Generator expressions and itertools¶
Generator functions (previous notebook) are powerful, but often you want something smaller — "multiply every element by two", "keep the odds", "pair these up". For those cases Python has generator expressions, the inline cousin of list comprehensions. And for everything more structured than "map" and "filter" — grouping, windowing, chaining, combining — the standard library's itertools module already has the generator written for you.
This notebook is the practical toolkit.
Generator expressions¶
A generator expression looks exactly like a list comprehension, but with round brackets instead of square ones. Square brackets mean "build a list now". Round brackets mean "build a generator — produce values lazily as consumed".
squares_list = [x * x for x in range(5)] # builds a list immediately
squares_gen = (x * x for x in range(5)) # builds a generator
print(squares_list) # [0, 1, 4, 9, 16]
print(squares_gen) # <generator object ...>
print(list(squares_gen))
They support the same machinery as list comprehensions — filter clauses, multiple for clauses, conditional expressions — just evaluated lazily.
xs = range(20)
# Squares of odd numbers, lazy:
odd_squares = (x * x for x in xs if x % 2)
print(list(odd_squares))
# Cartesian pairs, lazy:
pairs = ((a, b) for a in 'ab' for b in (1, 2))
print(list(pairs))
When to use a genexp instead of a listcomp¶
Prefer a generator expression when:
- You only need to iterate the values once — e.g. passing straight into
sum,max,any,all,join. - The source is large and you don't want the intermediate list in memory.
- You're building a pipeline of transformations.
Prefer a list comprehension when:
- You need to iterate the result multiple times.
- You need indexing, slicing, or
len. - You're debugging and want to
printthe intermediate values.
# Good use of a genexp: no intermediate list built
print(sum(x * x for x in range(1_000_000)))
# Even nicer — the parentheses can be omitted when a genexp is the sole
# argument of a function call.
print(max(len(word) for word in 'one two three four'.split()))
Watch out: single-use¶
A genexp is an iterator. The same one-shot rule applies.
squares = (x * x for x in range(4))
print(list(squares)) # [0, 1, 4, 9]
print(list(squares)) # [] — already consumed
If you need the values twice, materialise once into a list or tuple, or wrap the generation in a function you can call again.
itertools — the toolkit¶
itertools ships with a dozen or so generator combinators. We won't cover every one here (see the reference page), but the ones below come up repeatedly in real code.
islice — lazy slicing¶
Slicing with [:] doesn't work on arbitrary iterators; it only works on sequences. islice(iterable, stop) or islice(iterable, start, stop, step) gives you slice-like behaviour for any iterator.
from itertools import islice
def naturals():
n = 1
while True:
yield n
n += 1
print(list(islice(naturals(), 5))) # first 5
print(list(islice(naturals(), 10, 15))) # indices 10..14
print(list(islice(naturals(), 0, 20, 3))) # every 3rd
islice is the standard way to bound an infinite generator.
chain — concatenation without an intermediate list¶
chain(a, b, c, ...) yields everything from a, then everything from b, etc. No copy, no combined list in memory.
from itertools import chain
part1 = [1, 2, 3]
part2 = range(10, 13)
part3 = (x * x for x in [4, 5])
for x in chain(part1, part2, part3):
print(x, end=' ')
print()
There's also chain.from_iterable(it_of_its) for chaining an iterable of iterables — especially useful for flattening one level deep.
from itertools import chain
rows = [[1, 2, 3], [4, 5], [6]]
print(list(chain.from_iterable(rows)))
groupby — runs of equal values¶
groupby(iterable, key=...) groups adjacent equal items. The important word is adjacent: it doesn't sort for you. If you want grouped output from an unsorted source, sort first by the same key.
from itertools import groupby
# Already sorted by key — groupby just picks out the runs
events = [
('2024-01', 'login'),
('2024-01', 'click'),
('2024-02', 'login'),
('2024-02', 'login'),
('2024-03', 'click'),
]
for month, items in groupby(events, key=lambda e: e[0]):
# items is an iterator — consume it inside the loop
print(month, [action for _, action in items])
A common bug: storing items in a list, then trying to use the previous group's iterator on the next loop iteration. Each items is only valid while that loop iteration is active — the outer iteration invalidates it.
# Sort first if your data isn't already grouped:
unsorted = [('a', 1), ('b', 2), ('a', 3), ('b', 4)]
sorted_pairs = sorted(unsorted, key=lambda p: p[0])
for key, group in groupby(sorted_pairs, key=lambda p: p[0]):
print(key, sum(v for _, v in group))
zip — parallel iteration (built-in, not in itertools)¶
zip isn't in itertools but it's the same family. It pairs elements from multiple iterables, stopping at the shortest. Since Python 3.10 you can pass strict=True to fail loudly if lengths disagree.
names = ['Ada', 'Grace', 'Linus']
scores = [95, 88, 72]
for name, score in zip(names, scores):
print(f'{name}: {score}')
# strict=True — catch the bug instead of silently truncating
try:
list(zip([1, 2, 3], [10, 20], strict=True))
except ValueError as e:
print(f'caught: {e}')
zip_longest — pad instead of truncate¶
itertools.zip_longest(*iterables, fillvalue=...) keeps going until the longest iterable is exhausted, padding missing values with fillvalue (default None).
from itertools import zip_longest
a = [1, 2, 3, 4]
b = ['x', 'y']
print(list(zip_longest(a, b, fillvalue='?')))
accumulate — running totals (and other folds)¶
accumulate(iterable) yields a running sum. Pass func= to accumulate with a different binary operation (max, multiplication, etc.).
from itertools import accumulate
import operator
print(list(accumulate([1, 2, 3, 4, 5]))) # running sum
print(list(accumulate([1, 2, 3, 4, 5], operator.mul))) # running product
print(list(accumulate([3, 1, 4, 1, 5, 9, 2, 6], max))) # running max
takewhile / dropwhile — conditional prefixes and suffixes¶
takewhile(pred, iter): yield values until the predicate first returns false, then stop.dropwhile(pred, iter): skip values while the predicate is true, then yield the rest.
from itertools import takewhile, dropwhile
values = [1, 3, 5, 4, 7, 9]
print(list(takewhile(lambda x: x % 2, values))) # [1, 3, 5]
print(list(dropwhile(lambda x: x % 2, values))) # [4, 7, 9]
These differ from filter in one crucial way: they care about position, not just match. filter would keep all the odds; takewhile stops at the first non-odd.
tee — branching an iterator¶
tee(iterable, n) turns one iterator into n independent ones. Useful when you need to iterate the same stream twice but don't want to materialise it.
from itertools import tee
def events():
for x in [1, 2, 3, 4, 5]:
yield x
for_sum, for_max = tee(events(), 2)
print(sum(for_sum), max(for_max))
Caveat: if one branch gets far ahead of the other, tee has to buffer all the values in between. For streams that don't fit in memory, don't tee — restructure so you iterate once.
Combining the pieces — a short pipeline¶
Here's a pattern that shows up constantly: read records, filter, transform, group, summarise. All lazy; no intermediate lists.
from itertools import groupby
transactions = [
('2024-01-15', 'food', 12.50),
('2024-01-18', 'transport', 3.20),
('2024-01-20', 'food', 24.00),
('2024-02-02', 'food', 10.00),
('2024-02-05', 'transport', 12.80),
('2024-02-12', 'transport', 0.00), # will be filtered out
]
# 1. filter — drop zero-value rows
nonzero = (t for t in transactions if t[2] > 0)
# 2. key by month
keyed = ((t[0][:7], t) for t in nonzero)
# 3. sort by month so groupby sees contiguous runs
by_month = sorted(keyed, key=lambda p: p[0])
# 4. group and summarise
for month, rows in groupby(by_month, key=lambda p: p[0]):
total = sum(t[2] for _, t in rows)
print(f'{month}: £{total:.2f}')
Each stage is a generator expression or an itertools call. The sorted step is the one place we do need to materialise — groupby requires contiguous runs. This is the typical shape of real pipelines: a few lazy stages, one bulk operation in the middle, and a consumer at the end.
Quick check — build a small pipeline¶
Given an iterable of (timestamp, level, message) log tuples where level is 'DEBUG', 'INFO', 'WARN', or 'ERROR':
- Drop
'DEBUG'entries. - Take only the first 1000 rows remaining.
- Count how many there are of each level.
Do it with a generator expression for the filter, islice for the cap, and a single pass that updates a Counter.
from itertools import islice
from collections import Counter
# Sample data — pretend this is a multi-gigabyte log file
def fake_logs():
import random
levels = ['DEBUG', 'INFO', 'WARN', 'ERROR']
weights = [0.5, 0.35, 0.12, 0.03]
rnd = random.Random(42)
for i in range(5000):
yield (i, rnd.choices(levels, weights=weights)[0], f'msg {i}')
# Your turn — fill these:
nondebug = ... # generator expression: drop DEBUG
capped = ... # islice to the first 1000
counts = ... # Counter over capped
# print(counts)
Working solution¶
from itertools import islice
from collections import Counter
def fake_logs():
import random
levels = ['DEBUG', 'INFO', 'WARN', 'ERROR']
weights = [0.5, 0.35, 0.12, 0.03]
rnd = random.Random(42)
for i in range(5000):
yield (i, rnd.choices(levels, weights=weights)[0], f'msg {i}')
nondebug = (row for row in fake_logs() if row[1] != 'DEBUG')
capped = islice(nondebug, 1000)
counts = Counter(row[1] for row in capped)
print(counts)
Summary¶
(expr for x in xs)is a generator expression — the lazy sister of a list comprehension. Use it when you'll only iterate once.itertoolsis full of pre-written iterator combinators. The ones you'll reach for repeatedly:islice,chain,groupby,accumulate,takewhile/dropwhile,tee, plus built-inzip/zip_longest.- These compose cleanly with each other and with generator expressions. The shape of a typical pipeline is filter → transform → group → summarise, with
sortedin the middle when you need it.
Next: custom iterators — when a generator function isn't the right tool and you want the explicit control of writing the iterator class yourself.