Process a large file lazily¶
The question. You have a file that might be bigger than memory — a 10 GB log, a multi-million-row CSV, a stream that never ends — and you need to read it, transform each record, and produce a summary. The summary fits; the raw file doesn't.
The answer: iterate the file handle directly (for line in f), wrap the per-record work in a generator expression or function, and let a reducer (sum, Counter, writing to another file) consume the stream end-to-end. The file's iterator yields one line at a time, so memory stays O(1) regardless of file size.
# Lazy read + filter + transform + aggregate, all streaming.
# Here the 'aggregate' is a Counter of per-user totals, but the shape is
# identical whether the file is 10 KB or 10 GB.
from pathlib import Path
from collections import Counter
# Make a sample file to demo with (in production this is the file you can't load).
sample = Path('/tmp/transactions.csv')
sample.write_text(
'timestamp,user_id,category,amount\n'
'2024-01-15T09:23:11,42,food,12.50\n'
'2024-01-15T11:08:00,17,transport,3.20\n'
'2024-01-15T13:42:30,42,food,8.75\n'
'2024-01-16T07:15:00,99,utilities,45.00\n'
'2024-01-16T19:30:55,17,food,22.40\n'
'2024-01-17T08:01:00,42,transport,2.80\n'
)
totals: Counter[int] = Counter()
with open(sample) as f:
next(f) # skip header
for line in f: # one line at a time
_, user_id, category, amount = line.rstrip().split(',')
if category == 'food': # filter
totals[int(user_id)] += float(amount)
for user, total in totals.most_common():
print(f'user {user}: £{total:.2f}')
Variant: wrap the parser in a generator function¶
Once the per-line work is more than split-and-cast, lift it into a named generator. You get a testable thing that takes a file handle and yields typed records. The pipeline above stays the same; the parser becomes swappable.
import csv
from dataclasses import dataclass
@dataclass
class Tx:
timestamp: str
user_id: int
category: str
amount: float
def read_transactions(file):
'''Yield Tx records lazily from an open CSV file.'''
reader = csv.reader(file)
next(reader) # header
for row in reader:
yield Tx(row[0], int(row[1]), row[2], float(row[3]))
from pathlib import Path
with open('/tmp/transactions.csv') as f:
for tx in read_transactions(f):
print(tx)
Variant: multi-line records¶
Sometimes a 'record' spans several lines — a multi-line log entry, a pretty-printed JSON object. Accumulate lines into a buffer, yield at the boundary, flush at the end. Still O(longest-record), not O(file).
from pathlib import Path
log = Path('/tmp/multiline.log')
log.write_text(
'2024-01-15 ERROR: connection refused\n'
' at module foo.bar\n'
' at module foo.baz\n'
'2024-01-15 INFO: retry succeeded\n'
'2024-01-16 ERROR: out of memory\n'
' at module qux.quux\n'
)
def read_log_entries(file):
'''An entry starts with a date; indented lines are continuations.'''
buffer: list[str] = []
for line in file:
if line and not line[0].isspace() and buffer:
yield ''.join(buffer)
buffer = []
buffer.append(line)
if buffer:
yield ''.join(buffer)
with open(log) as f:
for entry in read_log_entries(f):
print('---')
print(entry, end='')
Variant: fixed-size binary chunks with iter(callable, sentinel)¶
Two-argument iter calls the callable repeatedly until it returns the sentinel. For binary files read in fixed-size blocks it's cleaner than a while True loop.
from pathlib import Path
binpath = Path('/tmp/data.bin')
binpath.write_bytes(b'A' * 10 + b'B' * 10 + b'C' * 5)
with open(binpath, 'rb') as f:
for chunk in iter(lambda: f.read(8), b''): # stops when read returns b''
print(len(chunk), chunk)
Why this works¶
A file object is an iterator — it yields one line (including the trailing newline) each time you call next() on it, and for line in f just drives that protocol. The buffer Python uses under the hood is small and fixed; lines are decoded one at a time. Memory used is O(longest line), not O(file).
The generator-expression style (for line in f if ...) keeps filter and transform lazy too. The only eager step is the final reducer — Counter here, but it could equally be sum, max, heapq.nlargest, or writing each transformed line to another file. Because the reducer consumes one value at a time, the whole chain stays streaming.
This pattern is also streaming-friendly in the broader sense: the same code works on sys.stdin, a network socket, or a pipe. Anything that supports line-at-a-time iteration plugs into the same shape.
Trade-offs¶
The anti-pattern is f.read() or f.readlines() — both pull the whole file into memory. Fine for 10 KB config files; catastrophic for 10 GB logs. If you ever find yourself typing .readlines(), check whether a for line in f loop would do instead.
sorted(...) on the stream also breaks the streaming property — sorting needs every element, so it materialises. If you only want the top-k, heapq.nlargest(k, iterable) runs in O(n) time and O(k) memory. If you genuinely need a full sort, accept the cost and think about external sort (sort chunks, merge) for the really large cases.
Keep all iteration inside the with block. If you return a generator whose source is the file and then the with block exits, subsequent next() calls will raise — the file is closed by then. Fix: consume inside the with, or move open outside and close explicitly with contextlib.closing.
Related reading¶
- Combine generators into a pipeline — the read loop is the source stage of a larger pipeline.
- Avoid common iterator mistakes —
readlinestrap, consuming twice, forgetting the reducer. - Laziness and memory — why the iterator protocol gives you constant memory for free.