Process large files¶
The question. You have a text file that's too big to load — a multi-GB log, a 10-million-row CSV — and you need to scan, filter, transform, or summarise it. f.read() and f.readlines() both load the whole file; you want the constant-memory path.
The answer: iterate the file object directly. for line in f yields one line at a time, O(longest line) in memory, regardless of file size. Wrap any filter/transform in a generator expression and consume with a reducer (sum, Counter, writing to another file) to keep the whole pipeline streaming.
# Line-by-line iteration — constant memory, works on files of any size.
# Works identically on a 10-KB sample and a 10-GB log.
from pathlib import Path
from collections import Counter
# Make a sample. In production this is the file you can't load.
sample = Path('/tmp/events.log')
sample.write_text(''.join(
f'2026-04-{(i % 30) + 1:02d} INFO: event {i} for user {i % 5}\n'
for i in range(10_000)
))
# Streaming count of events per user — one line at a time.
counts: Counter[str] = Counter()
with open(sample, encoding='utf-8') as f:
for line in f:
if 'INFO' in line: # filter
user = line.rsplit(' ', 1)[-1].strip() # transform
counts[user] += 1 # aggregate
for user, n in counts.most_common():
print(f'user {user}: {n}')
sample.unlink()
Variant: fixed-size chunks for character-counting or binary-like work¶
When line boundaries don't matter — counting bytes, hashing, streaming network upload — f.read(n) in a loop is the direct form. iter(callable, sentinel) keeps the loop clean.
from pathlib import Path
path = Path('/tmp/chunk-demo.txt')
path.write_text('x' * 25_000, encoding='utf-8')
total = 0
with path.open(encoding='utf-8') as f:
for chunk in iter(lambda: f.read(8192), ''): # '' is the EOF sentinel
total += len(chunk)
print(f'read {total:,} chars')
path.unlink()
Variant: CSV with DictReader¶
DictReader wraps the file iterator and yields each row as a dict, keyed by the header. Memory still scales with row size, not file size.
import csv
from pathlib import Path
path = Path('/tmp/prices.csv')
path.write_text('\n'.join(
['name,value'] + [f'item_{i},{i * 1.5}' for i in range(1, 1001)]
) + '\n', encoding='utf-8')
total, count = 0.0, 0
with path.open(encoding='utf-8', newline='') as f:
for row in csv.DictReader(f):
total += float(row['value'])
count += 1
print(f'{count} rows, total £{total:.2f}')
path.unlink()
Why this works¶
A file object is an iterator — next(f) yields the next line, including its trailing newline. for line in f is just that protocol driven by the loop. Python keeps a small internal buffer and decodes lazily; memory stays O(longest line), not O(file).
The with open(...) wrapper closes the file on exit, whether the loop finished naturally or raised. Skipping encoding='utf-8' is a cross-platform trap — the default varies by OS, and on Windows you'll get surprising UnicodeDecodeError errors from files that were fine on Linux. Always specify the encoding.
The pattern generalises: the same shape works for sys.stdin, a network socket, or the output of a subprocess. Anything that yields one record per iteration plugs in unchanged.
Trade-offs¶
The anti-pattern is f.read() or f.readlines() — both materialise the entire file. Fine for a 10-KB config; catastrophic for a 10-GB log. If a colleague is looking over your shoulder and you're about to type .readlines(), ask whether a for line in f loop would do.
sorted(...) on a streaming pipeline destroys the constant-memory property — sorting needs every element. For top-k queries, heapq.nlargest(k, iterable) is O(n) time and O(k) memory. For full sort on a too-big file, you're into external-sort territory (sort chunks, merge) — usually easier to push that into a purpose-built tool (sort(1) on Unix, DuckDB, pandas with chunksize).
For CSVs, csv.reader or csv.DictReader wrap the file iterator and give you typed rows. For binary files, fixed-size chunks via iter(f.read, b'') — see the work-with-binary-files recipe.
Related reading¶
- Work with binary files — the chunk-at-a-time pattern for non-text data.
- Avoid common file-handling mistakes — the
readlinestrap and other anti-patterns. - Process a large file lazily — the generator-pipeline view of the same pattern.