Process large files¶

The question. You have a text file that's too big to load — a multi-GB log, a 10-million-row CSV — and you need to scan, filter, transform, or summarise it. f.read() and f.readlines() both load the whole file; you want the constant-memory path.

The answer: iterate the file object directly. for line in f yields one line at a time, O(longest line) in memory, regardless of file size. Wrap any filter/transform in a generator expression and consume with a reducer (sum, Counter, writing to another file) to keep the whole pipeline streaming.

In [ ]:

Copied!





# Line-by-line iteration — constant memory, works on files of any size.
# Works identically on a 10-KB sample and a 10-GB log.
from pathlib import Path
from collections import Counter

# Make a sample. In production this is the file you can't load.
sample = Path('/tmp/events.log')
sample.write_text(''.join(
    f'2026-04-{(i % 30) + 1:02d} INFO: event {i} for user {i % 5}\n'
    for i in range(10_000)
))

# Streaming count of events per user — one line at a time.
counts: Counter[str] = Counter()
with open(sample, encoding='utf-8') as f:
    for line in f:
        if 'INFO' in line:                           # filter
            user = line.rsplit(' ', 1)[-1].strip()   # transform
            counts[user] += 1                        # aggregate

for user, n in counts.most_common():
    print(f'user {user}: {n}')

sample.unlink()
# Line-by-line iteration — constant memory, works on files of any size.
# Works identically on a 10-KB sample and a 10-GB log.
from pathlib import Path
from collections import Counter

# Make a sample. In production this is the file you can't load.
sample = Path('/tmp/events.log')
sample.write_text(''.join(
    f'2026-04-{(i % 30) + 1:02d} INFO: event {i} for user {i % 5}\n'
    for i in range(10_000)
))

# Streaming count of events per user — one line at a time.
counts: Counter[str] = Counter()
with open(sample, encoding='utf-8') as f:
    for line in f:
        if 'INFO' in line:                           # filter
            user = line.rsplit(' ', 1)[-1].strip()   # transform
            counts[user] += 1                        # aggregate

for user, n in counts.most_common():
    print(f'user {user}: {n}')

sample.unlink()

Variant: fixed-size chunks for character-counting or binary-like work¶

When line boundaries don't matter — counting bytes, hashing, streaming network upload — f.read(n) in a loop is the direct form. iter(callable, sentinel) keeps the loop clean.

In [ ]:

Copied!





from pathlib import Path

path = Path('/tmp/chunk-demo.txt')
path.write_text('x' * 25_000, encoding='utf-8')

total = 0
with path.open(encoding='utf-8') as f:
    for chunk in iter(lambda: f.read(8192), ''):   # '' is the EOF sentinel
        total += len(chunk)

print(f'read {total:,} chars')
path.unlink()
from pathlib import Path

path = Path('/tmp/chunk-demo.txt')
path.write_text('x' * 25_000, encoding='utf-8')

total = 0
with path.open(encoding='utf-8') as f:
    for chunk in iter(lambda: f.read(8192), ''):   # '' is the EOF sentinel
        total += len(chunk)

print(f'read {total:,} chars')
path.unlink()

Variant: CSV with `DictReader`¶

DictReader wraps the file iterator and yields each row as a dict, keyed by the header. Memory still scales with row size, not file size.

In [ ]:

Copied!





import csv
from pathlib import Path

path = Path('/tmp/prices.csv')
path.write_text('\n'.join(
    ['name,value'] + [f'item_{i},{i * 1.5}' for i in range(1, 1001)]
) + '\n', encoding='utf-8')

total, count = 0.0, 0
with path.open(encoding='utf-8', newline='') as f:
    for row in csv.DictReader(f):
        total += float(row['value'])
        count += 1

print(f'{count} rows, total £{total:.2f}')
path.unlink()
import csv
from pathlib import Path

path = Path('/tmp/prices.csv')
path.write_text('\n'.join(
    ['name,value'] + [f'item_{i},{i * 1.5}' for i in range(1, 1001)]
) + '\n', encoding='utf-8')

total, count = 0.0, 0
with path.open(encoding='utf-8', newline='') as f:
    for row in csv.DictReader(f):
        total += float(row['value'])
        count += 1

print(f'{count} rows, total £{total:.2f}')
path.unlink()

Why this works¶

A file object is an iterator — next(f) yields the next line, including its trailing newline. for line in f is just that protocol driven by the loop. Python keeps a small internal buffer and decodes lazily; memory stays O(longest line), not O(file).

The with open(...) wrapper closes the file on exit, whether the loop finished naturally or raised. Skipping encoding='utf-8' is a cross-platform trap — the default varies by OS, and on Windows you'll get surprising UnicodeDecodeError errors from files that were fine on Linux. Always specify the encoding.

The pattern generalises: the same shape works for sys.stdin, a network socket, or the output of a subprocess. Anything that yields one record per iteration plugs in unchanged.

Trade-offs¶

The anti-pattern is f.read() or f.readlines() — both materialise the entire file. Fine for a 10-KB config; catastrophic for a 10-GB log. If a colleague is looking over your shoulder and you're about to type .readlines(), ask whether a for line in f loop would do.

sorted(...) on a streaming pipeline destroys the constant-memory property — sorting needs every element. For top-k queries, heapq.nlargest(k, iterable) is O(n) time and O(k) memory. For full sort on a too-big file, you're into external-sort territory (sort chunks, merge) — usually easier to push that into a purpose-built tool (sort(1) on Unix, DuckDB, pandas with chunksize).

For CSVs, csv.reader or csv.DictReader wrap the file iterator and give you typed rows. For binary files, fixed-size chunks via iter(f.read, b'') — see the work-with-binary-files recipe.

Work with binary files — the chunk-at-a-time pattern for non-text data.
Avoid common file-handling mistakes — the readlines trap and other anti-patterns.
Process a large file lazily — the generator-pipeline view of the same pattern.

Process large files¶

Variant: fixed-size chunks for character-counting or binary-like work¶

Variant: CSV with DictReader¶

Why this works¶

Trade-offs¶

Related reading¶

Variant: CSV with `DictReader`¶