Process a large file lazily¶

The question. You have a file that might be bigger than memory — a 10 GB log, a multi-million-row CSV, a stream that never ends — and you need to read it, transform each record, and produce a summary. The summary fits; the raw file doesn't.

The answer: iterate the file handle directly (for line in f), wrap the per-record work in a generator expression or function, and let a reducer (sum, Counter, writing to another file) consume the stream end-to-end. The file's iterator yields one line at a time, so memory stays O(1) regardless of file size.

In [ ]:

Copied!





# Lazy read + filter + transform + aggregate, all streaming.
# Here the 'aggregate' is a Counter of per-user totals, but the shape is
# identical whether the file is 10 KB or 10 GB.
from pathlib import Path
from collections import Counter

# Make a sample file to demo with (in production this is the file you can't load).
sample = Path('/tmp/transactions.csv')
sample.write_text(
    'timestamp,user_id,category,amount\n'
    '2024-01-15T09:23:11,42,food,12.50\n'
    '2024-01-15T11:08:00,17,transport,3.20\n'
    '2024-01-15T13:42:30,42,food,8.75\n'
    '2024-01-16T07:15:00,99,utilities,45.00\n'
    '2024-01-16T19:30:55,17,food,22.40\n'
    '2024-01-17T08:01:00,42,transport,2.80\n'
)

totals: Counter[int] = Counter()
with open(sample) as f:
    next(f)                              # skip header
    for line in f:                       # one line at a time
        _, user_id, category, amount = line.rstrip().split(',')
        if category == 'food':           # filter
            totals[int(user_id)] += float(amount)

for user, total in totals.most_common():
    print(f'user {user}: £{total:.2f}')
# Lazy read + filter + transform + aggregate, all streaming.
# Here the 'aggregate' is a Counter of per-user totals, but the shape is
# identical whether the file is 10 KB or 10 GB.
from pathlib import Path
from collections import Counter

# Make a sample file to demo with (in production this is the file you can't load).
sample = Path('/tmp/transactions.csv')
sample.write_text(
    'timestamp,user_id,category,amount\n'
    '2024-01-15T09:23:11,42,food,12.50\n'
    '2024-01-15T11:08:00,17,transport,3.20\n'
    '2024-01-15T13:42:30,42,food,8.75\n'
    '2024-01-16T07:15:00,99,utilities,45.00\n'
    '2024-01-16T19:30:55,17,food,22.40\n'
    '2024-01-17T08:01:00,42,transport,2.80\n'
)

totals: Counter[int] = Counter()
with open(sample) as f:
    next(f)                              # skip header
    for line in f:                       # one line at a time
        _, user_id, category, amount = line.rstrip().split(',')
        if category == 'food':           # filter
            totals[int(user_id)] += float(amount)

for user, total in totals.most_common():
    print(f'user {user}: £{total:.2f}')

Variant: wrap the parser in a generator function¶

Once the per-line work is more than split-and-cast, lift it into a named generator. You get a testable thing that takes a file handle and yields typed records. The pipeline above stays the same; the parser becomes swappable.

In [ ]:

Copied!





import csv
from dataclasses import dataclass

@dataclass
class Tx:
    timestamp: str
    user_id: int
    category: str
    amount: float

def read_transactions(file):
    '''Yield Tx records lazily from an open CSV file.'''
    reader = csv.reader(file)
    next(reader)                    # header
    for row in reader:
        yield Tx(row[0], int(row[1]), row[2], float(row[3]))

from pathlib import Path
with open('/tmp/transactions.csv') as f:
    for tx in read_transactions(f):
        print(tx)
import csv
from dataclasses import dataclass

@dataclass
class Tx:
    timestamp: str
    user_id: int
    category: str
    amount: float

def read_transactions(file):
    '''Yield Tx records lazily from an open CSV file.'''
    reader = csv.reader(file)
    next(reader)                    # header
    for row in reader:
        yield Tx(row[0], int(row[1]), row[2], float(row[3]))

from pathlib import Path
with open('/tmp/transactions.csv') as f:
    for tx in read_transactions(f):
        print(tx)

Variant: multi-line records¶

Sometimes a 'record' spans several lines — a multi-line log entry, a pretty-printed JSON object. Accumulate lines into a buffer, yield at the boundary, flush at the end. Still O(longest-record), not O(file).

In [ ]:

Copied!





from pathlib import Path

log = Path('/tmp/multiline.log')
log.write_text(
    '2024-01-15 ERROR: connection refused\n'
    '    at module foo.bar\n'
    '    at module foo.baz\n'
    '2024-01-15 INFO: retry succeeded\n'
    '2024-01-16 ERROR: out of memory\n'
    '    at module qux.quux\n'
)

def read_log_entries(file):
    '''An entry starts with a date; indented lines are continuations.'''
    buffer: list[str] = []
    for line in file:
        if line and not line[0].isspace() and buffer:
            yield ''.join(buffer)
            buffer = []
        buffer.append(line)
    if buffer:
        yield ''.join(buffer)

with open(log) as f:
    for entry in read_log_entries(f):
        print('---')
        print(entry, end='')
from pathlib import Path

log = Path('/tmp/multiline.log')
log.write_text(
    '2024-01-15 ERROR: connection refused\n'
    '    at module foo.bar\n'
    '    at module foo.baz\n'
    '2024-01-15 INFO: retry succeeded\n'
    '2024-01-16 ERROR: out of memory\n'
    '    at module qux.quux\n'
)

def read_log_entries(file):
    '''An entry starts with a date; indented lines are continuations.'''
    buffer: list[str] = []
    for line in file:
        if line and not line[0].isspace() and buffer:
            yield ''.join(buffer)
            buffer = []
        buffer.append(line)
    if buffer:
        yield ''.join(buffer)

with open(log) as f:
    for entry in read_log_entries(f):
        print('---')
        print(entry, end='')

Variant: fixed-size binary chunks with `iter(callable, sentinel)`¶

Two-argument iter calls the callable repeatedly until it returns the sentinel. For binary files read in fixed-size blocks it's cleaner than a while True loop.

In [ ]:

Copied!





from pathlib import Path

binpath = Path('/tmp/data.bin')
binpath.write_bytes(b'A' * 10 + b'B' * 10 + b'C' * 5)

with open(binpath, 'rb') as f:
    for chunk in iter(lambda: f.read(8), b''):   # stops when read returns b''
        print(len(chunk), chunk)
from pathlib import Path

binpath = Path('/tmp/data.bin')
binpath.write_bytes(b'A' * 10 + b'B' * 10 + b'C' * 5)

with open(binpath, 'rb') as f:
    for chunk in iter(lambda: f.read(8), b''):   # stops when read returns b''
        print(len(chunk), chunk)

Why this works¶

A file object is an iterator — it yields one line (including the trailing newline) each time you call next() on it, and for line in f just drives that protocol. The buffer Python uses under the hood is small and fixed; lines are decoded one at a time. Memory used is O(longest line), not O(file).

The generator-expression style (for line in f if ...) keeps filter and transform lazy too. The only eager step is the final reducer — Counter here, but it could equally be sum, max, heapq.nlargest, or writing each transformed line to another file. Because the reducer consumes one value at a time, the whole chain stays streaming.

This pattern is also streaming-friendly in the broader sense: the same code works on sys.stdin, a network socket, or a pipe. Anything that supports line-at-a-time iteration plugs into the same shape.

Trade-offs¶

The anti-pattern is f.read() or f.readlines() — both pull the whole file into memory. Fine for 10 KB config files; catastrophic for 10 GB logs. If you ever find yourself typing .readlines(), check whether a for line in f loop would do instead.

sorted(...) on the stream also breaks the streaming property — sorting needs every element, so it materialises. If you only want the top-k, heapq.nlargest(k, iterable) runs in O(n) time and O(k) memory. If you genuinely need a full sort, accept the cost and think about external sort (sort chunks, merge) for the really large cases.

Keep all iteration inside the with block. If you return a generator whose source is the file and then the with block exits, subsequent next() calls will raise — the file is closed by then. Fix: consume inside the with, or move open outside and close explicitly with contextlib.closing.

Combine generators into a pipeline — the read loop is the source stage of a larger pipeline.
Avoid common iterator mistakes — readlines trap, consuming twice, forgetting the reducer.
Laziness and memory — why the iterator protocol gives you constant memory for free.

Process a large file lazily¶

Variant: wrap the parser in a generator function¶

Variant: multi-line records¶

Variant: fixed-size binary chunks with iter(callable, sentinel)¶

Why this works¶

Trade-offs¶

Related reading¶

Variant: fixed-size binary chunks with `iter(callable, sentinel)`¶