{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Process a large file lazily\n",
    "\n",
    "**The question.** You have a file that might be bigger than memory — a 10 GB log, a multi-million-row CSV, a stream that never ends — and you need to read it, transform each record, and produce a summary. The summary fits; the raw file doesn't.\n",
    "\n",
    "The answer: iterate the file handle directly (`for line in f`), wrap the per-record work in a generator expression or function, and let a reducer (`sum`, `Counter`, writing to another file) consume the stream end-to-end. The file's iterator yields one line at a time, so memory stays O(1) regardless of file size.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Lazy read + filter + transform + aggregate, all streaming.\n",
    "# Here the 'aggregate' is a Counter of per-user totals, but the shape is\n",
    "# identical whether the file is 10 KB or 10 GB.\n",
    "from pathlib import Path\n",
    "from collections import Counter\n",
    "\n",
    "# Make a sample file to demo with (in production this is the file you can't load).\n",
    "sample = Path('/tmp/transactions.csv')\n",
    "sample.write_text(\n",
    "    'timestamp,user_id,category,amount\\n'\n",
    "    '2024-01-15T09:23:11,42,food,12.50\\n'\n",
    "    '2024-01-15T11:08:00,17,transport,3.20\\n'\n",
    "    '2024-01-15T13:42:30,42,food,8.75\\n'\n",
    "    '2024-01-16T07:15:00,99,utilities,45.00\\n'\n",
    "    '2024-01-16T19:30:55,17,food,22.40\\n'\n",
    "    '2024-01-17T08:01:00,42,transport,2.80\\n'\n",
    ")\n",
    "\n",
    "totals: Counter[int] = Counter()\n",
    "with open(sample) as f:\n",
    "    next(f)                              # skip header\n",
    "    for line in f:                       # one line at a time\n",
    "        _, user_id, category, amount = line.rstrip().split(',')\n",
    "        if category == 'food':           # filter\n",
    "            totals[int(user_id)] += float(amount)\n",
    "\n",
    "for user, total in totals.most_common():\n",
    "    print(f'user {user}: £{total:.2f}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: wrap the parser in a generator function\n",
    "\n",
    "Once the per-line work is more than `split-and-cast`, lift it into a named generator. You get a testable thing that takes a file handle and yields typed records. The pipeline above stays the same; the parser becomes swappable.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "from dataclasses import dataclass\n",
    "\n",
    "@dataclass\n",
    "class Tx:\n",
    "    timestamp: str\n",
    "    user_id: int\n",
    "    category: str\n",
    "    amount: float\n",
    "\n",
    "def read_transactions(file):\n",
    "    '''Yield Tx records lazily from an open CSV file.'''\n",
    "    reader = csv.reader(file)\n",
    "    next(reader)                    # header\n",
    "    for row in reader:\n",
    "        yield Tx(row[0], int(row[1]), row[2], float(row[3]))\n",
    "\n",
    "from pathlib import Path\n",
    "with open('/tmp/transactions.csv') as f:\n",
    "    for tx in read_transactions(f):\n",
    "        print(tx)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: multi-line records\n",
    "\n",
    "Sometimes a 'record' spans several lines — a multi-line log entry, a pretty-printed JSON object. Accumulate lines into a buffer, yield at the boundary, flush at the end. Still O(longest-record), not O(file).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "log = Path('/tmp/multiline.log')\n",
    "log.write_text(\n",
    "    '2024-01-15 ERROR: connection refused\\n'\n",
    "    '    at module foo.bar\\n'\n",
    "    '    at module foo.baz\\n'\n",
    "    '2024-01-15 INFO: retry succeeded\\n'\n",
    "    '2024-01-16 ERROR: out of memory\\n'\n",
    "    '    at module qux.quux\\n'\n",
    ")\n",
    "\n",
    "def read_log_entries(file):\n",
    "    '''An entry starts with a date; indented lines are continuations.'''\n",
    "    buffer: list[str] = []\n",
    "    for line in file:\n",
    "        if line and not line[0].isspace() and buffer:\n",
    "            yield ''.join(buffer)\n",
    "            buffer = []\n",
    "        buffer.append(line)\n",
    "    if buffer:\n",
    "        yield ''.join(buffer)\n",
    "\n",
    "with open(log) as f:\n",
    "    for entry in read_log_entries(f):\n",
    "        print('---')\n",
    "        print(entry, end='')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: fixed-size binary chunks with `iter(callable, sentinel)`\n",
    "\n",
    "Two-argument `iter` calls the callable repeatedly until it returns the sentinel. For binary files read in fixed-size blocks it's cleaner than a `while True` loop.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "binpath = Path('/tmp/data.bin')\n",
    "binpath.write_bytes(b'A' * 10 + b'B' * 10 + b'C' * 5)\n",
    "\n",
    "with open(binpath, 'rb') as f:\n",
    "    for chunk in iter(lambda: f.read(8), b''):   # stops when read returns b''\n",
    "        print(len(chunk), chunk)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why this works\n",
    "\n",
    "A file object *is* an iterator — it yields one line (including the trailing newline) each time you call `next()` on it, and `for line in f` just drives that protocol. The buffer Python uses under the hood is small and fixed; lines are decoded one at a time. Memory used is O(longest line), not O(file).\n",
    "\n",
    "The generator-expression style (`for line in f if ...`) keeps filter and transform lazy too. The only eager step is the final reducer — `Counter` here, but it could equally be `sum`, `max`, `heapq.nlargest`, or writing each transformed line to another file. Because the reducer consumes one value at a time, the whole chain stays streaming.\n",
    "\n",
    "This pattern is also streaming-friendly in the broader sense: the same code works on `sys.stdin`, a network socket, or a pipe. Anything that supports line-at-a-time iteration plugs into the same shape.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs\n",
    "\n",
    "The anti-pattern is `f.read()` or `f.readlines()` — both pull the whole file into memory. Fine for 10 KB config files; catastrophic for 10 GB logs. If you ever find yourself typing `.readlines()`, check whether a `for line in f` loop would do instead.\n",
    "\n",
    "`sorted(...)` on the stream also breaks the streaming property — sorting needs every element, so it materialises. If you only want the top-*k*, `heapq.nlargest(k, iterable)` runs in O(n) time and O(k) memory. If you genuinely need a full sort, accept the cost and think about external sort (sort chunks, merge) for the really large cases.\n",
    "\n",
    "Keep all iteration inside the `with` block. If you return a generator whose source is the file and then the `with` block exits, subsequent `next()` calls will raise — the file is closed by then. Fix: consume inside the `with`, or move `open` outside and close explicitly with `contextlib.closing`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related reading\n",
    "\n",
    "- [Combine generators into a pipeline](https://agilearn.co.uk/guides/iterators-and-generators/recipes/combine-generators) — the read loop is the source stage of a larger pipeline.\n",
    "- [Avoid common iterator mistakes](https://agilearn.co.uk/guides/iterators-and-generators/recipes/avoid-common-iterator-mistakes) — `readlines` trap, consuming twice, forgetting the reducer.\n",
    "- [Laziness and memory](https://agilearn.co.uk/guides/iterators-and-generators/concepts/laziness-and-memory) — why the iterator protocol gives you constant memory for free.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}