{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Process large files\n",
    "\n",
    "**The question.** You have a text file that's too big to load — a multi-GB log, a 10-million-row CSV — and you need to scan, filter, transform, or summarise it. `f.read()` and `f.readlines()` both load the whole file; you want the constant-memory path.\n",
    "\n",
    "The answer: iterate the file object directly. `for line in f` yields one line at a time, O(longest line) in memory, regardless of file size. Wrap any filter/transform in a generator expression and consume with a reducer (`sum`, `Counter`, writing to another file) to keep the whole pipeline streaming.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Line-by-line iteration — constant memory, works on files of any size.\n",
    "# Works identically on a 10-KB sample and a 10-GB log.\n",
    "from pathlib import Path\n",
    "from collections import Counter\n",
    "\n",
    "# Make a sample. In production this is the file you can't load.\n",
    "sample = Path('/tmp/events.log')\n",
    "sample.write_text(''.join(\n",
    "    f'2026-04-{(i % 30) + 1:02d} INFO: event {i} for user {i % 5}\\n'\n",
    "    for i in range(10_000)\n",
    "))\n",
    "\n",
    "# Streaming count of events per user — one line at a time.\n",
    "counts: Counter[str] = Counter()\n",
    "with open(sample, encoding='utf-8') as f:\n",
    "    for line in f:\n",
    "        if 'INFO' in line:                           # filter\n",
    "            user = line.rsplit(' ', 1)[-1].strip()   # transform\n",
    "            counts[user] += 1                        # aggregate\n",
    "\n",
    "for user, n in counts.most_common():\n",
    "    print(f'user {user}: {n}')\n",
    "\n",
    "sample.unlink()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: fixed-size chunks for character-counting or binary-like work\n",
    "\n",
    "When line boundaries don't matter — counting bytes, hashing, streaming network upload — `f.read(n)` in a loop is the direct form. `iter(callable, sentinel)` keeps the loop clean.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "path = Path('/tmp/chunk-demo.txt')\n",
    "path.write_text('x' * 25_000, encoding='utf-8')\n",
    "\n",
    "total = 0\n",
    "with path.open(encoding='utf-8') as f:\n",
    "    for chunk in iter(lambda: f.read(8192), ''):   # '' is the EOF sentinel\n",
    "        total += len(chunk)\n",
    "\n",
    "print(f'read {total:,} chars')\n",
    "path.unlink()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: CSV with `DictReader`\n",
    "\n",
    "`DictReader` wraps the file iterator and yields each row as a dict, keyed by the header. Memory still scales with row size, not file size.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "from pathlib import Path\n",
    "\n",
    "path = Path('/tmp/prices.csv')\n",
    "path.write_text('\\n'.join(\n",
    "    ['name,value'] + [f'item_{i},{i * 1.5}' for i in range(1, 1001)]\n",
    ") + '\\n', encoding='utf-8')\n",
    "\n",
    "total, count = 0.0, 0\n",
    "with path.open(encoding='utf-8', newline='') as f:\n",
    "    for row in csv.DictReader(f):\n",
    "        total += float(row['value'])\n",
    "        count += 1\n",
    "\n",
    "print(f'{count} rows, total £{total:.2f}')\n",
    "path.unlink()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why this works\n",
    "\n",
    "A file object is an iterator — `next(f)` yields the next line, including its trailing newline. `for line in f` is just that protocol driven by the loop. Python keeps a small internal buffer and decodes lazily; memory stays O(longest line), not O(file).\n",
    "\n",
    "The `with open(...)` wrapper closes the file on exit, whether the loop finished naturally or raised. Skipping `encoding='utf-8'` is a cross-platform trap — the default varies by OS, and on Windows you'll get surprising `UnicodeDecodeError` errors from files that were fine on Linux. Always specify the encoding.\n",
    "\n",
    "The pattern generalises: the same shape works for `sys.stdin`, a network socket, or the output of a subprocess. Anything that yields one record per iteration plugs in unchanged.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs\n",
    "\n",
    "The anti-pattern is `f.read()` or `f.readlines()` — both materialise the entire file. Fine for a 10-KB config; catastrophic for a 10-GB log. If a colleague is looking over your shoulder and you're about to type `.readlines()`, ask whether a `for line in f` loop would do.\n",
    "\n",
    "`sorted(...)` on a streaming pipeline destroys the constant-memory property — sorting needs every element. For top-*k* queries, `heapq.nlargest(k, iterable)` is O(n) time and O(k) memory. For full sort on a too-big file, you're into external-sort territory (sort chunks, merge) — usually easier to push that into a purpose-built tool (`sort(1)` on Unix, DuckDB, pandas with `chunksize`).\n",
    "\n",
    "For CSVs, `csv.reader` or `csv.DictReader` wrap the file iterator and give you typed rows. For binary files, fixed-size chunks via `iter(f.read, b'')` — see the [work-with-binary-files](https://agilearn.co.uk/guides/file-handling/recipes/work-with-binary-files) recipe.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related reading\n",
    "\n",
    "- [Work with binary files](https://agilearn.co.uk/guides/file-handling/recipes/work-with-binary-files) — the chunk-at-a-time pattern for non-text data.\n",
    "- [Avoid common file-handling mistakes](https://agilearn.co.uk/guides/file-handling/recipes/avoid-common-file-handling-mistakes) — the `readlines` trap and other anti-patterns.\n",
    "- [Process a large file lazily](https://agilearn.co.uk/guides/iterators-and-generators/recipes/process-a-large-file-lazily) — the generator-pipeline view of the same pattern.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}