{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use regex with file I/O\n",
    "\n",
    "**The question.** You want to run a pattern over a file — a log, a CSV, a config — and get back the matching lines (or the structured data inside them) without loading the whole thing into memory.\n",
    "\n",
    "The idiom is `re.compile` once, then stream the file line by line inside a `with` block. Each line is a tiny string the regex can chew through instantly, so the memory footprint stays flat even on multi-gigabyte logs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from pathlib import Path\n",
    "\n",
    "# Create a small demo log so the rest of the cell is self-contained.\n",
    "sample_path = Path('sample_log.txt')\n",
    "sample_path.write_text(\n",
    "    'Server started at 09:00:00\\n'\n",
    "    'User alice logged in at 09:15:30\\n'\n",
    "    'Error: connection timeout at 09:20:45\\n'\n",
    "    'User bob logged in at 09:25:00\\n'\n",
    "    'Warning: high memory usage at 09:30:15\\n'\n",
    "    'Error: disk space low at 10:15:30\\n'\n",
    ")\n",
    "\n",
    "LOG_PATTERN = re.compile(\n",
    "    r'(?P<type>Error|Warning|User \\w+)'\n",
    "    r'.*?at\\s+'\n",
    "    r'(?P<time>\\d{2}:\\d{2}:\\d{2})'\n",
    ")\n",
    "\n",
    "\n",
    "def parse_log(path: Path) -> list[dict[str, str]]:\n",
    "    \"\"\"Stream a log file and return one dict per matched line.\"\"\"\n",
    "    entries = []\n",
    "    with path.open() as fh:\n",
    "        for line in fh:\n",
    "            match = LOG_PATTERN.search(line)\n",
    "            if match:\n",
    "                entries.append(match.groupdict())\n",
    "    return entries\n",
    "\n",
    "\n",
    "for entry in parse_log(sample_path):\n",
    "    print(f'[{entry[\"time\"]}] {entry[\"type\"]}')\n",
    "\n",
    "sample_path.unlink()  # tidy up the demo file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why it works\n",
    "\n",
    "Opening a file object in Python and iterating it yields one line at a time. Only the current line sits in memory; the rest stays on disk until the iterator asks for it. That's why this pattern is safe on enormous files.\n",
    "\n",
    "`re.compile` is called at module import time (or near the top of the function), not inside the loop. Python caches recent patterns, but compiling explicitly makes the intent obvious and guarantees the cost is paid once, not per line.\n",
    "\n",
    "The `with path.open():` block guarantees the file handle is closed even if the regex or downstream code raises. On Linux that's a nicety; on Windows it's essential, since an open handle prevents the file from being moved or deleted by anything else."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs and when not to use this\n",
    "\n",
    "- **Small files are easier to search whole.** If the file is under a few megabytes, `path.read_text()` + `re.findall` is shorter and does the job. Use the streaming pattern when size or memory actually matter.\n",
    "- **Line-by-line fails when matches span multiple lines.** If your pattern needs to see across newlines (e.g. a multi-line stack trace), you have to read the whole file and use `re.DOTALL`, or buffer lines yourself into records first.\n",
    "- **`re.MULTILINE` is for whole-file searches, not streaming.** The flag only changes what `^` and `$` mean; it doesn't help if you're already iterating line-by-line.\n",
    "- **Structured formats still deserve their parser.** `json`, `csv`, and `logging`'s own `Formatter` can round-trip a line back to a dict without the fragility of a regex. Use regex when the log format is bespoke and no parser exists."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related\n",
    "\n",
    "- **Recipe** — [Extract data from text](https://agilearn.co.uk/guides/regex/recipes/extract-data-from-text) for the named-group + `.groupdict()` technique this recipe builds on.\n",
    "- **Recipe** — [Process a large file lazily](https://agilearn.co.uk/guides/iterators-and-generators/recipes/process-a-large-file-lazily) for the more general streaming pattern.\n",
    "- **Reference** — [`re` module quick reference](https://agilearn.co.uk/guides/regex/reference/re-module-quick-reference) and the [File handling guide](https://agilearn.co.uk/guides/file-handling) for `pathlib` basics.\n",
    "- **Concepts** — [Why context managers matter](https://agilearn.co.uk/guides/file-handling/concepts/why-context-managers-matter) for the `with` block."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}