{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extract data from text\n",
    "\n",
    "**The question.** You've got a lump of unstructured text — a log file, a report, a page of form submissions — and you want to pull the dates, phone numbers, or key-value pairs out of it as proper Python objects, without writing a bespoke parser.\n",
    "\n",
    "The technique that does nearly all the work is **named groups** (`(?P<name>...)`) with `re.finditer` and `.groupdict()`. Each match comes back as a dictionary you can drop straight into a `DataFrame`, a JSON blob, or anywhere else you'd want structured data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "DATE_PATTERN = re.compile(\n",
    "    r'(?P<day>0[1-9]|[12]\\d|3[01])'\n",
    "    r'/'\n",
    "    r'(?P<month>0[1-9]|1[0-2])'\n",
    "    r'/'\n",
    "    r'(?P<year>\\d{4})'\n",
    ")\n",
    "\n",
    "KV_PATTERN = re.compile(\n",
    "    r'(?P<key>\\w+)\\s*[:=]\\s*(?P<value>.+?)\\s*$',\n",
    "    re.MULTILINE,\n",
    ")\n",
    "\n",
    "\n",
    "def extract(pattern: re.Pattern, text: str) -> list[dict[str, str]]:\n",
    "    \"\"\"Return a list of named-group dicts for every match in text.\"\"\"\n",
    "    return [m.groupdict() for m in pattern.finditer(text)]\n",
    "\n",
    "\n",
    "# Pull dates out of a sentence.\n",
    "sentence = 'Events on 25/12/2026, 01/01/2027, and 14/02/2027'\n",
    "print('Dates:', extract(DATE_PATTERN, sentence))\n",
    "\n",
    "# Pull key/value pairs out of a config-style block (note: either\n",
    "# `key: value` or `key = value` works thanks to the `[:=]` class).\n",
    "config = '''\n",
    "name: Alice Smith\n",
    "email: alice@example.com\n",
    "role = admin\n",
    "department: Engineering\n",
    "'''\n",
    "print('Config:', extract(KV_PATTERN, config))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why it works\n",
    "\n",
    "`re.finditer` returns one `Match` object per match, lazily — no list comprehension is built until you iterate. Each `Match` has a `.groupdict()` method that returns the named captures as a Python dict, which is almost always the shape you want to work with downstream.\n",
    "\n",
    "The date pattern is worth reading slowly. Each named group uses **alternation** (`|`) inside a character class to enumerate only valid ranges: `0[1-9]|[12]\\d|3[01]` matches 01–31 but rejects 00, 32, 99. That's more work than `\\d{2}` but means an invalid date simply doesn't match — the regex does a first pass of validation for you.\n",
    "\n",
    "The key/value pattern leans on `re.MULTILINE` so the `$` anchor matches at every newline rather than only at the very end of the string. Without it, the pattern would only find the last line of the block."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs and when not to use this\n",
    "\n",
    "- **If the input is really a format (JSON, CSV, HTML, YAML, TOML), use a proper parser.** Regex will get you 80% there and then fall over on an edge case your sample didn't include. The standard library has `json`, `csv`, `html.parser`, and `tomllib`; reach for them first.\n",
    "- **Phone numbers, URLs, postcodes — adapt, don't copy.** Every country's phone format has its own rules; URL patterns have to decide whether to allow queries, fragments, and ports. Take the named-group structure above and tune the character classes to the format you actually see in your data.\n",
    "- **Validate, then extract.** If the extracted pieces need to satisfy more than \"looks roughly right\" (e.g. dates must be real calendar dates), pass them through `datetime.strptime` or similar after the regex match succeeds.\n",
    "- **Very large files belong in `re.finditer` with streaming**, not in `re.findall`, which builds the full list up front. See the *Use regex with file I/O* recipe for the streaming pattern."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related\n",
    "\n",
    "- **Learn** — [Groups and capturing](https://agilearn.co.uk/guides/regex/learn/03-groups-and-capturing) for a ground-up look at named groups and `.groupdict()`.\n",
    "- **Recipe** — [Use regex with file I/O](https://agilearn.co.uk/guides/regex/recipes/use-regex-with-file-io) for the streaming version of the same technique over log files.\n",
    "- **Reference** — [`re` module quick reference](https://agilearn.co.uk/guides/regex/reference/re-module-quick-reference) and [Regex flags](https://agilearn.co.uk/guides/regex/reference/regex-flags-reference) for the `MULTILINE` flag used here.\n",
    "- **Concepts** — [Understanding the regex engine](https://agilearn.co.uk/guides/regex/concepts/understanding-the-regex-engine) for what `finditer` actually does under the hood."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}