{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parse a messy date column\n",
    "\n",
    "**The question.** You've been handed a CSV column full of dates that use three or four formats interchangeably — ISO, `DD/MM/YYYY`, `21-Apr-2026`, `April 21, 2026` — with empty strings, `\"N/A\"`, and the occasional typo sprinkled through. You want a single cleaning pass that produces `date` objects for the valid rows and `None` (or a raised error) for the rest.\n",
    "\n",
    "The canonical shape is a multi-format parser: try each format in order, return the first that matches, return `None` if none does. Then layer on sentinel-handling on top. Ten lines total."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime, date\n",
    "\n",
    "# Deliberately awful input to clean\n",
    "raw = [\n",
    "    '2026-04-21',\n",
    "    '21/04/2026',\n",
    "    '21-Apr-2026',\n",
    "    'April 21, 2026',\n",
    "    '2026/04/21',\n",
    "    '',              # missing\n",
    "    'N/A',           # sentinel\n",
    "    'not a date',    # junk\n",
    "]\n",
    "\n",
    "FORMATS = (\n",
    "    '%Y-%m-%d',\n",
    "    '%d/%m/%Y',\n",
    "    '%d-%b-%Y',\n",
    "    '%B %d, %Y',\n",
    "    '%Y/%m/%d',\n",
    ")\n",
    "MISSING = {'', 'N/A', 'n/a', 'null', 'NULL'}\n",
    "\n",
    "\n",
    "def parse_date(s: str | None) -> date | None:\n",
    "    \"\"\"Return a date, None for explicit-missing, or raise for unparseable.\"\"\"\n",
    "    if s is None or s.strip() in MISSING:\n",
    "        return None\n",
    "    for fmt in FORMATS:\n",
    "        try:\n",
    "            return datetime.strptime(s.strip(), fmt).date()\n",
    "        except ValueError:\n",
    "            continue\n",
    "    raise ValueError(f'unparseable date: {s!r}')\n",
    "\n",
    "\n",
    "for s in raw:\n",
    "    try:\n",
    "        print(f'{s!r:20} -> {parse_date(s)}')\n",
    "    except ValueError as e:\n",
    "        print(f'{s!r:20} -> ERROR: {e}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Variant: pandas pd.to_datetime — handles most of this for you\n",
    "# import pandas as pd\n",
    "#\n",
    "# series = pd.Series(raw)\n",
    "# parsed = pd.to_datetime(series, errors='coerce', format='mixed', dayfirst=True)\n",
    "# print(parsed)\n",
    "#\n",
    "# format='mixed'    — let each row pick its own format (Python 3.11+ / pandas 2.0+)\n",
    "# dayfirst=True     — resolve DD/MM vs MM/DD in favour of British notation\n",
    "# errors='coerce'   — unparseable rows become NaT instead of raising\n",
    "# .isna()           — count how many rows failed, investigate by hand\n",
    "\n",
    "# Skipped in execution because pandas isn't imported by default in this\n",
    "# environment — uncomment the lines above when running locally.\n",
    "print('(pandas variant shown as reference; uncomment to run)')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why it works\n",
    "\n",
    "The multi-format loop is the Python equivalent of *\"try the formats I've seen, one after another, and take the first that matches\"*. `strptime` raises `ValueError` on a mismatch, so wrapping it in `try`/`except` and `continue`-ing makes the rest of the format list act as fallbacks. That's better than trying to write one giant regex: each format is its own line, new formats slot in without touching the others, and the order of the tuple controls precedence (which matters when `DD/MM` and `MM/DD` would both match).\n",
    "\n",
    "The sentinel set distinguishes three outcomes instead of two: **parsed**, **known missing** (empty, `N/A`), and **error** (junk). That split is what you want downstream — `None` collapses cleanly into pandas `NaT` or SQL `NULL`, while `ValueError` makes the bad rows noisy so you can investigate rather than silently losing them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs\n",
    "\n",
    "**Ambiguity can't be guessed.** `\"04/05/2026\"` is 4 May in British notation, 5 April in American. `strptime` can't tell you; whichever format comes first in `FORMATS` wins. Pick the convention that matches your source data and commit to it. If you genuinely have both conventions in the same column, you need a secondary signal (the source system, a column header, the day value being > 12) — or to reject the row.\n",
    "\n",
    "**Keep `FORMATS` narrow.** Every format you add is another chance for a row to match the *wrong* one. Only include formats you've actually seen in the data.\n",
    "\n",
    "**Try `date.fromisoformat` first for speed.** For large columns that are mostly ISO 8601, `fromisoformat` is much faster than `strptime` — a `try`/`except` wrapper around it as the first attempt is a cheap win.\n",
    "\n",
    "**In pandas, prefer `pd.to_datetime`.** `pd.to_datetime(series, errors='coerce', format='mixed', dayfirst=True)` handles mixed formats, missing values, and British vs American conventions in one call (pandas 2.0+ / Python 3.11+). The pure-Python loop above is for when you don't want a pandas dependency or you need the per-row error-handling distinction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related reading\n",
    "\n",
    "- [Avoid common datetime mistakes](https://agilearn.co.uk/guides/dates-and-times/recipes/avoid-common-datetime-mistakes) — especially the `dayfirst` ambiguity trap.\n",
    "- [Datetime format codes](https://agilearn.co.uk/guides/dates-and-times/reference/datetime-format-codes) — every `%Y`/`%m`/`%d`/`%b` in one place.\n",
    "- [Convert between time zones](https://agilearn.co.uk/guides/dates-and-times/recipes/convert-between-time-zones) — once the naive dates are clean, what to do about zones.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}