{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How do I clean and normalise messy text?\n",
    "\n",
    "You've got text from the wild — copied from a PDF, pulled from a form, scraped from a webpage — and it's a mess. Mixed case, stray whitespace, accented characters, weird line endings. Before you can compare it, search it, store it, or display it consistently, you need to normalise it into a predictable shape.\n",
    "\n",
    "This recipe gives you a `clean_text()` function you can lift straight into your code, plus the smaller building blocks (`strip`, `casefold`, Unicode NFKD, whitespace collapse) for when you need finer control."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import unicodedata\n",
    "\n",
    "\n",
    "def clean_text(text: str) -> str:\n",
    "    \"\"\"Normalise messy real-world text into a predictable shape.\"\"\"\n",
    "    # 1. Normalise Unicode (compose accents into a canonical form,\n",
    "    #    then strip them) — keeps \"café\" comparable to \"cafe\".\n",
    "    text = unicodedata.normalize(\"NFKD\", text)\n",
    "    text = \"\".join(ch for ch in text if not unicodedata.combining(ch))\n",
    "\n",
    "    # 2. Lowercase using casefold() — stronger than lower() for non-English\n",
    "    #    scripts (e.g. German \"ß\" → \"ss\").\n",
    "    text = text.casefold()\n",
    "\n",
    "    # 3. Normalise line endings to \\n.\n",
    "    text = text.replace(\"\\r\\n\", \"\\n\").replace(\"\\r\", \"\\n\")\n",
    "\n",
    "    # 4. Collapse runs of whitespace (spaces, tabs, newlines) into one space,\n",
    "    #    then strip leading/trailing whitespace.\n",
    "    text = re.sub(r\"\\s+\", \" \", text).strip()\n",
    "\n",
    "    return text\n",
    "\n",
    "\n",
    "messy = \"  Café\\tau Lait  \\r\\n  with  EXTRA  cream  \"\n",
    "print(repr(clean_text(messy)))\n",
    "# 'cafe au lait with extra cream'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you only need one of the building blocks, here are the pieces in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Stripping whitespace — three flavours\n",
    "raw = \"   Hello, world!   \\n\"\n",
    "print(repr(raw.strip()))   # both ends\n",
    "print(repr(raw.lstrip()))  # left only\n",
    "print(repr(raw.rstrip()))  # right only\n",
    "\n",
    "# Strip specific characters by passing them in\n",
    "print(\"***bold***\".strip(\"*\"))  # \"bold\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Case normalisation — prefer casefold() over lower() for comparisons\n",
    "print(\"Hello\".lower())     # \"hello\"\n",
    "print(\"Hello\".casefold())  # \"hello\"\n",
    "\n",
    "# casefold() handles tricky scripts that lower() misses\n",
    "german = \"STRASSE\"\n",
    "print(german.lower() == \"straße\".lower())        # False — leaves ß alone\n",
    "print(german.casefold() == \"straße\".casefold())  # True  — folds ß to \"ss\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Removing accents via Unicode NFKD decomposition\n",
    "import unicodedata\n",
    "\n",
    "text = \"naïve café résumé\"\n",
    "nfkd = unicodedata.normalize(\"NFKD\", text)\n",
    "no_accents = \"\".join(ch for ch in nfkd if not unicodedata.combining(ch))\n",
    "print(no_accents)  # \"naive cafe resume\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Collapsing whitespace runs\n",
    "import re\n",
    "\n",
    "messy = \"  This   has    too\\tmany\\nspaces  \"\n",
    "print(repr(re.sub(r\"\\s+\", \" \", messy).strip()))\n",
    "# 'This has too many spaces'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why it works\n",
    "\n",
    "Real-world text gets dirty in a few well-known ways, and each step in `clean_text()` targets one of them.\n",
    "\n",
    "**Unicode normalisation** is the subtle one. Two strings can look identical on screen but compare unequal because one uses a precomposed character (`é` as a single code point) and the other uses a base letter plus a combining accent (`e` + `́`). `unicodedata.normalize(\"NFKD\", text)` decomposes everything into the base-plus-combining form, then filtering out the combining characters strips the accents. The \"K\" in NFKD also normalises compatibility variants — full-width digits, ligatures, and the like — into their plain ASCII equivalents.\n",
    "\n",
    "**`casefold()` instead of `lower()`** matters as soon as your text might contain non-English characters. `lower()` is a simple character-by-character lowercase. `casefold()` applies Unicode's full case-folding rules, which were designed specifically for case-insensitive comparison. The German ß is the classic example: `\"STRASSE\".lower()` gives `\"strasse\"`, but `\"straße\".lower()` gives `\"straße\"` — they only become equal under `casefold()`.\n",
    "\n",
    "**Line ending normalisation** keeps you safe across operating systems. Windows uses `\\r\\n`, classic Mac used `\\r`, Unix uses `\\n`. Doing the `\\r\\n` replacement first avoids accidentally turning `\\r\\n` into `\\n\\n`.\n",
    "\n",
    "**Collapsing whitespace** with `re.sub(r\"\\s+\", \" \", text)` handles spaces, tabs, newlines, and any other whitespace character in one pass. The final `.strip()` removes any leading or trailing space the collapse left behind."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs\n",
    "\n",
    "This pipeline is opinionated — it strips accents, lowercases everything, and collapses all whitespace. That's the right move when you're normalising for comparison or deduplication, but it's the wrong move when the original case or accents matter (display, names, anything you'll show back to a user).\n",
    "\n",
    "If you need to preserve the original for display but compare normalised, keep both: store the original, normalise into a separate `key` field, and search against the key.\n",
    "\n",
    "The accent-stripping step is destructive. `\"piñata\"` becomes `\"pinata\"`, which is fine for fuzzy matching but loses information. If you only need NFC normalisation (canonical compose, no accent removal), use `unicodedata.normalize(\"NFC\", text)` and skip the combining-character filter.\n",
    "\n",
    "Avoid building this with `str.replace()` chains for whitespace (\"replace tab, replace newline, replace double space...\"). It's verbose and gets the order wrong on edge cases. The single regex is clearer and faster.\n",
    "\n",
    "Don't reach for the `unidecode` library unless ASCII transliteration is genuinely required (e.g. generating URL slugs from Cyrillic or Chinese). For Latin-script normalisation the standard library is enough."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related\n",
    "\n",
    "- [How to parse structured strings](https://agilearn.co.uk/guides/string-processing/recipes/parse-structured-strings) — once your text is clean, pull fields out of it.\n",
    "- [How to avoid common string mistakes](https://agilearn.co.uk/guides/string-processing/recipes/avoid-common-string-mistakes) — including the immutability trap that catches people writing `text.strip()` and wondering why nothing changed.\n",
    "- [Understanding Unicode and encodings](https://agilearn.co.uk/guides/string-processing/concepts/understanding-unicode-and-encodings) — the deeper picture behind NFKD, casefold, and why naive byte comparisons go wrong.\n",
    "- [String methods reference](https://agilearn.co.uk/guides/string-processing/reference/string-methods-reference) — the full menu of `str` methods, with what each one returns."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}