How do I clean and normalise messy text?¶
You've got text from the wild — copied from a PDF, pulled from a form, scraped from a webpage — and it's a mess. Mixed case, stray whitespace, accented characters, weird line endings. Before you can compare it, search it, store it, or display it consistently, you need to normalise it into a predictable shape.
This recipe gives you a clean_text() function you can lift straight into your code, plus the smaller building blocks (strip, casefold, Unicode NFKD, whitespace collapse) for when you need finer control.
import re
import unicodedata
def clean_text(text: str) -> str:
"""Normalise messy real-world text into a predictable shape."""
# 1. Normalise Unicode (compose accents into a canonical form,
# then strip them) — keeps "café" comparable to "cafe".
text = unicodedata.normalize("NFKD", text)
text = "".join(ch for ch in text if not unicodedata.combining(ch))
# 2. Lowercase using casefold() — stronger than lower() for non-English
# scripts (e.g. German "ß" → "ss").
text = text.casefold()
# 3. Normalise line endings to \n.
text = text.replace("\r\n", "\n").replace("\r", "\n")
# 4. Collapse runs of whitespace (spaces, tabs, newlines) into one space,
# then strip leading/trailing whitespace.
text = re.sub(r"\s+", " ", text).strip()
return text
messy = " Café\tau Lait \r\n with EXTRA cream "
print(repr(clean_text(messy)))
# 'cafe au lait with extra cream'
If you only need one of the building blocks, here are the pieces in isolation.
# Stripping whitespace — three flavours
raw = " Hello, world! \n"
print(repr(raw.strip())) # both ends
print(repr(raw.lstrip())) # left only
print(repr(raw.rstrip())) # right only
# Strip specific characters by passing them in
print("***bold***".strip("*")) # "bold"
# Case normalisation — prefer casefold() over lower() for comparisons
print("Hello".lower()) # "hello"
print("Hello".casefold()) # "hello"
# casefold() handles tricky scripts that lower() misses
german = "STRASSE"
print(german.lower() == "straße".lower()) # False — leaves ß alone
print(german.casefold() == "straße".casefold()) # True — folds ß to "ss"
# Removing accents via Unicode NFKD decomposition
import unicodedata
text = "naïve café résumé"
nfkd = unicodedata.normalize("NFKD", text)
no_accents = "".join(ch for ch in nfkd if not unicodedata.combining(ch))
print(no_accents) # "naive cafe resume"
# Collapsing whitespace runs
import re
messy = " This has too\tmany\nspaces "
print(repr(re.sub(r"\s+", " ", messy).strip()))
# 'This has too many spaces'
Why it works¶
Real-world text gets dirty in a few well-known ways, and each step in clean_text() targets one of them.
Unicode normalisation is the subtle one. Two strings can look identical on screen but compare unequal because one uses a precomposed character (é as a single code point) and the other uses a base letter plus a combining accent (e + ́). unicodedata.normalize("NFKD", text) decomposes everything into the base-plus-combining form, then filtering out the combining characters strips the accents. The "K" in NFKD also normalises compatibility variants — full-width digits, ligatures, and the like — into their plain ASCII equivalents.
casefold() instead of lower() matters as soon as your text might contain non-English characters. lower() is a simple character-by-character lowercase. casefold() applies Unicode's full case-folding rules, which were designed specifically for case-insensitive comparison. The German ß is the classic example: "STRASSE".lower() gives "strasse", but "straße".lower() gives "straße" — they only become equal under casefold().
Line ending normalisation keeps you safe across operating systems. Windows uses \r\n, classic Mac used \r, Unix uses \n. Doing the \r\n replacement first avoids accidentally turning \r\n into \n\n.
Collapsing whitespace with re.sub(r"\s+", " ", text) handles spaces, tabs, newlines, and any other whitespace character in one pass. The final .strip() removes any leading or trailing space the collapse left behind.
Trade-offs¶
This pipeline is opinionated — it strips accents, lowercases everything, and collapses all whitespace. That's the right move when you're normalising for comparison or deduplication, but it's the wrong move when the original case or accents matter (display, names, anything you'll show back to a user).
If you need to preserve the original for display but compare normalised, keep both: store the original, normalise into a separate key field, and search against the key.
The accent-stripping step is destructive. "piñata" becomes "pinata", which is fine for fuzzy matching but loses information. If you only need NFC normalisation (canonical compose, no accent removal), use unicodedata.normalize("NFC", text) and skip the combining-character filter.
Avoid building this with str.replace() chains for whitespace ("replace tab, replace newline, replace double space..."). It's verbose and gets the order wrong on edge cases. The single regex is clearer and faster.
Don't reach for the unidecode library unless ASCII transliteration is genuinely required (e.g. generating URL slugs from Cyrillic or Chinese). For Latin-script normalisation the standard library is enough.
Related¶
- How to parse structured strings — once your text is clean, pull fields out of it.
- How to avoid common string mistakes — including the immutability trap that catches people writing
text.strip()and wondering why nothing changed. - Understanding Unicode and encodings — the deeper picture behind NFKD, casefold, and why naive byte comparisons go wrong.
- String methods reference — the full menu of
strmethods, with what each one returns.