Parse a messy date column¶
The question. You've been handed a CSV column full of dates that use three or four formats interchangeably — ISO, DD/MM/YYYY, 21-Apr-2026, April 21, 2026 — with empty strings, "N/A", and the occasional typo sprinkled through. You want a single cleaning pass that produces date objects for the valid rows and None (or a raised error) for the rest.
The canonical shape is a multi-format parser: try each format in order, return the first that matches, return None if none does. Then layer on sentinel-handling on top. Ten lines total.
from datetime import datetime, date
# Deliberately awful input to clean
raw = [
'2026-04-21',
'21/04/2026',
'21-Apr-2026',
'April 21, 2026',
'2026/04/21',
'', # missing
'N/A', # sentinel
'not a date', # junk
]
FORMATS = (
'%Y-%m-%d',
'%d/%m/%Y',
'%d-%b-%Y',
'%B %d, %Y',
'%Y/%m/%d',
)
MISSING = {'', 'N/A', 'n/a', 'null', 'NULL'}
def parse_date(s: str | None) -> date | None:
"""Return a date, None for explicit-missing, or raise for unparseable."""
if s is None or s.strip() in MISSING:
return None
for fmt in FORMATS:
try:
return datetime.strptime(s.strip(), fmt).date()
except ValueError:
continue
raise ValueError(f'unparseable date: {s!r}')
for s in raw:
try:
print(f'{s!r:20} -> {parse_date(s)}')
except ValueError as e:
print(f'{s!r:20} -> ERROR: {e}')
# Variant: pandas pd.to_datetime — handles most of this for you
# import pandas as pd
#
# series = pd.Series(raw)
# parsed = pd.to_datetime(series, errors='coerce', format='mixed', dayfirst=True)
# print(parsed)
#
# format='mixed' — let each row pick its own format (Python 3.11+ / pandas 2.0+)
# dayfirst=True — resolve DD/MM vs MM/DD in favour of British notation
# errors='coerce' — unparseable rows become NaT instead of raising
# .isna() — count how many rows failed, investigate by hand
# Skipped in execution because pandas isn't imported by default in this
# environment — uncomment the lines above when running locally.
print('(pandas variant shown as reference; uncomment to run)')
Why it works¶
The multi-format loop is the Python equivalent of "try the formats I've seen, one after another, and take the first that matches". strptime raises ValueError on a mismatch, so wrapping it in try/except and continue-ing makes the rest of the format list act as fallbacks. That's better than trying to write one giant regex: each format is its own line, new formats slot in without touching the others, and the order of the tuple controls precedence (which matters when DD/MM and MM/DD would both match).
The sentinel set distinguishes three outcomes instead of two: parsed, known missing (empty, N/A), and error (junk). That split is what you want downstream — None collapses cleanly into pandas NaT or SQL NULL, while ValueError makes the bad rows noisy so you can investigate rather than silently losing them.
Trade-offs¶
Ambiguity can't be guessed. "04/05/2026" is 4 May in British notation, 5 April in American. strptime can't tell you; whichever format comes first in FORMATS wins. Pick the convention that matches your source data and commit to it. If you genuinely have both conventions in the same column, you need a secondary signal (the source system, a column header, the day value being > 12) — or to reject the row.
Keep FORMATS narrow. Every format you add is another chance for a row to match the wrong one. Only include formats you've actually seen in the data.
Try date.fromisoformat first for speed. For large columns that are mostly ISO 8601, fromisoformat is much faster than strptime — a try/except wrapper around it as the first attempt is a cheap win.
In pandas, prefer pd.to_datetime. pd.to_datetime(series, errors='coerce', format='mixed', dayfirst=True) handles mixed formats, missing values, and British vs American conventions in one call (pandas 2.0+ / Python 3.11+). The pure-Python loop above is for when you don't want a pandas dependency or you need the per-row error-handling distinction.
Related reading¶
- Avoid common datetime mistakes — especially the
dayfirstambiguity trap. - Datetime format codes — every
%Y/%m/%d/%bin one place. - Convert between time zones — once the naive dates are clean, what to do about zones.