Extract data from text¶
The question. You've got a lump of unstructured text — a log file, a report, a page of form submissions — and you want to pull the dates, phone numbers, or key-value pairs out of it as proper Python objects, without writing a bespoke parser.
The technique that does nearly all the work is named groups ((?P<name>...)) with re.finditer and .groupdict(). Each match comes back as a dictionary you can drop straight into a DataFrame, a JSON blob, or anywhere else you'd want structured data.
import re
DATE_PATTERN = re.compile(
r'(?P<day>0[1-9]|[12]\d|3[01])'
r'/'
r'(?P<month>0[1-9]|1[0-2])'
r'/'
r'(?P<year>\d{4})'
)
KV_PATTERN = re.compile(
r'(?P<key>\w+)\s*[:=]\s*(?P<value>.+?)\s*$',
re.MULTILINE,
)
def extract(pattern: re.Pattern, text: str) -> list[dict[str, str]]:
"""Return a list of named-group dicts for every match in text."""
return [m.groupdict() for m in pattern.finditer(text)]
# Pull dates out of a sentence.
sentence = 'Events on 25/12/2026, 01/01/2027, and 14/02/2027'
print('Dates:', extract(DATE_PATTERN, sentence))
# Pull key/value pairs out of a config-style block (note: either
# `key: value` or `key = value` works thanks to the `[:=]` class).
config = '''
name: Alice Smith
email: alice@example.com
role = admin
department: Engineering
'''
print('Config:', extract(KV_PATTERN, config))
Why it works¶
re.finditer returns one Match object per match, lazily — no list comprehension is built until you iterate. Each Match has a .groupdict() method that returns the named captures as a Python dict, which is almost always the shape you want to work with downstream.
The date pattern is worth reading slowly. Each named group uses alternation (|) inside a character class to enumerate only valid ranges: 0[1-9]|[12]\d|3[01] matches 01–31 but rejects 00, 32, 99. That's more work than \d{2} but means an invalid date simply doesn't match — the regex does a first pass of validation for you.
The key/value pattern leans on re.MULTILINE so the $ anchor matches at every newline rather than only at the very end of the string. Without it, the pattern would only find the last line of the block.
Trade-offs and when not to use this¶
- If the input is really a format (JSON, CSV, HTML, YAML, TOML), use a proper parser. Regex will get you 80% there and then fall over on an edge case your sample didn't include. The standard library has
json,csv,html.parser, andtomllib; reach for them first. - Phone numbers, URLs, postcodes — adapt, don't copy. Every country's phone format has its own rules; URL patterns have to decide whether to allow queries, fragments, and ports. Take the named-group structure above and tune the character classes to the format you actually see in your data.
- Validate, then extract. If the extracted pieces need to satisfy more than "looks roughly right" (e.g. dates must be real calendar dates), pass them through
datetime.strptimeor similar after the regex match succeeds. - Very large files belong in
re.finditerwith streaming, not inre.findall, which builds the full list up front. See the Use regex with file I/O recipe for the streaming pattern.
Related¶
- Learn — Groups and capturing for a ground-up look at named groups and
.groupdict(). - Recipe — Use regex with file I/O for the streaming version of the same technique over log files.
- Reference —
remodule quick reference and Regex flags for theMULTILINEflag used here. - Concepts — Understanding the regex engine for what
finditeractually does under the hood.