Validate email addresses¶
The question. You've been handed a string — from a form, a CSV, an API payload — and before you do anything else with it you need a cheap sanity check: does it at least look like an email address?
The pattern below is strict enough to catch obvious rubbish and loose enough to wave through the shapes real users actually type. Edit the sample list, press run, and see where it lines up with your own expectations.
import re
EMAIL_PATTERN = re.compile(
r'^[a-zA-Z0-9._%+-]+' # local part
r'@'
r'[a-zA-Z0-9.-]+' # domain
r'\.[a-zA-Z]{2,}$' # top-level domain
)
def is_valid_email(value: str) -> bool:
"""Return True if value looks like an email address."""
return bool(EMAIL_PATTERN.match(value))
# Try it on a mix of good and bad input.
samples = [
'alice@example.com',
'bob.smith@test.co.uk',
'user+tag@example.org',
'first-last@company.com',
'not-an-email',
'@missing-local.com',
'missing-domain@',
'spaces in@email.com',
'valid@sub.domain.example.com',
]
for s in samples:
print(f'{s:>35} -> {"ok" if is_valid_email(s) else "no"}')
Why it works¶
The pattern describes the four parts of an email address in the order they appear:
^[a-zA-Z0-9._%+-]+— the local part (everything before the@). Allows letters, digits, and the punctuation most people actually use: dot, underscore, percent, plus, and hyphen.@— the literal separator. It has to be there, and only once.[a-zA-Z0-9.-]+— the domain. The same idea as the local part, but domains don't use underscores, plus signs, or percent symbols.\.[a-zA-Z]{2,}$— a dot followed by at least two letters at the end. This catches the top-level domain (com,co.uk,org) and, with the$anchor, stops someone smuggling extra junk on after it.
The two anchors, ^ and $, are what turn this into a validator rather than a finder. Without them, re.match would happily accept 'alice@example.com please reply' — it would match the prefix and ignore the rest. The anchors force the regex engine to consume the whole string or fail.
Trade-offs and when not to use this¶
- The full specification (RFC 5322) is not practical to express as a regex. A truly compliant pattern runs to thousands of characters, permits quoted local parts, IP-literal domains, and comments mid-address, and is very hard to maintain. Don't chase it.
- A regex can't tell you whether an address exists. It's a shape check, not a delivery check. If you care whether the address actually works, send a one-time confirmation email and wait for the user to click the link.
- Drop the anchors if you're extracting, not validating. To pull every email out of a larger document, use the same pattern with
^and$removed, then call.findall(text)on it. - Reach for
re.VERBOSEonce the pattern grows. Any regex with more than a handful of parts earns its keep spread across multiple lines with inline comments — future you will thank you. - Internationalised addresses fall outside this pattern. Addresses containing non-ASCII characters (for example
用户@例子.公司) are valid under the latest email specs, but the character classes above reject them. If you need to accept them, either broaden the character classes or let a dedicated library (such asemail-validator) handle it.
Related¶
- Learn — Character classes and quantifiers explains the
[a-zA-Z0-9._%+-]+building blocks the pattern is made of. - Learn — Groups and capturing for when you need to pull the local part and the domain out separately.
- Reference —
remodule quick reference and Regex syntax for a look-up of every metacharacter the pattern uses. - Concepts — Why regular expressions for a longer discussion of when regex is the right tool, and when it isn't.