Regex syntax reference¶
This reference covers the complete regular expression syntax supported by Python's re module. Use it as a lookup resource when building patterns.
For the official documentation, see the Python Regular Expression HOWTO and the re module documentation.
Literal characters¶
Most characters match themselves literally. For example, the pattern hello matches the text hello.
The following characters have special meanings and must be escaped with a backslash (\) to match them literally:
import re
re.search(r'3\.14', '3.14') # Matches literal dot
re.search(r'\$100', '$100') # Matches literal dollar sign
re.search(r'file\(1\)', 'file(1)') # Matches literal parentheses
Metacharacters¶
The dot (.)¶
| Pattern | Matches |
|---|---|
. |
Any single character except a newline (unless re.DOTALL is set) |
Anchors¶
Anchors match positions, not characters.
| Pattern | Matches |
|---|---|
^ |
Start of the string (or start of each line with re.MULTILINE) |
$ |
End of the string (or end of each line with re.MULTILINE) |
\b |
Word boundary (between \w and \W, or at the start/end of the string) |
\B |
Non-word boundary |
\A |
Start of the string (not affected by re.MULTILINE) |
\Z |
End of the string (not affected by re.MULTILINE) |
import re
re.search(r'^Hello', 'Hello world') # Matches at start
re.search(r'world$', 'Hello world') # Matches at end
re.findall(r'\bcat\b', 'cat concatenate') # ['cat']
Character classes¶
Character classes match a single character from a defined set.
Custom character classes¶
| Pattern | Matches |
|---|---|
[abc] |
Any one of a, b, or c |
[a-z] |
Any lowercase letter |
[A-Z] |
Any uppercase letter |
[0-9] |
Any digit |
[a-zA-Z0-9] |
Any letter or digit |
[^abc] |
Any character except a, b, or c |
[^0-9] |
Any non-digit character |
Special rules inside character classes:
- Most metacharacters lose their special meaning inside
[...] - The caret
^has special meaning only at the start:[^abc] - The hyphen
-indicates a range, except at the start or end:[-abc]or[abc-] - The closing bracket
]must be first if included literally:[]abc] - The backslash
\still works as an escape character
Shorthand character classes¶
| Pattern | Equivalent | Matches |
|---|---|---|
\d |
[0-9] |
Any digit |
\D |
[^0-9] |
Any non-digit |
\w |
[a-zA-Z0-9_] |
Any word character (letter, digit, or underscore) |
\W |
[^a-zA-Z0-9_] |
Any non-word character |
\s |
[ \t\n\r\f\v] |
Any whitespace character |
\S |
[^ \t\n\r\f\v] |
Any non-whitespace character |
Note
With the re.UNICODE flag (the default in Python 3), \d, \w, and \s match Unicode equivalents as well. Use the re.ASCII flag to restrict them to ASCII characters only.
Quantifiers¶
Quantifiers control how many times the preceding element is matched.
Greedy quantifiers¶
Greedy quantifiers match as much text as possible.
| Pattern | Matches |
|---|---|
* |
Zero or more times |
+ |
One or more times |
? |
Zero or one time |
{n} |
Exactly n times |
{n,} |
At least n times |
{n,m} |
Between n and m times (inclusive) |
Lazy quantifiers¶
Lazy quantifiers match as little text as possible. They are created by appending ? to a greedy quantifier.
| Pattern | Matches |
|---|---|
*? |
Zero or more times (lazy) |
+? |
One or more times (lazy) |
?? |
Zero or one time (lazy) |
{n,}? |
At least n times (lazy) |
{n,m}? |
Between n and m times (lazy) |
import re
text = '<b>bold</b>'
re.search(r'<.+>', text).group() # '<b>bold</b>' (greedy)
re.search(r'<.+?>', text).group() # '<b>' (lazy)
Groups¶
Capturing groups¶
| Pattern | Description |
|---|---|
(...) |
Create a capturing group. The matched text is accessible through .group(n). |
Named groups¶
| Pattern | Description |
|---|---|
(?P<name>...) |
Create a named capturing group. Accessible through .group('name') or .groupdict(). |
(?P=name) |
Backreference to a named group within the same pattern. |
Non-capturing groups¶
| Pattern | Description |
|---|---|
(?:...) |
Group without capturing. Useful for applying quantifiers to a group. |
Backreferences¶
| Pattern | Description |
|---|---|
\1, \2, and so on |
Match the same text as the corresponding numbered group. |
(?P=name) |
Match the same text as the named group name. |
import re
# Backreference: match repeated words
re.search(r'\b(\w+)\s+\1\b', 'the the cat').group()
# 'the the'
# Named backreference
re.search(r'(?P<word>\w+)\s+(?P=word)', 'the the cat').group()
# 'the the'
Alternation¶
| Pattern | Description |
|---|---|
a|b |
Match either a or b. Alternation has the lowest precedence of all operators. |
import re
re.findall(r'cat|dog', 'I have a cat and a dog')
# ['cat', 'dog']
# Use groups to limit the scope of alternation
re.findall(r'col(?:ou|o)r', 'colour and color')
# ['colour', 'color']
Lookahead and lookbehind¶
Lookahead and lookbehind assertions match a position without consuming characters. They are sometimes called zero-width assertions.
Lookahead¶
| Pattern | Description |
|---|---|
(?=...) |
Positive lookahead: matches if ... matches next, without consuming. |
(?!...) |
Negative lookahead: matches if ... does not match next. |
import re
# Positive lookahead: find words followed by a colon
re.findall(r'\w+(?=:)', 'name: Alice age: 30')
# ['name', 'age']
# Negative lookahead: find words NOT followed by a colon
re.findall(r'\w+(?!:)\b', 'name: Alice age: 30')
# ['nam', 'Alice', 'ag', '30']
Lookbehind¶
| Pattern | Description |
|---|---|
(?<=...) |
Positive lookbehind: matches if ... matches immediately before the current position. |
(?<!...) |
Negative lookbehind: matches if ... does not match immediately before. |
Warning
Lookbehind patterns must be fixed-length in Python. You cannot use variable-length quantifiers (*, +, {n,m} where n and m differ) inside a lookbehind.
import re
# Positive lookbehind: find numbers preceded by £
re.findall(r'(?<=£)\d+\.?\d*', 'Prices: £5.99 and £12')
# ['5.99', '12']
# Negative lookbehind: find numbers NOT preceded by £
re.findall(r'(?<!£)\b\d+\.?\d*', 'Prices: £5.99 and 12 items')
# ['99', '12']
Conditional patterns¶
| Pattern | Description |
|---|---|
(?(id)yes|no) |
Match yes pattern if group id matched, otherwise match no pattern. The no part is optional. |
import re
# Match an optionally quoted word
pattern = re.compile(r'(")?(\w+)(?(1)")')
print(pattern.search('"hello"').group()) # "hello"
print(pattern.search('hello').group()) # hello
Special sequences summary¶
| Sequence | Description |
|---|---|
\d |
Digit |
\D |
Non-digit |
\w |
Word character |
\W |
Non-word character |
\s |
Whitespace |
\S |
Non-whitespace |
\b |
Word boundary |
\B |
Non-word boundary |
\A |
Start of string |
\Z |
End of string |
\1 ... \9 |
Backreference to group 1\u20139 |
\n, \t, \r |
Newline, tab, carriage return (in raw strings, use \n and so on directly) |