{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Character classes and quantifiers\n",
    "\n",
    "In this tutorial, you will learn how to build flexible patterns using character classes and quantifiers. These are the building blocks that make regular expressions truly powerful — allowing you to match ranges of characters and control how many times a pattern repeats.\n",
    "\n",
    "**Time commitment:** 15–20 minutes\n",
    "\n",
    "**Prerequisites:**\n",
    "\n",
    "- Completion of [Your first pattern](https://agilearn.co.uk/guides/regex/learn/01-your-first-pattern)\n",
    "- Basic Python knowledge (strings, variables, and functions)\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "By the end of this tutorial, you will be able to:\n",
    "\n",
    "- Use character classes to match sets of characters\n",
    "- Use shorthand character classes (`\\d`, `\\w`, `\\s`)\n",
    "- Apply quantifiers (`+`, `*`, `?`, `{n,m}`) to control repetition\n",
    "- Understand the difference between greedy and lazy matching\n",
    "- Use anchors (`^`, `$`, `\\b`) to match positions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character classes\n",
    "\n",
    "A **character class** matches any single character from a defined set. You create one by placing characters inside square brackets `[...]`.\n",
    "\n",
    "For example, `[aeiou]` matches any single vowel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match any vowel\n",
    "text = 'hello'\n",
    "match = re.search(r'[aeiou]', text)\n",
    "if match:\n",
    "    print(f'First vowel found: \"{match.group()}\" at position {match.start()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ranges in character classes\n",
    "\n",
    "You can use a hyphen to specify a range of characters:\n",
    "\n",
    "| Pattern | Matches |\n",
    "|---|---|\n",
    "| `[a-z]` | Any lowercase letter |\n",
    "| `[A-Z]` | Any uppercase letter |\n",
    "| `[0-9]` | Any digit |\n",
    "| `[a-zA-Z]` | Any letter (upper or lower) |\n",
    "| `[a-zA-Z0-9]` | Any letter or digit |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match a lowercase letter followed by a digit\n",
    "texts = ['a1', 'B2', 'c3', '4d', 'ef']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.search(r'[a-z][0-9]', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → matched \"{match.group()}\"')\n",
    "    else:\n",
    "        print(f'\"{text}\" → no match')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Negated character classes\n",
    "\n",
    "Placing a caret `^` at the start of a character class **negates** it, matching any character that is **not** in the set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match any character that is NOT a digit\n",
    "text = 'abc123def'\n",
    "match = re.search(r'[^0-9]', text)\n",
    "if match:\n",
    "    print(f'First non-digit character: \"{match.group()}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shorthand character classes\n",
    "\n",
    "The `re` module provides shorthand notation for common character classes:\n",
    "\n",
    "| Shorthand | Equivalent | Matches |\n",
    "|---|---|---|\n",
    "| `\\d` | `[0-9]` | Any digit |\n",
    "| `\\D` | `[^0-9]` | Any non-digit |\n",
    "| `\\w` | `[a-zA-Z0-9_]` | Any word character (letter, digit, or underscore) |\n",
    "| `\\W` | `[^a-zA-Z0-9_]` | Any non-word character |\n",
    "| `\\s` | `[ \\t\\n\\r\\f\\v]` | Any whitespace character |\n",
    "| `\\S` | `[^ \\t\\n\\r\\f\\v]` | Any non-whitespace character |\n",
    "\n",
    "These shorthand classes make patterns much more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Order #42 placed on 15/01/2026'\n",
    "\n",
    "# Find the first digit\n",
    "digit_match = re.search(r'\\d', text)\n",
    "print(f'First digit: \"{digit_match.group()}\"')\n",
    "\n",
    "# Find the first whitespace character\n",
    "space_match = re.search(r'\\s', text)\n",
    "print(f'First whitespace at position: {space_match.start()}')\n",
    "\n",
    "# Find the first word character after #\n",
    "word_match = re.search(r'#\\w', text)\n",
    "print(f'After #: \"{word_match.group()}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quantifiers\n",
    "\n",
    "**Quantifiers** control how many times a pattern element is matched. Without quantifiers, each element matches exactly once.\n",
    "\n",
    "| Quantifier | Meaning |\n",
    "|---|---|\n",
    "| `+` | One or more |\n",
    "| `*` | Zero or more |\n",
    "| `?` | Zero or one (optional) |\n",
    "| `{n}` | Exactly n times |\n",
    "| `{n,}` | At least n times |\n",
    "| `{n,m}` | Between n and m times (inclusive) |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The `+` quantifier (one or more)\n",
    "\n",
    "The `+` quantifier matches **one or more** occurrences of the preceding element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match one or more digits\n",
    "texts = ['Price: 5', 'Price: 42', 'Price: 1000', 'No price']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.search(r'\\d+', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → found \"{match.group()}\"')\n",
    "    else:\n",
    "        print(f'\"{text}\" → no match')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The `*` quantifier (zero or more)\n",
    "\n",
    "The `*` quantifier matches **zero or more** occurrences. Unlike `+`, it succeeds even when there are no matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match 'colour' or 'color' (with zero or more 'u' characters)\n",
    "texts = ['colour', 'color', 'colouur']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.search(r'colou*r', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → matched \"{match.group()}\"')\n",
    "    else:\n",
    "        print(f'\"{text}\" → no match')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The `?` quantifier (zero or one)\n",
    "\n",
    "The `?` quantifier matches **zero or one** occurrence, making the preceding element optional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match 'colour' or 'color' using ?\n",
    "texts = ['colour', 'color']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.search(r'colou?r', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → matched \"{match.group()}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exact and range quantifiers\n",
    "\n",
    "Curly braces let you specify exact counts or ranges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match exactly 3 digits\n",
    "texts = ['12', '123', '1234', '12345']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.fullmatch(r'\\d{3}', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → exactly 3 digits')\n",
    "    else:\n",
    "        print(f'\"{text}\" → not exactly 3 digits')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match between 2 and 4 digits\n",
    "texts = ['1', '12', '123', '1234', '12345']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.fullmatch(r'\\d{2,4}', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → between 2 and 4 digits')\n",
    "    else:\n",
    "        print(f'\"{text}\" → not between 2 and 4 digits')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combining character classes and quantifiers\n",
    "\n",
    "The real power of regular expressions comes from combining character classes with quantifiers. Let us build some practical patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match a simple word (one or more word characters)\n",
    "text = 'Hello, World! How are you?'\n",
    "match = re.search(r'\\w+', text)\n",
    "print(f'First word: \"{match.group()}\"')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match a date in DD/MM/YYYY format\n",
    "text = 'The deadline is 25/12/2026 for all submissions'\n",
    "match = re.search(r'\\d{2}/\\d{2}/\\d{4}', text)\n",
    "if match:\n",
    "    print(f'Date found: {match.group()}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match a simple price in pounds (for example, £9.99 or £100)\n",
    "text = 'The book costs £9.99 and the pen costs £1.50'\n",
    "match = re.search(r'£\\d+\\.\\d{2}', text)\n",
    "if match:\n",
    "    print(f'Price found: {match.group()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Greedy versus lazy matching\n",
    "\n",
    "By default, quantifiers are **greedy** — they match as much text as possible. You can make them **lazy** (matching as little as possible) by adding a `?` after the quantifier.\n",
    "\n",
    "| Greedy | Lazy | Meaning |\n",
    "|---|---|---|\n",
    "| `+` | `+?` | One or more (lazy) |\n",
    "| `*` | `*?` | Zero or more (lazy) |\n",
    "| `?` | `??` | Zero or one (lazy) |\n",
    "| `{n,m}` | `{n,m}?` | Between n and m (lazy) |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = '<b>bold</b> and <i>italic</i>'\n",
    "\n",
    "# Greedy: matches as much as possible\n",
    "greedy = re.search(r'<.+>', text)\n",
    "print(f'Greedy:  \"{greedy.group()}\"')\n",
    "\n",
    "# Lazy: matches as little as possible\n",
    "lazy = re.search(r'<.+?>', text)\n",
    "print(f'Lazy:    \"{lazy.group()}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The greedy pattern `<.+>` matched from the first `<` all the way to the last `>`, capturing everything in between. The lazy pattern `<.+?>` stopped at the first `>` it encountered.\n",
    "\n",
    "Understanding greedy versus lazy matching is essential for writing precise patterns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Anchors\n",
    "\n",
    "**Anchors** do not match characters — they match **positions** in the string.\n",
    "\n",
    "| Anchor | Matches |\n",
    "|---|---|\n",
    "| `^` | Start of the string (or start of a line with `re.MULTILINE`) |\n",
    "| `$` | End of the string (or end of a line with `re.MULTILINE`) |\n",
    "| `\\b` | Word boundary (between a word character and a non-word character) |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match a string that starts with a digit\n",
    "texts = ['42 is the answer', 'The answer is 42', '100 percent']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.search(r'^\\d+', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" starts with digits: \"{match.group()}\"')\n",
    "    else:\n",
    "        print(f'\"{text}\" does not start with digits')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match a string that ends with a digit\n",
    "texts = ['Room 42', 'Room 42B', 'Answer: 100']\n",
    "\n",
    "for text in texts:\n",
    "    match = re.search(r'\\d+$', text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" ends with digits: \"{match.group()}\"')\n",
    "    else:\n",
    "        print(f'\"{text}\" does not end with digits')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word boundaries\n",
    "\n",
    "The `\\b` anchor matches a position between a word character and a non-word character. This is useful for matching whole words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The cat concatenated the catalogue'\n",
    "\n",
    "# Without word boundaries: matches 'cat' inside other words\n",
    "matches_no_boundary = re.findall(r'cat', text)\n",
    "print(f'Without boundaries: {matches_no_boundary}')\n",
    "\n",
    "# With word boundaries: matches only the whole word 'cat'\n",
    "matches_with_boundary = re.findall(r'\\bcat\\b', text)\n",
    "print(f'With boundaries: {matches_with_boundary}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Putting it all together\n",
    "\n",
    "Let us combine everything you have learned to build a practical pattern. We will match a simple UK phone number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pattern for a simple UK phone number: 0XXXX XXXXXX\n",
    "phone_pattern = re.compile(r'\\b0\\d{4}\\s?\\d{6}\\b')\n",
    "\n",
    "texts = [\n",
    "    'Call us on 01234 567890 today',\n",
    "    'Phone: 01onal567890',\n",
    "    'Number is 01onal 567890',\n",
    "    'Not a phone: 12345',\n",
    "]\n",
    "\n",
    "for text in texts:\n",
    "    match = phone_pattern.search(text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → found \"{match.group()}\"')\n",
    "    else:\n",
    "        print(f'\"{text}\" → no phone number found')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "### Exercise 1\n",
    "\n",
    "Write a pattern that matches a string containing only lowercase letters. Test it using `re.fullmatch()` against the strings `'hello'`, `'Hello'`, `'hello123'`, and `'world'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "strings_to_test = ['hello', 'Hello', 'hello123', 'world']\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Write a pattern that matches a time in 24-hour format (`HH:MM`), where the hours are 00–23 and the minutes are 00–59. Test it against `'09:30'`, `'14:45'`, `'25:00'`, and `'9:30'`.\n",
    "\n",
    "*Hint:* You will need to think carefully about the ranges for the first digit of the hour."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "times_to_test = ['09:30', '14:45', '25:00', '9:30']\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "Use `re.findall()` with the `\\b` anchor to find all four-letter words in the following sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The quick brown fox jumps over the lazy dogs'\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1\n",
    "strings_to_test = ['hello', 'Hello', 'hello123', 'world']\n",
    "\n",
    "for s in strings_to_test:\n",
    "    if re.fullmatch(r'[a-z]+', s):\n",
    "        print(f'\"{s}\" → only lowercase letters')\n",
    "    else:\n",
    "        print(f'\"{s}\" → not only lowercase letters')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2\n",
    "times_to_test = ['09:30', '14:45', '25:00', '9:30']\n",
    "\n",
    "# Hours: 00-19 is [01]\\d, 20-23 is 2[0-3]\n",
    "time_pattern = re.compile(r'(?:[01]\\d|2[0-3]):[0-5]\\d')\n",
    "\n",
    "for t in times_to_test:\n",
    "    if re.fullmatch(time_pattern, t):\n",
    "        print(f'\"{t}\" → valid 24-hour time')\n",
    "    else:\n",
    "        print(f'\"{t}\" → not a valid 24-hour time')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3\n",
    "text = 'The quick brown fox jumps over the lazy dogs'\n",
    "four_letter_words = re.findall(r'\\b\\w{4}\\b', text)\n",
    "print(f'Four-letter words: {four_letter_words}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned:\n",
    "\n",
    "- **Character classes**: `[abc]` matches any character in the set; `[^abc]` matches any character not in the set\n",
    "- **Ranges**: `[a-z]`, `[0-9]`, and `[A-Za-z]` match ranges of characters\n",
    "- **Shorthand classes**: `\\d` (digit), `\\w` (word character), `\\s` (whitespace), and their negated forms\n",
    "- **Quantifiers**: `+` (one or more), `*` (zero or more), `?` (optional), `{n}` (exactly n), `{n,m}` (between n and m)\n",
    "- **Greedy versus lazy**: Quantifiers are greedy by default; add `?` to make them lazy\n",
    "- **Anchors**: `^` (start), `$` (end), `\\b` (word boundary)\n",
    "\n",
    "## Next steps\n",
    "\n",
    "In the next tutorial, [Groups and capturing](https://agilearn.co.uk/guides/regex/learn/03-groups-and-capturing), you will learn how to use parentheses to group parts of a pattern and extract specific portions of matched text."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}