{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Groups and capturing\n",
    "\n",
    "In this tutorial, you will learn how to use parentheses to group parts of a pattern and extract specific portions of matched text. Groups are one of the most useful features in regular expressions, allowing you to organise patterns and pull out the exact data you need.\n",
    "\n",
    "**Time commitment:** 15–20 minutes\n",
    "\n",
    "**Prerequisites:**\n",
    "\n",
    "- Completion of [Character classes and quantifiers](https://agilearn.co.uk/guides/regex/learn/02-character-classes-and-quantifiers)\n",
    "- Basic Python knowledge (strings, variables, and functions)\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "By the end of this tutorial, you will be able to:\n",
    "\n",
    "- Use parentheses to create capturing groups\n",
    "- Extract matched text with `.group()`, `.groups()`, and `.groupdict()`\n",
    "- Use named groups with `(?P<name>...)`\n",
    "- Apply non-capturing groups with `(?:...)`\n",
    "- Use the alternation operator `|` with groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Capturing groups\n",
    "\n",
    "**Capturing groups** are created by enclosing part of a pattern in parentheses `(...)`. They serve two purposes:\n",
    "\n",
    "1. They **group** elements together so that quantifiers apply to the whole group\n",
    "2. They **capture** the matched text so you can extract it later\n",
    "\n",
    "Let us start with a simple example — extracting the day, month, and year from a date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The event is on 25/12/2026'\n",
    "match = re.search(r'(\\d{2})/(\\d{2})/(\\d{4})', text)\n",
    "\n",
    "if match:\n",
    "    print(f'Full match: {match.group()}')\n",
    "    print(f'Day:   {match.group(1)}')\n",
    "    print(f'Month: {match.group(2)}')\n",
    "    print(f'Year:  {match.group(3)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each pair of parentheses creates a numbered group, starting from 1. The `.group(0)` (or simply `.group()`) returns the entire match, whilst `.group(1)`, `.group(2)`, and so on return each captured group.\n",
    "\n",
    "You can also use `.groups()` to get all captured groups as a tuple."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The event is on 25/12/2026'\n",
    "match = re.search(r'(\\d{2})/(\\d{2})/(\\d{4})', text)\n",
    "\n",
    "if match:\n",
    "    day, month, year = match.groups()\n",
    "    print(f'Date: {day}/{month}/{year}')\n",
    "    print(f'All groups: {match.groups()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Named groups\n",
    "\n",
    "Numbered groups can be hard to read, especially with many groups. **Named groups** solve this by letting you assign a name to each group using the syntax `(?P<name>...)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The event is on 25/12/2026'\n",
    "match = re.search(r'(?P<day>\\d{2})/(?P<month>\\d{2})/(?P<year>\\d{4})', text)\n",
    "\n",
    "if match:\n",
    "    print(f'Day:   {match.group(\"day\")}')\n",
    "    print(f'Month: {match.group(\"month\")}')\n",
    "    print(f'Year:  {match.group(\"year\")}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Named groups are especially useful when working with complex patterns. You can also access them through `.groupdict()`, which returns a dictionary of all named groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Contact: Alice Smith, alice@example.com'\n",
    "pattern = re.compile(\n",
    "    r'(?P<name>[A-Z][a-z]+ [A-Z][a-z]+), (?P<email>[\\w.]+@[\\w.]+)'\n",
    ")\n",
    "match = pattern.search(text)\n",
    "\n",
    "if match:\n",
    "    details = match.groupdict()\n",
    "    print(f'Group dictionary: {details}')\n",
    "    print(f'Name:  {details[\"name\"]}')\n",
    "    print(f'Email: {details[\"email\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Non-capturing groups\n",
    "\n",
    "Sometimes you need parentheses for grouping (for example, to apply a quantifier to a group of characters) but you do not need to capture the matched text. Use `(?:...)` to create a **non-capturing group**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Capturing group: the prefix is captured\n",
    "text = 'http://example.com'\n",
    "match = re.search(r'(https?)://(.+)', text)\n",
    "if match:\n",
    "    print(f'Groups with capturing: {match.groups()}')\n",
    "\n",
    "# Non-capturing group: the prefix is not captured\n",
    "match = re.search(r'(?:https?)://(.+)', text)\n",
    "if match:\n",
    "    print(f'Groups with non-capturing: {match.groups()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first example, `match.groups()` returns two items (the protocol and the domain). In the second example, only the domain is captured because the protocol group uses `(?:...)`. Non-capturing groups are useful for keeping your group numbering clean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alternation with groups\n",
    "\n",
    "The **alternation** operator `|` works like a logical OR. When combined with groups, it lets you match one of several alternatives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match different file extensions\n",
    "filenames = ['report.pdf', 'image.png', 'data.csv', 'script.py', 'notes.txt']\n",
    "\n",
    "pattern = re.compile(r'\\w+\\.(?:pdf|png|csv)')\n",
    "\n",
    "for name in filenames:\n",
    "    if pattern.fullmatch(name):\n",
    "        print(f'\"{name}\" → matched')\n",
    "    else:\n",
    "        print(f'\"{name}\" → not matched')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a capturing group to also extract the extension\n",
    "pattern = re.compile(r'(\\w+)\\.(pdf|png|csv)')\n",
    "\n",
    "for name in filenames:\n",
    "    match = pattern.fullmatch(name)\n",
    "    if match:\n",
    "        print(f'\"{name}\" → base: \"{match.group(1)}\", extension: \"{match.group(2)}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Groups with quantifiers\n",
    "\n",
    "You can apply quantifiers to groups to repeat the entire group pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Match an IP address (simplified)\n",
    "text = 'Server IP: 192.168.1.100'\n",
    "match = re.search(r'(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})', text)\n",
    "\n",
    "if match:\n",
    "    print(f'Full IP: {match.group()}')\n",
    "    print(f'Octets: {match.groups()}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a non-capturing group with a quantifier to match repeated patterns\n",
    "# Match a sequence like \"abc-abc-abc\"\n",
    "text = 'Reference: ABC-123-XYZ'\n",
    "match = re.search(r'[A-Z]{3}(?:-[A-Z0-9]{3}){2}', text)\n",
    "\n",
    "if match:\n",
    "    print(f'Reference found: {match.group()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Groups with `re.findall()`\n",
    "\n",
    "When you use `re.findall()` with a pattern that contains capturing groups, it returns the captured groups rather than the full matches. This is a common source of confusion, but it is also very useful for extracting data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Dates: 25/12/2026, 01/01/2027, 14/02/2027'\n",
    "\n",
    "# Without groups: returns full matches\n",
    "print('Without groups:', re.findall(r'\\d{2}/\\d{2}/\\d{4}', text))\n",
    "\n",
    "# With one group: returns list of captured strings\n",
    "print('One group (year):', re.findall(r'\\d{2}/\\d{2}/(\\d{4})', text))\n",
    "\n",
    "# With multiple groups: returns list of tuples\n",
    "print('Multiple groups:', re.findall(r'(\\d{2})/(\\d{2})/(\\d{4})', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A practical example: parsing log entries\n",
    "\n",
    "Let us combine everything to parse a realistic log entry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_entry = '2026-02-09 14:30:45 [ERROR] Connection timeout after 30s'\n",
    "\n",
    "pattern = re.compile(\n",
    "    r'(?P<date>\\d{4}-\\d{2}-\\d{2})\\s'\n",
    "    r'(?P<time>\\d{2}:\\d{2}:\\d{2})\\s'\n",
    "    r'\\[(?P<level>\\w+)\\]\\s'\n",
    "    r'(?P<message>.+)'\n",
    ")\n",
    "\n",
    "match = pattern.search(log_entry)\n",
    "if match:\n",
    "    info = match.groupdict()\n",
    "    for key, value in info.items():\n",
    "        print(f'{key:>10}: {value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "### Exercise 1\n",
    "\n",
    "Write a pattern with named groups to extract the hours and minutes from a time string in `HH:MM` format. Test it on `'Meeting at 14:30'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Meeting at 14:30'\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Use `re.findall()` with capturing groups to extract all the names and ages from the following text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Alice is 30, Bob is 25, and Charlie is 35'\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "Write a pattern using alternation to match both `'colour'` and `'color'`. Use a non-capturing group to avoid capturing the optional `'u'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = ['I like this colour', 'I like this color', 'colourful display']\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1\n",
    "text = 'Meeting at 14:30'\n",
    "match = re.search(r'(?P<hours>\\d{2}):(?P<minutes>\\d{2})', text)\n",
    "\n",
    "if match:\n",
    "    print(f'Hours:   {match.group(\"hours\")}')\n",
    "    print(f'Minutes: {match.group(\"minutes\")}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2\n",
    "text = 'Alice is 30, Bob is 25, and Charlie is 35'\n",
    "results = re.findall(r'(\\w+) is (\\d+)', text)\n",
    "\n",
    "for name, age in results:\n",
    "    print(f'{name} is {age} years old')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3\n",
    "texts = ['I like this colour', 'I like this color', 'colourful display']\n",
    "\n",
    "pattern = re.compile(r'colou?r')\n",
    "\n",
    "for text in texts:\n",
    "    match = pattern.search(text)\n",
    "    if match:\n",
    "        print(f'\"{text}\" → found \"{match.group()}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned:\n",
    "\n",
    "- **Capturing groups**: Use `(...)` to capture matched text and access it with `.group(n)` or `.groups()`\n",
    "- **Named groups**: Use `(?P<name>...)` for readable patterns and access groups with `.group(\"name\")` or `.groupdict()`\n",
    "- **Non-capturing groups**: Use `(?:...)` when you need grouping but do not need to capture the text\n",
    "- **Alternation**: Use `|` inside groups to match one of several alternatives\n",
    "- **Groups with `re.findall()`**: When groups are present, `re.findall()` returns captured groups rather than full matches\n",
    "\n",
    "## Next steps\n",
    "\n",
    "In the next tutorial, [Find and replace](https://agilearn.co.uk/guides/regex/learn/04-find-and-replace), you will learn how to use `re.sub()`, `re.findall()`, `re.finditer()`, and backreferences to search, extract, and transform text."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}