{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Find and replace\n",
    "\n",
    "In this tutorial, you will master the art of finding, extracting, and transforming text with regular expressions. You will learn how to use `re.findall()`, `re.finditer()`, and `re.sub()` along with backreferences to perform powerful text manipulation.\n",
    "\n",
    "**Time commitment:** 15–20 minutes\n",
    "\n",
    "**Prerequisites:**\n",
    "\n",
    "- Completion of [Groups and capturing](https://agilearn.co.uk/guides/regex/learn/03-groups-and-capturing)\n",
    "- Basic Python knowledge (strings, variables, and functions)\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "By the end of this tutorial, you will be able to:\n",
    "\n",
    "- Use `re.findall()` to extract all matches from a string\n",
    "- Use `re.finditer()` to iterate over matches with full match object details\n",
    "- Use `re.sub()` to replace matched text\n",
    "- Use backreferences in replacement strings\n",
    "- Use `re.split()` to split strings on patterns\n",
    "- Use functions as replacement arguments in `re.sub()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding all matches with `re.findall()`\n",
    "\n",
    "You have already seen `re.search()`, which finds the **first** match. The `re.findall()` function finds **all** non-overlapping matches and returns them as a list of strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The prices are £5.99, £12.50, and £100.00'\n",
    "\n",
    "# Find all prices\n",
    "prices = re.findall(r'£\\d+\\.\\d{2}', text)\n",
    "print(f'Prices found: {prices}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find all words that start with a capital letter\n",
    "text = 'Alice and Bob visited London and Paris last Summer'\n",
    "capitalised_words = re.findall(r'\\b[A-Z][a-z]+\\b', text)\n",
    "print(f'Capitalised words: {capitalised_words}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember from the previous tutorial: when the pattern contains capturing groups, `re.findall()` returns the captured groups rather than the full matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Contact: alice@example.com, bob@test.co.uk'\n",
    "\n",
    "# Without groups: returns full matches\n",
    "print('Full matches:', re.findall(r'[\\w.]+@[\\w.]+', text))\n",
    "\n",
    "# With groups: returns only the captured parts\n",
    "print('Domains only:', re.findall(r'[\\w.]+@([\\w.]+)', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iterating over matches with `re.finditer()`\n",
    "\n",
    "The `re.finditer()` function returns an iterator of **match objects**, giving you access to the full match details (position, groups, and so on) for each match. This is more powerful than `re.findall()` when you need more than just the matched text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Order #101 placed on 15/01/2026, Order #102 placed on 20/01/2026'\n",
    "\n",
    "for match in re.finditer(r'Order #(\\d+)', text):\n",
    "    print(f'Found \"{match.group()}\" at position {match.start()}-{match.end()}')\n",
    "    print(f'  Order number: {match.group(1)}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using finditer() with named groups for structured extraction\n",
    "log_text = \"\"\"2026-02-09 14:30:00 [INFO] Server started\n",
    "2026-02-09 14:31:15 [WARNING] High memory usage\n",
    "2026-02-09 14:32:00 [ERROR] Connection refused\"\"\"\n",
    "\n",
    "pattern = re.compile(\n",
    "    r'(?P<date>[\\d-]+) (?P<time>[\\d:]+) \\[(?P<level>\\w+)\\] (?P<message>.+)'\n",
    ")\n",
    "\n",
    "for match in pattern.finditer(log_text):\n",
    "    info = match.groupdict()\n",
    "    print(f'[{info[\"level\"]:>7}] {info[\"time\"]} - {info[\"message\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replacing text with `re.sub()`\n",
    "\n",
    "The `re.sub()` function replaces all occurrences of a pattern with a replacement string. Its basic syntax is:\n",
    "\n",
    "```python\n",
    "re.sub(pattern, replacement, string)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'The colour of the colour wheel is colourful'\n",
    "\n",
    "# Replace 'colour' with 'color'\n",
    "result = re.sub(r'colour', 'color', text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove all digits from a string\n",
    "text = 'Room 42, Floor 3, Building 7'\n",
    "result = re.sub(r'\\d+', '', text)\n",
    "print(f'Without digits: \"{result}\"')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace multiple whitespace characters with a single space\n",
    "text = 'Too   many     spaces    here'\n",
    "result = re.sub(r'\\s+', ' ', text)\n",
    "print(f'Cleaned: \"{result}\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Limiting replacements with `count`\n",
    "\n",
    "You can limit the number of replacements using the `count` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'one two three four five'\n",
    "\n",
    "# Replace only the first two words with 'X'\n",
    "result = re.sub(r'\\w+', 'X', text, count=2)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backreferences in replacements\n",
    "\n",
    "**Backreferences** allow you to refer to captured groups in the replacement string. Use `\\1`, `\\2`, and so on for numbered groups, or `\\g<name>` for named groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Swap first name and last name\n",
    "text = 'Smith, Alice'\n",
    "result = re.sub(r'(\\w+), (\\w+)', r'\\2 \\1', text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert dates from DD/MM/YYYY to YYYY-MM-DD using named groups\n",
    "text = 'Dates: 25/12/2026, 01/01/2027'\n",
    "result = re.sub(\n",
    "    r'(?P<day>\\d{2})/(?P<month>\\d{2})/(?P<year>\\d{4})',\n",
    "    r'\\g<year>-\\g<month>-\\g<day>',\n",
    "    text,\n",
    ")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wrap all email addresses in angle brackets\n",
    "text = 'Contact alice@example.com or bob@test.co.uk'\n",
    "result = re.sub(r'([\\w.]+@[\\w.]+)', r'<\\1>', text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using functions as replacements\n",
    "\n",
    "For more complex replacements, you can pass a **function** as the replacement argument. The function receives a match object and must return the replacement string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def double_number(match: re.Match) -> str:\n",
    "    \"\"\"Double the matched number.\"\"\"\n",
    "    number = int(match.group())\n",
    "    return str(number * 2)\n",
    "\n",
    "\n",
    "text = 'I have 3 cats and 5 dogs'\n",
    "result = re.sub(r'\\d+', double_number, text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert temperatures from Fahrenheit to Celsius\n",
    "def fahrenheit_to_celsius(match: re.Match) -> str:\n",
    "    \"\"\"Convert a Fahrenheit temperature to Celsius.\"\"\"\n",
    "    fahrenheit = float(match.group(1))\n",
    "    celsius = (fahrenheit - 32) * 5 / 9\n",
    "    return f'{celsius:.1f}°C'\n",
    "\n",
    "\n",
    "text = 'Today: 68°F, Tomorrow: 77°F, Next week: 50°F'\n",
    "result = re.sub(r'(\\d+)°F', fahrenheit_to_celsius, text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting strings with `re.split()`\n",
    "\n",
    "The `re.split()` function splits a string at each occurrence of the pattern. This is more powerful than the built-in `str.split()` because you can split on patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split on any whitespace (similar to str.split())\n",
    "text = 'one  two\\tthree\\nfour'\n",
    "parts = re.split(r'\\s+', text)\n",
    "print(f'Split on whitespace: {parts}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split on multiple delimiters (comma, semicolon, or pipe)\n",
    "text = 'apple,banana;cherry|date'\n",
    "parts = re.split(r'[,;|]', text)\n",
    "print(f'Split on delimiters: {parts}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split sentences on punctuation\n",
    "text = 'First sentence. Second sentence! Third sentence? Fourth.'\n",
    "sentences = re.split(r'[.!?]\\s*', text)\n",
    "# Filter out empty strings\n",
    "sentences = [s for s in sentences if s]\n",
    "print(f'Sentences: {sentences}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Keeping the delimiters\n",
    "\n",
    "If you wrap the pattern in a capturing group, `re.split()` includes the delimiters in the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = '3+5-2*4'\n",
    "\n",
    "# Without capturing group: delimiters are removed\n",
    "print('Without delimiters:', re.split(r'[+\\-*]', text))\n",
    "\n",
    "# With capturing group: delimiters are kept\n",
    "print('With delimiters:   ', re.split(r'([+\\-*])', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using `re.subn()` to count replacements\n",
    "\n",
    "The `re.subn()` function works just like `re.sub()` but also returns the number of replacements made."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'foo bar foo baz foo'\n",
    "result, count = re.subn(r'foo', 'qux', text)\n",
    "print(f'Result: \"{result}\"')\n",
    "print(f'Replacements made: {count}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A practical example: cleaning messy data\n",
    "\n",
    "Let us combine the techniques from this tutorial to clean up messy input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "messy_data = \"\"\"\n",
    "  Name:   Alice   Smith  \n",
    "  Email: alice@example.com   \n",
    "  Phone:  01234  567890\n",
    "  Date:  25/12/2026  \n",
    "\"\"\"\n",
    "\n",
    "# Step 1: Extract key-value pairs\n",
    "pairs = re.findall(r'(\\w+):\\s*(.+?)\\s*$', messy_data, re.MULTILINE)\n",
    "print('Extracted pairs:')\n",
    "for key, value in pairs:\n",
    "    # Step 2: Clean up extra whitespace within values\n",
    "    clean_value = re.sub(r'\\s+', ' ', value.strip())\n",
    "    print(f'  {key}: {clean_value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "### Exercise 1\n",
    "\n",
    "Use `re.findall()` to extract all hashtags (words starting with `#`) from the following text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Learning #Python and #regex is great! #coding #programming'\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Use `re.sub()` with a backreference to convert the names from `\"Last, First\"` format to `\"First Last\"` format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names = 'Smith, Alice\\nJones, Bob\\nBrown, Charlie'\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "Write a replacement function for `re.sub()` that censors any number in the text by replacing each digit with `*`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'Card number: 1234 5678 9012, PIN: 4321'\n",
    "\n",
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1\n",
    "text = 'Learning #Python and #regex is great! #coding #programming'\n",
    "hashtags = re.findall(r'#\\w+', text)\n",
    "print(f'Hashtags: {hashtags}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2\n",
    "names = 'Smith, Alice\\nJones, Bob\\nBrown, Charlie'\n",
    "result = re.sub(r'(\\w+), (\\w+)', r'\\2 \\1', names)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3\n",
    "def censor_number(match: re.Match) -> str:\n",
    "    \"\"\"Replace each digit in the matched text with an asterisk.\"\"\"\n",
    "    return '*' * len(match.group())\n",
    "\n",
    "\n",
    "text = 'Card number: 1234 5678 9012, PIN: 4321'\n",
    "result = re.sub(r'\\d+', censor_number, text)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned:\n",
    "\n",
    "- **`re.findall()`**: Find all non-overlapping matches and return them as a list\n",
    "- **`re.finditer()`**: Iterate over matches with full match object details\n",
    "- **`re.sub()`**: Replace all occurrences of a pattern with a replacement string\n",
    "- **Backreferences**: Use `\\1`, `\\2`, or `\\g<name>` in replacement strings to refer to captured groups\n",
    "- **Function replacements**: Pass a function to `re.sub()` for complex replacement logic\n",
    "- **`re.split()`**: Split strings on regex patterns, with optional delimiter retention\n",
    "- **`re.subn()`**: Replace and count the number of replacements made\n",
    "\n",
    "## Next steps\n",
    "\n",
    "Congratulations — you have completed all four tutorials! You now have a solid foundation in Python regular expressions. From here, you can:\n",
    "\n",
    "- Explore the [Recipes](https://agilearn.co.uk/guides/regex/recipes/index) for practical, real-world applications\n",
    "- Consult the [Reference](https://agilearn.co.uk/guides/regex/reference/index) documentation for detailed technical information\n",
    "- Read the [Concepts](https://agilearn.co.uk/guides/regex/concepts/index) articles to deepen your understanding"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}