{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a1b2c3d4",
   "metadata": {},
   "source": [
    "# String searching\n",
    "\n",
    "In this tutorial, you will learn how to search within strings and test for the presence of substrings. These skills are essential for tasks such as validating input, extracting information, and filtering text.\n",
    "\n",
    "**Time commitment:** 15&ndash;20 minutes\n",
    "\n",
    "**Prerequisites:**\n",
    "- Completion of [String formatting](https://agilearn.co.uk/guides/string-processing/learn/03-string-formatting)\n",
    "- Understanding of string methods and indexing\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "By the end of this tutorial, you will be able to:\n",
    "\n",
    "- Use the `in` operator to test for substring presence\n",
    "- Find the position of substrings with `find()` and `index()`\n",
    "- Search from the right with `rfind()` and `rindex()`\n",
    "- Test string beginnings and endings with `startswith()` and `endswith()`\n",
    "- Count occurrences with `count()`\n",
    "- Combine searching methods to solve practical problems"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2c3d4e5",
   "metadata": {},
   "source": [
    "## The `in` operator\n",
    "\n",
    "The simplest way to check whether a substring exists within a string is the `in` operator. It returns `True` if the substring is found and `False` otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3d4e5f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = \"The quick brown fox jumps over the lazy dog\"\n",
    "\n",
    "print(\"fox\" in sentence)\n",
    "print(\"cat\" in sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4e5f6a7",
   "metadata": {},
   "source": [
    "The `in` operator is **case-sensitive**. If you need a case-insensitive search, convert both strings to the same case first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5f6a7b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"Hello, World!\"\n",
    "\n",
    "print(\"hello\" in text)                 # False &ndash; case mismatch\n",
    "print(\"hello\" in text.lower())         # True &ndash; both lowercase"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6a7b8c9",
   "metadata": {},
   "source": [
    "You can also use `not in` to check for the absence of a substring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7b8c9d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = \"The quick brown fox jumps over the lazy dog\"\n",
    "\n",
    "print(\"cat\" not in sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c9d0e1",
   "metadata": {},
   "source": [
    "## Finding substrings\n",
    "\n",
    "While `in` tells you whether a substring exists, `str.find()` tells you **where** it is. It returns the index of the first occurrence of the substring, or `-1` if the substring is not found."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9d0e1f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = \"Python is a powerful programming language\"\n",
    "\n",
    "print(sentence.find(\"powerful\"))   # 12\n",
    "print(sentence.find(\"Ruby\"))       # -1&ndash;not found"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0e1f2a3",
   "metadata": {},
   "source": [
    "### `str.index()` -- an alternative to `str.find()`\n",
    "\n",
    "The `str.index()` method works the same as `str.find()`, but it raises a `ValueError` if the substring is not found. Use `str.index()` when you expect the substring to be present and want an error to alert you if it is not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3b4c5d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = \"Python is a powerful programming language\"\n",
    "\n",
    "print(sentence.index(\"powerful\"))   # 12\n",
    "\n",
    "# Uncommenting the following line would raise a ValueError:\n",
    "# sentence.index(\"Ruby\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4c5d6e7",
   "metadata": {},
   "source": [
    "### `str.rfind()` and `str.rindex()` -- searching from the right\n",
    "\n",
    "The methods `str.rfind()` and `str.rindex()` work the same way as `str.find()` and `str.index()`, but they search from the **right** side of the string. They return the index of the **last** occurrence of the substring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5d6e7f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"/home/user/documents/report.final.pdf\"\n",
    "\n",
    "print(path.find(\".\"))     # 30&ndash;first dot\n",
    "print(path.rfind(\".\"))    # 36&ndash;last dot"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6e7f8a9",
   "metadata": {},
   "source": [
    "### Specifying start and end positions\n",
    "\n",
    "All four methods -- `str.find()`, `str.index()`, `str.rfind()`, and `str.rindex()` -- accept optional `start` and `end` arguments to limit the search to a specific portion of the string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7f8a9b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"banana\"\n",
    "\n",
    "# Find the first \"a\" starting from index 2\n",
    "print(text.find(\"a\", 2))      # 3\n",
    "\n",
    "# Find \"a\" between index 2 and index 4\n",
    "print(text.find(\"a\", 2, 4))   # 3\n",
    "\n",
    "# Find \"a\" between index 4 and index 5\n",
    "print(text.find(\"a\", 4, 5))   # -1&ndash;not found in that range"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8a9b0c1",
   "metadata": {},
   "source": [
    "## Testing beginnings and endings\n",
    "\n",
    "Python provides `str.startswith()` and `str.endswith()` for checking whether a string begins or ends with a particular substring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9b0c1d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://www.python.org\"\n",
    "\n",
    "print(url.startswith(\"https\"))     # True\n",
    "print(url.startswith(\"http://\"))   # False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1d2e3f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"report.pdf\"\n",
    "\n",
    "print(filename.endswith(\".pdf\"))    # True\n",
    "print(filename.endswith(\".docx\"))   # False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2e3f4a5",
   "metadata": {},
   "source": [
    "### Using tuples for multiple options\n",
    "\n",
    "Both `str.startswith()` and `str.endswith()` accept a **tuple** of strings to test against. The method returns `True` if the string matches **any** of the options in the tuple."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3f4a5b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"photo.jpg\"\n",
    "\n",
    "# Check for common image file extensions\n",
    "is_image = filename.endswith((\".jpg\", \".jpeg\", \".png\", \".gif\"))\n",
    "print(is_image)   # True\n",
    "\n",
    "# Check for common web protocols\n",
    "url = \"ftp://files.example.com\"\n",
    "is_web = url.startswith((\"http://\", \"https://\"))\n",
    "print(is_web)     # False &ndash; this is an FTP URL"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c7d8e9",
   "metadata": {},
   "source": [
    "## Counting occurrences\n",
    "\n",
    "The `str.count()` method returns the number of **non-overlapping** occurrences of a substring within a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7d8e9f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"she sells sea shells by the sea shore\"\n",
    "\n",
    "print(text.count(\"sea\"))     # 2\n",
    "print(text.count(\"s\"))       # 6\n",
    "print(text.count(\"xyz\"))     # 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1b2c3d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Non-overlapping matches\n",
    "text = \"aaaa\"\n",
    "\n",
    "# Counting \"aa\" -- finds 2, not 3, because matches do not overlap\n",
    "print(text.count(\"aa\"))   # 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2c3d4e6",
   "metadata": {},
   "source": [
    "## Character testing methods\n",
    "\n",
    "Python strings include several methods that test whether **all** characters in the string satisfy a particular condition. These methods return `True` or `False` and are especially useful for input validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3d4e5f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"hello\".isalpha())       # True &ndash; all alphabetic characters\n",
    "print(\"hello123\".isalpha())    # False &ndash; contains digits\n",
    "\n",
    "print(\"12345\".isdigit())       # True &ndash; all digit characters\n",
    "print(\"12.34\".isdigit())       # False &ndash; contains a dot\n",
    "\n",
    "print(\"hello123\".isalnum())    # True &ndash; all alphanumeric characters\n",
    "print(\"hello 123\".isalnum())   # False &ndash; contains a space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5f6a7b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"HELLO\".isupper())       # True\n",
    "print(\"Hello\".isupper())       # False\n",
    "\n",
    "print(\"hello\".islower())       # True\n",
    "print(\"Hello\".islower())       # False\n",
    "\n",
    "print(\"Hello World\".istitle())  # True\n",
    "print(\"Hello world\".istitle())  # False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c9d0e2",
   "metadata": {},
   "source": [
    "### Practical use: input validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9d0e1f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_valid_username(username: str) -> bool:\n",
    "    \"\"\"Check whether a username is valid.\n",
    "\n",
    "    A valid username contains only alphanumeric characters\n",
    "    and is between 3 and 20 characters long.\n",
    "    \"\"\"\n",
    "    return username.isalnum() and 3 <= len(username) <= 20\n",
    "\n",
    "\n",
    "print(is_valid_username(\"alice42\"))        # True\n",
    "print(is_valid_username(\"ab\"))             # False &ndash; too short\n",
    "print(is_valid_username(\"hello world\"))    # False &ndash; contains a space\n",
    "print(is_valid_username(\"user@name\"))      # False &ndash; contains @"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0e1f2a4",
   "metadata": {},
   "source": [
    "## Practical examples\n",
    "\n",
    "Let us now combine the searching methods you have learned to solve some real-world problems.\n",
    "\n",
    "### Extracting file extensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1f2a3b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_extension(filename: str) -> str:\n",
    "    \"\"\"Return the file extension, including the leading dot.\n",
    "\n",
    "    Returns an empty string if no extension is found.\n",
    "    \"\"\"\n",
    "    dot_position = filename.rfind(\".\")\n",
    "    if dot_position == -1:\n",
    "        return \"\"\n",
    "    return filename[dot_position:]\n",
    "\n",
    "\n",
    "print(get_extension(\"report.pdf\"))          # .pdf\n",
    "print(get_extension(\"archive.tar.gz\"))      # .gz\n",
    "print(get_extension(\"README\"))              # (empty string)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2a3b4c6",
   "metadata": {},
   "source": [
    "### Basic email format validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3b4c5d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_valid_email(email: str) -> bool:\n",
    "    \"\"\"Perform a basic check on the format of an email address.\n",
    "\n",
    "    Checks that the email contains exactly one @ symbol,\n",
    "    has text before and after it, and the domain contains a dot.\n",
    "    \"\"\"\n",
    "    if email.count(\"@\") != 1:\n",
    "        return False\n",
    "\n",
    "    local_part, domain = email.split(\"@\")\n",
    "\n",
    "    if not local_part or not domain:\n",
    "        return False\n",
    "\n",
    "    if \".\" not in domain:\n",
    "        return False\n",
    "\n",
    "    if domain.startswith(\".\") or domain.endswith(\".\"):\n",
    "        return False\n",
    "\n",
    "    return True\n",
    "\n",
    "\n",
    "print(is_valid_email(\"user@example.com\"))       # True\n",
    "print(is_valid_email(\"user@.com\"))               # False\n",
    "print(is_valid_email(\"user@@example.com\"))       # False\n",
    "print(is_valid_email(\"@example.com\"))            # False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4c5d6e8",
   "metadata": {},
   "source": [
    "### Counting words in text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5d6e7f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_words(text: str) -> int:\n",
    "    \"\"\"Count the number of words in a string.\n",
    "\n",
    "    Words are separated by whitespace. Leading and trailing\n",
    "    whitespace is ignored.\n",
    "    \"\"\"\n",
    "    return len(text.split())\n",
    "\n",
    "\n",
    "print(count_words(\"The quick brown fox\"))           # 4\n",
    "print(count_words(\"  lots   of   spaces  \"))        # 3\n",
    "print(count_words(\"\"))                               # 0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6e7f8aa",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "Now it is time to practise what you have learned. Try to complete each exercise before looking at the solution.\n",
    "\n",
    "### Exercise 1: Extracting a domain name\n",
    "\n",
    "Write a function called `extract_domain()` that takes a URL string and returns the domain name. For example, `extract_domain(\"https://www.python.org/docs\")` should return `\"www.python.org\"`. You can assume the URL always starts with `\"http://\"` or `\"https://\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7f8a9b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Write your solution here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8a9b0c2",
   "metadata": {},
   "source": [
    "### Exercise 2: Counting vowels\n",
    "\n",
    "Write a function called `count_vowels()` that takes a string and returns the number of vowels (a, e, i, o, and u) it contains. The function should be case-insensitive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9b0c1d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Write your solution here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0c1d2e4",
   "metadata": {},
   "source": [
    "### Exercise 3: Identifying a pangram\n",
    "\n",
    "A pangram is a sentence that contains every letter of the alphabet at least once. Write a function called `is_pangram()` that takes a string and returns `True` if it is a pangram and `False` otherwise.\n",
    "\n",
    "For example, `is_pangram(\"The quick brown fox jumps over the lazy dog\")` should return `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1d2e3f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Write your solution here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4a5b6c8",
   "metadata": {},
   "source": [
    "### Solutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6c7d8ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution 1: Extracting a domain name\n",
    "def extract_domain(url: str) -> str:\n",
    "    \"\"\"Extract the domain name from a URL.\"\"\"\n",
    "    start = url.find(\"//\") + 2\n",
    "    end = url.find(\"/\", start)\n",
    "    if end == -1:\n",
    "        return url[start:]\n",
    "    return url[start:end]\n",
    "\n",
    "\n",
    "print(extract_domain(\"https://www.python.org/docs\"))    # www.python.org\n",
    "print(extract_domain(\"http://example.com\"))             # example.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7d8e9f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution 2: Counting vowels\n",
    "def count_vowels(text: str) -> int:\n",
    "    \"\"\"Count the number of vowels in a string (case-insensitive).\"\"\"\n",
    "    vowels = \"aeiou\"\n",
    "    return sum(1 for char in text.lower() if char in vowels)\n",
    "\n",
    "\n",
    "print(count_vowels(\"Hello World\"))    # 3\n",
    "print(count_vowels(\"AEIOU\"))          # 5\n",
    "print(count_vowels(\"rhythm\"))         # 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8e9f0a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution 3: Identifying a pangram\n",
    "def is_pangram(text: str) -> bool:\n",
    "    \"\"\"Check whether the given text is a pangram.\"\"\"\n",
    "    alphabet = \"abcdefghijklmnopqrstuvwxyz\"\n",
    "    lower_text = text.lower()\n",
    "    return all(letter in lower_text for letter in alphabet)\n",
    "\n",
    "\n",
    "print(is_pangram(\"The quick brown fox jumps over the lazy dog\"))   # True\n",
    "print(is_pangram(\"Hello World\"))                                   # False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9f0a1b3",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned how to search within strings and test for the presence of substrings using a variety of built-in methods and operators.\n",
    "\n",
    "Here is a summary of the key points:\n",
    "\n",
    "| Method / Operator | Purpose | Returns |\n",
    "|---|---|---|\n",
    "| `in` / `not in` | Test for substring presence or absence | `True` or `False` |\n",
    "| `str.find()` | Find the first occurrence of a substring | Index, or `-1` if not found |\n",
    "| `str.index()` | Find the first occurrence of a substring | Index, or raises `ValueError` |\n",
    "| `str.rfind()` | Find the last occurrence of a substring | Index, or `-1` if not found |\n",
    "| `str.startswith()` | Test whether a string starts with a prefix | `True` or `False` |\n",
    "| `str.endswith()` | Test whether a string ends with a suffix | `True` or `False` |\n",
    "| `str.count()` | Count non-overlapping occurrences | Integer count |\n",
    "\n",
    "### What you have covered in this tutorial series\n",
    "\n",
    "This is the final tutorial in the introductory series on string processing with Python. Across the four tutorials, you have learned:\n",
    "\n",
    "1. **String basics** -- creating strings, indexing, slicing, and immutability\n",
    "2. **String methods** -- transforming and manipulating text with built-in methods\n",
    "3. **String formatting** -- producing well-formatted output with f-strings and `str.format()`\n",
    "4. **String searching** -- finding substrings, testing beginnings and endings, and validating input\n",
    "\n",
    "### Next steps\n",
    "\n",
    "Now that you have a solid foundation in string processing, you can explore further:\n",
    "\n",
    "- Browse the [recipes](https://agilearn.co.uk/guides/string-processing/recipes/index) for practical, task-oriented guides that solve specific string processing problems\n",
    "- Consult the [reference documentation](https://agilearn.co.uk/guides/string-processing/reference/index) for detailed information on string methods and the `string` module\n",
    "- Read the [explanations](https://agilearn.co.uk/guides/string-processing/concepts/index) for deeper understanding of topics such as string immutability, Unicode, and encoding\n",
    "\n",
    "Well done on completing the tutorial series &ndash; you are well equipped to handle a wide variety of string processing tasks in Python!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbformat_minor": 5,
   "pygments_lexer": "ipython3",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}