{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b7399328",
   "metadata": {},
   "source": "# Generator expressions and `itertools`\n\nGenerator functions (previous notebook) are powerful, but often you want something smaller — \"multiply every element by two\", \"keep the odds\", \"pair these up\". For those cases Python has **generator expressions**, the inline cousin of list comprehensions. And for everything more structured than \"map\" and \"filter\" — grouping, windowing, chaining, combining — the standard library's `itertools` module already has the generator written for you.\n\nThis notebook is the practical toolkit."
  },
  {
   "cell_type": "markdown",
   "id": "9957f3f3",
   "metadata": {},
   "source": "## Generator expressions\n\nA generator expression looks exactly like a list comprehension, but with round brackets instead of square ones. Square brackets mean \"build a list now\". Round brackets mean \"build a generator — produce values lazily as consumed\"."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9dec0783",
   "metadata": {},
   "outputs": [],
   "source": "squares_list = [x * x for x in range(5)]       # builds a list immediately\nsquares_gen  = (x * x for x in range(5))       # builds a generator\n\nprint(squares_list)     # [0, 1, 4, 9, 16]\nprint(squares_gen)      # <generator object ...>\nprint(list(squares_gen))"
  },
  {
   "cell_type": "markdown",
   "id": "448b2241",
   "metadata": {},
   "source": "They support the same machinery as list comprehensions — filter clauses, multiple `for` clauses, conditional expressions — just evaluated lazily."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02d44760",
   "metadata": {},
   "outputs": [],
   "source": "xs = range(20)\n\n# Squares of odd numbers, lazy:\nodd_squares = (x * x for x in xs if x % 2)\nprint(list(odd_squares))\n\n# Cartesian pairs, lazy:\npairs = ((a, b) for a in 'ab' for b in (1, 2))\nprint(list(pairs))"
  },
  {
   "cell_type": "markdown",
   "id": "de04d3fe",
   "metadata": {},
   "source": "### When to use a genexp instead of a listcomp\n\nPrefer a generator expression when:\n\n- You only need to iterate the values once — e.g. passing straight into `sum`, `max`, `any`, `all`, `join`.\n- The source is large and you don't want the intermediate list in memory.\n- You're building a pipeline of transformations.\n\nPrefer a list comprehension when:\n\n- You need to iterate the result multiple times.\n- You need indexing, slicing, or `len`.\n- You're debugging and want to `print` the intermediate values."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "323dd3c5",
   "metadata": {},
   "outputs": [],
   "source": "# Good use of a genexp: no intermediate list built\nprint(sum(x * x for x in range(1_000_000)))\n\n# Even nicer — the parentheses can be omitted when a genexp is the sole\n# argument of a function call.\nprint(max(len(word) for word in 'one two three four'.split()))"
  },
  {
   "cell_type": "markdown",
   "id": "f81c4af0",
   "metadata": {},
   "source": "### Watch out: single-use\n\nA genexp is an iterator. The same one-shot rule applies."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40e8e074",
   "metadata": {},
   "outputs": [],
   "source": "squares = (x * x for x in range(4))\nprint(list(squares))   # [0, 1, 4, 9]\nprint(list(squares))   # [] — already consumed"
  },
  {
   "cell_type": "markdown",
   "id": "ea8bbb8d",
   "metadata": {},
   "source": "If you need the values twice, materialise once into a list or tuple, or wrap the generation in a function you can call again."
  },
  {
   "cell_type": "markdown",
   "id": "59878b3b",
   "metadata": {},
   "source": "## `itertools` — the toolkit\n\n`itertools` ships with a dozen or so generator combinators. We won't cover every one here (see the [reference page](https://agilearn.co.uk/guides/iterators-and-generators/reference/itertools-cheatsheet)), but the ones below come up repeatedly in real code."
  },
  {
   "cell_type": "markdown",
   "id": "43b259e6",
   "metadata": {},
   "source": "### `islice` — lazy slicing\n\nSlicing with `[:]` doesn't work on arbitrary iterators; it only works on sequences. `islice(iterable, stop)` or `islice(iterable, start, stop, step)` gives you slice-like behaviour for any iterator."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9662b319",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import islice\n\ndef naturals():\n    n = 1\n    while True:\n        yield n\n        n += 1\n\nprint(list(islice(naturals(), 5)))        # first 5\nprint(list(islice(naturals(), 10, 15)))   # indices 10..14\nprint(list(islice(naturals(), 0, 20, 3))) # every 3rd"
  },
  {
   "cell_type": "markdown",
   "id": "95328b42",
   "metadata": {},
   "source": "`islice` is the standard way to bound an infinite generator."
  },
  {
   "cell_type": "markdown",
   "id": "2405061f",
   "metadata": {},
   "source": "### `chain` — concatenation without an intermediate list\n\n`chain(a, b, c, ...)` yields everything from `a`, then everything from `b`, etc. No copy, no combined list in memory."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f77fd61",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import chain\n\npart1 = [1, 2, 3]\npart2 = range(10, 13)\npart3 = (x * x for x in [4, 5])\n\nfor x in chain(part1, part2, part3):\n    print(x, end=' ')\nprint()"
  },
  {
   "cell_type": "markdown",
   "id": "c5fa1ada",
   "metadata": {},
   "source": "There's also `chain.from_iterable(it_of_its)` for chaining an iterable *of* iterables — especially useful for flattening one level deep."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "948519c3",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import chain\nrows = [[1, 2, 3], [4, 5], [6]]\nprint(list(chain.from_iterable(rows)))"
  },
  {
   "cell_type": "markdown",
   "id": "6530a627",
   "metadata": {},
   "source": "### `groupby` — runs of equal values\n\n`groupby(iterable, key=...)` groups *adjacent* equal items. The important word is *adjacent*: it doesn't sort for you. If you want grouped output from an unsorted source, sort first by the same key."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42510e08",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import groupby\n\n# Already sorted by key — groupby just picks out the runs\nevents = [\n    ('2024-01', 'login'),\n    ('2024-01', 'click'),\n    ('2024-02', 'login'),\n    ('2024-02', 'login'),\n    ('2024-03', 'click'),\n]\n\nfor month, items in groupby(events, key=lambda e: e[0]):\n    # items is an iterator — consume it inside the loop\n    print(month, [action for _, action in items])"
  },
  {
   "cell_type": "markdown",
   "id": "7fd61f93",
   "metadata": {},
   "source": "A common bug: storing `items` in a list, then trying to use the previous group's iterator on the next loop iteration. Each `items` is only valid while that loop iteration is active — the outer iteration invalidates it."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a7f1f9a",
   "metadata": {},
   "outputs": [],
   "source": "# Sort first if your data isn't already grouped:\nunsorted = [('a', 1), ('b', 2), ('a', 3), ('b', 4)]\nsorted_pairs = sorted(unsorted, key=lambda p: p[0])\n\nfor key, group in groupby(sorted_pairs, key=lambda p: p[0]):\n    print(key, sum(v for _, v in group))"
  },
  {
   "cell_type": "markdown",
   "id": "41ae1181",
   "metadata": {},
   "source": "### `zip` — parallel iteration (built-in, not in itertools)\n\n`zip` isn't in `itertools` but it's the same family. It pairs elements from multiple iterables, stopping at the shortest. Since Python 3.10 you can pass `strict=True` to fail loudly if lengths disagree."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "122a86a4",
   "metadata": {},
   "outputs": [],
   "source": "names  = ['Ada', 'Grace', 'Linus']\nscores = [95, 88, 72]\n\nfor name, score in zip(names, scores):\n    print(f'{name}: {score}')\n\n# strict=True — catch the bug instead of silently truncating\ntry:\n    list(zip([1, 2, 3], [10, 20], strict=True))\nexcept ValueError as e:\n    print(f'caught: {e}')"
  },
  {
   "cell_type": "markdown",
   "id": "6341540a",
   "metadata": {},
   "source": "### `zip_longest` — pad instead of truncate\n\n`itertools.zip_longest(*iterables, fillvalue=...)` keeps going until the *longest* iterable is exhausted, padding missing values with `fillvalue` (default `None`)."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c146b9d",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import zip_longest\n\na = [1, 2, 3, 4]\nb = ['x', 'y']\nprint(list(zip_longest(a, b, fillvalue='?')))"
  },
  {
   "cell_type": "markdown",
   "id": "9f093590",
   "metadata": {},
   "source": "### `accumulate` — running totals (and other folds)\n\n`accumulate(iterable)` yields a running sum. Pass `func=` to accumulate with a different binary operation (max, multiplication, etc.)."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39a8ef76",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import accumulate\nimport operator\n\nprint(list(accumulate([1, 2, 3, 4, 5])))                    # running sum\nprint(list(accumulate([1, 2, 3, 4, 5], operator.mul)))      # running product\nprint(list(accumulate([3, 1, 4, 1, 5, 9, 2, 6], max)))      # running max"
  },
  {
   "cell_type": "markdown",
   "id": "da8f05fc",
   "metadata": {},
   "source": "### `takewhile` / `dropwhile` — conditional prefixes and suffixes\n\n- `takewhile(pred, iter)`: yield values *until* the predicate first returns false, then stop.\n- `dropwhile(pred, iter)`: skip values *while* the predicate is true, then yield the rest."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee334927",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import takewhile, dropwhile\n\nvalues = [1, 3, 5, 4, 7, 9]\nprint(list(takewhile(lambda x: x % 2, values)))   # [1, 3, 5]\nprint(list(dropwhile(lambda x: x % 2, values)))   # [4, 7, 9]"
  },
  {
   "cell_type": "markdown",
   "id": "d6b80c9d",
   "metadata": {},
   "source": "These differ from `filter` in one crucial way: they care about *position*, not just match. `filter` would keep all the odds; `takewhile` stops at the first non-odd."
  },
  {
   "cell_type": "markdown",
   "id": "1462e753",
   "metadata": {},
   "source": "### `tee` — branching an iterator\n\n`tee(iterable, n)` turns one iterator into `n` independent ones. Useful when you need to iterate the same stream twice but don't want to materialise it."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b648e63",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import tee\n\ndef events():\n    for x in [1, 2, 3, 4, 5]:\n        yield x\n\nfor_sum, for_max = tee(events(), 2)\nprint(sum(for_sum), max(for_max))"
  },
  {
   "cell_type": "markdown",
   "id": "02f36d44",
   "metadata": {},
   "source": "Caveat: if one branch gets far ahead of the other, `tee` has to buffer all the values in between. For streams that don't fit in memory, don't `tee` — restructure so you iterate once."
  },
  {
   "cell_type": "markdown",
   "id": "dd94d08e",
   "metadata": {},
   "source": "## Combining the pieces — a short pipeline\n\nHere's a pattern that shows up constantly: read records, filter, transform, group, summarise. All lazy; no intermediate lists."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eddaac4d",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import groupby\n\ntransactions = [\n    ('2024-01-15', 'food',      12.50),\n    ('2024-01-18', 'transport',  3.20),\n    ('2024-01-20', 'food',      24.00),\n    ('2024-02-02', 'food',      10.00),\n    ('2024-02-05', 'transport', 12.80),\n    ('2024-02-12', 'transport',  0.00),   # will be filtered out\n]\n\n# 1. filter — drop zero-value rows\nnonzero = (t for t in transactions if t[2] > 0)\n# 2. key by month\nkeyed = ((t[0][:7], t) for t in nonzero)\n# 3. sort by month so groupby sees contiguous runs\nby_month = sorted(keyed, key=lambda p: p[0])\n# 4. group and summarise\nfor month, rows in groupby(by_month, key=lambda p: p[0]):\n    total = sum(t[2] for _, t in rows)\n    print(f'{month}: £{total:.2f}')"
  },
  {
   "cell_type": "markdown",
   "id": "5315e3b2",
   "metadata": {},
   "source": "Each stage is a generator expression or an `itertools` call. The `sorted` step is the one place we *do* need to materialise — `groupby` requires contiguous runs. This is the typical shape of real pipelines: a few lazy stages, one bulk operation in the middle, and a consumer at the end."
  },
  {
   "cell_type": "markdown",
   "id": "8ab86fc9",
   "metadata": {},
   "source": "## Quick check — build a small pipeline\n\nGiven an iterable of `(timestamp, level, message)` log tuples where `level` is `'DEBUG'`, `'INFO'`, `'WARN'`, or `'ERROR'`:\n\n1. Drop `'DEBUG'` entries.\n2. Take only the first 1000 rows remaining.\n3. Count how many there are of each level.\n\nDo it with a generator expression for the filter, `islice` for the cap, and a single pass that updates a `Counter`."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "516c1588",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import islice\nfrom collections import Counter\n\n# Sample data — pretend this is a multi-gigabyte log file\ndef fake_logs():\n    import random\n    levels = ['DEBUG', 'INFO', 'WARN', 'ERROR']\n    weights = [0.5, 0.35, 0.12, 0.03]\n    rnd = random.Random(42)\n    for i in range(5000):\n        yield (i, rnd.choices(levels, weights=weights)[0], f'msg {i}')\n\n# Your turn — fill these:\n\nnondebug = ...          # generator expression: drop DEBUG\ncapped   = ...          # islice to the first 1000\ncounts   = ...          # Counter over capped\n\n# print(counts)"
  },
  {
   "cell_type": "markdown",
   "id": "ef4c569e",
   "metadata": {},
   "source": "### Working solution"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd47d3ec",
   "metadata": {},
   "outputs": [],
   "source": "from itertools import islice\nfrom collections import Counter\n\ndef fake_logs():\n    import random\n    levels = ['DEBUG', 'INFO', 'WARN', 'ERROR']\n    weights = [0.5, 0.35, 0.12, 0.03]\n    rnd = random.Random(42)\n    for i in range(5000):\n        yield (i, rnd.choices(levels, weights=weights)[0], f'msg {i}')\n\nnondebug = (row for row in fake_logs() if row[1] != 'DEBUG')\ncapped   = islice(nondebug, 1000)\ncounts   = Counter(row[1] for row in capped)\n\nprint(counts)"
  },
  {
   "cell_type": "markdown",
   "id": "5f7d7f0a",
   "metadata": {},
   "source": "## Summary\n\n- `(expr for x in xs)` is a generator expression — the lazy sister of a list comprehension. Use it when you'll only iterate once.\n- `itertools` is full of pre-written iterator combinators. The ones you'll reach for repeatedly: `islice`, `chain`, `groupby`, `accumulate`, `takewhile`/`dropwhile`, `tee`, plus built-in `zip`/`zip_longest`.\n- These compose cleanly with each other and with generator expressions. The shape of a typical pipeline is filter → transform → group → summarise, with `sorted` in the middle when you need it.\n\nNext: **custom iterators** — when a generator function isn't the right tool and you want the explicit control of writing the iterator class yourself."
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}