{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processes and parallelism\n",
    "\n",
    "Threads can't speed up computation, because the GIL lets only one thread run Python at a time. **Processes** get around this completely: each process is a separate Python interpreter with its own memory and its own GIL, so on an 8-core machine, eight processes can run Python in genuine parallel. This is the tool for **CPU-bound** work — and the only one of the three that delivers real multi-core speed.\n",
    "\n",
    "The good news: the high-level API is nearly identical to the thread pool. The catches are all about the fact that processes *don't share memory*.\n",
    "\n",
    "> These examples spawn real OS processes and won't run inside the in-browser sandbox — run them locally, ideally saved as a `.py` file (see the `__main__` note below)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Same interface, different engine\n",
    "\n",
    "`ProcessPoolExecutor` has the same `submit` / `map` / `as_completed` interface as `ThreadPoolExecutor`. Often you can switch a CPU-bound workload from threads to processes by changing one word."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "tags": [
     "no-run"
    ]
   },
   "execution_count": null,
   "outputs": [],
   "source": [
    "from concurrent.futures import ProcessPoolExecutor\n",
    "import time\n",
    "\n",
    "def cpu_task(n):\n",
    "    \"\"\"Busy work: sum of squares. The core is pinned the whole time.\"\"\"\n",
    "    total = 0\n",
    "    for i in range(10_000_000):\n",
    "        total += i * i\n",
    "    return total\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    start = time.perf_counter()\n",
    "    with ProcessPoolExecutor() as pool:        # defaults to one worker per core\n",
    "        results = list(pool.map(cpu_task, range(4)))\n",
    "    print(f'4 tasks across processes: {time.perf_counter() - start:.2f}s')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare that to running the four sequentially. On a multi-core machine the process version is several times faster — close to Nx on N cores, minus overhead. Try the same swap with `ThreadPoolExecutor` and you'll see *no* speedup, because the GIL serialises the computation."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "tags": [
     "no-run"
    ]
   },
   "execution_count": null,
   "outputs": [],
   "source": [
    "if __name__ == '__main__':\n",
    "    start = time.perf_counter()\n",
    "    for n in range(4):\n",
    "        cpu_task(n)\n",
    "    print(f'4 tasks sequentially:     {time.perf_counter() - start:.2f}s')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The `if __name__ == \"__main__\"` guard is mandatory\n",
    "\n",
    "You've probably seen this idiom treated as optional politeness. With `multiprocessing` (and `ProcessPoolExecutor`, which uses it) on Windows and macOS, it is **required**.\n",
    "\n",
    "To create a worker, Python starts a fresh interpreter and **imports your script** into it. Without the guard, that import re-runs your top-level pool-creating code in every child — which tries to spawn more children, which import the script, which spawn more children... an infinite cascade that crashes with a clear error. The guard ensures the pool-launching code runs only in the original process, not on import.\n",
    "\n",
    "This is also why these examples belong in a `.py` file run directly, not pasted into a REPL or notebook: the children need a real module to import."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Everything must be picklable\n",
    "\n",
    "Processes don't share memory, so arguments and return values are **serialised with `pickle`**, sent across a pipe, and reconstructed in the worker. Anything you pass in or return must therefore be picklable.\n",
    "\n",
    "The usual casualties:\n",
    "\n",
    "- **`lambda`s and local (nested) functions** can't be pickled — the function you give to `map`/`submit` must be defined at module top level.\n",
    "- Open files, sockets, locks, and database connections can't cross the boundary.\n",
    "- Large arguments and results are *copied* between processes, which costs time and memory."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "tags": [
     "no-run"
    ]
   },
   "execution_count": null,
   "outputs": [],
   "source": [
    "# This FAILS: a lambda can't be pickled to send to a worker process.\n",
    "#   with ProcessPoolExecutor() as pool:\n",
    "#       pool.map(lambda x: x * x, range(4))     # PicklingError\n",
    "#\n",
    "# Define the function at module level instead:\n",
    "def square(x):\n",
    "    return x * x\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    with ProcessPoolExecutor() as pool:\n",
    "        print(list(pool.map(square, range(8))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overhead is real — chunk your work\n",
    "\n",
    "Spawning processes and pickling data back and forth costs milliseconds. If each task is tiny, that overhead dwarfs the work and the parallel version is *slower* than just looping. The fix is to give each worker a meaningful **chunk** of work rather than one trivial item at a time.\n",
    "\n",
    "`executor.map` takes a `chunksize` argument that batches inputs per worker, which cuts the per-item overhead dramatically for large, cheap iterables."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "tags": [
     "no-run"
    ]
   },
   "execution_count": null,
   "outputs": [],
   "source": [
    "def is_prime(n):\n",
    "    if n < 2:\n",
    "        return False\n",
    "    for i in range(2, int(n ** 0.5) + 1):\n",
    "        if n % i == 0:\n",
    "            return False\n",
    "    return True\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    numbers = range(1_000_000, 1_000_500)\n",
    "    with ProcessPoolExecutor() as pool:\n",
    "        # chunksize batches inputs so each worker gets ~50 at a time\n",
    "        primes = [n for n, ok in zip(numbers, pool.map(is_prime, numbers, chunksize=50)) if ok]\n",
    "    print(f'found {len(primes)} primes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A rule of thumb: parallelise across processes only when each unit of work takes meaningfully longer than the cost of shipping it to a worker — think milliseconds of compute at minimum, and prefer chunks. For thousands of microsecond-sized items, stay sequential or batch them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sharing results, not state\n",
    "\n",
    "Because memory is isolated, the clean pattern is exactly the one from the thread pool: each task **returns** its result, and you combine the returns in the parent. Don't try to have workers mutate a shared list — they each get their own *copy*, and your changes vanish.\n",
    "\n",
    "When you genuinely need shared mutable state across processes (a counter several workers update, say), `multiprocessing` offers `Value`, `Array`, and `Manager` objects backed by shared memory or a server process. They're slower and fiddlier than returning values, so reach for them only when the algorithm truly requires it — see the [threading and multiprocessing reference](https://agilearn.co.uk/guides/concurrency/reference/threading-and-multiprocessing-reference)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Measure, don't assume\n",
    "\n",
    "Parallel speedup is never the full Nx. Pickling, process startup, and the parts of your program that *can't* be parallelised all eat into it (this ceiling is **Amdahl's law**: if 20% of the work is inherently sequential, even infinite cores cap you at 5x). Always time the real workload both ways before concluding processes helped."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "tags": [
     "no-run"
    ]
   },
   "execution_count": null,
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    print('cores available:', os.cpu_count())\n",
    "\n",
    "    start = time.perf_counter()\n",
    "    [cpu_task(n) for n in range(8)]\n",
    "    seq = time.perf_counter() - start\n",
    "\n",
    "    start = time.perf_counter()\n",
    "    with ProcessPoolExecutor() as pool:\n",
    "        list(pool.map(cpu_task, range(8)))\n",
    "    par = time.perf_counter() - start\n",
    "\n",
    "    print(f'sequential: {seq:.2f}s   parallel: {par:.2f}s   speedup: {seq/par:.1f}x')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recap\n",
    "\n",
    "- Processes give **true parallelism** for **CPU-bound** work; each has its own interpreter and GIL.\n",
    "- `ProcessPoolExecutor` mirrors `ThreadPoolExecutor` — often a one-word swap.\n",
    "- Guard pool creation with `if __name__ == \"__main__\"`; run from a `.py` file.\n",
    "- Arguments and results must be **picklable** — no lambdas or local functions; large data is copied.\n",
    "- **Chunk** small tasks to beat overhead, return results instead of sharing state, and **measure** the real speedup.\n",
    "\n",
    "Next: [Async and await](https://agilearn.co.uk/guides/concurrency/learn/04-async-await) — back to I/O-bound work, but scaled to thousands of simultaneous tasks on a single thread."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}