Processes and parallelism¶

Threads can't speed up computation, because the GIL lets only one thread run Python at a time. Processes get around this completely: each process is a separate Python interpreter with its own memory and its own GIL, so on an 8-core machine, eight processes can run Python in genuine parallel. This is the tool for CPU-bound work — and the only one of the three that delivers real multi-core speed.

The good news: the high-level API is nearly identical to the thread pool. The catches are all about the fact that processes don't share memory.

These examples spawn real OS processes and won't run inside the in-browser sandbox — run them locally, ideally saved as a .py file (see the __main__ note below).

Same interface, different engine¶

ProcessPoolExecutor has the same submit / map / as_completed interface as ThreadPoolExecutor. Often you can switch a CPU-bound workload from threads to processes by changing one word.

In [ ]:

no-run

Copied!





from concurrent.futures import ProcessPoolExecutor
import time

def cpu_task(n):
    """Busy work: sum of squares. The core is pinned the whole time."""
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total

if __name__ == '__main__':
    start = time.perf_counter()
    with ProcessPoolExecutor() as pool:        # defaults to one worker per core
        results = list(pool.map(cpu_task, range(4)))
    print(f'4 tasks across processes: {time.perf_counter() - start:.2f}s')
from concurrent.futures import ProcessPoolExecutor
import time

def cpu_task(n):
    """Busy work: sum of squares. The core is pinned the whole time."""
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total

if __name__ == '__main__':
    start = time.perf_counter()
    with ProcessPoolExecutor() as pool:        # defaults to one worker per core
        results = list(pool.map(cpu_task, range(4)))
    print(f'4 tasks across processes: {time.perf_counter() - start:.2f}s')

Compare that to running the four sequentially. On a multi-core machine the process version is several times faster — close to Nx on N cores, minus overhead. Try the same swap with ThreadPoolExecutor and you'll see no speedup, because the GIL serialises the computation.

In [ ]:

no-run

Copied!





if __name__ == '__main__':
    start = time.perf_counter()
    for n in range(4):
        cpu_task(n)
    print(f'4 tasks sequentially:     {time.perf_counter() - start:.2f}s')
if __name__ == '__main__':
    start = time.perf_counter()
    for n in range(4):
        cpu_task(n)
    print(f'4 tasks sequentially:     {time.perf_counter() - start:.2f}s')

The `if name == "main"` guard is mandatory¶

You've probably seen this idiom treated as optional politeness. With multiprocessing (and ProcessPoolExecutor, which uses it) on Windows and macOS, it is required.

To create a worker, Python starts a fresh interpreter and imports your script into it. Without the guard, that import re-runs your top-level pool-creating code in every child — which tries to spawn more children, which import the script, which spawn more children... an infinite cascade that crashes with a clear error. The guard ensures the pool-launching code runs only in the original process, not on import.

This is also why these examples belong in a .py file run directly, not pasted into a REPL or notebook: the children need a real module to import.

Everything must be picklable¶

Processes don't share memory, so arguments and return values are serialised with pickle, sent across a pipe, and reconstructed in the worker. Anything you pass in or return must therefore be picklable.

The usual casualties:

lambdas and local (nested) functions can't be pickled — the function you give to map/submit must be defined at module top level.
Open files, sockets, locks, and database connections can't cross the boundary.
Large arguments and results are copied between processes, which costs time and memory.

In [ ]:

no-run

Copied!





# This FAILS: a lambda can't be pickled to send to a worker process.
#   with ProcessPoolExecutor() as pool:
#       pool.map(lambda x: x * x, range(4))     # PicklingError
#
# Define the function at module level instead:
def square(x):
    return x * x

if __name__ == '__main__':
    with ProcessPoolExecutor() as pool:
        print(list(pool.map(square, range(8))))
# This FAILS: a lambda can't be pickled to send to a worker process.
#   with ProcessPoolExecutor() as pool:
#       pool.map(lambda x: x * x, range(4))     # PicklingError
#
# Define the function at module level instead:
def square(x):
    return x * x

if __name__ == '__main__':
    with ProcessPoolExecutor() as pool:
        print(list(pool.map(square, range(8))))

Overhead is real — chunk your work¶

Spawning processes and pickling data back and forth costs milliseconds. If each task is tiny, that overhead dwarfs the work and the parallel version is slower than just looping. The fix is to give each worker a meaningful chunk of work rather than one trivial item at a time.

executor.map takes a chunksize argument that batches inputs per worker, which cuts the per-item overhead dramatically for large, cheap iterables.

In [ ]:

no-run

Copied!





def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

if __name__ == '__main__':
    numbers = range(1_000_000, 1_000_500)
    with ProcessPoolExecutor() as pool:
        # chunksize batches inputs so each worker gets ~50 at a time
        primes = [n for n, ok in zip(numbers, pool.map(is_prime, numbers, chunksize=50)) if ok]
    print(f'found {len(primes)} primes')
def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

if __name__ == '__main__':
    numbers = range(1_000_000, 1_000_500)
    with ProcessPoolExecutor() as pool:
        # chunksize batches inputs so each worker gets ~50 at a time
        primes = [n for n, ok in zip(numbers, pool.map(is_prime, numbers, chunksize=50)) if ok]
    print(f'found {len(primes)} primes')

A rule of thumb: parallelise across processes only when each unit of work takes meaningfully longer than the cost of shipping it to a worker — think milliseconds of compute at minimum, and prefer chunks. For thousands of microsecond-sized items, stay sequential or batch them.

Because memory is isolated, the clean pattern is exactly the one from the thread pool: each task returns its result, and you combine the returns in the parent. Don't try to have workers mutate a shared list — they each get their own copy, and your changes vanish.

When you genuinely need shared mutable state across processes (a counter several workers update, say), multiprocessing offers Value, Array, and Manager objects backed by shared memory or a server process. They're slower and fiddlier than returning values, so reach for them only when the algorithm truly requires it — see the threading and multiprocessing reference.

Measure, don't assume¶

Parallel speedup is never the full Nx. Pickling, process startup, and the parts of your program that can't be parallelised all eat into it (this ceiling is Amdahl's law: if 20% of the work is inherently sequential, even infinite cores cap you at 5x). Always time the real workload both ways before concluding processes helped.

In [ ]:

no-run

Copied!





import os

if __name__ == '__main__':
    print('cores available:', os.cpu_count())

    start = time.perf_counter()
    [cpu_task(n) for n in range(8)]
    seq = time.perf_counter() - start

    start = time.perf_counter()
    with ProcessPoolExecutor() as pool:
        list(pool.map(cpu_task, range(8)))
    par = time.perf_counter() - start

    print(f'sequential: {seq:.2f}s   parallel: {par:.2f}s   speedup: {seq/par:.1f}x')
import os

if __name__ == '__main__':
    print('cores available:', os.cpu_count())

    start = time.perf_counter()
    [cpu_task(n) for n in range(8)]
    seq = time.perf_counter() - start

    start = time.perf_counter()
    with ProcessPoolExecutor() as pool:
        list(pool.map(cpu_task, range(8)))
    par = time.perf_counter() - start

    print(f'sequential: {seq:.2f}s   parallel: {par:.2f}s   speedup: {seq/par:.1f}x')

Recap¶

Processes give true parallelism for CPU-bound work; each has its own interpreter and GIL.
ProcessPoolExecutor mirrors ThreadPoolExecutor — often a one-word swap.
Guard pool creation with if __name__ == "__main__"; run from a .py file.
Arguments and results must be picklable — no lambdas or local functions; large data is copied.
Chunk small tasks to beat overhead, return results instead of sharing state, and measure the real speedup.

Next: Async and await — back to I/O-bound work, but scaled to thousands of simultaneous tasks on a single thread.

Processes and parallelism¶

Same interface, different engine¶

The if __name__ == "__main__" guard is mandatory¶

Everything must be picklable¶

Overhead is real — chunk your work¶

Sharing results, not state¶

Measure, don't assume¶

Recap¶

The `if name == "main"` guard is mandatory¶