Processes and parallelism¶
Threads can't speed up computation, because the GIL lets only one thread run Python at a time. Processes get around this completely: each process is a separate Python interpreter with its own memory and its own GIL, so on an 8-core machine, eight processes can run Python in genuine parallel. This is the tool for CPU-bound work — and the only one of the three that delivers real multi-core speed.
The good news: the high-level API is nearly identical to the thread pool. The catches are all about the fact that processes don't share memory.
These examples spawn real OS processes and won't run inside the in-browser sandbox — run them locally, ideally saved as a
.pyfile (see the__main__note below).
Same interface, different engine¶
ProcessPoolExecutor has the same submit / map / as_completed interface as ThreadPoolExecutor. Often you can switch a CPU-bound workload from threads to processes by changing one word.
from concurrent.futures import ProcessPoolExecutor
import time
def cpu_task(n):
"""Busy work: sum of squares. The core is pinned the whole time."""
total = 0
for i in range(10_000_000):
total += i * i
return total
if __name__ == '__main__':
start = time.perf_counter()
with ProcessPoolExecutor() as pool: # defaults to one worker per core
results = list(pool.map(cpu_task, range(4)))
print(f'4 tasks across processes: {time.perf_counter() - start:.2f}s')
Compare that to running the four sequentially. On a multi-core machine the process version is several times faster — close to Nx on N cores, minus overhead. Try the same swap with ThreadPoolExecutor and you'll see no speedup, because the GIL serialises the computation.
if __name__ == '__main__':
start = time.perf_counter()
for n in range(4):
cpu_task(n)
print(f'4 tasks sequentially: {time.perf_counter() - start:.2f}s')
The if __name__ == "__main__" guard is mandatory¶
You've probably seen this idiom treated as optional politeness. With multiprocessing (and ProcessPoolExecutor, which uses it) on Windows and macOS, it is required.
To create a worker, Python starts a fresh interpreter and imports your script into it. Without the guard, that import re-runs your top-level pool-creating code in every child — which tries to spawn more children, which import the script, which spawn more children... an infinite cascade that crashes with a clear error. The guard ensures the pool-launching code runs only in the original process, not on import.
This is also why these examples belong in a .py file run directly, not pasted into a REPL or notebook: the children need a real module to import.
Everything must be picklable¶
Processes don't share memory, so arguments and return values are serialised with pickle, sent across a pipe, and reconstructed in the worker. Anything you pass in or return must therefore be picklable.
The usual casualties:
lambdas and local (nested) functions can't be pickled — the function you give tomap/submitmust be defined at module top level.- Open files, sockets, locks, and database connections can't cross the boundary.
- Large arguments and results are copied between processes, which costs time and memory.
# This FAILS: a lambda can't be pickled to send to a worker process.
# with ProcessPoolExecutor() as pool:
# pool.map(lambda x: x * x, range(4)) # PicklingError
#
# Define the function at module level instead:
def square(x):
return x * x
if __name__ == '__main__':
with ProcessPoolExecutor() as pool:
print(list(pool.map(square, range(8))))
Overhead is real — chunk your work¶
Spawning processes and pickling data back and forth costs milliseconds. If each task is tiny, that overhead dwarfs the work and the parallel version is slower than just looping. The fix is to give each worker a meaningful chunk of work rather than one trivial item at a time.
executor.map takes a chunksize argument that batches inputs per worker, which cuts the per-item overhead dramatically for large, cheap iterables.
def is_prime(n):
if n < 2:
return False
for i in range(2, int(n ** 0.5) + 1):
if n % i == 0:
return False
return True
if __name__ == '__main__':
numbers = range(1_000_000, 1_000_500)
with ProcessPoolExecutor() as pool:
# chunksize batches inputs so each worker gets ~50 at a time
primes = [n for n, ok in zip(numbers, pool.map(is_prime, numbers, chunksize=50)) if ok]
print(f'found {len(primes)} primes')
A rule of thumb: parallelise across processes only when each unit of work takes meaningfully longer than the cost of shipping it to a worker — think milliseconds of compute at minimum, and prefer chunks. For thousands of microsecond-sized items, stay sequential or batch them.
Sharing results, not state¶
Because memory is isolated, the clean pattern is exactly the one from the thread pool: each task returns its result, and you combine the returns in the parent. Don't try to have workers mutate a shared list — they each get their own copy, and your changes vanish.
When you genuinely need shared mutable state across processes (a counter several workers update, say), multiprocessing offers Value, Array, and Manager objects backed by shared memory or a server process. They're slower and fiddlier than returning values, so reach for them only when the algorithm truly requires it — see the threading and multiprocessing reference.
Measure, don't assume¶
Parallel speedup is never the full Nx. Pickling, process startup, and the parts of your program that can't be parallelised all eat into it (this ceiling is Amdahl's law: if 20% of the work is inherently sequential, even infinite cores cap you at 5x). Always time the real workload both ways before concluding processes helped.
import os
if __name__ == '__main__':
print('cores available:', os.cpu_count())
start = time.perf_counter()
[cpu_task(n) for n in range(8)]
seq = time.perf_counter() - start
start = time.perf_counter()
with ProcessPoolExecutor() as pool:
list(pool.map(cpu_task, range(8)))
par = time.perf_counter() - start
print(f'sequential: {seq:.2f}s parallel: {par:.2f}s speedup: {seq/par:.1f}x')
Recap¶
- Processes give true parallelism for CPU-bound work; each has its own interpreter and GIL.
ProcessPoolExecutormirrorsThreadPoolExecutor— often a one-word swap.- Guard pool creation with
if __name__ == "__main__"; run from a.pyfile. - Arguments and results must be picklable — no lambdas or local functions; large data is copied.
- Chunk small tasks to beat overhead, return results instead of sharing state, and measure the real speedup.
Next: Async and await — back to I/O-bound work, but scaled to thousands of simultaneous tasks on a single thread.