TOPIC: GLOBAL INTERPRETER LOCK
Making sense of parallel and asynchronous execution in Python
Parallel processing in Python is often presented as a straightforward route to faster programs, though the reality is rather more nuanced. At its core, parallel processing means executing parts of a task simultaneously across multiple processors or cores on the same machine, with the intention of reducing the total time needed to complete the work. Any honest explanation must include an important caveat because parallelism brings overhead of its own: processes need to be created, scheduled and coordinated, and data often has to be passed between them. For small or lightweight tasks, that overhead can outweigh any gain, and two tasks that each take five seconds may still require around eight seconds when parallelised, rather than the ideal five.
The Multiprocessing Module
One of the standard ways to work with parallel execution in Python is the multiprocessing module This module creates subprocesses rather than threads, which matters because each process has its own memory space. On both Unix-like systems and Windows, this arrangement allows Python code to use multiple processors more effectively for independent work, and it sidesteps some of the limitations commonly associated with threads in CPython, particularly for CPU-bound tasks. Threads still have an important role, especially for workloads that are heavy on input/output operations, but multiprocessing is often the better fit when the work involves substantial computation.
Understanding the Global Interpreter Lock
The reason threads are less effective for CPU-bound work in CPython relates directly to the Global Interpreter Lock (GIL). The GIL is a mutex that allows only one thread to hold control of the Python interpreter at any one time, meaning that even in a multithreaded programme, only one thread can execute Python bytecode at a given moment. When a thread is waiting for an external input/output operation it releases the GIL, allowing other threads to run, which is why threading remains a reasonable choice for I/O-bound workloads. Multiprocessing sidesteps the GIL entirely by spawning separate processes, each with its own Python interpreter, allowing genuine parallel execution across cores.
How Many Processes Can Run in Parallel?
Before using multiprocessing, it helps to understand the practical ceiling on how many processes can run in parallel. The upper bound is usually tied to the number of logical processors or cores available on the machine, and Python exposes this through multiprocessing.cpu_count(), which returns the number of processors detected. That figure is a useful starting point rather than an absolute rule. In real applications, the best number of worker processes can vary according to available memory, the nature of the task and what else the machine is doing at the time.
Synchronous and Asynchronous Execution
Another foundation worth clarifying is the difference between synchronous and asynchronous execution. In synchronous execution, tasks are coordinated so that results are typically gathered in the same order in which they were started, and the main programme effectively waits for those tasks to finish. In asynchronous execution, by contrast, tasks can complete in any order and the results may not correspond to the original input sequence, which often improves throughput but requires the programmer to be more deliberate about collecting and arranging results.
Pool and Process: The Two Main Abstractions
The multiprocessing module offers two main abstractions for parallel work: Pool and Process. For most practical tasks, Pool is the easier and more convenient option. It manages a collection of worker processes and provides methods such as apply(), map() and starmap() for synchronous execution, alongside apply_async(), map_async() and starmap_async() for asynchronous execution. The lower-level Process class offers more control and suits more specialised cases, but for many data-processing jobs Pool is sufficient and considerably easier to reason about.
An Example: Counting Values in a Range
A useful way to see these ideas in action is through a concrete example. Suppose there is a two-dimensional list, or matrix, where each row contains a small set of integers, and the task is to count how many values in each row fall within a given range. In the example, the data are generated with NumPy using np.random.randint(0, 10, size=[200000, 5]) and then converted to a plain list of lists with tolist(). A simple function, howmany_within_range(row, minimum, maximum), loops through each number in a row and increments a counter whenever the number falls between the supplied minimum and maximum values.
Without any parallelism, this task is handled with a straightforward loop in which each row is passed to the function in turn and the returned counts are appended to a results list. This serial approach is simple, easy to read and often good enough as a baseline, and it provides an important benchmark because parallel processing should not be adopted merely because it is available but should address an actual performance problem.
Pool.apply()
To parallelise the same function, the first step is to create a process pool, typically with mp.Pool(mp.cpu_count()). The simplest method to understand is Pool.apply(), which runs a function in a worker process using the arguments supplied through args. In the range-counting example, each row is submitted with the same minimum and maximum values. The resulting code is concise, but there is an important detail to note: when apply() is used inside a list comprehension, each call still blocks until it completes. It is parallel in terms of the workers available, but it is not always the most efficient pattern for distributing a large iterable of similar tasks.
Pool.map()
That is where Pool.map() can be more suitable. The map() method accepts a single iterable and applies the target function to each element. Because the original howmany_within_range() function expects more than one argument, the example adapts it by defining howmany_within_range_rowonly(row, minimum=4, maximum=8), giving default values to the range bounds so that only the row must be supplied. This is not always the cleanest design, but it illustrates the central constraint of map(): it expects one iterable of inputs rather than multiple arguments per call. In return, it is often a good fit for simple, repeated operations over a dataset.
Pool.starmap()
When a function genuinely needs multiple arguments and one wants the convenience of map-like behaviour, Pool.starmap() is usually the better choice. Like map(), it takes a single iterable, but each element of that iterable is itself another iterable containing the arguments for one function call. In the example, the input becomes [(row, 4, 8) for row in data], with each tuple unpacked into howmany_within_range(). This tends to be clearer than altering function signatures purely to satisfy the constraints of map().
Asynchronous Variants
The asynchronous equivalents follow the same broad pattern but differ in one crucial respect: they do not force the main process to wait for each task in order. With Pool.apply_async(), tasks are submitted, and the programme can continue while workers process them in the background. The example demonstrates this by redefining the counting function as howmany_within_range2(i, row, minimum, maximum), which returns both the original index and the count, a distinction that matters because asynchronous execution may alter the order of results. A callback function appends each completed result to a shared list and, after all tasks finish, that list is sorted by index so that the final output matches the original row order.
There is also an alternative form of apply_async() that avoids callbacks by returning ApplyResult objects, which can later be resolved with .get() to retrieve the actual result. This approach can be easier to follow when callbacks feel too indirect, though it still requires care to ensure that the pool is properly closed and joined so that all processes complete. The use of pool.join() is particularly important here because it prevents subsequent lines of code from running until the queued work is finished. Asynchronous mapping methods are available too, including Pool.starmap_async(), which mirrors starmap() but returns an asynchronous result object whose data can be fetched with .get().
Parallelising Pandas DataFrames
Parallelism in Python is not restricted to plain lists. In data analysis and machine learning work, it is often more relevant to process pandas DataFrames, and there are several levels at which this can happen: a function can operate on one row, one column or an entire DataFrame. The first two can be managed with the standard multiprocessing module alone, while whole-DataFrame parallelism often needs more flexible serialisation support than the standard library provides.
Row-wise and Column-wise Parallelism
For row-wise work, one approach is to iterate over df.itertuples(name=False) so that each row is presented as a simple tuple. A hypotenuse(row) function can compute the square root of the sum of squares of two values from each row, with a pool of four worker processes handling the rows through pool.imap(). This resembles pd.apply() conceptually, but the work is spread across processes rather than performed in a single interpreter thread.
Column-wise parallelism follows the same idea but uses df.items() to iterate over columns (it is worth noting that df.iteritems(), which older examples may reference, was deprecated in pandas 1.5.0 and has since been removed, with df.items() being the correct modern equivalent). A sum_of_squares(column) function receives each column as a pair containing the column label and the series itself, and pool.imap() distributes this work across multiple processes. This pattern is useful when independent operations need to be applied to separate columns.
Whole-DataFrame Parallelism with Pathos
Parallelising functions that accept an entire DataFrame or similarly complex object is more difficult with the standard multiprocessing machinery because of serialisation constraints, since the standard library uses pickle internally and pickle has well-known limitations with certain object types. The pathos package addresses this by using dill internally, which supports serialising and deserialising almost all Python types. A DataFrame is split into chunks with np.array_split(df, cores, axis=0), and a ProcessingPool from pathos.multiprocessing maps a function across those chunks, with the results combined using np.vstack(). This extends the same Pool > Map > Close > Join pattern, though the pool is also cleared afterwards with pool.clear().
Lower-level Process Control and Queues
There are broader ways to think about parallel execution beyond multiprocessing.Pool. Lower-level process management with multiprocessing.Process gives explicit control over individual processes, and this can be paired with queues managed through multiprocessing.Manager() for inter-process communication. In such designs, one queue can hold tasks and another can collect results, with worker processes repeatedly fetching tasks, processing them and placing outputs in the result queue, terminating when they receive a sentinel value such as -1. This approach is more verbose than using a pool, but it can be valuable when workflows are dynamic or when processes need long-lived coordination.
Threads, Executors and External Commands
Python also offers other concurrency models worth knowing. Threads, available through the threading module or concurrent.futures.ThreadPoolExecutor, are often well suited to I/O-bound work such as downloading files or waiting on network responses. Because of the GIL in CPython, threads are less effective for CPU-bound pure Python code, though they can still provide concurrency when much of the time is spent waiting. Process-based approaches, including ProcessPoolExecutor, are generally more effective for CPU-heavy work because they achieve genuine parallel execution across cores.
External process execution forms another category entirely. The os.system() method can launch shell commands, potentially in the background, though it is relatively crude. The subprocess module is more robust, providing better control over arguments, output capture and return codes. These tools are useful when the work is best handled by external programmes rather than Python functions, though they are conceptually distinct from in-Python data parallelism.
Choosing the Right Approach for Parallel Processing in Python
What emerges from all of this is that parallel processing in Python is less about memorising one trick and more about matching the method to the problem at hand. For simple data transformations over independent records, Pool.map() or Pool.starmap() can be effective, while asynchronous methods come into play when result order is not guaranteed or when responsiveness matters. When working with pandas, row-wise and column-wise strategies fit naturally into the standard multiprocessing model, whereas whole-object processing may call for a package such as pathos. Lower-level process control, thread pools, external commands and task queues each have their place too.
It is also worth remembering that parallelism is not free. Process creation, serialisation, memory usage and coordination all introduce cost, and the right question is not whether code can be parallelised but whether the effort and overhead make sense for the workload in question. Python provides several mature tools for splitting work across processes and threads, and the multiprocessing module remains one of the most practical for CPU-bound tasks on a single machine, with the Pool interface offering the clearest path from serial code to parallel execution for many everyday applications.