Central Processing Unit | Technology Tales

Making sense of parallel and asynchronous execution in Python

16^th March 2026

Parallel processing in Python is often presented as a straightforward route to faster programs, though the reality is rather more nuanced. At its core, parallel processing means executing parts of a task simultaneously across multiple processors or cores on the same machine, with the intention of reducing the total time needed to complete the work. Any honest explanation must include an important caveat because parallelism brings overhead of its own: processes need to be created, scheduled and coordinated, and data often has to be passed between them. For small or lightweight tasks, that overhead can outweigh any gain, and two tasks that each take five seconds may still require around eight seconds when parallelised, rather than the ideal five.

The Multiprocessing Module

One of the standard ways to work with parallel execution in Python is the multiprocessing module This module creates subprocesses rather than threads, which matters because each process has its own memory space. On both Unix-like systems and Windows, this arrangement allows Python code to use multiple processors more effectively for independent work, and it sidesteps some of the limitations commonly associated with threads in CPython, particularly for CPU-bound tasks. Threads still have an important role, especially for workloads that are heavy on input/output operations, but multiprocessing is often the better fit when the work involves substantial computation.

Understanding the Global Interpreter Lock

The reason threads are less effective for CPU-bound work in CPython relates directly to the Global Interpreter Lock (GIL). The GIL is a mutex that allows only one thread to hold control of the Python interpreter at any one time, meaning that even in a multithreaded programme, only one thread can execute Python bytecode at a given moment. When a thread is waiting for an external input/output operation it releases the GIL, allowing other threads to run, which is why threading remains a reasonable choice for I/O-bound workloads. Multiprocessing sidesteps the GIL entirely by spawning separate processes, each with its own Python interpreter, allowing genuine parallel execution across cores.

How Many Processes Can Run in Parallel?

Before using multiprocessing, it helps to understand the practical ceiling on how many processes can run in parallel. The upper bound is usually tied to the number of logical processors or cores available on the machine, and Python exposes this through multiprocessing.cpu_count(), which returns the number of processors detected. That figure is a useful starting point rather than an absolute rule. In real applications, the best number of worker processes can vary according to available memory, the nature of the task and what else the machine is doing at the time.

Synchronous and Asynchronous Execution

Another foundation worth clarifying is the difference between synchronous and asynchronous execution. In synchronous execution, tasks are coordinated so that results are typically gathered in the same order in which they were started, and the main programme effectively waits for those tasks to finish. In asynchronous execution, by contrast, tasks can complete in any order and the results may not correspond to the original input sequence, which often improves throughput but requires the programmer to be more deliberate about collecting and arranging results.

Pool and Process: The Two Main Abstractions

The multiprocessing module offers two main abstractions for parallel work: Pool and Process. For most practical tasks, Pool is the easier and more convenient option. It manages a collection of worker processes and provides methods such as apply(), map() and starmap() for synchronous execution, alongside apply_async(), map_async() and starmap_async() for asynchronous execution. The lower-level Process class offers more control and suits more specialised cases, but for many data-processing jobs Pool is sufficient and considerably easier to reason about.

An Example: Counting Values in a Range

A useful way to see these ideas in action is through a concrete example. Suppose there is a two-dimensional list, or matrix, where each row contains a small set of integers, and the task is to count how many values in each row fall within a given range. In the example, the data are generated with NumPy using np.random.randint(0, 10, size=[200000, 5]) and then converted to a plain list of lists with tolist(). A simple function, howmany_within_range(row, minimum, maximum), loops through each number in a row and increments a counter whenever the number falls between the supplied minimum and maximum values.

Without any parallelism, this task is handled with a straightforward loop in which each row is passed to the function in turn and the returned counts are appended to a results list. This serial approach is simple, easy to read and often good enough as a baseline, and it provides an important benchmark because parallel processing should not be adopted merely because it is available but should address an actual performance problem.

Pool.apply()

To parallelise the same function, the first step is to create a process pool, typically with mp.Pool(mp.cpu_count()). The simplest method to understand is Pool.apply(), which runs a function in a worker process using the arguments supplied through args. In the range-counting example, each row is submitted with the same minimum and maximum values. The resulting code is concise, but there is an important detail to note: when apply() is used inside a list comprehension, each call still blocks until it completes. It is parallel in terms of the workers available, but it is not always the most efficient pattern for distributing a large iterable of similar tasks.

Pool.map()

That is where Pool.map() can be more suitable. The map() method accepts a single iterable and applies the target function to each element. Because the original howmany_within_range() function expects more than one argument, the example adapts it by defining howmany_within_range_rowonly(row, minimum=4, maximum=8), giving default values to the range bounds so that only the row must be supplied. This is not always the cleanest design, but it illustrates the central constraint of map(): it expects one iterable of inputs rather than multiple arguments per call. In return, it is often a good fit for simple, repeated operations over a dataset.

Pool.starmap()

When a function genuinely needs multiple arguments and one wants the convenience of map-like behaviour, Pool.starmap() is usually the better choice. Like map(), it takes a single iterable, but each element of that iterable is itself another iterable containing the arguments for one function call. In the example, the input becomes [(row, 4, 8) for row in data], with each tuple unpacked into howmany_within_range(). This tends to be clearer than altering function signatures purely to satisfy the constraints of map().

Asynchronous Variants

The asynchronous equivalents follow the same broad pattern but differ in one crucial respect: they do not force the main process to wait for each task in order. With Pool.apply_async(), tasks are submitted, and the programme can continue while workers process them in the background. The example demonstrates this by redefining the counting function as howmany_within_range2(i, row, minimum, maximum), which returns both the original index and the count, a distinction that matters because asynchronous execution may alter the order of results. A callback function appends each completed result to a shared list and, after all tasks finish, that list is sorted by index so that the final output matches the original row order.

There is also an alternative form of apply_async() that avoids callbacks by returning ApplyResult objects, which can later be resolved with .get() to retrieve the actual result. This approach can be easier to follow when callbacks feel too indirect, though it still requires care to ensure that the pool is properly closed and joined so that all processes complete. The use of pool.join() is particularly important here because it prevents subsequent lines of code from running until the queued work is finished. Asynchronous mapping methods are available too, including Pool.starmap_async(), which mirrors starmap() but returns an asynchronous result object whose data can be fetched with .get().

Parallelising Pandas DataFrames

Parallelism in Python is not restricted to plain lists. In data analysis and machine learning work, it is often more relevant to process pandas DataFrames, and there are several levels at which this can happen: a function can operate on one row, one column or an entire DataFrame. The first two can be managed with the standard multiprocessing module alone, while whole-DataFrame parallelism often needs more flexible serialisation support than the standard library provides.

Row-wise and Column-wise Parallelism

For row-wise work, one approach is to iterate over df.itertuples(name=False) so that each row is presented as a simple tuple. A hypotenuse(row) function can compute the square root of the sum of squares of two values from each row, with a pool of four worker processes handling the rows through pool.imap(). This resembles pd.apply() conceptually, but the work is spread across processes rather than performed in a single interpreter thread.

Column-wise parallelism follows the same idea but uses df.items() to iterate over columns (it is worth noting that df.iteritems(), which older examples may reference, was deprecated in pandas 1.5.0 and has since been removed, with df.items() being the correct modern equivalent). A sum_of_squares(column) function receives each column as a pair containing the column label and the series itself, and pool.imap() distributes this work across multiple processes. This pattern is useful when independent operations need to be applied to separate columns.

Whole-DataFrame Parallelism with Pathos

Parallelising functions that accept an entire DataFrame or similarly complex object is more difficult with the standard multiprocessing machinery because of serialisation constraints, since the standard library uses pickle internally and pickle has well-known limitations with certain object types. The pathos package addresses this by using dill internally, which supports serialising and deserialising almost all Python types. A DataFrame is split into chunks with np.array_split(df, cores, axis=0), and a ProcessingPool from pathos.multiprocessing maps a function across those chunks, with the results combined using np.vstack(). This extends the same Pool > Map > Close > Join pattern, though the pool is also cleared afterwards with pool.clear().

Lower-level Process Control and Queues

There are broader ways to think about parallel execution beyond multiprocessing.Pool. Lower-level process management with multiprocessing.Process gives explicit control over individual processes, and this can be paired with queues managed through multiprocessing.Manager() for inter-process communication. In such designs, one queue can hold tasks and another can collect results, with worker processes repeatedly fetching tasks, processing them and placing outputs in the result queue, terminating when they receive a sentinel value such as -1. This approach is more verbose than using a pool, but it can be valuable when workflows are dynamic or when processes need long-lived coordination.

Threads, Executors and External Commands

Python also offers other concurrency models worth knowing. Threads, available through the threading module or concurrent.futures.ThreadPoolExecutor, are often well suited to I/O-bound work such as downloading files or waiting on network responses. Because of the GIL in CPython, threads are less effective for CPU-bound pure Python code, though they can still provide concurrency when much of the time is spent waiting. Process-based approaches, including ProcessPoolExecutor, are generally more effective for CPU-heavy work because they achieve genuine parallel execution across cores.

External process execution forms another category entirely. The os.system() method can launch shell commands, potentially in the background, though it is relatively crude. The subprocess module is more robust, providing better control over arguments, output capture and return codes. These tools are useful when the work is best handled by external programmes rather than Python functions, though they are conceptually distinct from in-Python data parallelism.

Choosing the Right Approach for Parallel Processing in Python

What emerges from all of this is that parallel processing in Python is less about memorising one trick and more about matching the method to the problem at hand. For simple data transformations over independent records, Pool.map() or Pool.starmap() can be effective, while asynchronous methods come into play when result order is not guaranteed or when responsiveness matters. When working with pandas, row-wise and column-wise strategies fit naturally into the standard multiprocessing model, whereas whole-object processing may call for a package such as pathos. Lower-level process control, thread pools, external commands and task queues each have their place too.

It is also worth remembering that parallelism is not free. Process creation, serialisation, memory usage and coordination all introduce cost, and the right question is not whether code can be parallelised but whether the effort and overhead make sense for the workload in question. Python provides several mature tools for splitting work across processes and threads, and the multiprocessing module remains one of the most practical for CPU-bound tasks on a single machine, with the Pool interface offering the clearest path from serial code to parallel execution for many everyday applications.

Upheaval and miniaturisation

4^th March 2025

The ongoing AI boom got me refreshing my computer assets. One was a hefty upgrade to my main workstation, still powered by Linux. Along the way, I learned a few lessons:

Processing with LLM's only works on a graphics card when everything can remain within its onboard memory. It is all too easy to revert to system memory and CPU usage, given the amount of memory you get on consumer graphics cards. That applies even with the latest and greatest from Nvidia, when the main use case is for gaming. Things become prohibitively expensive when you go on from there.
Even with water cooling, keeping a top of the range CPU cool and its fans running quietly remains a challenge, more so than when I last went for a major upgrade. It takes time for things to settle down.
My Iiyama monitor now feels flaky with input from the latest technology. This is enough to make me look for a replacement, and it is waking up from dormancy that is the real issue. While it was always slow, plugging out from mains electricity and then back in again is a hack that is needed all too often.
KVM switches may need upgrading to work with the latest graphical input. The monitor may have been a culprit with the problems that I was getting, yet things were smoother once I replaced the unit that I had been using with another that is more modern.
AMD Ryzen 9 chips now have onboard graphics, a boon when things are not proceeding too well with a dedicated graphics card. Even though this was not the case when the last major upgrade happened, there were no issues like what I faced this time around.
Having LED's on a motherboard to tell what might be stopping system startup is invaluable. This helped in July 2021 and averted confusion this time around as well. While only four of them were on offer, knowing which of CPU, DRAM, GPU or system boot needs attention is a big help.
Optical drives are not needed any longer. Booting off a USB drive was enough to get Linux Mint installed, once I got the image loaded on there properly. Rufus got used, and I needed to select the low-level writing option before things proceeded as I had hoped.

Just like 2021, the 2025 upgrade cycle needed a few weeks for everything to settle down. The previous cycle was more challenging, and this was not just because of an accompanying heatwave. The latest one was not so bedevilled.

Given the above, one might be tempted to go for a less arduous path, like my acquisition of an iMac last year for another place that I own. After all, a Mac Mini packs in quite a lot of power, and it is not the only miniature option. Now that I have one, I have moved image processing off the workstation and onto it. The images are stored on the Linux machine and edited on the Mac, which has plenty of memory and storage of its own. There is also an M4 chip, so processing power is not lacking either.

It could have been used for work affairs, yet I acquired a Geekom A8 for just that. Though seeking work as I write this, my being an incorporated freelancer means that having a dedicated machine that uses my main monitor has its advantages. Virtualisation can allow drift from business affairs to business matters, that is not so easy when a separate machine is involved. There is no shortage of power either with an AMD Ryzen 9 8945HS and Radeon 780M Graphics on board. Add in 32 GB of memory and 2 TB of storage and all is commodious. It can be surprising what a small package can do.

The Iiyama's travails also pop up with these smaller machines, less so on the Geekom than with the Mac. The latter needs the HDMI cable to be removed and reinserted after a delay to sort out things. Maybe that new monitor may not be such an off the wall idea after all.

Performing parallel processing in Perl scripting with the Parallel::ForkManager module

30^th September 2019

In a previous post, I described how to add Perl modules in Linux Mint, while mentioning that I hoped to add another that discusses the use of the Parallel::ForkManager module. This is that second post, and I am going to keep things as simple and generic as they can be. There are other articles like one on the Perl Maven website that go into more detail.

The first thing to do is ensure that the Parallel::ForkManager module is called by your script; having the line of code presented below near the top of the file will do just that. Without this step, the script will not be able to find the required module by itself and errors will be generated.

use Parallel::ForkManager;

Then, the maximum number of threads needs to be specified. While that can be achieved using a simple variable declaration, the following line reads this from the command used to invoke the script. It even tells a forgetful user what they need to do in its own terse manner. Here $0 is the name of the script and N is the number of threads. Not all these threads will get used and processing capacity will limit how many actually are in use, which means that there is less chance of overwhelming a CPU.

my $forks = shift or die "Usage: $0 N\n";

Once the maximum number of available threads is known, the next step is to instantiate the Parallel::ForkManager object as follows to use these child processes:

my $pm = Parallel::ForkManager->new($forks);

With the Parallel::ForkManager object available, it is now possible to use it as part of a loop. A foreach loop works well, though only a single array can be used, with hashes being needed when other collections need interrogation. Two extra statements are needed, with one to start a child process and another to end it.

foreach $t (@array) { my $pid = $pm->start and next; << Other code to be processed >> $pm->finish; }

Since there is often other processing performed by script, and it is possible to have multiple threaded loops in one, there needs to be a way of getting the parent process to wait until all the child processes have completed before moving from one step to another in the main script and that is what the following statement does. In short, it adds more control.

$pm->wait_all_children;

To close, there needs to be a comment on the advantages of parallel processing. Modern multicore processors often get used in single threaded operations, which leaves most of the capacity unused. Utilising this extra power then shortens processing times markedly. To give you an idea of what can be achieved, I had a single script taking around 2.5 minutes to complete in single threaded mode, while setting the maximum number of threads to 24 reduced this to just over half a minute while taking up 80% of the processing capacity. This was with an AMD Ryzen 7 2700X CPU with eight cores and a maximum of 16 processor threads. Surprisingly, using 16 as the maximum thread number only used half the processor capacity, so it seems to be a matter of performing one's own measurements when making these decisions.

Interrogating Solaris hardware for installed CPU and memory resources

2^nd October 2008

There are times when working with a Solaris server that you need to know a little more about the hardware configuration. Knowing how much memory that you have and how many processors there are can be very useful to know if you are not to hog such resources.

The command for revealing how much memory has been installed is:

prtconf -v

Since memory is often allocated to individual CPU's, then knowing how many are on the system is a must. This command will give you the bare number:

psrinfo -p

The following variant provides the full detail that you see below it:

psrinfo -v

Output:

Status of virtual processor 0 as of: 10/06/2008 16:47:54 on-line since 09/13/2008 14:47:52. The sparcv9 processor operates at 1503 MHz, and has a sparcv9 floating point processor. Status of virtual processor 1 as of: 10/06/2008 16:47:54 on-line since 09/13/2008 14:47:49. The sparcv9 processor operates at 1503 MHz, and has a sparcv9 floating point processor.

For a level intermediate between both extremes, try this to get what you see below it:

psrinfo -vp

Output:

The physical processor has 1 virtual processor (0) UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x34 clock 1503 MHz) The physical processor has 1 virtual processor (1) UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x34 clock 1503 MHz)

Killing those runaway processes that refuse to die

5^th July 2008

I must admit that there have been times when I logged off from my main Ubuntu box at home to dispatch a runaway process that I couldn't kill, and then log back in again. The standard signal being sent to the process by the very useful kill command just wasn't sending the nefarious CPU-eating nuisance the right kind of signal. Thankfully, there is a way to control the signal being sent and there is one that does what's needed:

kill -9 [ID of nuisance process]

For Linux users, there appears to be another option for terminating a process that doesn't need the ps and grep command combination: it's killall. Generally, killall terminates all processes and its own has no immunity to its quest. Hence, it's an administrator only tool with a very definite and perhaps rarely required use. The Linux variant is more useful because it also will terminate all instances of a named process at a stroke and has the same signal control as the kill command. It is used as follows:

killall -9 nuisanceprocess

I'll certainly be continuing to use both of the above; it appears that Wine needs termination like this at times and VMware Workstation lapsed into the same sort of antisocial behaviour while running a VM running a development version of Ubuntu's Intrepid Ibex (or 8.10 if you prefer). Anything that keeps you from constantly needing to restart Linux sessions on your PC has to be good.

New Firefox, new ForecastFox

25^th February 2007

Firefox 2.0.0.2 has made its appearance and the CPU usage bug seems to have gone away. We'll see how it goes... Also, a new version of the Accuweather.com powered ForecastFox plug-in has come out. It was when I was using it that I noticed heavy CPU usage, but the behaviour has yet to make its reappearance and I hope it never will. Now, I can get back to enjoying this very useful widget.