Parallel computing | Technology Tales

Speeding up R Code with parallel processing

17^th March 2026

Parallel processing in R has evolved considerably over the past fifteen years, moving from a patchwork of platform-specific workarounds into a well-structured ecosystem with clean, consistent interfaces. The appeal is easy to grasp: modern computers offer several processor cores, yet most R code runs on only one of them unless the user makes a deliberate choice to go parallel. When a task involves repeated calculations across groups, repeated model fitting or many independent data retrievals, spreading that work across multiple cores can reduce elapsed time substantially.

At its heart, the idea is simple. A larger job is split into smaller pieces, those pieces are executed simultaneously where possible, and the results are combined back together. That pattern appears throughout R's parallel ecosystem, whether the work is running on a laptop with a handful of cores or on a university supercomputer with thousands.

Why Parallel Processing?

Most modern computers have multiple cores that sit idle during single-threaded R scripts. Parallel processing takes advantage of this by splitting work across those cores, but it is important to understand that it is not always beneficial. Starting workers, transmitting data and collecting results all take time. Parallel processing makes the most sense when each iteration does enough computational work to justify that overhead. For fast operations of well under a second, the overhead will outweigh any gain and serial execution is faster. The sweet spot is iterative work, where each unit of computation takes at least a few seconds.

Benchmarking: Amdahl's Law

The theoretical speed-up from adding processors is always limited by the fraction of work that cannot be parallelised. Amdahl's Law, formulated by computer scientist Gene Amdahl in 1967, captures this:

Maximum Speedup = 1 / ( f/p + (1 - f) )

Here, f is the parallelisable fraction and p is the number of processors. Problems where f = 1 (the entire computation is parallelisable) are called embarrassingly parallel: bootstrapping, simulation studies and applying the same model to many independent groups all fall into this category. For everything else, the sequential fraction, including the overhead of setting up workers and moving data, sets a ceiling on how much improvement is achievable.

How We Got Here

The current landscape makes more sense with a brief orientation. R 2.14.0 in 2011 brought {parallel} into base R, providing built-in support for both forking and socket clusters along with reproducible random number streams, and it remains the foundation everything else builds on. The {foreach} package with {doParallel} became the most common high-level interface for many years, and is still widely encountered in existing code. The split-apply-combine package {plyr} was an early entry point for parallel data manipulation but is now retired; the recommendation is to use {dplyr} for data frames and {purrr} for list iteration instead. The {future} ecosystem, covered in the next section, is the current best practice for new code.

The Modern Standard: The {future} Ecosystem

The most significant development in R parallel computing in recent years has been the {future} package by Henrik Bengtsson, which provides a unified API for sequential and parallel execution across a wide range of backends. Its central concept is simple: a future is a value that will be computed (possibly in parallel) and retrieved later. What makes it powerful is that you write code once and change the execution strategy by swapping a single plan() call, with no other changes to your code.

library(future)
plan(multisession)  # Use all available cores via background R sessions

The common plans are sequential (the default, no parallelism), multisession (multiple background R processes, works on all platforms including Windows) and multicore (forking, faster but Unix/macOS only). On a cluster, cluster and backends such as future.batchtools extend the same interface to remote nodes.

The {future} package itself is a low-level building block. For day-to-day work, three higher-level packages are the main entry points.

{future.apply}: Drop-in Replacements for base R Apply

{future.apply} provides parallel versions of every *apply function in base R, including future_lapply(), future_sapply(), future_mapply(), future_replicate() and more. The conversion from serial to parallel code requires just two lines:

library(future.apply)
plan(multisession)

# Serial
results <- lapply(my_list, my_function)

# Parallel — identical output, just faster
results <- future_lapply(my_list, my_function)

Global variables and packages are automatically identified and exported to workers, which removes the manual clusterExport and clusterEvalQ calls that {parallel} requires.

{furrr}: Drop-in Replacements for {purrr}

{furrr} does the same for {purrr}'s mapping functions. Any map() call can become future_map() by loading the library and setting a plan:

library(furrr)
plan(multisession, workers = availableCores() - 1)

# Serial
results <- map(my_list, my_function)

# Parallel
results <- future_map(my_list, my_function)

Like {future.apply}, {furrr} handles environment export automatically. There are parallel equivalents for all typed variants (future_map_dbl(), future_map_chr(), etc.) and for map2() and pmap() as well. It is the most natural choice for tidyverse-style code that already uses {purrr}.

{futurize}: One-Line Parallelisation

For users who want to parallelise existing code with minimal changes, {futurize} can transpile calls to lapply(), purrr::map() and foreach::foreach() %do% {} into their parallel equivalents automatically.

{foreach} with {doFuture}

The {foreach} package remains widely used, and the modern way to parallelise it is with the {doFuture} backend and the %dofuture% operator:

library(foreach)
library(doFuture)
plan(multisession)

results <- foreach(i = 1:10) %dofuture% {
    my_function(i)
}

This approach inherits all the benefits of {future}, including automatic global variable handling and reproducible random numbers.

The {parallel} Package: Core Functions

The {parallel} package remains part of base R and is the foundation that {future} and most other packages build on. It is useful to know its core functions directly, especially for distributed work across multiple nodes.

Shared memory (single machine, Unix/macOS only):

mclapply(X, FUN, mc.cores = n) is a parallelised lapply that works by forking. It does not work on Windows and falls back silently to serial execution there.

Distributed memory (all platforms, including multi-node):

Function	Description
`makeCluster(n)`	Start `n` worker processes
`clusterExport(cl, vars)`	Copy named objects to all workers
`clusterEvalQ(cl, expr)`	Run an expression (e.g. `library(pkg)`) on all workers
`parLapply(cl, X, FUN)`	Parallelised `lapply` across the cluster
`parLapplyLB(cl, X, FUN)`	Same with load balancing for uneven tasks
`clusterSetRNGStream(cl, seed)`	Set reproducible random seeds on workers
`stopCluster(cl)`	Shut down the cluster

Note that detectCores() can return misleading values in HPC environments, reporting the total cores on a node rather than those allocated to your job. The {parallelly} package's availableCores() is more reliable in those settings and is what {furrr} and {future.apply} use internally.

A Tidyverse Approach with {multidplyr}

For data frame-centric workflows, {multidplyr} (available on CRAN) provides a {dplyr} backend that distributes grouped data across worker processes. The API has been simplified considerably since older tutorials were written: there is no longer any need to manually add group index columns or call create_cluster(). The current workflow is straightforward.

library(multidplyr)
library(dplyr)

# Step 1: Create a cluster (leave 1–2 cores free)
cluster <- new_cluster(parallel::detectCores() - 1)

# Step 2: Load packages on workers
cluster_library(cluster, "dplyr")

# Step 3: Group your data and partition it across workers
flights_partitioned <- nycflights13::flights %>%
    group_by(dest) %>%
    partition(cluster)

# Step 4: Work with dplyr verbs as normal
results <- flights_partitioned %>%
    summarise(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
    collect()

partition() uses a greedy algorithm to keep all rows of a group on the same worker and balance shard sizes. The collect() call at the end recombines the results into an ordinary tibble in the main session. If you need to use custom functions, load them on each worker with cluster_assign():

cluster_assign(cluster, my_function = my_function)

One important caveat from the official documentation: for basic {dplyr} operations, {multidplyr} is unlikely to give measurable speed-ups unless you have tens or hundreds of millions of rows. Its real strength is in parallelising slower, more complex operations such as fitting models to each group. For large in-memory data with fast transformations, {dtplyr} (which translates {dplyr} to {data.table}) is often a better first choice.

Running R on HPC Clusters

For computations that exceed what a single workstation can provide, university and research HPC clusters are the next step. The core terminology is worth understanding clearly before submitting your first job.

One node is a single physical computer, which may itself contain multiple processors. One processor contains multiple cores. Wall-time is the real-world clock time a job is permitted to run; the job is terminated when this limit is reached, regardless of whether the script has finished. Memory refers to the RAM the job requires. When requesting resources, leave a margin of at least five per cent of RAM for system processes, as exceeding the allocation will cause the job to fail.

Slurm Job Submission

Slurm is the dominant scheduler on modern HPC clusters, including Penn State's Roar Collab system, managed by the Institute for Computational and Data Sciences (ICDS). Jobs are described in a shell script and submitted with sbatch. From R, the {rslurm} package allows Slurm jobs to be created and submitted directly without leaving the R session:

library(rslurm)
sjob <- slurm_apply(my_function, params_df, jobname = "my_job",
                    nodes = 2, cpus_per_node = 8)

Connecting R Workflows to Cluster Schedulers

The {batchtools} package provides Map, Reduce and Filter variants for managing R jobs on PBS, Slurm, LSF and Sun Grid Engine. The {clustermq} package sends function calls as cluster jobs via a single line of code without network-mounted storage. For users already in the {future} ecosystem, {future.batchtools} wraps {batchtools} as a {future} backend, letting you scale from a local plan(multisession) all the way to plan(batchtools_slurm) with no other code changes.

The Broader Ecosystem

The CRAN Task View on High-Performance and Parallel Computing, maintained by Dirk Eddelbuettel and updated lately, remains the most comprehensive catalogue of R packages in this space. The core packages designated by the Task View are {Rmpi} and {snow}. Beyond these, several areas are worth knowing about.

For large and out-of-memory data, {arrow} provides the Apache Arrow in-memory format with support for out-of-memory processing and streaming. {bigmemory} allows multiple R processes on the same machine to share large matrix objects. {bigstatsr} operates on file-backed matrices via memory-mapped access with parallel matrix operations and PCA.

For pipeline orchestration, the {targets} package constructs a directed acyclic graph of your workflow and orchestrates distributed computing across {future} workers, only re-running steps whose upstream dependencies have changed. For GPU computing, the {tensorflow} package by Allaire and colleagues provides access to the complete TensorFlow API from within R, enabling computation across CPUs and GPUs with a single API.

When it comes to random number reproducibility across parallel workers, the L'Ecuyer-CMRG streams built into {parallel} are available via RNGkind("L'Ecuyer-CMRG"). The {rlecuyer}, {rstream}, {sitmo} and {dqrng} packages provide further alternatives. The {doRNG} package handles reproducible seeds specifically for {foreach} loops.

Choosing the Right Approach

The appropriate tool depends on the shape of the problem and how it fits into your existing code.

If you are already using {purrr}'s map() functions, replacing them with future_map() from {furrr} after plan(multisession) is the path of least resistance. If you use base R's lapply or sapply, {future.apply} provides identical drop-in replacements. Both inherit automatic environment handling, reproducible random numbers and cross-platform compatibility from {future}.

If you are working with grouped data frames in a {dplyr} style and each group operation is computationally substantial, {multidplyr} is a good fit. For fast operations on large data, try {dtplyr} first.

For the largest workloads on institutional clusters, {future} scales directly to HPC environments via plan(cluster) or plan(batchtools_slurm). The {rslurm} and {batchtools} packages provide more direct control over job submission and resource management.

Making sense of parallel and asynchronous execution in Python

16^th March 2026

Parallel processing in Python is often presented as a straightforward route to faster programs, though the reality is rather more nuanced. At its core, parallel processing means executing parts of a task simultaneously across multiple processors or cores on the same machine, with the intention of reducing the total time needed to complete the work. Any honest explanation must include an important caveat because parallelism brings overhead of its own: processes need to be created, scheduled and coordinated, and data often has to be passed between them. For small or lightweight tasks, that overhead can outweigh any gain, and two tasks that each take five seconds may still require around eight seconds when parallelised, rather than the ideal five.

The Multiprocessing Module

One of the standard ways to work with parallel execution in Python is the multiprocessing module This module creates subprocesses rather than threads, which matters because each process has its own memory space. On both Unix-like systems and Windows, this arrangement allows Python code to use multiple processors more effectively for independent work, and it sidesteps some of the limitations commonly associated with threads in CPython, particularly for CPU-bound tasks. Threads still have an important role, especially for workloads that are heavy on input/output operations, but multiprocessing is often the better fit when the work involves substantial computation.

Understanding the Global Interpreter Lock

The reason threads are less effective for CPU-bound work in CPython relates directly to the Global Interpreter Lock (GIL). The GIL is a mutex that allows only one thread to hold control of the Python interpreter at any one time, meaning that even in a multithreaded programme, only one thread can execute Python bytecode at a given moment. When a thread is waiting for an external input/output operation it releases the GIL, allowing other threads to run, which is why threading remains a reasonable choice for I/O-bound workloads. Multiprocessing sidesteps the GIL entirely by spawning separate processes, each with its own Python interpreter, allowing genuine parallel execution across cores.

How Many Processes Can Run in Parallel?

Before using multiprocessing, it helps to understand the practical ceiling on how many processes can run in parallel. The upper bound is usually tied to the number of logical processors or cores available on the machine, and Python exposes this through multiprocessing.cpu_count(), which returns the number of processors detected. That figure is a useful starting point rather than an absolute rule. In real applications, the best number of worker processes can vary according to available memory, the nature of the task and what else the machine is doing at the time.

Synchronous and Asynchronous Execution

Another foundation worth clarifying is the difference between synchronous and asynchronous execution. In synchronous execution, tasks are coordinated so that results are typically gathered in the same order in which they were started, and the main programme effectively waits for those tasks to finish. In asynchronous execution, by contrast, tasks can complete in any order and the results may not correspond to the original input sequence, which often improves throughput but requires the programmer to be more deliberate about collecting and arranging results.

Pool and Process: The Two Main Abstractions

The multiprocessing module offers two main abstractions for parallel work: Pool and Process. For most practical tasks, Pool is the easier and more convenient option. It manages a collection of worker processes and provides methods such as apply(), map() and starmap() for synchronous execution, alongside apply_async(), map_async() and starmap_async() for asynchronous execution. The lower-level Process class offers more control and suits more specialised cases, but for many data-processing jobs Pool is sufficient and considerably easier to reason about.

An Example: Counting Values in a Range

A useful way to see these ideas in action is through a concrete example. Suppose there is a two-dimensional list, or matrix, where each row contains a small set of integers, and the task is to count how many values in each row fall within a given range. In the example, the data are generated with NumPy using np.random.randint(0, 10, size=[200000, 5]) and then converted to a plain list of lists with tolist(). A simple function, howmany_within_range(row, minimum, maximum), loops through each number in a row and increments a counter whenever the number falls between the supplied minimum and maximum values.

Without any parallelism, this task is handled with a straightforward loop in which each row is passed to the function in turn and the returned counts are appended to a results list. This serial approach is simple, easy to read and often good enough as a baseline, and it provides an important benchmark because parallel processing should not be adopted merely because it is available but should address an actual performance problem.

Pool.apply()

To parallelise the same function, the first step is to create a process pool, typically with mp.Pool(mp.cpu_count()). The simplest method to understand is Pool.apply(), which runs a function in a worker process using the arguments supplied through args. In the range-counting example, each row is submitted with the same minimum and maximum values. The resulting code is concise, but there is an important detail to note: when apply() is used inside a list comprehension, each call still blocks until it completes. It is parallel in terms of the workers available, but it is not always the most efficient pattern for distributing a large iterable of similar tasks.

Pool.map()

That is where Pool.map() can be more suitable. The map() method accepts a single iterable and applies the target function to each element. Because the original howmany_within_range() function expects more than one argument, the example adapts it by defining howmany_within_range_rowonly(row, minimum=4, maximum=8), giving default values to the range bounds so that only the row must be supplied. This is not always the cleanest design, but it illustrates the central constraint of map(): it expects one iterable of inputs rather than multiple arguments per call. In return, it is often a good fit for simple, repeated operations over a dataset.

Pool.starmap()

When a function genuinely needs multiple arguments and one wants the convenience of map-like behaviour, Pool.starmap() is usually the better choice. Like map(), it takes a single iterable, but each element of that iterable is itself another iterable containing the arguments for one function call. In the example, the input becomes [(row, 4, 8) for row in data], with each tuple unpacked into howmany_within_range(). This tends to be clearer than altering function signatures purely to satisfy the constraints of map().

Asynchronous Variants

The asynchronous equivalents follow the same broad pattern but differ in one crucial respect: they do not force the main process to wait for each task in order. With Pool.apply_async(), tasks are submitted, and the programme can continue while workers process them in the background. The example demonstrates this by redefining the counting function as howmany_within_range2(i, row, minimum, maximum), which returns both the original index and the count, a distinction that matters because asynchronous execution may alter the order of results. A callback function appends each completed result to a shared list and, after all tasks finish, that list is sorted by index so that the final output matches the original row order.

There is also an alternative form of apply_async() that avoids callbacks by returning ApplyResult objects, which can later be resolved with .get() to retrieve the actual result. This approach can be easier to follow when callbacks feel too indirect, though it still requires care to ensure that the pool is properly closed and joined so that all processes complete. The use of pool.join() is particularly important here because it prevents subsequent lines of code from running until the queued work is finished. Asynchronous mapping methods are available too, including Pool.starmap_async(), which mirrors starmap() but returns an asynchronous result object whose data can be fetched with .get().

Parallelising Pandas DataFrames

Parallelism in Python is not restricted to plain lists. In data analysis and machine learning work, it is often more relevant to process pandas DataFrames, and there are several levels at which this can happen: a function can operate on one row, one column or an entire DataFrame. The first two can be managed with the standard multiprocessing module alone, while whole-DataFrame parallelism often needs more flexible serialisation support than the standard library provides.

Row-wise and Column-wise Parallelism

For row-wise work, one approach is to iterate over df.itertuples(name=False) so that each row is presented as a simple tuple. A hypotenuse(row) function can compute the square root of the sum of squares of two values from each row, with a pool of four worker processes handling the rows through pool.imap(). This resembles pd.apply() conceptually, but the work is spread across processes rather than performed in a single interpreter thread.

Column-wise parallelism follows the same idea but uses df.items() to iterate over columns (it is worth noting that df.iteritems(), which older examples may reference, was deprecated in pandas 1.5.0 and has since been removed, with df.items() being the correct modern equivalent). A sum_of_squares(column) function receives each column as a pair containing the column label and the series itself, and pool.imap() distributes this work across multiple processes. This pattern is useful when independent operations need to be applied to separate columns.

Whole-DataFrame Parallelism with Pathos

Parallelising functions that accept an entire DataFrame or similarly complex object is more difficult with the standard multiprocessing machinery because of serialisation constraints, since the standard library uses pickle internally and pickle has well-known limitations with certain object types. The pathos package addresses this by using dill internally, which supports serialising and deserialising almost all Python types. A DataFrame is split into chunks with np.array_split(df, cores, axis=0), and a ProcessingPool from pathos.multiprocessing maps a function across those chunks, with the results combined using np.vstack(). This extends the same Pool > Map > Close > Join pattern, though the pool is also cleared afterwards with pool.clear().

Lower-level Process Control and Queues

There are broader ways to think about parallel execution beyond multiprocessing.Pool. Lower-level process management with multiprocessing.Process gives explicit control over individual processes, and this can be paired with queues managed through multiprocessing.Manager() for inter-process communication. In such designs, one queue can hold tasks and another can collect results, with worker processes repeatedly fetching tasks, processing them and placing outputs in the result queue, terminating when they receive a sentinel value such as -1. This approach is more verbose than using a pool, but it can be valuable when workflows are dynamic or when processes need long-lived coordination.

Threads, Executors and External Commands

Python also offers other concurrency models worth knowing. Threads, available through the threading module or concurrent.futures.ThreadPoolExecutor, are often well suited to I/O-bound work such as downloading files or waiting on network responses. Because of the GIL in CPython, threads are less effective for CPU-bound pure Python code, though they can still provide concurrency when much of the time is spent waiting. Process-based approaches, including ProcessPoolExecutor, are generally more effective for CPU-heavy work because they achieve genuine parallel execution across cores.

External process execution forms another category entirely. The os.system() method can launch shell commands, potentially in the background, though it is relatively crude. The subprocess module is more robust, providing better control over arguments, output capture and return codes. These tools are useful when the work is best handled by external programmes rather than Python functions, though they are conceptually distinct from in-Python data parallelism.

Choosing the Right Approach for Parallel Processing in Python

What emerges from all of this is that parallel processing in Python is less about memorising one trick and more about matching the method to the problem at hand. For simple data transformations over independent records, Pool.map() or Pool.starmap() can be effective, while asynchronous methods come into play when result order is not guaranteed or when responsiveness matters. When working with pandas, row-wise and column-wise strategies fit naturally into the standard multiprocessing model, whereas whole-object processing may call for a package such as pathos. Lower-level process control, thread pools, external commands and task queues each have their place too.

It is also worth remembering that parallelism is not free. Process creation, serialisation, memory usage and coordination all introduce cost, and the right question is not whether code can be parallelised but whether the effort and overhead make sense for the workload in question. Python provides several mature tools for splitting work across processes and threads, and the multiprocessing module remains one of the most practical for CPU-bound tasks on a single machine, with the Pool interface offering the clearest path from serial code to parallel execution for many everyday applications.

Performing parallel processing in Perl scripting with the Parallel::ForkManager module

30^th September 2019

In a previous post, I described how to add Perl modules in Linux Mint, while mentioning that I hoped to add another that discusses the use of the Parallel::ForkManager module. This is that second post, and I am going to keep things as simple and generic as they can be. There are other articles like one on the Perl Maven website that go into more detail.

The first thing to do is ensure that the Parallel::ForkManager module is called by your script; having the line of code presented below near the top of the file will do just that. Without this step, the script will not be able to find the required module by itself and errors will be generated.

use Parallel::ForkManager;

Then, the maximum number of threads needs to be specified. While that can be achieved using a simple variable declaration, the following line reads this from the command used to invoke the script. It even tells a forgetful user what they need to do in its own terse manner. Here $0 is the name of the script and N is the number of threads. Not all these threads will get used and processing capacity will limit how many actually are in use, which means that there is less chance of overwhelming a CPU.

my $forks = shift or die "Usage: $0 N\n";

Once the maximum number of available threads is known, the next step is to instantiate the Parallel::ForkManager object as follows to use these child processes:

my $pm = Parallel::ForkManager->new($forks);

With the Parallel::ForkManager object available, it is now possible to use it as part of a loop. A foreach loop works well, though only a single array can be used, with hashes being needed when other collections need interrogation. Two extra statements are needed, with one to start a child process and another to end it.

foreach $t (@array) { my $pid = $pm->start and next; << Other code to be processed >> $pm->finish; }

Since there is often other processing performed by script, and it is possible to have multiple threaded loops in one, there needs to be a way of getting the parent process to wait until all the child processes have completed before moving from one step to another in the main script and that is what the following statement does. In short, it adds more control.

$pm->wait_all_children;

To close, there needs to be a comment on the advantages of parallel processing. Modern multicore processors often get used in single threaded operations, which leaves most of the capacity unused. Utilising this extra power then shortens processing times markedly. To give you an idea of what can be achieved, I had a single script taking around 2.5 minutes to complete in single threaded mode, while setting the maximum number of threads to 24 reduced this to just over half a minute while taking up 80% of the processing capacity. This was with an AMD Ryzen 7 2700X CPU with eight cores and a maximum of 16 processor threads. Surprisingly, using 16 as the maximum thread number only used half the processor capacity, so it seems to be a matter of performing one's own measurements when making these decisions.

Interrogating Solaris hardware for installed CPU and memory resources

2^nd October 2008

There are times when working with a Solaris server that you need to know a little more about the hardware configuration. Knowing how much memory that you have and how many processors there are can be very useful to know if you are not to hog such resources.

The command for revealing how much memory has been installed is:

prtconf -v

Since memory is often allocated to individual CPU's, then knowing how many are on the system is a must. This command will give you the bare number:

psrinfo -p

The following variant provides the full detail that you see below it:

psrinfo -v

Output:

Status of virtual processor 0 as of: 10/06/2008 16:47:54 on-line since 09/13/2008 14:47:52. The sparcv9 processor operates at 1503 MHz, and has a sparcv9 floating point processor. Status of virtual processor 1 as of: 10/06/2008 16:47:54 on-line since 09/13/2008 14:47:49. The sparcv9 processor operates at 1503 MHz, and has a sparcv9 floating point processor.

For a level intermediate between both extremes, try this to get what you see below it:

psrinfo -vp