Technology Tales

Notes drawn from experiences in consumer and enterprise technology

TOPIC: R PACKAGE

Open Source Tools for Pharmaceutical Clinical Data Reporting, Analysis & Regulatory Submissions

25th March 2026

There was a time when SAS was the predominant technology for clinical data reporting, analysis and submission work in the pharmaceutical industry. Within the last decade, open-source alternatives have gained a lot of traction, and the {pharmaverse} initiative has arisen from this. The range of packages ranges from dataset creation (SDTM and ADaM) to output production, with utilities for test data and submission activities. All in all, there is quite a range here. The effort also is a marked change from each company working by itself to sharing and collaborating with others. Here then is the outcome of their endeavours.

{admiral}

Designed as an open-source, modular R toolbox, the {admiral} package assists in the creation of ADaM datasets through reusable functions and utilities tailored for pharmaceutical data analysis. Core packages handle general ADaM derivations whilst therapeutic area-specific extensions address more specialised needs, with a structured release schedule divided into two phases. Usability, simplicity and readability are central priorities, supported by comprehensive documentation, vignettes and example scripts. Community contributions and collaboration are actively encouraged, with the aim of fostering a shared, industry-wide approach to ADaM development in R. Related packages for test data and metadata manipulation complement the main toolkit, alongside a commitment to consistent coding practices and accessible code.

{aNCA}

Maintained by contributors from F. Hoffmann-La Roche AG, {aNCA} is an open-source R Shiny application that makes Non-Compartmental Analysis (NCA) accessible to scientists working with clinical and pre-clinical pharmacokinetic datasets. Users can upload their own data, apply pre-processing filters and run NCA with configurable options including half-life calculation rules, manual slope selection and user-defined AUC intervals. Results are explorable through interactive box plots, scatter plots and summary statistics tables, and can be exported in `PP` and `ADPP` dataset domains alongside a reproducible R script. Analysis settings can be saved and reloaded for continuity across sessions. Installation is available from CRAN via a standard install command, from GitHub using the `pak` package manager, or by cloning the repository directly for those wishing to contribute.

{autoslider.core}

The {autoslider.core} package generates standard table templates commonly used in Study Results Endorsement Plans. Its principal purpose is to reduce duplicated effort between statisticians and programmers when creating slides. Available on CRAN, the package can be installed either through the standard installation method or directly from GitHub for the latest development version.

{cards}

Supporting the CDISC Analysis Results Standard, the {cards} package facilitates the creation of analysis results data sets that enhance automation, reproducibility and consistency in clinical research. Structured data sets for statistical summaries are generated to enable tasks such as quality control, pre-calculating statistics for reports and combining results across studies. Tools for creating, modifying and analysing these data sets are provided, with the {cardx} extension offering additional functions for statistical tests and models. Installation is available through CRAN or GitHub, with resources including documentation and community contributions.

{cardx}

Extending the {cards} package, {cardx} facilitates the creation of Analysis Results Data Objects (ARD's) in R by leveraging utility functions from {cards} and statistical methods from packages such as {stats} and {emmeans}. These ARD's enable the generation of tables and visualisations for regulatory submissions, support quality control checks by storing both results and parameters, and allow for reproducible analyses through the inclusion of function inputs. Installation options include CRAN and GitHub, with examples demonstrating its use in t-tests and regression models. External statistical library dependencies are not enforced by the package, requiring explicit references in code for tools like {renv} to track them.

{chevron}

A collection of high-level functions for generating standardised outputs in clinical trials reporting, {chevron} covers a broad range of output types including tables for safety summaries, adverse events, demographics, ECG results, laboratory findings, medical history, response data, time-to-event analyses and vital signs, as well as listings and graphs such as Kaplan-Meier and mean plots. Straightforward implementation with limited parameterisation is a defining characteristic of the package. It is available on CRAN, with a development version accessible via GitHub, and those requiring greater flexibility are directed to the related {tern} package and its associated catalogue.

{clinify}

Built on the {flextable} and {officer} packages, {clinify} streamlines the creation of clinical tables, listings and figures whilst addressing challenges such as adherence to organisational reporting standards, the need for flexibility across different clients and the importance of reusable configurations. Compatibility with existing tools is a key priority, ensuring that its features do not interfere with the core functionalities of {flextable} or {officer}, whilst enabling tasks like dynamic page breaks, grouped headers and customisable formatting. Complex documents such as Word files with consistent layouts and tailored elements like footnotes and titles can be produced with reduced effort by building on these established frameworks.

{connector}

Offering a unified interface for establishing connections to various data sources, the {connector} package covers file systems and databases through a central configuration file that maintains consistent references across project scripts and facilitates switching between data sources. Functions such as connector_fs() for file system access and connector_dbi() for database connections are provided, with additional expansion packages enabling integration with specific platforms like Databricks and SharePoint. Installation is available via CRAN or GitHub, and usage involves defining a YAML configuration file to specify connection details that can then be initialised and utilised to interact with data sources. Operations including reading, writing and listing content are supported, with methods for managing connections and handling data in formats like parquet.

{covtracer}

Linking test traces to package code and documentation using coverage data from {covr}, the {covtracer} package enables the creation of a traceability matrix that maps tests to specific documented functions. Installation is via remotes from GitHub with specific dependencies, and configuration of {covr} is required to record tests alongside coverage traces. Untested behaviours can be identified and the direct testing of functions assessed, providing insights into test coverage and software validation. The example workflow demonstrates generating a matrix to show which tests evaluate code related to documented behaviours, highlighting gaps in test coverage.

{datacutr}

An open-source solution for applying data cuts to SDTM datasets within R, the {datacutr} package is designed to support pharmaceutical data analysis workflows. Available via CRAN or GitHub, it offers options for different types of cuts tailored to specific SDTM domains. Supplemental qualifiers are assumed to be merged with their parent domain before processing, allowing users flexibility in defining cut types such as patient, date, or domain-specific cuts. Documentation, contribution guidelines and community support through platforms like Slack and GitHub provide further assistance.

{datasetjson}

Facilitating the creation and manipulation of CDISC Dataset JSON formatted datasets, the {datasetjson} R package enables users to generate structured data files by applying metadata attributes to data frames. Metadata such as file paths, study identifiers and system details can be incorporated into dataset objects and written to disk or returned as JSON text. Reading JSON files back into data frames is also supported, with metadata preserved as attributes for use in analysis. The package currently supports version 1.1.0 of the Dataset JSON standard and is available via CRAN or GitHub.

{dataviewR}

An interactive data viewer for R, {dataviewR} enhances data exploration through a Shiny-based interface that enables users to examine data frames and tibbles with tools for filtering, column selection and generating reproducible {dplyr} code. Viewing multiple datasets simultaneously is supported, and the tool provides metadata insights alongside features for importing and exporting data, all within a responsive and user-friendly design. By combining intuitive navigation with automated code generation, the package aims to streamline data analysis workflows and improve the efficiency of dataset manipulation and documentation.

{docorator}

Generating formatted documents by adding headers, footers and page numbers to displays such as tables and figures, {docorator} exports outputs as PDF or RTF files. Accepted inputs include tables created with the {gt} package, figures generated using {ggplot2}, or paths to existing PNG files, and users can customise document elements like titles and footers. The package can be installed from CRAN or via GitHub, and its use involves creating a display object with specified formatting options before rendering the output. LaTeX libraries are required for PDF generation.

{envsetup}

Providing a configuration system for managing R project environments, the {envsetup} package enables adaptation to different deployment stages such as development, testing and production without altering code. YAML files are used to define paths for data and output directories, and R scripts are automatically sourced from specified locations to reduce the need for manual configuration changes. This approach supports consistent code usage across environments whilst allowing flexibility in environment-specific settings, streamlining workflows for projects requiring multiple deployment contexts.

{ggsurvfit}

Simplifying the creation of survival analysis visualisations using {ggplot2}, the {ggsurvfit} package offers tools to generate publication-ready figures with features such as confidence intervals, risk tables and quantile markers. Seamless integration with {ggplot2} functions allows for extensive customisation of plot elements whilst maintaining alignment between graphical components and annotations. Competing risks analysis is supported through `ggcuminc()`, and specific functions such as Surv_CNSR() handle CDISC ADaM `ADTTE` data by adjusting event coding conventions to prevent errors. Installation options are available via CRAN or GitHub, with examples and further resources accessible through its documentation and community links.

{gridify}

Addressing challenges in creating consistent and customisable graphical arrangements for figures and tables, the {gridify} package leverages the base {grid} package to facilitate the addition of headers, footers, captions and other contextual elements through predefined or custom layouts. Multiple input types are supported, including {ggplot2}, {flextable} and base R plots, and the workflow involves generating an object, selecting a layout and using functions to populate text elements before rendering the final output. Installation options include CRAN and GitHub, with examples demonstrating its application in enhancing tables with metadata and formatting. Uniformity across different projects is promoted, reducing manual adjustments and aligning visual elements consistently.

{gtsummary}

Offering a streamlined approach to generating publication-quality analytical and summary tables in R, the {gtsummary} package enables users to summarise datasets, regression models and other statistical outputs with minimal code. Variable types are identified automatically, relevant descriptive statistics computed and measures of data incompleteness included, whilst customisation of table formatting such as adjusting labels, adding p-values or merging tables for comparative analysis is also supported. Integration with packages like {broom} and {gt} facilitates the creation of visually appealing tables, and results can be exported to multiple formats including HTML, Word and LaTeX, making the package suitable for reproducible reporting in academic and professional contexts.

{logrx}

Supporting logging in clinical programming environments, the {logrx} package generates detailed logs for R scripts, ensuring code execution is traceable and reproducible. An overview of script execution and the associated environment is provided, enabling users to recreate conditions for verification or further analysis. Available on CRAN, installation is possible via standard methods or from its development repository, offering flexibility for both file-based and scripted usage. Structured logging tailored to the specific requirements of clinical applications is the defining characteristic of the package, with simplicity and minimal intrusion in coding workflows maintained throughout.

{metacore}

Providing a standardised framework for managing metadata within R sessions, the {metacore} package is particularly suited to clinical trial data analysis. Metadata is organised into six interconnected tables covering dataset specifications, variable details, value definitions, derivations, code lists and supplemental information, ensuring consistency and ease of access. By centralising metadata in a structured, immutable format, the package facilitates the development of tools that can leverage this information across different workflows, reducing the need for redundant data structures. Reading metadata from various sources, including Define-XML 2.0, is also supported.

{metatools}

Working with {metacore} objects, {metatools} enables users to build datasets, enhance columns in existing datasets and validate data against metadata specifications. Installation is available from CRAN or via GitHub. Core functionality includes pulling columns from existing datasets, creating new categorical variables, converting columns to factors and running checks to verify that data conforms to control terminology and that all expected variables are present.

{pharmaRTF}

Developed to address gaps in RTF output capabilities within R, {pharmaRTF} is a package for pharmaceutical industry programmers who produce RTF documents for clinical trial data analysis. Whilst the {huxtable} package offers extensive RTF styling and formatting options, it lacks the ability to set document properties such as page size and orientation, repeat column headers across pages, or create multi-level titles and footnotes within document headers and footers. These limitations are resolved by {pharmaRTF}, which wraps around {huxtable} tables to provide document property controls, proper multipage display and title and footnote management within headers and footers. Two core objects form the basis of the package: rtf_doc for document-wide attributes and hf_line for creating individual title and footnote lines, each carrying formatting properties such as alignment, font and bold or italic styling. Default output files use Courier New at 12-point size, Letter page dimensions in landscape orientation with one-inch margins, though all of these can be adjusted through property functions. The package is available on CRAN and supports both a {tidyverse} piping style and a more traditional assignment-based coding approach.

{pharmaverseadam}

Serving as a repository for ADaM test datasets generated by executing templates from related packages such as {admiral} and its extensions, the {pharmaverseadam} package automates dataset creation through a script that installs required packages, runs templates and saves results. Metadata is managed centrally in an XLSX file to ensure consistency in documentation, and updates occur regularly or ad-hoc when templates change. Documentation is generated automatically from metadata and saved as `.R` files, and the package includes contributions from multiple developers with examples provided for each dataset. Preparing metadata, updating configuration files for new therapeutic areas and executing a script to generate datasets and documentation ensures alignment with the latest versions of dependent packages. Installation is available via CRAN or GitHub.

{pharmaverseraw}

Providing raw datasets to support the creation of SDTM datasets, the {pharmaverseraw} package includes examples that are independent of specific electronic data capture systems or data standards such as CDASH. Datasets are named using SDTM domain identifiers with the suffix _raw, and installation options include CRAN or direct GitHub access. Updates involve contributing via GitHub issues, generating new or modified datasets through standalone R scripts stored in the data-raw folder, and ensuring generated files are saved in the data folder as .rda files with consistent naming. Documentation is maintained in R/*.R files, and changes require updating `NAMESPACE` and `.Rd` files using devtools::document.

{pharmaversesdtm}

A collection of test datasets formatted according to the SDTM standard, the {pharmaversesdtm} package is designed for use within the pharmaverse family of packages. Datasets applicable across therapeutic areas, such as `DM` and VS, are included alongside those specific to particular areas, like `RS` and `OE`. Available via CRAN and GitHub, the package provides installation instructions for both stable and development versions, with test data sourced from the CDISC pilot project and ad-hoc datasets generated by the {admiral} team. Naming conventions distinguish between general and therapeutic area-specific categories, with examples such as dm for general use and rs_onco for oncology-specific data. Updates involve creating or modifying R scripts in the data-raw folder, generating `.rda` files and updating metadata in a central JSON file to automate documentation and maintain consistency, including specifying dataset details like labels, descriptions and therapeutic areas.

{pkglite} (R)

Converting R package source code into text files and reconstructing package structures from those files, {pkglite} enables the exchange and management of R packages as plain text. Single or multiple packages can be processed through functions that collate, pack and unpack files, with installation options available via CRAN or GitHub. The tool adheres to a defined format for text files and includes documentation for generating specifications and managing file collections.

{pkglite} (Python)

An open-source framework licensed under the MIT licence, {pkglite} for Python, allows source projects written in any programming language to be packed into portable files and restored to their original directory structure. Installation is available via PyPI or as a development version cloned from GitHub, and the package can also be run without installation using `uvx`. A command line interface is provided in addition to the Python API, which can be installed globally using `pipx`.

{rhino}

Streamlining the development of high-quality, enterprise-grade Shiny applications, {rhino} integrates software engineering best practices, modular code structures and robust testing frameworks. Scalable architecture is supported through modularisation, code quality is enhanced with unit and end-to-end testing, and automation is facilitated via tools for project setup, continuous integration and dependency management. Comprehensive documentation is divided into tutorials, explanations and guides, with examples and resources available for learning.

{risk.assessr}

Evaluating the reliability and security of R packages during validation, the {risk.assessr} package analyses maintenance, documentation and dependencies through metrics such as R CMD check results, unit test coverage and dependency assessments. A traceability matrix linking functions to tests is generated, and risk profiles are based on predefined thresholds including documentation completeness, licence type and code coverage. The tool supports installation from GitHub or CRAN, processes local package files or `renv.lock` dependencies and offers detailed outputs such as risk analysis, dependency lists and reverse dependency information. Advanced features include identifying potential issues in suggested package dependencies and generating HTML reports for risk evaluation, with applications in clinical trial workflows and package validation processes.

{riskassessment}

Built on the {riskmetric} framework, the {riskassessment} application offers a user-friendly interface for evaluating the risk of using R packages within regulated industries, assessing development practices, documentation and sustainability. Non-technical users can review {riskmetric} outputs, add personalised comments, categorise packages into risk levels, generate reports and store assessments securely, with features such as user authentication and role-based access. Alignment with validation principles outlined by the R Validation Hub supports decision-making in regulated settings, though deeper software inspection may be required in some cases. Deployment is possible using tools like Shiny Server or Posit Connect, with installation options including GitHub and local configuration via {renv}.

{riskmetric}

Providing a framework for evaluating the quality of R packages, the {riskmetric} package assesses development practices, documentation, community engagement and sustainability through a series of metrics. Currently operating in a maintenance-only phase, further development is focused on a new tool called {val.metre}. The workflow involves retrieving package information, assessing it against predefined criteria and generating a risk score, with installation available from CRAN or GitHub. An associated application, {riskassessment}, offers a user interface for organisations to review and manage package risk assessments, store metrics and apply organisational rules.

{rlistings}

Designed to create and display formatted listings with a focus on ASCII rendering for tables and regulatory-ready outputs, the {rlistings} R package relies on the {formatters} package for formatting infrastructure. Requirements such as flexible pagination, multiple output formats and repeated key columns informed its development. Available on CRAN and GitHub, the package is under active development and includes features such as adjustable column widths, alignment and support for titles and footnotes.

{rtables}

Tailored for generating submission-ready tables for health authority review, the {rtables} R package creates and displays complex tables with advanced formatting and output options that support regulatory requirements for clinical trial data presentation. Separation of data values from their visualisation is enabled, multiple values can be included within cells, and flexible tabulation and formatting capabilities are provided, including cell spans, rounding and alignment. Output formats include HTML, ASCII, LaTeX, PDF and PowerPoint, with additional formats under development. Also, the package incorporates features such as pagination, distinction between data names and labels for CDISC standards and support for titles and footnotes. Installation is available via CRAN or GitHub, with ongoing community support and training resources.

{rtflite}

A lightweight Python library focused on precise formatting of production-quality tables and figures, {rtflite} is designed for composing RTF documents. Installation is available via PyPI or directly from its GitHub repository, with optional dependencies available to enable DOCX assembly support and RTF-to-PDF or RTF-to-DOCX conversion via LibreOffice.

{sdtm.oak}

Offering a modular, open-source solution for generating CDISC SDTM datasets, the {sdtm.oak} R package is designed to work across different electronic data capture systems and data standards. Industry challenges related to inconsistent raw data structures and varying data collection practices are addressed through reusable algorithms that map raw datasets to SDTM domains, with current capabilities covering Findings, Events and Intervention classes. Future developments aim to expand domain support, introduce metadata-driven code generation and enhance automation potential, though sponsor-specific metadata management tasks are not yet handled by the package. Available on CRAN and GitHub, development is ongoing with refinements based on user feedback and evolving SDTM requirements.

{sdtmchecks}

Providing functions to detect common data issues in SDTM datasets, the {sdtmchecks} package is designed to be broadly applicable and useful for analysis. Installation is available from CRAN or via GitHub, with development versions accessible through specific repositories, and users are not required to specify SDTM versions. A range of data check functions stored as R scripts is included, and contributions are encouraged that maintain flexibility across different data standards.

{siera}

Facilitating the generation of Analysis Results Datasets (ARD's) by processing Analysis Results Standard (ARS) metadata, the {siera} package works with parameters such as analysis sets, groupings, data subsets and methods. Metadata is typically provided in JSON format and used to create R scripts automatically that, when executed with corresponding ADaM datasets, produce ARD's in a structured format. The package can be installed from CRAN or GitHub, and its primary function, `readARS`, requires an ARS file, an output directory and access to relevant ADaM data. The CDISC Analysis Results Standard underpins this process, promoting automation and consistency in analysis outcomes.

{teal}

An open-source, Shiny-based interactive framework for exploratory data analysis, {teal} is developed as part of the pharmaverse ecosystem and maintained by F. Hoffmann-La Roche AG alongside a broad community of contributors. Analytical applications are built by combining supported data types, including CDISC clinical trial data, independent or relational datasets and `MultiAssayExperiment` objects, with modular analytical components known as teal modules. These modules can be drawn from dedicated packages covering general data exploration, clinical reporting and multi-omics analysis and define the specific analyses presented within an application. A suite of companion packages handles logging, reproducibility, data loading, filtering, reporting and transformation. The package is available on CRAN and is under active development, with community support provided through the {pharmaverse} Slack workspace.

{tern}

Supporting clinical trial reporting through a broad range of analysis functions, the {tern} R package offers data visualisation capabilities including line plots, Kaplan-Meier plots, forest plots, waterfall plots and Bland-Altman plots. Statistical model fit summaries for logistic and Cox regression are also provided, along with numerous analysis and summary table functions. Many of these outputs can be integrated into interactive Teal Shiny applications via the {teal.modules.clinical} package.

{tfrmt}

Offering a structured approach to defining and applying formatting rules for data displays in clinical trials, the {tfrmt} package streamlines the creation of mock displays, aligns with industry-standard Analysis Results Data (ARD) formats and integrates formatting tasks into the programming workflow to reduce manual effort and rework. Metadata is leveraged to automate styling and layout, enabling standardised formatting with minimal code, supporting quality control before final output and facilitating the reuse of datasets across different table types. Built on the {gt} package, the tool provides a flexible interface for generating tables and mock-ups, allowing users to focus on data interpretation rather than repetitive formatting tasks.

{tfrmtbuilder}

A tool for defining display-related metadata to streamline the creation and modification of table formats, the {tfrmtbuilder} package supports workflows such as generating tables from scratch, using templates or editing existing ones. Features include a toggle to switch between mock and real data, options to load or create datasets, tools for mapping and formatting data and the ability to export results as JSON, HTML or PNG. Designed for use in study planning and analysis phases, the package allows users to manage table structures efficiently.

{tidyCDISC}

An open-source R Shiny application, {tidyCDISC} is designed to help clinical personnel explore and analyse ADaM-standard data sets without writing any code. Customised clinical tables can be generated through a point-and-click interface, trends across patient populations examined using dynamic figures and individual patient profiles explored in detail. A broad range of users is served, from clinical heads with no programming background to statisticians and statistical programmers, with reported time savings of around 95% for routine trial analysis tasks. The app accepts only `sas7bdat` files conforming to CDISC ADaM standards and includes a feature to export reproducible R scripts from its table generator. A demo version is available without installation using CDISC pilot data, whilst uploading study data requires installing the package from CRAN or via GitHub.

{tidytlg}

Facilitating the creation of tables, listings and graphs using the {tidyverse} framework, the {tidytlg} package offers two approaches: a functional method involving custom scripts for each output and a metadata-driven method that leverages column and table metadata to generate results automatically. Tools for data analysis, including frequency tables and univariate statistics, are included alongside support for exporting outputs to formatted documents.

{Tplyr}

Simplifying the creation of clinical data summaries by breaking down complex tables into reusable layers, {Tplyr} allows users to focus on presentation rather than repetitive data processing. The conceptual approach of {dplyr} is mirrored but applied to common clinical table types, such as counting event-based variables, generating descriptive statistics for continuous data and categorising numerical ranges. Metadata is included with each summary produced to ensure traceability from raw data to final output, and user-acceptance testing documentation is provided to support its use in regulated environments. Installation options are available via CRAN or GitHub, accompanied by detailed vignettes covering features like layer templates, metadata extension and styled table outputs.

{valtools}

Streamlining the validation of R packages used in clinical research and drug development, {valtools} offers templates and functions to support tasks such as setting up validation frameworks, managing requirements and test cases and generating reports. Developed by the R Package Validation Framework PHUSE Working Group, the package integrates with standard development tools and provides functions prefixed with `vt` to facilitate structured validation processes including infrastructure setup, documentation creation and automated checks. Generating validation reports, scraping metadata from validation configurations and executing validation workflows through temporary installations or existing packages are all supported.

{whirl}

Facilitating the execution of scripts in batch mode whilst generating detailed logs that meet regulatory requirements, the {whirl} package produces logs including script status, execution timestamps, environment details, package versions and environmental variables, presented in a structured HTML format. Individual or multiple scripts can be run simultaneously, with parallel processing enabled through specified worker counts. A configuration file allows scripts to be executed in sequential steps, ensuring dependencies are respected, and the package produces individual logs for each script alongside a summary log and a tibble summarising execution outcomes. Installation options include CRAN and GitHub, with documentation available for customisation and advanced usage.

{xportr}

Assisting clinical programmers in preparing CDISC compliant XPT files for clinical data sets, the {xportr} package associates metadata with R data frames, performs validation checks and converts data into transportable SAS v5 XPT format. Tools are included to define variable types, set appropriate lengths, apply labels, format data, reorder variables and assign dataset labels, ensuring adherence to standards such as variable naming conventions, character length limits and the absence of non-ASCII characters. A practical example demonstrates how to use a specification file to apply these transformations to an ADSL dataset, ultimately generating a compliant XPT file.

Some R packages to explore as you find your feet with the language

24th March 2026

Here are some commonly used R packages and other tools that are pervasive, along with others that I have encountered while getting started with the language, itself becoming pervasive in my line of business. The collection grew organically as my explorations proceeded, and reflects what I was trying out during my acclimatisation.

General

Here are two general packages to get things started, with one of them being unavoidable in the R world. The other is more advanced, possibly offering more to package developers.

{tidyverse}

You cannot use R without knowing about this collection of packages. In many ways, they form a mini-language of their own, drawing some criticism from those who reckon that base R functionality covers a sufficient gamut anyway. Nevertheless, there is so much here that will get you going with data wrangling and visualisation that it is worth knowing what is possible. The complaints may come from your not needing to use anything else for these purposes.

{plumber}

This R package enables developers to convert existing R functions into web API endpoints by adding roxygen2-like comment annotations to their code. Once annotated, functions can handle HTTP GET and POST requests, accept query string or JSON parameters and return outputs such as plain values or rendered plots. The package is available on CRAN as a stable release, with a development version hosted on GitHub. For deployment, it integrates with DigitalOcean through a companion package called {plumberDeploy}, and also supports Posit Connect, PM2 and Docker as hosting options. Related projects in the same space include OpenCPU, which is designed for hosting R APIs in scientific research contexts, and the now-discontinued jug package, which took a more programmatic approach to API construction.

Data Preparation

You simply cannot avoid working with data during any analysis or reporting work. While there is a learning curve if you are used to other languages, there is little doubt that R is well-endowed when it comes to performing these tasks. Here are some packages that extend base R capabilities and might even add some extra user-friendliness along the way.

{forcats}

The {forcats} package in R provides functions to manage categorical variables by reordering factor levels, collapsing infrequent values and adjusting their sequence based on frequency or other variables. It includes tools such as reordering by another variable, grouping rare categories into 'other' and modifying level order manually, which are useful for data analysis and visualisation workflows. Designed as part of the tidyverse, it integrates with other packages to streamline tasks like counting and plotting categorical data, enhancing clarity and efficiency in handling factors within R.

{tidyr}

Around this time last year, I remember completing a LinkedIn course on a set of good practices known as tidy data, where each variable occupies a column, each observation a row and each value a single cell. This package is designed to help users restructure data so it follows those rules. It provides tools for reshaping data between long and wide formats, handling nested lists, splitting or combining columns, managing missing values and layering or flattening grouped data.

Installation options include the {tidyverse} collection, standalone installation, or the development version from GitHub. The package succeeds earlier reshaping tools like {reshape2} and {reshape}, offering a focused approach to tidying data rather than general reshaping or aggregation.

{haven}

Having a long track record of working with SAS, {haven} with its abilities to read and write data files from statistical software such as SAS, SPSS and Stata, leveraging the ReadStat library, arouses my interest. Handily, it supports a range of file formats, including SAS transport and data files, SPSS system and older portable files and Stata data files up to version 15, converting these into tibbles with enhanced printing capabilities. Value labels are preserved as a labelled class, allowing conversion to factors, while dates and times are transformed into standard R classes.

{RMariaDB}

While there are other approaches to working with databases using R, {RMariaDB} provides a database interface and driver for MariaDB, designed to fully comply with the DBI specification and serve as a replacement for the older {RMySQL} package. It supports connecting to databases using configuration files, executing queries, reading and writing data tables and managing results in chunks. Installation options include binary packages from CRAN or development versions from GitHub, with additional dependencies such as MariaDB Connector/C or libmysqlclient required for Linux and macOS systems. Configuration is typically handled through a MariaDB-specific file, and the package includes acknowledgments for contributions from various developers and organisations.

COVID-19 Data Hub

For many people, the pandemic may be a fading memory, yet it offered its chances for learning R, not least because there was a use case with more than a hint of personal interest about it. Here is a library making it easier to get hold of the data, with some added pre-processing too. Memories of how I needed to wrangle what was published by various sources make me appreciate just how vital it is to have harmonised data for analysis work.

Table Production

While many appear to graphical presentation of results to their tabular display, R does have its options here too. In recent times, the options have improved, particularly of the pharmaverse initiative. Here is a selection of what I found during my explorations.

{officer}

Part of the {officeverse} along with {officedown}, {Flextable}, {Rvg} and {mschart}, the {officer} R package enables users to create and modify Word and PowerPoint documents directly from R, allowing the insertion of images, tables and formatted content, as well as the import of document content into data frames. It supports the generation of RTF files and integrates with other packages for advanced features such as vector graphics and native office charts. Installation options include CRAN and GitHub, with community resources available for assistance and contributions. The package facilitates the manipulation of document elements like paragraphs, tables and section breaks and provides tools for exporting and importing content between R and office formats, alongside functions for managing slide layouts and embedded objects in presentations.

{pharmaRTF}

If you work in clinical research like I do, the need to produce data tabulations is a non-negotiable requirement. That is how this package came to be developed and the pharmaverse of which it is part has numerous other options, should you need to look at using one of those. The flavour of RTF produced here is the Microsoft Word variety, which did not look as well in LibreOffice Writer when I last looked at the results with that open-source alternative. Otherwise, the results look well to many eyes.

{formattable}

Here is a package that enhances data presentation by applying customisable formatting to vectors and data frames, supporting formats such as percentages, currency and accounting. Available on GitHub and CRAN, it integrates with dynamic document tools like {knitr} and {rmarkdown} to produce visually distinct tables, with features including gradient colour scales, conditional styling and icon-based representations. It automatically converts to {htmlwidgets} in interactive environments and is licensed under MIT, enabling flexible use in both static and interactive data displays.

{reactable}

The {reactable} package for R provides interactive data tables built on the React Table library, offering features such as sorting, filtering, pagination, grouping with aggregation, virtual scrolling for large datasets and support for custom rendering through R or JavaScript. It integrates seamlessly into R Markdown documents and Shiny applications, enabling the use of HTML widgets and conditional styling. Installation options include CRAN and GitHub, with examples demonstrating its application across various datasets and scenarios. The package supports major web browsers and is licensed under MIT, designed for developers seeking dynamic data presentation tools within the R ecosystem.

{DT}

Particularly useful in dynamic web applications like Shiny, the {DT} package in R provides a means of rendering interactive HTML tables by building on the DataTables JavaScript library. It supports features including sorting, searching, pagination and advanced filtering, with numeric, date and time columns using range-based sliders whilst factor and character columns rely on search boxes or dropdowns. Filtering operates on the client side by default, though server-side processing is also available. JavaScript callbacks can be injected after initialisation to manipulate table behaviour, such as enabling automatic page navigation or adding child rows to display additional detail. HTML content is escaped by default as a safeguard against cross-site scripting attacks, with the option to adjust this on a per-column basis. Whilst the package integrates with Shiny applications, attention is needed around scrolling and slider positioning to prevent layout problems. Overall, the package is well suited to exploratory data analysis and the building of interactive dashboards.

{gt}

The {gt} package in R enables users to create well-structured tables with a variety of formatting options, starting from data frames or tibbles and incorporating elements such as headers, footers and customised column labels. It supports output in HTML, LaTeX and RTF formats and includes example datasets for experimentation. The package prioritises simplicity for common tasks while offering advanced functions for detailed customisation, with installation available via CRAN or GitHub. Users can access resources like documentation, community forums and example projects to explore its capabilities, and it is supported by a range of related packages that extend its functionality.

{gtsummary}

Enabling users to produce publication-ready outputs with minimal code, the {gtsummary} package offers a streamlined approach to generating analytical and summary tables in R. It automates the summarisation of data frames, regression models and other datasets, identifying variable types and calculating relevant statistics, including measures of data incompleteness. Customisation options allow for formatting, merging and styling tables to suit specific needs, while integration with packages such as {broom} and {gt} facilitates seamless incorporation into R Markdown workflows. The package supports the creation of side-by-side regression tables and provides tools for exporting results as images, HTML, Word, or LaTeX files, enhancing flexibility for reporting and sharing findings.

{huxtable}

Here is an R package designed to generate LaTeX and HTML tables with a modern, user-friendly interface, offering extensive control over styling, formatting, alignment and layout. It supports features such as custom borders, padding, background colours and cell spanning across rows or columns, with tables modifiable using standard R subsetting or dplyr functions. Examples demonstrate its use for creating simple tables, applying conditional formatting and producing regression output with statistical details. The package also facilitates quick export to formats like PDF, DOCX, HTML and XLSX. Installation options include CRAN, R-Universe and GitHub, while the name reflects its origins as an enhanced version of the {xtable} package. The logo was generated using the package itself, and the background design draws inspiration from Piet Mondrian’s artwork.

Figure Generation

R has such a reputation for graphical presentations that it is cited as a strong reason to explore what the ecosystem has to offer. While base R itself is not shabby when it comes to creating graphs and charts, these packages will extend things by quite a way. In fact, the first on this list is near enough pervasive.

{ggplot2}

Though its default formatting does not appeal to me, the myriad of options makes this a very flexible tool, albeit at the expense of some code verbosity. Multi-panel plots are not among its strengths, which may send you elsewhere for that need.

{ggforce}

Focusing on features not included in the core library, the {ggforce} package extends {ggplot2} by offering additional tools to enhance data visualisation. Designed to complement the primary role of {ggplot2} in exploratory data analysis, it provides a range of geoms, stats and other components that are well-documented and implemented, aiming to support more complex and custom plot compositions. Available for installation via CRAN or GitHub, the package includes a variety of functionalities described in detail on its associated website, though specific examples are not included here.

{cowplot}

Developed by Claus O. Wilke for internal use in his lab, {cowplot} is an R package designed to help with the creation of publication-quality figures built on top of {ggplot2}. It provides a set of themes, tools for aligning and arranging plots into compound figures and functions for annotating plots or combining them with images. The package can be installed directly from CRAN or as a development version via GitHub, and it has seen widespread use in the book Fundamentals of Data Visualisation.

{sjPlot}

The {sjPlot} package provides a range of tools for visualising data and statistical results commonly used in social science research, including frequency tables, histograms, box plots, regression models, mixed effects models, PCA, correlation matrices and cluster analyses. It supports installation via CRAN for stable releases or through GitHub for development versions, with documentation and examples available online. The package is licensed under GPL-3 and developed by Daniel Lüdecke, offering functions to create visualisations such as scatter plots, Likert scales and interaction effect plots, along with tools for constructing index variables and presenting statistical outputs in tabular formats.

{thematic}

By offering a centralised approach to theming and enabling automatic adaptation of plot styles within Shiny applications, the {thematic} package simplifies the styling of R graphics, including {ggplot2}, {lattice} and base R plots, R Markdown documents and RStudio. It allows users to apply consistent visual themes across different plotting systems, with auto-theming in Shiny and R Markdown relying on CSS and {bslib} themes, respectively. Installation requires specific versions of dependent packages such as {shiny} and {rmarkdown}, while custom fonts benefit from {showtext} or {ragg}. Users can set global defaults for background, foreground and accent colours, as well as fonts, which can be overridden with plot-specific theme adjustments. The package also defines default colour scales for qualitative and sequential data and integrates with tools like bslib to import Google Fonts, enhancing visual consistency across different environments and user interfaces.

Publishing Tools

The R ecosystem goes beyond mere graphical and tabular display production to offer means for taking things much further, often offering platforms for publishing your work. These can be used locally too, so there is no need to entrust everything to a third-party provider. The uses are endless for what is available, and it appears that Posit has used this to help with building documentation and training too.

R Markdown

What you have here is one of those distinguishing facilities of the R ecosystem, particularly for those wanting to share their analysis work with more than a hint of reproducibility. The tool combines narrative text and code to generate various outputs, supporting multiple programming languages and formats such as HTML, PDF and dashboards. It enables users to produce reports, presentations and interactive applications, with options for publishing and scheduling through platforms like RStudio Connect, facilitating collaboration and distribution of results in professional settings.

Distill for R Markdown

Distill for R Markdown is a tool designed to streamline the creation of technical documents, offering features such as code folding, syntax highlighting and theming. It builds on existing frameworks like Pandoc, MathJax and D3, enabling the production of dynamic, interactive content. Users can customise the appearance with CSS and incorporate appendices for supplementary information. The tool acknowledges the contributions of developers who created foundational libraries, ensuring accessibility and functionality for a wide audience. Its design prioritises clarity, allowing authors to focus on presenting results rather than underlying code, while maintaining flexibility for those who wish to include detailed explanations.

{shiny}

For a while, this was one of R's unique selling points, and remains as compelling a reason to use the language even when Python has got its own version of the package. Enabling the creation of interactive web applications for data analysis without requiring web development expertise allows users to build interfaces that let others explore data through dynamic visualisations and filters. Here is a simple example: an app that generates scatter plots with adjustable variables, species filters and marginal plots, hosted either on personal servers or through a dedicated hosting service.

{bslib}

The {bslib} R package offers a modern user interface toolkit for Shiny and R Markdown applications, leveraging Bootstrap to enable the creation of customisable dashboards and interactive theming. It supports the use of updated Bootstrap and Bootswatch versions while maintaining compatibility with existing defaults, and provides tools for real-time visual adjustments. Installation is available through CRAN, with example previews demonstrating its capabilities.

{rhandsontable}

Enabling users to manipulate and validate data within a spreadsheet-like interface, the {rhandsontable} package introduces an interactive data grid for R. It supports features such as custom cell rendering, validation rules and integration with Shiny applications. When used in Shiny, the widget requires explicit conversion of data using the hot_to_r function, as updates may not be immediately reflected in reactive contexts. Examples demonstrate its application in various scenarios, including date editing, financial calculations and dynamic visualisations linked to charts. The package also accommodates bookmarks in Shiny apps with specific handling. Users are encouraged to report issues or contribute improvements, with guidance provided for those seeking to expand its functionality. The development team welcomes feedback to refine the tool further, ensuring it aligns with evolving user needs.

{xaringanExtra}

{xaringanExtra} offers a range of enhancements and extensions for creating and presenting slides with xaringan, enabling features such as adding an overview tile view, making slides editable, broadcasting in real time, incorporating animations, embedding live video feeds and applying custom styles. It allows users to selectively activate individual tools or load multiple features simultaneously through a single function call, supporting tasks like adding banners, enabling code copying, fitting slides to screen dimensions and integrating utility toolkits. The package is available for installation via CRAN or GitHub, providing flexibility for developers and presenters seeking to expand the functionality of their slides.

Online R programming books that are worth bookmarking

23rd March 2026

As part of making content more useful following its reorganisation, numerous articles on the R statistical computing language have appeared on here. All of those have taken a more narrative form. With this collation of online books on the R language, I take a different approach. What you find below is a collection of links with associated descriptions. While narrative accounts can be very useful, there is something handy about running one's eye down a compilation as well. Many entries have a corresponding print edition, some of which are not cheap to buy, which makes me wonder about the economics of posting the content online as well, though it can help with getting feedback during book preparation.

Big Book of R

We start with this comprehensive collection of over 400 free and affordable resources related to the R programming language, organised into categories such as data science, statistics, machine learning and specific fields like economics and life sciences. In many ways, it is a superset of what you find below and complements this collection with many other finds. The fact that it is a living collection makes it even more useful.

R Programming for Data Science

Here is an introduction to the R programming language, focusing on its application in data science. It covers foundational topics such as installation, data manipulation, function writing, debugging and code optimisation, alongside advanced concepts like parallel computation and data analysis case studies. The text includes practical guidance on handling data structures, using packages such as {dplyr} and {readr} as well as working with dates, times and regular expressions. Additional sections address control structures, scoping rules and profiling techniques, while the author also discusses resources for staying updated through a podcast and accessing e-book versions for ongoing revisions.

Hands-On Programming with R

Designed for individuals with no prior coding experience, the book provides an introduction to programming in R while using practical examples to teach fundamental concepts such as data manipulation, function creation and the use of R's environment system. It is structured around hands-on projects, including simulations of weighted dice, playing cards and a slot machine, alongside explanations of core programming principles like objects, notation, loops and performance optimisation. Additional sections cover installation, package management, data handling and debugging techniques. While the book is written using RMarkdown and published under a Creative Commons licence, a physical edition is available through O’Reilly.

Advanced R

What you have here is one of several books written by Hadley Wickham. This one is published in its second edition as part of Chapman and Hall's R Series and is aimed primarily at R users who want to deepen their programming skills and understanding of the language, though it is also useful for programmers migrating from other languages. The book covers a broad range of topics organised into sections on foundations, functional programming, object-oriented programming, metaprogramming and techniques, with the latter including debugging, performance measurement and rewriting R code in C++.

Cookbook for R

Unlike Paul Teetor's separately published R Cookbook, the Cookbook for R was created by Winston Chang. It offers solutions to common tasks and problems in data analysis, covering topics such as basic operations, numbers, strings, formulas, data input and output, data manipulation, statistical analysis, graphs, scripts and functions, and tools for experiments.

R for Data Science

The second edition of R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund offers a structured approach to learning data science with R, covering essential skills such as data visualisation, transformation, import, programming and communication. Organised into chapters that explore workflows, data manipulation techniques and tools like Quarto for reproducible research, the book emphasises practical applications and best practices for handling data effectively.

R Graphics Cookbook

The R Graphics Cookbook, 2nd edition, offers a comprehensive guide to creating visualisations in R, structured into chapters that cover foundational skills such as installing and using packages, loading data from various formats and exploring datasets through basic plots. It progresses to detailed techniques for constructing bar graphs, line graphs, scatter plots and histograms, alongside methods for customising axes, annotations, themes and legends.

The book also addresses advanced topics like colour application, faceting data into subplots, generating specialised graphs such as network diagrams and heat maps and preparing data for visualisation through reshaping and summarising. Additional sections focus on refining graphical outputs for presentation, including exporting to different file formats and adjusting visual elements for clarity and aesthetics, while an appendix provides an overview of the {ggplot2} system.

R Markdown: The Definitive Guide

Published by Chapman & Hall/CRC, R Markdown: The Definitive Guide by Yihui Xie, J.J. Allaire and Garrett Grolemund covers the R Markdown document format, which has been in use since 2012 and is built on the knitr and Pandoc tools. The format allows users to embed code within Markdown documents and compile the results into a range of output formats including PDF, HTML and Word. The guide covers a broad scope of practical applications, from creating presentations, dashboards, journal articles and books to building interactive applications and generating blogs, reflecting how the ecosystem has matured since the {rmarkdown} package was first released in 2014.

A key principle running throughout is that Markdown's deliberately limited feature set is a strength rather than a drawback, encouraging authors to focus on content rather than complex typesetting. Despite this simplicity, the format remains highly customisable through tools such as Pandoc templates, LaTeX and CSS. Documents produced in R Markdown are also notably portable, as their straightforward syntax makes conversion between output formats more reliable, and because results are generated dynamically from code rather than entered manually, they are far more reproducible than those produced through conventional copy-and-paste methods.

R Markdown Cookbook

The R Markdown Cookbook is a practical guide designed to help users enhance their ability to create dynamic documents by combining analysis and reporting. It covers essential topics such as installation, document structure, formatting options and output formats like LaTeX, HTML and Word, while also addressing advanced features such as customisations, chunk options and integration with other programming languages. The book provides step-by-step solutions to common tasks, drawing on examples from online resources and community discussions to offer clear, actionable advice for both new and experienced users seeking to improve their workflow and explore the full potential of R Markdown.

RMarkdown for Scientists

This book provides a practical guide to using R Markdown for scientists, developed from a three-hour workshop and designed to evolve as a living resource. It covers essential topics such as setting up R Markdown documents, integrating with RStudio for efficient workflows, exporting outputs to formats like PDF, HTML and Word, managing figures and tables with dynamic references and captions, incorporating mathematical equations, handling bibliographies with citations and style adjustments, troubleshooting common issues and exploring advanced R Markdown extensions.

bookdown: Authoring Books and Technical Documents with R Markdown

Here is a guide to using the {bookdown} package, which extends R Markdown to facilitate the creation of books and technical documents. It covers Markdown syntax, integration of R code, formatting options for HTML, LaTeX and e-book outputs and features such as cross-referencing, custom blocks and theming. The package supports both multipage and single-document outputs, and its applications extend beyond traditional books to include course materials, manuals and other structured content. The work includes practical examples, publishing workflows and details on customisation, alongside information about licensing and the availability of a printed version.

[blogdown]: Creating Websites with R Markdown

Though the authors note that some information may be outdated due to recent updates to Hugo and the {blogdown} package, and they direct readers to additional resources for the latest features and changes, this book still provides a guide to building static websites using R Markdown and the Hugo static site generator, emphasising the advantages of this approach for creating reproducible, portable content. It covers installation, configuration, deployment options such as Netlify and GitHub Pages, migration from platforms like WordPress and advanced topics including custom layouts and version control as well as practical examples, workflow recommendations and discussions on themes, content management and technical aspects of website development.

[pagedown]: Create Paged HTML Documents for Printing from R Markdown

The R package {pagedown} enables users to create paged HTML documents suitable for printing to PDF, using R Markdown combined with a JavaScript library called paged.js, that later of which implements W3C specifications for paged media. While tools like LaTeX and Microsoft Word have traditionally dominated PDF production, pagedown offers an alternative approach through HTML and CSS, supporting a range of document types including resumes, posters, business cards, letters, theses and journal articles.

Documents can be converted to PDF via Google Chrome, Microsoft Edge or Chromium, either manually or through the chrome_print() function, with additional support for server-based, CI/CD pipeline and Docker-based workflows. The package provides customisable CSS stylesheets, a CSS overriding mechanism for adjusting fonts and page properties, and various formatting features such as lists of tables and figures, abbreviations, footnotes, line numbering, page references, cover images, running headers, chapter prefixes and page breaks. Previewing paged documents requires a local or remote web server, and the layout is sensitive to browser zoom levels, with 100% zoom recommended for the most accurate output.

Dynamic Documents with R and knitr

Developed by Yihui Xie and inspired by the earlier {Sweave} package, {knitr} is an R package designed for dynamic report generation that consolidates the functionality of numerous other add-on packages into a single, cohesive tool. It supports multiple input languages, including R, Python and shell scripts, as well as multiple output markup languages such as LaTeX, HTML, Markdown, AsciiDoc and reStructuredText. The package operates on a principle of transparency, giving users full control over how input and output are handled, and runs R code in a manner consistent with how it would behave in a standard R terminal.

Among its notable features are built-in caching, automatic code formatting via the {formatR} package, support for more than 20 graphics devices and flexible options for managing plots within documents. It also allows advanced users to define custom hooks and regular expressions to extend and tailor its behaviour further. The package is affiliated with the Foundation for Open Access Statistics, a nonprofit organisation promoting free software, open access publishing and reproducible research in statistics.

Mastering Shiny

Mastering Shiny is a comprehensive guide to developing web applications using R, focusing on the Shiny framework designed for data scientists. It introduces core concepts such as user interface design, reactive programming and dynamic content generation, while also exploring advanced topics like performance optimisation, security and modular app development. The book covers practical applications across industries, from academic teaching tools to real-time analytics dashboards, and aims to equip readers with the skills to build scalable, maintainable applications. It includes detailed chapters on workflow, layout, visualisation and user interaction, alongside case studies and technical best practices.

Engineering Production-Grade Shiny Apps

This is aimed at developers and team managers who already possess a working knowledge of the Shiny framework for R and wish to advance beyond the basics toward building robust, production-ready applications. Rather than covering introductory Shiny concepts or post-deployment concerns, the book focuses on the intermediate ground between those two stages, addressing project management, workflow, code structure and optimisation.

It introduces the {golem} package as a central framework and guides readers through a five-step workflow covering design, prototyping, building, strengthening and deployment, with additional chapters on optimisation techniques including R code performance, JavaScript integration and CSS. The book is structured to serve both those with project management responsibilities and those focused on technical development, acknowledging that in many small teams these roles are carried out by the same individual.

Outstanding User Interfaces with Shiny

Written by David Granjon and published in 2022, Outstanding User Interfaces with Shiny is a book aimed at filling the gap between beginner and advanced Shiny developers, covering how to deeply customise and enhance Shiny applications to the point where they become indistinguishable from classic web applications. The book spans a wide range of topics, including working with HTML and CSS, integrating JavaScript, building Bootstrap dashboard templates, mobile development and the use of React, providing a comprehensive resource that consolidates knowledge and experience previously scattered across the Shiny developer community.

R Packages

Now in its second edition, R Packages by Hadley Wickham and Jennifer Bryan is a freely available online guide that teaches readers how to develop packages in R. A package is the core unit of shareable and reproducible R code, typically comprising reusable functions, documentation explaining how to use them and sample data. The book guides readers through the entire process of package development, covering areas such as package structure, metadata, dependencies, testing, documentation and distribution, including how to release a package to CRAN. The authors encourage a gradual approach, noting that an imperfect first version is perfectly acceptable provided each subsequent version improves on the last.

Mastering Spark with R

Written by Javier Luraschi, Kevin Kuo and Edgar Ruiz, Mastering Spark with R is a comprehensive guide designed to take readers from little or no familiarity with Apache Spark or R through to proficiency in large-scale data science. The book covers a broad range of topics, including data analysis, modelling, pipelines, cluster management, connections, data handling, performance tuning, extensions, distributed computing, streaming and contributing to the Spark ecosystem.

Happy Git and GitHub for the useR

Here is a practical guide written by Jenny Bryan and contributors, aimed primarily at R users involved in data analysis or package development. It covers the installation and configuration of Git alongside GitHub, the development of key workflows for common tasks and the integration of these tools into day-to-day work with R and R Markdown. The guide is structured to take readers from initial setup through to more advanced daily workflows, with particular attention paid to how Git and GitHub serve the needs of data science rather than pure software development.

JavaScript for R

Written by John Coene and intended for release as part of the CRC Press R series, JavaScript for R explore how the R programming language and JavaScript can be used together to enhance data science workflows. Rather than teaching JavaScript as a standalone language, the book demonstrates how a limited working knowledge of it can meaningfully extend what R developers can achieve, particularly through the integration of external JavaScript libraries.

The book covers a broad range of topics, progressing from foundational concepts through to data visualisation using the {htmlwidgets} package, bidirectional communication with Shiny, JavaScript-powered computations via the V8 engine and Node.js and the use of modern JavaScript tools such as Vue, React and webpack alongside R. Practical examples are woven throughout, including the building of interactive visualisations, custom Shiny inputs and outputs, image classification and machine learning operations, with all accompanying code made publicly available on GitHub.

HTTP Testing in R

This guide addresses challenges faced by developers of R packages that interact with web resources, offering strategies to create reliable unit tests despite dependencies on internet connectivity, authentication and external service availability. It explores tools such as {vcr}, {webmockr}, {httptest} and {webfakes}, which enable mocking and recording HTTP requests to ensure consistent testing environments, reduce reliance on live data and improve test reliability. The text also covers advanced topics like handling errors, securing tests and ensuring compatibility with CRAN and Bioconductor, while emphasising best practices for maintaining test robustness and contributor-friendly workflows. Funded by rOpenSci and the R Consortium, the resource aims to support developers in building more resilient and maintainable R packages through structured testing approaches.

The Shiny AWS Book

The Shiny AWS Book is an online resource designed to teach data scientists how to deploy, host and maintain Shiny web applications using cloud infrastructure. Addressing a common gap in data science education, it guides readers through a range of DevOps technologies including AWS, Docker, Git, NGINX and open-source Shiny Server, covering everything from server setup and cost management to networking, security and custom configuration.

{ggplot2}: Elegant Graphics for Data Analysis

The third edition of {ggplot2}: Elegant Graphics for Data Analysis provides an in-depth exploration of the Grammar of Graphics framework, focusing on the theoretical foundations and detailed implementation of the ggplot2 package rather than offering step-by-step instructions for specific visualisations. Written by Hadley Wickham, Danielle Navarro and Thomas Lin Pedersen, the book is presented as an online work-in-progress, with content structured across sections such as layers, scales, coordinate systems and advanced programming topics. It aims to equip readers with the knowledge to customise plots according to their needs, rather than serving as a direct guide for creating predefined graphics.

YaRrr! The Pirate’s Guide to R

Written by Nathaniel D. Phillips, this is a beginner-oriented guide to learning the R programming language from the ground up, covering everything from installation and basic navigation of the RStudio environment through to more advanced topics such as data manipulation, statistical analysis and custom function writing. The guide progresses logically through foundational concepts including scalars, vectors, matrices and dataframes before moving into practical areas such as hypothesis testing, regression, ANOVA and Bayesian statistics. Visualisation is given considerable attention across dedicated chapters on plotting, while later sections address loops, debugging and managing data from a variety of file formats. Each chapter includes practical exercises to reinforce learning, and the book concludes with a solutions section for reference.

Data Visualisation: A Practical Introduction

Data Visualisation: A Practical Introduction is a forthcoming second edition from Princeton University Press, written by Kieran Healy and due for release in March 2026, which teaches readers how to explore, understand and present data using the R programming language and the {ggplot2} library. The book aims to bridge the gap between works that discuss visualisation principles without teaching the underlying tools and those that provide code recipes without explaining the reasoning behind them, instead combining both practical instruction and conceptual grounding.

Revised and updated throughout to reflect developments in R and {ggplot2}, the second edition places greater emphasis on data wrangling, introduces updated and new datasets, and substantially rewrites several chapters, particularly those covering statistical models and map-drawing. Readers are guided through building plots progressively, from basic scatter plots to complex layered graphics, with the expectation that by the end they will be able to reproduce nearly every figure in the book and understand the principles that inform each choice.

The book also addresses the growing role of large language models in coding workflows, arguing that genuine understanding of what one is doing remains essential regardless of the tools available. It is suitable for complete beginners, those with some prior R experience, and instructors looking for a course companion, and requires the installation of R, RStudio and a number of supporting packages before work can begin.

Learning R for Data Analysis: Going from the basics to professional practice

22nd March 2026

R has grown from a specialist statistical language into one of the most widely recognised tools for working with data. Across tutorials, community sites, training platforms and industry resources, it is presented as both a programming language and a software environment for statistical computing, graphics and reporting. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand, and its name draws on the first letter of their first names while also alluding to the Bell Labs language S. It is freely available under the GNU General Public Licence and runs on Linux, Windows and macOS, which has helped it spread across research, education and industry alike.

What Makes R Distinctive

What makes R notable is its combination of programming features with a strong focus on data analysis. Introductory material, such as the tutorials at Tutorialspoint and Datamentor, repeatedly highlights its support for conditionals, loops, user-defined recursive functions and input and output, but these sit alongside effective data handling, a broad set of operators for arrays, lists, vectors and matrices and strong graphical capabilities. That mixture means R can be used for straightforward scripts and for complex analytical workflows. A beginner may start by printing "Hello, World!" with the print() function, while a more experienced user may move on to regression models, interactive dashboards or automated reporting.

The Learning Progression

Learning materials generally present R in a structured progression. A beginner is first introduced to reserved words, variables and constants, operators and the order in which expressions are evaluated. From there, the path usually moves into flow control through if…else, ifelse(), for, while, repeat and the use of break and next, before functions follow naturally, including return values, environments and scope, recursive functions, infix operators and switch(). Most sources agree that confidence with the syntax and fundamentals is the real starting point, and this early sequence matters because it helps learners become comfortable reading and writing R rather than only copying examples.

After the basics, attention tends to turn to the structures that make R so useful for data work. Vectors, matrices, lists, data frames and factors appear in nearly every introductory course because they are central to how information is stored and manipulated. Object-oriented concepts also emerge quite early in some routes through the language, with classes and objects extending into S3, S4 and reference classes. For someone coming from spreadsheets or point-and-click statistical software, this shift can feel significant, but it also opens the way to more reproducible and flexible analysis.

Visualisation

Visualisation is another recurring theme in R education. Basic chart types such as bar plots, histograms, pie charts, box plots and strip charts are common early examples because they show how quickly data can be turned into graphics. More advanced lessons widen the scope through plot functions, multiple plots, saving graphics, colour selection and the production of 3D plots.

Beyond base plotting, there is extensive evidence of the central role of {ggplot2} in contemporary R practice. Data Cornering demonstrates this well, with articles covering how to create funnel charts in R using {ggplot2} and how to diversify stacked column chart data label colours, showing how R is used not only to summarise data but also to tell visual stories more clearly. In the pharmaceutical and clinical research space, the PSI VIS-SIG blog is published by the PSI Visualisation Special Interest Group and summarises its monthly Wonderful Wednesday webinars, presenting real-world datasets and community-contributed chart improvements alongside news from the group.

Data Wrangling and the Tidyverse

Much of modern R work is built around data wrangling, and here the {tidyverse} has become especially prominent. Claudia A. Engel's openly published guide Data Wrangling with R (last updated 3rd November 2023) sets out a preparation phase that assumes some basic R knowledge, a recent installation of R and RStudio and the installation of the {tidyverse} package with install.packages("tidyverse") followed by library(tidyverse). It also recommends creating a dedicated RStudio project and downloading CSV files into a data subdirectory, reinforcing the importance of organised project structure.

That same guide then moves through data manipulation with {dplyr}, covering selecting columns and filtering rows, pipes, adding new columns, split-apply-combine, tallying and joining two tables, before moving on to {tidyr} topics such as long and wide table formats, pivot_wider, pivot_longer and exporting data. These topics reflect a broader pattern in the R ecosystem because data import and export, reshaping, combining tables and counting by group recur across teaching resources as they mirror common analytical tasks.

Applications and Professional Use

The range of applications attached to R is wide, though data science remains the clearest centre of gravity. Educational sources describe R as valuable for data wrangling, visualisation and analysis, often pointing to packages such as {dplyr}, {tidyr}, {ggplot2} and {Shiny}. Statistical modelling is another major strand, with R offering extensible techniques for descriptive and inferential statistics, regression analysis, time series methods and classical tests. Machine learning appears as a further area of growth, supported by a large and expanding package ecosystem. In more advanced contexts, R is also linked with dashboards, web applications, report generation and publishing systems such as Quarto and R Markdown.

R's place in professional settings is underscored by the breadth of organisations and sectors associated with it. Introductory resources mention companies such as Google, Microsoft, Facebook, ANZ Bank, Ford and The New York Times as examples of organisations using R for modelling, forecasting, analysis and visualisation. The NHS-R Community promotes the use of R and open analytics in health and care, building a community of practice for data analysis and data science using open-source software in the NHS and wider UK health and care system. Its resources include reports, blogs, webinars and workshops, books, videos and R packages, with webinar materials archived in a publicly accessible GitHub repository. The R Validation Hub, supported through the pharmaR initiative, is a collaboration to support the adoption of R within a biopharmaceutical regulatory setting and provides tools including the {riskmetric} package, the {riskassessment} app and the {riskscore} package for assessing package quality and risk.

The Wider Ecosystem

The wider ecosystem around R is unusually rich. The R Consortium promotes the growth and development of the R language and its ecosystem by supporting technical and social infrastructure, fostering community engagement and driving industry adoption. It notes that the R language supports over two million users and has been adopted in industries including biotech, finance, research and high technology. Community growth is visible not only through organisations and conferences but through user groups, scholarships, project working groups and local meetups, which matters because learning a language is easier when there is an active support network around it.

Another sign of maturity is the depth of R's package and publication landscape. rdrr.io provides a comprehensive index of over 29,000 CRAN packages alongside more than 2,100 Bioconductor packages, over 2,200 R-Forge packages and more than 76,000 GitHub packages, making it possible to search for packages, functions, documentation and source code in one place. Rdocumentation, powered by DataCamp, covers 32,130 packages across CRAN and Bioconductor and offers a searchable interface for function-level documentation. The Journal of Statistical Software adds a scholarly dimension, publishing open-access articles on statistical computing software together with source code, with full reproducibility mandatory for publication. R-bloggers aggregates R news and tutorials contributed by hundreds of R bloggers, while R Weekly curates a community digest and an accompanying podcast, both helping users keep pace with the steady flow of tutorials, package releases, blog posts and developments across the R world.

Where to Begin

For beginners, one recurring challenge is knowing where to start, and different learning routes reflect different backgrounds. Datamentor points learners towards step-by-step tutorials covering popular topics such as R operators, if...else statements, data frames, lists and histograms, progressing through to more advanced material. R for the Rest of Us offers a staged path through three core courses, Getting Started With R, Fundamentals of R and Going Deeper with R, and extends into nine topics courses covering Git and GitHub, making beautiful tables, mapping, graphics, data cleaning, inferential statistics, package development, reproducibility and interactive dashboards with {Shiny}. The site is explicitly designed for people who may never have coded before and also offers the structured R in 3 Months programme alongside training and consulting. RStudio Education (now part of Posit) outlines six distinct ways to begin learning R, covering installation, a free introductory webinar on tidy statistics, the book R for Data Science, browser-based primers, and further options suited to different learning styles, along with guidance on R Markdown and good project practices.

Despite the variety, the underlying advice is consistent: start by learning the basics well enough to read and write simple code, practise regularly beginning with straightforward exercises and gradually take on more complex tasks, then build projects that matter to you because projects create context and make concepts stick. There is no suggestion that mastery comes from passively reading documentation alone, as practical engagement is treated as essential throughout. The blog Stats and R exemplifies this philosophy well, with the stated aim of making statistics accessible to everyone by sharing, explaining and illustrating statistical concepts and, where appropriate, applying them in R.

That practical engagement can take many forms. Someone interested in data journalism may focus on visualisation and reproducible reporting, while a researcher may prioritise statistical modelling and publishing workflows, and a health analyst may use R for quality assurance, open health data and clinical reporting. Others may work with {Shiny}, package development, machine learning, Git and GitHub or interactive dashboards. The variety shows that R is not confined to a single use case, even if statistics and data science remain the common thread.

Free Learning Resources for R

It is also worth noting that R learning is supported by a great deal of freely available material. Statistics Globe, founded in 2017 by Joachim Schork and now an education and consulting platform, offers more than 3,000 free tutorials and over 1,000 video tutorials on YouTube, spanning R programming, Python and statistical methodology. STHDA (Statistical Tools for High-Throughput Data Analysis) covers basics, data import and export, reshaping, manipulation and visualisation, with material geared towards practical data analysis at every level. Community sites, webinar repositories and newsletters add further layers of accessibility, and even where paid courses exist, the surrounding free ecosystem is substantial.

Taken together, these sources present R as far more than a niche programming language. It is a mature open-source environment with a strong statistical heritage, a practical orientation towards data work and a well-developed community of learners, teachers, developers and organisations. Its core concepts are approachable enough for beginners, yet its package ecosystem and publishing culture support highly specialised and advanced work. For anyone looking to enter data analysis, statistics, visualisation or related areas, R offers a route that begins with simple code and can extend into large-scale analytical workflows.

Speeding up R Code with parallel processing

17th March 2026

Parallel processing in R has evolved considerably over the past fifteen years, moving from a patchwork of platform-specific workarounds into a well-structured ecosystem with clean, consistent interfaces. The appeal is easy to grasp: modern computers offer several processor cores, yet most R code runs on only one of them unless the user makes a deliberate choice to go parallel. When a task involves repeated calculations across groups, repeated model fitting or many independent data retrievals, spreading that work across multiple cores can reduce elapsed time substantially.

At its heart, the idea is simple. A larger job is split into smaller pieces, those pieces are executed simultaneously where possible, and the results are combined back together. That pattern appears throughout R's parallel ecosystem, whether the work is running on a laptop with a handful of cores or on a university supercomputer with thousands.

Why Parallel Processing?

Most modern computers have multiple cores that sit idle during single-threaded R scripts. Parallel processing takes advantage of this by splitting work across those cores, but it is important to understand that it is not always beneficial. Starting workers, transmitting data and collecting results all take time. Parallel processing makes the most sense when each iteration does enough computational work to justify that overhead. For fast operations of well under a second, the overhead will outweigh any gain and serial execution is faster. The sweet spot is iterative work, where each unit of computation takes at least a few seconds.

Benchmarking: Amdahl's Law

The theoretical speed-up from adding processors is always limited by the fraction of work that cannot be parallelised. Amdahl's Law, formulated by computer scientist Gene Amdahl in 1967, captures this:

Maximum Speedup = 1 / ( f/p + (1 - f) )

Here, f is the parallelisable fraction and p is the number of processors. Problems where f = 1 (the entire computation is parallelisable) are called embarrassingly parallel: bootstrapping, simulation studies and applying the same model to many independent groups all fall into this category. For everything else, the sequential fraction, including the overhead of setting up workers and moving data, sets a ceiling on how much improvement is achievable.

How We Got Here

The current landscape makes more sense with a brief orientation. R 2.14.0 in 2011 brought {parallel} into base R, providing built-in support for both forking and socket clusters along with reproducible random number streams, and it remains the foundation everything else builds on. The {foreach} package with {doParallel} became the most common high-level interface for many years, and is still widely encountered in existing code. The split-apply-combine package {plyr} was an early entry point for parallel data manipulation but is now retired; the recommendation is to use {dplyr} for data frames and {purrr} for list iteration instead. The {future} ecosystem, covered in the next section, is the current best practice for new code.

The Modern Standard: The {future} Ecosystem

The most significant development in R parallel computing in recent years has been the {future} package by Henrik Bengtsson, which provides a unified API for sequential and parallel execution across a wide range of backends. Its central concept is simple: a future is a value that will be computed (possibly in parallel) and retrieved later. What makes it powerful is that you write code once and change the execution strategy by swapping a single plan() call, with no other changes to your code.

library(future)
plan(multisession)  # Use all available cores via background R sessions

The common plans are sequential (the default, no parallelism), multisession (multiple background R processes, works on all platforms including Windows) and multicore (forking, faster but Unix/macOS only). On a cluster, cluster and backends such as future.batchtools extend the same interface to remote nodes.

The {future} package itself is a low-level building block. For day-to-day work, three higher-level packages are the main entry points.

{future.apply}: Drop-in Replacements for base R Apply

{future.apply} provides parallel versions of every *apply function in base R, including future_lapply(), future_sapply(), future_mapply(), future_replicate() and more. The conversion from serial to parallel code requires just two lines:

library(future.apply)
plan(multisession)

# Serial
results <- lapply(my_list, my_function)

# Parallel — identical output, just faster
results <- future_lapply(my_list, my_function)

Global variables and packages are automatically identified and exported to workers, which removes the manual clusterExport and clusterEvalQ calls that {parallel} requires.

{furrr}: Drop-in Replacements for {purrr}

{furrr} does the same for {purrr}'s mapping functions. Any map() call can become future_map() by loading the library and setting a plan:

library(furrr)
plan(multisession, workers = availableCores() - 1)

# Serial
results <- map(my_list, my_function)

# Parallel
results <- future_map(my_list, my_function)

Like {future.apply}, {furrr} handles environment export automatically. There are parallel equivalents for all typed variants (future_map_dbl(), future_map_chr(), etc.) and for map2() and pmap() as well. It is the most natural choice for tidyverse-style code that already uses {purrr}.

{futurize}: One-Line Parallelisation

For users who want to parallelise existing code with minimal changes, {futurize} can transpile calls to lapply(), purrr::map() and foreach::foreach() %do% {} into their parallel equivalents automatically.

{foreach} with {doFuture}

The {foreach} package remains widely used, and the modern way to parallelise it is with the {doFuture} backend and the %dofuture% operator:

library(foreach)
library(doFuture)
plan(multisession)

results <- foreach(i = 1:10) %dofuture% {
    my_function(i)
}

This approach inherits all the benefits of {future}, including automatic global variable handling and reproducible random numbers.

The {parallel} Package: Core Functions

The {parallel} package remains part of base R and is the foundation that {future} and most other packages build on. It is useful to know its core functions directly, especially for distributed work across multiple nodes.

Shared memory (single machine, Unix/macOS only):

mclapply(X, FUN, mc.cores = n) is a parallelised lapply that works by forking. It does not work on Windows and falls back silently to serial execution there.

Distributed memory (all platforms, including multi-node):

Function Description
makeCluster(n) Start `n` worker processes
clusterExport(cl, vars) Copy named objects to all workers
clusterEvalQ(cl, expr) Run an expression (e.g. library(pkg)) on all workers
parLapply(cl, X, FUN) Parallelised lapply across the cluster
parLapplyLB(cl, X, FUN) Same with load balancing for uneven tasks
clusterSetRNGStream(cl, seed) Set reproducible random seeds on workers
stopCluster(cl) Shut down the cluster

Note that detectCores() can return misleading values in HPC environments, reporting the total cores on a node rather than those allocated to your job. The {parallelly} package's availableCores() is more reliable in those settings and is what {furrr} and {future.apply} use internally.

A Tidyverse Approach with {multidplyr}

For data frame-centric workflows, {multidplyr} (available on CRAN) provides a {dplyr} backend that distributes grouped data across worker processes. The API has been simplified considerably since older tutorials were written: there is no longer any need to manually add group index columns or call create_cluster(). The current workflow is straightforward.

library(multidplyr)
library(dplyr)

# Step 1: Create a cluster (leave 1–2 cores free)
cluster <- new_cluster(parallel::detectCores() - 1)

# Step 2: Load packages on workers
cluster_library(cluster, "dplyr")

# Step 3: Group your data and partition it across workers
flights_partitioned <- nycflights13::flights %>%
    group_by(dest) %>%
    partition(cluster)

# Step 4: Work with dplyr verbs as normal
results <- flights_partitioned %>%
    summarise(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
    collect()

partition() uses a greedy algorithm to keep all rows of a group on the same worker and balance shard sizes. The collect() call at the end recombines the results into an ordinary tibble in the main session. If you need to use custom functions, load them on each worker with cluster_assign():

cluster_assign(cluster, my_function = my_function)

One important caveat from the official documentation: for basic {dplyr} operations, {multidplyr} is unlikely to give measurable speed-ups unless you have tens or hundreds of millions of rows. Its real strength is in parallelising slower, more complex operations such as fitting models to each group. For large in-memory data with fast transformations, {dtplyr} (which translates {dplyr} to {data.table}) is often a better first choice.

Running R on HPC Clusters

For computations that exceed what a single workstation can provide, university and research HPC clusters are the next step. The core terminology is worth understanding clearly before submitting your first job.

One node is a single physical computer, which may itself contain multiple processors. One processor contains multiple cores. Wall-time is the real-world clock time a job is permitted to run; the job is terminated when this limit is reached, regardless of whether the script has finished. Memory refers to the RAM the job requires. When requesting resources, leave a margin of at least five per cent of RAM for system processes, as exceeding the allocation will cause the job to fail.

Slurm Job Submission

Slurm is the dominant scheduler on modern HPC clusters, including Penn State's Roar Collab system, managed by the Institute for Computational and Data Sciences (ICDS). Jobs are described in a shell script and submitted with sbatch. From R, the {rslurm} package allows Slurm jobs to be created and submitted directly without leaving the R session:

library(rslurm)
sjob <- slurm_apply(my_function, params_df, jobname = "my_job",
                    nodes = 2, cpus_per_node = 8)

Connecting R Workflows to Cluster Schedulers

The {batchtools} package provides Map, Reduce and Filter variants for managing R jobs on PBS, Slurm, LSF and Sun Grid Engine. The {clustermq} package sends function calls as cluster jobs via a single line of code without network-mounted storage. For users already in the {future} ecosystem, {future.batchtools} wraps {batchtools} as a {future} backend, letting you scale from a local plan(multisession) all the way to plan(batchtools_slurm) with no other code changes.

The Broader Ecosystem

The CRAN Task View on High-Performance and Parallel Computing, maintained by Dirk Eddelbuettel and updated lately, remains the most comprehensive catalogue of R packages in this space. The core packages designated by the Task View are {Rmpi} and {snow}. Beyond these, several areas are worth knowing about.

For large and out-of-memory data, {arrow} provides the Apache Arrow in-memory format with support for out-of-memory processing and streaming. {bigmemory} allows multiple R processes on the same machine to share large matrix objects. {bigstatsr} operates on file-backed matrices via memory-mapped access with parallel matrix operations and PCA.

For pipeline orchestration, the {targets} package constructs a directed acyclic graph of your workflow and orchestrates distributed computing across {future} workers, only re-running steps whose upstream dependencies have changed. For GPU computing, the {tensorflow} package by Allaire and colleagues provides access to the complete TensorFlow API from within R, enabling computation across CPUs and GPUs with a single API.

When it comes to random number reproducibility across parallel workers, the L'Ecuyer-CMRG streams built into {parallel} are available via RNGkind("L'Ecuyer-CMRG"). The {rlecuyer}, {rstream}, {sitmo} and {dqrng} packages provide further alternatives. The {doRNG} package handles reproducible seeds specifically for {foreach} loops.

Choosing the Right Approach

The appropriate tool depends on the shape of the problem and how it fits into your existing code.

If you are already using {purrr}'s map() functions, replacing them with future_map() from {furrr} after plan(multisession) is the path of least resistance. If you use base R's lapply or sapply, {future.apply} provides identical drop-in replacements. Both inherit automatic environment handling, reproducible random numbers and cross-platform compatibility from {future}.

If you are working with grouped data frames in a {dplyr} style and each group operation is computationally substantial, {multidplyr} is a good fit. For fast operations on large data, try {dtplyr} first.

For the largest workloads on institutional clusters, {future} scales directly to HPC environments via plan(cluster) or plan(batchtools_slurm). The {rslurm} and {batchtools} packages provide more direct control over job submission and resource management.

Further Reading

The CRAN Task View on High-Performance and Parallel Computing is the most comprehensive and current reference. The Futureverse website documents the full {future} ecosystem. The {multidplyr} vignette covers the current API in detail. Penn State users can find cluster support through ICDS and the QuantDev group's HPC in R tutorial. The R Special Interest Group on High-Performance Computing mailing list is a further resource for more specialist questions.

SAS Packages: Revolutionising code sharing in the SAS ecosystem

26th July 2025

In the world of statistical programming, SAS has long been the backbone of data analysis for countless organisations worldwide. Yet, for decades, one of the most significant challenges facing SAS practitioners has been the efficient sharing and reuse of code. Knowledge and expertise have often remained siloed within individual developers or teams, creating inefficiencies and missed opportunities for collaboration. Enter the SAS Packages Framework (SPF), a solution that changes how SAS professionals share, distribute and utilise code across their organisations and the broader community.

The Problem: Fragmented Knowledge and Complex Dependencies

Anyone who has worked extensively with SAS knows the frustration of trying to share complex macros or functions with colleagues. Traditional code sharing in SAS has been plagued by several issues:

  • Dependency nightmares: A single macro often relies on dozens of utility macros working behind the scenes, making it nearly impossible to share everything needed for the code to function properly
  • Version control chaos: Keeping track of which version of which macro works with which other components becomes an administrative burden
  • Platform compatibility issues: Code that works on Windows might fail on Linux systems and vice versa
  • Lack of documentation: Without proper documentation and help systems, even the most elegant code becomes unusable to others
  • Knowledge concentration: Valuable SAS expertise remains trapped within individuals rather than being shared with the broader community

These challenges have historically meant that SAS developers spend countless hours reinventing the wheel, recreating functionality that already exists elsewhere in their organisation or the wider SAS community.

The Solution: SAS Packages Framework

The SAS Packages Framework, developed by Bartosz Jabłoński, represents a paradigm shift in how SAS code is organised, shared and deployed. At its core, a SAS package is an automatically generated, single, standalone zip file containing organised and ordered code structures, extended with additional metadata and utility files. This solution addresses the fundamental challenges of SAS code sharing by providing:

  • Functionality over complexity: Instead of worrying about 73 utility macros working in the background, you simply share one file and tell your colleagues about the main functionality they need to use.
  • Complete self-containment: Everything needed for the code to function is bundled into one file, eliminating the "did I remember to include everything?" problem that has plagued SAS developers for years.
  • Automatic dependency management: The framework handles the loading order of code components and automatically updates system options like cmplib= and fmtsearch= for functions and formats.
  • Cross-platform compatibility: Packages work seamlessly across different operating systems, from Windows to Linux and UNIX environments.

Beyond Macros: A Spectrum of SAS Functionality

One of the most compelling aspects of the SAS Packages Framework is its versatility. While many code-sharing solutions focus solely on macros, SAS packages support a wide range of SAS functionality:

  • User-defined functions (both FCMP and CASL)
  • IML modules for matrix programming
  • PROC PROTO C routines for high-performance computing
  • Custom formats and informats
  • Libraries and datasets
  • PROC DS2 threads and packages
  • Data generation code
  • Additional content such as documentation PDF files

This comprehensive approach means that virtually any SAS functionality can be packaged and shared, making the framework suitable for everything from simple utility macros to complex analytical frameworks.

Real-World Applications: From Pharmaceutical Research to General Analytics

The adoption of SAS packages has been particularly notable in the pharmaceutical industry, where code quality, validation and sharing are critical concerns. The PharmaForest initiative, led by PHUSE Japan's Open-Source Technology Working Group, exemplifies how the framework is being used to revolutionise pharmaceutical SAS programming. PharmaForest offers a collaborative repository of SAS packages specifically designed for pharmaceutical applications, including:

  • OncoPlotter: A comprehensive package for creating figures commonly used in oncology studies
  • SAS FAKER: Tools for generating realistic test data while maintaining privacy
  • SASLogChecker: Automated log review and validation tools
  • rtfCreator: Streamlined RTF output generation

The initiative's philosophy captures perfectly the spirit of the SAS Packages Framework: "Through SAS packages, we want to actively encourage sharing of SAS know-how that has often stayed within individuals. By doing this, we aim to build up collective knowledge, boost productivity, ensure quality through standardisation and energise our community".

The SASPAC Archive: A Growing Ecosystem

The establishment of SASPAC (SAS Packages Archive) represents the maturation of the SAS packages ecosystem. This dedicated repository serves as the official home for SAS packages, with each package maintained as a separate repository complete with version history and documentation. Some notable packages available through SASPAC include:

  • BasePlus: Extends BASE SAS with functionality that many developers find themselves wishing was built into SAS itself. With 12 stars on GitHub, it's become one of the most popular packages in the archive.
  • MacroArray: Provides macro array functionality that simplifies complex macro programming tasks, addressing a long-standing gap in SAS's macro language capabilities.
  • SQLinDS: Enables SQL queries within data steps, bridging the gap between SAS's powerful data step processing and SQL's intuitive query syntax.
  • DFA (Dynamic Function Arrays): Offers advanced data structures that extend SAS's analytical capabilities.
  • GSM (Generate Secure Macros): Provides tools for protecting proprietary code while still enabling sharing and collaboration.

Getting Started: Surprisingly Simple

Despite the capabilities, getting started with SAS packages is fairly straightforward. The framework can be deployed in multiple ways, depending on your needs. For a quick test or one-time use, you can enable the framework directly from the web:

filename packages "%sysfunc(pathname(work))";
filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit;

For permanent installation, you simply create a directory for your packages and install the framework:

filename packages "C:SAS_PACKAGES";
%installPackage(SPFinit)

Once installed, using packages becomes as simple as:

%installPackage(packageName)
%helpPackage(packageName)
%loadPackage(packageName)

Developer Benefits: Quality and Efficiency

For SAS developers, the framework offers numerous advantages that go beyond simple code sharing:

  • Enforced organisation: The package development process naturally encourages better code organisation and documentation practices.
  • Built-in testing: The framework includes testing capabilities that help ensure code quality and reliability.
  • Version management: Packages include metadata such as version numbers and generation timestamps, supporting modern DevOps practices.
  • Integrity verification: The framework provides tools to verify package authenticity and integrity, addressing security concerns in enterprise environments.
  • Cherry-picking: Users can load only specific components from a package, reducing memory usage and namespace pollution.

The Future of SAS Code Sharing

The growing adoption of SAS packages represents more than just a new tool, it signals a fundamental shift towards a more collaborative and efficient SAS ecosystem. The framework's MIT licensing and 100% open-source nature ensure that it remains accessible to all SAS users, from individual practitioners to large enterprise installations. This democratisation of advanced code-sharing capabilities levels the playing field and enables even small teams to benefit from enterprise-grade development practices.

As the ecosystem continues to grow, with contributions from pharmaceutical companies, academic institutions and individual developers worldwide, the SAS Packages Framework is proving that the future of SAS programming lies not in isolated development, but in collaborative, community-driven innovation.

For SAS practitioners looking to modernise their development practices, improve code quality and tap into the collective knowledge of the global SAS community, exploring SAS packages isn't just an option, it's becoming an essential step towards more efficient and effective statistical programming.

Broadening data science horizons: Useful Python packages for working with data

14th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. While one of these has been R, Python is another that has taken up my attention, and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I began to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family to have a more stable income. The approach is that a student selects a project and works their way through it, with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course, and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic supplying data, and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Though some subjects are uplifting while others are more foreboding, the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is, and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message, but books are always worth reading even if that is the slower route. While those from the Dummies series or from O'Reilly have proved must useful so far, I do need to read them more completely than I already have; it is all too tempting to go with the try the "programming and search for solutions as you go" approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality, but I prefer a more traditional editor, so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Because Spyder itself is written in Python, it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities, but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science, but I have discovered others since then. It also seems that there is porosity between the worlds of R and Python, so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dplyr and ggplot2 in the form of Siuba and Plotnine, respectively. Though the syntax of these packages are not direct copies of what is executed in R, they are close enough for there to be enough familiarity for added user-friendliness compared to Pandas or Matplotlib. The interoperability does not stop there, for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

While Python may not have the speed of Julia, there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have their uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available, and there is always the possibility of checking out others. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurts to keep an open mind either.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.