TOPIC: R
Open Source Tools for Pharmaceutical Clinical Data Reporting, Analysis & Regulatory Submissions
There was a time when SAS was the predominant technology for clinical data reporting, analysis and submission work in the pharmaceutical industry. Within the last decade, open-source alternatives have gained a lot of traction, and the {pharmaverse} initiative has arisen from this. The range of packages ranges from dataset creation (SDTM and ADaM) to output production, with utilities for test data and submission activities. All in all, there is quite a range here. The effort also is a marked change from each company working by itself to sharing and collaborating with others. Here then is the outcome of their endeavours.
{admiral}
Designed as an open-source, modular R toolbox, the {admiral} package assists in the creation of ADaM datasets through reusable functions and utilities tailored for pharmaceutical data analysis. Core packages handle general ADaM derivations whilst therapeutic area-specific extensions address more specialised needs, with a structured release schedule divided into two phases. Usability, simplicity and readability are central priorities, supported by comprehensive documentation, vignettes and example scripts. Community contributions and collaboration are actively encouraged, with the aim of fostering a shared, industry-wide approach to ADaM development in R. Related packages for test data and metadata manipulation complement the main toolkit, alongside a commitment to consistent coding practices and accessible code.
{aNCA}
Maintained by contributors from F. Hoffmann-La Roche AG, {aNCA} is an open-source R Shiny application that makes Non-Compartmental Analysis (NCA) accessible to scientists working with clinical and pre-clinical pharmacokinetic datasets. Users can upload their own data, apply pre-processing filters and run NCA with configurable options including half-life calculation rules, manual slope selection and user-defined AUC intervals. Results are explorable through interactive box plots, scatter plots and summary statistics tables, and can be exported in `PP` and `ADPP` dataset domains alongside a reproducible R script. Analysis settings can be saved and reloaded for continuity across sessions. Installation is available from CRAN via a standard install command, from GitHub using the `pak` package manager, or by cloning the repository directly for those wishing to contribute.
{autoslider.core}
The {autoslider.core} package generates standard table templates commonly used in Study Results Endorsement Plans. Its principal purpose is to reduce duplicated effort between statisticians and programmers when creating slides. Available on CRAN, the package can be installed either through the standard installation method or directly from GitHub for the latest development version.
{cards}
Supporting the CDISC Analysis Results Standard, the {cards} package facilitates the creation of analysis results data sets that enhance automation, reproducibility and consistency in clinical research. Structured data sets for statistical summaries are generated to enable tasks such as quality control, pre-calculating statistics for reports and combining results across studies. Tools for creating, modifying and analysing these data sets are provided, with the {cardx} extension offering additional functions for statistical tests and models. Installation is available through CRAN or GitHub, with resources including documentation and community contributions.
{cardx}
Extending the {cards} package, {cardx} facilitates the creation of Analysis Results Data Objects (ARD's) in R by leveraging utility functions from {cards} and statistical methods from packages such as {stats} and {emmeans}. These ARD's enable the generation of tables and visualisations for regulatory submissions, support quality control checks by storing both results and parameters, and allow for reproducible analyses through the inclusion of function inputs. Installation options include CRAN and GitHub, with examples demonstrating its use in t-tests and regression models. External statistical library dependencies are not enforced by the package, requiring explicit references in code for tools like {renv} to track them.
{chevron}
A collection of high-level functions for generating standardised outputs in clinical trials reporting, {chevron} covers a broad range of output types including tables for safety summaries, adverse events, demographics, ECG results, laboratory findings, medical history, response data, time-to-event analyses and vital signs, as well as listings and graphs such as Kaplan-Meier and mean plots. Straightforward implementation with limited parameterisation is a defining characteristic of the package. It is available on CRAN, with a development version accessible via GitHub, and those requiring greater flexibility are directed to the related {tern} package and its associated catalogue.
{clinify}
Built on the {flextable} and {officer} packages, {clinify} streamlines the creation of clinical tables, listings and figures whilst addressing challenges such as adherence to organisational reporting standards, the need for flexibility across different clients and the importance of reusable configurations. Compatibility with existing tools is a key priority, ensuring that its features do not interfere with the core functionalities of {flextable} or {officer}, whilst enabling tasks like dynamic page breaks, grouped headers and customisable formatting. Complex documents such as Word files with consistent layouts and tailored elements like footnotes and titles can be produced with reduced effort by building on these established frameworks.
{connector}
Offering a unified interface for establishing connections to various data sources, the {connector} package covers file systems and databases through a central configuration file that maintains consistent references across project scripts and facilitates switching between data sources. Functions such as connector_fs() for file system access and connector_dbi() for database connections are provided, with additional expansion packages enabling integration with specific platforms like Databricks and SharePoint. Installation is available via CRAN or GitHub, and usage involves defining a YAML configuration file to specify connection details that can then be initialised and utilised to interact with data sources. Operations including reading, writing and listing content are supported, with methods for managing connections and handling data in formats like parquet.
{covtracer}
Linking test traces to package code and documentation using coverage data from {covr}, the {covtracer} package enables the creation of a traceability matrix that maps tests to specific documented functions. Installation is via remotes from GitHub with specific dependencies, and configuration of {covr} is required to record tests alongside coverage traces. Untested behaviours can be identified and the direct testing of functions assessed, providing insights into test coverage and software validation. The example workflow demonstrates generating a matrix to show which tests evaluate code related to documented behaviours, highlighting gaps in test coverage.
{datacutr}
An open-source solution for applying data cuts to SDTM datasets within R, the {datacutr} package is designed to support pharmaceutical data analysis workflows. Available via CRAN or GitHub, it offers options for different types of cuts tailored to specific SDTM domains. Supplemental qualifiers are assumed to be merged with their parent domain before processing, allowing users flexibility in defining cut types such as patient, date, or domain-specific cuts. Documentation, contribution guidelines and community support through platforms like Slack and GitHub provide further assistance.
{datasetjson}
Facilitating the creation and manipulation of CDISC Dataset JSON formatted datasets, the {datasetjson} R package enables users to generate structured data files by applying metadata attributes to data frames. Metadata such as file paths, study identifiers and system details can be incorporated into dataset objects and written to disk or returned as JSON text. Reading JSON files back into data frames is also supported, with metadata preserved as attributes for use in analysis. The package currently supports version 1.1.0 of the Dataset JSON standard and is available via CRAN or GitHub.
{dataviewR}
An interactive data viewer for R, {dataviewR} enhances data exploration through a Shiny-based interface that enables users to examine data frames and tibbles with tools for filtering, column selection and generating reproducible {dplyr} code. Viewing multiple datasets simultaneously is supported, and the tool provides metadata insights alongside features for importing and exporting data, all within a responsive and user-friendly design. By combining intuitive navigation with automated code generation, the package aims to streamline data analysis workflows and improve the efficiency of dataset manipulation and documentation.
{docorator}
Generating formatted documents by adding headers, footers and page numbers to displays such as tables and figures, {docorator} exports outputs as PDF or RTF files. Accepted inputs include tables created with the {gt} package, figures generated using {ggplot2}, or paths to existing PNG files, and users can customise document elements like titles and footers. The package can be installed from CRAN or via GitHub, and its use involves creating a display object with specified formatting options before rendering the output. LaTeX libraries are required for PDF generation.
{envsetup}
Providing a configuration system for managing R project environments, the {envsetup} package enables adaptation to different deployment stages such as development, testing and production without altering code. YAML files are used to define paths for data and output directories, and R scripts are automatically sourced from specified locations to reduce the need for manual configuration changes. This approach supports consistent code usage across environments whilst allowing flexibility in environment-specific settings, streamlining workflows for projects requiring multiple deployment contexts.
{ggsurvfit}
Simplifying the creation of survival analysis visualisations using {ggplot2}, the {ggsurvfit} package offers tools to generate publication-ready figures with features such as confidence intervals, risk tables and quantile markers. Seamless integration with {ggplot2} functions allows for extensive customisation of plot elements whilst maintaining alignment between graphical components and annotations. Competing risks analysis is supported through `ggcuminc()`, and specific functions such as Surv_CNSR() handle CDISC ADaM `ADTTE` data by adjusting event coding conventions to prevent errors. Installation options are available via CRAN or GitHub, with examples and further resources accessible through its documentation and community links.
{gridify}
Addressing challenges in creating consistent and customisable graphical arrangements for figures and tables, the {gridify} package leverages the base {grid} package to facilitate the addition of headers, footers, captions and other contextual elements through predefined or custom layouts. Multiple input types are supported, including {ggplot2}, {flextable} and base R plots, and the workflow involves generating an object, selecting a layout and using functions to populate text elements before rendering the final output. Installation options include CRAN and GitHub, with examples demonstrating its application in enhancing tables with metadata and formatting. Uniformity across different projects is promoted, reducing manual adjustments and aligning visual elements consistently.
{gtsummary}
Offering a streamlined approach to generating publication-quality analytical and summary tables in R, the {gtsummary} package enables users to summarise datasets, regression models and other statistical outputs with minimal code. Variable types are identified automatically, relevant descriptive statistics computed and measures of data incompleteness included, whilst customisation of table formatting such as adjusting labels, adding p-values or merging tables for comparative analysis is also supported. Integration with packages like {broom} and {gt} facilitates the creation of visually appealing tables, and results can be exported to multiple formats including HTML, Word and LaTeX, making the package suitable for reproducible reporting in academic and professional contexts.
{logrx}
Supporting logging in clinical programming environments, the {logrx} package generates detailed logs for R scripts, ensuring code execution is traceable and reproducible. An overview of script execution and the associated environment is provided, enabling users to recreate conditions for verification or further analysis. Available on CRAN, installation is possible via standard methods or from its development repository, offering flexibility for both file-based and scripted usage. Structured logging tailored to the specific requirements of clinical applications is the defining characteristic of the package, with simplicity and minimal intrusion in coding workflows maintained throughout.
{metacore}
Providing a standardised framework for managing metadata within R sessions, the {metacore} package is particularly suited to clinical trial data analysis. Metadata is organised into six interconnected tables covering dataset specifications, variable details, value definitions, derivations, code lists and supplemental information, ensuring consistency and ease of access. By centralising metadata in a structured, immutable format, the package facilitates the development of tools that can leverage this information across different workflows, reducing the need for redundant data structures. Reading metadata from various sources, including Define-XML 2.0, is also supported.
{metatools}
Working with {metacore} objects, {metatools} enables users to build datasets, enhance columns in existing datasets and validate data against metadata specifications. Installation is available from CRAN or via GitHub. Core functionality includes pulling columns from existing datasets, creating new categorical variables, converting columns to factors and running checks to verify that data conforms to control terminology and that all expected variables are present.
{pharmaRTF}
Developed to address gaps in RTF output capabilities within R, {pharmaRTF} is a package for pharmaceutical industry programmers who produce RTF documents for clinical trial data analysis. Whilst the {huxtable} package offers extensive RTF styling and formatting options, it lacks the ability to set document properties such as page size and orientation, repeat column headers across pages, or create multi-level titles and footnotes within document headers and footers. These limitations are resolved by {pharmaRTF}, which wraps around {huxtable} tables to provide document property controls, proper multipage display and title and footnote management within headers and footers. Two core objects form the basis of the package: rtf_doc for document-wide attributes and hf_line for creating individual title and footnote lines, each carrying formatting properties such as alignment, font and bold or italic styling. Default output files use Courier New at 12-point size, Letter page dimensions in landscape orientation with one-inch margins, though all of these can be adjusted through property functions. The package is available on CRAN and supports both a {tidyverse} piping style and a more traditional assignment-based coding approach.
{pharmaverseadam}
Serving as a repository for ADaM test datasets generated by executing templates from related packages such as {admiral} and its extensions, the {pharmaverseadam} package automates dataset creation through a script that installs required packages, runs templates and saves results. Metadata is managed centrally in an XLSX file to ensure consistency in documentation, and updates occur regularly or ad-hoc when templates change. Documentation is generated automatically from metadata and saved as `.R` files, and the package includes contributions from multiple developers with examples provided for each dataset. Preparing metadata, updating configuration files for new therapeutic areas and executing a script to generate datasets and documentation ensures alignment with the latest versions of dependent packages. Installation is available via CRAN or GitHub.
{pharmaverseraw}
Providing raw datasets to support the creation of SDTM datasets, the {pharmaverseraw} package includes examples that are independent of specific electronic data capture systems or data standards such as CDASH. Datasets are named using SDTM domain identifiers with the suffix _raw, and installation options include CRAN or direct GitHub access. Updates involve contributing via GitHub issues, generating new or modified datasets through standalone R scripts stored in the data-raw folder, and ensuring generated files are saved in the data folder as .rda files with consistent naming. Documentation is maintained in R/*.R files, and changes require updating `NAMESPACE` and `.Rd` files using devtools::document.
{pharmaversesdtm}
A collection of test datasets formatted according to the SDTM standard, the {pharmaversesdtm} package is designed for use within the pharmaverse family of packages. Datasets applicable across therapeutic areas, such as `DM` and VS, are included alongside those specific to particular areas, like `RS` and `OE`. Available via CRAN and GitHub, the package provides installation instructions for both stable and development versions, with test data sourced from the CDISC pilot project and ad-hoc datasets generated by the {admiral} team. Naming conventions distinguish between general and therapeutic area-specific categories, with examples such as dm for general use and rs_onco for oncology-specific data. Updates involve creating or modifying R scripts in the data-raw folder, generating `.rda` files and updating metadata in a central JSON file to automate documentation and maintain consistency, including specifying dataset details like labels, descriptions and therapeutic areas.
{pkglite} (R)
Converting R package source code into text files and reconstructing package structures from those files, {pkglite} enables the exchange and management of R packages as plain text. Single or multiple packages can be processed through functions that collate, pack and unpack files, with installation options available via CRAN or GitHub. The tool adheres to a defined format for text files and includes documentation for generating specifications and managing file collections.
{pkglite} (Python)
An open-source framework licensed under the MIT licence, {pkglite} for Python, allows source projects written in any programming language to be packed into portable files and restored to their original directory structure. Installation is available via PyPI or as a development version cloned from GitHub, and the package can also be run without installation using `uvx`. A command line interface is provided in addition to the Python API, which can be installed globally using `pipx`.
{rhino}
Streamlining the development of high-quality, enterprise-grade Shiny applications, {rhino} integrates software engineering best practices, modular code structures and robust testing frameworks. Scalable architecture is supported through modularisation, code quality is enhanced with unit and end-to-end testing, and automation is facilitated via tools for project setup, continuous integration and dependency management. Comprehensive documentation is divided into tutorials, explanations and guides, with examples and resources available for learning.
{risk.assessr}
Evaluating the reliability and security of R packages during validation, the {risk.assessr} package analyses maintenance, documentation and dependencies through metrics such as R CMD check results, unit test coverage and dependency assessments. A traceability matrix linking functions to tests is generated, and risk profiles are based on predefined thresholds including documentation completeness, licence type and code coverage. The tool supports installation from GitHub or CRAN, processes local package files or `renv.lock` dependencies and offers detailed outputs such as risk analysis, dependency lists and reverse dependency information. Advanced features include identifying potential issues in suggested package dependencies and generating HTML reports for risk evaluation, with applications in clinical trial workflows and package validation processes.
{riskassessment}
Built on the {riskmetric} framework, the {riskassessment} application offers a user-friendly interface for evaluating the risk of using R packages within regulated industries, assessing development practices, documentation and sustainability. Non-technical users can review {riskmetric} outputs, add personalised comments, categorise packages into risk levels, generate reports and store assessments securely, with features such as user authentication and role-based access. Alignment with validation principles outlined by the R Validation Hub supports decision-making in regulated settings, though deeper software inspection may be required in some cases. Deployment is possible using tools like Shiny Server or Posit Connect, with installation options including GitHub and local configuration via {renv}.
{riskmetric}
Providing a framework for evaluating the quality of R packages, the {riskmetric} package assesses development practices, documentation, community engagement and sustainability through a series of metrics. Currently operating in a maintenance-only phase, further development is focused on a new tool called {val.metre}. The workflow involves retrieving package information, assessing it against predefined criteria and generating a risk score, with installation available from CRAN or GitHub. An associated application, {riskassessment}, offers a user interface for organisations to review and manage package risk assessments, store metrics and apply organisational rules.
{rlistings}
Designed to create and display formatted listings with a focus on ASCII rendering for tables and regulatory-ready outputs, the {rlistings} R package relies on the {formatters} package for formatting infrastructure. Requirements such as flexible pagination, multiple output formats and repeated key columns informed its development. Available on CRAN and GitHub, the package is under active development and includes features such as adjustable column widths, alignment and support for titles and footnotes.
{rtables}
Tailored for generating submission-ready tables for health authority review, the {rtables} R package creates and displays complex tables with advanced formatting and output options that support regulatory requirements for clinical trial data presentation. Separation of data values from their visualisation is enabled, multiple values can be included within cells, and flexible tabulation and formatting capabilities are provided, including cell spans, rounding and alignment. Output formats include HTML, ASCII, LaTeX, PDF and PowerPoint, with additional formats under development. Also, the package incorporates features such as pagination, distinction between data names and labels for CDISC standards and support for titles and footnotes. Installation is available via CRAN or GitHub, with ongoing community support and training resources.
{rtflite}
A lightweight Python library focused on precise formatting of production-quality tables and figures, {rtflite} is designed for composing RTF documents. Installation is available via PyPI or directly from its GitHub repository, with optional dependencies available to enable DOCX assembly support and RTF-to-PDF or RTF-to-DOCX conversion via LibreOffice.
{sdtm.oak}
Offering a modular, open-source solution for generating CDISC SDTM datasets, the {sdtm.oak} R package is designed to work across different electronic data capture systems and data standards. Industry challenges related to inconsistent raw data structures and varying data collection practices are addressed through reusable algorithms that map raw datasets to SDTM domains, with current capabilities covering Findings, Events and Intervention classes. Future developments aim to expand domain support, introduce metadata-driven code generation and enhance automation potential, though sponsor-specific metadata management tasks are not yet handled by the package. Available on CRAN and GitHub, development is ongoing with refinements based on user feedback and evolving SDTM requirements.
{sdtmchecks}
Providing functions to detect common data issues in SDTM datasets, the {sdtmchecks} package is designed to be broadly applicable and useful for analysis. Installation is available from CRAN or via GitHub, with development versions accessible through specific repositories, and users are not required to specify SDTM versions. A range of data check functions stored as R scripts is included, and contributions are encouraged that maintain flexibility across different data standards.
{siera}
Facilitating the generation of Analysis Results Datasets (ARD's) by processing Analysis Results Standard (ARS) metadata, the {siera} package works with parameters such as analysis sets, groupings, data subsets and methods. Metadata is typically provided in JSON format and used to create R scripts automatically that, when executed with corresponding ADaM datasets, produce ARD's in a structured format. The package can be installed from CRAN or GitHub, and its primary function, `readARS`, requires an ARS file, an output directory and access to relevant ADaM data. The CDISC Analysis Results Standard underpins this process, promoting automation and consistency in analysis outcomes.
{teal}
An open-source, Shiny-based interactive framework for exploratory data analysis, {teal} is developed as part of the pharmaverse ecosystem and maintained by F. Hoffmann-La Roche AG alongside a broad community of contributors. Analytical applications are built by combining supported data types, including CDISC clinical trial data, independent or relational datasets and `MultiAssayExperiment` objects, with modular analytical components known as teal modules. These modules can be drawn from dedicated packages covering general data exploration, clinical reporting and multi-omics analysis and define the specific analyses presented within an application. A suite of companion packages handles logging, reproducibility, data loading, filtering, reporting and transformation. The package is available on CRAN and is under active development, with community support provided through the {pharmaverse} Slack workspace.
{tern}
Supporting clinical trial reporting through a broad range of analysis functions, the {tern} R package offers data visualisation capabilities including line plots, Kaplan-Meier plots, forest plots, waterfall plots and Bland-Altman plots. Statistical model fit summaries for logistic and Cox regression are also provided, along with numerous analysis and summary table functions. Many of these outputs can be integrated into interactive Teal Shiny applications via the {teal.modules.clinical} package.
{tfrmt}
Offering a structured approach to defining and applying formatting rules for data displays in clinical trials, the {tfrmt} package streamlines the creation of mock displays, aligns with industry-standard Analysis Results Data (ARD) formats and integrates formatting tasks into the programming workflow to reduce manual effort and rework. Metadata is leveraged to automate styling and layout, enabling standardised formatting with minimal code, supporting quality control before final output and facilitating the reuse of datasets across different table types. Built on the {gt} package, the tool provides a flexible interface for generating tables and mock-ups, allowing users to focus on data interpretation rather than repetitive formatting tasks.
{tfrmtbuilder}
A tool for defining display-related metadata to streamline the creation and modification of table formats, the {tfrmtbuilder} package supports workflows such as generating tables from scratch, using templates or editing existing ones. Features include a toggle to switch between mock and real data, options to load or create datasets, tools for mapping and formatting data and the ability to export results as JSON, HTML or PNG. Designed for use in study planning and analysis phases, the package allows users to manage table structures efficiently.
{tidyCDISC}
An open-source R Shiny application, {tidyCDISC} is designed to help clinical personnel explore and analyse ADaM-standard data sets without writing any code. Customised clinical tables can be generated through a point-and-click interface, trends across patient populations examined using dynamic figures and individual patient profiles explored in detail. A broad range of users is served, from clinical heads with no programming background to statisticians and statistical programmers, with reported time savings of around 95% for routine trial analysis tasks. The app accepts only `sas7bdat` files conforming to CDISC ADaM standards and includes a feature to export reproducible R scripts from its table generator. A demo version is available without installation using CDISC pilot data, whilst uploading study data requires installing the package from CRAN or via GitHub.
{tidytlg}
Facilitating the creation of tables, listings and graphs using the {tidyverse} framework, the {tidytlg} package offers two approaches: a functional method involving custom scripts for each output and a metadata-driven method that leverages column and table metadata to generate results automatically. Tools for data analysis, including frequency tables and univariate statistics, are included alongside support for exporting outputs to formatted documents.
{Tplyr}
Simplifying the creation of clinical data summaries by breaking down complex tables into reusable layers, {Tplyr} allows users to focus on presentation rather than repetitive data processing. The conceptual approach of {dplyr} is mirrored but applied to common clinical table types, such as counting event-based variables, generating descriptive statistics for continuous data and categorising numerical ranges. Metadata is included with each summary produced to ensure traceability from raw data to final output, and user-acceptance testing documentation is provided to support its use in regulated environments. Installation options are available via CRAN or GitHub, accompanied by detailed vignettes covering features like layer templates, metadata extension and styled table outputs.
{valtools}
Streamlining the validation of R packages used in clinical research and drug development, {valtools} offers templates and functions to support tasks such as setting up validation frameworks, managing requirements and test cases and generating reports. Developed by the R Package Validation Framework PHUSE Working Group, the package integrates with standard development tools and provides functions prefixed with `vt` to facilitate structured validation processes including infrastructure setup, documentation creation and automated checks. Generating validation reports, scraping metadata from validation configurations and executing validation workflows through temporary installations or existing packages are all supported.
{whirl}
Facilitating the execution of scripts in batch mode whilst generating detailed logs that meet regulatory requirements, the {whirl} package produces logs including script status, execution timestamps, environment details, package versions and environmental variables, presented in a structured HTML format. Individual or multiple scripts can be run simultaneously, with parallel processing enabled through specified worker counts. A configuration file allows scripts to be executed in sequential steps, ensuring dependencies are respected, and the package produces individual logs for each script alongside a summary log and a tibble summarising execution outcomes. Installation options include CRAN and GitHub, with documentation available for customisation and advanced usage.
{xportr}
Assisting clinical programmers in preparing CDISC compliant XPT files for clinical data sets, the {xportr} package associates metadata with R data frames, performs validation checks and converts data into transportable SAS v5 XPT format. Tools are included to define variable types, set appropriate lengths, apply labels, format data, reorder variables and assign dataset labels, ensuring adherence to standards such as variable naming conventions, character length limits and the absence of non-ASCII characters. A practical example demonstrates how to use a specification file to apply these transformations to an ADSL dataset, ultimately generating a compliant XPT file.
Some R packages to explore as you find your feet with the language
Here are some commonly used R packages and other tools that are pervasive, along with others that I have encountered while getting started with the language, itself becoming pervasive in my line of business. The collection grew organically as my explorations proceeded, and reflects what I was trying out during my acclimatisation.
General
Here are two general packages to get things started, with one of them being unavoidable in the R world. The other is more advanced, possibly offering more to package developers.
You cannot use R without knowing about this collection of packages. In many ways, they form a mini-language of their own, drawing some criticism from those who reckon that base R functionality covers a sufficient gamut anyway. Nevertheless, there is so much here that will get you going with data wrangling and visualisation that it is worth knowing what is possible. The complaints may come from your not needing to use anything else for these purposes.
This R package enables developers to convert existing R functions into web API endpoints by adding roxygen2-like comment annotations to their code. Once annotated, functions can handle HTTP GET and POST requests, accept query string or JSON parameters and return outputs such as plain values or rendered plots. The package is available on CRAN as a stable release, with a development version hosted on GitHub. For deployment, it integrates with DigitalOcean through a companion package called {plumberDeploy}, and also supports Posit Connect, PM2 and Docker as hosting options. Related projects in the same space include OpenCPU, which is designed for hosting R APIs in scientific research contexts, and the now-discontinued jug package, which took a more programmatic approach to API construction.
Data Preparation
You simply cannot avoid working with data during any analysis or reporting work. While there is a learning curve if you are used to other languages, there is little doubt that R is well-endowed when it comes to performing these tasks. Here are some packages that extend base R capabilities and might even add some extra user-friendliness along the way.
The {forcats} package in R provides functions to manage categorical variables by reordering factor levels, collapsing infrequent values and adjusting their sequence based on frequency or other variables. It includes tools such as reordering by another variable, grouping rare categories into 'other' and modifying level order manually, which are useful for data analysis and visualisation workflows. Designed as part of the tidyverse, it integrates with other packages to streamline tasks like counting and plotting categorical data, enhancing clarity and efficiency in handling factors within R.
Around this time last year, I remember completing a LinkedIn course on a set of good practices known as tidy data, where each variable occupies a column, each observation a row and each value a single cell. This package is designed to help users restructure data so it follows those rules. It provides tools for reshaping data between long and wide formats, handling nested lists, splitting or combining columns, managing missing values and layering or flattening grouped data.
Installation options include the {tidyverse} collection, standalone installation, or the development version from GitHub. The package succeeds earlier reshaping tools like {reshape2} and {reshape}, offering a focused approach to tidying data rather than general reshaping or aggregation.
Having a long track record of working with SAS, {haven} with its abilities to read and write data files from statistical software such as SAS, SPSS and Stata, leveraging the ReadStat library, arouses my interest. Handily, it supports a range of file formats, including SAS transport and data files, SPSS system and older portable files and Stata data files up to version 15, converting these into tibbles with enhanced printing capabilities. Value labels are preserved as a labelled class, allowing conversion to factors, while dates and times are transformed into standard R classes.
While there are other approaches to working with databases using R, {RMariaDB} provides a database interface and driver for MariaDB, designed to fully comply with the DBI specification and serve as a replacement for the older {RMySQL} package. It supports connecting to databases using configuration files, executing queries, reading and writing data tables and managing results in chunks. Installation options include binary packages from CRAN or development versions from GitHub, with additional dependencies such as MariaDB Connector/C or libmysqlclient required for Linux and macOS systems. Configuration is typically handled through a MariaDB-specific file, and the package includes acknowledgments for contributions from various developers and organisations.
For many people, the pandemic may be a fading memory, yet it offered its chances for learning R, not least because there was a use case with more than a hint of personal interest about it. Here is a library making it easier to get hold of the data, with some added pre-processing too. Memories of how I needed to wrangle what was published by various sources make me appreciate just how vital it is to have harmonised data for analysis work.
Table Production
While many appear to graphical presentation of results to their tabular display, R does have its options here too. In recent times, the options have improved, particularly of the pharmaverse initiative. Here is a selection of what I found during my explorations.
Part of the {officeverse} along with {officedown}, {Flextable}, {Rvg} and {mschart}, the {officer} R package enables users to create and modify Word and PowerPoint documents directly from R, allowing the insertion of images, tables and formatted content, as well as the import of document content into data frames. It supports the generation of RTF files and integrates with other packages for advanced features such as vector graphics and native office charts. Installation options include CRAN and GitHub, with community resources available for assistance and contributions. The package facilitates the manipulation of document elements like paragraphs, tables and section breaks and provides tools for exporting and importing content between R and office formats, alongside functions for managing slide layouts and embedded objects in presentations.
If you work in clinical research like I do, the need to produce data tabulations is a non-negotiable requirement. That is how this package came to be developed and the pharmaverse of which it is part has numerous other options, should you need to look at using one of those. The flavour of RTF produced here is the Microsoft Word variety, which did not look as well in LibreOffice Writer when I last looked at the results with that open-source alternative. Otherwise, the results look well to many eyes.
Here is a package that enhances data presentation by applying customisable formatting to vectors and data frames, supporting formats such as percentages, currency and accounting. Available on GitHub and CRAN, it integrates with dynamic document tools like {knitr} and {rmarkdown} to produce visually distinct tables, with features including gradient colour scales, conditional styling and icon-based representations. It automatically converts to {htmlwidgets} in interactive environments and is licensed under MIT, enabling flexible use in both static and interactive data displays.
The {reactable} package for R provides interactive data tables built on the React Table library, offering features such as sorting, filtering, pagination, grouping with aggregation, virtual scrolling for large datasets and support for custom rendering through R or JavaScript. It integrates seamlessly into R Markdown documents and Shiny applications, enabling the use of HTML widgets and conditional styling. Installation options include CRAN and GitHub, with examples demonstrating its application across various datasets and scenarios. The package supports major web browsers and is licensed under MIT, designed for developers seeking dynamic data presentation tools within the R ecosystem.
Particularly useful in dynamic web applications like Shiny, the {DT} package in R provides a means of rendering interactive HTML tables by building on the DataTables JavaScript library. It supports features including sorting, searching, pagination and advanced filtering, with numeric, date and time columns using range-based sliders whilst factor and character columns rely on search boxes or dropdowns. Filtering operates on the client side by default, though server-side processing is also available. JavaScript callbacks can be injected after initialisation to manipulate table behaviour, such as enabling automatic page navigation or adding child rows to display additional detail. HTML content is escaped by default as a safeguard against cross-site scripting attacks, with the option to adjust this on a per-column basis. Whilst the package integrates with Shiny applications, attention is needed around scrolling and slider positioning to prevent layout problems. Overall, the package is well suited to exploratory data analysis and the building of interactive dashboards.
The {gt} package in R enables users to create well-structured tables with a variety of formatting options, starting from data frames or tibbles and incorporating elements such as headers, footers and customised column labels. It supports output in HTML, LaTeX and RTF formats and includes example datasets for experimentation. The package prioritises simplicity for common tasks while offering advanced functions for detailed customisation, with installation available via CRAN or GitHub. Users can access resources like documentation, community forums and example projects to explore its capabilities, and it is supported by a range of related packages that extend its functionality.
Enabling users to produce publication-ready outputs with minimal code, the {gtsummary} package offers a streamlined approach to generating analytical and summary tables in R. It automates the summarisation of data frames, regression models and other datasets, identifying variable types and calculating relevant statistics, including measures of data incompleteness. Customisation options allow for formatting, merging and styling tables to suit specific needs, while integration with packages such as {broom} and {gt} facilitates seamless incorporation into R Markdown workflows. The package supports the creation of side-by-side regression tables and provides tools for exporting results as images, HTML, Word, or LaTeX files, enhancing flexibility for reporting and sharing findings.
Here is an R package designed to generate LaTeX and HTML tables with a modern, user-friendly interface, offering extensive control over styling, formatting, alignment and layout. It supports features such as custom borders, padding, background colours and cell spanning across rows or columns, with tables modifiable using standard R subsetting or dplyr functions. Examples demonstrate its use for creating simple tables, applying conditional formatting and producing regression output with statistical details. The package also facilitates quick export to formats like PDF, DOCX, HTML and XLSX. Installation options include CRAN, R-Universe and GitHub, while the name reflects its origins as an enhanced version of the {xtable} package. The logo was generated using the package itself, and the background design draws inspiration from Piet Mondrian’s artwork.
Figure Generation
R has such a reputation for graphical presentations that it is cited as a strong reason to explore what the ecosystem has to offer. While base R itself is not shabby when it comes to creating graphs and charts, these packages will extend things by quite a way. In fact, the first on this list is near enough pervasive.
Though its default formatting does not appeal to me, the myriad of options makes this a very flexible tool, albeit at the expense of some code verbosity. Multi-panel plots are not among its strengths, which may send you elsewhere for that need.
Focusing on features not included in the core library, the {ggforce} package extends {ggplot2} by offering additional tools to enhance data visualisation. Designed to complement the primary role of {ggplot2} in exploratory data analysis, it provides a range of geoms, stats and other components that are well-documented and implemented, aiming to support more complex and custom plot compositions. Available for installation via CRAN or GitHub, the package includes a variety of functionalities described in detail on its associated website, though specific examples are not included here.
Developed by Claus O. Wilke for internal use in his lab, {cowplot} is an R package designed to help with the creation of publication-quality figures built on top of {ggplot2}. It provides a set of themes, tools for aligning and arranging plots into compound figures and functions for annotating plots or combining them with images. The package can be installed directly from CRAN or as a development version via GitHub, and it has seen widespread use in the book Fundamentals of Data Visualisation.
The {sjPlot} package provides a range of tools for visualising data and statistical results commonly used in social science research, including frequency tables, histograms, box plots, regression models, mixed effects models, PCA, correlation matrices and cluster analyses. It supports installation via CRAN for stable releases or through GitHub for development versions, with documentation and examples available online. The package is licensed under GPL-3 and developed by Daniel Lüdecke, offering functions to create visualisations such as scatter plots, Likert scales and interaction effect plots, along with tools for constructing index variables and presenting statistical outputs in tabular formats.
By offering a centralised approach to theming and enabling automatic adaptation of plot styles within Shiny applications, the {thematic} package simplifies the styling of R graphics, including {ggplot2}, {lattice} and base R plots, R Markdown documents and RStudio. It allows users to apply consistent visual themes across different plotting systems, with auto-theming in Shiny and R Markdown relying on CSS and {bslib} themes, respectively. Installation requires specific versions of dependent packages such as {shiny} and {rmarkdown}, while custom fonts benefit from {showtext} or {ragg}. Users can set global defaults for background, foreground and accent colours, as well as fonts, which can be overridden with plot-specific theme adjustments. The package also defines default colour scales for qualitative and sequential data and integrates with tools like bslib to import Google Fonts, enhancing visual consistency across different environments and user interfaces.
Publishing Tools
The R ecosystem goes beyond mere graphical and tabular display production to offer means for taking things much further, often offering platforms for publishing your work. These can be used locally too, so there is no need to entrust everything to a third-party provider. The uses are endless for what is available, and it appears that Posit has used this to help with building documentation and training too.
What you have here is one of those distinguishing facilities of the R ecosystem, particularly for those wanting to share their analysis work with more than a hint of reproducibility. The tool combines narrative text and code to generate various outputs, supporting multiple programming languages and formats such as HTML, PDF and dashboards. It enables users to produce reports, presentations and interactive applications, with options for publishing and scheduling through platforms like RStudio Connect, facilitating collaboration and distribution of results in professional settings.
Distill for R Markdown is a tool designed to streamline the creation of technical documents, offering features such as code folding, syntax highlighting and theming. It builds on existing frameworks like Pandoc, MathJax and D3, enabling the production of dynamic, interactive content. Users can customise the appearance with CSS and incorporate appendices for supplementary information. The tool acknowledges the contributions of developers who created foundational libraries, ensuring accessibility and functionality for a wide audience. Its design prioritises clarity, allowing authors to focus on presenting results rather than underlying code, while maintaining flexibility for those who wish to include detailed explanations.
For a while, this was one of R's unique selling points, and remains as compelling a reason to use the language even when Python has got its own version of the package. Enabling the creation of interactive web applications for data analysis without requiring web development expertise allows users to build interfaces that let others explore data through dynamic visualisations and filters. Here is a simple example: an app that generates scatter plots with adjustable variables, species filters and marginal plots, hosted either on personal servers or through a dedicated hosting service.
The {bslib} R package offers a modern user interface toolkit for Shiny and R Markdown applications, leveraging Bootstrap to enable the creation of customisable dashboards and interactive theming. It supports the use of updated Bootstrap and Bootswatch versions while maintaining compatibility with existing defaults, and provides tools for real-time visual adjustments. Installation is available through CRAN, with example previews demonstrating its capabilities.
Enabling users to manipulate and validate data within a spreadsheet-like interface, the {rhandsontable} package introduces an interactive data grid for R. It supports features such as custom cell rendering, validation rules and integration with Shiny applications. When used in Shiny, the widget requires explicit conversion of data using the hot_to_r function, as updates may not be immediately reflected in reactive contexts. Examples demonstrate its application in various scenarios, including date editing, financial calculations and dynamic visualisations linked to charts. The package also accommodates bookmarks in Shiny apps with specific handling. Users are encouraged to report issues or contribute improvements, with guidance provided for those seeking to expand its functionality. The development team welcomes feedback to refine the tool further, ensuring it aligns with evolving user needs.
{xaringanExtra} offers a range of enhancements and extensions for creating and presenting slides with xaringan, enabling features such as adding an overview tile view, making slides editable, broadcasting in real time, incorporating animations, embedding live video feeds and applying custom styles. It allows users to selectively activate individual tools or load multiple features simultaneously through a single function call, supporting tasks like adding banners, enabling code copying, fitting slides to screen dimensions and integrating utility toolkits. The package is available for installation via CRAN or GitHub, providing flexibility for developers and presenters seeking to expand the functionality of their slides.
Online R programming books that are worth bookmarking
As part of making content more useful following its reorganisation, numerous articles on the R statistical computing language have appeared on here. All of those have taken a more narrative form. With this collation of online books on the R language, I take a different approach. What you find below is a collection of links with associated descriptions. While narrative accounts can be very useful, there is something handy about running one's eye down a compilation as well. Many entries have a corresponding print edition, some of which are not cheap to buy, which makes me wonder about the economics of posting the content online as well, though it can help with getting feedback during book preparation.
We start with this comprehensive collection of over 400 free and affordable resources related to the R programming language, organised into categories such as data science, statistics, machine learning and specific fields like economics and life sciences. In many ways, it is a superset of what you find below and complements this collection with many other finds. The fact that it is a living collection makes it even more useful.
R Programming for Data Science
Here is an introduction to the R programming language, focusing on its application in data science. It covers foundational topics such as installation, data manipulation, function writing, debugging and code optimisation, alongside advanced concepts like parallel computation and data analysis case studies. The text includes practical guidance on handling data structures, using packages such as {dplyr} and {readr} as well as working with dates, times and regular expressions. Additional sections address control structures, scoping rules and profiling techniques, while the author also discusses resources for staying updated through a podcast and accessing e-book versions for ongoing revisions.
Designed for individuals with no prior coding experience, the book provides an introduction to programming in R while using practical examples to teach fundamental concepts such as data manipulation, function creation and the use of R's environment system. It is structured around hands-on projects, including simulations of weighted dice, playing cards and a slot machine, alongside explanations of core programming principles like objects, notation, loops and performance optimisation. Additional sections cover installation, package management, data handling and debugging techniques. While the book is written using RMarkdown and published under a Creative Commons licence, a physical edition is available through O’Reilly.
What you have here is one of several books written by Hadley Wickham. This one is published in its second edition as part of Chapman and Hall's R Series and is aimed primarily at R users who want to deepen their programming skills and understanding of the language, though it is also useful for programmers migrating from other languages. The book covers a broad range of topics organised into sections on foundations, functional programming, object-oriented programming, metaprogramming and techniques, with the latter including debugging, performance measurement and rewriting R code in C++.
Unlike Paul Teetor's separately published R Cookbook, the Cookbook for R was created by Winston Chang. It offers solutions to common tasks and problems in data analysis, covering topics such as basic operations, numbers, strings, formulas, data input and output, data manipulation, statistical analysis, graphs, scripts and functions, and tools for experiments.
The second edition of R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund offers a structured approach to learning data science with R, covering essential skills such as data visualisation, transformation, import, programming and communication. Organised into chapters that explore workflows, data manipulation techniques and tools like Quarto for reproducible research, the book emphasises practical applications and best practices for handling data effectively.
The R Graphics Cookbook, 2nd edition, offers a comprehensive guide to creating visualisations in R, structured into chapters that cover foundational skills such as installing and using packages, loading data from various formats and exploring datasets through basic plots. It progresses to detailed techniques for constructing bar graphs, line graphs, scatter plots and histograms, alongside methods for customising axes, annotations, themes and legends.
The book also addresses advanced topics like colour application, faceting data into subplots, generating specialised graphs such as network diagrams and heat maps and preparing data for visualisation through reshaping and summarising. Additional sections focus on refining graphical outputs for presentation, including exporting to different file formats and adjusting visual elements for clarity and aesthetics, while an appendix provides an overview of the {ggplot2} system.
R Markdown: The Definitive Guide
Published by Chapman & Hall/CRC, R Markdown: The Definitive Guide by Yihui Xie, J.J. Allaire and Garrett Grolemund covers the R Markdown document format, which has been in use since 2012 and is built on the knitr and Pandoc tools. The format allows users to embed code within Markdown documents and compile the results into a range of output formats including PDF, HTML and Word. The guide covers a broad scope of practical applications, from creating presentations, dashboards, journal articles and books to building interactive applications and generating blogs, reflecting how the ecosystem has matured since the {rmarkdown} package was first released in 2014.
A key principle running throughout is that Markdown's deliberately limited feature set is a strength rather than a drawback, encouraging authors to focus on content rather than complex typesetting. Despite this simplicity, the format remains highly customisable through tools such as Pandoc templates, LaTeX and CSS. Documents produced in R Markdown are also notably portable, as their straightforward syntax makes conversion between output formats more reliable, and because results are generated dynamically from code rather than entered manually, they are far more reproducible than those produced through conventional copy-and-paste methods.
The R Markdown Cookbook is a practical guide designed to help users enhance their ability to create dynamic documents by combining analysis and reporting. It covers essential topics such as installation, document structure, formatting options and output formats like LaTeX, HTML and Word, while also addressing advanced features such as customisations, chunk options and integration with other programming languages. The book provides step-by-step solutions to common tasks, drawing on examples from online resources and community discussions to offer clear, actionable advice for both new and experienced users seeking to improve their workflow and explore the full potential of R Markdown.
This book provides a practical guide to using R Markdown for scientists, developed from a three-hour workshop and designed to evolve as a living resource. It covers essential topics such as setting up R Markdown documents, integrating with RStudio for efficient workflows, exporting outputs to formats like PDF, HTML and Word, managing figures and tables with dynamic references and captions, incorporating mathematical equations, handling bibliographies with citations and style adjustments, troubleshooting common issues and exploring advanced R Markdown extensions.
bookdown: Authoring Books and Technical Documents with R Markdown
Here is a guide to using the {bookdown} package, which extends R Markdown to facilitate the creation of books and technical documents. It covers Markdown syntax, integration of R code, formatting options for HTML, LaTeX and e-book outputs and features such as cross-referencing, custom blocks and theming. The package supports both multipage and single-document outputs, and its applications extend beyond traditional books to include course materials, manuals and other structured content. The work includes practical examples, publishing workflows and details on customisation, alongside information about licensing and the availability of a printed version.
[blogdown]: Creating Websites with R Markdown
Though the authors note that some information may be outdated due to recent updates to Hugo and the {blogdown} package, and they direct readers to additional resources for the latest features and changes, this book still provides a guide to building static websites using R Markdown and the Hugo static site generator, emphasising the advantages of this approach for creating reproducible, portable content. It covers installation, configuration, deployment options such as Netlify and GitHub Pages, migration from platforms like WordPress and advanced topics including custom layouts and version control as well as practical examples, workflow recommendations and discussions on themes, content management and technical aspects of website development.
[pagedown]: Create Paged HTML Documents for Printing from R Markdown
The R package {pagedown} enables users to create paged HTML documents suitable for printing to PDF, using R Markdown combined with a JavaScript library called paged.js, that later of which implements W3C specifications for paged media. While tools like LaTeX and Microsoft Word have traditionally dominated PDF production, pagedown offers an alternative approach through HTML and CSS, supporting a range of document types including resumes, posters, business cards, letters, theses and journal articles.
Documents can be converted to PDF via Google Chrome, Microsoft Edge or Chromium, either manually or through the chrome_print() function, with additional support for server-based, CI/CD pipeline and Docker-based workflows. The package provides customisable CSS stylesheets, a CSS overriding mechanism for adjusting fonts and page properties, and various formatting features such as lists of tables and figures, abbreviations, footnotes, line numbering, page references, cover images, running headers, chapter prefixes and page breaks. Previewing paged documents requires a local or remote web server, and the layout is sensitive to browser zoom levels, with 100% zoom recommended for the most accurate output.
Dynamic Documents with R and knitr
Developed by Yihui Xie and inspired by the earlier {Sweave} package, {knitr} is an R package designed for dynamic report generation that consolidates the functionality of numerous other add-on packages into a single, cohesive tool. It supports multiple input languages, including R, Python and shell scripts, as well as multiple output markup languages such as LaTeX, HTML, Markdown, AsciiDoc and reStructuredText. The package operates on a principle of transparency, giving users full control over how input and output are handled, and runs R code in a manner consistent with how it would behave in a standard R terminal.
Among its notable features are built-in caching, automatic code formatting via the {formatR} package, support for more than 20 graphics devices and flexible options for managing plots within documents. It also allows advanced users to define custom hooks and regular expressions to extend and tailor its behaviour further. The package is affiliated with the Foundation for Open Access Statistics, a nonprofit organisation promoting free software, open access publishing and reproducible research in statistics.
Mastering Shiny is a comprehensive guide to developing web applications using R, focusing on the Shiny framework designed for data scientists. It introduces core concepts such as user interface design, reactive programming and dynamic content generation, while also exploring advanced topics like performance optimisation, security and modular app development. The book covers practical applications across industries, from academic teaching tools to real-time analytics dashboards, and aims to equip readers with the skills to build scalable, maintainable applications. It includes detailed chapters on workflow, layout, visualisation and user interaction, alongside case studies and technical best practices.
Engineering Production-Grade Shiny Apps
This is aimed at developers and team managers who already possess a working knowledge of the Shiny framework for R and wish to advance beyond the basics toward building robust, production-ready applications. Rather than covering introductory Shiny concepts or post-deployment concerns, the book focuses on the intermediate ground between those two stages, addressing project management, workflow, code structure and optimisation.
It introduces the {golem} package as a central framework and guides readers through a five-step workflow covering design, prototyping, building, strengthening and deployment, with additional chapters on optimisation techniques including R code performance, JavaScript integration and CSS. The book is structured to serve both those with project management responsibilities and those focused on technical development, acknowledging that in many small teams these roles are carried out by the same individual.
Outstanding User Interfaces with Shiny
Written by David Granjon and published in 2022, Outstanding User Interfaces with Shiny is a book aimed at filling the gap between beginner and advanced Shiny developers, covering how to deeply customise and enhance Shiny applications to the point where they become indistinguishable from classic web applications. The book spans a wide range of topics, including working with HTML and CSS, integrating JavaScript, building Bootstrap dashboard templates, mobile development and the use of React, providing a comprehensive resource that consolidates knowledge and experience previously scattered across the Shiny developer community.
Now in its second edition, R Packages by Hadley Wickham and Jennifer Bryan is a freely available online guide that teaches readers how to develop packages in R. A package is the core unit of shareable and reproducible R code, typically comprising reusable functions, documentation explaining how to use them and sample data. The book guides readers through the entire process of package development, covering areas such as package structure, metadata, dependencies, testing, documentation and distribution, including how to release a package to CRAN. The authors encourage a gradual approach, noting that an imperfect first version is perfectly acceptable provided each subsequent version improves on the last.
Written by Javier Luraschi, Kevin Kuo and Edgar Ruiz, Mastering Spark with R is a comprehensive guide designed to take readers from little or no familiarity with Apache Spark or R through to proficiency in large-scale data science. The book covers a broad range of topics, including data analysis, modelling, pipelines, cluster management, connections, data handling, performance tuning, extensions, distributed computing, streaming and contributing to the Spark ecosystem.
Happy Git and GitHub for the useR
Here is a practical guide written by Jenny Bryan and contributors, aimed primarily at R users involved in data analysis or package development. It covers the installation and configuration of Git alongside GitHub, the development of key workflows for common tasks and the integration of these tools into day-to-day work with R and R Markdown. The guide is structured to take readers from initial setup through to more advanced daily workflows, with particular attention paid to how Git and GitHub serve the needs of data science rather than pure software development.
Written by John Coene and intended for release as part of the CRC Press R series, JavaScript for R explore how the R programming language and JavaScript can be used together to enhance data science workflows. Rather than teaching JavaScript as a standalone language, the book demonstrates how a limited working knowledge of it can meaningfully extend what R developers can achieve, particularly through the integration of external JavaScript libraries.
The book covers a broad range of topics, progressing from foundational concepts through to data visualisation using the {htmlwidgets} package, bidirectional communication with Shiny, JavaScript-powered computations via the V8 engine and Node.js and the use of modern JavaScript tools such as Vue, React and webpack alongside R. Practical examples are woven throughout, including the building of interactive visualisations, custom Shiny inputs and outputs, image classification and machine learning operations, with all accompanying code made publicly available on GitHub.
This guide addresses challenges faced by developers of R packages that interact with web resources, offering strategies to create reliable unit tests despite dependencies on internet connectivity, authentication and external service availability. It explores tools such as {vcr}, {webmockr}, {httptest} and {webfakes}, which enable mocking and recording HTTP requests to ensure consistent testing environments, reduce reliance on live data and improve test reliability. The text also covers advanced topics like handling errors, securing tests and ensuring compatibility with CRAN and Bioconductor, while emphasising best practices for maintaining test robustness and contributor-friendly workflows. Funded by rOpenSci and the R Consortium, the resource aims to support developers in building more resilient and maintainable R packages through structured testing approaches.
The Shiny AWS Book is an online resource designed to teach data scientists how to deploy, host and maintain Shiny web applications using cloud infrastructure. Addressing a common gap in data science education, it guides readers through a range of DevOps technologies including AWS, Docker, Git, NGINX and open-source Shiny Server, covering everything from server setup and cost management to networking, security and custom configuration.
{ggplot2}: Elegant Graphics for Data Analysis
The third edition of {ggplot2}: Elegant Graphics for Data Analysis provides an in-depth exploration of the Grammar of Graphics framework, focusing on the theoretical foundations and detailed implementation of the ggplot2 package rather than offering step-by-step instructions for specific visualisations. Written by Hadley Wickham, Danielle Navarro and Thomas Lin Pedersen, the book is presented as an online work-in-progress, with content structured across sections such as layers, scales, coordinate systems and advanced programming topics. It aims to equip readers with the knowledge to customise plots according to their needs, rather than serving as a direct guide for creating predefined graphics.
YaRrr! The Pirate’s Guide to R
Written by Nathaniel D. Phillips, this is a beginner-oriented guide to learning the R programming language from the ground up, covering everything from installation and basic navigation of the RStudio environment through to more advanced topics such as data manipulation, statistical analysis and custom function writing. The guide progresses logically through foundational concepts including scalars, vectors, matrices and dataframes before moving into practical areas such as hypothesis testing, regression, ANOVA and Bayesian statistics. Visualisation is given considerable attention across dedicated chapters on plotting, while later sections address loops, debugging and managing data from a variety of file formats. Each chapter includes practical exercises to reinforce learning, and the book concludes with a solutions section for reference.
Data Visualisation: A Practical Introduction
Data Visualisation: A Practical Introduction is a forthcoming second edition from Princeton University Press, written by Kieran Healy and due for release in March 2026, which teaches readers how to explore, understand and present data using the R programming language and the {ggplot2} library. The book aims to bridge the gap between works that discuss visualisation principles without teaching the underlying tools and those that provide code recipes without explaining the reasoning behind them, instead combining both practical instruction and conceptual grounding.
Revised and updated throughout to reflect developments in R and {ggplot2}, the second edition places greater emphasis on data wrangling, introduces updated and new datasets, and substantially rewrites several chapters, particularly those covering statistical models and map-drawing. Readers are guided through building plots progressively, from basic scatter plots to complex layered graphics, with the expectation that by the end they will be able to reproduce nearly every figure in the book and understand the principles that inform each choice.
The book also addresses the growing role of large language models in coding workflows, arguing that genuine understanding of what one is doing remains essential regardless of the tools available. It is suitable for complete beginners, those with some prior R experience, and instructors looking for a course companion, and requires the installation of R, RStudio and a number of supporting packages before work can begin.
Learning R for Data Analysis: Going from the basics to professional practice
R has grown from a specialist statistical language into one of the most widely recognised tools for working with data. Across tutorials, community sites, training platforms and industry resources, it is presented as both a programming language and a software environment for statistical computing, graphics and reporting. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand, and its name draws on the first letter of their first names while also alluding to the Bell Labs language S. It is freely available under the GNU General Public Licence and runs on Linux, Windows and macOS, which has helped it spread across research, education and industry alike.
What Makes R Distinctive
What makes R notable is its combination of programming features with a strong focus on data analysis. Introductory material, such as the tutorials at Tutorialspoint and Datamentor, repeatedly highlights its support for conditionals, loops, user-defined recursive functions and input and output, but these sit alongside effective data handling, a broad set of operators for arrays, lists, vectors and matrices and strong graphical capabilities. That mixture means R can be used for straightforward scripts and for complex analytical workflows. A beginner may start by printing "Hello, World!" with the print() function, while a more experienced user may move on to regression models, interactive dashboards or automated reporting.
The Learning Progression
Learning materials generally present R in a structured progression. A beginner is first introduced to reserved words, variables and constants, operators and the order in which expressions are evaluated. From there, the path usually moves into flow control through if…else, ifelse(), for, while, repeat and the use of break and next, before functions follow naturally, including return values, environments and scope, recursive functions, infix operators and switch(). Most sources agree that confidence with the syntax and fundamentals is the real starting point, and this early sequence matters because it helps learners become comfortable reading and writing R rather than only copying examples.
After the basics, attention tends to turn to the structures that make R so useful for data work. Vectors, matrices, lists, data frames and factors appear in nearly every introductory course because they are central to how information is stored and manipulated. Object-oriented concepts also emerge quite early in some routes through the language, with classes and objects extending into S3, S4 and reference classes. For someone coming from spreadsheets or point-and-click statistical software, this shift can feel significant, but it also opens the way to more reproducible and flexible analysis.
Visualisation
Visualisation is another recurring theme in R education. Basic chart types such as bar plots, histograms, pie charts, box plots and strip charts are common early examples because they show how quickly data can be turned into graphics. More advanced lessons widen the scope through plot functions, multiple plots, saving graphics, colour selection and the production of 3D plots.
Beyond base plotting, there is extensive evidence of the central role of {ggplot2} in contemporary R practice. Data Cornering demonstrates this well, with articles covering how to create funnel charts in R using {ggplot2} and how to diversify stacked column chart data label colours, showing how R is used not only to summarise data but also to tell visual stories more clearly. In the pharmaceutical and clinical research space, the PSI VIS-SIG blog is published by the PSI Visualisation Special Interest Group and summarises its monthly Wonderful Wednesday webinars, presenting real-world datasets and community-contributed chart improvements alongside news from the group.
Data Wrangling and the Tidyverse
Much of modern R work is built around data wrangling, and here the {tidyverse} has become especially prominent. Claudia A. Engel's openly published guide Data Wrangling with R (last updated 3rd November 2023) sets out a preparation phase that assumes some basic R knowledge, a recent installation of R and RStudio and the installation of the {tidyverse} package with install.packages("tidyverse") followed by library(tidyverse). It also recommends creating a dedicated RStudio project and downloading CSV files into a data subdirectory, reinforcing the importance of organised project structure.
That same guide then moves through data manipulation with {dplyr}, covering selecting columns and filtering rows, pipes, adding new columns, split-apply-combine, tallying and joining two tables, before moving on to {tidyr} topics such as long and wide table formats, pivot_wider, pivot_longer and exporting data. These topics reflect a broader pattern in the R ecosystem because data import and export, reshaping, combining tables and counting by group recur across teaching resources as they mirror common analytical tasks.
Applications and Professional Use
The range of applications attached to R is wide, though data science remains the clearest centre of gravity. Educational sources describe R as valuable for data wrangling, visualisation and analysis, often pointing to packages such as {dplyr}, {tidyr}, {ggplot2} and {Shiny}. Statistical modelling is another major strand, with R offering extensible techniques for descriptive and inferential statistics, regression analysis, time series methods and classical tests. Machine learning appears as a further area of growth, supported by a large and expanding package ecosystem. In more advanced contexts, R is also linked with dashboards, web applications, report generation and publishing systems such as Quarto and R Markdown.
R's place in professional settings is underscored by the breadth of organisations and sectors associated with it. Introductory resources mention companies such as Google, Microsoft, Facebook, ANZ Bank, Ford and The New York Times as examples of organisations using R for modelling, forecasting, analysis and visualisation. The NHS-R Community promotes the use of R and open analytics in health and care, building a community of practice for data analysis and data science using open-source software in the NHS and wider UK health and care system. Its resources include reports, blogs, webinars and workshops, books, videos and R packages, with webinar materials archived in a publicly accessible GitHub repository. The R Validation Hub, supported through the pharmaR initiative, is a collaboration to support the adoption of R within a biopharmaceutical regulatory setting and provides tools including the {riskmetric} package, the {riskassessment} app and the {riskscore} package for assessing package quality and risk.
The Wider Ecosystem
The wider ecosystem around R is unusually rich. The R Consortium promotes the growth and development of the R language and its ecosystem by supporting technical and social infrastructure, fostering community engagement and driving industry adoption. It notes that the R language supports over two million users and has been adopted in industries including biotech, finance, research and high technology. Community growth is visible not only through organisations and conferences but through user groups, scholarships, project working groups and local meetups, which matters because learning a language is easier when there is an active support network around it.
Another sign of maturity is the depth of R's package and publication landscape. rdrr.io provides a comprehensive index of over 29,000 CRAN packages alongside more than 2,100 Bioconductor packages, over 2,200 R-Forge packages and more than 76,000 GitHub packages, making it possible to search for packages, functions, documentation and source code in one place. Rdocumentation, powered by DataCamp, covers 32,130 packages across CRAN and Bioconductor and offers a searchable interface for function-level documentation. The Journal of Statistical Software adds a scholarly dimension, publishing open-access articles on statistical computing software together with source code, with full reproducibility mandatory for publication. R-bloggers aggregates R news and tutorials contributed by hundreds of R bloggers, while R Weekly curates a community digest and an accompanying podcast, both helping users keep pace with the steady flow of tutorials, package releases, blog posts and developments across the R world.
Where to Begin
For beginners, one recurring challenge is knowing where to start, and different learning routes reflect different backgrounds. Datamentor points learners towards step-by-step tutorials covering popular topics such as R operators, if...else statements, data frames, lists and histograms, progressing through to more advanced material. R for the Rest of Us offers a staged path through three core courses, Getting Started With R, Fundamentals of R and Going Deeper with R, and extends into nine topics courses covering Git and GitHub, making beautiful tables, mapping, graphics, data cleaning, inferential statistics, package development, reproducibility and interactive dashboards with {Shiny}. The site is explicitly designed for people who may never have coded before and also offers the structured R in 3 Months programme alongside training and consulting. RStudio Education (now part of Posit) outlines six distinct ways to begin learning R, covering installation, a free introductory webinar on tidy statistics, the book R for Data Science, browser-based primers, and further options suited to different learning styles, along with guidance on R Markdown and good project practices.
Despite the variety, the underlying advice is consistent: start by learning the basics well enough to read and write simple code, practise regularly beginning with straightforward exercises and gradually take on more complex tasks, then build projects that matter to you because projects create context and make concepts stick. There is no suggestion that mastery comes from passively reading documentation alone, as practical engagement is treated as essential throughout. The blog Stats and R exemplifies this philosophy well, with the stated aim of making statistics accessible to everyone by sharing, explaining and illustrating statistical concepts and, where appropriate, applying them in R.
That practical engagement can take many forms. Someone interested in data journalism may focus on visualisation and reproducible reporting, while a researcher may prioritise statistical modelling and publishing workflows, and a health analyst may use R for quality assurance, open health data and clinical reporting. Others may work with {Shiny}, package development, machine learning, Git and GitHub or interactive dashboards. The variety shows that R is not confined to a single use case, even if statistics and data science remain the common thread.
Free Learning Resources for R
It is also worth noting that R learning is supported by a great deal of freely available material. Statistics Globe, founded in 2017 by Joachim Schork and now an education and consulting platform, offers more than 3,000 free tutorials and over 1,000 video tutorials on YouTube, spanning R programming, Python and statistical methodology. STHDA (Statistical Tools for High-Throughput Data Analysis) covers basics, data import and export, reshaping, manipulation and visualisation, with material geared towards practical data analysis at every level. Community sites, webinar repositories and newsletters add further layers of accessibility, and even where paid courses exist, the surrounding free ecosystem is substantial.
Taken together, these sources present R as far more than a niche programming language. It is a mature open-source environment with a strong statistical heritage, a practical orientation towards data work and a well-developed community of learners, teachers, developers and organisations. Its core concepts are approachable enough for beginners, yet its package ecosystem and publishing culture support highly specialised and advanced work. For anyone looking to enter data analysis, statistics, visualisation or related areas, R offers a route that begins with simple code and can extend into large-scale analytical workflows.
How to centre titles, remove gridlines and write reusable functions in {ggplot2}
{ggplot2} is widely used for data visualisation in R because it offers a flexible, layered grammar for constructing charts. A plot can begin with a straightforward mapping of data to axes and then be refined with titles, themes and annotations until it better serves the message being communicated. That flexibility is one of the greatest strengths of {ggplot2}, though it also means that many useful adjustments are small, specific techniques that are easy to overlook when first learning the package.
Three of those techniques fit together particularly well. The first is centring a plot title, a common formatting need because {ggplot2} titles are left-aligned by default. The second is removing grid lines and background elements to produce a cleaner, less cluttered appearance. The third is wrapping familiar {ggplot2} code into a reusable function so that the same visual style can be applied across different datasets without rewriting everything each time. Together, these approaches show how a basic plot can move from a default graphic to something more polished and more efficient to reproduce.
Centring the Plot Title
A clear starting point comes from a short tutorial by Luis Serra at Ubiqum Code Academy, published on RPubs, which focuses on one specific goal: centring the title of a {ggplot2} output. The example uses the well-known Iris dataset, which is included with R and contains 150 observations across five variables. Those variables are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species, with Species stored as a factor containing three levels (setosa, versicolor and virginica), each represented by 50 samples.
The first step is to load {ggplot2} and inspect the structure of the data using library(ggplot2), followed by data("iris") and str(iris). The structure output confirms that the first four columns are numeric, and the fifth is categorical. That distinction matters because it makes the dataset well suited to a scatter plot with a colour grouping, allowing two continuous variables to be compared while species differences are shown visually.
The initial chart plots petal length against petal width, with points coloured by species:
ggplot() + geom_point(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species))
This produces a simple scatter plot and serves as the base for later refinements. Even in this minimal form, the grammar is clear: the data are supplied to geom_point(), the x and y aesthetics are mapped to Petal.Width and Petal.Length, and colour is mapped to Species.
Once the scatter plot is in place, a title is added using ggtitle("My dope plot"), appended to the existing plotting code. This creates a title above the graphic, but it remains left-justified by default. That alignment is not necessarily wrong, as left-aligned titles work well in many visual contexts, yet there are situations where a centred title gives a more balanced appearance, particularly for standalone blog images, presentation slides or teaching examples.
The adjustment required is small and direct. {ggplot2} allows title styling through its theme system, and horizontal justification for the title is controlled through plot.title = element_text(hjust = 0.5). Setting hjust to 0.5 centres the title within the plot area, whilst 0 aligns it to the left and 1 to the right. The revised code becomes:
ggplot() +
geom_point(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
ggtitle("My dope plot") +
theme(plot.title = element_text(hjust = 0.5))
That small example also opens the door to a broader understanding of {ggplot2} themes. Titles, text size, panel borders, grid lines and background fills are all managed through the same theming system, which means that once one element is adjusted, others can be modified in a similar way.
Removing Grids and Background Elements
A second set of techniques, demonstrated by Felix Fan in a concise tutorial on his personal site, begins by generating simple data rather than using a built-in dataset. The code creates a sequence from 1 to 20 with a <- seq(1, 20), calculates the fourth root with b <- a^0.25 and combines both into a data frame using df <- as.data.frame(cbind(a, b)). The plot is then created as a reusable object:
myplot = ggplot(df, aes(x = a, y = b)) + geom_point()
From there, several styling approaches become available. One of the quickest is theme_bw(), which removes the default grey background and replaces it with a cleaner black-and-white theme. This does not strip the graphic down completely, but it does provide a more neutral base and is often a practical shortcut when the standard {ggplot2} appearance feels too heavy.
More selective adjustments can also be made independently. Grid lines can be removed with the following:
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
This suppresses both major and minor grid lines, whilst leaving other parts of the panel unchanged. Borderlines can be removed separately with theme(panel.border = element_blank()), though that does not affect the background colour or the grid. Likewise, the panel background can be cleared with theme(panel.background = element_blank()), which removes the panel fill and borderlines but leaves grid lines in place. Each of these commands targets a different component, so they can be combined depending on the desired result.
If the background and border are removed, axis lines can be added back for clarity using theme(axis.line = element_line(colour = "black")). This is an important finishing step in a stripped-back plot because removing too many panel elements can leave the chart without enough visual structure. The explicit axis line restores a frame of reference without reintroducing the full border box.
Two combined approaches are worth knowing. The first uses a single custom theme call:
myplot + theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
)
The second starts from theme_bw() and then removes the border and grids whilst adding axis lines:
myplot + theme_bw() + theme(
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black")
)
Both approaches produce a cleaner chart, though they begin from slightly different defaults. The practical lesson is that {ggplot2} styling is modular, so there is often more than one route to a similar visual result.
This matters because chart design is rarely only about appearance. Cleaner formatting can make a chart easier to read by reducing distractions and placing more emphasis on the data itself. A centred title, a restrained background and the selective use of borders all influence how quickly the eye settles on what is important.
Building Reusable Custom Plot Functions
A third area extends these ideas further by showing how to build custom {ggplot2} functions in R, a topic covered in depth by Sharon Machlis in a tutorial published on Infoworld. The central problem discussed is the mismatch that used to make this awkward: tidyverse functions typically use unquoted column names, whilst base R functions generally expect quoted names. This tension became especially noticeable when users wanted to write their own plotting functions that accepted a data frame and column names as arguments.
The example in that article uses Zillow data containing estimated median home values. After loading {dplyr} and {ggplot2}, a horizontal bar chart is created to show home values by neighbourhood in Boston, with bars ordered from highest to lowest values, outlined in black and filled in blue:
ggplot(data = bos_values, aes(x = reorder(RegionName, Zhvi), y = Zhvi)) +
geom_col(color = "black", fill = "#0072B2") +
xlab("") + ylab("") +
ggtitle("Zillow Home Value Index by Boston Neighborhood") +
theme_classic() +
theme(plot.title = element_text(size = 24)) +
coord_flip()
The next step is to turn that pattern into a function. An initial attempt passes unquoted column names but does not work as intended because of the underlying tension between standard R evaluation and the non-standard evaluation of {ggplot2}. The solution came with the introduction of the tidy evaluation {{ operator, commonly known as "curly-curly", in {rlang} version 0.4.0. As noted in the official tidyverse announcement, this operator abstracts the previous two-step quote-and-unquote process into a single interpolation step. Once library(rlang) is loaded, column references inside the plotting code are wrapped in double curly braces:
library(rlang)
mybarplot <- function(mydf, myxcol, myycol, mytitle) {
ggplot2::ggplot(data = mydf, aes(x = reorder({{ myxcol }}, {{ myycol }}), y = {{ myycol }})) +
geom_col(color = "black", fill = "#0072B2") +
xlab("") + ylab("") +
coord_flip() +
ggtitle(mytitle) +
theme_classic() +
theme(plot.title = element_text(size = 24))
}
With that change in place, the function can be called with unquoted column names, just as they would appear in many tidyverse functions:
mybarplot(bos_values, RegionName, Zhvi, "Zillow Home Value Index by Boston Neighborhood")
That final point is particularly useful in practice. The resulting plot object can be stored and extended further, for example by adding data labels on the bars with geom_text() and the scales::comma() function. A custom plotting function does not lock the user into a fixed result; it provides a well-designed starting point that can still be extended with additional {ggplot} layers.
Putting the Three Techniques Together in {ggplot2}
Seen as a progression, these examples build on one another in a logical way. The first shows how to centre a title with theme(plot.title = element_text(hjust = 0.5)). The second shows how to simplify a chart by removing grids, borders and background elements whilst restoring axis lines where needed. The third scales those preferences up by packaging them inside a reusable function. What begins as a one-off styling adjustment can therefore become part of a repeatable workflow.
These techniques also reflect a wider culture around R graphics. Resources such as the R Graph Gallery, created by Yan Holtz, have helped make this style of incremental learning more accessible by offering reproducible examples across a wide range of chart types. The gallery presents over 400 R-based graphics, with a strong emphasis on {ggplot2} and the tidyverse, and organises them into nearly 50 chart families and use cases. Its broader message is that effective visualisation is often the result of small, deliberate decisions rather than dramatic reinvention.
For anyone working with {ggplot2}, that is a helpful principle to keep in mind. A centred title may seem minor, just as removing a panel grid may seem cosmetic, yet these changes can improve clarity and consistency across a body of work. When those preferences are wrapped into a function, they also save time and reduce repetition, connecting plot styling directly to good code design.
Some R functions for working with dates, strings and data frames
Working with data in R often comes down to a handful of recurring tasks: combining text, converting dates and times, reshaping tables and creating summaries that are easier to interpret. This article brings together several strands of base R and tidyverse-style practice, with a particular focus on string handling, date parsing, subsetting and simple time series smoothing. Taken together, these functions form part of the everyday toolkit for data cleaning and analysis, especially when imported data arrive in inconsistent formats.
String Building
At the simplest end of this toolkit is paste(), a base R function for concatenating character vectors. Its purpose is straightforward: it converts one or more R objects to character vectors and joins them together, separating terms with the string supplied in sep, which defaults to a space. If the inputs are vectors, concatenation happens term by term, so paste("A", 1:6, sep = "") yields "A1" through "A6", while paste(1:12) behaves much like as.character(1:12). There is also a collapse argument, which takes the resulting vector and combines its elements into a single string separated by the chosen delimiter, making paste() useful both for constructing values row by row and for creating one final display string from many parts.
That basic string-building role becomes more important when dates and times are involved because imported date-time data often arrive as text split across multiple columns. A common example is having one column for a date and another for a time, then joining them with paste(dates, times) before parsing the result. In that sense, the paste() function often acts as a bridge between messy raw input and structured date-time objects. It is simple, but it appears repeatedly in data preparation pipelines.
Date-Time Conversion
For date-time conversion, base R provides strptime(), strftime() and format() methods for POSIXlt and POSIXct objects. These functions convert between character representations and R date-time classes, and they are central to understanding how R reads and prints times. strptime() takes character input and converts it to an object of class "POSIXlt", while strftime() and format() move in the other direction, turning date-time objects into character strings. The as.character() method for "POSIXt" classes fits into the same family, and the essential idea is that the date-time value and its textual representation are separate things, with the format string defining how R should interpret or display that representation.
Format strings rely on conversion specifications introduced with %, and many of these are standard across systems. %Y means a four-digit year with century, %y means a two-digit year, %m is a month, %d is the day of a month and %H:%M:%S captures hours, minutes and seconds in 24-hour time. %F is equivalent to %Y-%m-%d, which is the ISO 8601 date format. %b and %B represent abbreviated and complete month names, while %a and %A do the same for weekdays. Locale matters here because month names, weekday names, AM/PM indicators and some separators depend on the LC_TIME locale, meaning a date string like "1jan1960" may parse correctly in one locale and return NA in another unless the locale is set appropriately.
R's defaults generally follow ISO 8601 rules, so dates print as "2001-02-28" and times as "14:01:02", though R inserts a space between date and time by default. Several details matter in practice. strptime() processes input strings only as far as needed for the specified format, so trailing characters are ignored. Unspecified hours, minutes and seconds default to zero, and if no year, month or day is supplied then the current values are assumed, though if a month is given, the day must also be valid for that month. Invalid calendar dates such as "2010-02-30 08:00" produce results whose components are all NA.
Time Zones and Daylight Saving
Time zones add another layer of complexity. The tz argument specifies the time zone to use for conversion, with "" meaning the current time zone and "GMT" meaning UTC. Invalid values are often treated as UTC, though behaviour can be system-specific. The usetz argument controls whether a time zone abbreviation is appended to output, which is generally more reliable than %Z. %z represents a signed UTC offset such as -0800, and R supports it for input on all platforms. Even so, time zones can be awkward because daylight saving transitions create times that do not occur at all, or occur twice, and strptime() itself does not validate those cases, though conversion through as.POSIXct may do so.
Two-Digit Years
Two-digit years are a notable source of confusion for analysts working with historical data. As described in the R date formats guide on R-bloggers, %y maps values 00 to 68 to the years 2000 to 2068 and 69 to 99 to 1969 to 1999, following the POSIX standard. A value such as "08/17/20" may therefore be interpreted as 2020 when the intended year is 1920. One practical workaround is to identify any parsed dates lying in the future and then rebuild them with a 19 prefix using format() and ifelse(). This approach is explicit and practical, though it depends on the assumptions of the data at hand.
Plain Dates
For plain dates, rather than full date-times, as.Date() is usually the entry point. Character dates can be imported by specifying the current format, such as %m/%d/%y for "05/27/84" or %B %d %Y for "May 27 1984". If no format is supplied, as.Date() first tries %Y-%m-%d and then %Y/%m/%d. Numeric dates are common when data come from Excel, and here the crucial issue is the origin date: Windows Excel uses an origin of "1899-12-30" for dates after 1900 because Excel incorrectly treated 1900 as a leap year (an error originally copied from Lotus 1-2-3 for compatibility), while Mac Excel traditionally uses "1904-01-01". Once the correct origin is supplied, as.Date() converts the serial numbers into standard R dates.
After import, format() can display dates in other ways without changing their underlying class. For example, format(betterDates, "%a %b %d") might yield values like "Sun May 27" and "Thu Jul 07". This distinction between storage and display is important because once R recognises values as dates, they can participate in date-aware operations such as mean(), min() and max(), and a vector of dates can have a meaningful mean date with the minimum and maximum identifying the earliest and latest observations.
Extracting Columns and Manipulating Lists
These ideas about correct types and structure carry over into table manipulation. A data frame column often needs to be extracted as a vector before further processing, and there are several standard ways to do this, as covered in this guide from Statistics Globe. In base R, the $ operator gives a direct route, as in data$x1. Subsetting with data[, "x1"] yields the same result for a single column, and in the tidyverse, dplyr::pull(data, x1) serves the same purpose. All three approaches convert a column of a data frame into a standalone vector, and each is useful depending on the surrounding code style.
List manipulation has similar patterns, detailed in this Statistics Globe tutorial on removing list elements. Removing elements from a list can be done by position with negative indexing, as in my_list[-2], or by assigning NULL to the relevant component, for example my_list_2[2] <- NULL. If names are more meaningful than positions, then subsetting with names(my_list) != "b" or names(my_list) %in% "b" == FALSE removes the named element instead. The same logic extends to multiple elements, whether by positions such as -c(2, 3) or names such as %in% c("b", "c") == FALSE. These are simple techniques, but they matter because lists are a common structure in R, especially when working with nested results.
Subsetting, Renaming and Reordering Data Frames
Data frames themselves can be subset in several ways, and the choice often depends on readability, as the five-method overview on R-bloggers demonstrates clearly. The bracket form example[x, y] remains the foundation, whether selecting rows and columns directly or omitting unwanted ones with negative indices. More expressive alternatives include which() together with %in%, the base subset() function and tidyverse verbs like filter() and select(). The point is not that one method is universally best, but that R offers both low-level precision and higher-level readability, depending on the task.
Column names and column order also need regular attention. Renaming can be done with dplyr::rename(), as explained in this lesson from Datanovia, for instance changing Sepal.Length to sepal_length and Sepal.Width to sepal_width. In base R, the same effect comes from modifying names() or colnames(), either by matching specific names or by position. Reordering columns is just as direct, with a data frame rearranged by column indices such as my_data[, c(5, 4, 1, 2, 3)] or by an explicit character vector of names, as the STHDA guide on reordering columns illustrates. Both approaches are useful when preparing data for presentation or for functions that expect variables in a certain order.
Sorting and Cumulative Calculations
Sorting and cumulative calculations fit naturally into this same preparatory workflow. To sort a data frame in base R, the DataCamp sorting reference demonstrates that order() is the key function: mtcars[order(mpg), ] sorts ascending by mpg, while mtcars[order(mpg, -cyl), ] sorts by mpg ascending and cyl descending. For cumulative totals, cumsum() provides a running sum, as in calculating cumulative air miles from the airmiles dataset, an example covered in the Data Cornering guide to cumulative calculations. Within grouped data, dplyr::group_by() and mutate() can apply cumsum() separately to each group, and a related idea is cumulative count, which can be built by summing a column of ones within groups, or with data.table::rowid() to create a group index.
Time Series Smoothing
Time series smoothing introduces one further pattern: replacing noisy raw values with moving averages. As the Storybench rolling averages guide explains, the zoo::rollmean() function calculates rolling means over a window of width k, and examples using 3, 5, 7, 15 and 21-day windows on pandemic deaths and confirmed cases by state demonstrate the approach clearly. After arranging and grouping by state, mutate() adds variables such as death_03da, death_05da and death_07da. Because rollmean() is centred by default, the resulting values are symmetrical around the observation of interest and produce NA values at the start and end where there are not enough surrounding observations, which is why odd values of k are usually preferred as they make the smoothing window balanced.
The arithmetic is uncomplicated, but the interpretation is useful. A 3-day moving average for a given date is the mean of that day, the previous day and the following day, while a 7-day moving average uses three observations on either side. As the window widens, the line becomes smoother, but more short-term variation is lost. This trade-off is visible when comparing 3-day and 21-day averages: a shorter average tracks recent changes more closely, while a longer one suppresses noise and makes broader trends stand out. If a trailing rather than centred calculation is needed, rollmeanr() shifts the window to the right-hand end.
The same grouped workflow can be used to derive new daily values before smoothing. In the pandemic example, daily new confirmed cases are calculated from cumulative confirmed counts using dplyr::lag(), with each day's new cases equal to the current cumulative total minus the previous day's total. Grouping by state and date, summing confirmed counts and then subtracting the lagged value produces new_confirmed_cases, which can then be smoothed with rollmean() in the same way as deaths. Once these measures are available, reshaping with pivot_longer() allows raw values and rolling averages to be plotted together in ggplot2, making it easier to compare volatility against trend.
How These R Data Manipulation Techniques Fit Together
What links all of these techniques is not just that they are common in R, but that they solve the mundane, essential problems of analysis. Data arrive as text when they should be dates, as cumulative counts when daily changes are needed, as broad tables when only a few columns matter, or as inconsistent names that get in the way of clear code. Functions such as paste(), strptime(), as.Date(), order(), cumsum(), rollmean(), rename(), select() and simple bracket subsetting are therefore less like isolated tricks and more like pieces of a coherent working practice. Knowing how they fit together makes it easier to move from raw input to reliable analysis, with fewer surprises along the way.
How to persist R packages across remote Windows server sessions
Recently, I was using R to automate some code changes that needed implementation when porting code from a vendor to client systems. While I was doing so, I noticed that packages needed to be reinstalled every time that I logged into their system. This was because they were going into a temporary area by default. The solution was to define another location where the packages could be persisted.
That meant creating a .Renviron file, with Windows Explorer making that manoeuvre an awkward one that could not be completed. Using PowerShell was the solution for this. There, I could use the following command to do what I needed:
New-Item -ItemType File "$env:USERPROFILE\Documents\.Renviron" -Force
That gave me an empty .Renviron file, to which I could add the following text for where the packages should be kept (the path may differ on your system):
R_LIBS_USER=C:/R/packages
Here, the paths are only examples and do not always represent what the real ones were, and that is by design for reasons of client confidentiality. Restarting RStudio to give me a fresh R session meant that I now could install packages using commands like this one:
install.packages("tidyverse")
Version constraints meant for compilation from source in my case, making for a long wait time for completion. Once that was done, though, there was no need for a repeat operation.
One final remark is that file creation and population could be done in the same command in PowerShell:
'R_LIBS_USER=C:/R/packages' | Out-File -Encoding ascii "$env:USERPROFILE\Documents\.Renviron"
It places the text into a new file or completely overwrites an existing, meaning that you really want to do this once should you decide to add any more setting details to .Renviron later on.
Automating Positron and RStudio updates on Linux Mint 22
Elsewhere, I have written about avoiding manual updates with VSCode and VSCodium. Here, I come to IDE's produced by Posit, formerly RStudio, for data science and analytics uses. The first is a more recent innovation that works with both R and Python code natively, while the second has been around for much longer and focusses on native R code alone, though there are R packages allowing an interface of sorts with Python. Neither are released via a PPA, necessitating either manual downloading or the scripted approach taken here for a Linux system. Each software tool will be discussed in turn.
Positron
Now, we work through a script that automates the upgrade process for Positron. This starts with a shebang line calling the bash executable before moving to a line that adds safety to how the script works using a set statement. Here, the -e switch triggers exiting whenever there is an error, halting the script before it carries on to perform any undesirable actions. That is followed by the -u switch that causes errors when unset variables are called; normally these would be assigned a missing value, which is not desirable in all cases. Lastly, the -o pipefail switch causes a pipeline (cmd1 | cmd2 | cm3) to fail if any command in the pipeline produces an error, which can help debugging because the error is associated with the command that fails to complete.
#!/bin/bash
set -euo pipefail
The next step then is to determine the architecture of the system on which the script is running so that the correct download is selected.
ARCH=$(uname -m)
case "$ARCH" in
x86_64) POSIT_ARCH="x64" ;;
aarch64|arm64) POSIT_ARCH="arm64" ;;
*) echo "Unsupported arch: $ARCH"; exit 1 ;;
esac
Once that completes, we define the address of the web page to be interrogated and the path to the temporary file that is to be downloaded.
RELEASES_URL="https://github.com/posit-dev/positron/releases"
TMPFILE="/tmp/positron-latest.deb"
Now, we scrape the page to find the address of the latest DEB file that has been released.
echo "Finding latest Positron .deb for $POSIT_ARCH..."
DEB_URL=$(curl -fsSL "$RELEASES_URL" \
| grep -Eo "https://cdn\.posit\.co/[A-Za-z0-9/_\.-]+Positron-[0-9\.~-]+-${POSIT_ARCH}\.deb" \
| head -n 1)
If that were to fail, we get an error message produced before the script is aborted.
if [ -z "${DEB_URL:-}" ]; then
echo "Could not find a .deb link for ${POSIT_ARCH} on the releases page"
exit 1
fi
Should all go well thus far, we download the latest DEB file using curl.
echo "Downloading: $DEB_URL"
curl -fL "$DEB_URL" -o "$TMPFILE"
When the download completes, we try installing the package using apt, much like we do with a repo, apart from specifying an actual file path on our system.
echo "Installing Positron..."
sudo apt install -y "$TMPFILE"
Following that, we delete the installation file and issue a message informing the user of the task's successful completion.
echo "Cleaning up..."
rm -f "$TMPFILE"
echo "Done."
When I do this, I tend to find that the Python REPL console does not open straight away, causing me to shut down Positron and leaving things for a while before starting it again. There may be temporary files that need to be expunged and that needs its own time. Someone else might have a better explanation that I am happy to use if that makes more sense than what I am suggesting. Otherwise, all works well.
RStudio
A lot of the same processing happens during the script updating RStudio, so we will just cover the differences. The set -x statement ensures that every command is printed to the console for the debugging that was needed while this was being developed. Otherwise, much code, including architecture detection, is shared between the two apps.
#!/bin/bash
set -euo pipefail
set -x
# --- Detect architecture ---
ARCH=$(uname -m)
case "$ARCH" in
x86_64) RSTUDIO_ARCH="amd64" ;;
aarch64|arm64) RSTUDIO_ARCH="arm64" ;;
*) echo "Unsupported architecture: $ARCH"; exit 1 ;;
esac
Figuring out the distro version and the web page to scrape was where additional effort was needed, and that is reflected in some of the code that follows. Otherwise, many of the ideas applied with Positron also have a place here.
# --- Detect Ubuntu base ---
DISTRO=$(grep -oP '(?<=UBUNTU_CODENAME=).*' /etc/os-release || true)
[ -z "$DISTRO" ] && DISTRO="noble"
# --- Define paths ---
TMPFILE="/tmp/rstudio-latest.deb"
LOGFILE="/var/log/rstudio_update.log"
echo "Detected Ubuntu base: ${DISTRO}"
echo "Fetching latest version number from Posit..."
# --- Get version from Posit's official RStudio Desktop page ---
VERSION=$(curl -s https://posit.co/download/rstudio-desktop/ \
| grep -Eo 'rstudio-[0-9]+\.[0-9]+\.[0-9]+-[0-9]+' \
| head -n 1 \
| sed -E 's/rstudio-([0-9]+\.[0-9]+\.[0-9]+-[0-9]+)/\1/')
if [ -z "$VERSION" ]; then
echo "Error: Could not extract the latest RStudio version number from Posit's site."
exit 1
fi
echo "Latest RStudio version detected: ${VERSION}"
# --- Construct download URL (Jammy build for Noble until Noble builds exist) ---
BASE_DISTRO="jammy"
BASE_URL="https://download1.rstudio.org/electron/${BASE_DISTRO}/${RSTUDIO_ARCH}"
FULL_URL="${BASE_URL}/rstudio-${VERSION}-${RSTUDIO_ARCH}.deb"
echo "Downloading from:"
echo " ${FULL_URL}"
# --- Validate URL before downloading ---
if ! curl --head --silent --fail "$FULL_URL" >/dev/null; then
echo "Error: The expected RStudio package was not found at ${FULL_URL}"
exit 1
fi
# --- Download and install ---
curl -L "$FULL_URL" -o "$TMPFILE"
echo "Installing RStudio..."
sudo apt install -y "$TMPFILE" | tee -a "$LOGFILE"
# --- Clean up ---
rm -f "$TMPFILE"
echo "RStudio update to version ${VERSION} completed successfully." | tee -a "$LOGFILE"
When all ended, RStudio worked without a hitch, leaving me to move on to other things. The next time that I am prompted to upgrade the environment, this is the way I likely will go.
Avoiding Python missing package errors with automatic installation checks
Though some may not like having something preceding package import statements in Python scripts, I prefer the added robustness of an extra piece of code checking for package presence and installing anything that is missing in place getting an error. In what follows, I define the list of packages that need to be present for everything to work:
required_packages = ["pandas", "tqdm", "progressbar2", "sqlalchemy", "pymysql"]
Then, I declare the inbuilt modules in advance of looping through the list that was already defined (adding special handling for a case where there has been a name change):
import subprocess
import sys
for package in required_packages:
try:
__import__(package if package != "progressbar2" else "progressbar")
print(f"{package} is already installed.")
except ImportError:
print(f"{package} not found. Installing...")
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
The above code tries importing the package and catches the error to do the required installation. While a stable environment may be a better way around all of this, I find that this way of working adds valuable robustness to a script and automates what you would need to do anyway. Though the use of requirements files and even the Poetry tool for dependency management may be next steps, this approach suffices for my simpler needs, at least when it comes to personal projects.
PandasGUI: A simple solution for Pandas DataFrame inspection from within VSCode
One of the things that I miss about Spyder when running Python scripts is the ability to look at DataFrames easily. Recently, I was checking a VAT return only for tmux to truncate how much of the DataFrame I could see in output from the print function. While closing tmux might have been an idea, I sought the DataFrame windowing alternative. That led me to the pandasgui package, which did exactly what I needed, apart from pausing the script execution to show me the data. The installed was done using pip:
pip install pandasgui
Once that competed, I could use the following code construct to accomplish what I wanted:
import pandasgui
pandasgui.show(df)
In my case, there were several lines between the two lines above. Nevertheless, the first line made the pandasgui package available to the script, while the second one displayed the DataFrame in a GUI with scrollbars and cells, among other things. That was close enough to what I wanted to leave me able to complete the task that was needed of me.