Technology Tales

Notes drawn from experiences in consumer and enterprise technology

TOPIC: TIDYVERSE

Some R packages to explore as you find your feet with the language

24th March 2026

Here are some commonly used R packages and other tools that are pervasive, along with others that I have encountered while getting started with the language, itself becoming pervasive in my line of business. The collection grew organically as my explorations proceeded, and reflects what I was trying out during my acclimatisation.

General

Here are two general packages to get things started, with one of them being unavoidable in the R world. The other is more advanced, possibly offering more to package developers.

{tidyverse}

You cannot use R without knowing about this collection of packages. In many ways, they form a mini-language of their own, drawing some criticism from those who reckon that base R functionality covers a sufficient gamut anyway. Nevertheless, there is so much here that will get you going with data wrangling and visualisation that it is worth knowing what is possible. The complaints may come from your not needing to use anything else for these purposes.

{plumber}

This R package enables developers to convert existing R functions into web API endpoints by adding roxygen2-like comment annotations to their code. Once annotated, functions can handle HTTP GET and POST requests, accept query string or JSON parameters and return outputs such as plain values or rendered plots. The package is available on CRAN as a stable release, with a development version hosted on GitHub. For deployment, it integrates with DigitalOcean through a companion package called {plumberDeploy}, and also supports Posit Connect, PM2 and Docker as hosting options. Related projects in the same space include OpenCPU, which is designed for hosting R APIs in scientific research contexts, and the now-discontinued jug package, which took a more programmatic approach to API construction.

Data Preparation

You simply cannot avoid working with data during any analysis or reporting work. While there is a learning curve if you are used to other languages, there is little doubt that R is well-endowed when it comes to performing these tasks. Here are some packages that extend base R capabilities and might even add some extra user-friendliness along the way.

{forcats}

The {forcats} package in R provides functions to manage categorical variables by reordering factor levels, collapsing infrequent values and adjusting their sequence based on frequency or other variables. It includes tools such as reordering by another variable, grouping rare categories into 'other' and modifying level order manually, which are useful for data analysis and visualisation workflows. Designed as part of the tidyverse, it integrates with other packages to streamline tasks like counting and plotting categorical data, enhancing clarity and efficiency in handling factors within R.

{tidyr}

Around this time last year, I remember completing a LinkedIn course on a set of good practices known as tidy data, where each variable occupies a column, each observation a row and each value a single cell. This package is designed to help users restructure data so it follows those rules. It provides tools for reshaping data between long and wide formats, handling nested lists, splitting or combining columns, managing missing values and layering or flattening grouped data.

Installation options include the {tidyverse} collection, standalone installation, or the development version from GitHub. The package succeeds earlier reshaping tools like {reshape2} and {reshape}, offering a focused approach to tidying data rather than general reshaping or aggregation.

{haven}

Having a long track record of working with SAS, {haven} with its abilities to read and write data files from statistical software such as SAS, SPSS and Stata, leveraging the ReadStat library, arouses my interest. Handily, it supports a range of file formats, including SAS transport and data files, SPSS system and older portable files and Stata data files up to version 15, converting these into tibbles with enhanced printing capabilities. Value labels are preserved as a labelled class, allowing conversion to factors, while dates and times are transformed into standard R classes.

{RMariaDB}

While there are other approaches to working with databases using R, {RMariaDB} provides a database interface and driver for MariaDB, designed to fully comply with the DBI specification and serve as a replacement for the older {RMySQL} package. It supports connecting to databases using configuration files, executing queries, reading and writing data tables and managing results in chunks. Installation options include binary packages from CRAN or development versions from GitHub, with additional dependencies such as MariaDB Connector/C or libmysqlclient required for Linux and macOS systems. Configuration is typically handled through a MariaDB-specific file, and the package includes acknowledgments for contributions from various developers and organisations.

COVID-19 Data Hub

For many people, the pandemic may be a fading memory, yet it offered its chances for learning R, not least because there was a use case with more than a hint of personal interest about it. Here is a library making it easier to get hold of the data, with some added pre-processing too. Memories of how I needed to wrangle what was published by various sources make me appreciate just how vital it is to have harmonised data for analysis work.

Table Production

While many appear to graphical presentation of results to their tabular display, R does have its options here too. In recent times, the options have improved, particularly of the pharmaverse initiative. Here is a selection of what I found during my explorations.

{officer}

Part of the {officeverse} along with {officedown}, {Flextable}, {Rvg} and {mschart}, the {officer} R package enables users to create and modify Word and PowerPoint documents directly from R, allowing the insertion of images, tables and formatted content, as well as the import of document content into data frames. It supports the generation of RTF files and integrates with other packages for advanced features such as vector graphics and native office charts. Installation options include CRAN and GitHub, with community resources available for assistance and contributions. The package facilitates the manipulation of document elements like paragraphs, tables and section breaks and provides tools for exporting and importing content between R and office formats, alongside functions for managing slide layouts and embedded objects in presentations.

{pharmaRTF}

If you work in clinical research like I do, the need to produce data tabulations is a non-negotiable requirement. That is how this package came to be developed and the pharmaverse of which it is part has numerous other options, should you need to look at using one of those. The flavour of RTF produced here is the Microsoft Word variety, which did not look as well in LibreOffice Writer when I last looked at the results with that open-source alternative. Otherwise, the results look well to many eyes.

{formattable}

Here is a package that enhances data presentation by applying customisable formatting to vectors and data frames, supporting formats such as percentages, currency and accounting. Available on GitHub and CRAN, it integrates with dynamic document tools like {knitr} and {rmarkdown} to produce visually distinct tables, with features including gradient colour scales, conditional styling and icon-based representations. It automatically converts to {htmlwidgets} in interactive environments and is licensed under MIT, enabling flexible use in both static and interactive data displays.

{reactable}

The {reactable} package for R provides interactive data tables built on the React Table library, offering features such as sorting, filtering, pagination, grouping with aggregation, virtual scrolling for large datasets and support for custom rendering through R or JavaScript. It integrates seamlessly into R Markdown documents and Shiny applications, enabling the use of HTML widgets and conditional styling. Installation options include CRAN and GitHub, with examples demonstrating its application across various datasets and scenarios. The package supports major web browsers and is licensed under MIT, designed for developers seeking dynamic data presentation tools within the R ecosystem.

{DT}

Particularly useful in dynamic web applications like Shiny, the {DT} package in R provides a means of rendering interactive HTML tables by building on the DataTables JavaScript library. It supports features including sorting, searching, pagination and advanced filtering, with numeric, date and time columns using range-based sliders whilst factor and character columns rely on search boxes or dropdowns. Filtering operates on the client side by default, though server-side processing is also available. JavaScript callbacks can be injected after initialisation to manipulate table behaviour, such as enabling automatic page navigation or adding child rows to display additional detail. HTML content is escaped by default as a safeguard against cross-site scripting attacks, with the option to adjust this on a per-column basis. Whilst the package integrates with Shiny applications, attention is needed around scrolling and slider positioning to prevent layout problems. Overall, the package is well suited to exploratory data analysis and the building of interactive dashboards.

{gt}

The {gt} package in R enables users to create well-structured tables with a variety of formatting options, starting from data frames or tibbles and incorporating elements such as headers, footers and customised column labels. It supports output in HTML, LaTeX and RTF formats and includes example datasets for experimentation. The package prioritises simplicity for common tasks while offering advanced functions for detailed customisation, with installation available via CRAN or GitHub. Users can access resources like documentation, community forums and example projects to explore its capabilities, and it is supported by a range of related packages that extend its functionality.

{gtsummary}

Enabling users to produce publication-ready outputs with minimal code, the {gtsummary} package offers a streamlined approach to generating analytical and summary tables in R. It automates the summarisation of data frames, regression models and other datasets, identifying variable types and calculating relevant statistics, including measures of data incompleteness. Customisation options allow for formatting, merging and styling tables to suit specific needs, while integration with packages such as {broom} and {gt} facilitates seamless incorporation into R Markdown workflows. The package supports the creation of side-by-side regression tables and provides tools for exporting results as images, HTML, Word, or LaTeX files, enhancing flexibility for reporting and sharing findings.

{huxtable}

Here is an R package designed to generate LaTeX and HTML tables with a modern, user-friendly interface, offering extensive control over styling, formatting, alignment and layout. It supports features such as custom borders, padding, background colours and cell spanning across rows or columns, with tables modifiable using standard R subsetting or dplyr functions. Examples demonstrate its use for creating simple tables, applying conditional formatting and producing regression output with statistical details. The package also facilitates quick export to formats like PDF, DOCX, HTML and XLSX. Installation options include CRAN, R-Universe and GitHub, while the name reflects its origins as an enhanced version of the {xtable} package. The logo was generated using the package itself, and the background design draws inspiration from Piet Mondrian’s artwork.

Figure Generation

R has such a reputation for graphical presentations that it is cited as a strong reason to explore what the ecosystem has to offer. While base R itself is not shabby when it comes to creating graphs and charts, these packages will extend things by quite a way. In fact, the first on this list is near enough pervasive.

{ggplot2}

Though its default formatting does not appeal to me, the myriad of options makes this a very flexible tool, albeit at the expense of some code verbosity. Multi-panel plots are not among its strengths, which may send you elsewhere for that need.

{ggforce}

Focusing on features not included in the core library, the {ggforce} package extends {ggplot2} by offering additional tools to enhance data visualisation. Designed to complement the primary role of {ggplot2} in exploratory data analysis, it provides a range of geoms, stats and other components that are well-documented and implemented, aiming to support more complex and custom plot compositions. Available for installation via CRAN or GitHub, the package includes a variety of functionalities described in detail on its associated website, though specific examples are not included here.

{cowplot}

Developed by Claus O. Wilke for internal use in his lab, {cowplot} is an R package designed to help with the creation of publication-quality figures built on top of {ggplot2}. It provides a set of themes, tools for aligning and arranging plots into compound figures and functions for annotating plots or combining them with images. The package can be installed directly from CRAN or as a development version via GitHub, and it has seen widespread use in the book Fundamentals of Data Visualisation.

{sjPlot}

The {sjPlot} package provides a range of tools for visualising data and statistical results commonly used in social science research, including frequency tables, histograms, box plots, regression models, mixed effects models, PCA, correlation matrices and cluster analyses. It supports installation via CRAN for stable releases or through GitHub for development versions, with documentation and examples available online. The package is licensed under GPL-3 and developed by Daniel Lüdecke, offering functions to create visualisations such as scatter plots, Likert scales and interaction effect plots, along with tools for constructing index variables and presenting statistical outputs in tabular formats.

{thematic}

By offering a centralised approach to theming and enabling automatic adaptation of plot styles within Shiny applications, the {thematic} package simplifies the styling of R graphics, including {ggplot2}, {lattice} and base R plots, R Markdown documents and RStudio. It allows users to apply consistent visual themes across different plotting systems, with auto-theming in Shiny and R Markdown relying on CSS and {bslib} themes, respectively. Installation requires specific versions of dependent packages such as {shiny} and {rmarkdown}, while custom fonts benefit from {showtext} or {ragg}. Users can set global defaults for background, foreground and accent colours, as well as fonts, which can be overridden with plot-specific theme adjustments. The package also defines default colour scales for qualitative and sequential data and integrates with tools like bslib to import Google Fonts, enhancing visual consistency across different environments and user interfaces.

Publishing Tools

The R ecosystem goes beyond mere graphical and tabular display production to offer means for taking things much further, often offering platforms for publishing your work. These can be used locally too, so there is no need to entrust everything to a third-party provider. The uses are endless for what is available, and it appears that Posit has used this to help with building documentation and training too.

R Markdown

What you have here is one of those distinguishing facilities of the R ecosystem, particularly for those wanting to share their analysis work with more than a hint of reproducibility. The tool combines narrative text and code to generate various outputs, supporting multiple programming languages and formats such as HTML, PDF and dashboards. It enables users to produce reports, presentations and interactive applications, with options for publishing and scheduling through platforms like RStudio Connect, facilitating collaboration and distribution of results in professional settings.

Distill for R Markdown

Distill for R Markdown is a tool designed to streamline the creation of technical documents, offering features such as code folding, syntax highlighting and theming. It builds on existing frameworks like Pandoc, MathJax and D3, enabling the production of dynamic, interactive content. Users can customise the appearance with CSS and incorporate appendices for supplementary information. The tool acknowledges the contributions of developers who created foundational libraries, ensuring accessibility and functionality for a wide audience. Its design prioritises clarity, allowing authors to focus on presenting results rather than underlying code, while maintaining flexibility for those who wish to include detailed explanations.

{shiny}

For a while, this was one of R's unique selling points, and remains as compelling a reason to use the language even when Python has got its own version of the package. Enabling the creation of interactive web applications for data analysis without requiring web development expertise allows users to build interfaces that let others explore data through dynamic visualisations and filters. Here is a simple example: an app that generates scatter plots with adjustable variables, species filters and marginal plots, hosted either on personal servers or through a dedicated hosting service.

{bslib}

The {bslib} R package offers a modern user interface toolkit for Shiny and R Markdown applications, leveraging Bootstrap to enable the creation of customisable dashboards and interactive theming. It supports the use of updated Bootstrap and Bootswatch versions while maintaining compatibility with existing defaults, and provides tools for real-time visual adjustments. Installation is available through CRAN, with example previews demonstrating its capabilities.

{rhandsontable}

Enabling users to manipulate and validate data within a spreadsheet-like interface, the {rhandsontable} package introduces an interactive data grid for R. It supports features such as custom cell rendering, validation rules and integration with Shiny applications. When used in Shiny, the widget requires explicit conversion of data using the hot_to_r function, as updates may not be immediately reflected in reactive contexts. Examples demonstrate its application in various scenarios, including date editing, financial calculations and dynamic visualisations linked to charts. The package also accommodates bookmarks in Shiny apps with specific handling. Users are encouraged to report issues or contribute improvements, with guidance provided for those seeking to expand its functionality. The development team welcomes feedback to refine the tool further, ensuring it aligns with evolving user needs.

{xaringanExtra}

{xaringanExtra} offers a range of enhancements and extensions for creating and presenting slides with xaringan, enabling features such as adding an overview tile view, making slides editable, broadcasting in real time, incorporating animations, embedding live video feeds and applying custom styles. It allows users to selectively activate individual tools or load multiple features simultaneously through a single function call, supporting tasks like adding banners, enabling code copying, fitting slides to screen dimensions and integrating utility toolkits. The package is available for installation via CRAN or GitHub, providing flexibility for developers and presenters seeking to expand the functionality of their slides.

Learning R for Data Analysis: Going from the basics to professional practice

22nd March 2026

R has grown from a specialist statistical language into one of the most widely recognised tools for working with data. Across tutorials, community sites, training platforms and industry resources, it is presented as both a programming language and a software environment for statistical computing, graphics and reporting. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand, and its name draws on the first letter of their first names while also alluding to the Bell Labs language S. It is freely available under the GNU General Public Licence and runs on Linux, Windows and macOS, which has helped it spread across research, education and industry alike.

What Makes R Distinctive

What makes R notable is its combination of programming features with a strong focus on data analysis. Introductory material, such as the tutorials at Tutorialspoint and Datamentor, repeatedly highlights its support for conditionals, loops, user-defined recursive functions and input and output, but these sit alongside effective data handling, a broad set of operators for arrays, lists, vectors and matrices and strong graphical capabilities. That mixture means R can be used for straightforward scripts and for complex analytical workflows. A beginner may start by printing "Hello, World!" with the print() function, while a more experienced user may move on to regression models, interactive dashboards or automated reporting.

The Learning Progression

Learning materials generally present R in a structured progression. A beginner is first introduced to reserved words, variables and constants, operators and the order in which expressions are evaluated. From there, the path usually moves into flow control through if…else, ifelse(), for, while, repeat and the use of break and next, before functions follow naturally, including return values, environments and scope, recursive functions, infix operators and switch(). Most sources agree that confidence with the syntax and fundamentals is the real starting point, and this early sequence matters because it helps learners become comfortable reading and writing R rather than only copying examples.

After the basics, attention tends to turn to the structures that make R so useful for data work. Vectors, matrices, lists, data frames and factors appear in nearly every introductory course because they are central to how information is stored and manipulated. Object-oriented concepts also emerge quite early in some routes through the language, with classes and objects extending into S3, S4 and reference classes. For someone coming from spreadsheets or point-and-click statistical software, this shift can feel significant, but it also opens the way to more reproducible and flexible analysis.

Visualisation

Visualisation is another recurring theme in R education. Basic chart types such as bar plots, histograms, pie charts, box plots and strip charts are common early examples because they show how quickly data can be turned into graphics. More advanced lessons widen the scope through plot functions, multiple plots, saving graphics, colour selection and the production of 3D plots.

Beyond base plotting, there is extensive evidence of the central role of {ggplot2} in contemporary R practice. Data Cornering demonstrates this well, with articles covering how to create funnel charts in R using {ggplot2} and how to diversify stacked column chart data label colours, showing how R is used not only to summarise data but also to tell visual stories more clearly. In the pharmaceutical and clinical research space, the PSI VIS-SIG blog is published by the PSI Visualisation Special Interest Group and summarises its monthly Wonderful Wednesday webinars, presenting real-world datasets and community-contributed chart improvements alongside news from the group.

Data Wrangling and the Tidyverse

Much of modern R work is built around data wrangling, and here the {tidyverse} has become especially prominent. Claudia A. Engel's openly published guide Data Wrangling with R (last updated 3rd November 2023) sets out a preparation phase that assumes some basic R knowledge, a recent installation of R and RStudio and the installation of the {tidyverse} package with install.packages("tidyverse") followed by library(tidyverse). It also recommends creating a dedicated RStudio project and downloading CSV files into a data subdirectory, reinforcing the importance of organised project structure.

That same guide then moves through data manipulation with {dplyr}, covering selecting columns and filtering rows, pipes, adding new columns, split-apply-combine, tallying and joining two tables, before moving on to {tidyr} topics such as long and wide table formats, pivot_wider, pivot_longer and exporting data. These topics reflect a broader pattern in the R ecosystem because data import and export, reshaping, combining tables and counting by group recur across teaching resources as they mirror common analytical tasks.

Applications and Professional Use

The range of applications attached to R is wide, though data science remains the clearest centre of gravity. Educational sources describe R as valuable for data wrangling, visualisation and analysis, often pointing to packages such as {dplyr}, {tidyr}, {ggplot2} and {Shiny}. Statistical modelling is another major strand, with R offering extensible techniques for descriptive and inferential statistics, regression analysis, time series methods and classical tests. Machine learning appears as a further area of growth, supported by a large and expanding package ecosystem. In more advanced contexts, R is also linked with dashboards, web applications, report generation and publishing systems such as Quarto and R Markdown.

R's place in professional settings is underscored by the breadth of organisations and sectors associated with it. Introductory resources mention companies such as Google, Microsoft, Facebook, ANZ Bank, Ford and The New York Times as examples of organisations using R for modelling, forecasting, analysis and visualisation. The NHS-R Community promotes the use of R and open analytics in health and care, building a community of practice for data analysis and data science using open-source software in the NHS and wider UK health and care system. Its resources include reports, blogs, webinars and workshops, books, videos and R packages, with webinar materials archived in a publicly accessible GitHub repository. The R Validation Hub, supported through the pharmaR initiative, is a collaboration to support the adoption of R within a biopharmaceutical regulatory setting and provides tools including the {riskmetric} package, the {riskassessment} app and the {riskscore} package for assessing package quality and risk.

The Wider Ecosystem

The wider ecosystem around R is unusually rich. The R Consortium promotes the growth and development of the R language and its ecosystem by supporting technical and social infrastructure, fostering community engagement and driving industry adoption. It notes that the R language supports over two million users and has been adopted in industries including biotech, finance, research and high technology. Community growth is visible not only through organisations and conferences but through user groups, scholarships, project working groups and local meetups, which matters because learning a language is easier when there is an active support network around it.

Another sign of maturity is the depth of R's package and publication landscape. rdrr.io provides a comprehensive index of over 29,000 CRAN packages alongside more than 2,100 Bioconductor packages, over 2,200 R-Forge packages and more than 76,000 GitHub packages, making it possible to search for packages, functions, documentation and source code in one place. Rdocumentation, powered by DataCamp, covers 32,130 packages across CRAN and Bioconductor and offers a searchable interface for function-level documentation. The Journal of Statistical Software adds a scholarly dimension, publishing open-access articles on statistical computing software together with source code, with full reproducibility mandatory for publication. R-bloggers aggregates R news and tutorials contributed by hundreds of R bloggers, while R Weekly curates a community digest and an accompanying podcast, both helping users keep pace with the steady flow of tutorials, package releases, blog posts and developments across the R world.

Where to Begin

For beginners, one recurring challenge is knowing where to start, and different learning routes reflect different backgrounds. Datamentor points learners towards step-by-step tutorials covering popular topics such as R operators, if...else statements, data frames, lists and histograms, progressing through to more advanced material. R for the Rest of Us offers a staged path through three core courses, Getting Started With R, Fundamentals of R and Going Deeper with R, and extends into nine topics courses covering Git and GitHub, making beautiful tables, mapping, graphics, data cleaning, inferential statistics, package development, reproducibility and interactive dashboards with {Shiny}. The site is explicitly designed for people who may never have coded before and also offers the structured R in 3 Months programme alongside training and consulting. RStudio Education (now part of Posit) outlines six distinct ways to begin learning R, covering installation, a free introductory webinar on tidy statistics, the book R for Data Science, browser-based primers, and further options suited to different learning styles, along with guidance on R Markdown and good project practices.

Despite the variety, the underlying advice is consistent: start by learning the basics well enough to read and write simple code, practise regularly beginning with straightforward exercises and gradually take on more complex tasks, then build projects that matter to you because projects create context and make concepts stick. There is no suggestion that mastery comes from passively reading documentation alone, as practical engagement is treated as essential throughout. The blog Stats and R exemplifies this philosophy well, with the stated aim of making statistics accessible to everyone by sharing, explaining and illustrating statistical concepts and, where appropriate, applying them in R.

That practical engagement can take many forms. Someone interested in data journalism may focus on visualisation and reproducible reporting, while a researcher may prioritise statistical modelling and publishing workflows, and a health analyst may use R for quality assurance, open health data and clinical reporting. Others may work with {Shiny}, package development, machine learning, Git and GitHub or interactive dashboards. The variety shows that R is not confined to a single use case, even if statistics and data science remain the common thread.

Free Learning Resources for R

It is also worth noting that R learning is supported by a great deal of freely available material. Statistics Globe, founded in 2017 by Joachim Schork and now an education and consulting platform, offers more than 3,000 free tutorials and over 1,000 video tutorials on YouTube, spanning R programming, Python and statistical methodology. STHDA (Statistical Tools for High-Throughput Data Analysis) covers basics, data import and export, reshaping, manipulation and visualisation, with material geared towards practical data analysis at every level. Community sites, webinar repositories and newsletters add further layers of accessibility, and even where paid courses exist, the surrounding free ecosystem is substantial.

Taken together, these sources present R as far more than a niche programming language. It is a mature open-source environment with a strong statistical heritage, a practical orientation towards data work and a well-developed community of learners, teachers, developers and organisations. Its core concepts are approachable enough for beginners, yet its package ecosystem and publishing culture support highly specialised and advanced work. For anyone looking to enter data analysis, statistics, visualisation or related areas, R offers a route that begins with simple code and can extend into large-scale analytical workflows.

How to centre titles, remove gridlines and write reusable functions in {ggplot2}

20th March 2026

{ggplot2} is widely used for data visualisation in R because it offers a flexible, layered grammar for constructing charts. A plot can begin with a straightforward mapping of data to axes and then be refined with titles, themes and annotations until it better serves the message being communicated. That flexibility is one of the greatest strengths of {ggplot2}, though it also means that many useful adjustments are small, specific techniques that are easy to overlook when first learning the package.

Three of those techniques fit together particularly well. The first is centring a plot title, a common formatting need because {ggplot2} titles are left-aligned by default. The second is removing grid lines and background elements to produce a cleaner, less cluttered appearance. The third is wrapping familiar {ggplot2} code into a reusable function so that the same visual style can be applied across different datasets without rewriting everything each time. Together, these approaches show how a basic plot can move from a default graphic to something more polished and more efficient to reproduce.

Centring the Plot Title

A clear starting point comes from a short tutorial by Luis Serra at Ubiqum Code Academy, published on RPubs, which focuses on one specific goal: centring the title of a {ggplot2} output. The example uses the well-known Iris dataset, which is included with R and contains 150 observations across five variables. Those variables are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species, with Species stored as a factor containing three levels (setosa, versicolor and virginica), each represented by 50 samples.

The first step is to load {ggplot2} and inspect the structure of the data using library(ggplot2), followed by data("iris") and str(iris). The structure output confirms that the first four columns are numeric, and the fifth is categorical. That distinction matters because it makes the dataset well suited to a scatter plot with a colour grouping, allowing two continuous variables to be compared while species differences are shown visually.

The initial chart plots petal length against petal width, with points coloured by species:

ggplot() + geom_point(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species))

This produces a simple scatter plot and serves as the base for later refinements. Even in this minimal form, the grammar is clear: the data are supplied to geom_point(), the x and y aesthetics are mapped to Petal.Width and Petal.Length, and colour is mapped to Species.

Once the scatter plot is in place, a title is added using ggtitle("My dope plot"), appended to the existing plotting code. This creates a title above the graphic, but it remains left-justified by default. That alignment is not necessarily wrong, as left-aligned titles work well in many visual contexts, yet there are situations where a centred title gives a more balanced appearance, particularly for standalone blog images, presentation slides or teaching examples.

The adjustment required is small and direct. {ggplot2} allows title styling through its theme system, and horizontal justification for the title is controlled through plot.title = element_text(hjust = 0.5). Setting hjust to 0.5 centres the title within the plot area, whilst 0 aligns it to the left and 1 to the right. The revised code becomes:

ggplot() +
  geom_point(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
  ggtitle("My dope plot") +
  theme(plot.title = element_text(hjust = 0.5))

That small example also opens the door to a broader understanding of {ggplot2} themes. Titles, text size, panel borders, grid lines and background fills are all managed through the same theming system, which means that once one element is adjusted, others can be modified in a similar way.

Removing Grids and Background Elements

A second set of techniques, demonstrated by Felix Fan in a concise tutorial on his personal site, begins by generating simple data rather than using a built-in dataset. The code creates a sequence from 1 to 20 with a <- seq(1, 20), calculates the fourth root with b <- a^0.25 and combines both into a data frame using df <- as.data.frame(cbind(a, b)). The plot is then created as a reusable object:

myplot = ggplot(df, aes(x = a, y = b)) + geom_point()

From there, several styling approaches become available. One of the quickest is theme_bw(), which removes the default grey background and replaces it with a cleaner black-and-white theme. This does not strip the graphic down completely, but it does provide a more neutral base and is often a practical shortcut when the standard {ggplot2} appearance feels too heavy.

More selective adjustments can also be made independently. Grid lines can be removed with the following:

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

This suppresses both major and minor grid lines, whilst leaving other parts of the panel unchanged. Borderlines can be removed separately with theme(panel.border = element_blank()), though that does not affect the background colour or the grid. Likewise, the panel background can be cleared with theme(panel.background = element_blank()), which removes the panel fill and borderlines but leaves grid lines in place. Each of these commands targets a different component, so they can be combined depending on the desired result.

If the background and border are removed, axis lines can be added back for clarity using theme(axis.line = element_line(colour = "black")). This is an important finishing step in a stripped-back plot because removing too many panel elements can leave the chart without enough visual structure. The explicit axis line restores a frame of reference without reintroducing the full border box.

Two combined approaches are worth knowing. The first uses a single custom theme call:

myplot + theme(
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_blank(),
  axis.line = element_line(colour = "black")
)

The second starts from theme_bw() and then removes the border and grids whilst adding axis lines:

myplot + theme_bw() + theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  axis.line = element_line(colour = "black")
)

Both approaches produce a cleaner chart, though they begin from slightly different defaults. The practical lesson is that {ggplot2} styling is modular, so there is often more than one route to a similar visual result.

This matters because chart design is rarely only about appearance. Cleaner formatting can make a chart easier to read by reducing distractions and placing more emphasis on the data itself. A centred title, a restrained background and the selective use of borders all influence how quickly the eye settles on what is important.

Building Reusable Custom Plot Functions

A third area extends these ideas further by showing how to build custom {ggplot2} functions in R, a topic covered in depth by Sharon Machlis in a tutorial published on Infoworld. The central problem discussed is the mismatch that used to make this awkward: tidyverse functions typically use unquoted column names, whilst base R functions generally expect quoted names. This tension became especially noticeable when users wanted to write their own plotting functions that accepted a data frame and column names as arguments.

The example in that article uses Zillow data containing estimated median home values. After loading {dplyr} and {ggplot2}, a horizontal bar chart is created to show home values by neighbourhood in Boston, with bars ordered from highest to lowest values, outlined in black and filled in blue:

ggplot(data = bos_values, aes(x = reorder(RegionName, Zhvi), y = Zhvi)) +
  geom_col(color = "black", fill = "#0072B2") +
  xlab("") + ylab("") +
  ggtitle("Zillow Home Value Index by Boston Neighborhood") +
  theme_classic() +
  theme(plot.title = element_text(size = 24)) +
  coord_flip()

The next step is to turn that pattern into a function. An initial attempt passes unquoted column names but does not work as intended because of the underlying tension between standard R evaluation and the non-standard evaluation of {ggplot2}. The solution came with the introduction of the tidy evaluation {{ operator, commonly known as "curly-curly", in {rlang} version 0.4.0. As noted in the official tidyverse announcement, this operator abstracts the previous two-step quote-and-unquote process into a single interpolation step. Once library(rlang) is loaded, column references inside the plotting code are wrapped in double curly braces:

library(rlang)
mybarplot <- function(mydf, myxcol, myycol, mytitle) {
  ggplot2::ggplot(data = mydf, aes(x = reorder({{ myxcol }}, {{ myycol }}), y = {{ myycol }})) +
    geom_col(color = "black", fill = "#0072B2") +
    xlab("") + ylab("") +
    coord_flip() +
    ggtitle(mytitle) +
    theme_classic() +
    theme(plot.title = element_text(size = 24))
}

With that change in place, the function can be called with unquoted column names, just as they would appear in many tidyverse functions:

mybarplot(bos_values, RegionName, Zhvi, "Zillow Home Value Index by Boston Neighborhood")

That final point is particularly useful in practice. The resulting plot object can be stored and extended further, for example by adding data labels on the bars with geom_text() and the scales::comma() function. A custom plotting function does not lock the user into a fixed result; it provides a well-designed starting point that can still be extended with additional {ggplot} layers.

Putting the Three Techniques Together in {ggplot2}

Seen as a progression, these examples build on one another in a logical way. The first shows how to centre a title with theme(plot.title = element_text(hjust = 0.5)). The second shows how to simplify a chart by removing grids, borders and background elements whilst restoring axis lines where needed. The third scales those preferences up by packaging them inside a reusable function. What begins as a one-off styling adjustment can therefore become part of a repeatable workflow.

These techniques also reflect a wider culture around R graphics. Resources such as the R Graph Gallery, created by Yan Holtz, have helped make this style of incremental learning more accessible by offering reproducible examples across a wide range of chart types. The gallery presents over 400 R-based graphics, with a strong emphasis on {ggplot2} and the tidyverse, and organises them into nearly 50 chart families and use cases. Its broader message is that effective visualisation is often the result of small, deliberate decisions rather than dramatic reinvention.

For anyone working with {ggplot2}, that is a helpful principle to keep in mind. A centred title may seem minor, just as removing a panel grid may seem cosmetic, yet these changes can improve clarity and consistency across a body of work. When those preferences are wrapped into a function, they also save time and reduce repetition, connecting plot styling directly to good code design.

From summary statistics to published reports with R, LaTeX and TinyTeX

19th March 2026

For anyone working across LaTeX, R Markdown and data analysis in R, there comes a point where separate tools begin to converge. Data has to be summarised, those summaries have to be turned into presentable tables and the finished result has to compile into a report that looks appropriate for its audience rather than a console dump. These notes follow that sequence, moving from the practical business of summarising data in R through to tabulation and then on to the publishing infrastructure that makes clean PDF and Word output possible.

Summarising Data with {dplyr}

The starting point for many analyses is a quick exploration of the data at hand. One useful example uses the anorexia dataset from the {MASS} package together with {dplyr}. The dataset contains weight change data for young female anorexia patients, divided into three treatment groups: Cont for the control group, CBT for cognitive behavioural treatment and FT for family treatment.

The basic manipulation starts by loading {MASS} and {dplyr}, then using filter() to create separate subsets for each treatment group. From there, mutate() adds a wtDelta column defined as Postwt - Prewt, giving the weight change for each patient. group_by(Treat) prepares the data for grouped summaries, and arrange(wtDelta) sorts within treatment groups. The notes then show how {dplyr}'s pipe operator, %&gt;%, makes the workflow more readable by chaining these operations. The final summary table uses summarize() to compute the number of observations, the mean weight change and the standard deviation within each treatment group. The reported values are count 29, average weight change 3.006897 and standard deviation 7.308504 for CBT, count 26, average weight change -0.450000 and standard deviation 7.988705 for Cont and count 17, average weight change 7.264706 and standard deviation 7.157421 for FT.

That example is not presented as a complete statistical analysis. Instead, it serves as a quick exploratory route into the data, with the wording remaining appropriately cautious and noting that this is only a glance and not a rigorous analysis.

Choosing an R Package for Descriptive Summaries

The question of how best to summarise data opens up a broader comparison of R packages for descriptive statistics. A useful review sets out a common set of needs: a count of observations, the number and types of fields, transparent handling of missing data and sensible statistics that depend on the data type. Numeric variables call for measures such as mean, median, range and standard deviation, perhaps with percentiles. Categorical variables call for counts of levels and some sense of which categories dominate.

Base R's summary() does some of this reasonably well. It distinguishes categorical from numeric variables and reports distributions or numeric summaries accordingly, while also highlighting missing values. Yet, it does not show an overall record count, lacks standard deviation and is not especially tidy or ready for tools such as kable. Several contributed packages aim to improve on that. Hmisc::describe() gives counts of variables and observations, handles both categorical and numerical data and reports missing values clearly, showing the highest and lowest five values for numeric data instead of a simple range. pastecs::stat.desc() is more focused on numeric variables and provides confidence intervals, standard errors and optional normality tests. psych::describe() includes categorical variables but converts them to numeric codes by default before describing them, which the package documentation itself advises should be interpreted cautiously. psych::describeBy() extends this approach to grouped summaries and can return a matrix form with mat = TRUE.

Among the packages reviewed, {skimr} receives especially strong attention for balancing readability and downstream usefulness. skim() reports record and variable counts clearly, separates variables by type and includes missing data and standard summaries in an accessible layout. It also works with group_by() from {dplyr}, making grouped summaries straightforward to produce. More importantly for analytical workflows, the skim output can be treated as a tidy data frame in which each combination of variable and statistic is represented in long form, meaning the results can be filtered, transformed and plotted with standard tidyverse tools such as {ggplot2}.

{summarytools} is presented as another strong option, though with a distinction between its functions. descr() handles numeric variables and can be converted to a data frame for use with kable, while dfSummary() works across entire data frames and produces an especially polished summary. At the time of the original notes, dfSummary() was considered slow. The package author subsequently traced the issue, as documented in the same review, to an excessive number of histogram breaks being generated for variables with large values, imposing a limit to resolve it. The package also supports output through view(dfSummary(data)), which yields an attractive HTML-style summary.

Grouped Summary Table Packages

Once the data has been summarised, the next step is turning those summaries into formal tables. A detailed comparison covers a number of packages specifically designed for this purpose: {arsenal}, {qwraps2}, {Amisc}, {table1}, {tangram}, {furniture}, {tableone}, {compareGroups} and {Gmisc}. {arsenal} is described as highly functional and flexible, with tableby() able to create grouped tables in only a few lines and then be customised through control objects that specify tests, display statistics, labels and missing value treatment. {qwraps2} offers a lot of flexibility through nested lists of summary specifications, though at the cost of more code. {Amisc} can produce grouped tables and works with pander::pandoc.table(), but is noted as not being on CRAN. {table1} creates attractive tables with minimal code, though its treatment of missing values may not suit every use case. {tangram} produces visually appealing HTML output and allows custom rows such as missing counts to be inserted manually, although only HTML output is supported. {furniture} and {tableone} both support grouped table creation, but {tableone} in particular is notable because it is widely used in biomedical research for baseline characteristics tables.

The {tableone} package deserves separate mention because it is designed to summarise continuous and categorical variables in one table, a common need in medical papers. As the package introduction explains, CreateTableOne() can be used on an entire dataset or on a selected subset of variables, with factorVars specifying variables that are coded numerically but should be treated as categorical. The package can display all levels for categorical variables, report missing values via summary() and switch selected continuous variables to non-normal summaries using medians and interquartile ranges instead of means and standard deviations. For grouped comparisons, it prints p-values by default and can switch to non-parametric tests or Fisher's exact test where needed. Standardised mean differences can also be shown. Output can be captured as a matrix and written to CSV for editing in Excel or Word.

Styling and Exporting Tables

With tables constructed, the focus shifts to how they are presented and exported. As Hao Zhu's conference slides explain, the {kableExtra} package builds on knitr::kable() and provides a grammar-like approach to adding styling layers, importing the pipe %&gt;% symbol from {magrittr} so that formatting functions can be added in the same way that layers are added in {ggplot2}. It supports themes such as kable_paper, kable_classic, kable_minimal and kable_material, as well as options for striping, hover effects, condensed layouts, fixed headers, grouped rows and columns, footnotes, scroll boxes and inline plots.

Table output is often the visible end of an analysis, and a broader review of R table packages covers a range of approaches that go well beyond the default output. In R Markdown, packages such as {gt}, {kableExtra}, {formattable}, {DT}, {reactable}, {reactablefmtr} and {flextable} all offer richer possibilities. Some are aimed mainly at HTML output, others at Word. {DT} in particular supports highly customised interactive tables with searching, filtering and cell styling through more advanced R and HTML code. {flextable} is highlighted as the strongest option when knitting to Word, given that the other packages are primarily designed for HTML.

For users working in Word-heavy settings, older but still practical workflows remain relevant too. One approach is simply to write tables to comma-separated text files and then paste and convert the content into a Word table. Another route is through {arsenal}'s write2 functions, designed as an alternative to SAS ODS. The convenience functions write2word(), write2html() and write2pdf() accept a wide range of objects: tableby, modelsum, freqlist and comparedf from {arsenal} itself, as well as knitr::kable(), xtable::xtable() and pander::pander_return() output. One notable constraint is that {xtable} is incompatible with write2word(). Beyond single tables, the functions accept a list of objects so that multiple tables, headers, paragraphs and even raw HTML or LaTeX can all be combined into a single output document. A yaml() helper adds a YAML header to the output, and a code.chunk() helper embeds executable R code chunks, while the generic write2() function handles formats beyond the three convenience wrappers, such as RTF.

The Publishing Infrastructure: CTAN and Its Mirrors

Producing PDF output from R Markdown depends on a working LaTeX installation, and the backbone of that ecosystem is CTAN, the Comprehensive TeX Archive Network. CTAN is the main archive for TeX and LaTeX packages and is supported by a large collection of mirrors spread around the world. The purpose of this distributed system is straightforward: users are encouraged to fetch files from a site that is close to them in network terms, which reduces load and tends to improve speed.

That global spread is extensive. The CTAN mirror list organises sites alphabetically by continent and then by country, with active sites listed across Africa, Asia, Europe, North America, Oceania and South America. Africa includes mirrors in South Africa and Morocco. Asia has particularly wide coverage, with many mirrors in China as well as sites in Korea, Hong Kong, India, Indonesia, Japan, Singapore, Taiwan, Saudi Arabia and Thailand. Europe is especially rich in mirrors, with hosts in Denmark, Germany, Spain, France, Italy, the Netherlands, Norway, Poland, Portugal, Romania, Switzerland, Finland, Sweden, the United Kingdom, Austria, Greece, Bulgaria and Russia. North America includes Canada, Costa Rica and the United States, while Oceania covers Australia and South America includes Brazil and Chile.

The details matter because different mirrors expose different protocols. While many support HTTPS, some also offer HTTP, FTP or rsync. CTAN provides a mirror multiplexer to make the common case simpler: pointing a browser to https://mirrors.ctan.org/ results in automatic redirection to a mirror in or near the user's country. There is one caveat. The multiplexer always redirects to an HTTPS mirror, so anyone intending to use another protocol needs to select manually from the mirror list. That is why the full listings still include non-HTTPS URLs alongside secure ones.

There is also an operational side to the network that is easy to overlook when things are working well. CTAN monitors mirrors to ensure they are current, and if one falls behind, then mirrors.ctan.org will not redirect users there. Updates to the mirror list can be sent to ctan@ctan.org. The master host of CTAN is ftp.dante.de in Cologne, Germany, with rsync access available at rsync://rsync.dante.ctan.org/CTAN/ and web access on https://ctan.org/. For those who want to contribute infrastructure rather than simply use it, CTAN also invites volunteers to become mirrors.

TinyTeX: A Lightweight LaTeX Distribution

This infrastructure becomes much more tangible when looking at a lightweight TeX distribution such as TinyTeX. TinyTeX is a lightweight, cross-platform, portable and easy-to-maintain LaTeX distribution based on TeX Live. It is small in size but intended to function well in most situations, especially for R users. Its appeal lies in not requiring users to install thousands of packages they will never use, installing them as needed instead. This also means installation can be done without administrator privileges, which removes one of the more familiar barriers around traditional TeX setups. TinyTeX can even be run from a flash drive.

For R users, TinyTeX is closely tied to the {tinytex} R package. The distinction is important: tinytex in lower case refers to the R package, while TinyTeX refers to the LaTeX distribution. Installation is intentionally direct. After installing the R package with install.packages('tinytex'), a user can run tinytex::install_tinytex(). Uninstallation is equally simple with tinytex::uninstall_tinytex(). For the average R Markdown user, that is often enough. Once TinyTeX is in place, PDF compilation usually requires no further manual package management.

There is slightly more to know if the aim is to compile standalone LaTeX documents from R. The {tinytex} package provides wrappers such as pdflatex(), xelatex() and lualatex(). These functions detect required LaTeX packages that are missing and install them automatically by default. In practical terms, that means a small example document can be written to a file and compiled with tinytex::pdflatex('test.tex') without much concern about whether every dependency has already been installed. For R users, this largely removes the old pattern of cryptic missing-package errors followed by manual searching through TeX repositories.

Developers may want more than the basics, and TinyTeX has a path for that as well. A helper such as tinytex:::install_yihui_pkgs() installs a collection of packages needed for building the PDF vignettes of many CRAN packages. That is a specific convenience rather than a universal requirement, but it illustrates the design philosophy behind TinyTeX: keep the initial footprint light and offer ways to add what is commonly needed later.

Using TinyTeX Outside R

For users outside R, TinyTeX still works, but the focus shifts to the command-line utility tlmgr. The documentation is direct in its assumptions: if command-line work is unwelcome, another LaTeX distribution may be a better fit. The central command is tlmgr, and much of TinyTeX maintenance can be expressed through it.

On Linux, installation places TinyTeX in $HOME/.TinyTeX and creates symlinks for executables such as pdflatex under $HOME/bin or $HOME/.local/bin if it exists. The installation script is fetched with wget and piped to sh, after first checking that Perl is correctly installed. On macOS, TinyTeX lives in ~/Library/TinyTeX, and users without write permission to /usr/local/bin may need to change ownership of that directory before installation. Windows users can run a batch file, install-bin-windows.bat, and the default installation directory is %APPDATA%/TinyTeX unless APPDATA contains spaces or non-ASCII characters, in which case %ProgramData% is used instead. PowerShell version 3.0 or higher is required on Windows.

Uninstallation follows the same self-contained logic. On Linux and macOS, tlmgr path remove is followed by deleting the TinyTeX folder. On Windows, tlmgr path remove is followed by removing the installation directory. This simplicity is a deliberate contrast with larger LaTeX distributions, which are considerably more involved to remove cleanly.

Maintenance and Package Management

Maintenance is where TinyTeX's relationship to CTAN and TeX Live becomes especially visible. If a document fails with an error such as File 'times.sty' not found, the fix is to search for the package containing that file with tlmgr search --global --file "/times.sty". In the example given, that identifies the psnfss package, which can then be installed with tlmgr install psnfss. If the package includes executables, tlmgr path add may also be needed. An alternative route is to upload the error log to the yihui/latex-pass GitHub repository, where package searching is carried out remotely.

If the problem is less obvious, a full update cycle is suggested: tlmgr update --self --all, then tlmgr path add and fmtutil-sys --all. R users have wrappers for these tasks too, including tlmgr_search(), tlmgr_install() and tlmgr_update(). Some situations still require a full reinstallation. If TeX Live reports Remote repository newer than local, TinyTeX should be reinstalled manually, which for R users can be done with tinytex::reinstall_tinytex(). Similarly, when a TeX Live release is frozen in preparation for a new one, the advice is simply to wait and then reinstall when the next release is ready.

The motivation behind TinyTeX is laid out with unusual clarity. Traditional LaTeX distributions often present a choice between a small basic installation that soon proves incomplete and a very large full installation containing thousands of packages that will never be used. TinyTeX is framed as a way around those frustrations by building on TeX Live's portability and cross-platform design while stripping away unnecessary size and complexity. The acknowledgements also underline that TinyTeX depends on the work of the TeX Live team.

Connecting the R Workflow to a Finished Report

Taken together, these notes show how closely summarisation, tabulation and publishing are linked. {dplyr} and related tools make it easy to summarise data quickly, while a wide range of R packages then turn those summaries into tables that are not only statistically useful but also presentable. CTAN and its mirrors keep the TeX ecosystem available and current across the world, and TinyTeX builds on that ecosystem to make LaTeX more manageable, especially for R users. What begins with a grouped summary in the console can end with a polished report table in HTML, PDF or Word, and understanding the chain between those stages makes the whole workflow feel considerably less mysterious.

Some R functions for working with dates, strings and data frames

18th March 2026

Working with data in R often comes down to a handful of recurring tasks: combining text, converting dates and times, reshaping tables and creating summaries that are easier to interpret. This article brings together several strands of base R and tidyverse-style practice, with a particular focus on string handling, date parsing, subsetting and simple time series smoothing. Taken together, these functions form part of the everyday toolkit for data cleaning and analysis, especially when imported data arrive in inconsistent formats.

String Building

At the simplest end of this toolkit is paste(), a base R function for concatenating character vectors. Its purpose is straightforward: it converts one or more R objects to character vectors and joins them together, separating terms with the string supplied in sep, which defaults to a space. If the inputs are vectors, concatenation happens term by term, so paste("A", 1:6, sep = "") yields "A1" through "A6", while paste(1:12) behaves much like as.character(1:12). There is also a collapse argument, which takes the resulting vector and combines its elements into a single string separated by the chosen delimiter, making paste() useful both for constructing values row by row and for creating one final display string from many parts.

That basic string-building role becomes more important when dates and times are involved because imported date-time data often arrive as text split across multiple columns. A common example is having one column for a date and another for a time, then joining them with paste(dates, times) before parsing the result. In that sense, the paste() function often acts as a bridge between messy raw input and structured date-time objects. It is simple, but it appears repeatedly in data preparation pipelines.

Date-Time Conversion

For date-time conversion, base R provides strptime(), strftime() and format() methods for POSIXlt and POSIXct objects. These functions convert between character representations and R date-time classes, and they are central to understanding how R reads and prints times. strptime() takes character input and converts it to an object of class "POSIXlt", while strftime() and format() move in the other direction, turning date-time objects into character strings. The as.character() method for "POSIXt" classes fits into the same family, and the essential idea is that the date-time value and its textual representation are separate things, with the format string defining how R should interpret or display that representation.

Format strings rely on conversion specifications introduced with %, and many of these are standard across systems. %Y means a four-digit year with century, %y means a two-digit year, %m is a month, %d is the day of a month and %H:%M:%S captures hours, minutes and seconds in 24-hour time. %F is equivalent to %Y-%m-%d, which is the ISO 8601 date format. %b and %B represent abbreviated and complete month names, while %a and %A do the same for weekdays. Locale matters here because month names, weekday names, AM/PM indicators and some separators depend on the LC_TIME locale, meaning a date string like "1jan1960" may parse correctly in one locale and return NA in another unless the locale is set appropriately.

R's defaults generally follow ISO 8601 rules, so dates print as "2001-02-28" and times as "14:01:02", though R inserts a space between date and time by default. Several details matter in practice. strptime() processes input strings only as far as needed for the specified format, so trailing characters are ignored. Unspecified hours, minutes and seconds default to zero, and if no year, month or day is supplied then the current values are assumed, though if a month is given, the day must also be valid for that month. Invalid calendar dates such as "2010-02-30 08:00" produce results whose components are all NA.

Time Zones and Daylight Saving

Time zones add another layer of complexity. The tz argument specifies the time zone to use for conversion, with "" meaning the current time zone and "GMT" meaning UTC. Invalid values are often treated as UTC, though behaviour can be system-specific. The usetz argument controls whether a time zone abbreviation is appended to output, which is generally more reliable than %Z. %z represents a signed UTC offset such as -0800, and R supports it for input on all platforms. Even so, time zones can be awkward because daylight saving transitions create times that do not occur at all, or occur twice, and strptime() itself does not validate those cases, though conversion through as.POSIXct may do so.

Two-Digit Years

Two-digit years are a notable source of confusion for analysts working with historical data. As described in the R date formats guide on R-bloggers, %y maps values 00 to 68 to the years 2000 to 2068 and 69 to 99 to 1969 to 1999, following the POSIX standard. A value such as "08/17/20" may therefore be interpreted as 2020 when the intended year is 1920. One practical workaround is to identify any parsed dates lying in the future and then rebuild them with a 19 prefix using format() and ifelse(). This approach is explicit and practical, though it depends on the assumptions of the data at hand.

Plain Dates

For plain dates, rather than full date-times, as.Date() is usually the entry point. Character dates can be imported by specifying the current format, such as %m/%d/%y for "05/27/84" or %B %d %Y for "May 27 1984". If no format is supplied, as.Date() first tries %Y-%m-%d and then %Y/%m/%d. Numeric dates are common when data come from Excel, and here the crucial issue is the origin date: Windows Excel uses an origin of "1899-12-30" for dates after 1900 because Excel incorrectly treated 1900 as a leap year (an error originally copied from Lotus 1-2-3 for compatibility), while Mac Excel traditionally uses "1904-01-01". Once the correct origin is supplied, as.Date() converts the serial numbers into standard R dates.

After import, format() can display dates in other ways without changing their underlying class. For example, format(betterDates, "%a %b %d") might yield values like "Sun May 27" and "Thu Jul 07". This distinction between storage and display is important because once R recognises values as dates, they can participate in date-aware operations such as mean(), min() and max(), and a vector of dates can have a meaningful mean date with the minimum and maximum identifying the earliest and latest observations.

Extracting Columns and Manipulating Lists

These ideas about correct types and structure carry over into table manipulation. A data frame column often needs to be extracted as a vector before further processing, and there are several standard ways to do this, as covered in this guide from Statistics Globe. In base R, the $ operator gives a direct route, as in data$x1. Subsetting with data[, "x1"] yields the same result for a single column, and in the tidyverse, dplyr::pull(data, x1) serves the same purpose. All three approaches convert a column of a data frame into a standalone vector, and each is useful depending on the surrounding code style.

List manipulation has similar patterns, detailed in this Statistics Globe tutorial on removing list elements. Removing elements from a list can be done by position with negative indexing, as in my_list[-2], or by assigning NULL to the relevant component, for example my_list_2[2] <- NULL. If names are more meaningful than positions, then subsetting with names(my_list) != "b" or names(my_list) %in% "b" == FALSE removes the named element instead. The same logic extends to multiple elements, whether by positions such as -c(2, 3) or names such as %in% c("b", "c") == FALSE. These are simple techniques, but they matter because lists are a common structure in R, especially when working with nested results.

Subsetting, Renaming and Reordering Data Frames

Data frames themselves can be subset in several ways, and the choice often depends on readability, as the five-method overview on R-bloggers demonstrates clearly. The bracket form example[x, y] remains the foundation, whether selecting rows and columns directly or omitting unwanted ones with negative indices. More expressive alternatives include which() together with %in%, the base subset() function and tidyverse verbs like filter() and select(). The point is not that one method is universally best, but that R offers both low-level precision and higher-level readability, depending on the task.

Column names and column order also need regular attention. Renaming can be done with dplyr::rename(), as explained in this lesson from Datanovia, for instance changing Sepal.Length to sepal_length and Sepal.Width to sepal_width. In base R, the same effect comes from modifying names() or colnames(), either by matching specific names or by position. Reordering columns is just as direct, with a data frame rearranged by column indices such as my_data[, c(5, 4, 1, 2, 3)] or by an explicit character vector of names, as the STHDA guide on reordering columns illustrates. Both approaches are useful when preparing data for presentation or for functions that expect variables in a certain order.

Sorting and Cumulative Calculations

Sorting and cumulative calculations fit naturally into this same preparatory workflow. To sort a data frame in base R, the DataCamp sorting reference demonstrates that order() is the key function: mtcars[order(mpg), ] sorts ascending by mpg, while mtcars[order(mpg, -cyl), ] sorts by mpg ascending and cyl descending. For cumulative totals, cumsum() provides a running sum, as in calculating cumulative air miles from the airmiles dataset, an example covered in the Data Cornering guide to cumulative calculations. Within grouped data, dplyr::group_by() and mutate() can apply cumsum() separately to each group, and a related idea is cumulative count, which can be built by summing a column of ones within groups, or with data.table::rowid() to create a group index.

Time Series Smoothing

Time series smoothing introduces one further pattern: replacing noisy raw values with moving averages. As the Storybench rolling averages guide explains, the zoo::rollmean() function calculates rolling means over a window of width k, and examples using 3, 5, 7, 15 and 21-day windows on pandemic deaths and confirmed cases by state demonstrate the approach clearly. After arranging and grouping by state, mutate() adds variables such as death_03da, death_05da and death_07da. Because rollmean() is centred by default, the resulting values are symmetrical around the observation of interest and produce NA values at the start and end where there are not enough surrounding observations, which is why odd values of k are usually preferred as they make the smoothing window balanced.

The arithmetic is uncomplicated, but the interpretation is useful. A 3-day moving average for a given date is the mean of that day, the previous day and the following day, while a 7-day moving average uses three observations on either side. As the window widens, the line becomes smoother, but more short-term variation is lost. This trade-off is visible when comparing 3-day and 21-day averages: a shorter average tracks recent changes more closely, while a longer one suppresses noise and makes broader trends stand out. If a trailing rather than centred calculation is needed, rollmeanr() shifts the window to the right-hand end.

The same grouped workflow can be used to derive new daily values before smoothing. In the pandemic example, daily new confirmed cases are calculated from cumulative confirmed counts using dplyr::lag(), with each day's new cases equal to the current cumulative total minus the previous day's total. Grouping by state and date, summing confirmed counts and then subtracting the lagged value produces new_confirmed_cases, which can then be smoothed with rollmean() in the same way as deaths. Once these measures are available, reshaping with pivot_longer() allows raw values and rolling averages to be plotted together in ggplot2, making it easier to compare volatility against trend.

How These R Data Manipulation Techniques Fit Together

What links all of these techniques is not just that they are common in R, but that they solve the mundane, essential problems of analysis. Data arrive as text when they should be dates, as cumulative counts when daily changes are needed, as broad tables when only a few columns matter, or as inconsistent names that get in the way of clear code. Functions such as paste(), strptime(), as.Date(), order(), cumsum(), rollmean(), rename(), select() and simple bracket subsetting are therefore less like isolated tricks and more like pieces of a coherent working practice. Knowing how they fit together makes it easier to move from raw input to reliable analysis, with fewer surprises along the way.

Speeding up R Code with parallel processing

17th March 2026

Parallel processing in R has evolved considerably over the past fifteen years, moving from a patchwork of platform-specific workarounds into a well-structured ecosystem with clean, consistent interfaces. The appeal is easy to grasp: modern computers offer several processor cores, yet most R code runs on only one of them unless the user makes a deliberate choice to go parallel. When a task involves repeated calculations across groups, repeated model fitting or many independent data retrievals, spreading that work across multiple cores can reduce elapsed time substantially.

At its heart, the idea is simple. A larger job is split into smaller pieces, those pieces are executed simultaneously where possible, and the results are combined back together. That pattern appears throughout R's parallel ecosystem, whether the work is running on a laptop with a handful of cores or on a university supercomputer with thousands.

Why Parallel Processing?

Most modern computers have multiple cores that sit idle during single-threaded R scripts. Parallel processing takes advantage of this by splitting work across those cores, but it is important to understand that it is not always beneficial. Starting workers, transmitting data and collecting results all take time. Parallel processing makes the most sense when each iteration does enough computational work to justify that overhead. For fast operations of well under a second, the overhead will outweigh any gain and serial execution is faster. The sweet spot is iterative work, where each unit of computation takes at least a few seconds.

Benchmarking: Amdahl's Law

The theoretical speed-up from adding processors is always limited by the fraction of work that cannot be parallelised. Amdahl's Law, formulated by computer scientist Gene Amdahl in 1967, captures this:

Maximum Speedup = 1 / ( f/p + (1 - f) )

Here, f is the parallelisable fraction and p is the number of processors. Problems where f = 1 (the entire computation is parallelisable) are called embarrassingly parallel: bootstrapping, simulation studies and applying the same model to many independent groups all fall into this category. For everything else, the sequential fraction, including the overhead of setting up workers and moving data, sets a ceiling on how much improvement is achievable.

How We Got Here

The current landscape makes more sense with a brief orientation. R 2.14.0 in 2011 brought {parallel} into base R, providing built-in support for both forking and socket clusters along with reproducible random number streams, and it remains the foundation everything else builds on. The {foreach} package with {doParallel} became the most common high-level interface for many years, and is still widely encountered in existing code. The split-apply-combine package {plyr} was an early entry point for parallel data manipulation but is now retired; the recommendation is to use {dplyr} for data frames and {purrr} for list iteration instead. The {future} ecosystem, covered in the next section, is the current best practice for new code.

The Modern Standard: The {future} Ecosystem

The most significant development in R parallel computing in recent years has been the {future} package by Henrik Bengtsson, which provides a unified API for sequential and parallel execution across a wide range of backends. Its central concept is simple: a future is a value that will be computed (possibly in parallel) and retrieved later. What makes it powerful is that you write code once and change the execution strategy by swapping a single plan() call, with no other changes to your code.

library(future)
plan(multisession)  # Use all available cores via background R sessions

The common plans are sequential (the default, no parallelism), multisession (multiple background R processes, works on all platforms including Windows) and multicore (forking, faster but Unix/macOS only). On a cluster, cluster and backends such as future.batchtools extend the same interface to remote nodes.

The {future} package itself is a low-level building block. For day-to-day work, three higher-level packages are the main entry points.

{future.apply}: Drop-in Replacements for base R Apply

{future.apply} provides parallel versions of every *apply function in base R, including future_lapply(), future_sapply(), future_mapply(), future_replicate() and more. The conversion from serial to parallel code requires just two lines:

library(future.apply)
plan(multisession)

# Serial
results <- lapply(my_list, my_function)

# Parallel — identical output, just faster
results <- future_lapply(my_list, my_function)

Global variables and packages are automatically identified and exported to workers, which removes the manual clusterExport and clusterEvalQ calls that {parallel} requires.

{furrr}: Drop-in Replacements for {purrr}

{furrr} does the same for {purrr}'s mapping functions. Any map() call can become future_map() by loading the library and setting a plan:

library(furrr)
plan(multisession, workers = availableCores() - 1)

# Serial
results <- map(my_list, my_function)

# Parallel
results <- future_map(my_list, my_function)

Like {future.apply}, {furrr} handles environment export automatically. There are parallel equivalents for all typed variants (future_map_dbl(), future_map_chr(), etc.) and for map2() and pmap() as well. It is the most natural choice for tidyverse-style code that already uses {purrr}.

{futurize}: One-Line Parallelisation

For users who want to parallelise existing code with minimal changes, {futurize} can transpile calls to lapply(), purrr::map() and foreach::foreach() %do% {} into their parallel equivalents automatically.

{foreach} with {doFuture}

The {foreach} package remains widely used, and the modern way to parallelise it is with the {doFuture} backend and the %dofuture% operator:

library(foreach)
library(doFuture)
plan(multisession)

results <- foreach(i = 1:10) %dofuture% {
    my_function(i)
}

This approach inherits all the benefits of {future}, including automatic global variable handling and reproducible random numbers.

The {parallel} Package: Core Functions

The {parallel} package remains part of base R and is the foundation that {future} and most other packages build on. It is useful to know its core functions directly, especially for distributed work across multiple nodes.

Shared memory (single machine, Unix/macOS only):

mclapply(X, FUN, mc.cores = n) is a parallelised lapply that works by forking. It does not work on Windows and falls back silently to serial execution there.

Distributed memory (all platforms, including multi-node):

Function Description
makeCluster(n) Start `n` worker processes
clusterExport(cl, vars) Copy named objects to all workers
clusterEvalQ(cl, expr) Run an expression (e.g. library(pkg)) on all workers
parLapply(cl, X, FUN) Parallelised lapply across the cluster
parLapplyLB(cl, X, FUN) Same with load balancing for uneven tasks
clusterSetRNGStream(cl, seed) Set reproducible random seeds on workers
stopCluster(cl) Shut down the cluster

Note that detectCores() can return misleading values in HPC environments, reporting the total cores on a node rather than those allocated to your job. The {parallelly} package's availableCores() is more reliable in those settings and is what {furrr} and {future.apply} use internally.

A Tidyverse Approach with {multidplyr}

For data frame-centric workflows, {multidplyr} (available on CRAN) provides a {dplyr} backend that distributes grouped data across worker processes. The API has been simplified considerably since older tutorials were written: there is no longer any need to manually add group index columns or call create_cluster(). The current workflow is straightforward.

library(multidplyr)
library(dplyr)

# Step 1: Create a cluster (leave 1–2 cores free)
cluster <- new_cluster(parallel::detectCores() - 1)

# Step 2: Load packages on workers
cluster_library(cluster, "dplyr")

# Step 3: Group your data and partition it across workers
flights_partitioned <- nycflights13::flights %>%
    group_by(dest) %>%
    partition(cluster)

# Step 4: Work with dplyr verbs as normal
results <- flights_partitioned %>%
    summarise(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
    collect()

partition() uses a greedy algorithm to keep all rows of a group on the same worker and balance shard sizes. The collect() call at the end recombines the results into an ordinary tibble in the main session. If you need to use custom functions, load them on each worker with cluster_assign():

cluster_assign(cluster, my_function = my_function)

One important caveat from the official documentation: for basic {dplyr} operations, {multidplyr} is unlikely to give measurable speed-ups unless you have tens or hundreds of millions of rows. Its real strength is in parallelising slower, more complex operations such as fitting models to each group. For large in-memory data with fast transformations, {dtplyr} (which translates {dplyr} to {data.table}) is often a better first choice.

Running R on HPC Clusters

For computations that exceed what a single workstation can provide, university and research HPC clusters are the next step. The core terminology is worth understanding clearly before submitting your first job.

One node is a single physical computer, which may itself contain multiple processors. One processor contains multiple cores. Wall-time is the real-world clock time a job is permitted to run; the job is terminated when this limit is reached, regardless of whether the script has finished. Memory refers to the RAM the job requires. When requesting resources, leave a margin of at least five per cent of RAM for system processes, as exceeding the allocation will cause the job to fail.

Slurm Job Submission

Slurm is the dominant scheduler on modern HPC clusters, including Penn State's Roar Collab system, managed by the Institute for Computational and Data Sciences (ICDS). Jobs are described in a shell script and submitted with sbatch. From R, the {rslurm} package allows Slurm jobs to be created and submitted directly without leaving the R session:

library(rslurm)
sjob <- slurm_apply(my_function, params_df, jobname = "my_job",
                    nodes = 2, cpus_per_node = 8)

Connecting R Workflows to Cluster Schedulers

The {batchtools} package provides Map, Reduce and Filter variants for managing R jobs on PBS, Slurm, LSF and Sun Grid Engine. The {clustermq} package sends function calls as cluster jobs via a single line of code without network-mounted storage. For users already in the {future} ecosystem, {future.batchtools} wraps {batchtools} as a {future} backend, letting you scale from a local plan(multisession) all the way to plan(batchtools_slurm) with no other code changes.

The Broader Ecosystem

The CRAN Task View on High-Performance and Parallel Computing, maintained by Dirk Eddelbuettel and updated lately, remains the most comprehensive catalogue of R packages in this space. The core packages designated by the Task View are {Rmpi} and {snow}. Beyond these, several areas are worth knowing about.

For large and out-of-memory data, {arrow} provides the Apache Arrow in-memory format with support for out-of-memory processing and streaming. {bigmemory} allows multiple R processes on the same machine to share large matrix objects. {bigstatsr} operates on file-backed matrices via memory-mapped access with parallel matrix operations and PCA.

For pipeline orchestration, the {targets} package constructs a directed acyclic graph of your workflow and orchestrates distributed computing across {future} workers, only re-running steps whose upstream dependencies have changed. For GPU computing, the {tensorflow} package by Allaire and colleagues provides access to the complete TensorFlow API from within R, enabling computation across CPUs and GPUs with a single API.

When it comes to random number reproducibility across parallel workers, the L'Ecuyer-CMRG streams built into {parallel} are available via RNGkind("L'Ecuyer-CMRG"). The {rlecuyer}, {rstream}, {sitmo} and {dqrng} packages provide further alternatives. The {doRNG} package handles reproducible seeds specifically for {foreach} loops.

Choosing the Right Approach

The appropriate tool depends on the shape of the problem and how it fits into your existing code.

If you are already using {purrr}'s map() functions, replacing them with future_map() from {furrr} after plan(multisession) is the path of least resistance. If you use base R's lapply or sapply, {future.apply} provides identical drop-in replacements. Both inherit automatic environment handling, reproducible random numbers and cross-platform compatibility from {future}.

If you are working with grouped data frames in a {dplyr} style and each group operation is computationally substantial, {multidplyr} is a good fit. For fast operations on large data, try {dtplyr} first.

For the largest workloads on institutional clusters, {future} scales directly to HPC environments via plan(cluster) or plan(batchtools_slurm). The {rslurm} and {batchtools} packages provide more direct control over job submission and resource management.

Further Reading

The CRAN Task View on High-Performance and Parallel Computing is the most comprehensive and current reference. The Futureverse website documents the full {future} ecosystem. The {multidplyr} vignette covers the current API in detail. Penn State users can find cluster support through ICDS and the QuantDev group's HPC in R tutorial. The R Special Interest Group on High-Performance Computing mailing list is a further resource for more specialist questions.

Broadening data science horizons: Useful Python packages for working with data

14th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. While one of these has been R, Python is another that has taken up my attention, and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I began to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family to have a more stable income. The approach is that a student selects a project and works their way through it, with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course, and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic supplying data, and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Though some subjects are uplifting while others are more foreboding, the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is, and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message, but books are always worth reading even if that is the slower route. While those from the Dummies series or from O'Reilly have proved must useful so far, I do need to read them more completely than I already have; it is all too tempting to go with the try the "programming and search for solutions as you go" approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality, but I prefer a more traditional editor, so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Because Spyder itself is written in Python, it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities, but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science, but I have discovered others since then. It also seems that there is porosity between the worlds of R and Python, so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dplyr and ggplot2 in the form of Siuba and Plotnine, respectively. Though the syntax of these packages are not direct copies of what is executed in R, they are close enough for there to be enough familiarity for added user-friendliness compared to Pandas or Matplotlib. The interoperability does not stop there, for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

While Python may not have the speed of Julia, there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have their uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available, and there is always the possibility of checking out others. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurts to keep an open mind either.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.