Learning R for Data Analysis: Going from the basics to professional practice
R has grown from a specialist statistical language into one of the most widely recognised tools for working with data. Across tutorials, community sites, training platforms and industry resources, it is presented as both a programming language and a software environment for statistical computing, graphics and reporting. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand, and its name draws on the first letter of their first names while also alluding to the Bell Labs language S. It is freely available under the GNU General Public Licence and runs on Linux, Windows and macOS, which has helped it spread across research, education and industry alike.
What Makes R Distinctive
What makes R notable is its combination of programming features with a strong focus on data analysis. Introductory material, such as the tutorials at Tutorialspoint and Datamentor, repeatedly highlights its support for conditionals, loops, user-defined recursive functions and input and output, but these sit alongside effective data handling, a broad set of operators for arrays, lists, vectors and matrices and strong graphical capabilities. That mixture means R can be used for straightforward scripts and for complex analytical workflows. A beginner may start by printing "Hello, World!" with the print() function, while a more experienced user may move on to regression models, interactive dashboards or automated reporting.
The Learning Progression
Learning materials generally present R in a structured progression. A beginner is first introduced to reserved words, variables and constants, operators and the order in which expressions are evaluated. From there, the path usually moves into flow control through if…else, ifelse(), for, while, repeat and the use of break and next, before functions follow naturally, including return values, environments and scope, recursive functions, infix operators and switch(). Most sources agree that confidence with the syntax and fundamentals is the real starting point, and this early sequence matters because it helps learners become comfortable reading and writing R rather than only copying examples.
After the basics, attention tends to turn to the structures that make R so useful for data work. Vectors, matrices, lists, data frames and factors appear in nearly every introductory course because they are central to how information is stored and manipulated. Object-oriented concepts also emerge quite early in some routes through the language, with classes and objects extending into S3, S4 and reference classes. For someone coming from spreadsheets or point-and-click statistical software, this shift can feel significant, but it also opens the way to more reproducible and flexible analysis.
Visualisation
Visualisation is another recurring theme in R education. Basic chart types such as bar plots, histograms, pie charts, box plots and strip charts are common early examples because they show how quickly data can be turned into graphics. More advanced lessons widen the scope through plot functions, multiple plots, saving graphics, colour selection and the production of 3D plots.
Beyond base plotting, there is extensive evidence of the central role of {ggplot2} in contemporary R practice. Data Cornering demonstrates this well, with articles covering how to create funnel charts in R using {ggplot2} and how to diversify stacked column chart data label colours, showing how R is used not only to summarise data but also to tell visual stories more clearly. In the pharmaceutical and clinical research space, the PSI VIS-SIG blog is published by the PSI Visualisation Special Interest Group and summarises its monthly Wonderful Wednesday webinars, presenting real-world datasets and community-contributed chart improvements alongside news from the group.
Data Wrangling and the Tidyverse
Much of modern R work is built around data wrangling, and here the {tidyverse} has become especially prominent. Claudia A. Engel's openly published guide Data Wrangling with R (last updated 3rd November 2023) sets out a preparation phase that assumes some basic R knowledge, a recent installation of R and RStudio and the installation of the {tidyverse} package with install.packages("tidyverse") followed by library(tidyverse). It also recommends creating a dedicated RStudio project and downloading CSV files into a data subdirectory, reinforcing the importance of organised project structure.
That same guide then moves through data manipulation with {dplyr}, covering selecting columns and filtering rows, pipes, adding new columns, split-apply-combine, tallying and joining two tables, before moving on to {tidyr} topics such as long and wide table formats, pivot_wider, pivot_longer and exporting data. These topics reflect a broader pattern in the R ecosystem because data import and export, reshaping, combining tables and counting by group recur across teaching resources as they mirror common analytical tasks.
Applications and Professional Use
The range of applications attached to R is wide, though data science remains the clearest centre of gravity. Educational sources describe R as valuable for data wrangling, visualisation and analysis, often pointing to packages such as {dplyr}, {tidyr}, {ggplot2} and {Shiny}. Statistical modelling is another major strand, with R offering extensible techniques for descriptive and inferential statistics, regression analysis, time series methods and classical tests. Machine learning appears as a further area of growth, supported by a large and expanding package ecosystem. In more advanced contexts, R is also linked with dashboards, web applications, report generation and publishing systems such as Quarto and R Markdown.
R's place in professional settings is underscored by the breadth of organisations and sectors associated with it. Introductory resources mention companies such as Google, Microsoft, Facebook, ANZ Bank, Ford and The New York Times as examples of organisations using R for modelling, forecasting, analysis and visualisation. The NHS-R Community promotes the use of R and open analytics in health and care, building a community of practice for data analysis and data science using open-source software in the NHS and wider UK health and care system. Its resources include reports, blogs, webinars and workshops, books, videos and R packages, with webinar materials archived in a publicly accessible GitHub repository. The R Validation Hub, supported through the pharmaR initiative, is a collaboration to support the adoption of R within a biopharmaceutical regulatory setting and provides tools including the {riskmetric} package, the {riskassessment} app and the {riskscore} package for assessing package quality and risk.
The Wider Ecosystem
The wider ecosystem around R is unusually rich. The R Consortium promotes the growth and development of the R language and its ecosystem by supporting technical and social infrastructure, fostering community engagement and driving industry adoption. It notes that the R language supports over two million users and has been adopted in industries including biotech, finance, research and high technology. Community growth is visible not only through organisations and conferences but through user groups, scholarships, project working groups and local meetups, which matters because learning a language is easier when there is an active support network around it.
Another sign of maturity is the depth of R's package and publication landscape. rdrr.io provides a comprehensive index of over 29,000 CRAN packages alongside more than 2,100 Bioconductor packages, over 2,200 R-Forge packages and more than 76,000 GitHub packages, making it possible to search for packages, functions, documentation and source code in one place. Rdocumentation, powered by DataCamp, covers 32,130 packages across CRAN and Bioconductor and offers a searchable interface for function-level documentation. The Journal of Statistical Software adds a scholarly dimension, publishing open-access articles on statistical computing software together with source code, with full reproducibility mandatory for publication. R-bloggers aggregates R news and tutorials contributed by hundreds of R bloggers, while R Weekly curates a community digest and an accompanying podcast, both helping users keep pace with the steady flow of tutorials, package releases, blog posts and developments across the R world.
Where to Begin
For beginners, one recurring challenge is knowing where to start, and different learning routes reflect different backgrounds. Datamentor points learners towards step-by-step tutorials covering popular topics such as R operators, if...else statements, data frames, lists and histograms, progressing through to more advanced material. R for the Rest of Us offers a staged path through three core courses, Getting Started With R, Fundamentals of R and Going Deeper with R, and extends into nine topics courses covering Git and GitHub, making beautiful tables, mapping, graphics, data cleaning, inferential statistics, package development, reproducibility and interactive dashboards with {Shiny}. The site is explicitly designed for people who may never have coded before and also offers the structured R in 3 Months programme alongside training and consulting. RStudio Education (now part of Posit) outlines six distinct ways to begin learning R, covering installation, a free introductory webinar on tidy statistics, the book R for Data Science, browser-based primers, and further options suited to different learning styles, along with guidance on R Markdown and good project practices.
Despite the variety, the underlying advice is consistent: start by learning the basics well enough to read and write simple code, practise regularly beginning with straightforward exercises and gradually take on more complex tasks, then build projects that matter to you because projects create context and make concepts stick. There is no suggestion that mastery comes from passively reading documentation alone, as practical engagement is treated as essential throughout. The blog Stats and R exemplifies this philosophy well, with the stated aim of making statistics accessible to everyone by sharing, explaining and illustrating statistical concepts and, where appropriate, applying them in R.
That practical engagement can take many forms. Someone interested in data journalism may focus on visualisation and reproducible reporting, while a researcher may prioritise statistical modelling and publishing workflows, and a health analyst may use R for quality assurance, open health data and clinical reporting. Others may work with {Shiny}, package development, machine learning, Git and GitHub or interactive dashboards. The variety shows that R is not confined to a single use case, even if statistics and data science remain the common thread.
Free Learning Resources for R
It is also worth noting that R learning is supported by a great deal of freely available material. Statistics Globe, founded in 2017 by Joachim Schork and now an education and consulting platform, offers more than 3,000 free tutorials and over 1,000 video tutorials on YouTube, spanning R programming, Python and statistical methodology. STHDA (Statistical Tools for High-Throughput Data Analysis) covers basics, data import and export, reshaping, manipulation and visualisation, with material geared towards practical data analysis at every level. Community sites, webinar repositories and newsletters add further layers of accessibility, and even where paid courses exist, the surrounding free ecosystem is substantial.
Taken together, these sources present R as far more than a niche programming language. It is a mature open-source environment with a strong statistical heritage, a practical orientation towards data work and a well-developed community of learners, teachers, developers and organisations. Its core concepts are approachable enough for beginners, yet its package ecosystem and publishing culture support highly specialised and advanced work. For anyone looking to enter data analysis, statistics, visualisation or related areas, R offers a route that begins with simple code and can extend into large-scale analytical workflows.
Other sides to R beyond data analysis: Excel automation, Docker environments, shell commands and `Rscript`
The idea of using R purely as an interactive analysis tool understates what the language can do. Once a workflow is expressed in code, R can collect and transform data, write Excel workbooks, run inside a reproducible containerised environment, call shell scripts and execute unattended from the command line. Each of those capabilities is useful on its own, but together they describe something more ambitious: R as a general-purpose automation layer sitting at the centre of a reporting pipeline. This article draws on four sources to illustrate that picture, beginning with a Business Science tutorial on generating Excel workbooks with {openxlsx} and {tidyquant}, then expanding outwards to cover running RStudio inside Docker, calling shell commands from an interactive R session and executing R scripts from the command line with Rscript.
The Business Science Tutorial
A useful foundation for this workflow comes from a tutorial published on R-bloggers on the 6th of October 2020, contributed by Business Science as part of its R-Tips Weekly series. That tutorial demonstrates how to use the {openxlsx} and {tidyquant} packages together to automate the creation of an Excel workbook. The core idea is straightforward: gather financial data in R, transform it into a summary table, create a chart and then write those outputs into an Excel file programmatically.
The workbook described in the tutorial contains two main outputs. One is a pivot-style summary table showing stock returns broken down by year and symbol, and the other is a stock chart plotted over time. Rather than treating Excel as the place where all data manipulation begins and ends, the workflow shifts gathering and processing into R first, with Excel becoming the delivery format rather than the main engine of the analysis.
Collecting Data with {tidyquant}
The data collection stage in the tutorial uses {tidyquant}, a package designed for importing and working with financial data within the tidyverse. It wraps functionality from packages such as {quantmod}, {xts}, {zoo} and {TTR}, returning results as tidy tibbles that integrate cleanly with standard tidyverse tools. In the tutorial, financial market data are imported and then used to derive annual returns, which are reshaped into a pivot-table-like structure and plotted through time as a stock chart.
This keeps the entire preparation stage inside R before any results are written elsewhere. The tq_get() function serves as the main entry point for retrieving stock price data, accepting ticker symbols and returning results in a consistent tabular format. Keeping data collection in code rather than manual downloads also makes the workflow straightforward to update or extend.
Writing Workbooks with {openxlsx}
Once the data and visualisation are prepared, {openxlsx} provides the Excel automation layer. The tutorial describes a six-step process: initialise a workbook, create a worksheet, add the stock plot, add the pivot table, save the workbook and then open it programmatically. That sequence reflects a common pattern in reporting automation, where code assembles an output file from several components before making it available.
What makes {openxlsx} particularly convenient is that it works directly with Excel xlsx files, without requiring Excel itself to be open or installed during the creation process. In practical terms, this means an R script can generate a workbook as part of a larger task, whether run manually, scheduled on a machine or incorporated into a reporting pipeline. It is worth noting that {openxlsx} is no longer under active development; the package is maintained and CRAN warnings are fixed, but users starting new projects are encouraged to consider {openxlsx2} as a modern alternative.
The original tutorial also points readers towards a GitHub repository containing the full code and a YouTube walk-through showing the process step by step. Those references underline that the workflow is intended as a repeatable practical skill rather than a one-off demonstration. The tutorial forms part of a weekly series in which readers are invited to pull the latest code from the repository.
Running RStudio Inside Docker
Excel automation is one side of a wider theme in R workflows, namely integrating R with the surrounding operating environment. A good illustration of the broader approach is running RStudio inside Docker, which provides a reproducible computing environment that behaves consistently regardless of the host machine.
Docker needs to be installed first, after which the rocker/verse image can be used to launch a containerised RStudio session. As described in this Docker tutorial for R users, this image already has many useful R packages installed and allows RStudio Server to be accessed through a web browser. The launch command is as follows:
docker run --rm -p 8787:8787 -e PASSWORD=YOURNEWPASSWORD rocker/verse
The -p flag exposes port 8787 so that RStudio Server can be reached in a browser, while --rm ensures the container is deleted when it is shut down, preventing temporary containers from accumulating and consuming disc space. If Docker does not find the image locally, it will search Docker Hub and download it automatically.
Connecting to the running container depends on the operating system and Docker configuration. On Mac or Linux machines, pointing a browser to http://localhost:8787 should work. On Mac or Windows setups using Docker Quickstart Terminal, the IP address is shown at launch (for example http://192.168.99.100:8787). Should the error "Cannot connect to the Docker daemon" appear, running eval "$(docker-machine env default)" may resolve it. Once connected, log in using the username rstudio and the password set at launch.
Container File Systems and Volume Mounting
A key characteristic of Docker containers is that their file systems are temporary by default. Any files created inside a container launched with --rm will be lost when the container is shut down. The Docker tutorial illustrates this by having users create a script and a plot inside a running container, then restarting it with --rm to find them gone. This apparent limitation leads naturally into a more durable arrangement through volume mounting.
A local directory on the host machine can be linked to a directory inside the container using the -v flag, so that files written inside the container are stored on the host. Once that volume is linked, the user can open files from the mounted directory, set a working directory and load data stored on the host. The tutorial uses read.csv to load a CSV file, then loads {ggplot2}, creates a plot with qplot and saves the result with ggsave to the mounted directory. Files saved in this way persist after the container exits, separating the reproducible environment from persistent project data.
This is an important practical consideration for anyone building automated R workflows. It demonstrates that a containerised environment can provide isolation and consistency without sacrificing continuity of project files, making Docker a useful complement to R-based automation where consistent execution environments matter.
Calling Shell Commands from R
Another route to automation is the use of system commands from an interactive R session. As described in this post by Jay on his Notes blog, base R provides the system() function for this purpose, allowing an R session to call out to the operating system to list files, launch scripts or trigger shell-based tools. A simple example is system("ls"), which lists files in the current working directory from within an R session.
The post illustrates a more practical use case with a shell script called show_notes.sh. That script accepts a source file and a marker string, extracts all lines containing the marker and writes them to a new file. Running system("show_notes.sh explore.R NOTE") from within R would search through explore.R for lines labelled with NOTE and save them to explore.R.NOTE, assuming both files are in the same directory or their paths are provided in full.
If this becomes a regular part of a workflow, the shell script can be bundled inside an R package. The script is placed under an inst/sh subdirectory, and a wrapper function in the package's R directory calls it via system.file() to locate the installed script. The wrapper function, named show_notes() in the example, constructs a command string, appends the source file and marker arguments and runs the command with system(). Once the package is installed and loaded, calling show_notes("explore.R") performs the extraction without the user needing to remember the script location. The post also notes that system2() and the {fs} package are alternatives for similar tasks, though the author had not yet tried them at the time of writing.
Batch Execution with Rscript
For running R scripts outside an interactive session, Rscript provides a command-line front end. Its synopsis, as documented in the Rscript man page, is Rscript [options] [-e expr] file [args], allowing either an expression or a file of R code to be executed directly from a terminal. Additional arguments passed on the command line can then be accessed within the script using commandArgs().
Several flags influence how predictable a given run will be. The --vanilla flag suppresses saved workspaces, profile files and environment settings, helping ensure a script behaves consistently regardless of the local user environment. Other options include --verbose, --version, --help, --default-packages, --no-environ, --no-site-file and --no-init-file. Together, these controls make Rscript well suited to automated or scheduled execution where a clean, reproducible session is required.
When considered alongside the Excel example, Rscript illustrates how a reporting workflow might be run unattended. An R script that gathers financial data with {tidyquant}, creates a chart, writes a workbook with {openxlsx} and saves it to disc could be launched from the command line with Rscript, and that invocation could itself be triggered by either a scheduler or a shell process. This closes the loop between interactive development and fully automated deployment.
Putting the Pieces Together for Reproducible R Reporting
Across these examples, a coherent picture emerges of R as an orchestration layer rather than simply an interactive analysis tool. It can collect and transform data, produce visualisations, write Excel workbooks, run inside a containerised environment, call shell commands and execute as a standalone script. Each piece serves a different purpose, but together they show how a scripted workflow can cover the entire journey from data collection to delivered report.
For those primarily interested in Excel automation, the Business Science tutorial on R-bloggers remains the clearest practical entry point. It demonstrates that an Excel workbook can be generated from R by importing financial data with {tidyquant}, building a table of stock returns by year and symbol, creating a stock chart and inserting both into a workbook with {openxlsx}. The surrounding material on Docker, shell commands and Rscript adds useful depth, demonstrating that once a reporting task is expressed in code, there are several ways to run and maintain it, whether in a browser-based RStudio session inside a container, combined with shell tools or executed from the command line in a clean session.
How to centre titles, remove gridlines and write reusable functions in {ggplot2}
{ggplot2} is widely used for data visualisation in R because it offers a flexible, layered grammar for constructing charts. A plot can begin with a straightforward mapping of data to axes and then be refined with titles, themes and annotations until it better serves the message being communicated. That flexibility is one of the greatest strengths of {ggplot2}, though it also means that many useful adjustments are small, specific techniques that are easy to overlook when first learning the package.
Three of those techniques fit together particularly well. The first is centring a plot title, a common formatting need because {ggplot2} titles are left-aligned by default. The second is removing grid lines and background elements to produce a cleaner, less cluttered appearance. The third is wrapping familiar {ggplot2} code into a reusable function so that the same visual style can be applied across different datasets without rewriting everything each time. Together, these approaches show how a basic plot can move from a default graphic to something more polished and more efficient to reproduce.
Centring the Plot Title
A clear starting point comes from a short tutorial by Luis Serra at Ubiqum Code Academy, published on RPubs, which focuses on one specific goal: centring the title of a {ggplot2} output. The example uses the well-known Iris dataset, which is included with R and contains 150 observations across five variables. Those variables are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species, with Species stored as a factor containing three levels (setosa, versicolor and virginica), each represented by 50 samples.
The first step is to load {ggplot2} and inspect the structure of the data using library(ggplot2), followed by data("iris") and str(iris). The structure output confirms that the first four columns are numeric, and the fifth is categorical. That distinction matters because it makes the dataset well suited to a scatter plot with a colour grouping, allowing two continuous variables to be compared while species differences are shown visually.
The initial chart plots petal length against petal width, with points coloured by species:
ggplot() + geom_point(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species))
This produces a simple scatter plot and serves as the base for later refinements. Even in this minimal form, the grammar is clear: the data are supplied to geom_point(), the x and y aesthetics are mapped to Petal.Width and Petal.Length, and colour is mapped to Species.
Once the scatter plot is in place, a title is added using ggtitle("My dope plot"), appended to the existing plotting code. This creates a title above the graphic, but it remains left-justified by default. That alignment is not necessarily wrong, as left-aligned titles work well in many visual contexts, yet there are situations where a centred title gives a more balanced appearance, particularly for standalone blog images, presentation slides or teaching examples.
The adjustment required is small and direct. {ggplot2} allows title styling through its theme system, and horizontal justification for the title is controlled through plot.title = element_text(hjust = 0.5). Setting hjust to 0.5 centres the title within the plot area, whilst 0 aligns it to the left and 1 to the right. The revised code becomes:
ggplot() +
geom_point(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
ggtitle("My dope plot") +
theme(plot.title = element_text(hjust = 0.5))
That small example also opens the door to a broader understanding of {ggplot2} themes. Titles, text size, panel borders, grid lines and background fills are all managed through the same theming system, which means that once one element is adjusted, others can be modified in a similar way.
Removing Grids and Background Elements
A second set of techniques, demonstrated by Felix Fan in a concise tutorial on his personal site, begins by generating simple data rather than using a built-in dataset. The code creates a sequence from 1 to 20 with a <- seq(1, 20), calculates the fourth root with b <- a^0.25 and combines both into a data frame using df <- as.data.frame(cbind(a, b)). The plot is then created as a reusable object:
myplot = ggplot(df, aes(x = a, y = b)) + geom_point()
From there, several styling approaches become available. One of the quickest is theme_bw(), which removes the default grey background and replaces it with a cleaner black-and-white theme. This does not strip the graphic down completely, but it does provide a more neutral base and is often a practical shortcut when the standard {ggplot2} appearance feels too heavy.
More selective adjustments can also be made independently. Grid lines can be removed with the following:
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
This suppresses both major and minor grid lines, whilst leaving other parts of the panel unchanged. Borderlines can be removed separately with theme(panel.border = element_blank()), though that does not affect the background colour or the grid. Likewise, the panel background can be cleared with theme(panel.background = element_blank()), which removes the panel fill and borderlines but leaves grid lines in place. Each of these commands targets a different component, so they can be combined depending on the desired result.
If the background and border are removed, axis lines can be added back for clarity using theme(axis.line = element_line(colour = "black")). This is an important finishing step in a stripped-back plot because removing too many panel elements can leave the chart without enough visual structure. The explicit axis line restores a frame of reference without reintroducing the full border box.
Two combined approaches are worth knowing. The first uses a single custom theme call:
myplot + theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
)
The second starts from theme_bw() and then removes the border and grids whilst adding axis lines:
myplot + theme_bw() + theme(
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black")
)
Both approaches produce a cleaner chart, though they begin from slightly different defaults. The practical lesson is that {ggplot2} styling is modular, so there is often more than one route to a similar visual result.
This matters because chart design is rarely only about appearance. Cleaner formatting can make a chart easier to read by reducing distractions and placing more emphasis on the data itself. A centred title, a restrained background and the selective use of borders all influence how quickly the eye settles on what is important.
Building Reusable Custom Plot Functions
A third area extends these ideas further by showing how to build custom {ggplot2} functions in R, a topic covered in depth by Sharon Machlis in a tutorial published on Infoworld. The central problem discussed is the mismatch that used to make this awkward: tidyverse functions typically use unquoted column names, whilst base R functions generally expect quoted names. This tension became especially noticeable when users wanted to write their own plotting functions that accepted a data frame and column names as arguments.
The example in that article uses Zillow data containing estimated median home values. After loading {dplyr} and {ggplot2}, a horizontal bar chart is created to show home values by neighbourhood in Boston, with bars ordered from highest to lowest values, outlined in black and filled in blue:
ggplot(data = bos_values, aes(x = reorder(RegionName, Zhvi), y = Zhvi)) +
geom_col(color = "black", fill = "#0072B2") +
xlab("") + ylab("") +
ggtitle("Zillow Home Value Index by Boston Neighborhood") +
theme_classic() +
theme(plot.title = element_text(size = 24)) +
coord_flip()
The next step is to turn that pattern into a function. An initial attempt passes unquoted column names but does not work as intended because of the underlying tension between standard R evaluation and the non-standard evaluation of {ggplot2}. The solution came with the introduction of the tidy evaluation {{ operator, commonly known as "curly-curly", in {rlang} version 0.4.0. As noted in the official tidyverse announcement, this operator abstracts the previous two-step quote-and-unquote process into a single interpolation step. Once library(rlang) is loaded, column references inside the plotting code are wrapped in double curly braces:
library(rlang)
mybarplot <- function(mydf, myxcol, myycol, mytitle) {
ggplot2::ggplot(data = mydf, aes(x = reorder({{ myxcol }}, {{ myycol }}), y = {{ myycol }})) +
geom_col(color = "black", fill = "#0072B2") +
xlab("") + ylab("") +
coord_flip() +
ggtitle(mytitle) +
theme_classic() +
theme(plot.title = element_text(size = 24))
}
With that change in place, the function can be called with unquoted column names, just as they would appear in many tidyverse functions:
mybarplot(bos_values, RegionName, Zhvi, "Zillow Home Value Index by Boston Neighborhood")
That final point is particularly useful in practice. The resulting plot object can be stored and extended further, for example by adding data labels on the bars with geom_text() and the scales::comma() function. A custom plotting function does not lock the user into a fixed result; it provides a well-designed starting point that can still be extended with additional {ggplot} layers.
Putting the Three Techniques Together in {ggplot2}
Seen as a progression, these examples build on one another in a logical way. The first shows how to centre a title with theme(plot.title = element_text(hjust = 0.5)). The second shows how to simplify a chart by removing grids, borders and background elements whilst restoring axis lines where needed. The third scales those preferences up by packaging them inside a reusable function. What begins as a one-off styling adjustment can therefore become part of a repeatable workflow.
These techniques also reflect a wider culture around R graphics. Resources such as the R Graph Gallery, created by Yan Holtz, have helped make this style of incremental learning more accessible by offering reproducible examples across a wide range of chart types. The gallery presents over 400 R-based graphics, with a strong emphasis on {ggplot2} and the tidyverse, and organises them into nearly 50 chart families and use cases. Its broader message is that effective visualisation is often the result of small, deliberate decisions rather than dramatic reinvention.
For anyone working with {ggplot2}, that is a helpful principle to keep in mind. A centred title may seem minor, just as removing a panel grid may seem cosmetic, yet these changes can improve clarity and consistency across a body of work. When those preferences are wrapped into a function, they also save time and reduce repetition, connecting plot styling directly to good code design.
From summary statistics to published reports with R, LaTeX and TinyTeX
For anyone working across LaTeX, R Markdown and data analysis in R, there comes a point where separate tools begin to converge. Data has to be summarised, those summaries have to be turned into presentable tables and the finished result has to compile into a report that looks appropriate for its audience rather than a console dump. These notes follow that sequence, moving from the practical business of summarising data in R through to tabulation and then on to the publishing infrastructure that makes clean PDF and Word output possible.
Summarising Data with {dplyr}
The starting point for many analyses is a quick exploration of the data at hand. One useful example uses the anorexia dataset from the {MASS} package together with {dplyr}. The dataset contains weight change data for young female anorexia patients, divided into three treatment groups: Cont for the control group, CBT for cognitive behavioural treatment and FT for family treatment.
The basic manipulation starts by loading {MASS} and {dplyr}, then using filter() to create separate subsets for each treatment group. From there, mutate() adds a wtDelta column defined as Postwt - Prewt, giving the weight change for each patient. group_by(Treat) prepares the data for grouped summaries, and arrange(wtDelta) sorts within treatment groups. The notes then show how {dplyr}'s pipe operator, %>%, makes the workflow more readable by chaining these operations. The final summary table uses summarize() to compute the number of observations, the mean weight change and the standard deviation within each treatment group. The reported values are count 29, average weight change 3.006897 and standard deviation 7.308504 for CBT, count 26, average weight change -0.450000 and standard deviation 7.988705 for Cont and count 17, average weight change 7.264706 and standard deviation 7.157421 for FT.
That example is not presented as a complete statistical analysis. Instead, it serves as a quick exploratory route into the data, with the wording remaining appropriately cautious and noting that this is only a glance and not a rigorous analysis.
Choosing an R Package for Descriptive Summaries
The question of how best to summarise data opens up a broader comparison of R packages for descriptive statistics. A useful review sets out a common set of needs: a count of observations, the number and types of fields, transparent handling of missing data and sensible statistics that depend on the data type. Numeric variables call for measures such as mean, median, range and standard deviation, perhaps with percentiles. Categorical variables call for counts of levels and some sense of which categories dominate.
Base R's summary() does some of this reasonably well. It distinguishes categorical from numeric variables and reports distributions or numeric summaries accordingly, while also highlighting missing values. Yet, it does not show an overall record count, lacks standard deviation and is not especially tidy or ready for tools such as kable. Several contributed packages aim to improve on that. Hmisc::describe() gives counts of variables and observations, handles both categorical and numerical data and reports missing values clearly, showing the highest and lowest five values for numeric data instead of a simple range. pastecs::stat.desc() is more focused on numeric variables and provides confidence intervals, standard errors and optional normality tests. psych::describe() includes categorical variables but converts them to numeric codes by default before describing them, which the package documentation itself advises should be interpreted cautiously. psych::describeBy() extends this approach to grouped summaries and can return a matrix form with mat = TRUE.
Among the packages reviewed, {skimr} receives especially strong attention for balancing readability and downstream usefulness. skim() reports record and variable counts clearly, separates variables by type and includes missing data and standard summaries in an accessible layout. It also works with group_by() from {dplyr}, making grouped summaries straightforward to produce. More importantly for analytical workflows, the skim output can be treated as a tidy data frame in which each combination of variable and statistic is represented in long form, meaning the results can be filtered, transformed and plotted with standard tidyverse tools such as {ggplot2}.
{summarytools} is presented as another strong option, though with a distinction between its functions. descr() handles numeric variables and can be converted to a data frame for use with kable, while dfSummary() works across entire data frames and produces an especially polished summary. At the time of the original notes, dfSummary() was considered slow. The package author subsequently traced the issue, as documented in the same review, to an excessive number of histogram breaks being generated for variables with large values, imposing a limit to resolve it. The package also supports output through view(dfSummary(data)), which yields an attractive HTML-style summary.
Grouped Summary Table Packages
Once the data has been summarised, the next step is turning those summaries into formal tables. A detailed comparison covers a number of packages specifically designed for this purpose: {arsenal}, {qwraps2}, {Amisc}, {table1}, {tangram}, {furniture}, {tableone}, {compareGroups} and {Gmisc}. {arsenal} is described as highly functional and flexible, with tableby() able to create grouped tables in only a few lines and then be customised through control objects that specify tests, display statistics, labels and missing value treatment. {qwraps2} offers a lot of flexibility through nested lists of summary specifications, though at the cost of more code. {Amisc} can produce grouped tables and works with pander::pandoc.table(), but is noted as not being on CRAN. {table1} creates attractive tables with minimal code, though its treatment of missing values may not suit every use case. {tangram} produces visually appealing HTML output and allows custom rows such as missing counts to be inserted manually, although only HTML output is supported. {furniture} and {tableone} both support grouped table creation, but {tableone} in particular is notable because it is widely used in biomedical research for baseline characteristics tables.
The {tableone} package deserves separate mention because it is designed to summarise continuous and categorical variables in one table, a common need in medical papers. As the package introduction explains, CreateTableOne() can be used on an entire dataset or on a selected subset of variables, with factorVars specifying variables that are coded numerically but should be treated as categorical. The package can display all levels for categorical variables, report missing values via summary() and switch selected continuous variables to non-normal summaries using medians and interquartile ranges instead of means and standard deviations. For grouped comparisons, it prints p-values by default and can switch to non-parametric tests or Fisher's exact test where needed. Standardised mean differences can also be shown. Output can be captured as a matrix and written to CSV for editing in Excel or Word.
Styling and Exporting Tables
With tables constructed, the focus shifts to how they are presented and exported. As Hao Zhu's conference slides explain, the {kableExtra} package builds on knitr::kable() and provides a grammar-like approach to adding styling layers, importing the pipe %>% symbol from {magrittr} so that formatting functions can be added in the same way that layers are added in {ggplot2}. It supports themes such as kable_paper, kable_classic, kable_minimal and kable_material, as well as options for striping, hover effects, condensed layouts, fixed headers, grouped rows and columns, footnotes, scroll boxes and inline plots.
Table output is often the visible end of an analysis, and a broader review of R table packages covers a range of approaches that go well beyond the default output. In R Markdown, packages such as {gt}, {kableExtra}, {formattable}, {DT}, {reactable}, {reactablefmtr} and {flextable} all offer richer possibilities. Some are aimed mainly at HTML output, others at Word. {DT} in particular supports highly customised interactive tables with searching, filtering and cell styling through more advanced R and HTML code. {flextable} is highlighted as the strongest option when knitting to Word, given that the other packages are primarily designed for HTML.
For users working in Word-heavy settings, older but still practical workflows remain relevant too. One approach is simply to write tables to comma-separated text files and then paste and convert the content into a Word table. Another route is through {arsenal}'s write2 functions, designed as an alternative to SAS ODS. The convenience functions write2word(), write2html() and write2pdf() accept a wide range of objects: tableby, modelsum, freqlist and comparedf from {arsenal} itself, as well as knitr::kable(), xtable::xtable() and pander::pander_return() output. One notable constraint is that {xtable} is incompatible with write2word(). Beyond single tables, the functions accept a list of objects so that multiple tables, headers, paragraphs and even raw HTML or LaTeX can all be combined into a single output document. A yaml() helper adds a YAML header to the output, and a code.chunk() helper embeds executable R code chunks, while the generic write2() function handles formats beyond the three convenience wrappers, such as RTF.
The Publishing Infrastructure: CTAN and Its Mirrors
Producing PDF output from R Markdown depends on a working LaTeX installation, and the backbone of that ecosystem is CTAN, the Comprehensive TeX Archive Network. CTAN is the main archive for TeX and LaTeX packages and is supported by a large collection of mirrors spread around the world. The purpose of this distributed system is straightforward: users are encouraged to fetch files from a site that is close to them in network terms, which reduces load and tends to improve speed.
That global spread is extensive. The CTAN mirror list organises sites alphabetically by continent and then by country, with active sites listed across Africa, Asia, Europe, North America, Oceania and South America. Africa includes mirrors in South Africa and Morocco. Asia has particularly wide coverage, with many mirrors in China as well as sites in Korea, Hong Kong, India, Indonesia, Japan, Singapore, Taiwan, Saudi Arabia and Thailand. Europe is especially rich in mirrors, with hosts in Denmark, Germany, Spain, France, Italy, the Netherlands, Norway, Poland, Portugal, Romania, Switzerland, Finland, Sweden, the United Kingdom, Austria, Greece, Bulgaria and Russia. North America includes Canada, Costa Rica and the United States, while Oceania covers Australia and South America includes Brazil and Chile.
The details matter because different mirrors expose different protocols. While many support HTTPS, some also offer HTTP, FTP or rsync. CTAN provides a mirror multiplexer to make the common case simpler: pointing a browser to https://mirrors.ctan.org/ results in automatic redirection to a mirror in or near the user's country. There is one caveat. The multiplexer always redirects to an HTTPS mirror, so anyone intending to use another protocol needs to select manually from the mirror list. That is why the full listings still include non-HTTPS URLs alongside secure ones.
There is also an operational side to the network that is easy to overlook when things are working well. CTAN monitors mirrors to ensure they are current, and if one falls behind, then mirrors.ctan.org will not redirect users there. Updates to the mirror list can be sent to ctan@ctan.org. The master host of CTAN is ftp.dante.de in Cologne, Germany, with rsync access available at rsync://rsync.dante.ctan.org/CTAN/ and web access on https://ctan.org/. For those who want to contribute infrastructure rather than simply use it, CTAN also invites volunteers to become mirrors.
TinyTeX: A Lightweight LaTeX Distribution
This infrastructure becomes much more tangible when looking at a lightweight TeX distribution such as TinyTeX. TinyTeX is a lightweight, cross-platform, portable and easy-to-maintain LaTeX distribution based on TeX Live. It is small in size but intended to function well in most situations, especially for R users. Its appeal lies in not requiring users to install thousands of packages they will never use, installing them as needed instead. This also means installation can be done without administrator privileges, which removes one of the more familiar barriers around traditional TeX setups. TinyTeX can even be run from a flash drive.
For R users, TinyTeX is closely tied to the {tinytex} R package. The distinction is important: tinytex in lower case refers to the R package, while TinyTeX refers to the LaTeX distribution. Installation is intentionally direct. After installing the R package with install.packages('tinytex'), a user can run tinytex::install_tinytex(). Uninstallation is equally simple with tinytex::uninstall_tinytex(). For the average R Markdown user, that is often enough. Once TinyTeX is in place, PDF compilation usually requires no further manual package management.
There is slightly more to know if the aim is to compile standalone LaTeX documents from R. The {tinytex} package provides wrappers such as pdflatex(), xelatex() and lualatex(). These functions detect required LaTeX packages that are missing and install them automatically by default. In practical terms, that means a small example document can be written to a file and compiled with tinytex::pdflatex('test.tex') without much concern about whether every dependency has already been installed. For R users, this largely removes the old pattern of cryptic missing-package errors followed by manual searching through TeX repositories.
Developers may want more than the basics, and TinyTeX has a path for that as well. A helper such as tinytex:::install_yihui_pkgs() installs a collection of packages needed for building the PDF vignettes of many CRAN packages. That is a specific convenience rather than a universal requirement, but it illustrates the design philosophy behind TinyTeX: keep the initial footprint light and offer ways to add what is commonly needed later.
Using TinyTeX Outside R
For users outside R, TinyTeX still works, but the focus shifts to the command-line utility tlmgr. The documentation is direct in its assumptions: if command-line work is unwelcome, another LaTeX distribution may be a better fit. The central command is tlmgr, and much of TinyTeX maintenance can be expressed through it.
On Linux, installation places TinyTeX in $HOME/.TinyTeX and creates symlinks for executables such as pdflatex under $HOME/bin or $HOME/.local/bin if it exists. The installation script is fetched with wget and piped to sh, after first checking that Perl is correctly installed. On macOS, TinyTeX lives in ~/Library/TinyTeX, and users without write permission to /usr/local/bin may need to change ownership of that directory before installation. Windows users can run a batch file, install-bin-windows.bat, and the default installation directory is %APPDATA%/TinyTeX unless APPDATA contains spaces or non-ASCII characters, in which case %ProgramData% is used instead. PowerShell version 3.0 or higher is required on Windows.
Uninstallation follows the same self-contained logic. On Linux and macOS, tlmgr path remove is followed by deleting the TinyTeX folder. On Windows, tlmgr path remove is followed by removing the installation directory. This simplicity is a deliberate contrast with larger LaTeX distributions, which are considerably more involved to remove cleanly.
Maintenance and Package Management
Maintenance is where TinyTeX's relationship to CTAN and TeX Live becomes especially visible. If a document fails with an error such as File 'times.sty' not found, the fix is to search for the package containing that file with tlmgr search --global --file "/times.sty". In the example given, that identifies the psnfss package, which can then be installed with tlmgr install psnfss. If the package includes executables, tlmgr path add may also be needed. An alternative route is to upload the error log to the yihui/latex-pass GitHub repository, where package searching is carried out remotely.
If the problem is less obvious, a full update cycle is suggested: tlmgr update --self --all, then tlmgr path add and fmtutil-sys --all. R users have wrappers for these tasks too, including tlmgr_search(), tlmgr_install() and tlmgr_update(). Some situations still require a full reinstallation. If TeX Live reports Remote repository newer than local, TinyTeX should be reinstalled manually, which for R users can be done with tinytex::reinstall_tinytex(). Similarly, when a TeX Live release is frozen in preparation for a new one, the advice is simply to wait and then reinstall when the next release is ready.
The motivation behind TinyTeX is laid out with unusual clarity. Traditional LaTeX distributions often present a choice between a small basic installation that soon proves incomplete and a very large full installation containing thousands of packages that will never be used. TinyTeX is framed as a way around those frustrations by building on TeX Live's portability and cross-platform design while stripping away unnecessary size and complexity. The acknowledgements also underline that TinyTeX depends on the work of the TeX Live team.
Connecting the R Workflow to a Finished Report
Taken together, these notes show how closely summarisation, tabulation and publishing are linked. {dplyr} and related tools make it easy to summarise data quickly, while a wide range of R packages then turn those summaries into tables that are not only statistically useful but also presentable. CTAN and its mirrors keep the TeX ecosystem available and current across the world, and TinyTeX builds on that ecosystem to make LaTeX more manageable, especially for R users. What begins with a grouped summary in the console can end with a polished report table in HTML, PDF or Word, and understanding the chain between those stages makes the whole workflow feel considerably less mysterious.
Some R functions for working with dates, strings and data frames
Working with data in R often comes down to a handful of recurring tasks: combining text, converting dates and times, reshaping tables and creating summaries that are easier to interpret. This article brings together several strands of base R and tidyverse-style practice, with a particular focus on string handling, date parsing, subsetting and simple time series smoothing. Taken together, these functions form part of the everyday toolkit for data cleaning and analysis, especially when imported data arrive in inconsistent formats.
String Building
At the simplest end of this toolkit is paste(), a base R function for concatenating character vectors. Its purpose is straightforward: it converts one or more R objects to character vectors and joins them together, separating terms with the string supplied in sep, which defaults to a space. If the inputs are vectors, concatenation happens term by term, so paste("A", 1:6, sep = "") yields "A1" through "A6", while paste(1:12) behaves much like as.character(1:12). There is also a collapse argument, which takes the resulting vector and combines its elements into a single string separated by the chosen delimiter, making paste() useful both for constructing values row by row and for creating one final display string from many parts.
That basic string-building role becomes more important when dates and times are involved because imported date-time data often arrive as text split across multiple columns. A common example is having one column for a date and another for a time, then joining them with paste(dates, times) before parsing the result. In that sense, the paste() function often acts as a bridge between messy raw input and structured date-time objects. It is simple, but it appears repeatedly in data preparation pipelines.
Date-Time Conversion
For date-time conversion, base R provides strptime(), strftime() and format() methods for POSIXlt and POSIXct objects. These functions convert between character representations and R date-time classes, and they are central to understanding how R reads and prints times. strptime() takes character input and converts it to an object of class "POSIXlt", while strftime() and format() move in the other direction, turning date-time objects into character strings. The as.character() method for "POSIXt" classes fits into the same family, and the essential idea is that the date-time value and its textual representation are separate things, with the format string defining how R should interpret or display that representation.
Format strings rely on conversion specifications introduced with %, and many of these are standard across systems. %Y means a four-digit year with century, %y means a two-digit year, %m is a month, %d is the day of a month and %H:%M:%S captures hours, minutes and seconds in 24-hour time. %F is equivalent to %Y-%m-%d, which is the ISO 8601 date format. %b and %B represent abbreviated and complete month names, while %a and %A do the same for weekdays. Locale matters here because month names, weekday names, AM/PM indicators and some separators depend on the LC_TIME locale, meaning a date string like "1jan1960" may parse correctly in one locale and return NA in another unless the locale is set appropriately.
R's defaults generally follow ISO 8601 rules, so dates print as "2001-02-28" and times as "14:01:02", though R inserts a space between date and time by default. Several details matter in practice. strptime() processes input strings only as far as needed for the specified format, so trailing characters are ignored. Unspecified hours, minutes and seconds default to zero, and if no year, month or day is supplied then the current values are assumed, though if a month is given, the day must also be valid for that month. Invalid calendar dates such as "2010-02-30 08:00" produce results whose components are all NA.
Time Zones and Daylight Saving
Time zones add another layer of complexity. The tz argument specifies the time zone to use for conversion, with "" meaning the current time zone and "GMT" meaning UTC. Invalid values are often treated as UTC, though behaviour can be system-specific. The usetz argument controls whether a time zone abbreviation is appended to output, which is generally more reliable than %Z. %z represents a signed UTC offset such as -0800, and R supports it for input on all platforms. Even so, time zones can be awkward because daylight saving transitions create times that do not occur at all, or occur twice, and strptime() itself does not validate those cases, though conversion through as.POSIXct may do so.
Two-Digit Years
Two-digit years are a notable source of confusion for analysts working with historical data. As described in the R date formats guide on R-bloggers, %y maps values 00 to 68 to the years 2000 to 2068 and 69 to 99 to 1969 to 1999, following the POSIX standard. A value such as "08/17/20" may therefore be interpreted as 2020 when the intended year is 1920. One practical workaround is to identify any parsed dates lying in the future and then rebuild them with a 19 prefix using format() and ifelse(). This approach is explicit and practical, though it depends on the assumptions of the data at hand.
Plain Dates
For plain dates, rather than full date-times, as.Date() is usually the entry point. Character dates can be imported by specifying the current format, such as %m/%d/%y for "05/27/84" or %B %d %Y for "May 27 1984". If no format is supplied, as.Date() first tries %Y-%m-%d and then %Y/%m/%d. Numeric dates are common when data come from Excel, and here the crucial issue is the origin date: Windows Excel uses an origin of "1899-12-30" for dates after 1900 because Excel incorrectly treated 1900 as a leap year (an error originally copied from Lotus 1-2-3 for compatibility), while Mac Excel traditionally uses "1904-01-01". Once the correct origin is supplied, as.Date() converts the serial numbers into standard R dates.
After import, format() can display dates in other ways without changing their underlying class. For example, format(betterDates, "%a %b %d") might yield values like "Sun May 27" and "Thu Jul 07". This distinction between storage and display is important because once R recognises values as dates, they can participate in date-aware operations such as mean(), min() and max(), and a vector of dates can have a meaningful mean date with the minimum and maximum identifying the earliest and latest observations.
Extracting Columns and Manipulating Lists
These ideas about correct types and structure carry over into table manipulation. A data frame column often needs to be extracted as a vector before further processing, and there are several standard ways to do this, as covered in this guide from Statistics Globe. In base R, the $ operator gives a direct route, as in data$x1. Subsetting with data[, "x1"] yields the same result for a single column, and in the tidyverse, dplyr::pull(data, x1) serves the same purpose. All three approaches convert a column of a data frame into a standalone vector, and each is useful depending on the surrounding code style.
List manipulation has similar patterns, detailed in this Statistics Globe tutorial on removing list elements. Removing elements from a list can be done by position with negative indexing, as in my_list[-2], or by assigning NULL to the relevant component, for example my_list_2[2] <- NULL. If names are more meaningful than positions, then subsetting with names(my_list) != "b" or names(my_list) %in% "b" == FALSE removes the named element instead. The same logic extends to multiple elements, whether by positions such as -c(2, 3) or names such as %in% c("b", "c") == FALSE. These are simple techniques, but they matter because lists are a common structure in R, especially when working with nested results.
Subsetting, Renaming and Reordering Data Frames
Data frames themselves can be subset in several ways, and the choice often depends on readability, as the five-method overview on R-bloggers demonstrates clearly. The bracket form example[x, y] remains the foundation, whether selecting rows and columns directly or omitting unwanted ones with negative indices. More expressive alternatives include which() together with %in%, the base subset() function and tidyverse verbs like filter() and select(). The point is not that one method is universally best, but that R offers both low-level precision and higher-level readability, depending on the task.
Column names and column order also need regular attention. Renaming can be done with dplyr::rename(), as explained in this lesson from Datanovia, for instance changing Sepal.Length to sepal_length and Sepal.Width to sepal_width. In base R, the same effect comes from modifying names() or colnames(), either by matching specific names or by position. Reordering columns is just as direct, with a data frame rearranged by column indices such as my_data[, c(5, 4, 1, 2, 3)] or by an explicit character vector of names, as the STHDA guide on reordering columns illustrates. Both approaches are useful when preparing data for presentation or for functions that expect variables in a certain order.
Sorting and Cumulative Calculations
Sorting and cumulative calculations fit naturally into this same preparatory workflow. To sort a data frame in base R, the DataCamp sorting reference demonstrates that order() is the key function: mtcars[order(mpg), ] sorts ascending by mpg, while mtcars[order(mpg, -cyl), ] sorts by mpg ascending and cyl descending. For cumulative totals, cumsum() provides a running sum, as in calculating cumulative air miles from the airmiles dataset, an example covered in the Data Cornering guide to cumulative calculations. Within grouped data, dplyr::group_by() and mutate() can apply cumsum() separately to each group, and a related idea is cumulative count, which can be built by summing a column of ones within groups, or with data.table::rowid() to create a group index.
Time Series Smoothing
Time series smoothing introduces one further pattern: replacing noisy raw values with moving averages. As the Storybench rolling averages guide explains, the zoo::rollmean() function calculates rolling means over a window of width k, and examples using 3, 5, 7, 15 and 21-day windows on pandemic deaths and confirmed cases by state demonstrate the approach clearly. After arranging and grouping by state, mutate() adds variables such as death_03da, death_05da and death_07da. Because rollmean() is centred by default, the resulting values are symmetrical around the observation of interest and produce NA values at the start and end where there are not enough surrounding observations, which is why odd values of k are usually preferred as they make the smoothing window balanced.
The arithmetic is uncomplicated, but the interpretation is useful. A 3-day moving average for a given date is the mean of that day, the previous day and the following day, while a 7-day moving average uses three observations on either side. As the window widens, the line becomes smoother, but more short-term variation is lost. This trade-off is visible when comparing 3-day and 21-day averages: a shorter average tracks recent changes more closely, while a longer one suppresses noise and makes broader trends stand out. If a trailing rather than centred calculation is needed, rollmeanr() shifts the window to the right-hand end.
The same grouped workflow can be used to derive new daily values before smoothing. In the pandemic example, daily new confirmed cases are calculated from cumulative confirmed counts using dplyr::lag(), with each day's new cases equal to the current cumulative total minus the previous day's total. Grouping by state and date, summing confirmed counts and then subtracting the lagged value produces new_confirmed_cases, which can then be smoothed with rollmean() in the same way as deaths. Once these measures are available, reshaping with pivot_longer() allows raw values and rolling averages to be plotted together in ggplot2, making it easier to compare volatility against trend.
How These R Data Manipulation Techniques Fit Together
What links all of these techniques is not just that they are common in R, but that they solve the mundane, essential problems of analysis. Data arrive as text when they should be dates, as cumulative counts when daily changes are needed, as broad tables when only a few columns matter, or as inconsistent names that get in the way of clear code. Functions such as paste(), strptime(), as.Date(), order(), cumsum(), rollmean(), rename(), select() and simple bracket subsetting are therefore less like isolated tricks and more like pieces of a coherent working practice. Knowing how they fit together makes it easier to move from raw input to reliable analysis, with fewer surprises along the way.
Speeding up R Code with parallel processing
Parallel processing in R has evolved considerably over the past fifteen years, moving from a patchwork of platform-specific workarounds into a well-structured ecosystem with clean, consistent interfaces. The appeal is easy to grasp: modern computers offer several processor cores, yet most R code runs on only one of them unless the user makes a deliberate choice to go parallel. When a task involves repeated calculations across groups, repeated model fitting or many independent data retrievals, spreading that work across multiple cores can reduce elapsed time substantially.
At its heart, the idea is simple. A larger job is split into smaller pieces, those pieces are executed simultaneously where possible, and the results are combined back together. That pattern appears throughout R's parallel ecosystem, whether the work is running on a laptop with a handful of cores or on a university supercomputer with thousands.
Why Parallel Processing?
Most modern computers have multiple cores that sit idle during single-threaded R scripts. Parallel processing takes advantage of this by splitting work across those cores, but it is important to understand that it is not always beneficial. Starting workers, transmitting data and collecting results all take time. Parallel processing makes the most sense when each iteration does enough computational work to justify that overhead. For fast operations of well under a second, the overhead will outweigh any gain and serial execution is faster. The sweet spot is iterative work, where each unit of computation takes at least a few seconds.
Benchmarking: Amdahl's Law
The theoretical speed-up from adding processors is always limited by the fraction of work that cannot be parallelised. Amdahl's Law, formulated by computer scientist Gene Amdahl in 1967, captures this:
Maximum Speedup = 1 / ( f/p + (1 - f) )
Here, f is the parallelisable fraction and p is the number of processors. Problems where f = 1 (the entire computation is parallelisable) are called embarrassingly parallel: bootstrapping, simulation studies and applying the same model to many independent groups all fall into this category. For everything else, the sequential fraction, including the overhead of setting up workers and moving data, sets a ceiling on how much improvement is achievable.
How We Got Here
The current landscape makes more sense with a brief orientation. R 2.14.0 in 2011 brought {parallel} into base R, providing built-in support for both forking and socket clusters along with reproducible random number streams, and it remains the foundation everything else builds on. The {foreach} package with {doParallel} became the most common high-level interface for many years, and is still widely encountered in existing code. The split-apply-combine package {plyr} was an early entry point for parallel data manipulation but is now retired; the recommendation is to use {dplyr} for data frames and {purrr} for list iteration instead. The {future} ecosystem, covered in the next section, is the current best practice for new code.
The Modern Standard: The {future} Ecosystem
The most significant development in R parallel computing in recent years has been the {future} package by Henrik Bengtsson, which provides a unified API for sequential and parallel execution across a wide range of backends. Its central concept is simple: a future is a value that will be computed (possibly in parallel) and retrieved later. What makes it powerful is that you write code once and change the execution strategy by swapping a single plan() call, with no other changes to your code.
library(future)
plan(multisession) # Use all available cores via background R sessions
The common plans are sequential (the default, no parallelism), multisession (multiple background R processes, works on all platforms including Windows) and multicore (forking, faster but Unix/macOS only). On a cluster, cluster and backends such as future.batchtools extend the same interface to remote nodes.
The {future} package itself is a low-level building block. For day-to-day work, three higher-level packages are the main entry points.
{future.apply}: Drop-in Replacements for base R Apply
{future.apply} provides parallel versions of every *apply function in base R, including future_lapply(), future_sapply(), future_mapply(), future_replicate() and more. The conversion from serial to parallel code requires just two lines:
library(future.apply)
plan(multisession)
# Serial
results <- lapply(my_list, my_function)
# Parallel — identical output, just faster
results <- future_lapply(my_list, my_function)
Global variables and packages are automatically identified and exported to workers, which removes the manual clusterExport and clusterEvalQ calls that {parallel} requires.
{furrr}: Drop-in Replacements for {purrr}
{furrr} does the same for {purrr}'s mapping functions. Any map() call can become future_map() by loading the library and setting a plan:
library(furrr)
plan(multisession, workers = availableCores() - 1)
# Serial
results <- map(my_list, my_function)
# Parallel
results <- future_map(my_list, my_function)
Like {future.apply}, {furrr} handles environment export automatically. There are parallel equivalents for all typed variants (future_map_dbl(), future_map_chr(), etc.) and for map2() and pmap() as well. It is the most natural choice for tidyverse-style code that already uses {purrr}.
{futurize}: One-Line Parallelisation
For users who want to parallelise existing code with minimal changes, {futurize} can transpile calls to lapply(), purrr::map() and foreach::foreach() %do% {} into their parallel equivalents automatically.
{foreach} with {doFuture}
The {foreach} package remains widely used, and the modern way to parallelise it is with the {doFuture} backend and the %dofuture% operator:
library(foreach)
library(doFuture)
plan(multisession)
results <- foreach(i = 1:10) %dofuture% {
my_function(i)
}
This approach inherits all the benefits of {future}, including automatic global variable handling and reproducible random numbers.
The {parallel} Package: Core Functions
The {parallel} package remains part of base R and is the foundation that {future} and most other packages build on. It is useful to know its core functions directly, especially for distributed work across multiple nodes.
Shared memory (single machine, Unix/macOS only):
mclapply(X, FUN, mc.cores = n) is a parallelised lapply that works by forking. It does not work on Windows and falls back silently to serial execution there.
Distributed memory (all platforms, including multi-node):
| Function | Description |
|---|---|
makeCluster(n) |
Start `n` worker processes |
clusterExport(cl, vars) |
Copy named objects to all workers |
clusterEvalQ(cl, expr) |
Run an expression (e.g. library(pkg)) on all workers |
parLapply(cl, X, FUN) |
Parallelised lapply across the cluster |
parLapplyLB(cl, X, FUN) |
Same with load balancing for uneven tasks |
clusterSetRNGStream(cl, seed) |
Set reproducible random seeds on workers |
stopCluster(cl) |
Shut down the cluster |
Note that detectCores() can return misleading values in HPC environments, reporting the total cores on a node rather than those allocated to your job. The {parallelly} package's availableCores() is more reliable in those settings and is what {furrr} and {future.apply} use internally.
A Tidyverse Approach with {multidplyr}
For data frame-centric workflows, {multidplyr} (available on CRAN) provides a {dplyr} backend that distributes grouped data across worker processes. The API has been simplified considerably since older tutorials were written: there is no longer any need to manually add group index columns or call create_cluster(). The current workflow is straightforward.
library(multidplyr)
library(dplyr)
# Step 1: Create a cluster (leave 1–2 cores free)
cluster <- new_cluster(parallel::detectCores() - 1)
# Step 2: Load packages on workers
cluster_library(cluster, "dplyr")
# Step 3: Group your data and partition it across workers
flights_partitioned <- nycflights13::flights %>%
group_by(dest) %>%
partition(cluster)
# Step 4: Work with dplyr verbs as normal
results <- flights_partitioned %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
collect()
partition() uses a greedy algorithm to keep all rows of a group on the same worker and balance shard sizes. The collect() call at the end recombines the results into an ordinary tibble in the main session. If you need to use custom functions, load them on each worker with cluster_assign():
cluster_assign(cluster, my_function = my_function)
One important caveat from the official documentation: for basic {dplyr} operations, {multidplyr} is unlikely to give measurable speed-ups unless you have tens or hundreds of millions of rows. Its real strength is in parallelising slower, more complex operations such as fitting models to each group. For large in-memory data with fast transformations, {dtplyr} (which translates {dplyr} to {data.table}) is often a better first choice.
Running R on HPC Clusters
For computations that exceed what a single workstation can provide, university and research HPC clusters are the next step. The core terminology is worth understanding clearly before submitting your first job.
One node is a single physical computer, which may itself contain multiple processors. One processor contains multiple cores. Wall-time is the real-world clock time a job is permitted to run; the job is terminated when this limit is reached, regardless of whether the script has finished. Memory refers to the RAM the job requires. When requesting resources, leave a margin of at least five per cent of RAM for system processes, as exceeding the allocation will cause the job to fail.
Slurm Job Submission
Slurm is the dominant scheduler on modern HPC clusters, including Penn State's Roar Collab system, managed by the Institute for Computational and Data Sciences (ICDS). Jobs are described in a shell script and submitted with sbatch. From R, the {rslurm} package allows Slurm jobs to be created and submitted directly without leaving the R session:
library(rslurm)
sjob <- slurm_apply(my_function, params_df, jobname = "my_job",
nodes = 2, cpus_per_node = 8)
Connecting R Workflows to Cluster Schedulers
The {batchtools} package provides Map, Reduce and Filter variants for managing R jobs on PBS, Slurm, LSF and Sun Grid Engine. The {clustermq} package sends function calls as cluster jobs via a single line of code without network-mounted storage. For users already in the {future} ecosystem, {future.batchtools} wraps {batchtools} as a {future} backend, letting you scale from a local plan(multisession) all the way to plan(batchtools_slurm) with no other code changes.
The Broader Ecosystem
The CRAN Task View on High-Performance and Parallel Computing, maintained by Dirk Eddelbuettel and updated lately, remains the most comprehensive catalogue of R packages in this space. The core packages designated by the Task View are {Rmpi} and {snow}. Beyond these, several areas are worth knowing about.
For large and out-of-memory data, {arrow} provides the Apache Arrow in-memory format with support for out-of-memory processing and streaming. {bigmemory} allows multiple R processes on the same machine to share large matrix objects. {bigstatsr} operates on file-backed matrices via memory-mapped access with parallel matrix operations and PCA.
For pipeline orchestration, the {targets} package constructs a directed acyclic graph of your workflow and orchestrates distributed computing across {future} workers, only re-running steps whose upstream dependencies have changed. For GPU computing, the {tensorflow} package by Allaire and colleagues provides access to the complete TensorFlow API from within R, enabling computation across CPUs and GPUs with a single API.
When it comes to random number reproducibility across parallel workers, the L'Ecuyer-CMRG streams built into {parallel} are available via RNGkind("L'Ecuyer-CMRG"). The {rlecuyer}, {rstream}, {sitmo} and {dqrng} packages provide further alternatives. The {doRNG} package handles reproducible seeds specifically for {foreach} loops.
Choosing the Right Approach
The appropriate tool depends on the shape of the problem and how it fits into your existing code.
If you are already using {purrr}'s map() functions, replacing them with future_map() from {furrr} after plan(multisession) is the path of least resistance. If you use base R's lapply or sapply, {future.apply} provides identical drop-in replacements. Both inherit automatic environment handling, reproducible random numbers and cross-platform compatibility from {future}.
If you are working with grouped data frames in a {dplyr} style and each group operation is computationally substantial, {multidplyr} is a good fit. For fast operations on large data, try {dtplyr} first.
For the largest workloads on institutional clusters, {future} scales directly to HPC environments via plan(cluster) or plan(batchtools_slurm). The {rslurm} and {batchtools} packages provide more direct control over job submission and resource management.
Further Reading
The CRAN Task View on High-Performance and Parallel Computing is the most comprehensive and current reference. The Futureverse website documents the full {future} ecosystem. The {multidplyr} vignette covers the current API in detail. Penn State users can find cluster support through ICDS and the QuantDev group's HPC in R tutorial. The R Special Interest Group on High-Performance Computing mailing list is a further resource for more specialist questions.
Making sense of parallel and asynchronous execution in Python
Parallel processing in Python is often presented as a straightforward route to faster programs, though the reality is rather more nuanced. At its core, parallel processing means executing parts of a task simultaneously across multiple processors or cores on the same machine, with the intention of reducing the total time needed to complete the work. Any honest explanation must include an important caveat because parallelism brings overhead of its own: processes need to be created, scheduled and coordinated, and data often has to be passed between them. For small or lightweight tasks, that overhead can outweigh any gain, and two tasks that each take five seconds may still require around eight seconds when parallelised, rather than the ideal five.
The Multiprocessing Module
One of the standard ways to work with parallel execution in Python is the multiprocessing module This module creates subprocesses rather than threads, which matters because each process has its own memory space. On both Unix-like systems and Windows, this arrangement allows Python code to use multiple processors more effectively for independent work, and it sidesteps some of the limitations commonly associated with threads in CPython, particularly for CPU-bound tasks. Threads still have an important role, especially for workloads that are heavy on input/output operations, but multiprocessing is often the better fit when the work involves substantial computation.
Understanding the Global Interpreter Lock
The reason threads are less effective for CPU-bound work in CPython relates directly to the Global Interpreter Lock (GIL). The GIL is a mutex that allows only one thread to hold control of the Python interpreter at any one time, meaning that even in a multithreaded programme, only one thread can execute Python bytecode at a given moment. When a thread is waiting for an external input/output operation it releases the GIL, allowing other threads to run, which is why threading remains a reasonable choice for I/O-bound workloads. Multiprocessing sidesteps the GIL entirely by spawning separate processes, each with its own Python interpreter, allowing genuine parallel execution across cores.
How Many Processes Can Run in Parallel?
Before using multiprocessing, it helps to understand the practical ceiling on how many processes can run in parallel. The upper bound is usually tied to the number of logical processors or cores available on the machine, and Python exposes this through multiprocessing.cpu_count(), which returns the number of processors detected. That figure is a useful starting point rather than an absolute rule. In real applications, the best number of worker processes can vary according to available memory, the nature of the task and what else the machine is doing at the time.
Synchronous and Asynchronous Execution
Another foundation worth clarifying is the difference between synchronous and asynchronous execution. In synchronous execution, tasks are coordinated so that results are typically gathered in the same order in which they were started, and the main programme effectively waits for those tasks to finish. In asynchronous execution, by contrast, tasks can complete in any order and the results may not correspond to the original input sequence, which often improves throughput but requires the programmer to be more deliberate about collecting and arranging results.
Pool and Process: The Two Main Abstractions
The multiprocessing module offers two main abstractions for parallel work: Pool and Process. For most practical tasks, Pool is the easier and more convenient option. It manages a collection of worker processes and provides methods such as apply(), map() and starmap() for synchronous execution, alongside apply_async(), map_async() and starmap_async() for asynchronous execution. The lower-level Process class offers more control and suits more specialised cases, but for many data-processing jobs Pool is sufficient and considerably easier to reason about.
An Example: Counting Values in a Range
A useful way to see these ideas in action is through a concrete example. Suppose there is a two-dimensional list, or matrix, where each row contains a small set of integers, and the task is to count how many values in each row fall within a given range. In the example, the data are generated with NumPy using np.random.randint(0, 10, size=[200000, 5]) and then converted to a plain list of lists with tolist(). A simple function, howmany_within_range(row, minimum, maximum), loops through each number in a row and increments a counter whenever the number falls between the supplied minimum and maximum values.
Without any parallelism, this task is handled with a straightforward loop in which each row is passed to the function in turn and the returned counts are appended to a results list. This serial approach is simple, easy to read and often good enough as a baseline, and it provides an important benchmark because parallel processing should not be adopted merely because it is available but should address an actual performance problem.
Pool.apply()
To parallelise the same function, the first step is to create a process pool, typically with mp.Pool(mp.cpu_count()). The simplest method to understand is Pool.apply(), which runs a function in a worker process using the arguments supplied through args. In the range-counting example, each row is submitted with the same minimum and maximum values. The resulting code is concise, but there is an important detail to note: when apply() is used inside a list comprehension, each call still blocks until it completes. It is parallel in terms of the workers available, but it is not always the most efficient pattern for distributing a large iterable of similar tasks.
Pool.map()
That is where Pool.map() can be more suitable. The map() method accepts a single iterable and applies the target function to each element. Because the original howmany_within_range() function expects more than one argument, the example adapts it by defining howmany_within_range_rowonly(row, minimum=4, maximum=8), giving default values to the range bounds so that only the row must be supplied. This is not always the cleanest design, but it illustrates the central constraint of map(): it expects one iterable of inputs rather than multiple arguments per call. In return, it is often a good fit for simple, repeated operations over a dataset.
Pool.starmap()
When a function genuinely needs multiple arguments and one wants the convenience of map-like behaviour, Pool.starmap() is usually the better choice. Like map(), it takes a single iterable, but each element of that iterable is itself another iterable containing the arguments for one function call. In the example, the input becomes [(row, 4, 8) for row in data], with each tuple unpacked into howmany_within_range(). This tends to be clearer than altering function signatures purely to satisfy the constraints of map().
Asynchronous Variants
The asynchronous equivalents follow the same broad pattern but differ in one crucial respect: they do not force the main process to wait for each task in order. With Pool.apply_async(), tasks are submitted, and the programme can continue while workers process them in the background. The example demonstrates this by redefining the counting function as howmany_within_range2(i, row, minimum, maximum), which returns both the original index and the count, a distinction that matters because asynchronous execution may alter the order of results. A callback function appends each completed result to a shared list and, after all tasks finish, that list is sorted by index so that the final output matches the original row order.
There is also an alternative form of apply_async() that avoids callbacks by returning ApplyResult objects, which can later be resolved with .get() to retrieve the actual result. This approach can be easier to follow when callbacks feel too indirect, though it still requires care to ensure that the pool is properly closed and joined so that all processes complete. The use of pool.join() is particularly important here because it prevents subsequent lines of code from running until the queued work is finished. Asynchronous mapping methods are available too, including Pool.starmap_async(), which mirrors starmap() but returns an asynchronous result object whose data can be fetched with .get().
Parallelising Pandas DataFrames
Parallelism in Python is not restricted to plain lists. In data analysis and machine learning work, it is often more relevant to process pandas DataFrames, and there are several levels at which this can happen: a function can operate on one row, one column or an entire DataFrame. The first two can be managed with the standard multiprocessing module alone, while whole-DataFrame parallelism often needs more flexible serialisation support than the standard library provides.
Row-wise and Column-wise Parallelism
For row-wise work, one approach is to iterate over df.itertuples(name=False) so that each row is presented as a simple tuple. A hypotenuse(row) function can compute the square root of the sum of squares of two values from each row, with a pool of four worker processes handling the rows through pool.imap(). This resembles pd.apply() conceptually, but the work is spread across processes rather than performed in a single interpreter thread.
Column-wise parallelism follows the same idea but uses df.items() to iterate over columns (it is worth noting that df.iteritems(), which older examples may reference, was deprecated in pandas 1.5.0 and has since been removed, with df.items() being the correct modern equivalent). A sum_of_squares(column) function receives each column as a pair containing the column label and the series itself, and pool.imap() distributes this work across multiple processes. This pattern is useful when independent operations need to be applied to separate columns.
Whole-DataFrame Parallelism with Pathos
Parallelising functions that accept an entire DataFrame or similarly complex object is more difficult with the standard multiprocessing machinery because of serialisation constraints, since the standard library uses pickle internally and pickle has well-known limitations with certain object types. The pathos package addresses this by using dill internally, which supports serialising and deserialising almost all Python types. A DataFrame is split into chunks with np.array_split(df, cores, axis=0), and a ProcessingPool from pathos.multiprocessing maps a function across those chunks, with the results combined using np.vstack(). This extends the same Pool > Map > Close > Join pattern, though the pool is also cleared afterwards with pool.clear().
Lower-level Process Control and Queues
There are broader ways to think about parallel execution beyond multiprocessing.Pool. Lower-level process management with multiprocessing.Process gives explicit control over individual processes, and this can be paired with queues managed through multiprocessing.Manager() for inter-process communication. In such designs, one queue can hold tasks and another can collect results, with worker processes repeatedly fetching tasks, processing them and placing outputs in the result queue, terminating when they receive a sentinel value such as -1. This approach is more verbose than using a pool, but it can be valuable when workflows are dynamic or when processes need long-lived coordination.
Threads, Executors and External Commands
Python also offers other concurrency models worth knowing. Threads, available through the threading module or concurrent.futures.ThreadPoolExecutor, are often well suited to I/O-bound work such as downloading files or waiting on network responses. Because of the GIL in CPython, threads are less effective for CPU-bound pure Python code, though they can still provide concurrency when much of the time is spent waiting. Process-based approaches, including ProcessPoolExecutor, are generally more effective for CPU-heavy work because they achieve genuine parallel execution across cores.
External process execution forms another category entirely. The os.system() method can launch shell commands, potentially in the background, though it is relatively crude. The subprocess module is more robust, providing better control over arguments, output capture and return codes. These tools are useful when the work is best handled by external programmes rather than Python functions, though they are conceptually distinct from in-Python data parallelism.
Choosing the Right Approach for Parallel Processing in Python
What emerges from all of this is that parallel processing in Python is less about memorising one trick and more about matching the method to the problem at hand. For simple data transformations over independent records, Pool.map() or Pool.starmap() can be effective, while asynchronous methods come into play when result order is not guaranteed or when responsiveness matters. When working with pandas, row-wise and column-wise strategies fit naturally into the standard multiprocessing model, whereas whole-object processing may call for a package such as pathos. Lower-level process control, thread pools, external commands and task queues each have their place too.
It is also worth remembering that parallelism is not free. Process creation, serialisation, memory usage and coordination all introduce cost, and the right question is not whether code can be parallelised but whether the effort and overhead make sense for the workload in question. Python provides several mature tools for splitting work across processes and threads, and the multiprocessing module remains one of the most practical for CPU-bound tasks on a single machine, with the Pool interface offering the clearest path from serial code to parallel execution for many everyday applications.
Moving from 32-Bit to 64-Bit SAS on Windows
Moving from 32-bit SAS on Microsoft Windows to a 64-bit environment can look deceptively straightforward from the outside. The operating system is still Windows, programmes often run without alteration, and many data sets open just as expected. Beneath that continuity, however, sit several technical differences that matter considerably in practice, especially for organisations with long-lived code, established format libraries and regular exchanges with Microsoft Office files.
What makes this transition particularly awkward is that SAS treats some of these changes as more than a simple in-place upgrade. As Jacques Thibault notes in his PharmaSUG 2012 paper, a new operating system will often be accompanied by a new version of surrounding applications, and what matters most is ensuring sufficient time and resources to fully test existing programmes under the new environment before committing to the change. SAS file types are not uniformly portable across the 32-bit to 64-bit boundary, and support behaviour also differs by SAS release, with SAS 9.3 marking the point at which some earlier friction was meaningfully reduced. As of 2025, the current release of the SAS 9 line is SAS 9.4 Maintenance 9 (M9), and organisations running any SAS 9.4 release benefit from the data-set interoperability improvements first introduced in SAS 9.3, whilst the catalog and Office-integration issues described in this article remain relevant across all SAS 9.x environments.
Data Sets and Catalogs: A Fundamental Distinction
The broadest distinction is between SAS data sets and SAS catalogs. Data sets are generally more forgiving, while catalogs are not. SAS Usage Note 38339 explains that when upgrading from 32-bit to 64-bit Windows SAS in releases earlier than SAS 9.3, Cross-Environment Data Access (CEDA) is invoked to access 32-bit SAS data sets. CEDA allows the file to be read without immediate conversion, though it can impose restrictions and may reduce performance. The same note states directly that 64-bit SAS provides no access to 32-bit catalogs at all.
That distinction sits at the centre of most migration problems, and it is the reason a move that feels routine can catch teams off guard when they first encounter the ERROR: CATALOG was created for a different operating system message. As Chris Hemedinger explains in a post on The SAS Dummy, the move from 32-bit SAS for Windows to 64-bit SAS for Windows is, for all intents and purposes, a platform change from SAS's perspective, even though only the bit architecture has changed, and SAS catalogs are not portable across platforms.
How SAS Handles Data Sets Across the Boundary
For data sets, the picture is comparatively manageable. If a 32-bit SAS data set is opened in a 64-bit SAS session in releases before SAS 9.3, SAS writes a note to the log stating that the file is native to another host or that its encoding differs from the current session encoding, and that Cross-Environment Data Access will be used, which might require additional CPU resources and might reduce performance. This is SAS performing translation work in the background, and whilst useful for continued access, it is not always ideal for regular production use.
There is an important nuance that changes things significantly with SAS 9.3. In 32-bit SAS on Windows, the data representation is WINDOWS_32, whilst in 64-bit SAS on Windows it is WINDOWS_64. Hemedinger notes that in SAS 9.3 the developers taught SAS for Windows to bypass the CEDA layer when the only encoding difference is WINDOWS_32 versus WINDOWS_64. SAS Knowledge Base article 38379 confirms this, stating that from SAS 9.3 onwards, Windows 32-bit data sets can be read, written and updated in Windows 64-bit SAS, and vice versa, as a result of a change in how SAS determines file compatibility at open time. Users on SAS 9.3 and later, including all SAS 9.4 maintenance releases, may therefore see fewer warnings and less friction with ordinary data sets originating in 32-bit Windows SAS.
Converting Data Sets to Native 64-Bit Format
Even with those SAS 9.3 improvements, many organisations prefer to convert files into the native 64-bit format rather than rely indefinitely on cross-environment access. For entire libraries, PROC MIGRATE is the recommended mechanism. SAS Usage Note 38339 notes that for releases preceding SAS 9.3, PROC MIGRATE can migrate 32-bit SAS data sets to 64-bit, changing their format so that CEDA is no longer required.
The advantages of PROC MIGRATE over the older conversion procedures are set out in detail by Diane Olson and David Wiehle of SAS Institute in their paper hosted by the University of Delaware. Unlike PROC COPY, PROC MIGRATE retains deleted observations, migrates audit trails, preserves all integrity constraints and automatically retains created and last-modified date/times, compression, encryption, indexes and passwords from the source library. It is designed to produce members in the target library that differ from the source only in being in the new SAS format.
When the task concerns individual SAS data files rather than a whole library, SAS Usage Note 38339 points to PROC COPY with the NOCLONE option. Used in a 64-bit SAS session, this copies a 32-bit Windows data set into a new file that is native to the 64-bit environment. The NOCLONE option prevents SAS from cloning the original data representation during the copy, so that the resulting file is written in the target environment's native format and CEDA is no longer needed to process it. Thibault's PharmaSUG paper illustrates this with an example using PROC COPY with the NOCLONE option together with an OUTREP setting on the target LIBNAME statement to force creation in the desired representation.
Catalogs: The Hard Problem
Catalogs are a different matter entirely. If a user running 64-bit SAS attempts to open a catalog created in a 32-bit SAS session, the familiar error appears: ERROR: CATALOG was created for a different operating system. In the case of format catalogs, a related message often reads ERROR: File LIBRARY.FORMATS.CATALOG was created for a different operating system, and this is frequently followed by failures to use user-defined formats attached to variables. As the SSCC guidance from the University of Wisconsin-Madison notes, this can prevent 64-bit SAS from reading the data set at all, with the error about formats recorded only in the log whilst the visible symptom is simply that the table did not open.
This matters because catalogs are machine-dependent. User-defined formats created by PROC FORMAT are usually stored in catalogs, often in a member named FORMATS. If those formats were built in 32-bit SAS, 64-bit SAS cannot use the catalog directly, and this affects not only explicit formatting in code but also routine data viewing because a data set linked to permanent user-defined formats may fail to display properly unless the associated format catalog is converted.
Options for Migrating Format Catalogs
There are several ways to address catalog incompatibility. If the original PROC FORMAT source code still exists, the cleanest option is simply to rerun it under 64-bit SAS, producing a fresh native catalog. The SSCC guidance treats this as the easiest solution that preserves the formats themselves, and it also describes a short-term workaround: adding a bare format female; statement to the DATA or PROC step, which removes the custom format from that variable, so there is no need to read the problem catalog file at all.
When source code is not available, transport-based conversion is the answer. In a 32-bit SAS session, PROC CPORT creates a transport file from the catalog library, and in a 64-bit SAS session, PROC CIMPORT recreates the catalog in the new environment. SAS Knowledge Base article KB0041614 provides sample code that creates a transport file in 32-bit SAS using proc cport lib=my32 file=trans memtype=catalog; select formats; and then unloads it in 64-bit SAS using PROC CIMPORT, after which a new Formats.sas7bcat file should be present in the target library. The same article notes that if access to a 32-bit SAS session is simply not available, the system option NOFMTERR can be submitted as a last resort: this allows the underlying data values to be displayed whilst user-defined formats are ignored, avoiding the error without converting the catalog.
A more robust route for user-defined formats is to avoid moving the catalog as a catalog at all. PROC FORMAT can write format definitions to a standard SAS data set using CNTLOUT, and later rebuild them from that data set using CNTLIN. Because SAS data sets are generally portable across the 32-bit to 64-bit boundary, this method sidesteps the catalog incompatibility directly. KB0041614 describes CNTLOUT/CNTLIN as the most robust method available for migrating user-defined format libraries. Karin LaPann, writing in a poster presented at a meeting of the Philadelphia Area SAS Users Group, reaches the same conclusion and recommends always creating data sets from format catalogs and storing them alongside the data in the same library as a matter of good practice.
Caveats: Item Stores, Compiled Macros and the PIPE Engine
SAS Usage Note 38339 explicitly states that stored compiled macro catalogs are not supported by PROC CPORT and must be recompiled in the new operating environment, with SAS Note 46846 covering compatibility guidance for those files specifically. The note also warns that the 32-bit version of SAS should not be removed until it can be verified that all 32-bit catalogs have been successfully migrated.
Thibault's PharmaSUG paper identifies two further file types that require attention. SAS Item Store files (.sas7bitm), which organisations may use to store standard PROC TEMPLATE output templates, are not compatible across 32-bit and 64-bit environments, and the practical solution is to recreate them under the new environment using the same programme that created them originally, targeting a different output directory to avoid a mixed 32-bit and 64-bit directory. Thibault also notes that programmes using the PIPE engine may produce errors on Windows 64-bit environments, and recommends replacing such code with newer SAS functions such as filename, dopen and dread to avoid the issue altogether. These are not universal blockers, but they underline why testing is essential rather than assumed.
Microsoft Office Integration After the Move
Another area where 64-bit moves catch users out is access to Microsoft Excel and Access files. The issue is not SAS data compatibility but the bit-ness of the Microsoft data providers. In 64-bit SAS for Windows, attempts to use PROC IMPORT with DBMS=EXCEL, PROC EXPORT with Excel or Access options, or LIBNAME EXCEL can fail with errors such as ERROR: Connect: Class not registered or Connection Failed. As Hemedinger explains, the cause is that the 64-bit SAS process cannot use the built-in data providers for Microsoft Excel or Microsoft Access, which are usually 32-bit modules. Thibault's paper confirms that installation of the PC Files Server on the same machine will be required, since the required 32-bit ODBC drivers are incompatible with 64-bit SAS on Windows.
The workarounds depend on the file type and local setup. SAS/ACCESS to PC Files provides methods such as DBMS=EXCELCS, DBMS=ACCESSCS and LIBNAME PCFILES, all of which use the PC Files Server as an intermediary, with an autostart feature that minimises configuration changes to existing SAS programmes. For .xlsx files, DBMS=XLSX removes the Microsoft data providers from the equation entirely and requires no additional setup from SAS 9.3 Maintenance 1 onwards. Installing 64-bit Microsoft Office may appear to solve the bit-ness mismatch by supplying 64-bit providers, but as Hemedinger cautions, Microsoft recommends the 64-bit version of Office in only a few circumstances, and that route can introduce other incompatibilities with how Office applications are used.
Identifying 32-Bit Catalogs in a Mixed Environment
In mixed environments, a practical challenge is identifying which catalogs are still 32-bit and which are already 64-bit. This was precisely the problem Michael Raithel posed on LinkedIn in March 2015, after finding that no SAS facility, whether PROC CATALOG, PROC CONTENTS, PROC DATASETS or the Dictionary Tables, provided a direct way to distinguish them. His solution treats the .sas7bcat file as a flat file rather than a catalog, reading the first record and searching for the character strings W32_7PRO (identifying a 32-bit catalog) and X64_7PRO (identifying a 64-bit catalog). The macro he developed can be run against any number of catalogs and builds a SAS data set recording the bit-ness and full path of each file, making large-scale inventory automation entirely practical during a phased transition.
For broader validation work, the Olson and Wiehle paper pairs PROC MIGRATE with macros based on PROC CONTENTS, PROC DATASETS and PROC COMPARE, documenting what existed in the source library before migration and verifying what exists in the target library afterwards. For highly regulated or large-scale environments, that kind of structured checking is not optional.
Navigating the Transition Without Unnecessary Disruption
The main lesson from all of this is that moving from 32-bit to 64-bit SAS on Windows is not simply a matter of reinstalling software and carrying on unchanged. Much will work as before, particularly with ordinary data sets and particularly in SAS 9.3 and later. Catalogs, format libraries, item stores and Microsoft Office integration, however, require deliberate attention.
The transition is not so much problematic as predictable. Keeping 32-bit SAS available until catalog migration is confirmed, using PROC MIGRATE for full libraries, using PROC COPY with NOCLONE for individual data sets, converting format catalogs via CPORT/CIMPORT or CNTLOUT/CNTLIN, recreating item stores and compiled macros in the new environment and testing Office-related workflows and PIPE based code before deployment together form a sound path through the process. With that preparation in place, the advantages of a 64-bit environment can be gained without avoidable disruption.
Shaping SAS output using ODS Style Definitions as well as SAS Formats
Working with SAS output involves two related but distinct concerns: how results look, and how values are displayed. The material here covers both sides of that equation. On one hand, the DEFINE STYLE statement in PROC TEMPLATE provides a way to create and customise ODS styles for destinations that support the STYLE= option. On the other, SAS formats determine how character, numeric, date and time values are written in output. Taken together, these features shape both presentation and readability, which is why it is useful to understand them in the same discussion.
The DEFINE STYLE Statement
The DEFINE STYLE statement is the foundation for creating a stand-alone style. Its syntax allows a style to be stored in a template store and to include inherited behaviour, notes, imported CSS and individual style element definitions. A style definition begins with DEFINE STYLE followed by a style path (or in the special case of Base.Template.Style, it is that name itself), and it must end with an END statement. That final END is not optional, as it is a hard requirement. Within the body of the style, statements such as PARENT=, NOTES, CLASS, IMPORT and STYLE determine how the style behaves and what it contains.
Style Paths and the STORE= Option
The style path identifies where a style is stored. It consists of one or more names separated by periods, with each name representing a directory in a template store. PROC TEMPLATE writes the style to the first writeable template store in the current path unless a STORE= option directs it elsewhere. The STORE=libref.template-store option specifies a particular template store, and if that template store does not already exist, SAS creates it automatically. One important point is that the syntax of the STORE= option does not become part of the compiled template, so it affects where the style is saved rather than the internal definition itself.
Base.Template.Style
A notable special case is Base.Template.Style. This creates a style that becomes the parent of all styles that do not explicitly specify a parent, and once created it is automatically applied to output until it is specifically removed from the item store. That convenience comes with a clear caution: the SAS-supplied Base.Template.Style contains inheritance information relied upon by many styles, and if that inheritance structure is not preserved, some style elements might not appear in output. The safer route is therefore to start from the existing Base.Template.Style, write it to an external file and edit its contents rather than constructing a replacement from scratch. There is also a restriction: if PARENT= is specified, it must refer to a style other than Base.Template.Style.
Inheritance and the PARENT= Statement
Inheritance is central to how ODS styles work. The PARENT= statement specifies the style from which the current style inherits its style elements, style attributes and statements. The style path named in PARENT= is looked up in the first readable template store in the current path, and unless the current style overrides something, everything in the parent style carries through. SAS ships with several styles that can be used as a base, including styles.default, styles.beige, styles.brick, styles.brown, styles.d3d, styles.minimal, styles.printer and styles.statdoc. This inheritance model makes style creation more manageable because most new styles are refinements of existing ones rather than fully independent definitions.
The NOTES Statement
For documentation inside the style itself, the NOTES statement provides a place to store descriptive text. This differs from a SAS comment because the text becomes part of the compiled style template and can be viewed with the SOURCE statement. That makes NOTES useful for recording what a style is for, what it changes, or any implementation detail worth preserving alongside the template. In a shared environment, that sort of embedded documentation can be more durable than comments kept in a separate program file.
The CLASS Statement
The CLASS statement creates a style element from a like-named style element. In practical terms, it duplicates an existing element of the same name and applies modifications. The three statements class fonts;, style fonts from fonts; and style fonts from _self_; are equivalent, making CLASS a convenience form for a common pattern. It takes one or more style element names, optional descriptive text and optional attribute specifications. If the same attribute is specified more than once, the last value given is the one SAS uses, and that rule is worth keeping in mind when reading or maintaining larger templates.
The STYLE Statement
The STYLE statement is more general and is the main mechanism for creating or modifying one or more style elements. It can define new elements, override inherited ones, or absorb attributes from an existing element by using the FROM option. When a new style element overrides one that is a parent of other elements, all of its descendants (including those inherited from parent styles) also inherit the new attributes, which is one of the reasons why small changes can have broad visual effects in output. Style elements within a single STYLE statement must be separated by commas.
The distinction between using FROM and not using it is particularly important. If a like-named style element already exists in the child style, and it is not created with FROM, the child version overrides the parent version entirely. If it is created with FROM, the attributes from the parent style element are absorbed into the child style element. Without FROM, an attribute defined in a like-named style element in the parent is not inherited unless it is explicitly specified again. With FROM, inherited attributes remain in play and can then be modified selectively, and this is the practical difference between replacement and extension.
The _SELF_ keyword is a shorthand within the STYLE statement, specifying that each named style element should inherit from an existing style element of the same name. It is most useful when specifying multiple style elements in one statement. For example, the single statement style data, data1, dataempty from _self_ / color = red backgroundcolor = black; is exactly equivalent to writing separate STYLE statements for data, data1 and dataempty individually. Where the same attribute appears more than once among multiple identical style element names, the last value specified is used. PROC TEMPLATE looks first in the current style for the named style element when resolving a FROM reference, and only looks in the parent style if the element is not found there.
Style Attributes
Style attributes follow the general form style-attribute-name=<|>style-attribute-value. Standard attribute names from the documented list are written without quotation marks, while user-defined attribute names must be enclosed in quotation marks. The vertical bar (|) symbol prevents the style attribute from being inherited by any child style elements, allowing a template author to control precisely how far a change spreads through the inheritance tree. Text associated with a STYLE statement also becomes part of the compiled template (much like NOTES), which can help explain why a specific element is defined in a particular way.
The IMPORT Statement and CSS
The IMPORT statement bridges CSS and ODS styles by importing Cascading Style Sheet information from a file into the style. The file specification can be an external file path, a fileref or a URL, and once imported, SAS converts the CSS code into style attributes and style elements that can be used by PROC TEMPLATE. There are requirements of which you need to be aware: the CSS file must be written in the same type of CSS that the ODS HTML statement produces, and only class names that match ODS style element names are supported, with no IDs and no context-based selectors permitted. If needed, the CSS that ODS creates can be examined with the STYLESHEET= option, or by viewing the HTML source and inspecting the code at the top of the file.
Media types add another layer to the IMPORT statement. The syntax allows up to ten media types to be specified, separated by commas, corresponding to how output will be rendered on screen, paper, with a speech synthesiser or with a braille device, for example. CSS code outside any media block is always included, and the media type option additionally imports the section of a CSS file intended only for a specific media type. If no media type is specified in the ODS statement, but media types exist in the CSS file, ODS uses the Screen media type by default. If multiple media types are specified, all of their style information is applied, though if duplicate style information appears in different media blocks, the styles from the last media block are used.
The REPLACE Statement
One statement that no longer belongs in current practice is REPLACE. The SAS documentation states plainly that it is no longer supported and that STYLE or CLASS should be used instead to create and modify style elements. That is a useful reminder when reading older code, as REPLACE appears in legacy templates and conference papers that predate its deprecation.
The ODS Style Element Catalogue
To make sense of style customisation, it helps to understand the wider catalogue of ODS style elements. These elements are organised by function, and many are abstract, meaning they exist for inheritance purposes rather than direct rendering. Abstract elements are not explicitly used in ODS output and will not appear in destinations that generate a style sheet.
Miscellaneous and Document Elements
A broad abstract element, Container, controls all container-oriented elements and sits near the top of several inheritance chains. Document-related elements such as Document, Body, Frame, Contents and Pages control the overall presentation of output files, including page background and margins, with Body, Frame, Contents and Pages all inheriting from Document. Several further miscellaneous elements handle specific rendering concerns: Continued controls the continued flag when a table breaks across a page (paginated destinations only), ExtendedPage handles the message displayed when a page will not fit (Printer destination only), PageNo controls page numbers for paginated destinations and Parskip controls the space between tables. UserText controls the ODS TEXT= style and inherits from Note. The StartUpFunction and ShutDownFunction elements add JavaScript functions to HTML output that execute on page load and page exit, respectively, and PrePage controls the ODS RTF/MEASURED PREPAGE= style.
Date Elements
Date-related elements include Date (an abstract element controlling how date fields look), BodyDate (which controls the date field in the Contents file and inherits from ContentsDate) and PagesDate (which controls the date field in the Pages file and inherits from Date).
Contents and Pages Elements
Contents and pages files are influenced by a substantial group of elements. IndexItem is an abstract element controlling list items and folders for both files. ContentFolder controls folders in the Contents file, and ByContentFolder controls byline folders there, inheriting from ContentFolder. ContentItem controls items in the Contents file and PagesItem controls items in the Pages file, both inheriting from IndexItem. The abstract element Index covers miscellaneous Contents and Pages components, and from it inherit IndexProcName, ContentProcName, ContentProcLabel, PagesProcName and PagesProcLabel, which handle procedure names and labels in each file. IndexTitle and ContentTitle control the titles of the Contents and Pages files; in styles.default, ContentTitle contains a PRETEXT= attribute that prints the text "Table of Contents". IndexAction and FolderAction determine what happens on mouse-over events for folders and items (HTML only). SysTitleAndFooterContainer controls the container for system page titles and footers, and is generally used to add borders around a title.
Titles, Footers and Related Elements
Titles and footers are handled by the abstract element TitlesAndFooters, which controls system page title and footer text. SystemTitle inherits from it and chains through SystemTitle2 up to SystemTitle10, with each inheriting from the one before. The footer series follows the same pattern from SystemFooter through SystemFooter2 to SystemFooter10. TitleAndNoteContainer controls the container for procedure-defined titles and notes, inheriting from Container. ProcTitle controls procedure title text and inherits from TitlesAndFooters, with ProcTitleFixed handling procedure title text that requests a fixed font.
Bylines
BylineContainer controls the container for the byline (generally used to add borders) and inherits from Container. Byline controls byline text and inherits from TitlesAndFooters.
Notes, Warnings and Errors
Notes, warnings and errors each consist of two pieces: a banner area and a content area. The abstract element Note controls the container for note banners and note contents, and inherits from Container. The banner elements (NoteBanner, WarnBanner, ErrorBanner and FatalBanner) generally use the PRETEXT= attribute to print the banner label. Each has a corresponding content element (NoteContent, WarnContent, ErrorContent and FatalContent), and fixed-font variants exist for note, warning and error content (NoteContentFixed, WarnContentFixed and ErrorContentFixed). All of these elements inherit from Note.
Table Elements
Elements governing table output form a substantial hierarchy. Output is an abstract element that controls basic output forms, including borders (via FRAME=, RULES= and individual border control attributes), cell spacing, cell padding and background colour, inheriting from Container. Table controls overall table style and inherits from Output, as does Batch (which controls batch mode output). Three further abstract elements are specific to RTF output: TableHeaderContainer (which places and controls the box around all column headings), TableFooterContainer (which does the same for column footers) and ColumnGroup (which controls the box around groups of columns).
Data Cell Elements
Cell is an abstract element that controls data, header and footer cells, inheriting from Container. Data cells are controlled by Data (the default style for data cells), DataFixed (for data cells requesting a fixed font), DataEmpty (for empty data cells), DataEmphasis (for emphasised data cells), DataEmphasisFixed (for emphasised data cells requesting a fixed font), DataStrong (for strong, more emphasised data cells) and DataStrongFixed. All inherit from Cell or from one another in a chain.
Header and Footer Cell Elements
Header and footer cells are governed by HeadersAndFooters, an abstract element inheriting from Cell. Headers include Header, HeaderFixed, HeaderEmpty, HeaderEmphasis, HeaderEmphasisFixed, HeaderStrong and HeaderStrongFixed. Row headers follow a parallel set: RowHeader, RowHeaderFixed, RowHeaderEmpty, RowHeaderEmphasis, RowHeaderEmphasisFixed, RowHeaderStrong and RowHeaderStrongFixed. Footers mirror the same pattern through Footer, FooterFixed, FooterEmpty, FooterEmphasis, FooterEmphasisFixed, FooterStrong and FooterStrongFixed, with row footers following suit via RowFooter and its variants. PROC TABULATE captions are separately covered by the abstract element Caption (which inherits from HeadersAndFooters), BeforeCaption and AfterCaption.
SAS Formats
While styles affect appearance, formats affect representation. SAS organises formats into four categories: Character, Date and Time, ISO 8601 and Numeric. Formats that support national languages are documented separately in the SAS National Language Support reference, and storing user-defined formats is an important consideration when those formats are associated with variables in permanent SAS data sets shared with others.
Character Formats
Character formats cover both simple display and conversion tasks. $CHARw. and $w. write standard character data, while $QUOTEw. encloses values in double quotation marks. $UPCASEw. converts character data to uppercase, and $MSGCASEw. writes uppercase output when the MSGCASE system option is in effect. Several formats transform character data into alternative encodings or representations: $ASCIIw. converts to ASCII, $EBCDICw. converts to EBCDIC, $HEXw. converts to hexadecimal, $BINARYw. converts to binary and $OCTALw. converts to octal. Others alter ordering or length handling: $REVERJw. writes character data in reverse order and preserves blanks, $REVERSw. writes it in reverse and left-aligns it, and $VARYINGw. writes character data of varying length. $BASE64Xw. converts character data into ASCII text using Base 64 encoding.
Date and Time Formats
Date and time formats are especially broad. Traditional date formats include DATEw. (writing values as ddmmmyy or ddmmmyyyy), DDMMYYw. and DDMMYYxw. (day-month-year with various separators), MMDDYYw. and MMDDYYxw. (month-day-year), YYMMDDw. and YYMMDDxw. (year-month-day), MONYYw. (month and year), MONNAMEw. (month name), DOWNAMEw. (day of week name), WEEKDATEw. and WEEKDATXw. (day of week and date in different orderings) and WORDDATEw. and WORDDATXw. (month name with day and year in different orderings). Quarter and year formats include QTRw., QTRRw. (Roman numerals), YEARw., YYQw., YYQxw., YYQRw. and YYQRxw.. Week number formats include WEEKUw., WEEKVw. and WEEKWw., each using a different numbering algorithm.
Year-month combination formats include YYMMw., YYMMxw., YYMONw., MMYYw. and MMYYxw.. DAYw. writes the day of the month and WEEKDAYw. writes the day of the week as a number. Time and date time formats include TIMEw.d, TIMEAMPMw.d, TODw.d, HHMMw.d, HOURw.d, MMSSw.d, DATETIMEw.d and DATEAMPMw.d. Formats that take a date time value and write only part of it include DTDATEw., DTMONYYw., DTWKDATXw., DTYEARw. and DTYYQCw.. Julian date formats include JULDAYw. (Julian day of the year), JULIANw. (Julian date in yyddd or yyyyddd), PDJULGw. (packed Julian in hexadecimal yyyydddF for IBM) and PDJULIw. (packed Julian in hexadecimal ccyydddF for IBM).
The $N8601 character formats also appear within the Date and Time category. $N8601Bw.d and $N8601BAw.d both write ISO 8601 duration, date time and interval forms using basic notations. $N8601Ew.d and $N8601EAw.d use extended notations. $N8601EHw.d uses extended notation with a hyphen for omitted components, $N8601EXw.d uses an x in place of each digit of an omitted component, $N8601Hw.d drops omitted components in duration values and uses a hyphen for omitted date time components, and $N8601Xw.d drops omitted duration components and uses an x for each digit of an omitted date time component.
ISO 8601 Formats
The ISO 8601 category covers the same $N8601 character formats listed above, together with the B8601 (basic notation) and E8601 (extended notation) families of numeric formats. Basic formats include B8601DAw. (date as yyyymmdd), B8601DNw. (date from a date time value as yyyymmdd), B8601DTw.d (date time as yyyymmddThhmmssffffff), B8601DZw. (date time in UTC with time zone offset as yyyymmddThhmmss+|-hhmm), B8601LZw. (local time with UTC offset as hhmmss+|-hhmm), B8601TMw.d (time as hhmmssffff) and B8601TZw. (time adjusted to UTC as hhmmss+|-hhmm). Extended formats follow the same structure: E8601DAw. (date as yyyy-mm-dd), E8601DNw., E8601DTw.d, E8601DZw., E8601LZw., E8601TMw.d and E8601TZw.d, each using hyphen and colon delimiters to separate date and time components. These formats are important where standards compliance, machine readability or time zone clarity matter.
Numeric Formats
Numeric formats address general presentation, technical encoding and domain-specific output. BESTw. lets SAS choose the best notation, w.d writes standard numeric data one digit per byte and Zw.d adds leading zeroes. BESTDw.p lines up decimal places for values of similar magnitude and prints integers without decimals. Dw.p does the same over a potentially wider range of values, and Ew. writes values in scientific notation.
Financial and punctuation-sensitive displays are handled by COMMAw.d (comma every three digits, period for decimal), COMMAXw.d (period every three digits, comma for decimal), NUMXw.d (comma in place of the decimal point), DOLLARw.d, DOLLARXw.d, PERCENTw.d, PERCENTNw.d (using a minus sign for negative values) and NEGPARENw.d (negative values in parentheses). Integer and binary formats include IBw.d (native integer binary including negative values), IBRw.d (integer binary in Intel and DEC formats), PIBw.d (positive integer binary), PIBRw.d (positive integer binary in Intel and DEC formats) and RBw.d (real binary floating-point). Floating-point formats include FLOATw.d (native single-precision) and IEEEw.d. FRACTw. converts values to fractions.
Encoding formats include HEXw. (hexadecimal), BINARYw. (binary), OCTALw. (octal), PDw.d (packed decimal), PKw.d (unsigned packed decimal) and ZDw.d (zoned decimal). IBM mainframe formats form their own group: S370FFw.d (standard numeric), S370FIBw.d (integer binary including negative values), S370FIBUw.d (unsigned integer binary), S370FPDw.d (packed decimal), S370FPDUw.d (unsigned packed decimal), S370FPIBw.d (positive integer binary), S370FRBw.d (real binary floating-point), S370FZDw.d (zoned decimal), S370FZDLw.d (zoned decimal leading sign), S370FZDSw.d (zoned decimal separate leading sign), S370FZDTw.d (zoned decimal separate trailing sign) and S370FZDUw.d (unsigned zoned decimal). VAXRBw.d writes real binary data in VMS format and VMSZNw.d generates VMS and OpenText COBOL zoned numeric data.
Readable formats include ROMANw. (Roman numerals), WORDSw. (values as words) and WORDFw. (values as words with fractions shown numerically). The SSNw. format writes Social Security numbers and PVALUEw.d writes p-values.
Combining ODS Styles and Formats for Cleaner SAS Output
The connection between style definitions and formats is straightforward, even if the details are substantial. Styles determine the visual structure of ODS output through inheritance, element definitions and optional CSS imports, while formats determine how the values inside that output are written. A report can therefore be shaped at two levels at once: the appearance of titles, tables, notes and cells through DEFINE STYLE, and the textual form of dates, times, percentages, identifiers and other values through the SAS format system. Understanding both gives a clearer picture of how SAS turns data into output that is both functional and legible.
Managing Microsoft Outlook on Windows: Fonts, Zoom, Data Files and Deployment Controls
Outlook continues to evolve across Windows, with a mixture of everyday personalisation options for users and deployment controls for administrators. Recent guidance from Microsoft brings together practical steps for composing messages in a preferred typeface, approaches for reading messages more comfortably, and a set of administrative measures to manage when and how the new Outlook appears in an organisation. Alongside this are reminders about where Outlook stores data on different account types and how that affects moving between computers, as well as pointers for finding POP, IMAP and SMTP settings for Outlook.com when manual configuration is needed. What follows draws these threads together so that individual users and IT teams can navigate the changes with clarity.
Changing the Default Font for New Messages and Replies
For those composing email, Outlook starts with a familiar default: new messages use Calibri in black. This is only a starting point because the application allows the font, its colour, size and style to be changed, and it treats new messages separately from replies and forwards so that different choices can be set for each if desired.
In new Outlook for Windows, the path goes like this: View > View Settings > Email > Compose and Reply. Under Message Format, the preferred font, size and style can be chosen before saving, and these settings then apply whenever a message is written or a reply is sent. Note that in new Outlook the font setting applies to both new messages and replies and forwards from a single control, so a separate choice for each is not available in this version.
In classic Outlook for Windows, the approach is different and more granular. Navigating to File > Options > Mail reveals a Stationery and Fonts button. On the Personal Stationery tab, there are separate Font buttons for new mail messages and for replying or forwarding messages, which allows a distinct typeface, size and colour to be set for each scenario independently. This separation can be useful for distinguishing composed messages from replied ones at a glance. If similar changes are needed for the message list rather than the compose window, there is a separate set of options for changing the font or font size in the message list.
Adjusting the Zoom Level in the Reading Pane
Comfort when reading is equally important, particularly with longer emails. Both new and classic Outlook offer ways to adjust zoom in the Reading Pane without touching system-wide display settings, though the controls differ between the two versions. In new Outlook, selecting a message in the inbox opens it in the Reading Pane, after which the View tab's Zoom control can be used. Zooming in and out is done with plus and minus buttons, and there is a Reset option that returns the view to its default level. In classic Outlook, the same result can be achieved either by dragging the zoom bar at the bottom right of the window or by going to View and then Zoom, where a specific percentage between 50% and 200% can be chosen. Classic Outlook also offers a "Remember my preference" checkbox in the Zoom dialogue, which locks the chosen level so it persists across sessions without needing to be reset each time. In both versions, these adjustments affect only how messages appear on the screen and have no bearing on how they are composed or how recipients will see them.
Confirming Which Version of Outlook Is in Use
Not every copy of Outlook presents the same options at the same time. If steps that are described as applying to new Outlook do not appear, the device may still be running classic Outlook for Windows. That is not uncommon in environments where administrators are controlling the transition or where devices have not yet received the relevant updates, so checking the version in use is a sensible first step before assuming that something has gone wrong.
Hiding the New Outlook Toggle in Classic Outlook
For administrators, a recurring question is how to prevent users from switching to new Outlook until the organisation is ready. Microsoft provides a cloud policy in the Microsoft 365 Apps admin centre that hides the Try the new Outlook toggle in classic Outlook for Windows. After signing in to the admin centre, the policy can be created by going to Customisation, selecting Policy Management and enabling the policy named Hide the "Try the new Outlook" toggle in Outlook. There is also a registry-based method for controlling the same setting: the key is under HKEY_CURRENT_USERSoftwareMicrosoftOffice16.0OutlookOptionsGeneral and is named HideNewOutlookToggle, with a value of dword:00000000 to hide the toggle. To later enable the policy, the same value is set to 1. As with any registry change, this approach is best handled with care and in line with internal change management practices.
Removing the New Outlook App After Preinstallation on Windows 11
Preinstallation of the new Outlook on Windows 11 is another area where planning matters. On Windows 11 builds later than version 23H2, the app is preinstalled for all users, and there is currently no way to block that preinstallation. If devices should not surface the new Outlook, it can be removed after installation using the following Windows PowerShell command:
Remove-AppxProvisionedPackage -AllUsers -Online -PackageName (Get-AppxPackage Microsoft.OutlookForWindows).PackageFullName
After deprovisioning, Windows updates will not reinstall the app. Administrators can also remove an additional Windows orchestrator registry value at HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindowsUpdateOrchestratorUScheduler_OobeOutlookUpdate where applicable. Devices that have installed the March 2024 Non-Security Preview release, or a later cumulative update for Windows 11 version 23H2, respect the deprovisioning command and do not require removal of that registry value.
Handling User-Installed Instances and Start Menu Placeholders
Users may also install the app themselves, for example by selecting a toggle. In that case, the management approach shifts from provisioned packages to installed packages, and the following PowerShell command removes the app for all users:
Remove-AppxPackage -AllUsers -Package (Get-AppxPackage Microsoft.OutlookForWindows).PackageFullName
It is worth verifying whether the app is actually installed or whether only a Start menu placeholder is visible because a pinned icon may appear even when the underlying app is not yet present. A quick check of the folder at %localappdata%MicrosoftOlklogs can confirm whether the app has produced logs, and Start layout policies can be used to manage pins, so users are not inadvertently prompted to install by selecting a placeholder. On consumer devices, a Recommended section in the Windows 11 Start menu can also surface the app, which may need consideration in user communications.
Migrating Users Away from Windows Mail and Calendar
The end of support for Windows Mail and Calendar on the 31st of December 2024 introduced another migration pathway. Active users of those apps are being switched automatically to the new Outlook app, so organisations that wish to block that route can remove the Mail and Calendar apps from devices using the following command:
Get-AppxProvisionedPackage -Online | Where {$_.DisplayName -match "microsoft.windowscommunicationsapps"} | Remove-AppxProvisionedPackage -Online -PackageName {$_.PackageName}
For current users, the installed package can be removed with Remove-AppxPackage -AllUsers -Package (Get-AppxPackage microsoft.windowscommunicationsapps).PackageFullName. Alternatives exist through Microsoft Intune or Configuration Manager, which may be preferable in environments that already use those tools for application lifecycle management.
Blocking Acquisition via the Microsoft Store
Preventing acquisition from the Microsoft Store is more straightforward. Because the new Outlook for Windows is available there as well, blocking access to the Microsoft Store app prevents users from downloading it through that channel. Microsoft provides configuration options for controlling Microsoft Store access, and administrators can align those with broader device management policies that may already limit consumer app installs on corporate devices.
Opting Out of Automatic Migration
Some organisations will want to opt out of new Outlook migration entirely for a period. Starting in January 2025, users with Microsoft 365 Business Standard and Premium licences are automatically migrated from classic Outlook to new Outlook, with in-app notifications sent before the switch and the option to toggle back afterwards. Microsoft exposes a policy named Manage user setting for new Outlook automatic migration that controls whether users are switched automatically. If the policy is not set, the user setting remains uncontrolled and users can manage it themselves, with the default being enabled. Enabling the policy enforces automatic migration and prevents users from changing the setting, while disabling it turns off automatic migration and also prevents user changes. The equivalent registry setting sits under HKEY_CURRENT_USERSoftwarePoliciesMicrosoftoffice16.0outlookpreferences with a DWORD named NewOutlookMigrationUserSetting set to 0 to disable or 1 to enable. The same controls can be managed via Group Policy Administrative Templates and through the Cloud Policy service from the Microsoft 365 Apps admin centre, and because the setting is defined in ADMX templates it can also be surfaced in Intune using Administrative Templates.
Applying Conditional Access and Mailbox Policies
Beyond installation state and migration timing, access policies are a decisive layer of control. Conditional Access policies can require multifactor authentication, restrict access by location, block risky sign-in behaviours or insist on organisation-managed devices. For additional nuance, Outlook on the web (OWA) mailbox policies used together with the ConditionalAccessPolicy parameter can limit capabilities for users on non-compliant devices, for instance by restricting attachments. This approach allows a more graduated user experience that reduces risk without completely blocking access, and it can be combined with broader Conditional Access baseline requirements.
There are cases where a firmer control is required. To prevent mailbox access from the new Outlook regardless of how users acquired the app, administrators can use an Exchange mailbox policy that blocks organisation mailboxes from being added. This acts as a final block so that work or school accounts cannot be used in the app, even if an individual user has installed it or found it preinstalled. Because mailbox policies are applied to the account rather than to a device or a specific app, it is prudent to consider them alongside the earlier measures that block acquisition or control installation, so that personal accounts are not used in ways that bypass organisational safeguards.
Understanding How Outlook Stores Data and What Moves to a New Computer
While deployment and access are important, day-to-day continuity often depends on understanding how Outlook stores data and how that affects moving to a new computer. Outlook saves backup information in a variety of different locations depending on the account type involved. For users of Microsoft 365, Exchange, Outlook.com, Hotmail.com or Live.com accounts not accessed by POP or IMAP, email is backed up on the server and there is no Personal Folders file with a .pst extension. An Offline Folders file with an .ost extension may be present, but Outlook automatically recreates this when a new email account is added, and it cannot be moved between computers. Other elements such as navigation pane settings, print styles, signatures and stationery can be transferred, and their locations vary with version and configuration.
Users of POP accounts encounter a different arrangement. All email, calendar, contact and task information is stored in a .pst file, and moving this file to a new computer preserves that information. It does not carry over the account settings themselves, so Outlook needs to be set up on the new computer before opening the .pst file that was copied from the old one. On Windows 11, navigation pane settings are found at drive:Users<username>AppDataRoamingMicrosoftOutlook and signatures at drive:Users<username>AppDataRoamingMicrosoftSignatures. Knowing these paths saves time during a migration and reduces the risk of overlooking important data.
Avoiding OneDrive Synchronisation Problems with PST Files
Large .pst files can slow down OneDrive synchronisation if they are stored in folders that OneDrive is backing up. Symptoms include messages such as "Processing changes" or "A file is in use" that persist for longer than expected. Microsoft provides guidance on removing an Outlook PST data file from OneDrive if that becomes necessary, and doing so can restore normal synchronisation behaviour while keeping Outlook functional on the local machine.
Showing Hidden Files and Extensions on Windows
Locating Outlook data sometimes means revealing folders and file name extensions that Windows hides by default. This is especially true when navigating to AppData or similar directories, or when differentiating between PST and OST files. On Windows 11 File Explorer, going to View > Show, where both "File name extensions" and "Hidden items" settings can be toggled to their on positions. Doing so makes the AppData folder and the distinction between these file types visible without needing to navigate through the Control Panel.
Configuring POP, IMAP and SMTP Settings for Outlook.com
Configuration of Outlook.com accounts brings its own questions when used in the Outlook desktop app or other mail applications. Outlook and Outlook.com can often detect the correct mailbox settings automatically, which simplifies setup for many users. When that is not the case, or when using a third-party app, the POP, IMAP and SMTP settings can be viewed within Outlook.com settings and used for manual configuration. For Outlook.com accounts, both the IMAP and POP server name is outlook.office365.com, with IMAP using port 993 and POP using port 995, both with SSL/TLS encryption and OAuth2 authentication. It is worth noting that POP and IMAP access is disabled by default in Outlook.com and must be enabled in account settings before either protocol can be used. For other non-Microsoft accounts, the safest course is to obtain settings directly from the relevant email provider rather than guessing values, since incorrect entries can lead to connection issues that are not always obvious at first glance.
Getting Support for Outlook.com
Support remains close at hand for Outlook.com users who need it. The Help option on the menu bar in Outlook.com opens self-help resources where queries can be entered and common issues surfaced. If those do not resolve the problem, there is a path to contact support, which requires signing in to the account so that assistance can be tailored. If signing in is not possible, Microsoft directs users to a separate route to begin recovery or get help, and the Outlook.com Community provides an additional place to search for answers or ask questions from other users.
Keeping Users and IT Teams Informed During Outlook's Transition
Together, these user-facing features and administrative controls reflect a period of transition for Outlook on Windows. Individuals can shape the way they write and read messages, adjusting fonts to suit their preferences and using zoom where needed, without altering system-wide settings. Administrators can pace the adoption of the new Outlook with policies that hide toggles, prevent or reverse preinstallation, opt out of automatic migration and apply Conditional Access or mailbox policies that enforce organisational requirements. Underneath these changes, the fundamentals of data storage and account setup remain steady, with server-backed accounts recreating their local caches on-demand and POP accounts relying on .pst files that can be moved with care. By keeping these points in mind, users and IT teams alike can make informed decisions that avoid surprises and maintain a smooth email experience.