From summary statistics to published reports with R, LaTeX and TinyTeX
For anyone working across LaTeX, R Markdown and data analysis in R, there comes a point where separate tools begin to converge. Data has to be summarised, those summaries have to be turned into presentable tables and the finished result has to compile into a report that looks appropriate for its audience rather than a console dump. These notes follow that sequence, moving from the practical business of summarising data in R through to tabulation and then on to the publishing infrastructure that makes clean PDF and Word output possible.
Summarising Data with {dplyr}
The starting point for many analyses is a quick exploration of the data at hand. One useful example uses the anorexia dataset from the {MASS} package together with {dplyr}. The dataset contains weight change data for young female anorexia patients, divided into three treatment groups: Cont for the control group, CBT for cognitive behavioural treatment and FT for family treatment.
The basic manipulation starts by loading {MASS} and {dplyr}, then using filter() to create separate subsets for each treatment group. From there, mutate() adds a wtDelta column defined as Postwt - Prewt, giving the weight change for each patient. group_by(Treat) prepares the data for grouped summaries, and arrange(wtDelta) sorts within treatment groups. The notes then show how {dplyr}'s pipe operator, %>%, makes the workflow more readable by chaining these operations. The final summary table uses summarize() to compute the number of observations, the mean weight change and the standard deviation within each treatment group. The reported values are count 29, average weight change 3.006897 and standard deviation 7.308504 for CBT, count 26, average weight change -0.450000 and standard deviation 7.988705 for Cont and count 17, average weight change 7.264706 and standard deviation 7.157421 for FT.
That example is not presented as a complete statistical analysis. Instead, it serves as a quick exploratory route into the data, with the wording remaining appropriately cautious and noting that this is only a glance and not a rigorous analysis.
Choosing an R Package for Descriptive Summaries
The question of how best to summarise data opens up a broader comparison of R packages for descriptive statistics. A useful review sets out a common set of needs: a count of observations, the number and types of fields, transparent handling of missing data and sensible statistics that depend on the data type. Numeric variables call for measures such as mean, median, range and standard deviation, perhaps with percentiles. Categorical variables call for counts of levels and some sense of which categories dominate.
Base R's summary() does some of this reasonably well. It distinguishes categorical from numeric variables and reports distributions or numeric summaries accordingly, while also highlighting missing values. Yet, it does not show an overall record count, lacks standard deviation and is not especially tidy or ready for tools such as kable. Several contributed packages aim to improve on that. Hmisc::describe() gives counts of variables and observations, handles both categorical and numerical data and reports missing values clearly, showing the highest and lowest five values for numeric data instead of a simple range. pastecs::stat.desc() is more focused on numeric variables and provides confidence intervals, standard errors and optional normality tests. psych::describe() includes categorical variables but converts them to numeric codes by default before describing them, which the package documentation itself advises should be interpreted cautiously. psych::describeBy() extends this approach to grouped summaries and can return a matrix form with mat = TRUE.
Among the packages reviewed, {skimr} receives especially strong attention for balancing readability and downstream usefulness. skim() reports record and variable counts clearly, separates variables by type and includes missing data and standard summaries in an accessible layout. It also works with group_by() from {dplyr}, making grouped summaries straightforward to produce. More importantly for analytical workflows, the skim output can be treated as a tidy data frame in which each combination of variable and statistic is represented in long form, meaning the results can be filtered, transformed and plotted with standard tidyverse tools such as {ggplot2}.
{summarytools} is presented as another strong option, though with a distinction between its functions. descr() handles numeric variables and can be converted to a data frame for use with kable, while dfSummary() works across entire data frames and produces an especially polished summary. At the time of the original notes, dfSummary() was considered slow. The package author subsequently traced the issue, as documented in the same review, to an excessive number of histogram breaks being generated for variables with large values, imposing a limit to resolve it. The package also supports output through view(dfSummary(data)), which yields an attractive HTML-style summary.
Grouped Summary Table Packages
Once the data has been summarised, the next step is turning those summaries into formal tables. A detailed comparison covers a number of packages specifically designed for this purpose: {arsenal}, {qwraps2}, {Amisc}, {table1}, {tangram}, {furniture}, {tableone}, {compareGroups} and {Gmisc}. {arsenal} is described as highly functional and flexible, with tableby() able to create grouped tables in only a few lines and then be customised through control objects that specify tests, display statistics, labels and missing value treatment. {qwraps2} offers a lot of flexibility through nested lists of summary specifications, though at the cost of more code. {Amisc} can produce grouped tables and works with pander::pandoc.table(), but is noted as not being on CRAN. {table1} creates attractive tables with minimal code, though its treatment of missing values may not suit every use case. {tangram} produces visually appealing HTML output and allows custom rows such as missing counts to be inserted manually, although only HTML output is supported. {furniture} and {tableone} both support grouped table creation, but {tableone} in particular is notable because it is widely used in biomedical research for baseline characteristics tables.
The {tableone} package deserves separate mention because it is designed to summarise continuous and categorical variables in one table, a common need in medical papers. As the package introduction explains, CreateTableOne() can be used on an entire dataset or on a selected subset of variables, with factorVars specifying variables that are coded numerically but should be treated as categorical. The package can display all levels for categorical variables, report missing values via summary() and switch selected continuous variables to non-normal summaries using medians and interquartile ranges instead of means and standard deviations. For grouped comparisons, it prints p-values by default and can switch to non-parametric tests or Fisher's exact test where needed. Standardised mean differences can also be shown. Output can be captured as a matrix and written to CSV for editing in Excel or Word.
Styling and Exporting Tables
With tables constructed, the focus shifts to how they are presented and exported. As Hao Zhu's conference slides explain, the {kableExtra} package builds on knitr::kable() and provides a grammar-like approach to adding styling layers, importing the pipe %>% symbol from {magrittr} so that formatting functions can be added in the same way that layers are added in {ggplot2}. It supports themes such as kable_paper, kable_classic, kable_minimal and kable_material, as well as options for striping, hover effects, condensed layouts, fixed headers, grouped rows and columns, footnotes, scroll boxes and inline plots.
Table output is often the visible end of an analysis, and a broader review of R table packages covers a range of approaches that go well beyond the default output. In R Markdown, packages such as {gt}, {kableExtra}, {formattable}, {DT}, {reactable}, {reactablefmtr} and {flextable} all offer richer possibilities. Some are aimed mainly at HTML output, others at Word. {DT} in particular supports highly customised interactive tables with searching, filtering and cell styling through more advanced R and HTML code. {flextable} is highlighted as the strongest option when knitting to Word, given that the other packages are primarily designed for HTML.
For users working in Word-heavy settings, older but still practical workflows remain relevant too. One approach is simply to write tables to comma-separated text files and then paste and convert the content into a Word table. Another route is through {arsenal}'s write2 functions, designed as an alternative to SAS ODS. The convenience functions write2word(), write2html() and write2pdf() accept a wide range of objects: tableby, modelsum, freqlist and comparedf from {arsenal} itself, as well as knitr::kable(), xtable::xtable() and pander::pander_return() output. One notable constraint is that {xtable} is incompatible with write2word(). Beyond single tables, the functions accept a list of objects so that multiple tables, headers, paragraphs and even raw HTML or LaTeX can all be combined into a single output document. A yaml() helper adds a YAML header to the output, and a code.chunk() helper embeds executable R code chunks, while the generic write2() function handles formats beyond the three convenience wrappers, such as RTF.
The Publishing Infrastructure: CTAN and Its Mirrors
Producing PDF output from R Markdown depends on a working LaTeX installation, and the backbone of that ecosystem is CTAN, the Comprehensive TeX Archive Network. CTAN is the main archive for TeX and LaTeX packages and is supported by a large collection of mirrors spread around the world. The purpose of this distributed system is straightforward: users are encouraged to fetch files from a site that is close to them in network terms, which reduces load and tends to improve speed.
That global spread is extensive. The CTAN mirror list organises sites alphabetically by continent and then by country, with active sites listed across Africa, Asia, Europe, North America, Oceania and South America. Africa includes mirrors in South Africa and Morocco. Asia has particularly wide coverage, with many mirrors in China as well as sites in Korea, Hong Kong, India, Indonesia, Japan, Singapore, Taiwan, Saudi Arabia and Thailand. Europe is especially rich in mirrors, with hosts in Denmark, Germany, Spain, France, Italy, the Netherlands, Norway, Poland, Portugal, Romania, Switzerland, Finland, Sweden, the United Kingdom, Austria, Greece, Bulgaria and Russia. North America includes Canada, Costa Rica and the United States, while Oceania covers Australia and South America includes Brazil and Chile.
The details matter because different mirrors expose different protocols. While many support HTTPS, some also offer HTTP, FTP or rsync. CTAN provides a mirror multiplexer to make the common case simpler: pointing a browser to https://mirrors.ctan.org/ results in automatic redirection to a mirror in or near the user's country. There is one caveat. The multiplexer always redirects to an HTTPS mirror, so anyone intending to use another protocol needs to select manually from the mirror list. That is why the full listings still include non-HTTPS URLs alongside secure ones.
There is also an operational side to the network that is easy to overlook when things are working well. CTAN monitors mirrors to ensure they are current, and if one falls behind, then mirrors.ctan.org will not redirect users there. Updates to the mirror list can be sent to ctan@ctan.org. The master host of CTAN is ftp.dante.de in Cologne, Germany, with rsync access available at rsync://rsync.dante.ctan.org/CTAN/ and web access on https://ctan.org/. For those who want to contribute infrastructure rather than simply use it, CTAN also invites volunteers to become mirrors.
TinyTeX: A Lightweight LaTeX Distribution
This infrastructure becomes much more tangible when looking at a lightweight TeX distribution such as TinyTeX. TinyTeX is a lightweight, cross-platform, portable and easy-to-maintain LaTeX distribution based on TeX Live. It is small in size but intended to function well in most situations, especially for R users. Its appeal lies in not requiring users to install thousands of packages they will never use, installing them as needed instead. This also means installation can be done without administrator privileges, which removes one of the more familiar barriers around traditional TeX setups. TinyTeX can even be run from a flash drive.
For R users, TinyTeX is closely tied to the {tinytex} R package. The distinction is important: tinytex in lower case refers to the R package, while TinyTeX refers to the LaTeX distribution. Installation is intentionally direct. After installing the R package with install.packages('tinytex'), a user can run tinytex::install_tinytex(). Uninstallation is equally simple with tinytex::uninstall_tinytex(). For the average R Markdown user, that is often enough. Once TinyTeX is in place, PDF compilation usually requires no further manual package management.
There is slightly more to know if the aim is to compile standalone LaTeX documents from R. The {tinytex} package provides wrappers such as pdflatex(), xelatex() and lualatex(). These functions detect required LaTeX packages that are missing and install them automatically by default. In practical terms, that means a small example document can be written to a file and compiled with tinytex::pdflatex('test.tex') without much concern about whether every dependency has already been installed. For R users, this largely removes the old pattern of cryptic missing-package errors followed by manual searching through TeX repositories.
Developers may want more than the basics, and TinyTeX has a path for that as well. A helper such as tinytex:::install_yihui_pkgs() installs a collection of packages needed for building the PDF vignettes of many CRAN packages. That is a specific convenience rather than a universal requirement, but it illustrates the design philosophy behind TinyTeX: keep the initial footprint light and offer ways to add what is commonly needed later.
Using TinyTeX Outside R
For users outside R, TinyTeX still works, but the focus shifts to the command-line utility tlmgr. The documentation is direct in its assumptions: if command-line work is unwelcome, another LaTeX distribution may be a better fit. The central command is tlmgr, and much of TinyTeX maintenance can be expressed through it.
On Linux, installation places TinyTeX in $HOME/.TinyTeX and creates symlinks for executables such as pdflatex under $HOME/bin or $HOME/.local/bin if it exists. The installation script is fetched with wget and piped to sh, after first checking that Perl is correctly installed. On macOS, TinyTeX lives in ~/Library/TinyTeX, and users without write permission to /usr/local/bin may need to change ownership of that directory before installation. Windows users can run a batch file, install-bin-windows.bat, and the default installation directory is %APPDATA%/TinyTeX unless APPDATA contains spaces or non-ASCII characters, in which case %ProgramData% is used instead. PowerShell version 3.0 or higher is required on Windows.
Uninstallation follows the same self-contained logic. On Linux and macOS, tlmgr path remove is followed by deleting the TinyTeX folder. On Windows, tlmgr path remove is followed by removing the installation directory. This simplicity is a deliberate contrast with larger LaTeX distributions, which are considerably more involved to remove cleanly.
Maintenance and Package Management
Maintenance is where TinyTeX's relationship to CTAN and TeX Live becomes especially visible. If a document fails with an error such as File 'times.sty' not found, the fix is to search for the package containing that file with tlmgr search --global --file "/times.sty". In the example given, that identifies the psnfss package, which can then be installed with tlmgr install psnfss. If the package includes executables, tlmgr path add may also be needed. An alternative route is to upload the error log to the yihui/latex-pass GitHub repository, where package searching is carried out remotely.
If the problem is less obvious, a full update cycle is suggested: tlmgr update --self --all, then tlmgr path add and fmtutil-sys --all. R users have wrappers for these tasks too, including tlmgr_search(), tlmgr_install() and tlmgr_update(). Some situations still require a full reinstallation. If TeX Live reports Remote repository newer than local, TinyTeX should be reinstalled manually, which for R users can be done with tinytex::reinstall_tinytex(). Similarly, when a TeX Live release is frozen in preparation for a new one, the advice is simply to wait and then reinstall when the next release is ready.
The motivation behind TinyTeX is laid out with unusual clarity. Traditional LaTeX distributions often present a choice between a small basic installation that soon proves incomplete and a very large full installation containing thousands of packages that will never be used. TinyTeX is framed as a way around those frustrations by building on TeX Live's portability and cross-platform design while stripping away unnecessary size and complexity. The acknowledgements also underline that TinyTeX depends on the work of the TeX Live team.
Connecting the R Workflow to a Finished Report
Taken together, these notes show how closely summarisation, tabulation and publishing are linked. {dplyr} and related tools make it easy to summarise data quickly, while a wide range of R packages then turn those summaries into tables that are not only statistically useful but also presentable. CTAN and its mirrors keep the TeX ecosystem available and current across the world, and TinyTeX builds on that ecosystem to make LaTeX more manageable, especially for R users. What begins with a grouped summary in the console can end with a polished report table in HTML, PDF or Word, and understanding the chain between those stages makes the whole workflow feel considerably less mysterious.
Some R functions for working with dates, strings and data frames
Working with data in R often comes down to a handful of recurring tasks: combining text, converting dates and times, reshaping tables and creating summaries that are easier to interpret. This article brings together several strands of base R and tidyverse-style practice, with a particular focus on string handling, date parsing, subsetting and simple time series smoothing. Taken together, these functions form part of the everyday toolkit for data cleaning and analysis, especially when imported data arrive in inconsistent formats.
String Building
At the simplest end of this toolkit is paste(), a base R function for concatenating character vectors. Its purpose is straightforward: it converts one or more R objects to character vectors and joins them together, separating terms with the string supplied in sep, which defaults to a space. If the inputs are vectors, concatenation happens term by term, so paste("A", 1:6, sep = "") yields "A1" through "A6", while paste(1:12) behaves much like as.character(1:12). There is also a collapse argument, which takes the resulting vector and combines its elements into a single string separated by the chosen delimiter, making paste() useful both for constructing values row by row and for creating one final display string from many parts.
That basic string-building role becomes more important when dates and times are involved because imported date-time data often arrive as text split across multiple columns. A common example is having one column for a date and another for a time, then joining them with paste(dates, times) before parsing the result. In that sense, the paste() function often acts as a bridge between messy raw input and structured date-time objects. It is simple, but it appears repeatedly in data preparation pipelines.
Date-Time Conversion
For date-time conversion, base R provides strptime(), strftime() and format() methods for POSIXlt and POSIXct objects. These functions convert between character representations and R date-time classes, and they are central to understanding how R reads and prints times. strptime() takes character input and converts it to an object of class "POSIXlt", while strftime() and format() move in the other direction, turning date-time objects into character strings. The as.character() method for "POSIXt" classes fits into the same family, and the essential idea is that the date-time value and its textual representation are separate things, with the format string defining how R should interpret or display that representation.
Format strings rely on conversion specifications introduced with %, and many of these are standard across systems. %Y means a four-digit year with century, %y means a two-digit year, %m is a month, %d is the day of a month and %H:%M:%S captures hours, minutes and seconds in 24-hour time. %F is equivalent to %Y-%m-%d, which is the ISO 8601 date format. %b and %B represent abbreviated and complete month names, while %a and %A do the same for weekdays. Locale matters here because month names, weekday names, AM/PM indicators and some separators depend on the LC_TIME locale, meaning a date string like "1jan1960" may parse correctly in one locale and return NA in another unless the locale is set appropriately.
R's defaults generally follow ISO 8601 rules, so dates print as "2001-02-28" and times as "14:01:02", though R inserts a space between date and time by default. Several details matter in practice. strptime() processes input strings only as far as needed for the specified format, so trailing characters are ignored. Unspecified hours, minutes and seconds default to zero, and if no year, month or day is supplied then the current values are assumed, though if a month is given, the day must also be valid for that month. Invalid calendar dates such as "2010-02-30 08:00" produce results whose components are all NA.
Time Zones and Daylight Saving
Time zones add another layer of complexity. The tz argument specifies the time zone to use for conversion, with "" meaning the current time zone and "GMT" meaning UTC. Invalid values are often treated as UTC, though behaviour can be system-specific. The usetz argument controls whether a time zone abbreviation is appended to output, which is generally more reliable than %Z. %z represents a signed UTC offset such as -0800, and R supports it for input on all platforms. Even so, time zones can be awkward because daylight saving transitions create times that do not occur at all, or occur twice, and strptime() itself does not validate those cases, though conversion through as.POSIXct may do so.
Two-Digit Years
Two-digit years are a notable source of confusion for analysts working with historical data. As described in the R date formats guide on R-bloggers, %y maps values 00 to 68 to the years 2000 to 2068 and 69 to 99 to 1969 to 1999, following the POSIX standard. A value such as "08/17/20" may therefore be interpreted as 2020 when the intended year is 1920. One practical workaround is to identify any parsed dates lying in the future and then rebuild them with a 19 prefix using format() and ifelse(). This approach is explicit and practical, though it depends on the assumptions of the data at hand.
Plain Dates
For plain dates, rather than full date-times, as.Date() is usually the entry point. Character dates can be imported by specifying the current format, such as %m/%d/%y for "05/27/84" or %B %d %Y for "May 27 1984". If no format is supplied, as.Date() first tries %Y-%m-%d and then %Y/%m/%d. Numeric dates are common when data come from Excel, and here the crucial issue is the origin date: Windows Excel uses an origin of "1899-12-30" for dates after 1900 because Excel incorrectly treated 1900 as a leap year (an error originally copied from Lotus 1-2-3 for compatibility), while Mac Excel traditionally uses "1904-01-01". Once the correct origin is supplied, as.Date() converts the serial numbers into standard R dates.
After import, format() can display dates in other ways without changing their underlying class. For example, format(betterDates, "%a %b %d") might yield values like "Sun May 27" and "Thu Jul 07". This distinction between storage and display is important because once R recognises values as dates, they can participate in date-aware operations such as mean(), min() and max(), and a vector of dates can have a meaningful mean date with the minimum and maximum identifying the earliest and latest observations.
Extracting Columns and Manipulating Lists
These ideas about correct types and structure carry over into table manipulation. A data frame column often needs to be extracted as a vector before further processing, and there are several standard ways to do this, as covered in this guide from Statistics Globe. In base R, the $ operator gives a direct route, as in data$x1. Subsetting with data[, "x1"] yields the same result for a single column, and in the tidyverse, dplyr::pull(data, x1) serves the same purpose. All three approaches convert a column of a data frame into a standalone vector, and each is useful depending on the surrounding code style.
List manipulation has similar patterns, detailed in this Statistics Globe tutorial on removing list elements. Removing elements from a list can be done by position with negative indexing, as in my_list[-2], or by assigning NULL to the relevant component, for example my_list_2[2] <- NULL. If names are more meaningful than positions, then subsetting with names(my_list) != "b" or names(my_list) %in% "b" == FALSE removes the named element instead. The same logic extends to multiple elements, whether by positions such as -c(2, 3) or names such as %in% c("b", "c") == FALSE. These are simple techniques, but they matter because lists are a common structure in R, especially when working with nested results.
Subsetting, Renaming and Reordering Data Frames
Data frames themselves can be subset in several ways, and the choice often depends on readability, as the five-method overview on R-bloggers demonstrates clearly. The bracket form example[x, y] remains the foundation, whether selecting rows and columns directly or omitting unwanted ones with negative indices. More expressive alternatives include which() together with %in%, the base subset() function and tidyverse verbs like filter() and select(). The point is not that one method is universally best, but that R offers both low-level precision and higher-level readability, depending on the task.
Column names and column order also need regular attention. Renaming can be done with dplyr::rename(), as explained in this lesson from Datanovia, for instance changing Sepal.Length to sepal_length and Sepal.Width to sepal_width. In base R, the same effect comes from modifying names() or colnames(), either by matching specific names or by position. Reordering columns is just as direct, with a data frame rearranged by column indices such as my_data[, c(5, 4, 1, 2, 3)] or by an explicit character vector of names, as the STHDA guide on reordering columns illustrates. Both approaches are useful when preparing data for presentation or for functions that expect variables in a certain order.
Sorting and Cumulative Calculations
Sorting and cumulative calculations fit naturally into this same preparatory workflow. To sort a data frame in base R, the DataCamp sorting reference demonstrates that order() is the key function: mtcars[order(mpg), ] sorts ascending by mpg, while mtcars[order(mpg, -cyl), ] sorts by mpg ascending and cyl descending. For cumulative totals, cumsum() provides a running sum, as in calculating cumulative air miles from the airmiles dataset, an example covered in the Data Cornering guide to cumulative calculations. Within grouped data, dplyr::group_by() and mutate() can apply cumsum() separately to each group, and a related idea is cumulative count, which can be built by summing a column of ones within groups, or with data.table::rowid() to create a group index.
Time Series Smoothing
Time series smoothing introduces one further pattern: replacing noisy raw values with moving averages. As the Storybench rolling averages guide explains, the zoo::rollmean() function calculates rolling means over a window of width k, and examples using 3, 5, 7, 15 and 21-day windows on pandemic deaths and confirmed cases by state demonstrate the approach clearly. After arranging and grouping by state, mutate() adds variables such as death_03da, death_05da and death_07da. Because rollmean() is centred by default, the resulting values are symmetrical around the observation of interest and produce NA values at the start and end where there are not enough surrounding observations, which is why odd values of k are usually preferred as they make the smoothing window balanced.
The arithmetic is uncomplicated, but the interpretation is useful. A 3-day moving average for a given date is the mean of that day, the previous day and the following day, while a 7-day moving average uses three observations on either side. As the window widens, the line becomes smoother, but more short-term variation is lost. This trade-off is visible when comparing 3-day and 21-day averages: a shorter average tracks recent changes more closely, while a longer one suppresses noise and makes broader trends stand out. If a trailing rather than centred calculation is needed, rollmeanr() shifts the window to the right-hand end.
The same grouped workflow can be used to derive new daily values before smoothing. In the pandemic example, daily new confirmed cases are calculated from cumulative confirmed counts using dplyr::lag(), with each day's new cases equal to the current cumulative total minus the previous day's total. Grouping by state and date, summing confirmed counts and then subtracting the lagged value produces new_confirmed_cases, which can then be smoothed with rollmean() in the same way as deaths. Once these measures are available, reshaping with pivot_longer() allows raw values and rolling averages to be plotted together in ggplot2, making it easier to compare volatility against trend.
How These R Data Manipulation Techniques Fit Together
What links all of these techniques is not just that they are common in R, but that they solve the mundane, essential problems of analysis. Data arrive as text when they should be dates, as cumulative counts when daily changes are needed, as broad tables when only a few columns matter, or as inconsistent names that get in the way of clear code. Functions such as paste(), strptime(), as.Date(), order(), cumsum(), rollmean(), rename(), select() and simple bracket subsetting are therefore less like isolated tricks and more like pieces of a coherent working practice. Knowing how they fit together makes it easier to move from raw input to reliable analysis, with fewer surprises along the way.
Speeding up R Code with parallel processing
Parallel processing in R has evolved considerably over the past fifteen years, moving from a patchwork of platform-specific workarounds into a well-structured ecosystem with clean, consistent interfaces. The appeal is easy to grasp: modern computers offer several processor cores, yet most R code runs on only one of them unless the user makes a deliberate choice to go parallel. When a task involves repeated calculations across groups, repeated model fitting or many independent data retrievals, spreading that work across multiple cores can reduce elapsed time substantially.
At its heart, the idea is simple. A larger job is split into smaller pieces, those pieces are executed simultaneously where possible, and the results are combined back together. That pattern appears throughout R's parallel ecosystem, whether the work is running on a laptop with a handful of cores or on a university supercomputer with thousands.
Why Parallel Processing?
Most modern computers have multiple cores that sit idle during single-threaded R scripts. Parallel processing takes advantage of this by splitting work across those cores, but it is important to understand that it is not always beneficial. Starting workers, transmitting data and collecting results all take time. Parallel processing makes the most sense when each iteration does enough computational work to justify that overhead. For fast operations of well under a second, the overhead will outweigh any gain and serial execution is faster. The sweet spot is iterative work, where each unit of computation takes at least a few seconds.
Benchmarking: Amdahl's Law
The theoretical speed-up from adding processors is always limited by the fraction of work that cannot be parallelised. Amdahl's Law, formulated by computer scientist Gene Amdahl in 1967, captures this:
Maximum Speedup = 1 / ( f/p + (1 - f) )
Here, f is the parallelisable fraction and p is the number of processors. Problems where f = 1 (the entire computation is parallelisable) are called embarrassingly parallel: bootstrapping, simulation studies and applying the same model to many independent groups all fall into this category. For everything else, the sequential fraction, including the overhead of setting up workers and moving data, sets a ceiling on how much improvement is achievable.
How We Got Here
The current landscape makes more sense with a brief orientation. R 2.14.0 in 2011 brought {parallel} into base R, providing built-in support for both forking and socket clusters along with reproducible random number streams, and it remains the foundation everything else builds on. The {foreach} package with {doParallel} became the most common high-level interface for many years, and is still widely encountered in existing code. The split-apply-combine package {plyr} was an early entry point for parallel data manipulation but is now retired; the recommendation is to use {dplyr} for data frames and {purrr} for list iteration instead. The {future} ecosystem, covered in the next section, is the current best practice for new code.
The Modern Standard: The {future} Ecosystem
The most significant development in R parallel computing in recent years has been the {future} package by Henrik Bengtsson, which provides a unified API for sequential and parallel execution across a wide range of backends. Its central concept is simple: a future is a value that will be computed (possibly in parallel) and retrieved later. What makes it powerful is that you write code once and change the execution strategy by swapping a single plan() call, with no other changes to your code.
library(future)
plan(multisession) # Use all available cores via background R sessions
The common plans are sequential (the default, no parallelism), multisession (multiple background R processes, works on all platforms including Windows) and multicore (forking, faster but Unix/macOS only). On a cluster, cluster and backends such as future.batchtools extend the same interface to remote nodes.
The {future} package itself is a low-level building block. For day-to-day work, three higher-level packages are the main entry points.
{future.apply}: Drop-in Replacements for base R Apply
{future.apply} provides parallel versions of every *apply function in base R, including future_lapply(), future_sapply(), future_mapply(), future_replicate() and more. The conversion from serial to parallel code requires just two lines:
library(future.apply)
plan(multisession)
# Serial
results <- lapply(my_list, my_function)
# Parallel — identical output, just faster
results <- future_lapply(my_list, my_function)
Global variables and packages are automatically identified and exported to workers, which removes the manual clusterExport and clusterEvalQ calls that {parallel} requires.
{furrr}: Drop-in Replacements for {purrr}
{furrr} does the same for {purrr}'s mapping functions. Any map() call can become future_map() by loading the library and setting a plan:
library(furrr)
plan(multisession, workers = availableCores() - 1)
# Serial
results <- map(my_list, my_function)
# Parallel
results <- future_map(my_list, my_function)
Like {future.apply}, {furrr} handles environment export automatically. There are parallel equivalents for all typed variants (future_map_dbl(), future_map_chr(), etc.) and for map2() and pmap() as well. It is the most natural choice for tidyverse-style code that already uses {purrr}.
{futurize}: One-Line Parallelisation
For users who want to parallelise existing code with minimal changes, {futurize} can transpile calls to lapply(), purrr::map() and foreach::foreach() %do% {} into their parallel equivalents automatically.
{foreach} with {doFuture}
The {foreach} package remains widely used, and the modern way to parallelise it is with the {doFuture} backend and the %dofuture% operator:
library(foreach)
library(doFuture)
plan(multisession)
results <- foreach(i = 1:10) %dofuture% {
my_function(i)
}
This approach inherits all the benefits of {future}, including automatic global variable handling and reproducible random numbers.
The {parallel} Package: Core Functions
The {parallel} package remains part of base R and is the foundation that {future} and most other packages build on. It is useful to know its core functions directly, especially for distributed work across multiple nodes.
Shared memory (single machine, Unix/macOS only):
mclapply(X, FUN, mc.cores = n) is a parallelised lapply that works by forking. It does not work on Windows and falls back silently to serial execution there.
Distributed memory (all platforms, including multi-node):
| Function | Description |
|---|---|
makeCluster(n) |
Start `n` worker processes |
clusterExport(cl, vars) |
Copy named objects to all workers |
clusterEvalQ(cl, expr) |
Run an expression (e.g. library(pkg)) on all workers |
parLapply(cl, X, FUN) |
Parallelised lapply across the cluster |
parLapplyLB(cl, X, FUN) |
Same with load balancing for uneven tasks |
clusterSetRNGStream(cl, seed) |
Set reproducible random seeds on workers |
stopCluster(cl) |
Shut down the cluster |
Note that detectCores() can return misleading values in HPC environments, reporting the total cores on a node rather than those allocated to your job. The {parallelly} package's availableCores() is more reliable in those settings and is what {furrr} and {future.apply} use internally.
A Tidyverse Approach with {multidplyr}
For data frame-centric workflows, {multidplyr} (available on CRAN) provides a {dplyr} backend that distributes grouped data across worker processes. The API has been simplified considerably since older tutorials were written: there is no longer any need to manually add group index columns or call create_cluster(). The current workflow is straightforward.
library(multidplyr)
library(dplyr)
# Step 1: Create a cluster (leave 1–2 cores free)
cluster <- new_cluster(parallel::detectCores() - 1)
# Step 2: Load packages on workers
cluster_library(cluster, "dplyr")
# Step 3: Group your data and partition it across workers
flights_partitioned <- nycflights13::flights %>%
group_by(dest) %>%
partition(cluster)
# Step 4: Work with dplyr verbs as normal
results <- flights_partitioned %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
collect()
partition() uses a greedy algorithm to keep all rows of a group on the same worker and balance shard sizes. The collect() call at the end recombines the results into an ordinary tibble in the main session. If you need to use custom functions, load them on each worker with cluster_assign():
cluster_assign(cluster, my_function = my_function)
One important caveat from the official documentation: for basic {dplyr} operations, {multidplyr} is unlikely to give measurable speed-ups unless you have tens or hundreds of millions of rows. Its real strength is in parallelising slower, more complex operations such as fitting models to each group. For large in-memory data with fast transformations, {dtplyr} (which translates {dplyr} to {data.table}) is often a better first choice.
Running R on HPC Clusters
For computations that exceed what a single workstation can provide, university and research HPC clusters are the next step. The core terminology is worth understanding clearly before submitting your first job.
One node is a single physical computer, which may itself contain multiple processors. One processor contains multiple cores. Wall-time is the real-world clock time a job is permitted to run; the job is terminated when this limit is reached, regardless of whether the script has finished. Memory refers to the RAM the job requires. When requesting resources, leave a margin of at least five per cent of RAM for system processes, as exceeding the allocation will cause the job to fail.
Slurm Job Submission
Slurm is the dominant scheduler on modern HPC clusters, including Penn State's Roar Collab system, managed by the Institute for Computational and Data Sciences (ICDS). Jobs are described in a shell script and submitted with sbatch. From R, the {rslurm} package allows Slurm jobs to be created and submitted directly without leaving the R session:
library(rslurm)
sjob <- slurm_apply(my_function, params_df, jobname = "my_job",
nodes = 2, cpus_per_node = 8)
Connecting R Workflows to Cluster Schedulers
The {batchtools} package provides Map, Reduce and Filter variants for managing R jobs on PBS, Slurm, LSF and Sun Grid Engine. The {clustermq} package sends function calls as cluster jobs via a single line of code without network-mounted storage. For users already in the {future} ecosystem, {future.batchtools} wraps {batchtools} as a {future} backend, letting you scale from a local plan(multisession) all the way to plan(batchtools_slurm) with no other code changes.
The Broader Ecosystem
The CRAN Task View on High-Performance and Parallel Computing, maintained by Dirk Eddelbuettel and updated lately, remains the most comprehensive catalogue of R packages in this space. The core packages designated by the Task View are {Rmpi} and {snow}. Beyond these, several areas are worth knowing about.
For large and out-of-memory data, {arrow} provides the Apache Arrow in-memory format with support for out-of-memory processing and streaming. {bigmemory} allows multiple R processes on the same machine to share large matrix objects. {bigstatsr} operates on file-backed matrices via memory-mapped access with parallel matrix operations and PCA.
For pipeline orchestration, the {targets} package constructs a directed acyclic graph of your workflow and orchestrates distributed computing across {future} workers, only re-running steps whose upstream dependencies have changed. For GPU computing, the {tensorflow} package by Allaire and colleagues provides access to the complete TensorFlow API from within R, enabling computation across CPUs and GPUs with a single API.
When it comes to random number reproducibility across parallel workers, the L'Ecuyer-CMRG streams built into {parallel} are available via RNGkind("L'Ecuyer-CMRG"). The {rlecuyer}, {rstream}, {sitmo} and {dqrng} packages provide further alternatives. The {doRNG} package handles reproducible seeds specifically for {foreach} loops.
Choosing the Right Approach
The appropriate tool depends on the shape of the problem and how it fits into your existing code.
If you are already using {purrr}'s map() functions, replacing them with future_map() from {furrr} after plan(multisession) is the path of least resistance. If you use base R's lapply or sapply, {future.apply} provides identical drop-in replacements. Both inherit automatic environment handling, reproducible random numbers and cross-platform compatibility from {future}.
If you are working with grouped data frames in a {dplyr} style and each group operation is computationally substantial, {multidplyr} is a good fit. For fast operations on large data, try {dtplyr} first.
For the largest workloads on institutional clusters, {future} scales directly to HPC environments via plan(cluster) or plan(batchtools_slurm). The {rslurm} and {batchtools} packages provide more direct control over job submission and resource management.
Further Reading
The CRAN Task View on High-Performance and Parallel Computing is the most comprehensive and current reference. The Futureverse website documents the full {future} ecosystem. The {multidplyr} vignette covers the current API in detail. Penn State users can find cluster support through ICDS and the QuantDev group's HPC in R tutorial. The R Special Interest Group on High-Performance Computing mailing list is a further resource for more specialist questions.
Making sense of parallel and asynchronous execution in Python
Parallel processing in Python is often presented as a straightforward route to faster programs, though the reality is rather more nuanced. At its core, parallel processing means executing parts of a task simultaneously across multiple processors or cores on the same machine, with the intention of reducing the total time needed to complete the work. Any honest explanation must include an important caveat because parallelism brings overhead of its own: processes need to be created, scheduled and coordinated, and data often has to be passed between them. For small or lightweight tasks, that overhead can outweigh any gain, and two tasks that each take five seconds may still require around eight seconds when parallelised, rather than the ideal five.
The Multiprocessing Module
One of the standard ways to work with parallel execution in Python is the multiprocessing module This module creates subprocesses rather than threads, which matters because each process has its own memory space. On both Unix-like systems and Windows, this arrangement allows Python code to use multiple processors more effectively for independent work, and it sidesteps some of the limitations commonly associated with threads in CPython, particularly for CPU-bound tasks. Threads still have an important role, especially for workloads that are heavy on input/output operations, but multiprocessing is often the better fit when the work involves substantial computation.
Understanding the Global Interpreter Lock
The reason threads are less effective for CPU-bound work in CPython relates directly to the Global Interpreter Lock (GIL). The GIL is a mutex that allows only one thread to hold control of the Python interpreter at any one time, meaning that even in a multithreaded programme, only one thread can execute Python bytecode at a given moment. When a thread is waiting for an external input/output operation it releases the GIL, allowing other threads to run, which is why threading remains a reasonable choice for I/O-bound workloads. Multiprocessing sidesteps the GIL entirely by spawning separate processes, each with its own Python interpreter, allowing genuine parallel execution across cores.
How Many Processes Can Run in Parallel?
Before using multiprocessing, it helps to understand the practical ceiling on how many processes can run in parallel. The upper bound is usually tied to the number of logical processors or cores available on the machine, and Python exposes this through multiprocessing.cpu_count(), which returns the number of processors detected. That figure is a useful starting point rather than an absolute rule. In real applications, the best number of worker processes can vary according to available memory, the nature of the task and what else the machine is doing at the time.
Synchronous and Asynchronous Execution
Another foundation worth clarifying is the difference between synchronous and asynchronous execution. In synchronous execution, tasks are coordinated so that results are typically gathered in the same order in which they were started, and the main programme effectively waits for those tasks to finish. In asynchronous execution, by contrast, tasks can complete in any order and the results may not correspond to the original input sequence, which often improves throughput but requires the programmer to be more deliberate about collecting and arranging results.
Pool and Process: The Two Main Abstractions
The multiprocessing module offers two main abstractions for parallel work: Pool and Process. For most practical tasks, Pool is the easier and more convenient option. It manages a collection of worker processes and provides methods such as apply(), map() and starmap() for synchronous execution, alongside apply_async(), map_async() and starmap_async() for asynchronous execution. The lower-level Process class offers more control and suits more specialised cases, but for many data-processing jobs Pool is sufficient and considerably easier to reason about.
An Example: Counting Values in a Range
A useful way to see these ideas in action is through a concrete example. Suppose there is a two-dimensional list, or matrix, where each row contains a small set of integers, and the task is to count how many values in each row fall within a given range. In the example, the data are generated with NumPy using np.random.randint(0, 10, size=[200000, 5]) and then converted to a plain list of lists with tolist(). A simple function, howmany_within_range(row, minimum, maximum), loops through each number in a row and increments a counter whenever the number falls between the supplied minimum and maximum values.
Without any parallelism, this task is handled with a straightforward loop in which each row is passed to the function in turn and the returned counts are appended to a results list. This serial approach is simple, easy to read and often good enough as a baseline, and it provides an important benchmark because parallel processing should not be adopted merely because it is available but should address an actual performance problem.
Pool.apply()
To parallelise the same function, the first step is to create a process pool, typically with mp.Pool(mp.cpu_count()). The simplest method to understand is Pool.apply(), which runs a function in a worker process using the arguments supplied through args. In the range-counting example, each row is submitted with the same minimum and maximum values. The resulting code is concise, but there is an important detail to note: when apply() is used inside a list comprehension, each call still blocks until it completes. It is parallel in terms of the workers available, but it is not always the most efficient pattern for distributing a large iterable of similar tasks.
Pool.map()
That is where Pool.map() can be more suitable. The map() method accepts a single iterable and applies the target function to each element. Because the original howmany_within_range() function expects more than one argument, the example adapts it by defining howmany_within_range_rowonly(row, minimum=4, maximum=8), giving default values to the range bounds so that only the row must be supplied. This is not always the cleanest design, but it illustrates the central constraint of map(): it expects one iterable of inputs rather than multiple arguments per call. In return, it is often a good fit for simple, repeated operations over a dataset.
Pool.starmap()
When a function genuinely needs multiple arguments and one wants the convenience of map-like behaviour, Pool.starmap() is usually the better choice. Like map(), it takes a single iterable, but each element of that iterable is itself another iterable containing the arguments for one function call. In the example, the input becomes [(row, 4, 8) for row in data], with each tuple unpacked into howmany_within_range(). This tends to be clearer than altering function signatures purely to satisfy the constraints of map().
Asynchronous Variants
The asynchronous equivalents follow the same broad pattern but differ in one crucial respect: they do not force the main process to wait for each task in order. With Pool.apply_async(), tasks are submitted, and the programme can continue while workers process them in the background. The example demonstrates this by redefining the counting function as howmany_within_range2(i, row, minimum, maximum), which returns both the original index and the count, a distinction that matters because asynchronous execution may alter the order of results. A callback function appends each completed result to a shared list and, after all tasks finish, that list is sorted by index so that the final output matches the original row order.
There is also an alternative form of apply_async() that avoids callbacks by returning ApplyResult objects, which can later be resolved with .get() to retrieve the actual result. This approach can be easier to follow when callbacks feel too indirect, though it still requires care to ensure that the pool is properly closed and joined so that all processes complete. The use of pool.join() is particularly important here because it prevents subsequent lines of code from running until the queued work is finished. Asynchronous mapping methods are available too, including Pool.starmap_async(), which mirrors starmap() but returns an asynchronous result object whose data can be fetched with .get().
Parallelising Pandas DataFrames
Parallelism in Python is not restricted to plain lists. In data analysis and machine learning work, it is often more relevant to process pandas DataFrames, and there are several levels at which this can happen: a function can operate on one row, one column or an entire DataFrame. The first two can be managed with the standard multiprocessing module alone, while whole-DataFrame parallelism often needs more flexible serialisation support than the standard library provides.
Row-wise and Column-wise Parallelism
For row-wise work, one approach is to iterate over df.itertuples(name=False) so that each row is presented as a simple tuple. A hypotenuse(row) function can compute the square root of the sum of squares of two values from each row, with a pool of four worker processes handling the rows through pool.imap(). This resembles pd.apply() conceptually, but the work is spread across processes rather than performed in a single interpreter thread.
Column-wise parallelism follows the same idea but uses df.items() to iterate over columns (it is worth noting that df.iteritems(), which older examples may reference, was deprecated in pandas 1.5.0 and has since been removed, with df.items() being the correct modern equivalent). A sum_of_squares(column) function receives each column as a pair containing the column label and the series itself, and pool.imap() distributes this work across multiple processes. This pattern is useful when independent operations need to be applied to separate columns.
Whole-DataFrame Parallelism with Pathos
Parallelising functions that accept an entire DataFrame or similarly complex object is more difficult with the standard multiprocessing machinery because of serialisation constraints, since the standard library uses pickle internally and pickle has well-known limitations with certain object types. The pathos package addresses this by using dill internally, which supports serialising and deserialising almost all Python types. A DataFrame is split into chunks with np.array_split(df, cores, axis=0), and a ProcessingPool from pathos.multiprocessing maps a function across those chunks, with the results combined using np.vstack(). This extends the same Pool > Map > Close > Join pattern, though the pool is also cleared afterwards with pool.clear().
Lower-level Process Control and Queues
There are broader ways to think about parallel execution beyond multiprocessing.Pool. Lower-level process management with multiprocessing.Process gives explicit control over individual processes, and this can be paired with queues managed through multiprocessing.Manager() for inter-process communication. In such designs, one queue can hold tasks and another can collect results, with worker processes repeatedly fetching tasks, processing them and placing outputs in the result queue, terminating when they receive a sentinel value such as -1. This approach is more verbose than using a pool, but it can be valuable when workflows are dynamic or when processes need long-lived coordination.
Threads, Executors and External Commands
Python also offers other concurrency models worth knowing. Threads, available through the threading module or concurrent.futures.ThreadPoolExecutor, are often well suited to I/O-bound work such as downloading files or waiting on network responses. Because of the GIL in CPython, threads are less effective for CPU-bound pure Python code, though they can still provide concurrency when much of the time is spent waiting. Process-based approaches, including ProcessPoolExecutor, are generally more effective for CPU-heavy work because they achieve genuine parallel execution across cores.
External process execution forms another category entirely. The os.system() method can launch shell commands, potentially in the background, though it is relatively crude. The subprocess module is more robust, providing better control over arguments, output capture and return codes. These tools are useful when the work is best handled by external programmes rather than Python functions, though they are conceptually distinct from in-Python data parallelism.
Choosing the Right Approach for Parallel Processing in Python
What emerges from all of this is that parallel processing in Python is less about memorising one trick and more about matching the method to the problem at hand. For simple data transformations over independent records, Pool.map() or Pool.starmap() can be effective, while asynchronous methods come into play when result order is not guaranteed or when responsiveness matters. When working with pandas, row-wise and column-wise strategies fit naturally into the standard multiprocessing model, whereas whole-object processing may call for a package such as pathos. Lower-level process control, thread pools, external commands and task queues each have their place too.
It is also worth remembering that parallelism is not free. Process creation, serialisation, memory usage and coordination all introduce cost, and the right question is not whether code can be parallelised but whether the effort and overhead make sense for the workload in question. Python provides several mature tools for splitting work across processes and threads, and the multiprocessing module remains one of the most practical for CPU-bound tasks on a single machine, with the Pool interface offering the clearest path from serial code to parallel execution for many everyday applications.
Moving from 32-Bit to 64-Bit SAS on Windows
Moving from 32-bit SAS on Microsoft Windows to a 64-bit environment can look deceptively straightforward from the outside. The operating system is still Windows, programmes often run without alteration, and many data sets open just as expected. Beneath that continuity, however, sit several technical differences that matter considerably in practice, especially for organisations with long-lived code, established format libraries and regular exchanges with Microsoft Office files.
What makes this transition particularly awkward is that SAS treats some of these changes as more than a simple in-place upgrade. As Jacques Thibault notes in his PharmaSUG 2012 paper, a new operating system will often be accompanied by a new version of surrounding applications, and what matters most is ensuring sufficient time and resources to fully test existing programmes under the new environment before committing to the change. SAS file types are not uniformly portable across the 32-bit to 64-bit boundary, and support behaviour also differs by SAS release, with SAS 9.3 marking the point at which some earlier friction was meaningfully reduced. As of 2025, the current release of the SAS 9 line is SAS 9.4 Maintenance 9 (M9), and organisations running any SAS 9.4 release benefit from the data-set interoperability improvements first introduced in SAS 9.3, whilst the catalog and Office-integration issues described in this article remain relevant across all SAS 9.x environments.
Data Sets and Catalogs: A Fundamental Distinction
The broadest distinction is between SAS data sets and SAS catalogs. Data sets are generally more forgiving, while catalogs are not. SAS Usage Note 38339 explains that when upgrading from 32-bit to 64-bit Windows SAS in releases earlier than SAS 9.3, Cross-Environment Data Access (CEDA) is invoked to access 32-bit SAS data sets. CEDA allows the file to be read without immediate conversion, though it can impose restrictions and may reduce performance. The same note states directly that 64-bit SAS provides no access to 32-bit catalogs at all.
That distinction sits at the centre of most migration problems, and it is the reason a move that feels routine can catch teams off guard when they first encounter the ERROR: CATALOG was created for a different operating system message. As Chris Hemedinger explains in a post on The SAS Dummy, the move from 32-bit SAS for Windows to 64-bit SAS for Windows is, for all intents and purposes, a platform change from SAS's perspective, even though only the bit architecture has changed, and SAS catalogs are not portable across platforms.
How SAS Handles Data Sets Across the Boundary
For data sets, the picture is comparatively manageable. If a 32-bit SAS data set is opened in a 64-bit SAS session in releases before SAS 9.3, SAS writes a note to the log stating that the file is native to another host or that its encoding differs from the current session encoding, and that Cross-Environment Data Access will be used, which might require additional CPU resources and might reduce performance. This is SAS performing translation work in the background, and whilst useful for continued access, it is not always ideal for regular production use.
There is an important nuance that changes things significantly with SAS 9.3. In 32-bit SAS on Windows, the data representation is WINDOWS_32, whilst in 64-bit SAS on Windows it is WINDOWS_64. Hemedinger notes that in SAS 9.3 the developers taught SAS for Windows to bypass the CEDA layer when the only encoding difference is WINDOWS_32 versus WINDOWS_64. SAS Knowledge Base article 38379 confirms this, stating that from SAS 9.3 onwards, Windows 32-bit data sets can be read, written and updated in Windows 64-bit SAS, and vice versa, as a result of a change in how SAS determines file compatibility at open time. Users on SAS 9.3 and later, including all SAS 9.4 maintenance releases, may therefore see fewer warnings and less friction with ordinary data sets originating in 32-bit Windows SAS.
Converting Data Sets to Native 64-Bit Format
Even with those SAS 9.3 improvements, many organisations prefer to convert files into the native 64-bit format rather than rely indefinitely on cross-environment access. For entire libraries, PROC MIGRATE is the recommended mechanism. SAS Usage Note 38339 notes that for releases preceding SAS 9.3, PROC MIGRATE can migrate 32-bit SAS data sets to 64-bit, changing their format so that CEDA is no longer required.
The advantages of PROC MIGRATE over the older conversion procedures are set out in detail by Diane Olson and David Wiehle of SAS Institute in their paper hosted by the University of Delaware. Unlike PROC COPY, PROC MIGRATE retains deleted observations, migrates audit trails, preserves all integrity constraints and automatically retains created and last-modified date/times, compression, encryption, indexes and passwords from the source library. It is designed to produce members in the target library that differ from the source only in being in the new SAS format.
When the task concerns individual SAS data files rather than a whole library, SAS Usage Note 38339 points to PROC COPY with the NOCLONE option. Used in a 64-bit SAS session, this copies a 32-bit Windows data set into a new file that is native to the 64-bit environment. The NOCLONE option prevents SAS from cloning the original data representation during the copy, so that the resulting file is written in the target environment's native format and CEDA is no longer needed to process it. Thibault's PharmaSUG paper illustrates this with an example using PROC COPY with the NOCLONE option together with an OUTREP setting on the target LIBNAME statement to force creation in the desired representation.
Catalogs: The Hard Problem
Catalogs are a different matter entirely. If a user running 64-bit SAS attempts to open a catalog created in a 32-bit SAS session, the familiar error appears: ERROR: CATALOG was created for a different operating system. In the case of format catalogs, a related message often reads ERROR: File LIBRARY.FORMATS.CATALOG was created for a different operating system, and this is frequently followed by failures to use user-defined formats attached to variables. As the SSCC guidance from the University of Wisconsin-Madison notes, this can prevent 64-bit SAS from reading the data set at all, with the error about formats recorded only in the log whilst the visible symptom is simply that the table did not open.
This matters because catalogs are machine-dependent. User-defined formats created by PROC FORMAT are usually stored in catalogs, often in a member named FORMATS. If those formats were built in 32-bit SAS, 64-bit SAS cannot use the catalog directly, and this affects not only explicit formatting in code but also routine data viewing because a data set linked to permanent user-defined formats may fail to display properly unless the associated format catalog is converted.
Options for Migrating Format Catalogs
There are several ways to address catalog incompatibility. If the original PROC FORMAT source code still exists, the cleanest option is simply to rerun it under 64-bit SAS, producing a fresh native catalog. The SSCC guidance treats this as the easiest solution that preserves the formats themselves, and it also describes a short-term workaround: adding a bare format female; statement to the DATA or PROC step, which removes the custom format from that variable, so there is no need to read the problem catalog file at all.
When source code is not available, transport-based conversion is the answer. In a 32-bit SAS session, PROC CPORT creates a transport file from the catalog library, and in a 64-bit SAS session, PROC CIMPORT recreates the catalog in the new environment. SAS Knowledge Base article KB0041614 provides sample code that creates a transport file in 32-bit SAS using proc cport lib=my32 file=trans memtype=catalog; select formats; and then unloads it in 64-bit SAS using PROC CIMPORT, after which a new Formats.sas7bcat file should be present in the target library. The same article notes that if access to a 32-bit SAS session is simply not available, the system option NOFMTERR can be submitted as a last resort: this allows the underlying data values to be displayed whilst user-defined formats are ignored, avoiding the error without converting the catalog.
A more robust route for user-defined formats is to avoid moving the catalog as a catalog at all. PROC FORMAT can write format definitions to a standard SAS data set using CNTLOUT, and later rebuild them from that data set using CNTLIN. Because SAS data sets are generally portable across the 32-bit to 64-bit boundary, this method sidesteps the catalog incompatibility directly. KB0041614 describes CNTLOUT/CNTLIN as the most robust method available for migrating user-defined format libraries. Karin LaPann, writing in a poster presented at a meeting of the Philadelphia Area SAS Users Group, reaches the same conclusion and recommends always creating data sets from format catalogs and storing them alongside the data in the same library as a matter of good practice.
Caveats: Item Stores, Compiled Macros and the PIPE Engine
SAS Usage Note 38339 explicitly states that stored compiled macro catalogs are not supported by PROC CPORT and must be recompiled in the new operating environment, with SAS Note 46846 covering compatibility guidance for those files specifically. The note also warns that the 32-bit version of SAS should not be removed until it can be verified that all 32-bit catalogs have been successfully migrated.
Thibault's PharmaSUG paper identifies two further file types that require attention. SAS Item Store files (.sas7bitm), which organisations may use to store standard PROC TEMPLATE output templates, are not compatible across 32-bit and 64-bit environments, and the practical solution is to recreate them under the new environment using the same programme that created them originally, targeting a different output directory to avoid a mixed 32-bit and 64-bit directory. Thibault also notes that programmes using the PIPE engine may produce errors on Windows 64-bit environments, and recommends replacing such code with newer SAS functions such as filename, dopen and dread to avoid the issue altogether. These are not universal blockers, but they underline why testing is essential rather than assumed.
Microsoft Office Integration After the Move
Another area where 64-bit moves catch users out is access to Microsoft Excel and Access files. The issue is not SAS data compatibility but the bit-ness of the Microsoft data providers. In 64-bit SAS for Windows, attempts to use PROC IMPORT with DBMS=EXCEL, PROC EXPORT with Excel or Access options, or LIBNAME EXCEL can fail with errors such as ERROR: Connect: Class not registered or Connection Failed. As Hemedinger explains, the cause is that the 64-bit SAS process cannot use the built-in data providers for Microsoft Excel or Microsoft Access, which are usually 32-bit modules. Thibault's paper confirms that installation of the PC Files Server on the same machine will be required, since the required 32-bit ODBC drivers are incompatible with 64-bit SAS on Windows.
The workarounds depend on the file type and local setup. SAS/ACCESS to PC Files provides methods such as DBMS=EXCELCS, DBMS=ACCESSCS and LIBNAME PCFILES, all of which use the PC Files Server as an intermediary, with an autostart feature that minimises configuration changes to existing SAS programmes. For .xlsx files, DBMS=XLSX removes the Microsoft data providers from the equation entirely and requires no additional setup from SAS 9.3 Maintenance 1 onwards. Installing 64-bit Microsoft Office may appear to solve the bit-ness mismatch by supplying 64-bit providers, but as Hemedinger cautions, Microsoft recommends the 64-bit version of Office in only a few circumstances, and that route can introduce other incompatibilities with how Office applications are used.
Identifying 32-Bit Catalogs in a Mixed Environment
In mixed environments, a practical challenge is identifying which catalogs are still 32-bit and which are already 64-bit. This was precisely the problem Michael Raithel posed on LinkedIn in March 2015, after finding that no SAS facility, whether PROC CATALOG, PROC CONTENTS, PROC DATASETS or the Dictionary Tables, provided a direct way to distinguish them. His solution treats the .sas7bcat file as a flat file rather than a catalog, reading the first record and searching for the character strings W32_7PRO (identifying a 32-bit catalog) and X64_7PRO (identifying a 64-bit catalog). The macro he developed can be run against any number of catalogs and builds a SAS data set recording the bit-ness and full path of each file, making large-scale inventory automation entirely practical during a phased transition.
For broader validation work, the Olson and Wiehle paper pairs PROC MIGRATE with macros based on PROC CONTENTS, PROC DATASETS and PROC COMPARE, documenting what existed in the source library before migration and verifying what exists in the target library afterwards. For highly regulated or large-scale environments, that kind of structured checking is not optional.
Navigating the Transition Without Unnecessary Disruption
The main lesson from all of this is that moving from 32-bit to 64-bit SAS on Windows is not simply a matter of reinstalling software and carrying on unchanged. Much will work as before, particularly with ordinary data sets and particularly in SAS 9.3 and later. Catalogs, format libraries, item stores and Microsoft Office integration, however, require deliberate attention.
The transition is not so much problematic as predictable. Keeping 32-bit SAS available until catalog migration is confirmed, using PROC MIGRATE for full libraries, using PROC COPY with NOCLONE for individual data sets, converting format catalogs via CPORT/CIMPORT or CNTLOUT/CNTLIN, recreating item stores and compiled macros in the new environment and testing Office-related workflows and PIPE based code before deployment together form a sound path through the process. With that preparation in place, the advantages of a 64-bit environment can be gained without avoidable disruption.
Shaping SAS output using ODS Style Definitions as well as SAS Formats
Working with SAS output involves two related but distinct concerns: how results look, and how values are displayed. The material here covers both sides of that equation. On one hand, the DEFINE STYLE statement in PROC TEMPLATE provides a way to create and customise ODS styles for destinations that support the STYLE= option. On the other, SAS formats determine how character, numeric, date and time values are written in output. Taken together, these features shape both presentation and readability, which is why it is useful to understand them in the same discussion.
The DEFINE STYLE Statement
The DEFINE STYLE statement is the foundation for creating a stand-alone style. Its syntax allows a style to be stored in a template store and to include inherited behaviour, notes, imported CSS and individual style element definitions. A style definition begins with DEFINE STYLE followed by a style path (or in the special case of Base.Template.Style, it is that name itself), and it must end with an END statement. That final END is not optional, as it is a hard requirement. Within the body of the style, statements such as PARENT=, NOTES, CLASS, IMPORT and STYLE determine how the style behaves and what it contains.
Style Paths and the STORE= Option
The style path identifies where a style is stored. It consists of one or more names separated by periods, with each name representing a directory in a template store. PROC TEMPLATE writes the style to the first writeable template store in the current path unless a STORE= option directs it elsewhere. The STORE=libref.template-store option specifies a particular template store, and if that template store does not already exist, SAS creates it automatically. One important point is that the syntax of the STORE= option does not become part of the compiled template, so it affects where the style is saved rather than the internal definition itself.
Base.Template.Style
A notable special case is Base.Template.Style. This creates a style that becomes the parent of all styles that do not explicitly specify a parent, and once created it is automatically applied to output until it is specifically removed from the item store. That convenience comes with a clear caution: the SAS-supplied Base.Template.Style contains inheritance information relied upon by many styles, and if that inheritance structure is not preserved, some style elements might not appear in output. The safer route is therefore to start from the existing Base.Template.Style, write it to an external file and edit its contents rather than constructing a replacement from scratch. There is also a restriction: if PARENT= is specified, it must refer to a style other than Base.Template.Style.
Inheritance and the PARENT= Statement
Inheritance is central to how ODS styles work. The PARENT= statement specifies the style from which the current style inherits its style elements, style attributes and statements. The style path named in PARENT= is looked up in the first readable template store in the current path, and unless the current style overrides something, everything in the parent style carries through. SAS ships with several styles that can be used as a base, including styles.default, styles.beige, styles.brick, styles.brown, styles.d3d, styles.minimal, styles.printer and styles.statdoc. This inheritance model makes style creation more manageable because most new styles are refinements of existing ones rather than fully independent definitions.
The NOTES Statement
For documentation inside the style itself, the NOTES statement provides a place to store descriptive text. This differs from a SAS comment because the text becomes part of the compiled style template and can be viewed with the SOURCE statement. That makes NOTES useful for recording what a style is for, what it changes, or any implementation detail worth preserving alongside the template. In a shared environment, that sort of embedded documentation can be more durable than comments kept in a separate program file.
The CLASS Statement
The CLASS statement creates a style element from a like-named style element. In practical terms, it duplicates an existing element of the same name and applies modifications. The three statements class fonts;, style fonts from fonts; and style fonts from _self_; are equivalent, making CLASS a convenience form for a common pattern. It takes one or more style element names, optional descriptive text and optional attribute specifications. If the same attribute is specified more than once, the last value given is the one SAS uses, and that rule is worth keeping in mind when reading or maintaining larger templates.
The STYLE Statement
The STYLE statement is more general and is the main mechanism for creating or modifying one or more style elements. It can define new elements, override inherited ones, or absorb attributes from an existing element by using the FROM option. When a new style element overrides one that is a parent of other elements, all of its descendants (including those inherited from parent styles) also inherit the new attributes, which is one of the reasons why small changes can have broad visual effects in output. Style elements within a single STYLE statement must be separated by commas.
The distinction between using FROM and not using it is particularly important. If a like-named style element already exists in the child style, and it is not created with FROM, the child version overrides the parent version entirely. If it is created with FROM, the attributes from the parent style element are absorbed into the child style element. Without FROM, an attribute defined in a like-named style element in the parent is not inherited unless it is explicitly specified again. With FROM, inherited attributes remain in play and can then be modified selectively, and this is the practical difference between replacement and extension.
The _SELF_ keyword is a shorthand within the STYLE statement, specifying that each named style element should inherit from an existing style element of the same name. It is most useful when specifying multiple style elements in one statement. For example, the single statement style data, data1, dataempty from _self_ / color = red backgroundcolor = black; is exactly equivalent to writing separate STYLE statements for data, data1 and dataempty individually. Where the same attribute appears more than once among multiple identical style element names, the last value specified is used. PROC TEMPLATE looks first in the current style for the named style element when resolving a FROM reference, and only looks in the parent style if the element is not found there.
Style Attributes
Style attributes follow the general form style-attribute-name=<|>style-attribute-value. Standard attribute names from the documented list are written without quotation marks, while user-defined attribute names must be enclosed in quotation marks. The vertical bar (|) symbol prevents the style attribute from being inherited by any child style elements, allowing a template author to control precisely how far a change spreads through the inheritance tree. Text associated with a STYLE statement also becomes part of the compiled template (much like NOTES), which can help explain why a specific element is defined in a particular way.
The IMPORT Statement and CSS
The IMPORT statement bridges CSS and ODS styles by importing Cascading Style Sheet information from a file into the style. The file specification can be an external file path, a fileref or a URL, and once imported, SAS converts the CSS code into style attributes and style elements that can be used by PROC TEMPLATE. There are requirements of which you need to be aware: the CSS file must be written in the same type of CSS that the ODS HTML statement produces, and only class names that match ODS style element names are supported, with no IDs and no context-based selectors permitted. If needed, the CSS that ODS creates can be examined with the STYLESHEET= option, or by viewing the HTML source and inspecting the code at the top of the file.
Media types add another layer to the IMPORT statement. The syntax allows up to ten media types to be specified, separated by commas, corresponding to how output will be rendered on screen, paper, with a speech synthesiser or with a braille device, for example. CSS code outside any media block is always included, and the media type option additionally imports the section of a CSS file intended only for a specific media type. If no media type is specified in the ODS statement, but media types exist in the CSS file, ODS uses the Screen media type by default. If multiple media types are specified, all of their style information is applied, though if duplicate style information appears in different media blocks, the styles from the last media block are used.
The REPLACE Statement
One statement that no longer belongs in current practice is REPLACE. The SAS documentation states plainly that it is no longer supported and that STYLE or CLASS should be used instead to create and modify style elements. That is a useful reminder when reading older code, as REPLACE appears in legacy templates and conference papers that predate its deprecation.
The ODS Style Element Catalogue
To make sense of style customisation, it helps to understand the wider catalogue of ODS style elements. These elements are organised by function, and many are abstract, meaning they exist for inheritance purposes rather than direct rendering. Abstract elements are not explicitly used in ODS output and will not appear in destinations that generate a style sheet.
Miscellaneous and Document Elements
A broad abstract element, Container, controls all container-oriented elements and sits near the top of several inheritance chains. Document-related elements such as Document, Body, Frame, Contents and Pages control the overall presentation of output files, including page background and margins, with Body, Frame, Contents and Pages all inheriting from Document. Several further miscellaneous elements handle specific rendering concerns: Continued controls the continued flag when a table breaks across a page (paginated destinations only), ExtendedPage handles the message displayed when a page will not fit (Printer destination only), PageNo controls page numbers for paginated destinations and Parskip controls the space between tables. UserText controls the ODS TEXT= style and inherits from Note. The StartUpFunction and ShutDownFunction elements add JavaScript functions to HTML output that execute on page load and page exit, respectively, and PrePage controls the ODS RTF/MEASURED PREPAGE= style.
Date Elements
Date-related elements include Date (an abstract element controlling how date fields look), BodyDate (which controls the date field in the Contents file and inherits from ContentsDate) and PagesDate (which controls the date field in the Pages file and inherits from Date).
Contents and Pages Elements
Contents and pages files are influenced by a substantial group of elements. IndexItem is an abstract element controlling list items and folders for both files. ContentFolder controls folders in the Contents file, and ByContentFolder controls byline folders there, inheriting from ContentFolder. ContentItem controls items in the Contents file and PagesItem controls items in the Pages file, both inheriting from IndexItem. The abstract element Index covers miscellaneous Contents and Pages components, and from it inherit IndexProcName, ContentProcName, ContentProcLabel, PagesProcName and PagesProcLabel, which handle procedure names and labels in each file. IndexTitle and ContentTitle control the titles of the Contents and Pages files; in styles.default, ContentTitle contains a PRETEXT= attribute that prints the text "Table of Contents". IndexAction and FolderAction determine what happens on mouse-over events for folders and items (HTML only). SysTitleAndFooterContainer controls the container for system page titles and footers, and is generally used to add borders around a title.
Titles, Footers and Related Elements
Titles and footers are handled by the abstract element TitlesAndFooters, which controls system page title and footer text. SystemTitle inherits from it and chains through SystemTitle2 up to SystemTitle10, with each inheriting from the one before. The footer series follows the same pattern from SystemFooter through SystemFooter2 to SystemFooter10. TitleAndNoteContainer controls the container for procedure-defined titles and notes, inheriting from Container. ProcTitle controls procedure title text and inherits from TitlesAndFooters, with ProcTitleFixed handling procedure title text that requests a fixed font.
Bylines
BylineContainer controls the container for the byline (generally used to add borders) and inherits from Container. Byline controls byline text and inherits from TitlesAndFooters.
Notes, Warnings and Errors
Notes, warnings and errors each consist of two pieces: a banner area and a content area. The abstract element Note controls the container for note banners and note contents, and inherits from Container. The banner elements (NoteBanner, WarnBanner, ErrorBanner and FatalBanner) generally use the PRETEXT= attribute to print the banner label. Each has a corresponding content element (NoteContent, WarnContent, ErrorContent and FatalContent), and fixed-font variants exist for note, warning and error content (NoteContentFixed, WarnContentFixed and ErrorContentFixed). All of these elements inherit from Note.
Table Elements
Elements governing table output form a substantial hierarchy. Output is an abstract element that controls basic output forms, including borders (via FRAME=, RULES= and individual border control attributes), cell spacing, cell padding and background colour, inheriting from Container. Table controls overall table style and inherits from Output, as does Batch (which controls batch mode output). Three further abstract elements are specific to RTF output: TableHeaderContainer (which places and controls the box around all column headings), TableFooterContainer (which does the same for column footers) and ColumnGroup (which controls the box around groups of columns).
Data Cell Elements
Cell is an abstract element that controls data, header and footer cells, inheriting from Container. Data cells are controlled by Data (the default style for data cells), DataFixed (for data cells requesting a fixed font), DataEmpty (for empty data cells), DataEmphasis (for emphasised data cells), DataEmphasisFixed (for emphasised data cells requesting a fixed font), DataStrong (for strong, more emphasised data cells) and DataStrongFixed. All inherit from Cell or from one another in a chain.
Header and Footer Cell Elements
Header and footer cells are governed by HeadersAndFooters, an abstract element inheriting from Cell. Headers include Header, HeaderFixed, HeaderEmpty, HeaderEmphasis, HeaderEmphasisFixed, HeaderStrong and HeaderStrongFixed. Row headers follow a parallel set: RowHeader, RowHeaderFixed, RowHeaderEmpty, RowHeaderEmphasis, RowHeaderEmphasisFixed, RowHeaderStrong and RowHeaderStrongFixed. Footers mirror the same pattern through Footer, FooterFixed, FooterEmpty, FooterEmphasis, FooterEmphasisFixed, FooterStrong and FooterStrongFixed, with row footers following suit via RowFooter and its variants. PROC TABULATE captions are separately covered by the abstract element Caption (which inherits from HeadersAndFooters), BeforeCaption and AfterCaption.
SAS Formats
While styles affect appearance, formats affect representation. SAS organises formats into four categories: Character, Date and Time, ISO 8601 and Numeric. Formats that support national languages are documented separately in the SAS National Language Support reference, and storing user-defined formats is an important consideration when those formats are associated with variables in permanent SAS data sets shared with others.
Character Formats
Character formats cover both simple display and conversion tasks. $CHARw. and $w. write standard character data, while $QUOTEw. encloses values in double quotation marks. $UPCASEw. converts character data to uppercase, and $MSGCASEw. writes uppercase output when the MSGCASE system option is in effect. Several formats transform character data into alternative encodings or representations: $ASCIIw. converts to ASCII, $EBCDICw. converts to EBCDIC, $HEXw. converts to hexadecimal, $BINARYw. converts to binary and $OCTALw. converts to octal. Others alter ordering or length handling: $REVERJw. writes character data in reverse order and preserves blanks, $REVERSw. writes it in reverse and left-aligns it, and $VARYINGw. writes character data of varying length. $BASE64Xw. converts character data into ASCII text using Base 64 encoding.
Date and Time Formats
Date and time formats are especially broad. Traditional date formats include DATEw. (writing values as ddmmmyy or ddmmmyyyy), DDMMYYw. and DDMMYYxw. (day-month-year with various separators), MMDDYYw. and MMDDYYxw. (month-day-year), YYMMDDw. and YYMMDDxw. (year-month-day), MONYYw. (month and year), MONNAMEw. (month name), DOWNAMEw. (day of week name), WEEKDATEw. and WEEKDATXw. (day of week and date in different orderings) and WORDDATEw. and WORDDATXw. (month name with day and year in different orderings). Quarter and year formats include QTRw., QTRRw. (Roman numerals), YEARw., YYQw., YYQxw., YYQRw. and YYQRxw.. Week number formats include WEEKUw., WEEKVw. and WEEKWw., each using a different numbering algorithm.
Year-month combination formats include YYMMw., YYMMxw., YYMONw., MMYYw. and MMYYxw.. DAYw. writes the day of the month and WEEKDAYw. writes the day of the week as a number. Time and date time formats include TIMEw.d, TIMEAMPMw.d, TODw.d, HHMMw.d, HOURw.d, MMSSw.d, DATETIMEw.d and DATEAMPMw.d. Formats that take a date time value and write only part of it include DTDATEw., DTMONYYw., DTWKDATXw., DTYEARw. and DTYYQCw.. Julian date formats include JULDAYw. (Julian day of the year), JULIANw. (Julian date in yyddd or yyyyddd), PDJULGw. (packed Julian in hexadecimal yyyydddF for IBM) and PDJULIw. (packed Julian in hexadecimal ccyydddF for IBM).
The $N8601 character formats also appear within the Date and Time category. $N8601Bw.d and $N8601BAw.d both write ISO 8601 duration, date time and interval forms using basic notations. $N8601Ew.d and $N8601EAw.d use extended notations. $N8601EHw.d uses extended notation with a hyphen for omitted components, $N8601EXw.d uses an x in place of each digit of an omitted component, $N8601Hw.d drops omitted components in duration values and uses a hyphen for omitted date time components, and $N8601Xw.d drops omitted duration components and uses an x for each digit of an omitted date time component.
ISO 8601 Formats
The ISO 8601 category covers the same $N8601 character formats listed above, together with the B8601 (basic notation) and E8601 (extended notation) families of numeric formats. Basic formats include B8601DAw. (date as yyyymmdd), B8601DNw. (date from a date time value as yyyymmdd), B8601DTw.d (date time as yyyymmddThhmmssffffff), B8601DZw. (date time in UTC with time zone offset as yyyymmddThhmmss+|-hhmm), B8601LZw. (local time with UTC offset as hhmmss+|-hhmm), B8601TMw.d (time as hhmmssffff) and B8601TZw. (time adjusted to UTC as hhmmss+|-hhmm). Extended formats follow the same structure: E8601DAw. (date as yyyy-mm-dd), E8601DNw., E8601DTw.d, E8601DZw., E8601LZw., E8601TMw.d and E8601TZw.d, each using hyphen and colon delimiters to separate date and time components. These formats are important where standards compliance, machine readability or time zone clarity matter.
Numeric Formats
Numeric formats address general presentation, technical encoding and domain-specific output. BESTw. lets SAS choose the best notation, w.d writes standard numeric data one digit per byte and Zw.d adds leading zeroes. BESTDw.p lines up decimal places for values of similar magnitude and prints integers without decimals. Dw.p does the same over a potentially wider range of values, and Ew. writes values in scientific notation.
Financial and punctuation-sensitive displays are handled by COMMAw.d (comma every three digits, period for decimal), COMMAXw.d (period every three digits, comma for decimal), NUMXw.d (comma in place of the decimal point), DOLLARw.d, DOLLARXw.d, PERCENTw.d, PERCENTNw.d (using a minus sign for negative values) and NEGPARENw.d (negative values in parentheses). Integer and binary formats include IBw.d (native integer binary including negative values), IBRw.d (integer binary in Intel and DEC formats), PIBw.d (positive integer binary), PIBRw.d (positive integer binary in Intel and DEC formats) and RBw.d (real binary floating-point). Floating-point formats include FLOATw.d (native single-precision) and IEEEw.d. FRACTw. converts values to fractions.
Encoding formats include HEXw. (hexadecimal), BINARYw. (binary), OCTALw. (octal), PDw.d (packed decimal), PKw.d (unsigned packed decimal) and ZDw.d (zoned decimal). IBM mainframe formats form their own group: S370FFw.d (standard numeric), S370FIBw.d (integer binary including negative values), S370FIBUw.d (unsigned integer binary), S370FPDw.d (packed decimal), S370FPDUw.d (unsigned packed decimal), S370FPIBw.d (positive integer binary), S370FRBw.d (real binary floating-point), S370FZDw.d (zoned decimal), S370FZDLw.d (zoned decimal leading sign), S370FZDSw.d (zoned decimal separate leading sign), S370FZDTw.d (zoned decimal separate trailing sign) and S370FZDUw.d (unsigned zoned decimal). VAXRBw.d writes real binary data in VMS format and VMSZNw.d generates VMS and OpenText COBOL zoned numeric data.
Readable formats include ROMANw. (Roman numerals), WORDSw. (values as words) and WORDFw. (values as words with fractions shown numerically). The SSNw. format writes Social Security numbers and PVALUEw.d writes p-values.
Combining ODS Styles and Formats for Cleaner SAS Output
The connection between style definitions and formats is straightforward, even if the details are substantial. Styles determine the visual structure of ODS output through inheritance, element definitions and optional CSS imports, while formats determine how the values inside that output are written. A report can therefore be shaped at two levels at once: the appearance of titles, tables, notes and cells through DEFINE STYLE, and the textual form of dates, times, percentages, identifiers and other values through the SAS format system. Understanding both gives a clearer picture of how SAS turns data into output that is both functional and legible.
Managing Microsoft Outlook on Windows: Fonts, Zoom, Data Files and Deployment Controls
Outlook continues to evolve across Windows, with a mixture of everyday personalisation options for users and deployment controls for administrators. Recent guidance from Microsoft brings together practical steps for composing messages in a preferred typeface, approaches for reading messages more comfortably, and a set of administrative measures to manage when and how the new Outlook appears in an organisation. Alongside this are reminders about where Outlook stores data on different account types and how that affects moving between computers, as well as pointers for finding POP, IMAP and SMTP settings for Outlook.com when manual configuration is needed. What follows draws these threads together so that individual users and IT teams can navigate the changes with clarity.
Changing the Default Font for New Messages and Replies
For those composing email, Outlook starts with a familiar default: new messages use Calibri in black. This is only a starting point because the application allows the font, its colour, size and style to be changed, and it treats new messages separately from replies and forwards so that different choices can be set for each if desired.
In new Outlook for Windows, the path goes like this: View > View Settings > Email > Compose and Reply. Under Message Format, the preferred font, size and style can be chosen before saving, and these settings then apply whenever a message is written or a reply is sent. Note that in new Outlook the font setting applies to both new messages and replies and forwards from a single control, so a separate choice for each is not available in this version.
In classic Outlook for Windows, the approach is different and more granular. Navigating to File > Options > Mail reveals a Stationery and Fonts button. On the Personal Stationery tab, there are separate Font buttons for new mail messages and for replying or forwarding messages, which allows a distinct typeface, size and colour to be set for each scenario independently. This separation can be useful for distinguishing composed messages from replied ones at a glance. If similar changes are needed for the message list rather than the compose window, there is a separate set of options for changing the font or font size in the message list.
Adjusting the Zoom Level in the Reading Pane
Comfort when reading is equally important, particularly with longer emails. Both new and classic Outlook offer ways to adjust zoom in the Reading Pane without touching system-wide display settings, though the controls differ between the two versions. In new Outlook, selecting a message in the inbox opens it in the Reading Pane, after which the View tab's Zoom control can be used. Zooming in and out is done with plus and minus buttons, and there is a Reset option that returns the view to its default level. In classic Outlook, the same result can be achieved either by dragging the zoom bar at the bottom right of the window or by going to View and then Zoom, where a specific percentage between 50% and 200% can be chosen. Classic Outlook also offers a "Remember my preference" checkbox in the Zoom dialogue, which locks the chosen level so it persists across sessions without needing to be reset each time. In both versions, these adjustments affect only how messages appear on the screen and have no bearing on how they are composed or how recipients will see them.
Confirming Which Version of Outlook Is in Use
Not every copy of Outlook presents the same options at the same time. If steps that are described as applying to new Outlook do not appear, the device may still be running classic Outlook for Windows. That is not uncommon in environments where administrators are controlling the transition or where devices have not yet received the relevant updates, so checking the version in use is a sensible first step before assuming that something has gone wrong.
Hiding the New Outlook Toggle in Classic Outlook
For administrators, a recurring question is how to prevent users from switching to new Outlook until the organisation is ready. Microsoft provides a cloud policy in the Microsoft 365 Apps admin centre that hides the Try the new Outlook toggle in classic Outlook for Windows. After signing in to the admin centre, the policy can be created by going to Customisation, selecting Policy Management and enabling the policy named Hide the "Try the new Outlook" toggle in Outlook. There is also a registry-based method for controlling the same setting: the key is under HKEY_CURRENT_USERSoftwareMicrosoftOffice16.0OutlookOptionsGeneral and is named HideNewOutlookToggle, with a value of dword:00000000 to hide the toggle. To later enable the policy, the same value is set to 1. As with any registry change, this approach is best handled with care and in line with internal change management practices.
Removing the New Outlook App After Preinstallation on Windows 11
Preinstallation of the new Outlook on Windows 11 is another area where planning matters. On Windows 11 builds later than version 23H2, the app is preinstalled for all users, and there is currently no way to block that preinstallation. If devices should not surface the new Outlook, it can be removed after installation using the following Windows PowerShell command:
Remove-AppxProvisionedPackage -AllUsers -Online -PackageName (Get-AppxPackage Microsoft.OutlookForWindows).PackageFullName
After deprovisioning, Windows updates will not reinstall the app. Administrators can also remove an additional Windows orchestrator registry value at HKEY_LOCAL_MACHINESOFTWAREMicrosoftWindowsUpdateOrchestratorUScheduler_OobeOutlookUpdate where applicable. Devices that have installed the March 2024 Non-Security Preview release, or a later cumulative update for Windows 11 version 23H2, respect the deprovisioning command and do not require removal of that registry value.
Handling User-Installed Instances and Start Menu Placeholders
Users may also install the app themselves, for example by selecting a toggle. In that case, the management approach shifts from provisioned packages to installed packages, and the following PowerShell command removes the app for all users:
Remove-AppxPackage -AllUsers -Package (Get-AppxPackage Microsoft.OutlookForWindows).PackageFullName
It is worth verifying whether the app is actually installed or whether only a Start menu placeholder is visible because a pinned icon may appear even when the underlying app is not yet present. A quick check of the folder at %localappdata%MicrosoftOlklogs can confirm whether the app has produced logs, and Start layout policies can be used to manage pins, so users are not inadvertently prompted to install by selecting a placeholder. On consumer devices, a Recommended section in the Windows 11 Start menu can also surface the app, which may need consideration in user communications.
Migrating Users Away from Windows Mail and Calendar
The end of support for Windows Mail and Calendar on the 31st of December 2024 introduced another migration pathway. Active users of those apps are being switched automatically to the new Outlook app, so organisations that wish to block that route can remove the Mail and Calendar apps from devices using the following command:
Get-AppxProvisionedPackage -Online | Where {$_.DisplayName -match "microsoft.windowscommunicationsapps"} | Remove-AppxProvisionedPackage -Online -PackageName {$_.PackageName}
For current users, the installed package can be removed with Remove-AppxPackage -AllUsers -Package (Get-AppxPackage microsoft.windowscommunicationsapps).PackageFullName. Alternatives exist through Microsoft Intune or Configuration Manager, which may be preferable in environments that already use those tools for application lifecycle management.
Blocking Acquisition via the Microsoft Store
Preventing acquisition from the Microsoft Store is more straightforward. Because the new Outlook for Windows is available there as well, blocking access to the Microsoft Store app prevents users from downloading it through that channel. Microsoft provides configuration options for controlling Microsoft Store access, and administrators can align those with broader device management policies that may already limit consumer app installs on corporate devices.
Opting Out of Automatic Migration
Some organisations will want to opt out of new Outlook migration entirely for a period. Starting in January 2025, users with Microsoft 365 Business Standard and Premium licences are automatically migrated from classic Outlook to new Outlook, with in-app notifications sent before the switch and the option to toggle back afterwards. Microsoft exposes a policy named Manage user setting for new Outlook automatic migration that controls whether users are switched automatically. If the policy is not set, the user setting remains uncontrolled and users can manage it themselves, with the default being enabled. Enabling the policy enforces automatic migration and prevents users from changing the setting, while disabling it turns off automatic migration and also prevents user changes. The equivalent registry setting sits under HKEY_CURRENT_USERSoftwarePoliciesMicrosoftoffice16.0outlookpreferences with a DWORD named NewOutlookMigrationUserSetting set to 0 to disable or 1 to enable. The same controls can be managed via Group Policy Administrative Templates and through the Cloud Policy service from the Microsoft 365 Apps admin centre, and because the setting is defined in ADMX templates it can also be surfaced in Intune using Administrative Templates.
Applying Conditional Access and Mailbox Policies
Beyond installation state and migration timing, access policies are a decisive layer of control. Conditional Access policies can require multifactor authentication, restrict access by location, block risky sign-in behaviours or insist on organisation-managed devices. For additional nuance, Outlook on the web (OWA) mailbox policies used together with the ConditionalAccessPolicy parameter can limit capabilities for users on non-compliant devices, for instance by restricting attachments. This approach allows a more graduated user experience that reduces risk without completely blocking access, and it can be combined with broader Conditional Access baseline requirements.
There are cases where a firmer control is required. To prevent mailbox access from the new Outlook regardless of how users acquired the app, administrators can use an Exchange mailbox policy that blocks organisation mailboxes from being added. This acts as a final block so that work or school accounts cannot be used in the app, even if an individual user has installed it or found it preinstalled. Because mailbox policies are applied to the account rather than to a device or a specific app, it is prudent to consider them alongside the earlier measures that block acquisition or control installation, so that personal accounts are not used in ways that bypass organisational safeguards.
Understanding How Outlook Stores Data and What Moves to a New Computer
While deployment and access are important, day-to-day continuity often depends on understanding how Outlook stores data and how that affects moving to a new computer. Outlook saves backup information in a variety of different locations depending on the account type involved. For users of Microsoft 365, Exchange, Outlook.com, Hotmail.com or Live.com accounts not accessed by POP or IMAP, email is backed up on the server and there is no Personal Folders file with a .pst extension. An Offline Folders file with an .ost extension may be present, but Outlook automatically recreates this when a new email account is added, and it cannot be moved between computers. Other elements such as navigation pane settings, print styles, signatures and stationery can be transferred, and their locations vary with version and configuration.
Users of POP accounts encounter a different arrangement. All email, calendar, contact and task information is stored in a .pst file, and moving this file to a new computer preserves that information. It does not carry over the account settings themselves, so Outlook needs to be set up on the new computer before opening the .pst file that was copied from the old one. On Windows 11, navigation pane settings are found at drive:Users<username>AppDataRoamingMicrosoftOutlook and signatures at drive:Users<username>AppDataRoamingMicrosoftSignatures. Knowing these paths saves time during a migration and reduces the risk of overlooking important data.
Avoiding OneDrive Synchronisation Problems with PST Files
Large .pst files can slow down OneDrive synchronisation if they are stored in folders that OneDrive is backing up. Symptoms include messages such as "Processing changes" or "A file is in use" that persist for longer than expected. Microsoft provides guidance on removing an Outlook PST data file from OneDrive if that becomes necessary, and doing so can restore normal synchronisation behaviour while keeping Outlook functional on the local machine.
Showing Hidden Files and Extensions on Windows
Locating Outlook data sometimes means revealing folders and file name extensions that Windows hides by default. This is especially true when navigating to AppData or similar directories, or when differentiating between PST and OST files. On Windows 11 File Explorer, going to View > Show, where both "File name extensions" and "Hidden items" settings can be toggled to their on positions. Doing so makes the AppData folder and the distinction between these file types visible without needing to navigate through the Control Panel.
Configuring POP, IMAP and SMTP Settings for Outlook.com
Configuration of Outlook.com accounts brings its own questions when used in the Outlook desktop app or other mail applications. Outlook and Outlook.com can often detect the correct mailbox settings automatically, which simplifies setup for many users. When that is not the case, or when using a third-party app, the POP, IMAP and SMTP settings can be viewed within Outlook.com settings and used for manual configuration. For Outlook.com accounts, both the IMAP and POP server name is outlook.office365.com, with IMAP using port 993 and POP using port 995, both with SSL/TLS encryption and OAuth2 authentication. It is worth noting that POP and IMAP access is disabled by default in Outlook.com and must be enabled in account settings before either protocol can be used. For other non-Microsoft accounts, the safest course is to obtain settings directly from the relevant email provider rather than guessing values, since incorrect entries can lead to connection issues that are not always obvious at first glance.
Getting Support for Outlook.com
Support remains close at hand for Outlook.com users who need it. The Help option on the menu bar in Outlook.com opens self-help resources where queries can be entered and common issues surfaced. If those do not resolve the problem, there is a path to contact support, which requires signing in to the account so that assistance can be tailored. If signing in is not possible, Microsoft directs users to a separate route to begin recovery or get help, and the Outlook.com Community provides an additional place to search for answers or ask questions from other users.
Keeping Users and IT Teams Informed During Outlook's Transition
Together, these user-facing features and administrative controls reflect a period of transition for Outlook on Windows. Individuals can shape the way they write and read messages, adjusting fonts to suit their preferences and using zoom where needed, without altering system-wide settings. Administrators can pace the adoption of the new Outlook with policies that hide toggles, prevent or reverse preinstallation, opt out of automatic migration and apply Conditional Access or mailbox policies that enforce organisational requirements. Underneath these changes, the fundamentals of data storage and account setup remain steady, with server-backed accounts recreating their local caches on-demand and POP accounts relying on .pst files that can be moved with care. By keeping these points in mind, users and IT teams alike can make informed decisions that avoid surprises and maintain a smooth email experience.
Understanding and Managing MySQL Binary Log Files
When a MySQL server begins filling a disk with files named along the lines of server-bin.n or mysql-bin.00000n, where n is an incrementing number, the immediate concern is usually storage pressure. These files typically live in /var/lib/mysql and can accumulate steadily or very quickly depending on workload. While they may look obscure at first glance, they are neither random nor a sign of corruption. They are MySQL binary log files, commonly shortened to binlogs, and they exist for specific operational reasons.
What Binary Logs Are and What They Do
Understanding what these files do is the first step towards deciding whether they should remain enabled. The binary log contains all statements that update data or that potentially could have updated it, including operations such as DELETE or UPDATE where no rows were matched. Rather than being a plain text history, the log stores events describing those modifications and also includes information about how long each data-changing statement took to execute. In practical terms, the binary log forms a continuous stream of change activity on the server.
That design serves two important purposes. The first is data recovery: after a backup file has been restored, the events recorded in the binary log after that backup was made are re-executed, bringing the databases up to date from the point of the backup. The second is high availability and replication: the binary log is used on primary replication servers as a record of the statements to be sent to secondary servers. It is worth noting that it is actually the replica that pulls log data from the primary, rather than the primary pushing it outward. Without binlogs, that style of replication cannot function at all.
Before Making Any Changes
Because of those uses, any change to binary logging should be made with care. Changes to MySQL server settings and to binlog files should be made only when verified backups already exist, as mistakes can lead to data corruption or loss. If a server is important, caution matters more than convenience.
It is also worth noting that from MySQL 8.0 onwards, binary logging is enabled by default, with the log_bin system variable set to ON whether or not the --log-bin option is specified explicitly. This is a significant change from earlier versions, where binary logging had to be switched on manually. Commenting out log_bin may not switch it off in newer versions, which is consistent with this change in default behaviour and is one reason why simply commenting out that directive may not always produce the expected result.
Disabling Binary Logging
On systems where replication is not in use and point-in-time recovery is not needed, disabling binary logging may be a reasonable choice. The conventional approach is to open /etc/my.cnf or /etc/mysql/my.cnf and find the line that reads log_bin, then remove or comment it out:
#log_bin = /var/log/mysql/mysql-bin.log
The following related settings for automatic expiry and maximum log file size can also be commented out if binary logging is being disabled entirely:
#expire_logs_days = 10
#max_binlog_size = 100M
Once the file is saved, the server needs to be restarted. On many Linux systems, that means running one of the following:
service mysql restart
or:
systemctl restart mysql
Following a restart, it is worth confirming the service came back cleanly:
systemctl status mysql
If disabling binary logging does not produce the expected result, the safest course is to check the active MySQL version and review the matching vendor documentation for that release, as configuration behaviour is not always identical across versions and distributions. In MySQL 8.0 and later, the preferred startup options for disabling binary logging are --skip-log-bin or --disable-log-bin rather than commenting out log_bin alone.
Replication and the Case for Keeping Binlogs
For servers that are using replication, disabling binary logs is not a realistic option. If the primary does not log, there is nothing for the replica to download and apply, and replication cannot proceed. In those cases, the question is not whether to keep the files but how to manage their growth sensibly, and that is where retention settings and manual purging come in.
MySQL provides built-in ways to expire old binlogs automatically. With expire_logs_days = 10 in the configuration file, logs older than ten days are purged automatically by the MySQL server, removing the need for manual intervention under normal circumstances. Automatic expiry is often the simplest route on systems where a fixed retention window is acceptable and where replication lag, or backup policies do not require logs to be kept for longer.
Purging Binary Logs Manually
Manual purging remains useful in several situations: storage may already be tight, log growth may have exceeded expectations, or a one-off clean-up may be needed after a burst of activity. For those cases, the MySQL client can issue purge commands directly. The first example removes everything before a named log file, and the second removes everything before a given date and time:
mysql -u root -p 'MyPassword' -e "PURGE BINARY LOGS TO 'mysql-bin.03';"
mysql -u root -p 'MyPassword' -e "PURGE BINARY LOGS BEFORE '2008-12-15 10:06:06';"
It is also possible to remove logs older than a specified interval using a relative expression:
mysql -u root -p -e "PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY);"
Any purge action on a replicated environment should be timed so that replicas have already consumed the relevant logs. Purging logs that a replica still needs will break replication. A more drastic option, RESET MASTER (or RESET BINARY LOGS AND GTIDS in MySQL 8.4 and later), deletes all binary log files and resets the binary log index entirely, and should be approached with corresponding caution rather than used as a routine housekeeping measure.
Securing Database Credentials
One practical concern in the purge examples above is authentication. Placing a password directly on the command line is a security risk because command-line passwords can be exposed to other users through shell history or process inspection. A more secure alternative is to store connection details using the mysql_config_editor utility, which writes credentials to an obfuscated file named .mylogin.cnf.
On non-Windows systems, that file lives in the current user's home directory, and each group of settings within it is called a login path. A login path can store host, user, password, port and socket values. A default client login path is created with the following command, after which the password is entered at a prompt and is not echoed to the screen:
mysql_config_editor set --login-path=client --host=localhost --user=localuser --password
Additional login paths can be added for other servers. MySQL clients can then use those stored credentials like so:
mysql --login-path=client
A purge command written with a stored login path then becomes:
mysql --login-path=client -e "PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY);"
That form is easier to automate cleanly and keeps credentials out of shell history. Client programmes still read .mylogin.cnf even when --no-defaults is used, unless --no-login-paths is specified, which gives administrators a safer way to avoid command-line passwords without giving up flexible client behaviour.
The Relationship Between mysqlcheck and Binary Logging
It is also worth understanding how table maintenance tools interact with binary logging because they can contribute to log growth in ways that are not immediately obvious. The mysqlcheck utility performs table maintenance operations, including checking, repairing, analysing and optimising tables, while the server is running. The --write-binlog option is enabled by default for ANALYZE TABLE, OPTIMIZE TABLE and REPAIR TABLE statements generated by mysqlcheck, meaning those statements are written to the binary log unless instructed otherwise.
Using --skip-write-binlog adds NO_WRITE_TO_BINLOG to those statements, preventing them from being recorded in the binary log. That can matter if maintenance operations should not be sent to replicas or replayed during recovery from backup. It is also important to note that InnoDB tables can be checked with CHECK TABLE but cannot be repaired with REPAIR TABLE, and that taking a backup before attempting any repair operation is advisable because some circumstances can result in data loss. Binary logging records change events at the server level rather than belonging to any single storage engine, so the presence of InnoDB does not in itself remove binlogs from the picture.
Managing MySQL Binary Log Files: A Practical Summary
For administrators trying to decide what to do when binlogs are consuming disk space, context is everything. A development server with no replication and no need for point-in-time recovery may be a sensible candidate for disabling binary logging. A production server participating in replication or relying on binlogs for recovery should retain them and focus instead on retention settings, controlled purging and an understanding of what workloads are causing rapid log rotation.
The practical path is to begin with the role and requirements of the server rather than with deletion. If the server does not need binary logs, use --skip-log-bin or --disable-log-bin at startup (or comment out log_bin on older versions) and then verify whether the change actually took effect for the installed version. If the server does need them, review retention settings such as expire_logs_days, use purge commands carefully when necessary and avoid exposing passwords on the command line by using mysql_config_editor and stored login paths. Where maintenance commands are involved, remember that mysqlcheck writes certain statements to the binary log by default unless --skip-write-binlog is specified. Binlog files are part of the server's change history and can be central to recovery and replication; when they begin to dominate a file system, the right response depends on what the server is expected to do and how carefully housekeeping is carried out.
Practical Excel skills for dates, locked files and text splitting
Excel has a reputation for simplicity that does not always survive contact with real working life. Project schedules need to respect weekends and public holidays, files arrive locked at precisely the moment edits are most urgent, and data lands in cells formatted in ways that resist immediate analysis. Three everyday tasks illustrate this gap well: calculating future or past dates that exclude non-working days, removing or working around read-only restrictions when genuine edits are required, and splitting text strings cleanly at the first or Nth delimiter. The guidance below draws on established Excel features, keeping the emphasis firmly on what works in practice.
Working with Business Calendars
The WORKDAY Function
Working with dates that observe a business calendar is a frequent requirement in scheduling, logistics and reporting. Excel's WORKDAY function is designed for this purpose and returns a date that is a given number of working days from a start date, excluding Saturdays and Sundays by default. At its simplest, a formula such as =WORKDAY(A1,5) moves forward five working days from the date in A1. The function can also respect a list of holidays so that results avoid specific non-working dates as well as weekends, as in =WORKDAY(A1,5,C1:C3), which skips any matching dates in C1:C3 when calculating the result.
A concrete example shows how this behaves with a real date. Using named ranges for readability, with start referring to cell B5, days to B8, and holidays to B11:B13, the formula =WORKDAY(start,days) returns the next working day five days after 23rd December 2024. With weekends excluded, but no holidays provided, the result is Monday 30th December 2024. When holidays are supplied with =WORKDAY(start,days,holidays), the function also avoids the listed dates in B11:B13 and produces Thursday 2nd January 2025. In all cases, the weekend definition is Saturday and Sunday, and the holidays must be stored as valid Excel dates to be recognised correctly.
Visualising the Path to the Result
It often helps to see the individual dates that WORKDAY steps through when reaching its answer. A compact way to achieve this is to generate a short run of consecutive dates from the start date and display them alongside abbreviated day names. Using =SEQUENCE(13,1,start) in cell D5 creates thirteen dates beginning with the date held in the named range start because Excel dates are serial numbers that increment by one per day. Formatting these cells with the custom number format ddd, dd-mmm-yy shows an abbreviated weekday alongside the date, making it straightforward to spot weekends at a glance.
Conditional formatting can then shade non-working days directly within this generated block. Because WORKDAY does not evaluate a date when zero is supplied for the argument called days, a small workaround helps determine whether a given date is itself a working day. In column D, a rule based on the formula =WORKDAY(D5-1,1)<>D5 asks WORKDAY for the next working day after the previous day; if the answer does not equal the date in D5, then D5 is not a working day and can be shaded grey. A similar rule for column E, =WORKDAY(E5-1,1,holidays)<>E5, incorporates the named holiday range and produces additional shading where dates overlap with the supplied holiday list as well as weekends.
Highlighting calculated end dates ties the visualisation together. If the main results appear in G5 and G6, cells in column D can be highlighted when equal to $G$5 using a rule such as =D5=$G$5, and cells in column E can be highlighted when equal to $G$6 using =E5=$G$6. If preferred, the formatting rules can be defined without relying on G5 and G6 by embedding the WORKDAY calls directly in the comparisons, as in =D5=WORKDAY(start,days) and =E5=WORKDAY(start,days,holidays). In either arrangement, there are four conditional formatting rules in play across the grid: two to shade non-working days and two to pick out the final dates.
Handling Non-Standard Working Weeks
Work patterns do not always match the standard five-day week. Where a schedule follows a different rhythm, such as a four-day or six-day working week, switching to the WORKDAY.INTL function is the appropriate step. It follows the same principle as WORKDAY, returning a date a set number of business days from a start date while optionally excluding holidays, but it accepts a custom definition of which weekdays count as working days. This flexibility allows organisations that operate alternative rosters to generate accurate due dates without resorting to manual adjustments or complex helper columns.
Managing Read-Only Excel Files
There are several reasons why a workbook might be set to read-only: preventing accidental erasure of data, or ensuring a file remains unchanged as it passes between parties. Each form of protection serves a different purpose and has a distinct method for disabling it. ExcelRibbon.Tips.Net is a useful ongoing reference for these and many other workbook and security scenarios across Excel 2007 and later versions.
Read-Only Recommended
The lightest touch is the Read-Only Recommended setting. When a workbook carries this flag, Excel prompts on opening with a dialogue asking whether to open it as read-only. This method applies across all versions of Microsoft Excel from 2003 through to current releases. To remove the recommendation, open the workbook, use File > Save As and choose Browse to open the Save As window, then select Tools in the lower right of the dialogue, pick General Options, clear the Read-Only Recommended checkbox and click OK before saving. The next time the file opens, the prompt does not appear.
Marked as Final
A firmer signal is applied when a workbook is Marked as Final. In this state, commands, typing and proofing marks are all disabled, and Excel displays a Marked as Final notification bar at the top of the worksheet. To turn this off when editing is required, click Edit Anyway on the notification bar. This removes the read-only state for the current copy and allows modifications to proceed. The flag is more about signalling completion than enforcing security, so the application provides a clear override directly within the interface.
Password to Modify
Password protection introduces a stronger barrier. When a workbook has a Password to Modify set, a dialogue appears on opening that invites the password or offers the option to open the file as read-only; without the password, the file cannot be modified directly. A pragmatic path when only a working copy is needed is to open the original in read-only mode, then use File > Save As with Browse, select Tools > General Options, clear the Password to Modify field and confirm with OK before saving under a new name. Opening the newly saved file allows edits because it no longer carries a modification password. Using a third-party utility to crack a password on someone else's file is inadvisable and potentially inappropriate, so the better route is to request an editable version from whoever sent the document.
Operating System-Level Restrictions on a Mac
Occasionally, the read-only state is imposed not by Excel but by the operating system, where the file has been locked so that only the owner can edit it. On a Mac, the fix is made outside Excel: locate the file in Finder, right-click it and choose Get Info, then clear the Locked checkbox before reopening the file in Excel. If the issue is one of permissions rather than a simple lock, the Get Info window also contains a Sharing and Permissions section at the bottom. This lists each user alongside a drop-down privilege set to either Read Only or Read and Write, and the file owner can adjust these entries to grant editing access to the relevant users.
Operating System-Level Restrictions on a PC
On a PC, the equivalent controls are found in File Explorer. Right-clicking the workbook and choosing Properties opens the General tab, where unchecking the Read-Only attribute and clicking OK is often sufficient to restore full access. If the restriction stems from security permissions rather than the file attribute, the Security tab lists the groups and usernames that have access along with their permission levels. Clicking Edit beneath that list allows the file owner to adjust access for individual entries, including granting the ability to modify the file where that is justified.
Splitting Text in Excel
LEFT, MID and RIGHT
Reshaping text is another everyday requirement in Excel, and the LEFT, MID and RIGHT functions provide predictable building blocks. LEFT extracts a specified number of characters from the start of a string, MID extracts from a given position in the middle, and RIGHT extracts from the end. For instance, =LEFT("test string",2) returns te, =MID("test string",6,3) returns str, and =RIGHT("test string",2) returns ng. When exact character counts are known in advance these functions can be applied directly, but real data often arrives with variable-length segments separated by spaces or other delimiters, so the position of the delimiter must first be discovered before extraction can take place.
Splitting at the First Delimiter
To split at the first occurrence of a delimiter such as a space, combining these extraction functions with FIND or SEARCH is effective. Both functions return the position of a substring within a string, with the key distinction that FIND is case-sensitive while SEARCH is not. Suppose cell A1 contains test string. To return everything to the left of the first space, use =LEFT(A1,FIND(" ",A1)-1). Here FIND returns 5 as the position of the space, subtracting 1 yields 4, and LEFT uses that figure to return test. To return the text to the right of the first space, the formula subtracts the space position from the total string length: =RIGHT(A1,LEN(A1)-FIND(" ",A1)). In this example, LEN(A1) is 11 and the FIND result is 5, so the expression evaluates to 6 and RIGHT returns string. The pattern generalises to other delimiters by replacing the space character with the required alternative.
Locating the Nth Delimiter
Locating the Nth occurrence of a delimiter takes an extra step because FIND and SEARCH identify only the first match after a given starting point. A common technique relies on SUBSTITUTE to mark the Nth occurrence with a unique character that does not appear elsewhere in the text, then uses FIND to locate it. Consider An example text string in A1 and a requirement to return everything up to the third space. Substituting the third space with a vertical bar creates a dependable marker: =SUBSTITUTE(A1," ","|",3) produces An example text|string. Finding the bar with =FIND("|",SUBSTITUTE(A1," ","|",3)) returns 16, the position of the marker. LEFT can then extract the part before that position by subtracting one: =LEFT(A1,FIND("|",SUBSTITUTE(A1," ","|",3))-1), which returns An example text. Combining the steps in this way makes the formula self-contained and avoids the need for helper cells.
These text approaches extend naturally to extracting the segment after the Nth delimiter by using MID or RIGHT, with similar position logic. Replacing LEFT with MID and adjusting the start index to the marker position plus one retrieves the portion that follows. The same idea works with second or fourth occurrences by changing the instance number inside SUBSTITUTE. When working with data sets that include inconsistent spacing or punctuation, it is worth verifying that the chosen marker character does not already appear in the source text, since the method depends on its uniqueness within each processed string.
Bringing These Techniques Together in Your Excel Workflows
These three strands combine to form a toolkit that handles a surprising range of everyday scenarios. WORKDAY and WORKDAY.INTL anchor date calculations to real-world calendars so that estimates and commitments respect weekends and public holidays, while the SEQUENCE-based visualisation grid can help colleagues understand how an end date is reached rather than simply accepting a single cell value. Managing read-only states allows teams to balance protection with flexibility, with the key being to identify which type of restriction applies before attempting to remove it. LEFT, MID and RIGHT, combined with FIND and SUBSTITUTE, turn messy text into consistent fields ready for further analysis.
From planning to production: Selected aspects of modern software delivery
Software delivery has never been more interlinked across strategy, planning and operations. Agile practices are adapting to hybrid work, AI is reshaping how teams plan and execute, and cloud platforms have become the default substrate for everything from build pipelines to runtime security. What follows traces a practical route through that terrain, drawing together current guidance, tools and community efforts so teams can make informed choices without having to assemble the big picture for themselves.
Work Management: Asana and Jira
Planning and coordination remain the foundation of any delivery effort, and the market still gravitates to two names for day-to-day project management: Asana and Jira. Each can bring order to multi-team projects and distributed work, yet they approach the job from very different histories.
With a history rooted in large DevOps teams and issue tracking, Jira carries that lineage into its Scrum and Kanban options, backlogs, sprints and a reporting catalogue that leans into metrics such as time in status, resolution percentages and created-versus-resolved trends. Built as a more general project manager from the outset, Asana shows its intent in the way users move from a decluttered home screen to “My Tasks”, switch among Kanban, Gantt and Calendar views using tabs, and add custom fields or rules from within the view rather than navigating to separate screens. The two now look similar at a glance, but their structure and presentation differ, and that influences how quickly a team settles into a rhythm.
Dashboards and Reporting
Those differences widen when examining dashboards and reporting. Jira allows users to create multiple dashboards and fill them with a large range of gadgets, including assigned issues, average time in status, bubble charts, heat maps and slideshows. The designs are sparse yet flexible, and administrators on company-managed accounts can add custom reporting, while the Atlassian Marketplace offers hundreds of additional reporting integrations.
By contrast, the home dashboard in Asana is intentionally pared back, with reports placed in their own section to keep personal task management separate from project or portfolio-level tracking. Its native reporting is broader and more polished out of the box, with pre-built views for progress, work health and resourcing, together with custom report creation that does not require admin-level access.
Interoperability
How well each tool connects to other systems also sets expectations. Jira, as part of Atlassian's suite, has a bustling marketplace with over a thousand apps for its cloud product, covering project management, IT service management, reporting and more. Asana's store is smaller, with under 400 apps at the time of writing, though it continues to grow and offers breadth across staples such as Slack, Teams and Adobe Creative Cloud, as well as a strong showing for IT and developer use cases.
Both tools connect to Zapier, which has also published a detailed comparison of the two platforms, opening pathways to thousands of further automations, such as creating Jira issues from Typeform submissions or making Asana tasks from Airtable records without writing integration code. In practice, many teams will get what they need natively and then extend in targeted ways, whether through marketplace add-ons or workflow automations.
Plans and AI
Plans and AI are where the most significant recent movement has occurred. On the Asana side, a free Personal tier leads into paid Starter and Advanced plans followed by Enterprise, with AI tools (branded "Asana Intelligence") included across paid plans. Those features help prioritise work, automate repetitive steps, suggest smart workflows and summarise discussions to reduce time spent on status communication.
Over at Jira, the structure runs from a free tier for small teams through Standard, Premium and Enterprise plans. "Atlassian Intelligence" focuses on generative support in the issue editor, AI summaries and AI-assisted sprint planning, adding predictive insights to help with resource allocation and automation. It is worth noting that Jira's entry-level paid plan appears cheaper on paper, but real-world total cost of ownership often rises once Marketplace apps, Confluence licences and security add-ons are factored in.
Choosing between the two typically comes down to need. If you want a task manager built for general use with crisp reporting and strong collaboration features, Asana presents itself clearly. If your roadmap lives and breathes Agile sprints, backlogs and issue workflows, and you need deep extensibility across a suite, Jira remains a natural fit.
Scrum: Back to Basics
Method matters as much as tooling. Scrum remains the most widely adopted Agile framework, and it is worth revisiting its essentials when translating plans into delivery. The DevOps Institute tracks the human side of this evolution, noting that skills, learning and collaboration are as central to DevOps success as the toolchain itself. A Scrum Team is cross-functional and self-organising, combining the Product Owner's focus on prioritising a transparent, value-ordered Product Backlog with a Development Team that turns backlog items into a potentially shippable increment every Sprint.
The Scrum Master keeps the framework alive, removes impediments, and coaches both the team and the wider organisation. Sprints run for no longer than four weeks and bundle Sprint Planning, Daily Scrums, a Sprint Review and a Retrospective, with online whiteboards increasingly used to run those ceremonies effectively across distributed and hybrid teams. The Sprint Goal provides a unifying target, and the Sprint Backlog breaks selected Product Backlog items into tasks and steps owned by the team.
Scrum Versus Waterfall
That cadence stands in deliberate contrast to classic waterfall approaches, where specification, design, implementation, testing and deployment proceed in long phases with significant hand-offs between each. Scrum replaces upfront specifications with user stories and collaborative refinement using the "three Cs" of Card, Conversation and Confirmation, so requirements can evolve alongside market needs. It places self-organisation ahead of management directives in deciding how work is done within a Sprint, and it raises transparency by making progress and problems visible every day rather than at phase gates.
Teams feel the shift when they commit to delivering a working increment each Sprint rather than aiming for a distant release, and when they see the cost of change flatten because feedback arrives through Reviews and Retrospectives rather than months after decisions have been made.
The State of Agile
Richer context for these shifts appears in longitudinal views of industry practice. The 18th State of Agile Report, published by Digital.ai in late 2025, observes that Agile is adapting rather than fading, with adoption remaining widespread while many organisations rebuild from the ground up to focus on measurable outcomes. The report, drawing on responses from approximately 350 practitioners, notes that AI and automation are accelerating change while introducing fresh expectations around data quality, decision-making and governance, and it emphasises that outcomes have become the currency connecting strategy, planning and execution.
That aligns with the Agile Alliance's ongoing work to re-examine Agile's core values for enterprise settings, as well as with the joint Manifesto for Enterprise Agility initiative with PMI{:target="_blank"}, which argues for adaptability as a strategic advantage rather than a team-level method choice. Significantly, the 18th report found that only 13% of respondents say Agile is deeply embedded across their business, and that only 15% of business leaders participate meaningfully in Agile practices, suggesting that leadership alignment remains one of the most persistent blockers to realising the framework's full potential.
Continuous Delivery and CI/CD Tooling
Getting from plan to production relies on engineering foundations that have matured alongside Agile. Continuous Delivery reframes deployment as a safe, rapid and sustainable capability by keeping code in a deployable state and eliminating the traditional post-"dev complete" phases of integration, testing and hardening. By building deployment pipelines that automate build, environment provisioning and regression testing, teams reduce risk, shorten lead time and can redirect human effort towards exploratory, usability, performance and security testing throughout delivery, not just at the end.
The results can be counterintuitive. High-performing teams deploy more frequently and more reliably, even in regulated settings because painful activities are made routine and small batches make feedback economical.
CI/CD in Practice
Contemporary CI/CD tools express that philosophy in developer-centred ways. Travis CI can often be described in minutes using minimal YAML configuration, specifying runtimes, caching dependencies, parallelising jobs and running tests across multiple language versions. Azure Pipelines, GitHub Actions and Azure DevOps provide similar capabilities at broader scale, with managed runners, gated releases, integrated artefact feeds, security scanning and policy controls that matter in larger enterprises.
The emphasis across these platforms is on speed to first pipeline, consistency across environments and adding guardrails such as signed artefacts, scoped credentials and secret management, so that velocity does not undercut safety.
Cloud Native Architecture
Architecture and platform choices amplify or constrain delivery flow. The cloud native ecosystem, curated by the Cloud Native Computing Foundation (CNCF) under the Linux Foundation, has become the common bedrock for organisations standardising on Kubernetes, service meshes and observability stacks. Hosting more than 200 projects across sandbox, incubating and graduated maturity levels, it spans everything from container orchestration to policy and tracing, and brings together vendors, end users and maintainers at events such as KubeCon + CloudNativeCon.
Sitting higher up the stack, Knative is a recent CNCF graduate that provides building blocks for HTTP-first, event-driven serverless workloads on Kubernetes. It unifies serving and eventing, so teams can scale to zero on demand while routing asynchronous events with the same fluency as web requests, and was created at Google before joining the CNCF as an incubating project and subsequently reaching graduation status. For teams that need to manage the underlying cluster infrastructure declaratively, Cluster API provides a Kubernetes-native way to provision, upgrade and operate clusters across cloud and on-premises environments, bringing the same declarative model used for application workloads to the infrastructure layer itself.
APIs and Developer Ecosystems
API-driven integration is part of the cloud native picture rather than an afterthought. The API Landscape compiled by Apidays shows the sheer diversity of stakeholders and tools across the programmable economy, from design and testing to gateways, security and orchestration. Developer ecosystems such as Cisco DevNet bring this to ground level by offering documentation, labs, sample code and sandboxes across networking, security and collaboration products, encouraging infrastructure as code with tools like Terraform and Ansible.
Version control and collaboration sit at the centre of modern delivery, and GitHub's documentation, spanning everything from Codespaces to REST and GraphQL APIs, reflects that centrality. The breadth of what is available through a single platform, from repository management to CI/CD workflows and AI-assisted coding, illustrates how much of the delivery stack can now be coordinated in one place.
Security: An End-to-End Discipline
Security threads through every layer and is increasingly treated as an end-to-end discipline rather than a late-stage gate. The Open-Source Security Foundation (OpenSSF) coordinates community efforts to secure open-source software for the public good, spanning working groups on AI and machine learning security, supply chain integrity and vulnerability disclosure, and offering guides, courses and annual reviews.
On the cloud side, a Cloud-Native Application Protection Platform (CNAPP) consolidates capabilities to protect applications across multi-cloud estates. Core components typically include Cloud Infrastructure Entitlement Management (to rein in excessive permissions), Kubernetes Security Posture Management (to maintain container orchestration best practices and flag misconfigurations), Data Security Posture Management (to classify and monitor sensitive data) and Cloud Detection and Response (to automate threat response and connect to security orchestration platforms).
Increasingly, AI-driven Security Posture Management sits across these layers to spot anomalies and predict risks from historical patterns, though this brings its own challenges around false positives and model bias that require careful adoption planning. Vendors such as Check Point offer CNAPP products including CloudGuard with unified management and compliance automation. While such examples illustrate what is available commercially, it is the architecture and functions described above that define the category itself.
Site Reliability Engineering
Reliability is not left to chance in well-run organisations. Site Reliability Engineering (SRE), pioneered and documented by Google, treats operations as a software problem and asks SRE's to protect, provide for and progress the systems that underpin user-facing services. The remit ranges from disk I/O considerations to continental-scale capacity planning, with a constant focus on availability, latency, performance and efficiency.
Error budgets, automation, toil reduction and blameless post-mortems become part of the vocabulary for teams that want to move fast without eroding trust. The approach complements Continuous Delivery by turning operational quality into something measurable and improvable, rather than a set of aspirations.
Code Quality, Testing and Documentation
For all the automation and platform power now available, the basics of code quality and testing still count. The Twelve-Factor App methodology remains relevant in encouraging declarative automation, clean contracts with the operating system, strict separation of build and run stages, stateless processes, externalised configuration, dev-prod parity and treating logs as event streams rather than files to be managed. It was first presented by developers at Heroku and continues to inform how teams design applications for cloud environments.
Documentation practices have also evolved, from literate programming's argument that source should be written as human-readable text with code woven through, to modern API documentation standards that keep codebases easier to change and onboard. General-purpose resources such as the long-running Software QA and Testing FAQ remind teams that verification and validation are distinct activities, that a spectrum of testing types is available and that common delivery problems have known countermeasures when documentation, estimation and test design are taken seriously.
AI in Software Delivery
No survey of modern software delivery can sidestep artificial intelligence. Adoption is now near-universal: the 2025 DORA State of AI-Assisted Software Development report, drawing on responses from almost 5,000 technology professionals worldwide, found that around 90% of developers now use AI as part of their daily work, with the median respondent spending roughly two hours per day interacting with AI tools. More than 80% report feeling more productive as a result. The picture is not straightforward, however. The same research found that AI adoption correlates with higher delivery instability, more change failures and longer cycle times for resolving issues because the acceleration AI brings upstream tends to expose bottlenecks in testing, code review and quality assurance that were previously hidden.
The report's central conclusion is that AI functions as an amplifier rather than a remedy. Strong teams with solid engineering foundations use it to accelerate further, while teams carrying technical debt or process dysfunction find those problems magnified rather than resolved. This means the strategic question is not simply which AI tools to adopt, but whether the underlying platform, workflow and culture are ready to benefit from them. The DORA AI Capabilities Model, published as a companion guide, identifies seven foundational practices that consistently improve AI outcomes, including a clear organisational stance on AI use, healthy data ecosystems, working in small batches and a user-centric focus. Teams without that last ingredient, the report warns, can actually see performance worsen after adopting AI.
At the tooling level, the landscape has moved quickly. Coding assistants such as GitHub Copilot have gone from novelty to standard practice in many engineering organisations, with newer entrants including Cursor, Windsurf and agentic tools like Claude Code pushing the category further. The shift from "copilot" to "agent" is significant: where earlier tools suggested completions as a developer typed, agentic systems accept a goal and execute a multistep plan to reach it, handling scaffolding, test generation, documentation and deployment checks with far less human intervention. That brings real efficiency gains and also new governance questions around traceability, code provenance and the trust that teams place in AI-generated output. Around 30% of DORA respondents reported little or no trust in code produced by AI, a figure that points to where the next wave of tooling and practice will need to focus.
Putting It Together
Translating all of this into practice looks different in every organisation, yet certain patterns recur. Teams choose a work management tool that matches the shape of their portfolio and the degree of Agile structure they need, whether that is Asana's lighter-weight task management with strong reporting or Jira's DevOps-aligned issue and sprint workflows with deep extensibility, then align on a Scrum-like cadence if iteration and feedback are priorities, or adopt hybrid approaches that sustain visibility while staying compatible with regulatory or vendor constraints.
Build, test and release are automated early so that pipelines, not people, become the route to production, and cloud native platforms keep environments reproducible and scalable across teams and geographies. Instrumentation ensures that security posture, reliability and cost are visible and managed continuously rather than episodically, and deliberate investment in engineering foundations, small batches, fast feedback and strong platform quality, creates the conditions that the evidence now shows are prerequisites for AI to deliver on its promise rather than amplify existing dysfunction.
If anything remains uncertain, it is often the sequencing rather than the destination. Few organisations can refit planning tools, delivery pipelines, platform architecture and security models all at once, and there is no definitive order that works everywhere. Starting where friction is highest and then iterating tends to be more durable than a one-shot transformation, and most of the resources cited here assume that change will be continuous rather than staged. Agile communities, cloud native foundations and security collaboratives exist because no single team has all the answers, and that may be the most practical lesson of all.