11:17, 12th May 2021
How to download and convert CSV files for use in SAS
Chris Hemedinger of SAS demonstrates how to download a CSV file from GitHub and prepare it for use in SAS, using a four-step process. First, PROC HTTP is used to fetch the raw file from GitHub, as it is considered more robust and efficient than alternative methods. Next, PROC IMPORT brings the data into SAS with the VALIDVARNAME=ANY option enabled, allowing the original column names containing spaces or special characters to be retained temporarily. The third step uses PROC SQL with a SELECT INTO clause to dynamically generate RENAME and LABEL statements, converting the original column names into valid SAS variable names by stripping spaces and non-alphanumeric characters, while preserving the originals as descriptive labels. Finally, PROC DATASETS applies these generated statements to update the variable names and labels without rewriting the entire dataset, after which the VALIDVARNAME option is reset to its standard setting. The approach is particularly useful when working with externally sourced data whose column naming conventions do not conform to SAS programming rules.
11:16, 12th May 2021
R to SAS
Robert Allison, a data visualisation specialist at SAS, has published a series of blog posts exploring the conversion of graphs created in R into their SAS equivalents. The series covers a range of chart and map types, including pie charts, bar charts and maps generated from shapefiles, with each post taking a detailed look at how customised R visualisations can be reproduced using SAS tools.
11:15, 12th May 2021
Here is an exploration of how graphs produced using the popular R package GGPLOT2 compare with those created using SAS tools, specifically the SGPLOT procedure and Graph Template Language. Using built-in datasets from both systems, scatter plots, box plots and histograms were recreated in SAS to visually match their R counterparts, noting that while default styling differs between the two, both approaches follow a similar layered philosophy for building graphics.
Simple graphs are straightforward in either system, and more complex ones are achievable, though GGPLOT2 favours brevity in its syntax while SAS opts for a more structured and verbose approach. The developer also observed that grouped histograms, supported natively in GGPLOT2, are not directly available in SAS at the time of writing, requiring a workaround using overlaid histograms from reshaped data.
11:07, 12th May 2021
Quick Intro to Parallel Computing in R
Parallel computing in R offers a powerful way to speed up data-heavy computational tasks by distributing workloads across multiple processor cores. Modern computers contain multiple cores, and while R has historically been single-threaded, packages such as parallel and foreach allow programmers to take advantage of this additional processing power. Sequential loops and standard apply functions process tasks one at a time, but functions like mclapply and the dopar operator in the foreach package enable the same tasks to run simultaneously across multiple cores, significantly reducing overall computation time.
This approach is particularly valuable when processing large datasets, such as those involving remote sensing or complex environmental modelling, where hundreds of thousands of files may need to be handled. However, parallelisation is not always the most efficient solution, as there is inherent overhead in copying data and spawning new processes, meaning that for shorter or less intensive tasks, the setup costs can outweigh the performance gains. The proportion of a task that can actually be parallelised also affects efficiency, a concept related to Amdahl's Law, which suggests that speedup diminishes as the non-parallelisable portion of a task grows.
11:07, 12th May 2021
Comparing Dataframes In R Using compareDF
The compareDF package for R, developed to address gaps in existing data comparison tools, provides a straightforward way to identify and summarise differences between two dataframes that share the same structure. Its core function, compare_df, accepts two dataframes alongside one or more grouping variables and produces several outputs, including a comparison table that highlights rows where at least one value has changed, a colour-coded HTML output where changed cells are marked in red for older values and green for newer ones, and summary objects that quantify the number of changes, additions and removals per group.
The package supports grouping by multiple columns, allowing distinctions to be made between records that share a name but belong to different categories. Additional parameters enable users to exclude specific columns from comparison, limit the number of rows rendered in the HTML output to avoid performance issues with large datasets and set a numeric tolerance threshold so that minor variations below a defined percentage are not flagged as meaningful changes. Rows that are identical across both dataframes are omitted from the output entirely, keeping results focused on genuine differences.
11:05, 12th May 2021
Sharing Your Work with xaringan
A four-hour online workshop held across two days in November 2020 introduced R users to the xaringan package as a tool for building and sharing presentation slides via HTML. The first session covered the fundamentals of creating slides and deploying them in a shareable format, while the second explored advanced customisation using CSS and the xaringanExtra package. Designed for those already familiar with R Markdown and GitHub, the workshop was delivered through RStudio Cloud and Zoom, with local installation of R, RStudio and several packages available as a backup. The workshop was created by Dr Silvia Canelón of the University of Pennsylvania for the NHS-R Community 2020 Virtual Conference, drawing on prior work by Alison Hill and Greg Wilson.
11:03, 12th May 2021
ggplot2 - Easy Way to Mix Multiple Graphs on The Same Page
Combining multiple ggplot2 graphs onto a single page or across multiple pages requires specialist approaches, as standard R functions such as par() and layout() are incompatible with ggplot2. Several R packages offer solutions to this challenge, including gridExtra, cowplot and ggpubr.
The gridExtra package provides functions for arranging plots in a grid format, though it does not align plot panels or axes. The cowplot package addresses axis alignment through its plot_grid() function but lacks support for multipage layouts. The ggpubr package bridges this gap with its ggarrange() function, which wraps cowplot functionality while adding support for multipage arrangements and shared legends.
Beyond basic arrangement, these tools collectively support a range of more advanced layout techniques, including nested arrangements, custom column and row spanning, annotated figures, scatter plots with marginal density plots and the embedding of tables, paragraphs or additional graphical elements within a plot. Background images can also be incorporated into ggplot2 graphics using the background_image() function from ggpubr. Completed arrangements can be exported to file formats such as PDF, EPS or PNG using the ggexport() function, with options to control how many plots appear on each page.
11:02, 12th May 2021
6 Life-Altering RStudio Keyboard Shortcuts
Written by Matt Dancho of Business Science, this article highlights six keyboard shortcuts in the RStudio integrated development environment that are designed to boost productivity for R programmers. The shortcuts covered include commenting and uncommenting code, inserting the pipe operator, adding the assignment operator, selecting multiple lines with a multi-cursor tool, searching for specific terms across files and accessing a full keyboard shortcut reference sheet. Each shortcut is presented as a practical time-saver for common coding tasks, particularly those involving data wrangling and function building.
11:01, 12th May 2021
Rtools
Rtools is a collection of compilers, build utilities and Unix-like tools designed to enable the compilation of R packages on Windows, which lacks these components by default. It includes the GCC toolchain for C, C++ and Fortran, along with utilities such as make, tar and bash that replicate the Linux build environment commonly assumed by R packages. Rtools is essential for installing packages containing compiled code from source, installing development versions from repositories like GitHub, or creating custom packages with C/C++ extensions. Each R version requires a corresponding Rtools version to ensure compatibility, and proper installation involves adding it to the system PATH to allow seamless compilation during package installation. On Windows, Rtools serves as the equivalent of the standard build tools found on Linux and macOS, enabling the automatic construction of packages with compiled components without requiring users to manually manage the underlying infrastructure.
10:53, 12th May 2021
SAS Free Software Trials
SAS offers free trials and resources tailored to different user groups, including organisations seeking to test its data and AI platform, students and educators accessing software and training materials and professionals exploring learning subscriptions or career-specific courses. These initiatives aim to support skill development, academic instruction and organisational innovation through access to analytics tools, visualisation capabilities and industry-focused solutions. Additional options are available for purchasing licences and tailored industry applications, covering areas such as data management, decisioning and advanced analytics.