10:43, 4th October 2020
Advanced ODS Graphics: A deeper dive into item stores
Modifying ODS templates in SAS is a task that involves exporting the original code to a file, editing it to suit specific needs and recompiling it. This process allows users to customise graphical outputs without altering the original templates stored in SAS libraries. For instance, adjusting a Kaplan-Meier plot to display percentages instead of proportions requires deleting the existing template, exporting its source code and modifying parameters such as axis labels, tick values and expressions within the template. Temporary files are often used to store edited templates, ensuring that changes are isolated and do not interfere with the original system templates. This approach enables precise control over visual elements while maintaining the integrity of the underlying data and procedures. The key to successful modification lies in thoroughly reviewing the original template code before making changes, ensuring that edits align with the syntax and structure of the template language.
13:44, 13th July 2020
Code Maven: Groovy
A collection of Groovy programming resources has been migrated from Code Maven, covering a broad range of topics for those learning or working with the language. The material spans foundational concepts such as variables, functions, scope and classes, as well as more practical subjects including file handling, regular expressions, JSON processing, date and time management, exception handling and working with maps and lists. Supplementary slide-based content is also available, addressing reasons to use Groovy, available resources, installation and core language features such as input and output operations.
10:25, 29th June 2020
Groovy Goodness: Removing Elements From a Collection
Groovy extends Java's collection classes with several methods for removing elements. The removeAll method accepts a closure defining a condition, and any elements meeting that condition are removed directly from the collection. The removeIf method, introduced in Java 8, works similarly using a predicate that can be expressed as a closure in Groovy. To remove multiple elements at once, removeAll can also accept an array of objects. The removeElement method was added to resolve ambiguity with the standard remove method, which accepts either an object or an integer index value, causing confusion when working with collections of integers. For lists, the removeAt method allows removal of an element at a specified index position. As an alternative to these removal methods, the retainAll method takes the opposite approach, keeping only the elements that satisfy a given condition or match a provided array of objects, and removing everything else from the collection.
12:32, 25th February 2020
Converting Numeric Data to Categories in R
The process of categorising data involves several approaches, each suited to different analytical needs. One method divides data into groups based on quantiles, such as Lower_third, Middle_third and Upper_third, which are determined by percentile thresholds. This technique ensures even distribution across categories, though it may not always reflect underlying patterns in the data. Another approach uses clustering algorithms, such as partitioning around medoids (PAM), to group observations based on similarity across multiple variables.
This method identifies natural groupings within the data, which can reveal hidden structures not apparent through quantile-based methods. For instance, in an example involving two variables, Happy and Tired, clustering revealed four distinct groups, each with unique characteristics. The results of these methods are typically summarised in tables, showing group assignments and associated observations. Visualisation tools like ggplot2 are then used to represent these groupings graphically, providing an intuitive understanding of the data's structure. Each method has its strengths, and the choice depends on the specific goals of the analysis and the nature of the data being examined.
12:31, 25th February 2020
The expss package for R enables users to compute and display cross-tabulation tables with support for labelled variables, multiple and nested banners, weights, multiple-response variables and significance testing, with output rendered in HTML for use in knitr, R notebooks and Jupyter notebooks.
Drawing on functions familiar to users of SPSS and Excel, such as RECODE, COUNT and VLOOKUP, the package is designed to ease the transition of data processing workflows into R. Table construction follows a pipeline approach using the magrittr pipe operator, chaining functions that specify variables, calculate statistics and finalise output, with optional steps for sorting, transposing and dropping empty rows or columns.
The package supports a wide range of statistical outputs including column, row and table percentages, means, standard deviations and custom summary functions, as well as significance testing for both means and proportions. A practical demonstration using a product testing survey illustrates how multiple-response variables can be recoded, labelled and analysed across demographic and preference subgroups, with results exportable as CSV files accompanied by either R labelling code or SPSS syntax.
11:26, 25th February 2020
How to add a column to a dataframe in R
Adding a new column to a dataframe in R can be achieved through two main approaches. The preferred method uses the mutate() function from the dplyr package, which forms part of the broader Tidyverse collection of R packages. To use it, the function takes the name of the dataframe as its first argument, followed by a name-value pair that defines the new variable and how its values should be calculated. The alternative approach uses base R, employing the dollar sign operator to reference and create a new column by assigning a vector of values to it.
While both methods work, the Tidyverse approach is generally considered superior because its functions are intuitively named, easy to learn and straightforward to debug. The Tidyverse also includes other useful packages such as ggplot2 for visualisation, tidyr for reshaping data and stringr for handling string data, making it a comprehensive toolkit for data science in R. One important practical consideration when using mutate() is that it does not modify the original dataframe directly but instead produces a new one, meaning the output must be explicitly assigned to a variable name in order to retain the changes.
11:16, 25th February 2020
4 data wrangling tasks in R for advanced beginners
Here are four core data manipulation tasks: adding columns to existing data frames, generating summaries by data subgroups, sorting results and reshaping data between wide and long formats. Using a sample dataset of revenue and profit figures for Apple, Google and Microsoft from 2010 to 2012, the guide walks through multiple approaches for each task, ranging from base R syntax such as apply() and transform() to the more readable and efficient functions offered by the tidyverse ecosystem, particularly dplyr and tidyr. For adding columns, five distinct methods are demonstrated, with the dplyr mutate() function highlighted as the most elegant option. Grouping and summarising data by category is handled through dplyr's group_by() and summarise() functions, while sorting is made considerably more readable using dplyr's arrange() function compared to base R's order() approach. The final section addresses the conceptually challenging but practically important task of reshaping data, explaining the distinction between wide and long formats and demonstrating how tidyr's newer pivot_longer() and pivot_wider() functions can be used to switch between the two, which is particularly useful when preparing data for visualisation tools such as ggplot2.
22:20, 24th February 2020
How to Import Data: Reading SAS Files in R
Reading SAS files in R is achievable through two dedicated packages, haven and sas7bdat, as well as through the graphical interface of RStudio. The haven package, which forms part of the Tidyverse collection and also supports SPSS and Stata formats, provides the read_sas and write_sas functions for loading and saving sas7bdat files respectively. The sas7bdat package serves the sole purpose of reading SAS files into R dataframes using its read.sas7bdat function. Both packages can be installed via the install.packages() function in R or through Conda.
RStudio users can alternatively import SAS files through the Environment tab's Import Dataset menu, which uses haven in the background to handle the process. Once data has been loaded into a dataframe, it can be exported as a new SAS file using write_sas or converted to CSV format using R's built-in write.csv function.
13:02, 17th February 2020
SAS LIBNAME Statement: JSON Engine
The JSON LIBNAME statement in SAS provides read-only sequential access to JavaScript Object Notation data, reading the file a single time upon assignment and requiring reassignment to read it again. It uses a JSON map file to describe the structure of the data sets within the JSON document, which can either be generated automatically by the JSON engine or supplied manually by the user via the MAP= option.
The automapper produces a separate data set for each object found in the JSON, along with a consolidated ALLDATA data set containing all information in a single structure. Users can inspect generated data sets using procedures such as PROC DATASETS, PROC CONTENTS and PROC PRINT.
Several options govern the engine's behaviour, including AUTOMAP, which controls whether a map file is created or reused, RETAIN, which preserves observation buffer values between observations, ORDINALCOUNT, which determines how many ordinal variables are generated and MEMLEAVE, which manages memory allocation during processing. Map files can be edited manually to rename data sets, restructure variables, specify formats, informats and labels, and control retention at the variable level.
When JSON data spans multiple related data sets, ordinal variables serve as key fields that allow those data sets to be merged into a single consolidated output. The ALLDATA data set can also be manipulated directly using data step logic to extract and reshape information as needed, and the JSONPP DATA step function can be used to produce a human-readable formatted copy of any JSON input.
17:38, 25th November 2019
Elvis SAS Log Analyser
Elvis is a Windows-based application designed to enhance the analysis of SAS logs, offering tools such as the Log Explorer for summarising key events, the Folder View for monitoring log statuses and the Call Tree for visualising macro calls. It supports logs from SAS and the World Programming System, with features including colour-coded line types, quick navigation to errors and the ability to define custom line markers. The software transitioned to a subscription model in version 3.0, with a free trial allowing access to the first 800 lines of each log. Telemetry is enabled by default to collect anonymised usage data for development purposes, though users can disable it through preferences. The application is compatible with Windows 10, 8.1 and 8 and does not require additional runtime dependencies. User feedback highlights its utility for log analysis but notes its Windows-only limitation, with some users questioning its compatibility with z/OS environments.