Coding Notebook

10:52, 12^th May 2021

Open Integration with SAS

SAS Viya facilitates collaboration between SAS users and open-source developers by integrating open-source technology throughout the analytics process, enabling models to be deployed across various environments such as cloud platforms, private docker containers and through APIs. This approach supports governance and orchestration of analytics workflows, allowing models developed in multiple languages to be managed centrally and scaled effectively.

By leveraging REST and language-specific APIs, organisations can embed analytics into business applications, enhancing decision-making and operational efficiency. Examples from various industries demonstrate how this integration helps organisations streamline complex analytics ecosystems, foster innovation and make data-driven decisions at different levels of an organisation, with support for multiple programming languages and deployment options.

10:50, 12^th May 2021

SAS Developer Home

The SAS Developer Portal is a resource for building applications that integrate SAS artificial intelligence and analytics capabilities with open-source technologies, offering a range of REST APIs spanning areas such as machine learning, data management, fraud detection, generative AI, health and life sciences, IoT analytics and visualisation. Developers can explore use cases drawn from various SAS media outlets, access libraries for integration with Python, R, Lua and Java, and learn how to run SAS code on the SAS Viya platform using CAS actions or the SAS extension for Visual Studio Code.

Software development kits are also available for embedding insights and content from SAS Viya into custom dashboards and applications, while featured solutions include Customer Intelligence 360, SAS Viya Workbench and a Trustworthy AI platform centred on transparency and accountability. A developer community and GitHub repositories provide further opportunities for collaboration, knowledge sharing and access to code examples and tools, with an annual conference bringing together leaders, users and partners to advance their knowledge of the platform.

10:48, 12^th May 2021

How to Check if a File or a Directory exists in R, Python and Bash

When building data workflows and machine learning pipelines, it is often necessary to verify whether specific files or directories exist before proceeding. In R, this can be done using the base package commands file.exists() and dir.exists(), with new files and directories created using file.create() and dir.create() respectively. Python offers two main approaches through the os and pathlib modules, where os.path.isfile(), os.path.isdir() and os.path.exists() handle these checks, while new files and directories can be created using open() and os.makedirs(). In Bash, the same checks are performed using flags within conditional statements, with -f used for files and -d for directories, and new files and directories created with the touch and mkdir commands respectively, alongside a range of additional flags available for more specific checks such as verifying file permissions, ownership and type.

10:46, 12^th May 2021

Deleting a substring from a SAS string

Leonid Batkhan's SAS Users blog post explains how to delete substrings from SAS character variables and macro variables, framing this as the reverse of a previously covered substring insertion technique. For removing all instances of an unwanted substring from a character variable, the post demonstrates use of the TRANSTRN function paired with TRIMN, which together allow a zero-length replacement that effectively erases the target string. When working with macro variables, two approaches are presented, one using a data step with TRANSTRN and CALL SYMPUTX, and another using %SYSFUNC within the macro language.

For cases where only a specific occurrence of a substring needs to be removed rather than all of them, the post outlines two further solutions, one using SUBSTR and CATX to cut and rejoin the string around the unwanted portion, and another using the KUPDATE function for a more concise result. The FIND function is used to locate the precise position of the target substring, with its direction argument allowing searches from right to left. The post also notes that FINDNTH can be used when the goal is to remove a particular numbered instance of a repeated substring.

10:45, 12^th May 2021

10 Tips And Tricks For Data Scientists Vol.6

This sixth instalment in a series of data science tips covers a range of practical techniques in both Python and R. In Python, the tips include finding the mode of a list using the max function with a lambda key, disabling warnings via the warnings module, pasting copied data directly into a Pandas DataFrame using read_clipboard, saving DataFrames as image files with the dataframe-image library and tracking the progress of applied functions using the tqdm library.

On the R side, the article advises using the matrixStats package over the apply function for large datasets, recommending the data.table package for reading and writing CSV files due to its significantly faster performance compared to base R and the readr package. It also introduces the waldo package for comparing R objects and identifying differences in a clear, colour-coded format, demonstrates how to dynamically check for and install required packages before loading them and shows how to convert all character variables to factors in a single line of code using sapply and lapply.

10:42, 12^th May 2021

Timeseries analysis in R

Time series analysis in R covers several key areas, including decomposition, forecasting, clustering and classification. Using the AirPassengers dataset, which spans 1949 to 1961 across 144 observations, the data can be log-transformed to address non-stationarity before being decomposed into trend, seasonal and random components, with notable seasonal peaks in months seven and eight and a trough in month eleven.

For forecasting, an ARIMA model is fitted using automatic selection, with ACF and PACF plots used to assess residuals, and a Ljung-Box test confirming that any apparent autocorrelation is likely due to chance rather than a model deficiency. The resulting forecast achieves strong accuracy, with residuals closely approximating a normal distribution centred at zero.

For clustering, dynamic time warping is used to calculate distances between series exhibiting six distinct patterns, including normal, cyclic, increasing trend, decreasing trend, upward shift and downward shift, before hierarchical clustering groups them accordingly. Finally, a decision tree classifier trained on the same patterned data achieves an overall accuracy of above 95%, demonstrating that time series patterns can be reliably identified and categorised using R.

10:42, 12^th May 2021

Code performance in R: Which part of the code is slow?

Code performance in R: How to make code faster

Improving the performance of R code requires a combination of smart coding habits and the right diagnostic tools. Profiling tools such as the system.time function and the microbenchmark and profvis packages allow developers to identify which parts of their code are slowest, enabling targeted optimisation rather than guesswork. Once bottlenecks are identified, several techniques can meaningfully reduce computation time, including avoiding redundant operations within loops by moving static calculations outside them, pre-allocating vectors of the required length rather than appending values incrementally and collapsing strings in a single step rather than building them piece by piece.

Vectorisation offers particularly significant speed gains by leveraging internally compiled C-based loops, and functions such as rowSums, rowMeans, colSums and colMeans provide ready-made vectorised alternatives to manual iteration. Where vectorisation is not possible, translating performance-critical code into C++ using the Rcpp package can achieve comparable results. Saving intermediate results to file also reduces unnecessary recomputation across longer workflows.

17:02, 11^th May 2021

SAS 9.4 ODS Graphics: Procedures Guide

The SAS ODS Graphics Procedures Guide, now in its sixth edition, forms part of the broader SAS 9.4 and SAS Viya 3.5 programming documentation suite and covers the full range of procedures available for creating statistical graphics. It includes guidance on core procedures such as SGPLOT, SGPANEL, SGSCATTER, SGPIE and SGRENDER, alongside the SGDESIGN procedure, and addresses common concepts such as controlling graph appearance, managing output and working with attribute maps and annotation tools. The guide also provides reference material on syntax conventions, units of measurement, reserved keywords and comparisons with legacy SAS/GRAPH procedures, making it a comprehensive resource for those building and customising graphs within the ODS Graphics environment.

17:02, 11^th May 2021

Making great graphs even better with ODS Graphics

SAS has introduced updated graphing capabilities through ODS Graphics that allow programmers to produce modern, visually refined graphs using less code than older methods required. Where the Data Step Graphics Interface was once used to draw bar charts with custom symbols, the SGPLOT procedure now handles this with a dedicated SYMBOLCHAR statement that references hexadecimal values for special characters. Attribute maps in SGPLOT also replace the need for conditional macro code when assigning consistent colours to specific data values across groups. The SGPANEL procedure simplifies the creation of side-by-side panel plots by combining what previously required both PROC GPLOT and the GREPLAY procedure into a single step. Finally, the SGMAP procedure, available from SAS 9.4M5 onward on 64-bit Windows and Linux systems, enables the placement of markers and labels on detailed maps using OpenStreetMap data, replacing the older combination of GMAP, GPROJECT and the Annotate facility. All of these features are documented in the SAS 9.4 ODS Graphics Procedures Guide.

17:01, 11^th May 2021

How to fix common problems in output from SAS ODS Graphics procedures

Common problems with ODS Graphics output in SAS often stem from using incorrect options or syntax, and resolving them typically requires only small adjustments to the code. When colours do not appear as expected in scatter plots, the distinction between DATACOLORS and DATACONTRASTCOLORS in the STYLEATTRS statement is important, as the former applies to filled areas while the latter governs marker symbols and lines. Attribute maps, which link group variable values to specific colours, must use formatted values rather than raw ones if a format has been applied to the group variable. Marker symbol cycling can also behave unexpectedly depending on the style in use, and setting the ATTRPRIORITY option to NONE in an ODS GRAPHICS statement ensures that symbols and colours rotate independently. For annotation, those familiar with SAS/GRAPH may find that ODS Statistical Graphics requires different parameter names and function values, specifically X1 and Y1 for positioning rather than X and Y, and the TEXT function rather than LABEL for adding labels to a plot.

« Older Entries «

» Newer Entries »