Technology Tales

Adventures & experiences in contemporary technology

Compressing an entire dataset library using SAS

26th September 2019

Turning dataset compression for SAS datasets can produce quite a reduction in size so it often is standard practice to do just this. It can be set globally with a single command and many working systems do this for you:

options compress=yes;

It also can be done on a dataset by dataset option by adding (compress=yes) beside the dataset name on the data line of a data step. The size reduction can be gained retrospectively too but only if you create additional copies of the datasets. The following code uses the DATASETS procedure to accomplish this end at the time of copy the datasets from one library to another:

proc datasets lib=adam nolist;
copy inlib=adam outlib=adamc noclone datecopy memtype=data;
run;
quit;

The NOCLONE option on the COPY statement allows the compression status to be changed while the DATECOPY one keeps the same date/time stamp on the new file at the operating system level. Lastly, the MEMTYPE=DATA setting ensures that only datasets are processed while the NOLIST option on the PROC DATASETS line suppresses listing of dataset information.

It may be possible to do something like this with PROC COPY too but I know from use that the option presented here works as expected. It also does not involve much coding effort, which helps a lot.

Using NOT IN operator type functionality in SAS Macro

9th November 2018

For as long as I have been programming with SAS, there has been the ability to test if a variable does or does not have one value from a list of values in data step IF clauses or WHERE clauses in both data step and most if not all procedures. It was only within the last decade that its Macro language got similar functionality with one caveat that I recently uncovered: you cannot have a NOT IN construct. To get that, you need to go about things in a different way.

In the example below, you see the NOT operator being placed before the IN operator component that is enclosed in parentheses. If this is not done, SAS produces the error messages that caused me to look at SAS Usage Note 31322. Once I followed that approach, I was able to do what I wanted without resorting to older more long-winded coding practices.

options minoperator;

%macro inop(x);

%if not (&x in (a b c)) %then %do;
%put Value is not included;
%end;
%else %do;
%put Value is included;
%end;

%mend inop;

%inop(a);

Running the above code should produce a similar result to another featured on here in another post but the logic is reversed. There are times when such an approach is needed. One is where a small number of possibilities is to be excluded from a larger number of possibilities. Programming often involves more inventive thinking and this may be one of those.

WARNING: No bars were drawn. This could have been caused by ORDER= on the AXIS statement. You might wish to use the MIDPOINTS= option on the VBAR statement instead.

25th September 2015

What you see above is a an error issued by a SAS program like what a colleague at work recently found. The following code will reproduce this so let us walk through the steps to explain a possible cause for this.

The first stage is to create a test dataset containing variables y and x, for the vertical and midpoint axes, respectively, and populating these using a CARDS statement in a data step:

data a;
input y x;
cards;
1 5
3 9
;
run;

Now, we define an axis with tick marks for particular values that will be used as the definition for the midpoint or horizontal axis of the chart:

axis1 order=(1 3);

Then, we try creating the chart using the GCHART procedure that comes with SAS/GRAPH and this is what results in the error message being issued in the program log:

proc gchart data=a;
vbar x / freq=y maxis=axis1;
run;
quit;

The cause is that the midpoint axis tick marks are no included in the data so changing these to the actual values of the x variable removes the message and allows the creation of the required chart. Thus, the AXIS1 statement needs to become the following:

axis1 order=(5 9);

Another solution is to remove the MAXIS option from the VBAR statement and let GCHART be data driven. However, if requirements do not allow this, create a shell dataset with all expected values for the midpoint axis with y set 0 since that is used for presenting frequencies as per the FREQ option in the VBAR statement.

Changing the working directory in a SAS session

12th August 2014

It appears that PROC SGPLOT and other statistical graphics procedures create image files even if you are creating RTF or PDF files. By default, these are PNG files but there are other possibilities. When working with PC SAS , I have seen them written to the current working directory and that could clutter up your folder structure, especially if they are unwanted.

Being unable to track down a setting that controls this behaviour, I resolved to find a way around it by sending the files to the SAS work directory so they are removed when a SAS session is ended. One option is to set the session’s working directory to be the SAS work one and that can be done in SAS code without needing to use the user interface. As a result, you get some automation.

The method is implicit though in that you need to use an X statement to tell the operating system to change folder for you. Here is the line of code that I have used:

x "cd %sysfunc(pathname(work))";

The X statement passes commands to an operating system’s command line and they are enclosed in quotes. %sysfunc then is a macro command that allows certain data step functions or call routines as well as some SCL functions to be executed. An example of the latter is pathname and this resolves library or file references and it is interrogating the location of the SAS work library here so it can be passed to the operating systems cd (change directory) command for processing. This method works on Windows and UNIX so Linux should be covered too, offering a certain amount of automation since you don’t have to specify the location of the SAS work library in every session due to the folder name changing all the while.

Of course, if someone were to tell me of another way to declare the location of the generated PNG files that works with RTF and PDF ODS destinations, then I would be all ears. Even direct output without image file creation would be even better. Until then though, the above will do nicely.

Some SAS Macro code for detecting the presence or absence of a variable in a dataset

4th December 2013

Recently, I needed to put in place some code to detect the presence or absence of a variable in a dataset and I chose SAS Macro programming as the way to do what I wanted. The logic was based on a SAS sample that achieved the same result in a data step and some code that I had for detecting the presence or absence of a dataset. Mixing the two together gave me something like the following:

%macro testvar(ds=,var=);

%let dsid=%sysfunc(open(&ds,in));
%let varexist=%sysfunc(varnum(&dsid,&var));
%if &dsid > 0 %then %let rc=%sysfunc(close(&dsid));

%if &varexist gt 0 %then %put Info: Variable &var is in the &ds dataset;
%else %put Info: Variable &var is not in the &ds dataset;

%mend testvar;

%testvar(ds=dataset,var=var);

What this does is open up a dataset and look for the variable number in the dataset. In datasets, variables are numbered from left to right with 1 for the first one, 2 for the second and so on. If the variable is not in the dataset, the result is 0 so you know that it is not there. All of this is what the VARNUM SCL function within the SYSFUNC macro function does. In the example, this resolves to %sysfunc(varnum(&dsid,var)) with no quotes around the variable name like you would do in data step programming. Once you have the variable number or 0, then you can put in place some conditional logic that makes use of the information like what you see in the above simple example. Of course, that would be expanded to something more useful in real life but I hope it helps to show you the possibilities here.

Calculating geometric means using SAS

19th March 2012

Recently, I needed to calculate geometric means after a break of a number of years and ended racking a weary brain until it was brought back to me. In order that I am not in the same situation again, I am recording it here and sharing it always is good too.

The first step is to take the natural log (to base e or the approximated irrational value of the mathematical constant, 2.718281828) of the actual values in the data. Since you cannot have a log of zero, the solution is either to exclude those values or substitute a small value that will not affect the overall result as is done in the data step below. In SAS, the log function uses the number e as its base and you need to use the log10 equivalent when base 10 logs are needed.

data temp;
set temp;
if result=0 then _result=0.000001;
else _result=result;
ln_result=log(_result);
run;

Next, the mean of the log values is determined and you can use any method of doing that so long as it gives the correct values. PROC MEANS is used here but PROC SUMMARY (identical to MEANS except it defaults to data set creation while that generates output by default; because of that, we need to use the NOPRINT option here), PROC UNIVARIATE or even the MEAN function in PROC SQL.

proc means data=temp noprint;
output out=mean mean=mean;
var ln_result;
run;

With the mean of the log values obtained, we need to take the exponential of the obtained value(s) using the EXP function. This returns values of the same magnitude as in the original data using the formula emean.

data gmean;
set mean;
gmean=exp(mean);
run;

Dealing with variable length warnings in SAS 9.2

11th January 2012

A habit of mine is to put a LENGTH or ATTRIB statement between DATA and SET statements in a SAS data step to reset variable lengths. By default, it seems that this triggers truncation warnings in SAS 9.2 or SAS 9.3 when it didn’t in previous versions. SAS 9.1.3, for instance, allowed you have something like the following for shortening a variable length without issuing any messages at all:

data b;
length x $100;
set a;
run;

In this case, x could have a length of 200 previously and SAS 9.1.3 wouldn’t have complained. Now, SAS 9.2 and 9.3 will issue a warning if the new length is less than the old length. This can be useful to know but it can be changed using the VARLENCHK system option. The default value is WARN but it can be set to ERROR if you really want to ensure that there is no chance of truncation. Then, you get error messages and the program fails where it normally would run with warnings. Setting the value of the option to NOWARN restores the type of behaviour seen in SAS 9.1.3 and versions prior to that.

The SAS documentation says that the ability to change VARLENCHK can be restricted by an administrator so you might need to deal with this situation in a more locked down environment. Then, one option would be to do something like the following:

data b;
drop x;
rename _x=x;
set a;
length _x $100;
_x=strip(x);
run;

It’s a bit more laborious than setting to the VARLENCHK option to NOWARN but the idea is that you create a new variable of the right length and replace the old one with it. That gets rid of warnings or errors in the log and resets the variable length as needed. Of course, you have to ensure that there is no value truncation with either remedy. If any is found, then the dataset specification probably needs updating to accommodate the length of the values in the data. After all, there is no substitute for getting to know your data and doing your own checking should you decide to take matters into your hands.

There is a use for the default behaviour though. If you use a specification to specify a shell for a dataset, then you will be warned when the shell shortens variable lengths. That allows you to either adjust the dataset or your program. Also, it gives more information when you get variable length mismatch warnings when concatenating or merging datasets. There was a time when SAS wasn’t so communicative in these situations and some investigation was needed to establish which variable was affected. Now, that has changed without leaving the option to work differently if you so do desire. Sometimes, what can seem like an added restriction can have its uses.

Using Data Step to Create a Dataset Template from a Dataset in SAS

23rd November 2010

Recently, I wanted to make sure that some temporary datasets that were being created during data processing in a dataset creation program weren’t truncating values or differed from the variable lengths in the original. It was then that a brainwave struck me: create an empty dataset shell using data step and use that set all the variable lengths for me when the new datasets were concatenated to it. The code turned out to be very simple and here is an example of how it looked:

data shell;
stop;
set example;
run;

The STOP statement, prevents the data step from reading in any of the values in the template dataset and just its header is written out to another (empty) dataset that can be used to set things up as you would want them to be. It certainly was a quick solution in my case.

Using ODS Graphics to Create Plots Using PROC LIFETEST

3rd September 2010

One of the nice things about SAS 9.2 is that creation of statistical graphics is enhanced using ODS. One of the beneficiaries of this is PROC LIFETEST, a procedure that gained a lot when data sets could be created from it using ODS OUTPUT  statements. Before that, it was a matter of creating text output and converting it to a SAS data set using Data Step and that was a nuisance on a system that attached special significance to output destinations set up using PROC PRINTTO. What you’ll find below is a sample of the type of code for creating a Kaplan-Meier survival plot for time to adverse events resulting in discontinuation of study treatment with actual and censored times. The IMAGENAME parameter on the ODS GRAPHICS statement line controls the name of the file and it is possible to change the type using the IMAGEFMT parameter too.

ods graphics on / imagename=”fig5″;
proc lifetest data=km3 method=km plots=survival;
time timetoae*cens_ae(0);
run;
ods graphics off;

On Making PROC REPORT Work Harder

1st September 2010

In the early years of my SAS programming career, there seemed to be just the one procedure to use if you wanted to create a summary table. That was TABULATE and it was great for generating columns according to the value of a variable such as the treatment received by a subject in a clinical study. To a point, it could generate statistics for you too and I often used it to sum frequency and percentage variables. Since then, it seems to have been enhanced a little and it surprised me with the statistics it could produce when I had a recent play. Here’s the code:

proc tabulate data=sashelp.class;
class sex;
var age;
table age*(n median*f=8. mean*f=8.1 std*f=8.1 min*f=8. max*f=8. lclm*f=8.1 uclm*f=8.1),sex / misstext="0";
run;

When you compare that with the idea of creating one variable per column and then defining them in PROC REPORT as many do, it has to look more elegant and the results aren’t bad either though they can be tweaked further from the quick example that I generated. That last comment brings me to the point that PROC REPORT seems to have taken over from TABULATE wherever I care to look these days and I do ask myself if it is the right tool for that for which it is being used or if it is being used in the best way.

Using Data Step to create one variable per column in a PROC REPORT output doesn’t strike me as the best way to write reusable code but there are ways to make REPORT do more for you. For example, by defining GROUP, ACROSS and ANALYSIS columns in an output, you can persuade the procedure to do the summarising for you and there’s some example code below with the comma nesting height under sex in the resulting table. Sums are created by default if you do this and forgoing an analysis column definition means that you get a frequency table, not at all a useless thing in many cases.

proc report data=sashelp.class nowd missing;
columns age sex,height;
define age / group "Age";
define sex / across "Sex";
define height / analysis mean f=missing. "Mean Height";
run;

For those times when you need to create more heavily formatted statistics (summarising range as min-max rather showing min and max separately, for example), you might feel that the GROUP/ACROSS set-up’s non-display of character values puts a stop to using that approach. However, I found that making every value combination unique and attaching a cell ID helps to work around the problem. Then, you can create a format control data set from the data like in the code below and create a format from that which you can apply to the cell ID’s to display things as you need them. This method does make things more portable from situation to situation than adding or removing columns depending on the values of a classification variable.

proc sql noprint;
create table cntlin as
select distinct "fmtname" as fmtname, cellid as start, cellid as end, decode as label
from report;
quit;

proc format lib=work cntlin=cnlin;
run;

  • All the views that you find expressed on here in postings and articles are mine alone and not those of any organisation with which I have any association, through work or otherwise. As regards editorial policy, whatever appears here is entirely of my own choice and not that of any other person or organisation.

  • Please note that everything you find here is copyrighted material. The content may be available to read without charge and without advertising but it is not to be reproduced without attribution. As it happens, a number of the images are sourced from stock libraries like iStockPhoto so they certainly are not for abstraction.

  • With regards to any comments left on the site, I expect them to be civil in tone of voice and reserve the right to reject any that are either inappropriate or irrelevant. Comment review is subject to automated processing as well as manual inspection but whatever is said is the sole responsibility of the individual contributor.