Tag Archive for statistics

Creating a Data Set Containing Confidence Intervals Using PROC UNIVARIATE

While you could generate data sets containing means and confidence intervals using PROC SUMMARY or PROC MEANS, curiosity and the need to verify a program using a different technique were what drove me to consider using PROC UNIVARIATE for the task. For the record, the PROC SUMMARY code is below and the only difference between it and MEANS is that it doesn’t produce output by default, something that’s not needed in this case anyway. Quite why there are two SAS procedures doing exactly the same thing is beyond me though I do wonder if the NOPRINT options was a later addition than these two procedures. The LCLM and UCLM keywords are what triggers the calculation of confidence limits and the ALPHA option controls the confidence interval used; 0.05 specifies a 95% interval, 0.1 a 90% one and so on.

proc summary data=sashelp.class mean lclm uclm alpha=0.05;
var age;
output out=sasuser.lims mean=mean lclm=lclm uclm=uclm;
run;

Given that I have had PROC UNIVARIATE producing statistics that MEANS/SUMMARY didn’t in previous versions of SAS (I believe that is was standard deviation that was absent from MEANS/SUMMARY), I might have expected the calculation and export of confidence limits to a data set to be straightforward. Sadly, it’s not a case of simply adding LCLM and UCLM keywords in the OUTPUT statement for the procedure and ODS OUTPUT is needed to create the data set instead. An ODS SELECT statement is needed to pick out the BasicIntervals output object (UNIVARIATE creates quite a few, it seems) that is created through specification of the CIBASIC and ALPHA (performs the same role as it does for PROC MEANS/SUMMARY)  options on the PROC UNIVARIATE statement. The reason for the ODS LISTING and ODS RTF statements below is to stop output being sent to the output window in a standard SAS session. For some reason, it appears that you need the sending of output to one of the LISTING, HTML or RTF destinations or there will be no data in the data set; I met up with the same behaviour when using ODS PS, an ODS PRINTER destination. The data set will contain statistics for mean, standard deviation and variance so that’s why there is a WHERE clause on the ODS OUTPUT statement.

ods listing close;
ods rtf body=”c:\temp\uni_eg.doc”;
ods select BasicIntervals;
ods output BasicIntervals=sasuser.stats(where=(lowcase(parameter)=”mean”) );

proc univariate cibasic alpha=0.05 data=sashelp.class;
var age;
run;

ods output close;
ods rtf close;
ods listing;

On Making PROC REPORT Work Harder

In the early years of my SAS programming career, there seemed to be just the one procedure to use if you wanted to create a summary table. That was TABULATE and it was great for generating columns according to the value of a variable such as the treatment received by a subject in a clinical study. To a point, it could generate statistics for you too and I often used it to sum frequency and percentage variables. Since then, it seems to have been enhanced a little and it surprised me with the statistics it could produce when I had a recent play. Here’s the code:

proc tabulate data=sashelp.class;
class sex;
var age;
table age*(n median*f=8. mean*f=8.1 std*f=8.1 min*f=8. max*f=8. lclm*f=8.1 uclm*f=8.1),sex / misstext=”0″;
run;

When you compare that with the idea of creating one variable per column and then defining them in PROC REPORT as many do, it has to look more elegant and the results aren’t bad either though they can be tweaked further from the quick example that I generated. That last comment brings me to the point that PROC REPORT seems to have taken over from TABULATE wherever I care to look these days and I do ask myself if it is the right tool for that for which it is being used or if it is being used in the best way.

Using Data Step to create one variable per column in a PROC REPORT output doesn’t strike me as the best way to write reusable code but there are ways to make REPORT do more for you. For example, by defining GROUP, ACROSS and ANALYSIS columns in an output, you can persuade the procedure to do the summarising for you and there’s some example code below with the comma nesting height under sex in the resulting table. Sums are created by default if you do this and forgoing an analysis column definition means that you get a frequency table, not at all a useless thing in many cases.

proc report data=sashelp.class nowd missing;
columns age sex,height;
define age / group “Age”;
define sex / across “Sex”;
define height / analysis mean f=missing. “Mean Height”;
run;

For those times when you need to create more heavily formatted statistics (summarising range as min-max rather showing min and max separately, for example), you might feel that the GROUP/ACROSS set-up’s non-display of character values puts a stop to using that approach. However, I found that making every value combination unique and attaching a cell ID helps to work around the problem. Then, you can create a format control data set from the data like in the code below and create a format from that which you can apply to the cell ID’s to display things as you need them. This method does make things more portable from situation to situation than adding or removing columns depending on the values of a classification variable.

proc sql noprint;
create table cntlin as
select distinct  “fmtname” as fmtname, cellid as start, cellid as end, decode as label
from report;
quit;

proc format lib=work cntlin=cnlin;
run;

Consolidation

For a while, the Windows computing side of my life has been spread across far too many versions of the pervasive operating systems with the list including 2000 (desktop and server), XP, 2003 Server, Vista and 7; 9x hasn’t been part of my life for what feels like an age. At home, XP has been the mainstay for my Windows computing needs with Vista Home Premium loaded on my Toshiba laptop. The latter variant came in for more use during that period of home computing “homelessness” and, despite a cacophony of complaints from some, it seemed to work well enough. Since the start of the year, 7 has also been in my sights with beta and release candidate instances in virtual machines leaving me impressed enough to go popping the final version onto both the laptop and in a VM on my main PC. Microsoft finally have got around to checking product keys over the net so that meant a licence purchase for each installation using the same downloaded 32-bit ISO image. 7 still is doing well by me so I am beginning to wonder whether having an XP VM is becoming pointless. The reason for that train of thought is that 7 is becoming the only version that I really need for anything that takes me into the world of Windows.

Work is a different matter with a recent move away from Windows 2000 to Vista heavily reducing my exposure to the venerable old stager (businesses usually take longer to migrate and any good IT manager usually delays any migration by a year anyway). 2000 is sufficiently outmoded by now that even my brother was considering a move to 7 for his work because of al the Office 2007 files that have been coming his way. He may be no technical user but the bad press gained by Vista hasn’t passed him by so a certain wariness is understandable. Saying that, my experiences with Vista haven’t been unpleasant and it always worked well on the laptop and the same also can be said for its corporate desktop counterpart. Much of the noise centered around issues of hardware and software compatibility and that certainly is apparent at work with my having some creases left to straighten.

With all of this general forward heaving, you might think that IE6 would be shuffling its mortal coil by now but a recent check on visitor statistics for this website places it at about 13% share, tantalisingly close to oblivion but still too large to ignore it completely. All in all, it is lingering like that earlier blight of web design, Netscape 4.x. If I was planning a big change to the site design, setting up a Win2K VM would be in order not to completely put off those labouring with the old curmudgeon. For smaller changes, the temptation is not to bother checking but that is questionable when XP is set to live on for a while yet. That came with IE6 and there must be users labouring with the old curmudgeon and that’s ironic with IE8 being available for SP2 since its original launch a while back. Where all this is leading me is towards the idea of waiting for IE6 share to decrease further before tackling any major site changes. After all, I can wait with the general downward trend in market share; there has to be a point when its awkwardness makes it no longer viable to support the thing. That would be a happy day.

Quickly surveying free disk space on UNIX and Linux

Keeping an eye on disk space on a Solaris server is important for me at work while keeping the same top level overview is good for my use of Linux at home too. Luckily, there’s a simple command that delivers the goods:

df -h 2>/dev/null

The "df -h" piece is what delivers the statistics while the "2>/dev/null" rids the terminal of any error messages; ones stating that access has been denied are common and can cloud the picture.

Google Analytics

Furthering my excursions into things related to Google, I have been giving Google Analytics a whirl for my hillwalking and photo gallery website. Aside from the fact that it is updated once a day, it could have enabled me to eject WordPress plug-ins like Popularity Contest and FireStats getting the chop. As it happens, I also have a Google Analytics plugin installed but a little editing of the blog template that I have developed would get rid of that too.

That’s enough about WordPress plug-ins; let’s return to Google Analytics. It has all the usual stuff: who’s visiting, from where are they coming, what are they using to see your site, etc. In addition, it captures if they are coming back, how long they are staying on the site and how deep they are going. Bounce rate is another term that features heavily: it is when a user only goes to one page and then leaves. With a blog, this unfortunately seems to come out as a high figure and that is ironic given that the blog was meant to promote the online photo gallery; it has very much taken on a life all of its own. There’s more to the information from Google Analytics but it’s all useful stuff and I plan to make good use of it to improve how my site works.

Do we surf the web less at the weekend?

Looking at the visitor statistics for both this blog and for my main website, I have noticed a definite dip in visitor numbers at the weekends, at least over the last few weeks. Time will tell as to whether this is a definite trend but it is an intriguing one: less people are reading blogs and such like when they might have more time to do so. It would also suggest that people are getting away from the web at the weekend, not necessarily a bad thing at all. In fact, I was away from the world of computers and out walking in the border country shared by Wales and England yesterday.

Speaking of walking, it does not surprise me that my hillwalking blog received less attention: many of my readers could have been in the outdoors anyway. And as for this blog, it does contain stuff that I find useful in the day job and it seems that others are looking for the same stuff too if the blog statistics are to be believed. Couple that to the fact that technology news announcements peak during the week and it seems that the weekday upsurge is real. I’ll continue to keep an eye on things to see if my theorising is right or mistaken…

  • As is commonly the case with places like these, all the views that you find expressed on here in postings and articles are mine alone and not those of any organisation with which I have any association, through work or otherwise. With regards to any comments left on the site, I reserve the right to reject any that are inappropriate. Otherwise, whatever is said is the sole responsibility of whoever is leaving the comment.