26th November 2022
Recently, I encountered the following kind of message when reading an Excel file into SAS using PROC IMPORT:
ERROR: Error opening XLSX file -> xxx-.xlsx . It is either not an Excel spreadsheet or it is damaged. Error code=8000101D
Requested Input File Is Invalid
ERROR: Import unsuccessful. See SAS Log for details.
Naturally, thoughts arise regarding the state of the Excel file when you see a message like this but that was not the case because the file opened successfully in Excel and looked OK to me. After searching on the web, I found that it was a file permissions issue. The actual environment that I was using at the time was entimICE and I had forgotten to set up a link that granted read access to the file. Once that was added, the problem got resolved. In other systems, checking on file system permissions is needed even if the message seems to suggest that you are experiencing a file integrity problem.
16th February 2022
Custom error messages are good to add to SAS macros but you can get inconsistent colouration of the message text in multi-line messages. That was something that I just overlooked until I recently came across a solution. That is to use a hyphen at the end of the ERROR/WARNING/NOTE prefix instead of the more usual colon. Any prefixes ending on a hyphen are not included in the log text and the colouration ignores the carriage return that ordinary would change the text colour to black. The simple macro below demonstrates the effect.
%put ERROR: this is a test;
%put ERROR- this is another test;
%put WARNING: this is a test;
%put WARNING- this is another test;
%put NOTE: this is a test;
%put NOTE- this is another test;
ERROR: this is a test
this is another test
WARNING: this is a test
this is another test
NOTE: this is a test
this is another test
14th October 2021
My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. One of these has been R but Python is another that has taken up my attention and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.
While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool in order to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.
During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family so as to have a more stable income. The approach is that a student selects a project and works their way through it with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course and that point came through in the webinar.
That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic pandemic supplying data and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Some subjects are uplifting while others are more foreboding but the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is and I hope that it will help with learning Julia too.
In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message but books always are worth reading even if that is the slower route. Those from the Dummies series or from O’Reilly have proved must useful so far but I do need to read them more completely than I already have; it is all too tempting to go with the try the “programming and search for solutions as you go” approach instead.
To get going, many choose the Anaconda distribution to get Jupyter notebook functionality but I prefer a more traditional editor so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Spyder itself is written in Python so it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities but these get installed behind the scenes.
The packages that I first met in 2019 may be the mainstays for doing data science but I have discovered others since then. It also seems that there is porosity between the worlds of R an Python so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dply and ggplot2 in the form of Siuba and Plotnine, respectively. The syntax of these packages are not direct copies of what is executed in R but they are close enough for there to be enough familiarity for added user friendliness compared to Pandas or Matplotlib. The interoperability does not stop there for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.
Pyhton may not have the speed of Julia but there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have there uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available and there is no need not to check out more. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurst to keep an open mind either.
10th October 2021
Recently, a client of mine updated one of their systems from SAS 9.4 M5 to SAS 9.4 M7. In spite of performing due diligence regarding changes between the maintenance release, a change in behaviour of the SYSODSESCAPECHAR automatic macro variable surprised them. The macro variable captures the assignment of the ODS escape character used to prefix RTF codes for page numbering and other things. That setting is made using an ODS ESCAPECHAR statement like the following:
In the M5 release, the tilde character in this example was output by the automatic macro variable but that changed in the M7 release to 7E, the hexadecimal code for the same and this tripped up one of their validated macro programs used in output production. The adopted solution was to use the escape sequence (*ESC*) that gave the same outcome that was there before the change. That was less verbose than alternative code changing the hexadecimal code into the expected ASCII character that follows.
The above supplies a hexadecimal code to the BYTE function for correct rendering with the SYMPUT routine assigning the resulting value to a macro variable named new. Just using the escape sequence is far more succinct though there is now an added validation need once user pilot testing has completed. In my line of business, the updating of code is the quickest part of many such changes; documentation and testing always take longer.
21st August 2021
Data science has remained in my awareness since 2017 though my work is more on its fringes in clinical research. In fact, I have been involved more in the standardisation and automation of more traditional data reporting than in the needs of data modelling such as data engineering or other similar disciplines. Much of this effort has meant the use of SAS, with which I have programmed since 2000 and for which I have a licence (an expensive commodity, it has to be said), but other technologies are being explored with R, Python and Julia being among them.
The change in technological scope does bring an element of excitement and new interest but there is also some sadness when tried and trusted technologies meet with newer competition and valued skills are no longer as career securing as they once were. Still, there is plenty of online training out there and I already have collected some of my thoughts on this. The learning continues and the need for repositioning is also clear.
The journey also has brought some curios to my notice. One of these is This Person Does Not Exist, a website building photos of non-existent faces using machine learning. Recently, I learned of others like it such as This Artwork Does Not Exist, This Cat Does Not Exist, This Horse Does Not Exist, and This Chemical Does Not Exist. The last of these probably should be entitled “This Molecule Does Not Exist (Yet)” since it is a fictitious molecular structure that has been created and what you get is an actual moving image that spins it around in three-dimensional space. The one with dynamically generated abstract art is the main inspiration for this piece and is of more interest to me while the other two are more explanatory though the horse website is not so successful in its execution and one can ask why we need more cat pictures.
To some, the idea of creating fake pictures may feel a little foreboding and that especially applies to photos of people and the livelihoods of any content creators. Nevertheless, these sources of imagery have their legitimate uses such as decorating websites or brochures and that is where my interest is piqued. After all, there are some subjects where pictures can be scarce so any form of decoration that enlivens an article has to have some use. Technology websites like this one can feature images too with screenshots and device photos being commonplace but they can all look like each other, hence the need for a little more variety and having pictures often increases the choice of website themes as well since so many need images to make them work or stand out. As ever, being sparing with any new innovations remains in order so that is how I approach this matter as well.
15th August 2021
One should not do a new PC build in the middle of a heatwave if you do not want to be concerned about how fast fans are spinning and how hot things are getting. Yet, that is what I did last month after delaying the act for numerous months.
My efforts mean that I have a system built around an AMD Ryzen 9 5950X CPU and a Gigabyte X570 Aorus Pro with 64 GB of memory and things are settling down after the initial upheaval. That also meant some adjustments to the CPU fan profile in the BIOS for quieter running while the the use of Be Quiet! Dark Rock 4 cooler also helps as does a Be Quiet! Silent Wings 3 case fan. All are components from trusted brands though I wonder how much abuse they got during their installation and subsequent running in.
Fan noise is a non-quantitative indicator of heat levels as much as touch so more quantitative means are in order. Aside from using a thermocouple device, there are in-built sensors too. My using Linux Mint means that I have the sensors command from the lm-sensors package for checking on CPU and other temperatures though hddtemp is what you need for checking on the same for hard drives. The latter can be used as follows:
sudo hddtemp /dev/sda /dev/sdb
This has to happen using administrator access and a list of drives needs to be provided because it cannot find them by itself. In my case, I have no mechanical hard drives installed in non-NAS systems and I even got to replacing a 6 TB Western Digital Green disk with an 8 TB SSD but I got the following when I tried checking on things with hddtemp:
WARNING: Drive /dev/sda doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me ([email protected]).
WARNING: See --help, --debug and --drivebase options.
/dev/sda: Samsung SSD 870 QVO 8TB: no sensor
The cause of the message for me was that there is no entry for Samsung SSD 870 QVO 8TB in /etc/hddtemp.db so that needed to be added there. Before that could be rectified, I needed to get some additional information using smartmontools and these needed to be installed using the following command:
sudo apt-get install smartmontools
What I needed to do was check the drive’s SMART data output for extra information and that was achieved using the following command:
sudo smartctl /dev/sda -a | grep -i Temp
What this does is to look for the temperature information from smartctl output using the grep command with output from the first being passed to the second through a pipe. This yielded the following:
190 Airflow_Temperature_Cel 0x0032 072 050 000 Old_age Always - 28
The first number in the above (190) is the thermal sensor’s attribute identifier and that was needed in what got added to /etc/hddtemp.db. The following command added the necessary data to the aforementioned file:
echo \"Samsung SSD 870 QVO 8TB\" 190 C \"Samsung SSD 870 QVO 8TB\" | sudo tee -a /etc/hddtemp.db
Here, the output of the echo command was passed to the tee command for adding to the end of the file. In the echo command output, the first part is the name of the drive, the second is the heat sensor identifier, the third is the temperature scale (C for Celsius or F for Fahrenheit) and the last part is the label (it can be anything that you like but I kept it the same as the name). On re-running the hddtemp command, I got output like the following so all was as I needed it to be.
/dev/sda: Samsung SSD 870 QVO 8TB: 28°C
Since then, temperatures may have cooled and the weather become more like what we usually get but I am still keeping an eye on things, especially when the system is put under load using Perl, R, Python or SAS. There may be further modifications such as changing the case or even adding water cooling, not least to have a cooler power supply unit, but nothing is being rushed as I monitor things to my satisfaction.
18th April 2021
Recently, I shared my thoughts on learning new computing languages by oneself using books, online research and personal practice. As successful as that can be, there remains a place for getting some actual instruction as well. Maybe that is why so many turn to YouTube where there is a multitude of video channels offering such possibilities without cost. What I have also discovered is that this is complemented by a host of other providers whose services attract a fee and there will be a few of those mentioned later in this post. Paying for online courses does mean that you can get the benefit of curation and an added assurance of quality in what appears to be a growing market.
The variation in quality can dog the YouTube approach and it also can be tricky to find something good even if the platform does suggest new videos based on what you have been watching. Much of what is on there does take the form of webinars from the likes of the Why R? Foundation, RStudio or the NHSR Community. These can be useful and there are shorter videos from such providers as the Association of Computing Machinery or SAS Users. These do help more if you already have some knowledge about the topic area being discussed so they may not make the best starting points for someone that is starting from scratch.
Of course, working your way through a good book will help and it is something that I have been known to do but supplementing this with one or more video courses really adds to the experience and I have done a few of these on LinkedIn. That part of the professional platform came from the acquisition of Lynda.com and the topic areas range from soft skills like time management through to computing skills courses with R, SAS and Python seeing coverage among the data science portfolio. Even O’Reilly has ventured into the area in an expansion from the book publishing activities for which so many of us know the organisation.
The available online instructor community does not stop at the above since there are others like Udacity, Programiz, Udemy, Business Science and Datanovia. Some of these tend towards online education provision that feels more like an online university course and those are numerous as well as you will find through Data Science Central or KDNuggets. Both of these earn income from advertising to pay for featured blog posts and newsletters while the former also organises regular webinars and was my first port of call when I became curious about the world of data science during the autumn of 2017.
My point of approach into the world of online training has been as a freelance information professional needing to keep up to date with a rapidly changing field. The mix of content that is both free of charge and that which attracts a fee is one that can work. Both kinds do complement each other while possessing their unique advantages and disadvantages. The need to continually expand skills and knowledge never goes away so it is well worth spending some time working what you are after since you need to be sure that any training always adds to your own knowledge and skill level.
10th April 2021
Over the years, I have taught myself a number of computing languages with some coming in useful for professional work while others came in handy for website development and maintenance. The collection has grown to include HTML, CSS, XML, Perl, PHP and UNIX Shell Scripting. The ongoing pandemic allowed to me added two more to the repertoire: R and Python.
My interest in these arose from my work as an information professional concerned with standardisation and automation of statistical results delivery. To date, the main focus has been on clinical study data but ongoing changes in the life sciences sector could mean that I may need to look further afield so having extra knowledge never hurts. Though I have been a SAS programmer for more than twenty years, its predominance in the clinical research field is not what it was so that I am having to rethink things.
As it happens, I would like to continue working with SAS since it does so much and thoughts of leaving it after me bring sadness. It also helps to know what the alternatives might be and to reject some management hopes about any newcomers, especially with regard to the amount of code being produced and the quality of graphs being created. Use cases need to be assessed dispassionately even when emotions loom behind the scenes.
Both R and Python bring large scripting ecosystems with active communities so the attraction of their adoption makes a deal of sense. SAS is comparable in the scale of its own ecosystem though there are considerable differences and the platform is catching up when it comes to Data Science. The aforementioned open source languages may have had a head start but it seems that others are not standing still either. It is a time to have wider awareness and online conference attendance helps with that.
The breadth of what is available for any programming language more than stymies any attempt to create a truly all encompassing starting point and I have abandoned thoughts of doing anything like that for R. Similarly, I will not even try such a thing for Python. Consequently, this means that my sharing of anything learned will be in the form of discrete postings from time to time, especially given ho easy it is to collect numerous website links for sharing.
The learning has been facilitated by ongoing pandemic restrictions though things are opening up a little now. The pandemic also has given us public data that can be used for practice since much can be gained from having one’s own project instead of completing exercises from a book. Having an interesting data set with which to work is a must and COVID-19 data contain a certain self-interest as well though one always is mindful of the suffering and loss of life that has been happening since the pandemic first took hold.
21st December 2019
Recently, I had someone ask me how to create PNG files in SAS using ODS Graphics so I sought out the answer for them. Normally, the suggestion would have been to create RTF or PDF files instead but there was a specific need so a different approach was taken. Adding the something like following lines before an SGPLOT, SGPANEL or SGRENDER procedure does the needful:
ods listing gpath='E:\';
ods graphics / imagename="test" imagefmt=png;
Here, the ODS LISTING statement declares the destination for the desired graphics file while the ODS GRAPHICS statement defines the file name and type. In the above example, the file test.png would be created in the root of the E drive of a Windows machine. However, this also works with Linux or UNIX directory paths.
27th September 2019
Recently, I needed to create some example data with an extra numeric identifier variable that would be assigned according to the value of a character identifier variable. Not wanting to add another dataset merge or join to the code, I decided to create an informat from data. Initially, I looked into creating a format instead but it not accomplish what I wanted to do.
keep fmtname start end label type;
The input data needed a little processing as shown above. The format name was added in the variable FMTNAME and TYPE variable was assigned a value of I to make this a numeric informat; to make character equivalent, a value of J is assigned. The START and END variables declare the value range associated with the value of LABEL that would become the actual value of the numeric identifier variable. The variable names are fixed because the next step will not work with different ones.
proc format lib=work cntlin=patient;
To create the actual informat, the dataset is read by a FORMAT procedure with the CNTLIN parameter specifying the name of the input dataset and LIB defining the library where the format catalog is stored. When this in complete, the informat is available for use with an input function as shown in the code excerpt below.