Technology Tales

Adventures & experiences in contemporary technology

Useful Python packages for working with data

14th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. One of these has been R but Python is another that has taken up my attention and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool in order to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup  though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family so as to have a more stable income. The approach is that a student selects a project and works their way through it with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic pandemic supplying data and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Some subjects are uplifting while others are more foreboding but the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message but books always are worth reading even if that is the slower route. Those from the Dummies series or from O’Reilly have proved must useful so far but I do need to read them more completely than I already have; it is all too tempting to go with the try the “programming and search for solutions as you go” approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality but I prefer a more traditional editor so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Spyder itself is written in Python so it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science but I have discovered others since then. It also seems that there is porosity between the worlds of R an Python so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dply and ggplot2 in the form of Siuba and Plotnine, respectively. The syntax of these packages are not direct copies of what is executed in R but they are close enough for there to be enough familiarity for added user friendliness compared to Pandas or Matplotlib. The interoperability does not stop there for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

Pyhton may not have the speed of Julia but there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have there uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available and there is no need not to check out more. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurst to keep an open mind either.

Something to watch with the SYSODSESCAPECHAR automatic SAS macro variable

10th October 2021

Recently, a client of mine updated one of their systems from SAS 9.4 M5 to SAS 9.4 M7. In spite of performing due diligence regarding changes between the maintenance release, a change in behaviour of the SYSODSESCAPECHAR automatic macro variable surprised them. The macro variable captures the assignment of the ODS escape character used to prefix RTF codes for page numbering and other things. That setting is made using an ODS ESCAPECHAR statement like the following:

ods escapechar="~";

In the M5 release, the tilde character in this example was output by the automatic macro variable but that changed in the M7 release to 7E, the hexadecimal code for the same and this tripped up one of their validated macro programs used in output production. The adopted solution was to use the escape sequence (*ESC*) that gave the same outcome that was there before the change. That was less verbose than alternative code changing the hexadecimal code into the expected ASCII character that follows.

data _null_;
call symput("new",byte(input("&sysodsescapechar.",hex.)));
run;

The above supplies a hexadecimal code to the BYTE function for correct rendering with the SYMPUT routine assigning the resulting value to a macro variable named new. Just using the escape sequence is far more succinct though there is now an added validation need once user pilot testing has completed. In my line of business, the updating of code is the quickest part of many such changes; documentation and testing always take longer.

Some books and other forms of documentation on R

11th September 2021

The thrust of an exhortation from a computing handbook publisher comes to mind here: don’t just look things up on Google, read a book so you really understand what you are doing. Something like those words was used to sell an eBook on Github but the same sentiment applies to R or any other computing language. Using a search engine will get you going or add to existing knowledge but only a book or a training course will help to embed real competence.

In the case of R, there is a myriad of blogs out there that can be consulted as well as function and package documentation on RDocumentation or rrdr.io. For the former, R-bloggers or R Weekly can make good places to start while ones like Stats and R, Statistics Globe, STHDA, PSI’s VIS-SIG and anything from Posit (including their main blog as well as their AI one) can be worth consulting. Additionally, there is also RStudio Education and the NHS-R Community, which also have a Github repository together with a YouTube channel. Many packages have dedicated websites as well so there is no lack of documentation with all of these so here is a selection:

Tidyverse

forcats

tidyr

Distill for R Markdown

Databases using R

RMariaDB

R Markdown

xaringanExtra

Shiny

formattable

reactable

DT

rhandsontable

thematic

bslib

plumber

ggforce

officeverse

officer

pharmaRTF

COVID-19 Data Hub

To come to the real subject of this post, R is unusual in that books that you can buy also have companions websites that contain the same content with the same structure. Whatever funds this approach (and some appear to be supported by RStudio itself by the looks of things), there certainly are a lot of books available freely online in HTML as you will see from the list below while a few do not have a print counterpart as far as I know:

Big Book of R

R Programming for Data Science

Hands-On Programming with R

Advanced R

Cookbook for R

R Graphics Cookbook

R Markdown: The Definitive Guide

R Markdown Cookbook

RMarkdown for Scientists

bookdown: Authoring Books and Technical Documents with R Markdown

blogdown: Creating Websites with R Markdown

pagedown: Create Paged HTML Documents for Printing from R Markdown

Dynamic Documents with R and knitr

Mastering Shiny

Engineering Production-Grade Shiny Apps

Outstanding User Interfaces with Shiny

R Packages

Mastering Spark with R

Happy Git and GitHub for the useR

JavaScript for R

HTTP Testing in R

Outstanding User Interfaces with Shiny

Engineering Production-Grade Shiny Apps

The Shiny AWS Book

Many of the above have counterparts published by O’Reilly or Chapman & Hall, to name the two publishers that I have found so far. Aside from sharing these with you, there is also the personal motivation of having the collection of links somewhere so I can close tabs in my Firefox session. There are other web articles open in other tabs that I need to retain and share but these will need to do for now and I hope that you find them as useful as I do.

A little bit of abstraction

21st August 2021

A little bit of abstraction

Data science has remained in my awareness since 2017 though my work is more on its fringes in clinical research. In fact, I have been involved more in the standardisation and automation of more traditional data reporting than in the needs of data modelling such as data engineering or other similar disciplines. Much of this effort has meant the use of SAS, with which I have programmed since 2000 and for which I have a licence (an expensive commodity, it has to be said), but other technologies are being explored with R, Python and Julia being among them.

The change in technological scope does bring an element of excitement and new interest but there is also some sadness when tried and trusted technologies meet with newer competition and valued skills are no longer as career securing as they once were. Still, there is plenty of online training out there and I already have collected some of my thoughts on this. The learning continues and the need for repositioning is also clear.

A little bit of abstraction

A little bit of abstraction

The journey also has brought some curios to my notice. One of these is This Person Does Not Exist, a website building photos of non-existent faces using machine learning. Recently, I learned of others like it such as This Artwork Does Not Exist, This Cat Does Not Exist, This Horse Does Not Exist, and This Chemical Does Not Exist. The last of these probably should be entitled “This Molecule Does Not Exist (Yet)” since it is a fictitious molecular structure that has been created and what you get is an actual moving image that spins it around in three-dimensional space. The one with dynamically generated abstract art is the main inspiration for this piece and is of more interest to me while the other two are more explanatory though the horse website is not so successful in its execution and one can ask why we need more cat pictures.

To some, the idea of creating fake pictures may feel a little foreboding and that especially applies to photos of people and the livelihoods of any content creators. Nevertheless, these sources of imagery have their legitimate uses such as decorating websites or brochures and that is where my interest is piqued. After all, there are some subjects where pictures can be scarce so any form of decoration that enlivens an article has to have some use. Technology websites like this one can feature images too with screenshots and device photos being commonplace but they can all look like each other, hence the need for a little more variety and having pictures often increases the choice of website themes as well since so many need images to make them work or stand out. As ever, being sparing with any new innovations remains in order so that is how I approach this matter as well.

Online learning

18th April 2021

Recently, I shared my thoughts on learning new computing languages by oneself using books, online research and personal practice. As successful as that can be, there remains a place for getting some actual instruction as well. Maybe that is why so many turn to YouTube, where there is a multitude of video channels offering such possibilities without cost. What I have also discovered is that this is complemented by a host of other providers whose services attract a fee, and there will be a few of those mentioned later in this post. Paying for online courses does mean that you can get the benefit of curation and an added assurance of quality in what appears to be a growing market.

The variation in quality can dog the YouTube approach, and it also can be tricky to find something good, even if the platform does suggest new videos based on what you have been watching. Much of what is found there does take the form of webinars from the likes of the Why R? Foundation, Posit or the NHSR Community. These can be useful, and there are shorter videos from such providers as the Association of Computing Machinery or SAS Users. These do help more if you already have some knowledge about the topic area being discussed, so they may not make the best starting points for someone who is starting from scratch.

Of course, working your way through a good book will help, and it is something that I have been known to do, but supplementing this with one or more video courses really adds to the experience and I have done a few of these on LinkedIn. That part of the professional platform came from the acquisition of Lynda.com and the topic areas range from soft skills like time management through to computing skills courses with R, SAS and Python seeing coverage among the data science portfolio. Even O’Reilly has ventured into the area in an expansion from the book publishing activities for which so many of us know the organisation.

The available online instructor community does not stop at the above since there are others like Degreed, Baeldung, Udacity, Programiz, Udemy, Business Science and Datanovia. Some of these tend towards online education provision that feels more like an online university course and those are numerous as well as you will find through Data Science Central or KDNuggets. Both of these earn income from advertising to pay for featured blog posts and newsletters, while the former also organises regular webinars and was my first port of call when I became curious about the world of data science during the autumn of 2017.

My point of approach into the world of online training has been as a freelance information professional needing to keep up to date with a rapidly changing field. The mix of content that is both free of charge and that which attracts a fee is one that can work. Both kinds do complement each other while possessing their unique advantages and disadvantages. The need to continually expand skills and knowledge never goes away, so it is well worth spending some time working what you are after, since you need to be sure that any training always adds to your own knowledge and skill level.

Self-learning new computing languages

10th April 2021

Over the years, I have taught myself a number of computing languages with some coming in useful for professional work while others came in handy for website development and maintenance. The collection has grown to include HTML, CSS, XML, Perl, PHP and UNIX Shell Scripting. The ongoing pandemic allowed to me added two more to the repertoire: R and Python.

My interest in these arose from my work as an information professional concerned with standardisation and automation of statistical results delivery. To date, the main focus has been on clinical study data but ongoing changes in the life sciences sector could mean that I may need to look further afield so having extra knowledge never hurts. Though I have been a SAS programmer for more than twenty years, its predominance in the clinical research field is not what it was so that I am having to rethink things.

As it happens, I would like to continue working with SAS since it does so much and thoughts of leaving it after me bring sadness. It also helps to know what the alternatives might be and to reject some management hopes about any newcomers, especially with regard to the amount of code being produced and the quality of graphs being created. Use cases need to be assessed dispassionately even when emotions loom behind the scenes.

Both R and Python bring large scripting ecosystems with active communities so the attraction of their adoption makes a deal of sense. SAS is comparable in the scale of its own ecosystem though there are considerable differences and the platform is catching up when it comes to Data Science. The aforementioned open source languages may have had a head start but it seems that others are not standing still either. It is a time to have wider awareness and online conference attendance helps with that.

The breadth of what is available for any programming language more than stymies any attempt to create a truly all encompassing starting point and I have abandoned thoughts of doing anything like that for R. Similarly, I will not even try such a thing for Python. Consequently, this means that my sharing of anything learned will be in the form of discrete postings from time to time, especially given ho easy it is to collect numerous website links for sharing.

The learning has been facilitated by ongoing pandemic restrictions though things are opening up a little now. The pandemic also has given us public data that can be used for practice since much can be gained from having one’s own project instead of completing exercises from a book. Having an interesting data set with which to work is a must and COVID-19 data contain a certain self-interest as well though one always is mindful of the suffering and loss of life that has been happening since the pandemic first took hold.

Creating a data-driven informat in SAS

27th September 2019

Recently, I needed to create some example data with an extra numeric identifier variable that would be assigned according to the value of a character identifier variable. Not wanting to add another dataset merge or join to the code, I decided to create an informat from data. Initially, I looked into creating a format instead but it not accomplish what I wanted to do.

data patient;
keep fmtname start end label type;
set test.dm;
by subject;
fmtname="PATIENT";
start=subject;
end=start;
label=patient;
type="I";
run;

The input data needed a little processing as shown above. The format name was added in the variable FMTNAME and TYPE variable was assigned a value of I to make this a numeric informat; to make character equivalent, a value of J is assigned. The START and END variables declare the value range associated with the value of LABEL that would become the actual value of the numeric identifier variable. The variable names are fixed because the next step will not work with different ones.

proc format lib=work cntlin=patient;
run;
quit;

To create the actual informat, the dataset is read by a FORMAT procedure with the CNTLIN parameter specifying the name of the input dataset and LIB defining the library where the format catalog is stored. When this in complete, the informat is available for use with an input function as shown in the code excerpt below.

data ae1;
set ae;
patient=input(subject,patient.);
run;

Compressing an entire dataset library using SAS

26th September 2019

Turning dataset compression for SAS datasets can produce quite a reduction in size so it often is standard practice to do just this. It can be set globally with a single command and many working systems do this for you:

options compress=yes;

It also can be done on a dataset by dataset option by adding (compress=yes) beside the dataset name on the data line of a data step. The size reduction can be gained retrospectively too but only if you create additional copies of the datasets. The following code uses the DATASETS procedure to accomplish this end at the time of copy the datasets from one library to another:

proc datasets lib=adam nolist;
copy inlib=adam outlib=adamc noclone datecopy memtype=data;
run;
quit;

The NOCLONE option on the COPY statement allows the compression status to be changed while the DATECOPY one keeps the same date/time stamp on the new file at the operating system level. Lastly, the MEMTYPE=DATA setting ensures that only datasets are processed while the NOLIST option on the PROC DATASETS line suppresses listing of dataset information.

It may be possible to do something like this with PROC COPY too but I know from use that the option presented here works as expected. It also does not involve much coding effort, which helps a lot.

Moving a website from shared hosting to a virtual private server

24th November 2018

This year has seen some optimisation being applied to my web presences guided by the results of GTMetrix scans. It was then that I realised how slow things were, so server loads were reduced. Anything that slowed response times, such as WordPress plugins, got removed. Usage of Matomo also was curtailed in favour of Google Analytics while HTML, CSS and JS minification followed. What had yet to happen was a search for a faster server. Now, another website has been moved onto a virtual private server (VPS) to see how that would go.

Speed was not the only consideration since security was a factor too. After all, a VPS is more locked away from other users than a folder on a shared server. There also is the added sense of control, so Let’s Encrypt SSL certificates can be added using the Electronic Frontier Foundation’s Certbot. That avoids the expense of using an SSL certificate provided through my shared hosting provider and a successful transition for my travel website may mean that this one undergoes the same move.

For the VPS, I chose Ubuntu 18.04 as its operating system and it came with the LAMP stack already in place. Have offload development websites, the mix of Apache, MySQL and PHP is more familiar to me than anything using Nginx or Python. It also means that .htaccess files become more useful than they were on my previous Nginx-based platform. Having full access to the operating system by means of SSH helps too and should mean that I have fewer calls on technical support since I can do more for myself. Any extra tinkering should not affect others either, since this type of setup is well known to me and having an offline counterpart means that anything riskier is tried there beforehand.

Naturally, there were niggles to overcome with the move. The first to fix was to make the MySQL instance accept calls from outside the server so that I could migrate data there from elsewhere and I even got my shared hosting setup to start using the new database to see what performance boost it might give. To make all this happen, I first found the location of the relevant my.cnf configuration file using the following command:

find / -name my.cnf

Once I had the right file, I commented out the following line that it contained and restarted the database service afterwards using another command to stop the appearance of any error 111 messages:

bind-address 127.0.0.1
service mysql restart

After that, things worked as required and I moved onto another matter: uploading the requisite files. That meant installing an FTP server so I chose proftpd since I knew that well from previous tinkering. Once that was in place, file transfer commenced.

When that was done, I could do some testing to see if I had an active web server that loaded the website. Along the way, I also instated some Apache modules like mod-rewrite using the a2enmod command, restarting Apache each time I enabled another module.

Then, I discovered that Textpattern needed php-7.2-xml installed, so the following command was executed to do this:

apt install php7.2-xml

Then, the following line was uncommented in the correct php.ini configuration file that I found using the same method as that described already for the my.cnf configuration and that was followed by yet another Apache restart:

extension=php_xmlrpc.dll

Addressing the above issues yielded enough success for me to change the IP address in my Cloudflare dashboard so it pointed at the VPS and not the shared server. The changeover happened seamlessly without having to await DNS updates as once would have been the case. It had the added advantage of making both WordPress and Textpattern work fully.

With everything working to my satisfaction, I then followed the instructions on Certbot to set up my new Let’s Encrypt SSL certificate. Aside from a tweak to a configuration file and another Apache restart, the process was more automated than I had expected so I was ready to embark on some fine-tuning to embed the new security arrangements. That meant updating .htaccess files and Textpattern has its own, so the following addition was needed there:

RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

This complemented what was already in the main .htaccess file and WordPress allows you to include http(s) in the address it uses, so that was another task completed. The general .htaccess only needed the following lines to be added:

RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$ https://www.assortedexplorations.com/$1 [R,L]

What all these achieve is to redirect insecure connections to secure ones for every visitor to the website. After that, internal hyperlinks without https needed updating along with any forms so that a padlock sign could be shown for all pages.

With the main work completed, it was time to sort out a lingering niggle regarding the appearance of an FTP login page every time a WordPress installation or update was requested. The main solution was to make the web server account the owner of the files and directories, but the following line was added to wp-config.php as part of the fix even if it probably is not necessary:

define('FS_METHOD', 'direct');

There also was the non-operation of WP Cron and that was addressed using WP-CLI and a script from Bjorn Johansen. To make double sure of its effectiveness, the following was added to wp-config.php to turn off the usual WP-Cron behaviour:

define('DISABLE_WP_CRON', true);

Intriguingly, WP-CLI offers a long list of possible commands that are worth investigating. A few have been examined but more await attention.

Before those, I still need to get my new VPS to send emails. So far, sendmail has been installed, the hostname changed from localhost and the server restarted. More investigations are needed but what I have not is faster than what was there before, so the effort has been rewarded already.

Rethinking photo editing

17th April 2018

Photo editing has been something that I have been doing since my first-ever photo scan in 1998 (I believe it was in June of that year but cannot be completely sure nearly twenty years later). Since then, I have been using a variety of tools for the job and wondered how other photos can look better than my own. What cannot be excluded is my preference for being active in the middle of the day when light is at its bluest as well as a penchant for using a higher ISO of 400. In other words, what I do when making photos affects how they look afterwards as much as the weather that I had encountered.

My reason for mentioning the above aspects of photographic craft is that they affect what you can do in photo editing afterwards, even with the benefits of technological advancement. My tastes have changed over time, so the appeal of re-editing old photos fades when you realise that you only are going around in circles and there always are new ones to share, so that may be a better way to improve.

When I started, I was a user of Paint Shop Pro but have gone over to Adobe since then. First, it was Photoshop Elements, but an offer in 2011 lured me into having Lightroom and the full version of Photoshop. Nowadays, I am a Creative Cloud photography plan subscriber so I get to see new developments much sooner than once was the case.

Even though I have had Lightroom for all that time, I never really made full use of it and preferred a Photoshop-based workflow. Lightroom was used to select photos for Photoshop editing, mainly using adjustments for such things as tones, exposure, levels, hue and saturation. Removal of dust spots, resizing and sharpening were other parts of a still minimalist approach.

What changed all this was a day spent pottering about the 2018 Photography Show at the Birmingham NEC during a cold snap in March. That was followed by my checking out the Adobe YouTube Channel afterwards where there were videos of the talks featured every day of the four-day event. Here are some shortcuts if you want to do some catching up yourself: Day 1, Day 2, Day 3, and Day 4. Be warned though that these videos are long in that they feature the whole day and there are enough gaps that you may wish to fast-forward through them. Even so, there is quite a bit of variety of things to see.

Of particular interest were the talks given by the landscape photographer David Noton who sensibly has a philosophy of doing as little to his images as possible. It helps that his starting points are so good that adjusting black and white points with a little tonal adjustment does most of what he needs. Vibrancy, clarity and sharpening adjustments are kept to a minimum while some work with graduated filters evens out exposure differences between skies and landscapes. It helps that all this can be done in Lightroom, so that set me thinking about trying it out for size and the trick of using the backslash (\) key to switch between raw and processed views is a bonus granted by non-destructive editing. Others may have demonstrated the creation of composite imagery, but simplicity is more like my way of working.

Confusingly, we now have the cloud-based Lightroom CC while the previous desktop counterpart is known as Lightroom Classic CC. Though the former may allow for easy dust spot removal among other things, it is the latter that I prefer because the idea of wholesale image library upload does not appeal to me for now and I already have other places for off-site image backup like Google Drive and Dropbox. The mobile app does look interesting since it allows capturing images on a such a device in Adobe’s raw image format DNG. Still, my workflow is set to be more Lightroom-based than it once was and I quite fancy what new technology offers, especially since Adobe is progressing its Sensai artificial intelligence engine. The fact that it has access to many images on its systems due to Lightroom CC and its own stock library (Adobe Stock, formerly Fotolia) must mean that it has plenty of data for training this AI engine.

  • All the views that you find expressed on here in postings and articles are mine alone and not those of any organisation with which I have any association, through work or otherwise. As regards editorial policy, whatever appears here is entirely of my own choice and not that of any other person or organisation.

  • Please note that everything you find here is copyrighted material. The content may be available to read without charge and without advertising but it is not to be reproduced without attribution. As it happens, a number of the images are sourced from stock libraries like iStockPhoto so they certainly are not for abstraction.

  • With regards to any comments left on the site, I expect them to be civil in tone of voice and reserve the right to reject any that are either inappropriate or irrelevant. Comment review is subject to automated processing as well as manual inspection but whatever is said is the sole responsibility of the individual contributor.