Linux | Technology Tales

Useful Python packages for working with data

14th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. One of these has been R but Python is another that has taken up my attention and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool in order to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family so as to have a more stable income. The approach is that a student selects a project and works their way through it with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic pandemic supplying data and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Some subjects are uplifting while others are more foreboding but the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message but books always are worth reading even if that is the slower route. Those from the Dummies series or from O’Reilly have proved must useful so far but I do need to read them more completely than I already have; it is all too tempting to go with the try the “programming and search for solutions as you go” approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality but I prefer a more traditional editor so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Spyder itself is written in Python so it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science but I have discovered others since then. It also seems that there is porosity between the worlds of R an Python so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dply and ggplot2 in the form of Siuba and Plotnine, respectively. The syntax of these packages are not direct copies of what is executed in R but they are close enough for there to be enough familiarity for added user friendliness compared to Pandas or Matplotlib. The interoperability does not stop there for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

Pyhton may not have the speed of Julia but there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have there uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available and there is no need not to check out more. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurst to keep an open mind either.

Stopping Firefox from launching on the wrong virtual desktop on Linux Mint

12th October 2021

During the summer, I discovered that Firefox was steadfastly opening on the same virtual desktop on Linux Mint (the Cinnamon version) regardless of the one on which it was started. Being a creature of habit who routinely opens Firefox within the same virtual desktop all the time, this was not something that I had noticed until the upheaval of a system rebuild. The supposed cause is setting the browser to reopen tabs from the preceding session. The settings change according to the version of Firefox but it is found in Settings > General in the version in which I am writing these words (Firefox Developer Edition 94.0b4) and the text beside the tick box is Open previous windows and tabs.

While disabling the aforementioned setting could work, there is another less intrusive solution. This needs the opening of a new tab and the entering of the address about:config in the address bar. If you see a warning message about the consequences of proceed further, accept responsibility using the interface as you do just that. In the resulting field marked Search preference name, enter the text widget.disable-workspace-management and toggle the setting from false to true in order to activate it. Then, Firefox should open on the desktop where you want it and not some other default location.

Something to watch with the SYSODSESCAPECHAR automatic SAS macro variable

10th October 2021

Recently, a client of mine updated one of their systems from SAS 9.4 M5 to SAS 9.4 M7. In spite of performing due diligence regarding changes between the maintenance release, a change in behaviour of the SYSODSESCAPECHAR automatic macro variable surprised them. The macro variable captures the assignment of the ODS escape character used to prefix RTF codes for page numbering and other things. That setting is made using an ODS ESCAPECHAR statement like the following:

ods escapechar="~";

In the M5 release, the tilde character in this example was output by the automatic macro variable but that changed in the M7 release to 7E, the hexadecimal code for the same and this tripped up one of their validated macro programs used in output production. The adopted solution was to use the escape sequence (*ESC*) that gave the same outcome that was there before the change. That was less verbose than alternative code changing the hexadecimal code into the expected ASCII character that follows.

data _null_; call symput("new",byte(input("&sysodsescapechar.",hex.))); run;

The above supplies a hexadecimal code to the BYTE function for correct rendering with the SYMPUT routine assigning the resulting value to a macro variable named new. Just using the escape sequence is far more succinct though there is now an added validation need once user pilot testing has completed. In my line of business, the updating of code is the quickest part of many such changes; documentation and testing always take longer.

Limiting Google Drive upload & synchronisation speeds using Trickle

9th October 2021

Having had a mishap that lost me some photos in the early days of my dalliance with digital photography, I have been far more careful since then and that now applies to other files as well. Doing regular backups is a must that you find reiterated by many different authors and the current computing climate makes doing that more vital than it ever was.

So, as well as having various local backups, I also have remote ones in the form of OneDrive, Dropbox and Google Drive. These more correctly are file synchronisation services but disciplined use can make them useful as additional storage facilities in the interests of maintaining added resilience. There also are dedicated backup services that I have seen reviewed in the likes of PC Pro magazine but I have to make use of those.

Insync

Part of my process for dealing with new digital photo files is to back them up to Google Drive and I did that with a Windows client in the early days but then moved to Insync running on Linux Mint. One drawback to the approach is that this hogs the upload bandwidth of an internet connection that has yet to move to fibre from copper cabling. Having fibre connections to a local cabinet helps but a 100 KiB/s upload speed is easily overwhelmed and digital photo file sizes keep increasing. It does not help that I insist on using more flexible raw formats like DNG, CR2 or CR3 either.

Making fewer images could help to cut the load but I still come away from an excursion with many files because I get so besotted with my surroundings. This means that upload sessions take numerous hours and can extend across calendar days. Ultimately, this makes my internet connection far less usable so I want to throttle upload speed much like what is possible in the Transmission BitTorrent client or in the Dropbox client. Unfortunately, this is not available in Insync so I have tried using the trickle command instead and an example is below:

trickle -d 2000 -u 50 insync

Here, the upload speed is limited to 50 KiB/s while the download speed is limited to 2000 KiB/s. In my case, the latter of these hardly matters while the former leaves me with acceptable internet usability. Insync does not work smoothly with this, however, so occasional restarts are needed to keep file uploads progressing and CPU load also is higher. As rough as the user experience feels, uploads can continue in parallel with other work.

gdrive

One other option that I am exploring is the use of the command-line tool gdrive and this appears to work well with trickle. After downloading and installing the tool, getting going is a matter of issuing the following command and following the instructions:

gdrive about

On web servers, I even have the tool backing up things to Google Drive on a scheduled basis. Because of a Google Drive limitation that I have encountered not only with gdrive but also with Insync and Google’s own Windows Google Drive client, synchronisation only can happen with two new folders, one local and the other remote. Handily, gdrive supports the usual bash style commands for working with remote directories so something like the following will create a directory on Google Drive:

gdrive mkdir ttdc [ID for parent folder]

Here, the ID for the parent folder may be omitted but it can be obtained by going to Google Drive online and getting a link location by right-clicking on a folder and choosing the appropriate context menu item. This gets you something like the following and the required identifier is found between the last slash and the first question mark in the address string (so as not to share any real links, I made the address more general below):

https://drive.google.com/drive/folders/[remote folder ID]?usp=sharing

Then, synchronisation uses a command like the following:

gdrive sync upload [local folder or file path] [remote folder ID]

There also is the option to do a one-way upload and this is the form of the command used:

gdrive upload [local folder or file path] -p [remote folder ID]

Because every file or folder object has its own ID on Google Drive, it is possible to create two objects on there that appear to have the same name though that is sure to cause confusion even if you know what is happening. It is possible in each of the above to throttle them using trickle as well:

trickle -d 2000 -u 50 gdrive sync upload [local folder or file path] [remote folder ID] trickle -d 2000 -u 50 gdrive upload [local folder or file path] -p [remote folder ID]

Handily, this works without the added drama seen with Insync and lends itself to scripting as well so it could be something that I will incorporate into my current workflow. One thing that needs to be watched is file upload failures but there may be ways to catch those and retry them so that would another thing that needs doing. This is built into Insync and it would be a learning opportunity if I was to stick with gdrive instead.

When CRON is stalled by incorrect file and folder permissions

8th October 2021

During the past week, I rebooted my system only to find that a number of things no longer worked and my Pi-hole DNS server was among them. Having exhausted other possibilities by testing out things on another machine, I did a status check when I spotted a line like the following in my system logs and went investigating further:

cron[322]: (root) INSECURE MODE (mode 0600 expected) (crontabs/root)

It turned out to be more significant than I had expected because this was why every CRON job was failing and that included the network set up needed by Pi-hole; a script is executed using the @reboot directive to accomplish this and I got Pi-hole working again by manually executing it. The evening before, I did make some changes to file permissions under /var/www but I was not expecting it to affect other parts of /var though that may have something to do with some forgotten heavy-handedness. The cure was to issue a command like the following for execution in a terminal session:

sudo chmod -R 600 /var/spool/cron/crontabs/

Then, CRON itself needed starting since it had not not running at all and executing this command did the needful without restarting the system:

sudo systemctl start cron

That outcome was proved by executing the following command to issue some terminal output that include the welcome text “active (running)” highlighted in green:

sudo systemctl status cron

There was newly updated output from a frequently executing job that checked on web servers for me but this was added confirmation. It was a simple solution to a perplexing situation that led up all sorts of blind alleys before I alighted on the right solution to the problem.

When a hard drive is unrecognised by the Linux hddtemp command

15th August 2021

One should not do a new PC build in the middle of a heatwave if you do not want to be concerned about how fast fans are spinning and how hot things are getting. Yet, that is what I did last month after delaying the act for numerous months.

My efforts mean that I have a system built around an AMD Ryzen 9 5950X CPU and a Gigabyte X570 Aorus Pro with 64 GB of memory and things are settling down after the initial upheaval. That also meant some adjustments to the CPU fan profile in the BIOS for quieter running while the the use of Be Quiet! Dark Rock 4 cooler also helps as does a Be Quiet! Silent Wings 3 case fan. All are components from trusted brands though I wonder how much abuse they got during their installation and subsequent running in.

Fan noise is a non-quantitative indicator of heat levels as much as touch so more quantitative means are in order. Aside from using a thermocouple device, there are in-built sensors too. My using Linux Mint means that I have the sensors command from the lm-sensors package for checking on CPU and other temperatures though hddtemp is what you need for checking on the same for hard drives. The latter can be used as follows:

sudo hddtemp /dev/sda /dev/sdb

This has to happen using administrator access and a list of drives needs to be provided because it cannot find them by itself. In my case, I have no mechanical hard drives installed in non-NAS systems and I even got to replacing a 6 TB Western Digital Green disk with an 8 TB SSD but I got the following when I tried checking on things with hddtemp:

WARNING: Drive /dev/sda doesn't seem to have a temperature sensor. WARNING: This doesn't mean it hasn't got one. WARNING: If you are sure it has one, please contact me ([email protected]). WARNING: See --help, --debug and --drivebase options. /dev/sda: Samsung SSD 870 QVO 8TB: no sensor

The cause of the message for me was that there is no entry for Samsung SSD 870 QVO 8TB in /etc/hddtemp.db so that needed to be added there. Before that could be rectified, I needed to get some additional information using smartmontools and these needed to be installed using the following command:

sudo apt-get install smartmontools

What I needed to do was check the drive’s SMART data output for extra information and that was achieved using the following command:

sudo smartctl /dev/sda -a | grep -i Temp

What this does is to look for the temperature information from smartctl output using the grep command with output from the first being passed to the second through a pipe. This yielded the following:

190 Airflow_Temperature_Cel 0x0032 072 050 000 Old_age Always - 28

The first number in the above (190) is the thermal sensor’s attribute identifier and that was needed in what got added to /etc/hddtemp.db. The following command added the necessary data to the aforementioned file:

echo \"Samsung SSD 870 QVO 8TB\" 190 C \"Samsung SSD 870 QVO 8TB\" | sudo tee -a /etc/hddtemp.db

Here, the output of the echo command was passed to the tee command for adding to the end of the file. In the echo command output, the first part is the name of the drive, the second is the heat sensor identifier, the third is the temperature scale (C for Celsius or F for Fahrenheit) and the last part is the label (it can be anything that you like but I kept it the same as the name). On re-running the hddtemp command, I got output like the following so all was as I needed it to be.

/dev/sda: Samsung SSD 870 QVO 8TB: 28°C

Since then, temperatures may have cooled and the weather become more like what we usually get but I am still keeping an eye on things, especially when the system is put under load using Perl, R, Python or SAS. There may be further modifications such as changing the case or even adding water cooling, not least to have a cooler power supply unit, but nothing is being rushed as I monitor things to my satisfaction.

Changing the UUID of a VirtualBox Virtual Disk Image in Linux

11th July 2021

Recent experimentation centring around getting my hands on a test version of Windows 11 had me duplicating virtual machines and virtual disk images though VirtualBox still is not ready for the next Windows version; it has no TPM capability at the moment. Nevertheless, I was able to get something after a fresh installation that removed whatever files were on the disk image. That meant that I needed to mount the old version to get at those files again.

Renaming partially helped with this but what I really needed to do was change the UUID so VirtualBox would not report a collision between two disk images with the same UUID. To avoid this, the UUID of one of the disk images had to be changed and a command like the following was used to accomplish this:

VBoxManage internalcommands sethduuid [Virtual Disk Image Name].vdi

Because I was doing this on Linux Mint, I could call VBoxManage without need to tell the system where it was as would be the case in Windows. Otherwise, it is the sethduuid portion that changes the UUID as required. Another way around this is to clone the VDI file using the following command but I had not realised that at the time:

VBoxManage clonevdi [old virtual disk image].vdi [new virtual disk image].vdi

It seems that there can be more than way to do things in VirtualBox at times so the second way will remain on reference for the future.

Getting rid of the Windows Resizing message from a Manjaro VirtualBox guest

27th July 2020

Like Fedora, Manjaro also installs a package for VirtualBox Guest Additions when you install the Linux distro in a VirtualBox virtual machine. However, it does have certain expectations when doing this. On many systems and my own is one of these, Linux guests are forced to use the VMSVGA virtual graphics controller while Windows guests are allowed to use the VBoxSVGA one. It is the latter that Manjaro expects so you get a message like the following appearing when the desktop environment has loaded:

Windows Resizing Set your VirtualBox Graphics Controller to enable windows resizing

After ensuring that gcc, make, perl and kernel headers are installed, I usually install VirtualBox Guest Additions myself from the included ISO image and so I did the same with Manjaro. Doing that and restarting the virtual machine got me extra functionality like screen resizing and being able to copy and paste between the VM and elsewhere after choosing the Bidirectional setting in the menus under Devices > Shared Clipboard.

That still left an unwanted message popping up on startup. To get rid of that, I just needed to remove /etc/xdg/autostart/mhwd-vmsvga-alert.desktop. It can be deleted but I just moved it somewhere else and a restart proved that the message was gone as needed. Now everything is working as I wanted.

Contents not displaying for Shared Folders on a Fedora 32 guest instance in VirtualBox

26th July 2020

While some Linux distros like Fedora install VirtualBox drivers during installation time, I prefer to install the VirtualBox Guest Additions themselves. Before doing this, it is best to remove the virtualbox-guest-additions package from Fedora to avoid conflicts. After that, execute the following command to ensure that all prerequisites for the VirtualBox Guest Additions are in place prior to mounting the VirtualBox Guest Additions ISO image and installing from there:

sudo dnf -y install gcc automake make kernel-headers dkms bzip2 libxcrypt-compat kernel-devel perl

During the installation, you may encounter a message like the following:

ValueError: File context for /opt/VBoxGuestAdditions-<VERSION>/other/mount.vboxsf already defined

This is generated by SELinux so the following commands need executing before the VirtualBox Guest Additions installation is repeated:

sudo semanage fcontext -d /opt/VBoxGuestAdditions-<VERSION>/other/mount.vboxsf sudo restorecon /opt/VBoxGuestAdditions-<VERSION>/other/mount.vboxsf

Without doing the above step and fixing the preceding error message, I had an issue with mounting of Shared Folders whereby the mount point was set up but no folder contents were displayed. This happened even when my user account was added to the vboxsf group and it proved to be the SELinux context issue that was the cause.

Removing obsolete libraries from Flatpak

1st February 2020

Along with various pieces of software, Flatpak also installs KDE and GNOME libraries needed to support them. However, it does not always remove obsolete versions of those libraries whenever software gets updated. One result is that messages regarding obsolete versions of GNOME may be issued and this has been known to cause confusion because there is the GNOME instance that is part of a Linux distro like Ubuntu and using Flatpak adds another one for its software packages to use. My use of Linux Mint may lesson the chances of misunderstanding.

Thankfully, executing a single command will remove any obsolete Flatpak libraries so the messages no longer appear and there then is no need to touch your actual Linux installation. This then is the command that sorted it for me:
flatpak uninstall --unused && sudo flatpak repair

The first part that removes any unused libraries is run as a normal user so there is no error in the above command. Administrative privileges are needed for the second section that does any repairs that are needed. It might be better if Flatpak did all this for you using the update command but that is not how the thing works. At least, there is a quick way to address this state of affairs and there might be some good reasons for having things work as they do.

« Older Entries « » Newer Entries »

Insync

gdrive

Topics Discussed

Translate