Technology Tales

Adventures in consumer and enterprise technology

Performing parallel processing in Perl scripting with the Parallel::ForkManager module

30th September 2019

In a previous post, I described how to add Perl modules in Linux Mint, while mentioning that I hoped to add another that discusses the use of the Parallel::ForkManager module. This is that second post, and I am going to keep things as simple and generic as they can be. There are other articles like one on the Perl Maven website that go into more detail.

The first thing to do is ensure that the Parallel::ForkManager module is called by your script; having the line of code presented below near the top of the file will do just that. Without this step, the script will not be able to find the required module by itself and errors will be generated.

use Parallel::ForkManager;

Then, the maximum number of threads needs to be specified. While that can be achieved using a simple variable declaration, the following line reads this from the command used to invoke the script. It even tells a forgetful user what they need to do in its own terse manner. Here $0 is the name of the script and N is the number of threads. Not all these threads will get used and processing capacity will limit how many actually are in use, which means that there is less chance of overwhelming a CPU.

my $forks = shift or die "Usage: $0 N\n";

Once the maximum number of available threads is known, the next step is to instantiate the Parallel::ForkManager object as follows to use these child processes:

my $pm = Parallel::ForkManager->new($forks);

With the Parallel::ForkManager object available, it is now possible to use it as part of a loop. A foreach loop works well, though only a single array can be used, with hashes being needed when other collections need interrogation. Two extra statements are needed, with one to start a child process and another to end it.

foreach $t (@array) {
my $pid = $pm->start and next;
<< Other code to be processed >>
$pm->finish;
}

Since there is often other processing performed by script, and it is possible to have multiple threaded loops in one, there needs to be a way of getting the parent process to wait until all the child processes have completed before moving from one step to another in the main script and that is what the following statement does. In short, it adds more control.

$pm->wait_all_children;

To close, there needs to be a comment on the advantages of parallel processing. Modern multicore processors often get used in single threaded operations, which leaves most of the capacity unused. Utilising this extra power then shortens processing times markedly. To give you an idea of what can be achieved, I had a single script taking around 2.5 minutes to complete in single threaded mode, while setting the maximum number of threads to 24 reduced this to just over half a minute while taking up 80% of the processing capacity. This was with an AMD Ryzen 7 2700X CPU with eight cores and a maximum of 16 processor threads. Surprisingly, using 16 as the maximum thread number only used half the processor capacity, so it seems to be a matter of performing one's own measurements when making these decisions.

Installing Perl modules using CPAN on Linux Mint 19.2

28th September 2019

My online travel photo gallery is a self-coded set of PHP scripts that read data from tables in a MySQL database. These tables are built from input XML files using a Perl script that itself creates and executes an SQL script. The Perl script also does some image processing using GraphicsMagick commands to resize images and to add copyright information and image framing. Because this processed one image at a time sequentially, it was taking several minutes to complete and only partly used the capacity of the PC that I used.

This led me to look at adding parallel processing, which is what brought me to looking at the Parallel::ForkManager Perl module. An alternative approach might have been to add new images in such a way as not to need the full run involving hundreds of image files, but that will take more work and I fancied having a look at parallelising things anyway.

If it was not there already, the first act would have been to install build-essential to get access to the cpan command. The following command accomplishes this:

sudo apt-get install build-essential

Once that is there, the cpan command needs to be run and some questions answered to get things going. The first question to answer is whether you want setup to be as automated as possible, and the default answer of yes worked for me. The next question to answer regards the approach that cpan takes when installing modules: I chose sudo here (local::lib is the default value and manual is another option). After this, cpan drops into its own command shell. Here, I issued two more commands to continue the basic setup by updating CPAN.pm to the latest version and adding Bundle::CPAN to optimise the module further:

make install
install Bundle::CPAN

Continuing the last of these may need extra intervention to confirmation the suggested default of exit at one point in its operation, which takes a little time to complete. It is after this that Parallel::ForkManager can be installed using the following command:

install Parallel::ForkManager

That completed quickly and the cpan shell was exited using its exit command. Then, the new module was available in scripting after that. The actual use of this module is something that hope to describe in another post, so I am ending this one here, and the same process is just as applicable to setting up cpan and adding any other Perl CPAN module.

Creating a data-driven informat in SAS

27th September 2019

Recently, I needed to create some example data with an extra numeric identifier variable that would be assigned according to the value of a character identifier variable. Not wanting to add another dataset merge or join to the code, I decided to create an informat from data. Initially, I looked into creating a format instead, but it did not accomplish what I wanted to do.

data patient;
    keep fmtname start end label type;
    set test.dm;
    by subject;
    fmtname="PATIENT";
    start=subject;
    end=start;
    label=patient;
    type="I";
run;

The input data needed a little processing as shown above. The format name was defined in the variable FMTNAME and the TYPE variable was assigned a value of I to make this a numeric informat; to make character equivalent, a value of J was assigned. The START and END variables declare the value range associated with the value of the LABEL variable that would become the actual value of the numeric identifier variable. The variable names are fixed because the next step will not work with different ones.

proc format lib=work cntlin=patient;
run;
quit;

To create the actual informat, the dataset is read by the FORMAT procedure with the CNTLIN parameter specifying the name of the input dataset and LIB defining the library where the format catalogue is stored. When this in complete, the informat is available for use with an input function as shown in the code excerpt below.

data ae1;
    set ae;
    patient=input(subject,patient.);
run;

Compressing an entire dataset library using SAS

26th September 2019

Turning dataset compression for SAS datasets can produce quite a reduction in size, so it is often standard practice to do just this. It can be set globally with a single command, and many working systems do this for you:

options compress=yes;

It also can be done on a dataset by dataset option by adding (compress=yes) beside the dataset name on the data line of a data step. The size reduction can be gained retrospectively too, but only if you create additional copies of the datasets. The following code uses the DATASETS procedure to accomplish this end at the time of copying the datasets from one library to another:

proc datasets lib=adam nolist;
    copy inlib=adam outlib=adamc noclone datecopy memtype=data;
run;
quit;

The NOCLONE option on the COPY statement allows the compression status to be changed, while the DATECOPY one keeps the same date/time stamp on the new file at the operating system level. Lastly, the MEMTYPE=DATA setting ensures that only datasets are processed, while the NOLIST option on the PROC DATASETS line suppresses listing of dataset information.

It may be possible to do something like this with PROC COPY too, but I know from use that the option presented here works as expected. It also does not involve much coding effort, which helps a lot.

Creating a VirtualBox virtual disk image using the Linux command line

9th September 2019

Much of the past weekend was spent getting a working Debian 10 installation up and running in a VirtualBox virtual machine. Because I chose the Cinnamon desktop environment, the process was not as smooth as I would have liked, so a minimal installation was performed before I started to embellish as I liked. Along the way, I got to wondering if I could create virtual hard drives using the command line, and I found that something like the following did what was needed:

VBoxManage createmedium disk --filename <full path including file name without extension> -size <size in MiB> --format VDI --variant Standard

Most of the options are self-explanatory, apart from the one named variant. This defines whether the VDI file expands to the maximum size specified using the size parameter or is reserved with the size defined in that parameter. Two VDI files were created in this way and I used these to replace their Debian 8 predecessors and even to save a bit of space too. If you want, you can find out more in the user documentation, but this post hopefully gets you started anyway.

Getting Eclipse to start without incompatibility errors on Linux Mint 19.1

12th June 2019

Recent curiosity about Java programming and Groovy scripting got me trying to start up the Eclipse IDE that I had installed on my main machine. What I got instead of a successful application startup was a message that included the following:

!MESSAGE Exception launching the Eclipse Platform:
!STACK
java.lang.ClassNotFoundException: org.eclipse.core.runtime.adaptor.EclipseStarter
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:566)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:626)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:584)
at org.eclipse.equinox.launcher.Main.run(Main.java:1438)
at org.eclipse.equinox.launcher.Main.main(Main.java:1414)

The cause was a mismatch between Eclipse and the installed version of Java that it needed to run. After all, the software itself is written in the Java language and the installed version from the usual software repositories was too old for Java 11. The solution turned out to be installing a newer version as a Snap (Ubuntu's answer to Flatpak). The following command did the needful since snapd already was running on my machine:

sudo snap install eclipse --classic

The only part of the command that warrants extra comment is the --classic switch, since that is needed for a tool like Eclipse that needs to access a host file system. On executing, the software was downloaded from Snapcraft and then installed within its own bundle of dependencies. The latter adds a certain detachment from the underlying Linux installation and ensures that no messages appear because of incompatibilities like the one near the start of this post.

Fixing an update error in OpenMediaVault 4.0

10th June 2019

For a time, I found that executing the command omv-update in OpenMediaVault 4.0 produced the following Python errors appeared, among other more benign messages:

Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0xb7099d64>
Traceback (most recent call last):
File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0xb7099d64>
Traceback (most recent call last):
File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable

Not wanting a failed update, I decided that I needed to investigate this and found that /usr/lib/python3.5/weakref.py required the following updates to lines 109 and 117, respectively:

def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):

_atomic_removal(d, wr.key)

To be more clear, the line beginning with "def" is how line 109 should appear, while the line beginning with _atomic_removal is how line 117 should appear. Once the required edits were made and the file closed, re-running omv-update revealed that the problem was fixed and that is how things remain at the time of writing.

Running cron jobs using the www-data system account

22nd December 2018

When you set up your own web server or use a private server (virtual or physical), you will find that web servers run using the www-data account. That means that website files need to be accessible to that system account if not owned by it. The latter is mandatory if you want WordPress to be able to update itself with needing FTP details.

It also means that you probably need scheduled jobs to be executed using the privileges possessed by the www-data account. For instance, I use WP-CLI to automate spam removal and updates to plugins, themes and WordPress itself. Spam removal can be done without the www-data account, but the updates need file access and cannot be completed without this. Therefore, I got interested in setting up cron jobs to run under that account and the following command helps to address this:

sudo -u www-data crontab -e

For that to work, your own account needs to be listed in /etc/sudoers or be assigned to the sudo group in /etc/group. If it is either of those, then entering your own password will open the cron file for www-data, and it can be edited as for any other account. Closing and saving the session will update cron with the new job details.

In fact, the same approach can be taken for a variety of commands where files only can be accessed using www-data. This includes copying, pasting and deleting files as well as executing WP-CLI commands. The latter issues a striking message if you run a command using the root account, a pervasive temptation given what it allows. Any alternative to the latter has to be better from a security standpoint.

Enlarging a VirtualBox VDI virtual disk

20th December 2018

It is remarkable how the Windows folder manages to grow on your C drive, and one in a Windows 7 installation was the cause of my needing to expand the VirtualBox virtual machine VDI disk on which it was installed. After trying various ways to cut down the size, an enlargement could not be avoided. In all of this, it was handy that I had a recent backup for restoration after any damage.

The same thing meant that I could resort to enlarging the VDI file with more peace of mind than otherwise might have been the case. This needed use of the command line once the VM was shut down. The form of the command that I used was the following:

VBoxManage modifyhd <filepath/filename>.vdi --resize 102400

It appears that this also would work on a Windows host, but mine was Linux, and it did what I needed. The next step was to attach it to an Ubuntu VM and use GParted to expand the main partition to fill the newly available space. That does not mean that it takes up 100 GiB on my system just yet because these things can be left to grow over time and there is a way to shrink them too if you ever need to do just that. As ever, having a backup made before any such operation may have its uses if anything goes awry.

Moving a website from shared hosting to a virtual private server

24th November 2018

This year has seen some optimisation being applied to my web presences, guided by the results of GTMetrix scans. It was then that I realised how slow things were, so server loads were reduced. Anything that slowed response times, such as WordPress plugins, got removed. Usage of Matomo also was curtailed in favour of Google Analytics, while HTML, CSS and JS minification followed. Something that had yet to happen was a search for a faster server. Now, another website has been moved onto a virtual private server (VPS) to see how that would go.

Speed was not the only consideration, since security was a factor too. After all, a VPS is more locked away from other users than a folder on a shared server. There also is the added sense of control, so Let's Encrypt SSL certificates can be added using the Electronic Frontier Foundation's Certbot. That avoids the expense of using an SSL certificate provided through my shared hosting provider, and a successful transition for my travel website may mean that this one undergoes the same move.

For the VPS, I chose Ubuntu 18.04 as its operating system, and it came with the LAMP stack already in place. Have offload development websites, the mix of Apache, MySQL and PHP is more familiar to me than anything using Nginx or Python. It also means that .htaccess files become more useful than they were on my previous Nginx-based platform. Having full access to the operating system using SSH helps too and should mean that I have fewer calls on technical support since I can do more for myself. Any extra tinkering should not affect others either, since this type of setup is well known to me and having an offline counterpart means that anything riskier is tried there beforehand.

Naturally, there were niggles to overcome with the move. The first to fix was to make the MySQL instance accept calls from outside the server so that I could migrate data there from elsewhere, and I even got my shared hosting setup to start using the new database to see what performance boost it might give. To make all this happen, I first found the location of the relevant my.cnf configuration file using the following command:

find / -name my.cnf

Once I had the right file, I commented out the following line that it contained and restarted the database service afterwards, using another command to stop the appearance of any error 111 messages:

bind-address 127.0.0.1
service mysql restart

After that, things worked as required and I moved onto another matter: uploading the requisite files. That meant installing an FTP server, so I chose proftpd since I knew that well from previous tinkering. Once that was in place, file transfer commenced.

When that was done, I could do some testing to see if I had an active web server that loaded the website. Along the way, I also instated some Apache modules like mod-rewrite using the a2enmod command, restarting Apache each time I enabled another module.

Then, I discovered that Textpattern needed php-7.2-xml installed, so the following command was executed to do this:

apt install php7.2-xml

Then, the following line was uncommented in the correct php.ini configuration file that I found using the same method as that described already for the my.cnf configuration and that was followed by yet another Apache restart:

extension=php_xmlrpc.dll

Addressing the above issues yielded enough success for me to change the IP address in my Cloudflare dashboard so it pointed at the VPS and not the shared server. The changeover happened seamlessly without having to await DNS updates as once would have been the case. It had the added advantage of making both WordPress and Textpattern work fully.

With everything working to my satisfaction, I then followed the instructions on Certbot to set up my new Let's Encrypt SSL certificate. Aside from a tweak to a configuration file and another Apache restart, the process was more automated than I had expected, so I was ready to embark on some fine-tuning to embed the new security arrangements. That meant updating .htaccess files and Textpattern has its own, so the following addition was needed there:

RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

This complemented what was already in the main .htaccess file, and WordPress allows you to include http(s) in the address it uses, so that was another task completed. The general .htaccess only needed the following lines to be added:

RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$ https://www.assortedexplorations.com/$1 [R,L]

What all these achieve is to redirect insecure connections to secure ones for every visitor to the website. After that, internal hyperlinks without https needed updating along with any forms so that a padlock sign could be shown for all pages.

With the main work completed, it was time to sort out a lingering niggle regarding the appearance of an FTP login page every time a WordPress installation or update was requested. The main solution was to make the web server account the owner of the files and directories, but the following line was added to wp-config.php as part of the fix even if it probably is not necessary:

define('FS_METHOD', 'direct');

There also was the non-operation of WP Cron and that was addressed using WP-CLI and a script from Bjorn Johansen. To make double sure of its effectiveness, the following was added to wp-config.php to turn off the usual WP-Cron behaviour:

define('DISABLE_WP_CRON', true);

Intriguingly, WP-CLI offers a long list of possible commands that are worth investigating. A few have been examined, but more await attention.

Before those, I still need to get my new VPS to send emails. So far, sendmail has been installed, the hostname changed from localhost and the server restarted. More investigations are needed, but what I have not is faster than what was there before, so the effort has been rewarded already.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.