Performing parallel processing in Perl scripting with the Parallel::ForkManager module
30th September 2019In a previous post, I described how to add Perl modules in Linux Mint, while mentioning that I hoped to add another that discusses the use of the Parallel::ForkManager
module. This is that second post, and I am going to keep things as simple and generic as they can be. There are other articles like one on the Perl Maven website that go into more detail.
The first thing to do is ensure that the Parallel::ForkManager
module is called by your script; having the line of code presented below near the top of the file will do just that. Without this step, the script will not be able to find the required module by itself and errors will be generated.
use Parallel::ForkManager;
Then, the maximum number of threads needs to be specified. While that can be achieved using a simple variable declaration, the following line reads this from the command used to invoke the script. It even tells a forgetful user what they need to do in its own terse manner. Here $0 is the name of the script and N is the number of threads. Not all these threads will get used and processing capacity will limit how many actually are in use, which means that there is less chance of overwhelming a CPU.
my $forks = shift or die "Usage: $0 N\n";
Once the maximum number of available threads is known, the next step is to instantiate the Parallel::ForkManager
object as follows to use these child processes:
my $pm = Parallel::ForkManager->new($forks);
With the Parallel::ForkManager
object available, it is now possible to use it as part of a loop. A foreach
loop works well, though only a single array can be used, with hashes being needed when other collections need interrogation. Two extra statements are needed, with one to start a child process and another to end it.
foreach $t (@array) {
my $pid = $pm->start and next;
<< Other code to be processed >>
$pm->finish;
}
Since there is often other processing performed by script, and it is possible to have multiple threaded loops in one, there needs to be a way of getting the parent process to wait until all the child processes have completed before moving from one step to another in the main script and that is what the following statement does. In short, it adds more control.
$pm->wait_all_children;
To close, there needs to be a comment on the advantages of parallel processing. Modern multicore processors often get used in single threaded operations, which leaves most of the capacity unused. Utilising this extra power then shortens processing times markedly. To give you an idea of what can be achieved, I had a single script taking around 2.5 minutes to complete in single threaded mode, while setting the maximum number of threads to 24 reduced this to just over half a minute while taking up 80% of the processing capacity. This was with an AMD Ryzen 7 2700X CPU with eight cores and a maximum of 16 processor threads. Surprisingly, using 16 as the maximum thread number only used half the processor capacity, so it seems to be a matter of performing one's own measurements when making these decisions.
Installing Perl modules using CPAN on Linux Mint 19.2
28th September 2019My online travel photo gallery is a self-coded set of PHP scripts that read data from tables in a MySQL database. These tables are built from input XML files using a Perl script that itself creates and executes an SQL script. The Perl script also does some image processing using GraphicsMagick commands to resize images and to add copyright information and image framing. Because this processed one image at a time sequentially, it was taking several minutes to complete and only partly used the capacity of the PC that I used.
This led me to look at adding parallel processing, which is what brought me to looking at the Parallel::ForkManager Perl module. An alternative approach might have been to add new images in such a way as not to need the full run involving hundreds of image files, but that will take more work and I fancied having a look at parallelising things anyway.
If it was not there already, the first act would have been to install build-essential to get access to the cpan
command. The following command accomplishes this:
sudo apt-get install build-essential
Once that is there, the cpan
command needs to be run and some questions answered to get things going. The first question to answer is whether you want setup to be as automated as possible, and the default answer of yes worked for me. The next question to answer regards the approach that cpan
takes when installing modules: I chose sudo
here (local::lib
is the default value and manual
is another option). After this, cpan
drops into its own command shell. Here, I issued two more commands to continue the basic setup by updating CPAN.pm to the latest version and adding Bundle::CPAN to optimise the module further:
make install
install Bundle::CPAN
Continuing the last of these may need extra intervention to confirmation the suggested default of exit at one point in its operation, which takes a little time to complete. It is after this that Parallel::ForkManager
can be installed using the following command:
install Parallel::ForkManager
That completed quickly and the cpan
shell was exited using its exit command. Then, the new module was available in scripting after that. The actual use of this module is something that hope to describe in another post, so I am ending this one here, and the same process is just as applicable to setting up cpan
and adding any other Perl CPAN module.
Creating a data-driven informat in SAS
27th September 2019Recently, I needed to create some example data with an extra numeric identifier variable that would be assigned according to the value of a character identifier variable. Not wanting to add another dataset merge or join to the code, I decided to create an informat from data. Initially, I looked into creating a format instead, but it did not accomplish what I wanted to do.
data patient;
keep fmtname start end label type;
set test.dm;
by subject;
fmtname="PATIENT";
start=subject;
end=start;
label=patient;
type="I";
run;
The input data needed a little processing as shown above. The format name was defined in the variable FMTNAME
and the TYPE
variable was assigned a value of I
to make this a numeric informat; to make character equivalent, a value of J
was assigned. The START
and END
variables declare the value range associated with the value of the LABEL
variable that would become the actual value of the numeric identifier variable. The variable names are fixed because the next step will not work with different ones.
proc format lib=work cntlin=patient;
run;
quit;
To create the actual informat, the dataset is read by the FORMAT
procedure with the CNTLIN
parameter specifying the name of the input dataset and LIB
defining the library where the format catalogue is stored. When this in complete, the informat is available for use with an input function as shown in the code excerpt below.
data ae1;
set ae;
patient=input(subject,patient.);
run;
Compressing an entire dataset library using SAS
26th September 2019Turning dataset compression for SAS datasets can produce quite a reduction in size, so it is often standard practice to do just this. It can be set globally with a single command, and many working systems do this for you:
options compress=yes;
It also can be done on a dataset by dataset option by adding (compress=yes) beside the dataset name on the data line of a data step. The size reduction can be gained retrospectively too, but only if you create additional copies of the datasets. The following code uses the DATASETS
procedure to accomplish this end at the time of copying the datasets from one library to another:
proc datasets lib=adam nolist;
copy inlib=adam outlib=adamc noclone datecopy memtype=data;
run;
quit;
The NOCLONE
option on the COPY
statement allows the compression status to be changed, while the DATECOPY
one keeps the same date/time stamp on the new file at the operating system level. Lastly, the MEMTYPE=DATA
setting ensures that only datasets are processed, while the NOLIST
option on the PROC DATASETS
line suppresses listing of dataset information.
It may be possible to do something like this with PROC COPY
too, but I know from use that the option presented here works as expected. It also does not involve much coding effort, which helps a lot.
Creating a VirtualBox virtual disk image using the Linux command line
9th September 2019Much of the past weekend was spent getting a working Debian 10 installation up and running in a VirtualBox virtual machine. Because I chose the Cinnamon desktop environment, the process was not as smooth as I would have liked, so a minimal installation was performed before I started to embellish as I liked. Along the way, I got to wondering if I could create virtual hard drives using the command line, and I found that something like the following did what was needed:
VBoxManage createmedium disk --filename <full path including file name without extension> -size <size in MiB> --format VDI --variant Standard
Most of the options are self-explanatory, apart from the one named variant. This defines whether the VDI file expands to the maximum size specified using the size parameter or is reserved with the size defined in that parameter. Two VDI files were created in this way and I used these to replace their Debian 8 predecessors and even to save a bit of space too. If you want, you can find out more in the user documentation, but this post hopefully gets you started anyway.
Getting Eclipse to start without incompatibility errors on Linux Mint 19.1
12th June 2019Recent curiosity about Java programming and Groovy scripting got me trying to start up the Eclipse IDE that I had installed on my main machine. What I got instead of a successful application startup was a message that included the following:
!MESSAGE Exception launching the Eclipse Platform:
!STACK
java.lang.ClassNotFoundException: org.eclipse.core.runtime.adaptor.EclipseStarter
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:566)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:626)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:584)
at org.eclipse.equinox.launcher.Main.run(Main.java:1438)
at org.eclipse.equinox.launcher.Main.main(Main.java:1414)
The cause was a mismatch between Eclipse and the installed version of Java that it needed to run. After all, the software itself is written in the Java language and the installed version from the usual software repositories was too old for Java 11. The solution turned out to be installing a newer version as a Snap (Ubuntu's answer to Flatpak). The following command did the needful since snapd
already was running on my machine:
sudo snap install eclipse --classic
The only part of the command that warrants extra comment is the --classic
switch, since that is needed for a tool like Eclipse that needs to access a host file system. On executing, the software was downloaded from Snapcraft and then installed within its own bundle of dependencies. The latter adds a certain detachment from the underlying Linux installation and ensures that no messages appear because of incompatibilities like the one near the start of this post.
Fixing an update error in OpenMediaVault 4.0
10th June 2019For a time, I found that executing the command omv-update
in OpenMediaVault 4.0 produced the following Python errors appeared, among other more benign messages:
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0xb7099d64>
Traceback (most recent call last):
File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0xb7099d64>
Traceback (most recent call last):
File "/usr/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
Not wanting a failed update, I decided that I needed to investigate this and found that /usr/lib/python3.5/weakref.py
required the following updates to lines 109 and 117, respectively:
def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
_atomic_removal(d, wr.key)
To be more clear, the line beginning with "def" is how line 109 should appear, while the line beginning with _atomic_removal is how line 117 should appear. Once the required edits were made and the file closed, re-running omv-update
revealed that the problem was fixed and that is how things remain at the time of writing.
Running cron jobs using the www-data system account
22nd December 2018When you set up your own web server or use a private server (virtual or physical), you will find that web servers run using the www-data
account. That means that website files need to be accessible to that system account if not owned by it. The latter is mandatory if you want WordPress to be able to update itself with needing FTP details.
It also means that you probably need scheduled jobs to be executed using the privileges possessed by the www-data
account. For instance, I use WP-CLI to automate spam removal and updates to plugins, themes and WordPress itself. Spam removal can be done without the www-data
account, but the updates need file access and cannot be completed without this. Therefore, I got interested in setting up cron jobs to run under that account and the following command helps to address this:
sudo -u www-data crontab -e
For that to work, your own account needs to be listed in /etc/sudoers
or be assigned to the sudo group in /etc/group. If it is either of those, then entering your own password will open the cron
file for www-data
, and it can be edited as for any other account. Closing and saving the session will update cron
with the new job details.
In fact, the same approach can be taken for a variety of commands where files only can be accessed using www-data
. This includes copying, pasting and deleting files as well as executing WP-CLI commands. The latter issues a striking message if you run a command using the root account, a pervasive temptation given what it allows. Any alternative to the latter has to be better from a security standpoint.
Enlarging a VirtualBox VDI virtual disk
20th December 2018It is remarkable how the Windows folder manages to grow on your C drive, and one in a Windows 7 installation was the cause of my needing to expand the VirtualBox virtual machine VDI disk on which it was installed. After trying various ways to cut down the size, an enlargement could not be avoided. In all of this, it was handy that I had a recent backup for restoration after any damage.
The same thing meant that I could resort to enlarging the VDI file with more peace of mind than otherwise might have been the case. This needed use of the command line once the VM was shut down. The form of the command that I used was the following:
VBoxManage modifyhd <filepath/filename>.vdi --resize 102400
It appears that this also would work on a Windows host, but mine was Linux, and it did what I needed. The next step was to attach it to an Ubuntu VM and use GParted to expand the main partition to fill the newly available space. That does not mean that it takes up 100 GiB on my system just yet because these things can be left to grow over time and there is a way to shrink them too if you ever need to do just that. As ever, having a backup made before any such operation may have its uses if anything goes awry.
Moving a website from shared hosting to a virtual private server
24th November 2018This year has seen some optimisation being applied to my web presences, guided by the results of GTMetrix scans. It was then that I realised how slow things were, so server loads were reduced. Anything that slowed response times, such as WordPress plugins, got removed. Usage of Matomo also was curtailed in favour of Google Analytics, while HTML, CSS and JS minification followed. Something that had yet to happen was a search for a faster server. Now, another website has been moved onto a virtual private server (VPS) to see how that would go.
Speed was not the only consideration, since security was a factor too. After all, a VPS is more locked away from other users than a folder on a shared server. There also is the added sense of control, so Let's Encrypt SSL certificates can be added using the Electronic Frontier Foundation's Certbot. That avoids the expense of using an SSL certificate provided through my shared hosting provider, and a successful transition for my travel website may mean that this one undergoes the same move.
For the VPS, I chose Ubuntu 18.04 as its operating system, and it came with the LAMP stack already in place. Have offload development websites, the mix of Apache, MySQL and PHP is more familiar to me than anything using Nginx or Python. It also means that .htaccess
files become more useful than they were on my previous Nginx-based platform. Having full access to the operating system using SSH helps too and should mean that I have fewer calls on technical support since I can do more for myself. Any extra tinkering should not affect others either, since this type of setup is well known to me and having an offline counterpart means that anything riskier is tried there beforehand.
Naturally, there were niggles to overcome with the move. The first to fix was to make the MySQL instance accept calls from outside the server so that I could migrate data there from elsewhere, and I even got my shared hosting setup to start using the new database to see what performance boost it might give. To make all this happen, I first found the location of the relevant my.cnf
configuration file using the following command:
find / -name my.cnf
Once I had the right file, I commented out the following line that it contained and restarted the database service afterwards, using another command to stop the appearance of any error 111 messages:
bind-address 127.0.0.1
service mysql restart
After that, things worked as required and I moved onto another matter: uploading the requisite files. That meant installing an FTP server, so I chose proftpd since I knew that well from previous tinkering. Once that was in place, file transfer commenced.
When that was done, I could do some testing to see if I had an active web server that loaded the website. Along the way, I also instated some Apache modules like mod-rewrite using the a2enmod
command, restarting Apache each time I enabled another module.
Then, I discovered that Textpattern needed php-7.2-xml installed, so the following command was executed to do this:
apt install php7.2-xml
Then, the following line was uncommented in the correct php.ini configuration file that I found using the same method as that described already for the my.cnf
configuration and that was followed by yet another Apache restart:
extension=php_xmlrpc.dll
Addressing the above issues yielded enough success for me to change the IP address in my Cloudflare dashboard so it pointed at the VPS and not the shared server. The changeover happened seamlessly without having to await DNS updates as once would have been the case. It had the added advantage of making both WordPress and Textpattern work fully.
With everything working to my satisfaction, I then followed the instructions on Certbot to set up my new Let's Encrypt SSL certificate. Aside from a tweak to a configuration file and another Apache restart, the process was more automated than I had expected, so I was ready to embark on some fine-tuning to embed the new security arrangements. That meant updating .htaccess
files and Textpattern has its own, so the following addition was needed there:
RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
This complemented what was already in the main .htaccess
file, and WordPress allows you to include http(s) in the address it uses, so that was another task completed. The general .htaccess
only needed the following lines to be added:
RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$ https://www.assortedexplorations.com/$1 [R,L]
What all these achieve is to redirect insecure connections to secure ones for every visitor to the website. After that, internal hyperlinks without https
needed updating along with any forms so that a padlock sign could be shown for all pages.
With the main work completed, it was time to sort out a lingering niggle regarding the appearance of an FTP login page every time a WordPress installation or update was requested. The main solution was to make the web server account the owner of the files and directories, but the following line was added to wp-config.php as part of the fix even if it probably is not necessary:
define('FS_METHOD', 'direct');
There also was the non-operation of WP Cron and that was addressed using WP-CLI and a script from Bjorn Johansen. To make double sure of its effectiveness, the following was added to wp-config.php to turn off the usual WP-Cron behaviour:
define('DISABLE_WP_CRON', true);
Intriguingly, WP-CLI offers a long list of possible commands that are worth investigating. A few have been examined, but more await attention.
Before those, I still need to get my new VPS to send emails. So far, sendmail has been installed, the hostname changed from localhost and the server restarted. More investigations are needed, but what I have not is faster than what was there before, so the effort has been rewarded already.