Technology Tales

Adventures & experiences in contemporary technology

Useful Python packages for working with data

14th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. One of these has been R but Python is another that has taken up my attention and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool in order to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup  though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family so as to have a more stable income. The approach is that a student selects a project and works their way through it with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic pandemic supplying data and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Some subjects are uplifting while others are more foreboding but the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message but books always are worth reading even if that is the slower route. Those from the Dummies series or from O’Reilly have proved must useful so far but I do need to read them more completely than I already have; it is all too tempting to go with the try the “programming and search for solutions as you go” approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality but I prefer a more traditional editor so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Spyder itself is written in Python so it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science but I have discovered others since then. It also seems that there is porosity between the worlds of R an Python so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dply and ggplot2 in the form of Siuba and Plotnine, respectively. The syntax of these packages are not direct copies of what is executed in R but they are close enough for there to be enough familiarity for added user friendliness compared to Pandas or Matplotlib. The interoperability does not stop there for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

Pyhton may not have the speed of Julia but there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have there uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available and there is no need not to check out more. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurst to keep an open mind either.

Setting the PHP version in .htaccess on Apache web servers

7th September 2014

The default PHP version on my outdoors, travel and photography website is 5.2.17 and that is getting on a bit now since it is no longer supported by the PHP project and has not been thus since 2011. One obvious impact was Piwik, which I used for web analytics and needs at least 5.3.2. WordPress 4.0 even needs 5.2.24 so that upgrade became implausible so I contacted Webfusion’s support team and they showed me how to get to at least 5.3.3 and even as far as 5.5.9. The trick is the addition of a line of code to the .htaccess file (near the top was my choice) like one of the following:

PHP 5.3.x

AddHandler application/x-httpd-php53 .php

PHP 5.5.x

AddHandler application/x-httpd-php55 .php

When I got one of these in place, things started to look promising but for a locked database due to my not watching how big it had got. Replacing it with two additional databases addressed the problem of losing write access though there was a little upheaval caused by this. Using PHP 5.5.9 meant that I spotted messages regarding the deprecation of the mysql_connect function so that needed fixing too (prefixing it with @ might be a temporary fix but a more permanent one always is better so that is what I did in the form of piggybacking off what WordPress uses; MySQLi and PDO_MySQL are other options). Sorting the database issue meant that I saw the upgrade message for WordPress as well as a mix of plugins and themes so all looked better and I need worry less about losing security updates. Also, I am up to the latest version of Piwik too and that’s an even better way to be.

Self-hosted web analytics tracking

24th April 2009

It amazes me now to think how little tracking I used to do on my various web “experiments” only a few short years ago. However, there was a time when a mere web counter, perhaps displayed on web pages themselves, was enough to yield some level of satisfaction, or dissatisfaction in many a case. Things have come a long way since then and we now seem to have analytics packages all around us. In fact, we don’t even have to dig into our pockets to get our hands on the means to peruse this sort of information either.

At this point, I need to admit that I am known to make use of a few simultaneously but thoughts about reducing their number are coming to mind but there’ll be more on that later. Given that this site is hosted using WordPress software, it should come as no surprise that Automattic’s own plugin has been set into action to see how things are going. The main focus is on the total number of visits by day, week and month with a breakdown showing what pages are doing well as well as an indication of how people came to the site and what links they followed while there. Don’t go expecting details of your visitors like the software that they are using and the country where they are accessing the site with this minimalist option and satisfaction should head your way.

There is next to no way of discussing the subject of website analytics without mentioning Google’s comprehensive offering in the area. You have to admit that it’s comprehensive with perhaps the only bugbear being the lack of live tracking. That need has been addressed very effectively by Woopra, even if its WordPress plugin will not work with IE6. Otherwise, you need the desktop application (being written in Java, it’s a cross-platform affair and I have had it going in both Windows and Linux) but that works well too. Apart maybe from the lack of campaigns, Woopra supplies as good as all of the information that its main competitor provides. It certainly doe what I would need from it.

However, while they can be free as in beer, there are a some costs associated with using using external services like Google Analytics and Woopra. Their means of tracking your web pages for you is by executing a piece of JavaScript that needs to be added to every page. If you have everything set to use a common header or footer page, that shouldn’t be too laborious and there are plugins for publishing platforms like WordPress too. This way of working means that if anyone has JavaScript disabled or decides not to enable JavaScript for the requisite hosts while using the NoScript extension with Firefox, then your numbers are scuppered. Saying that, the same concerns probably any JavaScript code that you may want to execute but there’s another cost again: the calls to external websites can, even with the best attention in the world, slow down the loading of your own pages. Not only is additional JavaScript being run but there also is the latency caused by servers having to communicate across the web.

A self-hosted analytics package would avoid the latter and I found one recently through Lifehacker. Amazingly, it has been around for a while and I hadn’t known about it but I can’t say that I was actively looking for it either. Piwik, formerly known as PHPMyVisites, is the name of my discovery and it seems not too immature either. In fact, I’d venture that it does next to everything that Google Analytics does. While I’d prefer that it used PHP, JavaScript is its means of tracking web pages too. Nevertheless, page loading is still faster than with Google Analytics and/or Woopra and Firefox/NoScript users would only have to allow JavaScript for one site too. If you have had experience with installing PHP/MySQL powered publishing platforms like WordPress, Textpattern and such like, then putting Piwik in place is no ordeal. You may find yourself changing folder access but uploading of the required files, the specification of database credentials and adding an administration user is all fairly standard stuff. I have the thing tracking this edifice as well as my outdoor activities (hillwalking/cycling/photography) web presence and I cannot say that I have any complaints so we’ll see how it goes from here.

An alternative use for Woopra

4th August 2008

Google Analytics is all very fine with its once a day reporting cycle but the availability of real time data dose have its advantages. WordPress.com’s Stats plugin goes some way to serving the need but Woopra trumps it in every way apart from a possible overkill in the amount of information that it makes available. The software may be in the beta phase and it does crash from time to time but its usefulness remains more than apparent.

One of its uses is seeing if there are people visiting your website at a time when you might be thinking of making a change like upgrading WordPress. Timing such activities to avoid a clash is a win-win situation: a better experience from your visitors and more reliable updates for you. After all, it’s very easy to make a poor impression and an unreliable site will do that faster than anything else so it’s paramount that your visitors do not get on the receiving end of updates, even if they are all for the better.

Want attention? Just mention Ubuntu…

5th November 2007

According to Google Analytics, visitor numbers for this blog hit their highest level one day last week. I suspect that I might have been down to a mention of two of my posts on tuxmachines.org. Thanks guys. Feedburner activity has been strong too.

That brings me to another thought: the web seems a good place for Ubuntu users to find find solutions to problems that they might encounter. I certainly found recipes that resolved issues that I was having: scanner set up and using another hard drive to host my home directory, all very useful stuff. When I last played with Linux to the same extent that I am now doing, the web was still a resource but it wouldn’t have been as helpful as I found it recently. I suppose that there are people like me posting tips and tricks for computing on blogs and that makes them easier to find. That’s no bad thing and I hope that it continues. Saying that, I might still get my mits on an Ubuntu book yet…

IE7 on the way up…

9th September 2007

I don’t spend too much time looking at that stats in Google Analytics but I do find it useful to see what people come to see. Another thing that I keep on radar is the browser technologies that visitors are using. Screen resolution is a particular interest of mine. However, browsers and their versions are watched too and I have spotted the ascent of IE7 from where it was; there seems to be a surge in recent times. I am unsure as to the cause for this but it’s definitely happening and Vista take up seems to have noting to do with it.

Using external JavaScript files? Just don’t load them at the bottom of the page…

21st July 2007

Looking through Google Analytics for my websites, I have always been struck by the lack of IE7 uptake seemingly apparent from the statistics. However, I recently discovered that there may be a reason for this. I use the Ultimate GA plugin with my WordPress blogs and that adds the JavaScript code block near the bottom of the page. However, I recently saw that giving me scripting errors in IE7 and a spot of manual coding saw it travel to the header section of the web page. That, and the deactivation of the said plugin, was sufficient to rid of the errors in question. Seeing the effect of my changes on the reported share of visitors using IE7 could be interesting. It might even boost the Vista numbers as well.

Google Analytics

25th May 2007

Furthering my excursions into things related to Google, I have been giving Google Analytics a whirl for my hillwalking and photo gallery website. Aside from the fact that it is updated once a day, it could have enabled me to eject WordPress plug-ins like Popularity Contest and FireStats getting the chop. As it happens, I also have a Google Analytics plugin installed but a little editing of the blog template that I have developed would get rid of that too.

That’s enough about WordPress plug-ins; let’s return to Google Analytics. It has all the usual stuff: who’s visiting, from where are they coming, what are they using to see your site, etc. In addition, it captures if they are coming back, how long they are staying on the site and how deep they are going. Bounce rate is another term that features heavily: it is when a user only goes to one page and then leaves. With a blog, this unfortunately seems to come out as a high figure and that is ironic given that the blog was meant to promote the online photo gallery; it has very much taken on a life all of its own. There’s more to the information from Google Analytics but it’s all useful stuff and I plan to make good use of it to improve how my site works.

Do we surf the web less at the weekend?

21st May 2007

Looking at the visitor statistics for both this blog and for my main website, I have noticed a definite dip in visitor numbers at the weekends, at least over the last few weeks. Time will tell as to whether this is a definite trend but it is an intriguing one: less people are reading blogs and such like when they might have more time to do so. It would also suggest that people are getting away from the web at the weekend, not necessarily a bad thing at all. In fact, I was away from the world of computers and out walking in the border country shared by Wales and England yesterday.

Speaking of walking, it does not surprise me that my hillwalking blog received less attention: many of my readers could have been in the outdoors anyway. And as for this blog, it does contain stuff that I find useful in the day job and it seems that others are looking for the same stuff too if the blog statistics are to be believed. Couple that to the fact that technology news announcements peak during the week and it seems that the weekday upsurge is real. I’ll continue to keep an eye on things to see if my theorising is right or mistaken…

  • All the views that you find expressed on here in postings and articles are mine alone and not those of any organisation with which I have any association, through work or otherwise. As regards editorial policy, whatever appears here is entirely of my own choice and not that of any other person or organisation.

  • Please note that everything you find here is copyrighted material. The content may be available to read without charge and without advertising but it is not to be reproduced without attribution. As it happens, a number of the images are sourced from stock libraries like iStockPhoto so they certainly are not for abstraction.

  • With regards to any comments left on the site, I expect them to be civil in tone of voice and reserve the right to reject any that are either inappropriate or irrelevant. Comment review is subject to automated processing as well as manual inspection but whatever is said is the sole responsibility of the individual contributor.