Programming languages | Technology Tales

Loading API Keys from Linux shell environment variables in Python with Dotenv

23^rd October 2025

Recently, I ran into trouble with getting Python to pick up an API key that I had defined in the underlying bash environment. This was within a Python console running inside the Positron IDE for R and Python scripting. Opening up the folder containing my Python scripts within the IDE was part of the solution. The next part was creating a .env file within the same folder. A line like this was added within the new file:

export API_KEY="<API key value>"

That meant that code like the following then read in the API key in a more robust manner:

import os from dotenv import load_dotenv load_dotenv() api_key = os.getenv('API_KEY', 'default_value')

This imports the os module and the load_dotenv method from the dotenv package. Then, load_dotenv is executed to load the .env file and its contents. After that, the os.getenv function can assign the API key to a Python variable from the value of the environment variable.

Since this also was within a Git repository, a .gitignore file needed creating with the contents .env to avoid that file being uploaded to GitHub, which is the last place where you should be storing credentials like passwords, passphrases and API keys. While my repository may be private, the state of things at these troubled times mean that even that is no failsafe.

Taking control of Ruff checks on Python scripts

22^nd October 2025

Positron is becoming my tool of choice for developing Python code. Along from using a Python console like a REPL environment, it also includes Ruff for checking code compliance. One of its rules is that Python modules must be declared at the top. However, I want to use some code that checks for the present of any modules used in a script, installing those that are missing. This means that import statements appear later in a script that Ruff recommends, making me wish for a way to turn off that check since things run well anyway. The chosen solution is to create a file called pyproject.toml in the directory where my scripts are store and add the following lines in there to accomplish what I want:

[tool.ruff] ignore = ["E402"]

Here, it helps if you open a folder in Positron, achieving the same outcome as you would in the VSCode on which the IDE is based. While I have only listed one check here, you also can have a comma-delimited list of quoted strings if you need to switch off more than one rule at once.

Avoiding Python missing package errors with automatic installation checks

20^th October 2025

Though some may not like having something preceding package import statements in Python scripts, I prefer the added robustness of an extra piece of code checking for package presence and installing anything that is missing in place getting an error. In what follows, I define the list of packages that need to be present for everything to work:

required_packages = ["pandas", "tqdm", "progressbar2", "sqlalchemy", "pymysql"]

Then, I declare the inbuilt modules in advance of looping through the list that was already defined (adding special handling for a case where there has been a name change):

import subprocess
import sys
for package in required_packages:
    try:
        __import__(package if package != "progressbar2" else "progressbar")
        print(f"{package} is already installed.")
    except ImportError:
        print(f"{package} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

The above code tries importing the package and catches the error to do the required installation. While a stable environment may be a better way around all of this, I find that this way of working adds valuable robustness to a script and automates what you would need to do anyway. Though the use of requirements files and even the Poetry tool for dependency management may be next steps, this approach suffices for my simpler needs, at least when it comes to personal projects.

Python productivity: Building better code through design, performance and scale

12^th September 2025

Python's success in data science and beyond stems from more than just readable syntax. It represents a coherent philosophy where errors guide development, explicitness prevents bugs, modern tooling enforces quality, performance comes from purpose-built engines, and scaling extends rather than replaces familiar patterns. Understanding these principles transforms everyday coding from a series of individual tasks into a systematic approach to building robust, maintainable and efficient systems.

Error-Driven Development as a Design Philosophy

Python treats errors not as failures, but as design features that surface problems early and prevent subtle defects later. The language embodies an "easier to ask forgiveness than permission" philosophy, attempting operations first and objecting meaningfully when they cannot proceed.

Consider how Python handles basic operations. A SyntaxError appears immediately when code violates grammatical rules: if True print("hello") triggers an immediate complaint with a caret pointing to the problematic location. Python neither guesses intentions nor continues with broken syntax because this guarantee of clear structure keeps code understandable across projects and platforms.

Sequence operations demonstrate similar principles. When code attempts to access lst[5] on a three-element list, Python raises IndexError: list index out of range rather than silently padding or expanding the sequence. This deliberate failure prevents hidden logic errors in loops and aggregations by forcing explicit checks of assumptions about data size.

Dictionary lookups follow the same pattern. Accessing a non-existent key with d['missing'] yields KeyError: 'missing' rather than inventing placeholder values. This explicit failure catches typos and unclear control flow whilst enabling defensive programming patterns through try/except blocks.

Name resolution errors like NameError and UnboundLocalError enforce clear scoping rules without creating variables accidentally or resolving names to unexpected contexts. Type discipline appears at runtime through TypeError for incorrect argument types and ValueError for correct types with inappropriate values. Each error message identifies which contract has been violated, directing fixes to either the object passed or the value it contains.

Assertions provide a final layer of optional verification. The assert statement allows code to state assumptions explicitly, failing with meaningful messages when invariants do not hold. This narrows the search space for defects by making expectations visible and providing immediate context for failures.

Taking these error signals seriously nudges development towards explicitness and clarity, establishing a foundation for all subsequent quality improvements.

Explicitness Over Implicitness

Making intentions clear through code structure prevents ambiguity, aids tooling and simplifies reuse. This principle manifests across multiple areas of Python development, from data structures to function signatures.

Raw dictionaries offer flexibility but create fragility. A typo in a key or missing field becomes a runtime KeyError with no contract about required contents. Using @dataclass to define structured objects like User with id, email, full_name, status and optional last_login provides clear interfaces with minimal overhead. Type hints and IDE support make attribute access unambiguous, whilst construction fails early when required fields are absent.

For cases requiring validation, pydantic models build on this foundation. An email field declared as EmailStr automatically validates format, while custom validators can restrict status values to specific options such as 'active', 'inactive' or 'pending'. The resulting models are self-documenting and shield downstream code from invalid data.

Function parameters representing closed sets of options benefit from similar treatment. Plain strings invite typos and lack autocomplete support. Defining enums such as OrderStatus with PENDING, SHIPPED and DELIVERED makes possible states explicit whilst helping both developers and tools. Passing OrderStatus.SHIPPED to process_order reveals intention clearly and enables straightforward comparisons against enum members.

Function signatures become clearer through keyword-only arguments, enforced with a bare star in definitions. A function like create_user(name, email, *, admin=False, notify=True, temporary=False) forces call sites to write create_user(..., admin=True, notify=False) rather than passing sequences of ambiguous boolean values. The resulting calls read almost as documentation.

File path operations improve through object-oriented design. The pathlib module treats paths as objects where joining uses natural / syntax, directory creation uses mkdir, suffix changes use with_suffix, and text operations use read_text and write_text. Code becomes shorter, more portable and less prone to string manipulation errors.

These patterns consistently replace implicit assumptions with explicit contracts, making code intention more visible and reducing the cognitive load of understanding system behaviour.

Structural Code Quality Through Tooling and Patterns

Sustainable code quality emerges from systematic approaches to organisation, testing and maintenance rather than individual discipline alone. Several key patterns and tools work together to create robust, readable codebases.

Control flow benefits from handling error conditions early rather than nesting deeply. Guard clauses invert the traditional structure so that invalid states return immediately, whilst main logic remains non-indented when preconditions are met. A process_payment function checking order.is_valid, then user.has_payment_method, then available funds before performing charges reads linearly. Exceptions during processing are caught precisely, errors logged with context, and functions return deterministically.

Even beloved list comprehensions have limits. When filtering and transformation logic become complex, sprawling comprehensions become opaque. Extracting predicates into named functions like is_valid_premium_user restores readability by giving conditions clear names. Where multiple checks and transformations are needed, conventional loops with early continue statements may prove more straightforward and debuggable.

Pure functions that accept all inputs as parameters and return results without changing external state simplify testing and reuse. Moving from designs where functions mutate global totals and read from global inventories to approaches where calculations accept prices, quantities and discounts as inputs removes hidden coupling. This enables deterministic testing of edge cases and reasoning about code without tracking changing state.

Documentation ties these practices together. Docstrings explaining what functions do, parameters they accept, values they return and including examples make codebases self-explanatory. Combined with tooling, docstrings serve as both reference and executable documentation.

Automation enforces consistency where human attention falters. Formatters like Black, linters like Ruff, static type checkers like mypy and import organisers like isort can run before each commit using pre-commit. Style issues and common mistakes are caught automatically, freeing mental capacity for higher-level concerns.

When handling errors, resist blanket except: statements that swallow everything from syntax errors to keyboard interrupts. Be specific where possible, catching ConnectionError, ValueError or database errors and handling each appropriately. When catch-alls are necessary, prefer except Exception as e: and log full tracebacks so that unexpected failures remain visible and traceable.

Performance Through Modern Engines

Once code achieves cleanliness and robustness, performance becomes the next frontier. Traditional tools often leave substantial speed gains on the table, particularly for data-intensive work where single-threaded processing creates bottlenecks on modern hardware.

Polars, a DataFrame library written in Rust, addresses these limitations by making parallelism the default whilst providing both eager and lazy execution modes. Benchmarks on datasets of around 580,000 rows show Polars completing filtering roughly four times faster than Pandas, aggregation over twenty times faster, groupby operations eight times faster, sorting three times faster, and feature engineering five times faster. These gains stem from fundamental architectural differences rather than incremental optimisations.

The performance improvement requires a shift in mental model. Instead of writing sequential operations that execute immediately, you can batch expressions and let Polars parallelise them automatically. Creating both profit and margin with one with_columns call signals that these calculations can proceed together. Lazy evaluation extends this approach further. Building pipelines with pl.scan_csv('large_file.csv').filter(...).group_by(...).agg(...).collect() lets Polars construct query plans, then optimises them before execution. Filters are pushed down so less data reaches later stages, only selected columns are read, and compatible operations are combined.

Expressiveness comes from an expression system applying operations across columns succinctly. Where Pandas encourages thinking in terms of single columns assigned individually, Polars supports expressions like pl.col(['revenue', 'cost']) * 1.1 applied to multiple columns simultaneously. Familiar transformations translate directly: pl.read_csv('sales.csv') replaces pd.read_csv, selection and filtering become df.filter(pl.col('order_value') > 500).select(['customer_id', 'order_value']), new columns are created with df.with_columns(((pl.col('revenue') - pl.col('cost')) / pl.col('revenue')).alias('profit_margin')), and operations utilise all available cores automatically.

Memory efficiency improves through Apache Arrow's columnar format, storing data more compactly and avoiding NumPy-based overhead. CSV files of around 2 GB requiring roughly 10 GB of RAM in Pandas often process in approximately 4 GB with Polars. This difference can determine whether workflows run smoothly on laptops or require chunking strategies.

Scaling Beyond Single Processes

When single processes reach their limits, two prominent approaches help scale Python across cores and machines whilst preserving familiar patterns and mental models.

Dask extends NumPy, Pandas and scikit-learn idioms to larger-than-memory datasets by partitioning arrays, DataFrames and computations then scheduling them in parallel. Its primary abstractions are dask.dataframe and dask.array, along with delayed task graphs. It excels for scalable batch processing, feature engineering and out-of-core work where the mental model remains close to the PyData stack. Integration with scikit-learn and XGBoost is mature, work-stealing schedulers are sophisticated, and detailed dashboards provide visibility. Clusters can be managed natively or through systems like Kubernetes and YARN.

For large-scale data cleaning and feature engineering, Dask provides natural extensions. Reading many CSV files from storage with dd.read_csv('s3://data/large-dataset-*.csv'), filtering rows with df[df['amount'] > 100], applying transformations per partition, then writing Parquet with df.to_parquet('s3://processed/output/') looks like Pandas but runs in parallel and out of core. Array computations through dask.array handle chunked operations so that x.mean(axis=0).compute() runs across partitions without exhausting memory.

Ray takes a more general approach to distributed computing through remote functions and actors. It suits workloads with many independent Python functions, stateful services and complex machine learning pipelines. A growing ecosystem includes Ray Tune for hyperparameter optimisation, Ray Train for multi-GPU training, Ray Serve for model serving, and RLlib for reinforcement learning. Scheduling is dynamic and actor-based, cluster management integrates with cloud providers, and scalability handles applications requiring control and flexibility.

For model training requiring many configuration explorations, Ray Tune provides schedulers and search strategies. Training functions can be wrapped and launched across workers with tune.run, with methods like ASHA stopping unpromising runs early. Integration with popular libraries means scaling experiments requires minimal code changes. Ray Serve turns model classes exposing __call__ methods into scalable services with @serve.deployment and serve.run, handling routing and scaling automatically.

Incremental Adoption and Pragmatic Choices

The most sustainable approach to improving Python productivity involves gradual implementation rather than wholesale changes. Each improvement builds on previous ones, creating compound benefits over time whilst minimising disruption to existing workflows.

Adopting Polars illustrates this principle well. The first step can be simply loading data with pl.read_csv('big_file.csv') for faster I/O, then converting to Pandas with .to_pandas() if the rest of the pipeline expects Pandas objects. As comfort grows, expression-oriented patterns yield dividends: filtering then adding multiple columns in single chained calls, so Polars can optimise across steps. Full benefits appear when entire pipelines are expressed lazily, but this transition can happen gradually as understanding deepens.

Similarly, clean code practices can be introduced incrementally. Start by letting error messages guide fixes rather than suppressing them. Refactor one fragile dictionary into a dataclass when maintenance becomes painful. Extract a complex list comprehension into a named function when debugging becomes difficult. Each change teaches principles that apply more broadly whilst delivering immediate benefits.

Scaling decisions are often pragmatic rather than theoretical. If work centres on DataFrames and arrays with minimal conceptual shift from Pandas or NumPy, Dask likely delivers what you need. If workloads mix training, tuning and serving or require orchestrating many concurrent Python tasks with fine-grained control, Ray's abstractions and libraries provide better matches. Trying each approach on representative workflow slices quickly clarifies which will serve best.

The choice between tools should be driven by actual requirements rather than perceived sophistication. A single machine with Polars may outperform a small cluster running Pandas. A well-structured monolithic application may be more maintainable than a prematurely distributed system. The key is understanding when complexity serves genuine needs rather than adding overhead.

Synthesis: The Compound Nature of Python Productivity

These themes work together rather than in isolation. Error-driven development creates habits that surface problems early. Explicit code structures make intentions clear to both humans and tools. Quality practices through tooling and patterns create sustainable foundations. Modern engines provide performance without sacrificing readability. Scaling approaches extend familiar patterns rather than replacing them. Incremental adoption ensures changes compound rather than disrupt.

The result is a coherent approach to Python development where each improvement reinforces others. Explicit data structures work better with static type checkers. Pure functions are easier to test and parallelise. Clean error handling integrates naturally with distributed systems. Modern DataFrame engines benefit from lazy evaluation patterns that also improve code clarity.

This synthesis explains Python's enduring appeal in data science and beyond. The language welcomes beginners with approachable syntax, whilst scaling to demanding production work without losing clarity. The ecosystem encourages practices that speed up teams over time rather than optimising for immediate gratification. The same principles that guide small scripts apply to large systems, creating a path for continuous improvement rather than periodic rewrites.

Start small: let error messages guide one fix, refactor one fragile dictionary into a dataclass, switch one slow operation to Polars, or run one hyperparameter sweep with Ray Tune. The improvements compound, and the foundations established early enable sophisticated capabilities later without fundamental changes to approach or mindset.

Clearing the Julia REPL during code development

23^rd September 2024

During development, there are times when you need to clear the Julia REPL. It can become so laden with content that it becomes hard to perform debugging of your code. One way to accomplish this is issuing the CTRL + L keyboard shortcut while focus is within the REPL; you need to click on it first. Another is to issue the following in the REPL itself:

print("\033c")

Here \033 is an escape code in octal format. It is often used in terminal control sequences. The c character is what resets the terminal to its initial state. Printing this sequence is what does the clearance, and variations can be used to clear other kinds of console screens too. That makes it a more generic solution.

Dropping to an underlying shell using the ; character is another possibility. Then, you can use the clear or cls commands as needed; the latter is for Windows systems.

One last option is to define a Julia function for doing this:

function clear_console()
    run(`clear`)  # or `cls` for Windows
end

Calling the clear_console function then clears the screen programmatically, allowing for greater automation. The run function is the one that sends that command in backticks to the underlying shell for execution. Even using that alone should work too.

AttributeError: module 'PIL' has no attribute 'Image'

11^th March 2024

One of my websites has an online photo gallery. This has been a long-term activity that has taken several forms over the years. Once HTML and JavaScript based, it then was powered by Perl before PHP and MySQL came along to take things from there.

While that remains how it works, the publishing side of things has used its own selection of mechanisms over the same time span. Perl and XML were the backbone until Python and Markdown took over. There was a time when ImageMagick and GraphicsMagick handled image processing, but Python now does that as well.

That was when the error message gracing the title of this post came to my notice. Everything was working well when executed in Spyder, but the message appears when I tried running things using Python on the command line. PIL is the abbreviated name for the Python 3 pillow package; there was one called PIL in the Python 2 days.

For me, pillow loads, resizes and creates new images, which is handy for adding borders and copyright/source information to each image as well as creating thumbnails. All this happens in memory and that makes everything go quickly, much faster than disk-based tools like ImageMagick and GraphicsMagick.

Of course, nothing is going to happen if the package cannot be loaded, and that is what the error message is about. Linux is what I mainly use, so that is the context for this scenario. What I was doing was something like the following in the Python script:

import PIL

Then, I referred to PIL.Image when I needed it, and this could not be found when the script was run from the command line (BASH). The solution was to add something like the following:

from PIL import Image

That sorted it, and I must have run into trouble with PIL.ImageFilter too, since I now load it in the same manner. In both cases, I could just refer to Image or ImageFilter as I required and without the dot syntax. However, you need to make sure that there is no clash with anything in another loaded Python package when doing this.

A look at the Julia programming language

19^th November 2022

Several open-source computing languages get mentioned when talking about working with data. Among these are R and Python, but there are others; Julia is another one of these. It took a while before I got to check out Julia because I felt the need to get acquainted with R and Python beforehand. There are others like Lua to investigate too, but that can wait for now.

With the way that R is making an incursion into clinical data reporting analysis following the passage of decades when SAS was predominant, my explorations of Julia are inspired by a certain contrariness on my part. Alongside some small personal projects, there has been some reading in (digital) book form and online. Concerning the latter of these, there are useful tutorials like Introduction to Data Science: Learn Julia Programming, Maths & Data Science from Scratch or Julia Programming: a Hands-on Tutorial. Like what happens with R, there are online versions of published books available free of charge, and they include Julia Data Science and Interactive Visualization and Plotting with Julia. Video learning can help too and Jane Herriman has recorded and shared useful beginner's guides on YouTube that start with the basics before heading onto more advanced subjects like multiple dispatch, broadcasting and metaprogramming.

This piece of learning has been made of simple self-inspired puzzles before moving on to anything more complex. That differs from my dalliance with R and Python, where I ventured into complexity first, not least because of testing them out with public COVID data. Eventually, I got around to doing that with Julia too, though my interest was beginning to wane by then, and Julia's abilities for creating multipage PDF files were such that the PDF Toolkit was needed to help with this. Along the way, I have made use of such packages as CSV.jl, DataFrames.jl, DataFramesMeta, Plots, Gadfly.jl, XLSX.jl and JSON3.jl, among others. After that, there is PrettyTables.jl to try out, and anyone can look at the Beautiful Makie website to see what Makie can do. There are plenty of other packages creating graphs, such as SpatialGraphs.jl, PGFPlotsX and GRUtils.jl. For formatting numbers, options include Format.jl and Humanize.jl.

So far, my primary usage has been with personal financial data together with automated processing and backup of photo files. The photo file processing has taken advantage of the ability to compile Julia scripts for added speed because just-in-time compilation always means there is a lag before the real work begins.

VS Code is my chosen editor for working with Julia scripts, since it has a plugin for the language. That adds the REPL, syntax highlighting, execution and data frame viewing capabilities that once were added to the now defunct Atom editor by its own plugin. While it would be nice to have a keyboard shortcut for script execution, the whole thing works well and is regularly updated.

Naturally, there have been a load of queries as I have gone along and the Julia Documentation has been consulted as well as Julia Discourse and Stack Overflow. The latter pair have become regular landing spots on many a Google search. One example followed a glitch that I encountered after a Julia upgrade when I asked a question about this and was directed to the XLSX.jl Migration Guides where I got the information that I needed to fix my code for it to run properly.

There is more learning to do as I continue to use Julia for various things. Once compiled, it does run fast like it has been promised. The syntax paradigm is akin to R and Python, but there are Julia-specific features too. If you have used the others, the learning curve is lessened but not eliminated completely. This is not an object-oriented language as such, but its functional nature makes it familiar enough for getting going with it. In short, the project has come a long way since it started more than ten years ago. There is much for the scientific programmer, but only time will tell if it usurped its older competitors. For now, I will remain interested in it.

Removing a Julia package using REPL or script commands

5^th October 2022

While I have been programming with SAS for a few decades, and it remains a linchpin in the world of clinical development in the pharmaceutical industry, other technologies like R and Python are gaining a foothold. Two years ago, I started to look at those languages with personal projects being a great way of facilitating this. In addition, I got to hear of Julia and got to try that too. That journey continues since I have put it into use for importing and backing up photos, and there are other possible uses too.

Recently, I updated Julia to version 1.8.2 but ran into a problem with the DataArrays package that I had installed, so I decided to remove it since it was added during experimentation. Though the Pkg package that is used for package management is documented, I had not got to that, which meant that some web searching ensued. It turns out that there are two ways of doing this. One uses the REPL: after pressing the ] key, the following command gets issued:

rm DataArrays

When all is done, pressing the delete or backspace keys returns things to normal. This also can be done in a script as well as the REPL, and the following line works in both instances:

using Pkg; Pkg.rm("DataArrays")

While the semicolon is used to separate two commands issued on the same line, they can be on different lines or issued separately just as well. Naturally, DataArrays is just an example here; you just replace that with the name of whatever other package you need to remove. Since we can get carried away when downloading packages, there are times when a clean-up is needed to remove redundant packages, so knowing how to remove any clutter is invaluable.

Getting custom Python imports to work in Visual Studio Code

18^th February 2022

While I continue to use Spyder as my preferred Python code editor, I also tried out Visual Studio Code. Handily, this Integrated Development Environment also has facilities for working with R and Julia code as well as Markdown text editing and adding the required extensions is enough for these applications; it helps that there is an unofficial Grammarly extension for content creation.

My Python code development makes use of the Pylance extension, and it works a little differently from Spyder when it comes to including files using import statements. Spyder will look into the folder where the base script is located, but the default behaviour of Pylance is that it looks in the root path of your workspace. This meant that any code that ran successfully in Spyder failed in Visual Studio Code.

To solve this issue, I added the location using the python.analysis.extraPaths setting for the workspace. I opened Settings by going to File > Preferences > Settings in the menu. I typed python.analysis.extraPaths in the search box. This showed me the correct section. I clicked on Add Item, entered the required path, and clicked OK. This resolved the problem, and everything worked properly afterwards.

Broadening data science horizons: Useful Python packages for working with data

14^th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. While one of these has been R, Python is another that has taken up my attention, and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family to have a more stable income. The approach is that a student selects a project and works their way through it, with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course, and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic supplying data, and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Though some subjects are uplifting while others are more foreboding, the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is, and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message, but books are always worth reading even if that is the slower route. While those from the Dummies series or from O'Reilly have proved must useful so far, I do need to read them more completely than I already have; it is all too tempting to go with the try the "programming and search for solutions as you go" approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality, but I prefer a more traditional editor, so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Because Spyder itself is written in Python, it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities, but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science, but I have discovered others since then. It also seems that there is porosity between the worlds of R and Python, so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dplyr and ggplot2 in the form of Siuba and Plotnine, respectively. Though the syntax of these packages are not direct copies of what is executed in R, they are close enough for there to be enough familiarity for added user-friendliness compared to Pandas or Matplotlib. The interoperability does not stop there, for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

While Python may not have the speed of Julia, there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have their uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available, and there is always the possibility of checking out others. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurts to keep an open mind either.

« Older Entries «