TOPIC: PROGRAMMING LANGUAGES
Python productivity: Building better code through design, performance and scale
12th September 2025Python's success in data science and beyond stems from more than just readable syntax. It represents a coherent philosophy where errors guide development, explicitness prevents bugs, modern tooling enforces quality, performance comes from purpose-built engines, and scaling extends rather than replaces familiar patterns. Understanding these principles transforms everyday coding from a series of individual tasks into a systematic approach to building robust, maintainable and efficient systems.
- Error-Driven Development as a Design Philosophy
Python treats errors not as failures, but as design features that surface problems early and prevent subtle defects later. The language embodies an "easier to ask forgiveness than permission" philosophy, attempting operations first and objecting meaningfully when they cannot proceed.
Consider how Python handles basic operations. A SyntaxError
appears immediately when code violates grammatical rules: if True print("hello")
triggers an immediate complaint with a caret pointing to the problematic location. Python neither guesses intentions nor continues with broken syntax because this guarantee of clear structure keeps code understandable across projects and platforms.
Sequence operations demonstrate similar principles. When code attempts to access lst[5]
on a three-element list, Python raises IndexError: list index out of range
rather than silently padding or expanding the sequence. This deliberate failure prevents hidden logic errors in loops and aggregations by forcing explicit checks of assumptions about data size.
Dictionary lookups follow the same pattern. Accessing a non-existent key with d['missing']
yields KeyError: 'missing'
rather than inventing placeholder values. This explicit failure catches typos and unclear control flow whilst enabling defensive programming patterns through try
/except
blocks.
Name resolution errors like NameError
and UnboundLocalError
enforce clear scoping rules without creating variables accidentally or resolving names to unexpected contexts. Type discipline appears at runtime through TypeError
for incorrect argument types and ValueError
for correct types with inappropriate values. Each error message identifies which contract has been violated, directing fixes to either the object passed or the value it contains.
Assertions provide a final layer of optional verification. The assert
statement allows code to state assumptions explicitly, failing with meaningful messages when invariants do not hold. This narrows the search space for defects by making expectations visible and providing immediate context for failures.
Taking these error signals seriously nudges development towards explicitness and clarity, establishing a foundation for all subsequent quality improvements.
- Explicitness Over Implicitness
Making intentions clear through code structure prevents ambiguity, aids tooling and simplifies reuse. This principle manifests across multiple areas of Python development, from data structures to function signatures.
Raw dictionaries offer flexibility but create fragility. A typo in a key or missing field becomes a runtime KeyError
with no contract about required contents. Using @dataclass
to define structured objects like User
with id
, email
, full_name
, status
and optional last_login
provides clear interfaces with minimal overhead. Type hints and IDE support make attribute access unambiguous, whilst construction fails early when required fields are absent.
For cases requiring validation, pydantic
models build on this foundation. An email field declared as EmailStr
automatically validates format, while custom validators can restrict status
values to specific options such as 'active', 'inactive' or 'pending'. The resulting models are self-documenting and shield downstream code from invalid data.
Function parameters representing closed sets of options benefit from similar treatment. Plain strings invite typos and lack autocomplete support. Defining enums
such as OrderStatus
with PENDING
, SHIPPED
and DELIVERED
makes possible states explicit whilst helping both developers and tools. Passing OrderStatus.SHIPPED
to process_order
reveals intention clearly and enables straightforward comparisons against enum members.
Function signatures become clearer through keyword-only arguments, enforced with a bare star in definitions. A function like create_user(name, email, *, admin=False, notify=True, temporary=False)
forces call sites to write create_user(..., admin=True, notify=False)
rather than passing sequences of ambiguous boolean values. The resulting calls read almost as documentation.
File path operations improve through object-oriented design. The pathlib
module treats paths as objects where joining uses natural /
syntax, directory creation uses mkdir
, suffix changes use with_suffix
, and text operations use read_text
and write_text
. Code becomes shorter, more portable and less prone to string manipulation errors.
These patterns consistently replace implicit assumptions with explicit contracts, making code intention more visible and reducing the cognitive load of understanding system behaviour.
- Structural Code Quality Through Tooling and Patterns
Sustainable code quality emerges from systematic approaches to organisation, testing and maintenance rather than individual discipline alone. Several key patterns and tools work together to create robust, readable codebases.
Control flow benefits from handling error conditions early rather than nesting deeply. Guard clauses invert the traditional structure so that invalid states return immediately, whilst main logic remains non-indented when preconditions are met. A process_payment
function checking order.is_valid
, then user.has_payment_method
, then available funds before performing charges reads linearly. Exceptions during processing are caught precisely, errors logged with context, and functions return deterministically.
Even beloved list comprehensions have limits. When filtering and transformation logic become complex, sprawling comprehensions become opaque. Extracting predicates into named functions like is_valid_premium_user
restores readability by giving conditions clear names. Where multiple checks and transformations are needed, conventional loops with early continue
statements may prove more straightforward and debuggable.
Pure functions that accept all inputs as parameters and return results without changing external state simplify testing and reuse. Moving from designs where functions mutate global totals and read from global inventories to approaches where calculations accept prices, quantities and discounts as inputs removes hidden coupling. This enables deterministic testing of edge cases and reasoning about code without tracking changing state.
Documentation ties these practices together. Docstrings explaining what functions do, parameters they accept, values they return and including examples make codebases self-explanatory. Combined with tooling, docstrings serve as both reference and executable documentation.
Automation enforces consistency where human attention falters. Formatters like Black, linters like Ruff, static type checkers like mypy
and import organisers like isort
can run before each commit using pre-commit
. Style issues and common mistakes are caught automatically, freeing mental capacity for higher-level concerns.
When handling errors, resist blanket except:
statements that swallow everything from syntax errors to keyboard interrupts. Be specific where possible, catching ConnectionError
, ValueError
or database errors and handling each appropriately. When catch-alls are necessary, prefer except Exception as e:
and log full tracebacks so that unexpected failures remain visible and traceable.
- Performance Through Modern Engines
Once code achieves cleanliness and robustness, performance becomes the next frontier. Traditional tools often leave substantial speed gains on the table, particularly for data-intensive work where single-threaded processing creates bottlenecks on modern hardware.
Polars, a DataFrame library written in Rust, addresses these limitations by making parallelism the default whilst providing both eager and lazy execution modes. Benchmarks on datasets of around 580,000 rows show Polars completing filtering roughly four times faster than Pandas, aggregation over twenty times faster, groupby
operations eight times faster, sorting three times faster, and feature engineering five times faster. These gains stem from fundamental architectural differences rather than incremental optimisations.
The performance improvement requires a shift in mental model. Instead of writing sequential operations that execute immediately, you can batch expressions and let Polars parallelise them automatically. Creating both profit
and margin
with one with_columns
call signals that these calculations can proceed together. Lazy evaluation extends this approach further. Building pipelines with pl.scan_csv('large_file.csv').filter(...).group_by(...).agg(...).collect()
lets Polars construct query plans, then optimises them before execution. Filters are pushed down so less data reaches later stages, only selected columns are read, and compatible operations are combined.
Expressiveness comes from an expression system applying operations across columns succinctly. Where Pandas encourages thinking in terms of single columns assigned individually, Polars supports expressions like pl.col(['revenue', 'cost']) * 1.1
applied to multiple columns simultaneously. Familiar transformations translate directly: pl.read_csv('sales.csv')
replaces pd.read_csv
, selection and filtering become df.filter(pl.col('order_value') > 500).select(['customer_id', 'order_value'])
, new columns are created with df.with_columns(((pl.col('revenue') - pl.col('cost')) / pl.col('revenue')).alias('profit_margin'))
, and operations utilise all available cores automatically.
Memory efficiency improves through Apache Arrow's columnar format, storing data more compactly and avoiding NumPy-based overhead. CSV files of around 2 GB requiring roughly 10 GB of RAM in Pandas often process in approximately 4 GB with Polars. This difference can determine whether workflows run smoothly on laptops or require chunking strategies.
- Scaling Beyond Single Processes
When single processes reach their limits, two prominent approaches help scale Python across cores and machines whilst preserving familiar patterns and mental models.
Dask extends NumPy, Pandas and scikit-learn idioms to larger-than-memory datasets by partitioning arrays, DataFrames and computations then scheduling them in parallel. Its primary abstractions are dask.dataframe
and dask.array
, along with delayed task graphs. It excels for scalable batch processing, feature engineering and out-of-core work where the mental model remains close to the PyData
stack. Integration with scikit-learn and XGBoost is mature, work-stealing schedulers are sophisticated, and detailed dashboards provide visibility. Clusters can be managed natively or through systems like Kubernetes and YARN.
For large-scale data cleaning and feature engineering, Dask provides natural extensions. Reading many CSV files from storage with dd.read_csv('s3://data/large-dataset-*.csv')
, filtering rows with df[df['amount'] > 100]
, applying transformations per partition, then writing Parquet with df.to_parquet('s3://processed/output/')
looks like Pandas but runs in parallel and out of core. Array computations through dask.array
handle chunked operations so that x.mean(axis=0).compute()
runs across partitions without exhausting memory.
Ray takes a more general approach to distributed computing through remote functions and actors. It suits workloads with many independent Python functions, stateful services and complex machine learning pipelines. A growing ecosystem includes Ray Tune for hyperparameter optimisation, Ray Train for multi-GPU training, Ray Serve for model serving, and RLlib
for reinforcement learning. Scheduling is dynamic and actor-based, cluster management integrates with cloud providers, and scalability handles applications requiring control and flexibility.
For model training requiring many configuration explorations, Ray Tune provides schedulers and search strategies. Training functions can be wrapped and launched across workers with tune.run
, with methods like ASHA stopping unpromising runs early. Integration with popular libraries means scaling experiments requires minimal code changes. Ray Serve turns model classes exposing __call__
methods into scalable services with @serve.deployment
and serve.run
, handling routing and scaling automatically.
- Incremental Adoption and Pragmatic Choices
The most sustainable approach to improving Python productivity involves gradual implementation rather than wholesale changes. Each improvement builds on previous ones, creating compound benefits over time whilst minimising disruption to existing workflows.
Adopting Polars illustrates this principle well. The first step can be simply loading data with pl.read_csv('big_file.csv')
for faster I/O, then converting to Pandas with .to_pandas()
if the rest of the pipeline expects Pandas objects. As comfort grows, expression-oriented patterns yield dividends: filtering then adding multiple columns in single chained calls, so Polars can optimise across steps. Full benefits appear when entire pipelines are expressed lazily, but this transition can happen gradually as understanding deepens.
Similarly, clean code practices can be introduced incrementally. Start by letting error messages guide fixes rather than suppressing them. Refactor one fragile dictionary into a dataclass when maintenance becomes painful. Extract a complex list comprehension into a named function when debugging becomes difficult. Each change teaches principles that apply more broadly whilst delivering immediate benefits.
Scaling decisions are often pragmatic rather than theoretical. If work centres on DataFrames and arrays with minimal conceptual shift from Pandas or NumPy, Dask likely delivers what you need. If workloads mix training, tuning and serving or require orchestrating many concurrent Python tasks with fine-grained control, Ray's abstractions and libraries provide better matches. Trying each approach on representative workflow slices quickly clarifies which will serve best.
The choice between tools should be driven by actual requirements rather than perceived sophistication. A single machine with Polars may outperform a small cluster running Pandas. A well-structured monolithic application may be more maintainable than a prematurely distributed system. The key is understanding when complexity serves genuine needs rather than adding overhead.
- Synthesis: The Compound Nature of Python Productivity
These themes work together rather than in isolation. Error-driven development creates habits that surface problems early. Explicit code structures make intentions clear to both humans and tools. Quality practices through tooling and patterns create sustainable foundations. Modern engines provide performance without sacrificing readability. Scaling approaches extend familiar patterns rather than replacing them. Incremental adoption ensures changes compound rather than disrupt.
The result is a coherent approach to Python development where each improvement reinforces others. Explicit data structures work better with static type checkers. Pure functions are easier to test and parallelise. Clean error handling integrates naturally with distributed systems. Modern DataFrame engines benefit from lazy evaluation patterns that also improve code clarity.
This synthesis explains Python's enduring appeal in data science and beyond. The language welcomes beginners with approachable syntax, whilst scaling to demanding production work without losing clarity. The ecosystem encourages practices that speed up teams over time rather than optimising for immediate gratification. The same principles that guide small scripts apply to large systems, creating a path for continuous improvement rather than periodic rewrites.
Start small: let error messages guide one fix, refactor one fragile dictionary into a dataclass, switch one slow operation to Polars, or run one hyperparameter sweep with Ray Tune. The improvements compound, and the foundations established early enable sophisticated capabilities later without fundamental changes to approach or mindset.
Clearing the Julia REPL
23rd September 2024During development, there are times when you need to clear the Julia REPL. It can become so laden with content that it becomes hard to perform debugging of your code. One way to accomplish this is issuing the CTRL + L keyboard shortcut while focus is within the REPL; you need to click on it first. Another is to issue the following in the REPL itself:
print("\033c")
Here \033
is an escape code in octal format. It is often used in terminal control sequences. The c
character is what resets the terminal to its initial state. Printing this sequence is what does the clearance, and variations can be used to clear other kinds of console screens too. That makes it a more generic solution.
Dropping to an underlying shell using the ;
character is another possibility. Then, you can use the clear
or cls
commands as needed; the latter is for Windows systems.
One last option is to define a Julia function for doing this:
function clear_console()
run(`clear`) # or `cls` for Windows
end
Calling the clear_console
function then clears the screen programmatically, allowing for greater automation. The run
function is the one that sends that command in backticks to the underlying shell for execution. Even using that alone should work too.
AttributeError: module 'PIL' has no attribute 'Image'
11th March 2024One of my websites has an online photo gallery. This has been a long-term activity that has taken several forms over the years. Once HTML and JavaScript based, it then was powered by Perl before PHP and MySQL came along to take things from there.
While that remains how it works, the publishing side of things has used its own selection of mechanisms over the same time span. Perl and XML were the backbone until Python and Markdown took over. There was a time when ImageMagick and GraphicsMagick handled image processing, but Python now does that as well.
That was when the error message gracing the title of this post came to my notice. Everything was working well when executed in Spyder, but the message appears when I tried running things using Python on the command line. PIL is the abbreviated name for the Python 3 pillow package; there was one called PIL in the Python 2 days.
For me, pillow loads, resizes and creates new images, which is handy for adding borders and copyright/source information to each image as well as creating thumbnails. All this happens in memory and that makes everything go quickly, much faster than disk-based tools like ImageMagick and GraphicsMagick.
Of course, nothing is going to happen if the package cannot be loaded, and that is what the error message is about. Linux is what I mainly use, so that is the context for this scenario. What I was doing was something like the following in the Python script:
import PIL
Then, I referred to PIL.Image
when I needed it, and this could not be found when the script was run from the command line (BASH). The solution was to add something like the following:
from PIL import Image
That sorted it, and I must have run into trouble with PIL.ImageFilter
too, since I now load it in the same manner. In both cases, I could just refer to Image or ImageFilter as I required and without the dot syntax. However, you need to make sure that there is no clash with anything in another loaded Python package when doing this.
A look at the Julia programming language
19th November 2022Several open-source computing languages get mentioned when talking about working with data. Among these are R and Python, but there are others; Julia is another one of these. It took a while before I got to check out Julia because I felt the need to get acquainted with R and Python beforehand. There are others like Lua to investigate too, but that can wait for now.
With the way that R is making an incursion into clinical data reporting analysis following the passage of decades when SAS was predominant, my explorations of Julia are inspired by a certain contrariness on my part. Alongside some small personal projects, there has been some reading in (digital) book form and online. Concerning the latter of these, there are useful tutorials like Introduction to Data Science: Learn Julia Programming, Maths & Data Science from Scratch or Julia Programming: a Hands-on Tutorial. Like what happens with R, there are online versions of published books available free of charge, and they include Julia Data Science and Interactive Visualization and Plotting with Julia. Video learning can help too and Jane Herriman has recorded and shared useful beginner's guides on YouTube that start with the basics before heading onto more advanced subjects like multiple dispatch, broadcasting and metaprogramming.
This piece of learning has been made of simple self-inspired puzzles before moving on to anything more complex. That differs from my dalliance with R and Python, where I ventured into complexity first, not least because of testing them out with public COVID data. Eventually, I got around to doing that with Julia too, though my interest was beginning to wane by then, and Julia's abilities for creating multipage PDF files were such that the PDF Toolkit was needed to help with this. Along the way, I have made use of such packages as CSV.jl, DataFrames.jl, DataFramesMeta, Plots, Gadfly.jl, XLSX.jl and JSON3.jl, among others. After that, there is PrettyTables.jl to try out, and anyone can look at the Beautiful Makie website to see what Makie can do. There are plenty of other packages creating graphs, such as SpatialGraphs.jl, PGFPlotsX and GRUtils.jl. For formatting numbers, options include Format.jl and Humanize.jl.
So far, my primary usage has been with personal financial data together with automated processing and backup of photo files. The photo file processing has taken advantage of the ability to compile Julia scripts for added speed because just-in-time compilation always means there is a lag before the real work begins.
VS Code is my chosen editor for working with Julia scripts, since it has a plugin for the language. That adds the REPL, syntax highlighting, execution and data frame viewing capabilities that once were added to the now defunct Atom editor by its own plugin. While it would be nice to have a keyboard shortcut for script execution, the whole thing works well and is regularly updated.
Naturally, there have been a load of queries as I have gone along and the Julia Documentation has been consulted as well as Julia Discourse and Stack Overflow. The latter pair have become regular landing spots on many a Google search. One example followed a glitch that I encountered after a Julia upgrade when I asked a question about this and was directed to the XLSX.jl Migration Guides where I got the information that I needed to fix my code for it to run properly.
There is more learning to do as I continue to use Julia for various things. Once compiled, it does run fast like it has been promised. The syntax paradigm is akin to R and Python, but there are Julia-specific features too. If you have used the others, the learning curve is lessened but not eliminated completely. This is not an object-oriented language as such, but its functional nature makes it familiar enough for getting going with it. In short, the project has come a long way since it started more than ten years ago. There is much for the scientific programmer, but only time will tell if it usurped its older competitors. For now, I will remain interested in it.
Removing a Julia package
5th October 2022While I have been programming with SAS for a few decades, and it remains a linchpin in the world of clinical development in the pharmaceutical industry, other technologies like R and Python are gaining a foothold. Two years ago, I started to look at those languages with personal projects being a great way of facilitating this. In addition, I got to hear of Julia and got to try that too. That journey continues since I have put it into use for importing and backing up photos, and there are other possible uses too.
Recently, I updated Julia to version 1.8.2 but ran into a problem with the DataArrays
package that I had installed, so I decided to remove it since it was added during experimentation. Though the Pkg
package that is used for package management is documented, I had not got to that, which meant that some web searching ensued. It turns out that there are two ways of doing this. One uses the REPL: after pressing the ]
key, the following command gets issued:
rm DataArrays
When all is done, pressing the delete or backspace keys returns things to normal. This also can be done in a script as well as the REPL, and the following line works in both instances:
using Pkg; Pkg.rm("DataArrays")
While the semicolon is used to separate two commands issued on the same line, they can be on different lines or issued separately just as well. Naturally, DataArrays
is just an example here; you just replace that with the name of whatever other package you need to remove. Since we can get carried away when downloading packages, there are times when a clean-up is needed to remove redundant packages, so knowing how to remove any clutter is invaluable.
Getting custom Python imports to work in Visual Studio Code
18th February 2022While I continue to use Spyder as my preferred Python code editor, I also tried out Visual Studio Code. Handily, this Integrated Development Environment also has facilities for working with R and Julia code as well as Markdown text editing and adding the required extensions is enough for these applications; it helps that there is an unofficial Grammarly extension for content creation.
My Python code development makes use of the Pylance extension, and it works a little differently from Spyder when it comes to including files using import statements. Spyder will look into the folder where the base script is located, but the default behaviour of Pylance is that it looks in the root path of your workspace. This meant that any code that ran successfully in Spyder failed in Visual Studio Code.
To solve this issue, I added the location using the python.analysis.extraPaths
setting for the workspace. I opened Settings by going to File > Preferences > Settings in the menu. I typed python.analysis.extraPaths
in the search box. This showed me the correct section. I clicked on Add Item, entered the required path, and clicked OK. This resolved the problem, and everything worked properly afterwards.
Broadening data science horizons: Useful Python packages for working with data
14th October 2021My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. While one of these has been R, Python is another that has taken up my attention, and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.
While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.
During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family to have a more stable income. The approach is that a student selects a project and works their way through it, with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course, and that point came through in the webinar.
That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic supplying data, and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Though some subjects are uplifting while others are more foreboding, the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is, and I hope that it will help with learning Julia too.
In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message, but books are always worth reading even if that is the slower route. While those from the Dummies series or from O'Reilly have proved must useful so far, I do need to read them more completely than I already have; it is all too tempting to go with the try the "programming and search for solutions as you go" approach instead.
To get going, many choose the Anaconda distribution to get Jupyter notebook functionality, but I prefer a more traditional editor, so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Because Spyder itself is written in Python, it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities, but these get installed behind the scenes.
The packages that I first met in 2019 may be the mainstays for doing data science, but I have discovered others since then. It also seems that there is porosity between the worlds of R and Python, so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dplyr and ggplot2 in the form of Siuba and Plotnine, respectively. Though the syntax of these packages are not direct copies of what is executed in R, they are close enough for there to be enough familiarity for added user-friendliness compared to Pandas or Matplotlib. The interoperability does not stop there, for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.
While Python may not have the speed of Julia, there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have their uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available, and there is always the possibility of checking out others. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurts to keep an open mind either.
Getting Eclipse to start without incompatibility errors on Linux Mint 19.1
12th June 2019Recent curiosity about Java programming and Groovy scripting got me trying to start up the Eclipse IDE that I had installed on my main machine. What I got instead of a successful application startup was a message that included the following:
!MESSAGE Exception launching the Eclipse Platform:
!STACK
java.lang.ClassNotFoundException: org.eclipse.core.runtime.adaptor.EclipseStarter
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:566)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:626)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:584)
at org.eclipse.equinox.launcher.Main.run(Main.java:1438)
at org.eclipse.equinox.launcher.Main.main(Main.java:1414)
The cause was a mismatch between Eclipse and the installed version of Java that it needed to run. After all, the software itself is written in the Java language and the installed version from the usual software repositories was too old for Java 11. The solution turned out to be installing a newer version as a Snap (Ubuntu's answer to Flatpak). The following command did the needful since snapd
already was running on my machine:
sudo snap install eclipse --classic
The only part of the command that warrants extra comment is the --classic
switch, since that is needed for a tool like Eclipse that needs to access a host file system. On executing, the software was downloaded from Snapcraft and then installed within its own bundle of dependencies. The latter adds a certain detachment from the underlying Linux installation and ensures that no messages appear because of incompatibilities like the one near the start of this post.
On Making PROC REPORT Work Harder
1st September 2010In the early years of my SAS programming career, there seemed to be just the one procedure to use if you wanted to create a summary table. That was TABULATE
and it was great for generating columns according to the value of a variable such as the treatment received by a subject in a clinical study. To a point, it could generate statistics for you too, and I often used it to sum frequency and percentage variables. Since then, it seems to have been enhanced a little and surprised me with the statistics it could produce when I had a recent play. Here's the code:
proc tabulate data=sashelp.class;
class sex;
var age;
table age*(n median*f=8. mean*f=8.1 std*f=8.1 min*f=8. max*f=8. lclm*f=8.1 uclm*f=8.1),sex
/ misstext="0";
run;
When you compare that with the idea of creating one variable per column and then defining them in PROC REPORT
as many do, it looks more elegant and the results aren't bad either, though they can be tweaked further from the quick example that I generated. That last comment brings me to the point that PROC REPORT
seems to have taken over from TABULATE
wherever I care to look these days, and I do ask myself if it is the right tool for that for which it is being used or if it is being used in the best way.
While using Data Step to create one variable per column in a PROC REPORT
output doesn't strike me as the best way to write reusable code, there are ways to make PROC REPORT
do more for you. For example, by defining GROUP
, ACROSS
and ANALYSIS
columns in an output, you can persuade the procedure to do the summarising for you and there's some example code below with the comma nesting height under sex in the resulting table. Sums are created by default if you do this, and forgoing an analysis column definition means that you get a frequency table, not at all a useless thing in numerous instances.
proc report data=sashelp.class nowd missing;
columns age sex,height;
define age / group "Age";
define sex / across "Sex";
define height / analysis mean f=missing. "Mean Height";
run;
For those times when you need to create more heavily formatted statistics (summarising range as min-max rather showing min and max separately, for example), you might feel that the GROUP/ACROSS
set-up's non-display of character values puts a stop to using that approach. However, I found that making every value combination unique and attaching a cell ID helps to work around the problem. Then, you can create a format control data set from the data like in the code below and create a format from that which you can apply to the cell ID's to display things as you need them. This method does make things more portable from situation to situation than adding or removing columns depending on the values of a classification variable.
proc sql noprint;
create table cntlin as
select distinct "fmtname" as fmtname, cellid as start, cellid as end, decode as label
from report;
quit;
proc format lib=work cntlin=cnlin;
run;
About Perl's Binding Operator
20th May 2009While this piece is as much an aide de memoire for myself as anything else, putting it here seems worthwhile if it answers questions for others. The binding operators, =~
or !~
, come in handy when you are framing conditional statements in Perl using Regular Expressions, for example, testing whether x =~ /\d+/
or not. The =~ variant is also used for changing strings using the s/[pattern1]/[pattern2]/
regular expression construct (here, s
stands for "substitute"). What has brought this to mind is that I wanted to ensure that something was done for strings that did not contain a certain pattern, and that's where the !~
binding operator came in useful; ^~
might have come to mind for some reason, but it wasn't what I needed.