DASK | Technology Tales

TOPIC: DASK

Python productivity: Building better code through design, performance and scale

12^th September 2025

Python's success in data science and beyond stems from more than just readable syntax. It represents a coherent philosophy where errors guide development, explicitness prevents bugs, modern tooling enforces quality, performance comes from purpose-built engines, and scaling extends rather than replaces familiar patterns. Understanding these principles transforms everyday coding from a series of individual tasks into a systematic approach to building robust, maintainable and efficient systems.

- Error-Driven Development as a Design Philosophy

Python treats errors not as failures, but as design features that surface problems early and prevent subtle defects later. The language embodies an "easier to ask forgiveness than permission" philosophy, attempting operations first and objecting meaningfully when they cannot proceed.

Consider how Python handles basic operations. A SyntaxError appears immediately when code violates grammatical rules: if True print("hello") triggers an immediate complaint with a caret pointing to the problematic location. Python neither guesses intentions nor continues with broken syntax because this guarantee of clear structure keeps code understandable across projects and platforms.

Sequence operations demonstrate similar principles. When code attempts to access lst[5] on a three-element list, Python raises IndexError: list index out of range rather than silently padding or expanding the sequence. This deliberate failure prevents hidden logic errors in loops and aggregations by forcing explicit checks of assumptions about data size.

Dictionary lookups follow the same pattern. Accessing a non-existent key with d['missing'] yields KeyError: 'missing' rather than inventing placeholder values. This explicit failure catches typos and unclear control flow whilst enabling defensive programming patterns through try/except blocks.

Name resolution errors like NameError and UnboundLocalError enforce clear scoping rules without creating variables accidentally or resolving names to unexpected contexts. Type discipline appears at runtime through TypeError for incorrect argument types and ValueError for correct types with inappropriate values. Each error message identifies which contract has been violated, directing fixes to either the object passed or the value it contains.

Assertions provide a final layer of optional verification. The assert statement allows code to state assumptions explicitly, failing with meaningful messages when invariants do not hold. This narrows the search space for defects by making expectations visible and providing immediate context for failures.

Taking these error signals seriously nudges development towards explicitness and clarity, establishing a foundation for all subsequent quality improvements.

- Explicitness Over Implicitness

Making intentions clear through code structure prevents ambiguity, aids tooling and simplifies reuse. This principle manifests across multiple areas of Python development, from data structures to function signatures.

Raw dictionaries offer flexibility but create fragility. A typo in a key or missing field becomes a runtime KeyError with no contract about required contents. Using @dataclass to define structured objects like User with id, email, full_name, status and optional last_login provides clear interfaces with minimal overhead. Type hints and IDE support make attribute access unambiguous, whilst construction fails early when required fields are absent.

For cases requiring validation, pydantic models build on this foundation. An email field declared as EmailStr automatically validates format, while custom validators can restrict status values to specific options such as 'active', 'inactive' or 'pending'. The resulting models are self-documenting and shield downstream code from invalid data.

Function parameters representing closed sets of options benefit from similar treatment. Plain strings invite typos and lack autocomplete support. Defining enums such as OrderStatus with PENDING, SHIPPED and DELIVERED makes possible states explicit whilst helping both developers and tools. Passing OrderStatus.SHIPPED to process_order reveals intention clearly and enables straightforward comparisons against enum members.

Function signatures become clearer through keyword-only arguments, enforced with a bare star in definitions. A function like create_user(name, email, *, admin=False, notify=True, temporary=False) forces call sites to write create_user(..., admin=True, notify=False) rather than passing sequences of ambiguous boolean values. The resulting calls read almost as documentation.

File path operations improve through object-oriented design. The pathlib module treats paths as objects where joining uses natural / syntax, directory creation uses mkdir, suffix changes use with_suffix, and text operations use read_text and write_text. Code becomes shorter, more portable and less prone to string manipulation errors.

These patterns consistently replace implicit assumptions with explicit contracts, making code intention more visible and reducing the cognitive load of understanding system behaviour.

- Structural Code Quality Through Tooling and Patterns

Sustainable code quality emerges from systematic approaches to organisation, testing and maintenance rather than individual discipline alone. Several key patterns and tools work together to create robust, readable codebases.

Control flow benefits from handling error conditions early rather than nesting deeply. Guard clauses invert the traditional structure so that invalid states return immediately, whilst main logic remains non-indented when preconditions are met. A process_payment function checking order.is_valid, then user.has_payment_method, then available funds before performing charges reads linearly. Exceptions during processing are caught precisely, errors logged with context, and functions return deterministically.

Even beloved list comprehensions have limits. When filtering and transformation logic become complex, sprawling comprehensions become opaque. Extracting predicates into named functions like is_valid_premium_user restores readability by giving conditions clear names. Where multiple checks and transformations are needed, conventional loops with early continue statements may prove more straightforward and debuggable.

Pure functions that accept all inputs as parameters and return results without changing external state simplify testing and reuse. Moving from designs where functions mutate global totals and read from global inventories to approaches where calculations accept prices, quantities and discounts as inputs removes hidden coupling. This enables deterministic testing of edge cases and reasoning about code without tracking changing state.

Documentation ties these practices together. Docstrings explaining what functions do, parameters they accept, values they return and including examples make codebases self-explanatory. Combined with tooling, docstrings serve as both reference and executable documentation.

Automation enforces consistency where human attention falters. Formatters like Black, linters like Ruff, static type checkers like mypy and import organisers like isort can run before each commit using pre-commit. Style issues and common mistakes are caught automatically, freeing mental capacity for higher-level concerns.

When handling errors, resist blanket except: statements that swallow everything from syntax errors to keyboard interrupts. Be specific where possible, catching ConnectionError, ValueError or database errors and handling each appropriately. When catch-alls are necessary, prefer except Exception as e: and log full tracebacks so that unexpected failures remain visible and traceable.

- Performance Through Modern Engines

Once code achieves cleanliness and robustness, performance becomes the next frontier. Traditional tools often leave substantial speed gains on the table, particularly for data-intensive work where single-threaded processing creates bottlenecks on modern hardware.

Polars, a DataFrame library written in Rust, addresses these limitations by making parallelism the default whilst providing both eager and lazy execution modes. Benchmarks on datasets of around 580,000 rows show Polars completing filtering roughly four times faster than Pandas, aggregation over twenty times faster, groupby operations eight times faster, sorting three times faster, and feature engineering five times faster. These gains stem from fundamental architectural differences rather than incremental optimisations.

The performance improvement requires a shift in mental model. Instead of writing sequential operations that execute immediately, you can batch expressions and let Polars parallelise them automatically. Creating both profit and margin with one with_columns call signals that these calculations can proceed together. Lazy evaluation extends this approach further. Building pipelines with pl.scan_csv('large_file.csv').filter(...).group_by(...).agg(...).collect() lets Polars construct query plans, then optimises them before execution. Filters are pushed down so less data reaches later stages, only selected columns are read, and compatible operations are combined.

Expressiveness comes from an expression system applying operations across columns succinctly. Where Pandas encourages thinking in terms of single columns assigned individually, Polars supports expressions like pl.col(['revenue', 'cost']) * 1.1 applied to multiple columns simultaneously. Familiar transformations translate directly: pl.read_csv('sales.csv') replaces pd.read_csv, selection and filtering become df.filter(pl.col('order_value') > 500).select(['customer_id', 'order_value']), new columns are created with df.with_columns(((pl.col('revenue') - pl.col('cost')) / pl.col('revenue')).alias('profit_margin')), and operations utilise all available cores automatically.

Memory efficiency improves through Apache Arrow's columnar format, storing data more compactly and avoiding NumPy-based overhead. CSV files of around 2 GB requiring roughly 10 GB of RAM in Pandas often process in approximately 4 GB with Polars. This difference can determine whether workflows run smoothly on laptops or require chunking strategies.

- Scaling Beyond Single Processes

When single processes reach their limits, two prominent approaches help scale Python across cores and machines whilst preserving familiar patterns and mental models.

Dask extends NumPy, Pandas and scikit-learn idioms to larger-than-memory datasets by partitioning arrays, DataFrames and computations then scheduling them in parallel. Its primary abstractions are dask.dataframe and dask.array, along with delayed task graphs. It excels for scalable batch processing, feature engineering and out-of-core work where the mental model remains close to the PyData stack. Integration with scikit-learn and XGBoost is mature, work-stealing schedulers are sophisticated, and detailed dashboards provide visibility. Clusters can be managed natively or through systems like Kubernetes and YARN.

For large-scale data cleaning and feature engineering, Dask provides natural extensions. Reading many CSV files from storage with dd.read_csv('s3://data/large-dataset-*.csv'), filtering rows with df[df['amount'] > 100], applying transformations per partition, then writing Parquet with df.to_parquet('s3://processed/output/') looks like Pandas but runs in parallel and out of core. Array computations through dask.array handle chunked operations so that x.mean(axis=0).compute() runs across partitions without exhausting memory.

Ray takes a more general approach to distributed computing through remote functions and actors. It suits workloads with many independent Python functions, stateful services and complex machine learning pipelines. A growing ecosystem includes Ray Tune for hyperparameter optimisation, Ray Train for multi-GPU training, Ray Serve for model serving, and RLlib for reinforcement learning. Scheduling is dynamic and actor-based, cluster management integrates with cloud providers, and scalability handles applications requiring control and flexibility.

For model training requiring many configuration explorations, Ray Tune provides schedulers and search strategies. Training functions can be wrapped and launched across workers with tune.run, with methods like ASHA stopping unpromising runs early. Integration with popular libraries means scaling experiments requires minimal code changes. Ray Serve turns model classes exposing __call__ methods into scalable services with @serve.deployment and serve.run, handling routing and scaling automatically.

- Incremental Adoption and Pragmatic Choices

The most sustainable approach to improving Python productivity involves gradual implementation rather than wholesale changes. Each improvement builds on previous ones, creating compound benefits over time whilst minimising disruption to existing workflows.

Adopting Polars illustrates this principle well. The first step can be simply loading data with pl.read_csv('big_file.csv') for faster I/O, then converting to Pandas with .to_pandas() if the rest of the pipeline expects Pandas objects. As comfort grows, expression-oriented patterns yield dividends: filtering then adding multiple columns in single chained calls, so Polars can optimise across steps. Full benefits appear when entire pipelines are expressed lazily, but this transition can happen gradually as understanding deepens.

Similarly, clean code practices can be introduced incrementally. Start by letting error messages guide fixes rather than suppressing them. Refactor one fragile dictionary into a dataclass when maintenance becomes painful. Extract a complex list comprehension into a named function when debugging becomes difficult. Each change teaches principles that apply more broadly whilst delivering immediate benefits.

Scaling decisions are often pragmatic rather than theoretical. If work centres on DataFrames and arrays with minimal conceptual shift from Pandas or NumPy, Dask likely delivers what you need. If workloads mix training, tuning and serving or require orchestrating many concurrent Python tasks with fine-grained control, Ray's abstractions and libraries provide better matches. Trying each approach on representative workflow slices quickly clarifies which will serve best.

The choice between tools should be driven by actual requirements rather than perceived sophistication. A single machine with Polars may outperform a small cluster running Pandas. A well-structured monolithic application may be more maintainable than a prematurely distributed system. The key is understanding when complexity serves genuine needs rather than adding overhead.

- Synthesis: The Compound Nature of Python Productivity

These themes work together rather than in isolation. Error-driven development creates habits that surface problems early. Explicit code structures make intentions clear to both humans and tools. Quality practices through tooling and patterns create sustainable foundations. Modern engines provide performance without sacrificing readability. Scaling approaches extend familiar patterns rather than replacing them. Incremental adoption ensures changes compound rather than disrupt.

The result is a coherent approach to Python development where each improvement reinforces others. Explicit data structures work better with static type checkers. Pure functions are easier to test and parallelise. Clean error handling integrates naturally with distributed systems. Modern DataFrame engines benefit from lazy evaluation patterns that also improve code clarity.

This synthesis explains Python's enduring appeal in data science and beyond. The language welcomes beginners with approachable syntax, whilst scaling to demanding production work without losing clarity. The ecosystem encourages practices that speed up teams over time rather than optimising for immediate gratification. The same principles that guide small scripts apply to large systems, creating a path for continuous improvement rather than periodic rewrites.

Start small: let error messages guide one fix, refactor one fragile dictionary into a dataclass, switch one slow operation to Polars, or run one hyperparameter sweep with Ray Tune. The improvements compound, and the foundations established early enable sophisticated capabilities later without fundamental changes to approach or mindset.

Broadening data science horizons: Useful Python packages for working with data

14^th October 2021

My response to changes in the technology stack used in clinical research is to develop some familiarity with programming and scripting platforms that complement and compete with SAS, a system with which I have been programming since 2000. While one of these has been R, Python is another that has taken up my attention, and I now also have Julia in my sights as well. There may be others to assess in the fullness of time.

While I first started to explore the Data Science world in the autumn of 2017, it was in the autumn of 2019 that I began to complete LinkedIn training courses on the subject. Good though they were, I find that I need to actually use a tool to better understand it. At that time, I did get to hear about Python packages like Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn and Beautiful Soup though it took until of spring of this year for me to start gaining some hands-on experience with using any of these.

During the summer of 2020, I attended a BCS webinar on the CodeGrades initiative, a programming mentoring scheme inspired by the way classical musicianship is assessed. In fact, one of the main progenitors is a trained classical musician and teacher of classical music who turned to Python programming when starting a family to have a more stable income. The approach is that a student selects a project and works their way through it, with mentoring and periodic assessments carried out in a gentle and discursive manner. Of course, the project has to be engaging for the learning experience to stay the course, and that point came through in the webinar.

That is one lesson that resonates with me with subjects as diverse as web server performance and the ongoing pandemic supplying data, and there are other sources of public data to examine as well before looking through my own personal archive gathered over the decades. Though some subjects are uplifting while others are more foreboding, the key thing is that they sustain interest and offer opportunities for new learning. Without being able to dream up new things to try, my knowledge of R and Python would not be as extensive as it is, and I hope that it will help with learning Julia too.

In the main, my own learning has been a solo effort with consultation of documentation along with web searches that have brought me to the likes of Real Python, Stack Abuse, Data Viz with Python and R and others for longer tutorials as well as threads on Stack Overflow. Usually, the web searching begins when I need a steer on a particular or a way to resolve a particular error or warning message, but books are always worth reading even if that is the slower route. While those from the Dummies series or from O'Reilly have proved must useful so far, I do need to read them more completely than I already have; it is all too tempting to go with the try the "programming and search for solutions as you go" approach instead.

To get going, many choose the Anaconda distribution to get Jupyter notebook functionality, but I prefer a more traditional editor, so Spyder has been my tool of choice for Python programming and there are others like PyCharm as well. Because Spyder itself is written in Python, it can be installed using pip from PyPi like other Python packages. It has other dependencies like Pylint for code management activities, but these get installed behind the scenes.

The packages that I first met in 2019 may be the mainstays for doing data science, but I have discovered others since then. It also seems that there is porosity between the worlds of R and Python, so you get some Python packages aping R packages and R has the Reticulate package for executing Python code. There are Python counterparts to such Tidyverse stables as dplyr and ggplot2 in the form of Siuba and Plotnine, respectively. Though the syntax of these packages are not direct copies of what is executed in R, they are close enough for there to be enough familiarity for added user-friendliness compared to Pandas or Matplotlib. The interoperability does not stop there, for there is SQLAlchemy for connecting to MySQL and other databases (PyMySQL is needed as well) and there also is SASPy for interacting with SAS Viya.

While Python may not have the speed of Julia, there are plenty of packages for working with larger workloads. Of these, Dask, Modin and RAPIDS all have their uses for dealing with data volumes that make Pandas code crawl. As if to prove that there are plenty of libraries for various forms of data analytics, data science, artificial intelligence and machine learning, there also are the likes of Keras, TensorFlow and NetworkX. These are just a selection of what is available, and there is always the possibility of checking out others. It may be tempting to stick with the most popular packages all the time, especially when they do so much, but it never hurts to keep an open mind either.