Coding Notebook

12:01, 13^th July 2021

Python Pandas Series.str.strip(), lstrip() and rstrip()

The Pandas library in Python offers three string methods for removing whitespace, including newlines, from data in a Series. The str.lstrip() method removes spaces from the left side of a string, str.rstrip() removes them from the right side and str.strip() removes them from both sides.

Because these methods share names with built-in Python functions, the .str prefix must be used to ensure the compiler recognises them as Pandas functions. Each method returns a new Series with the relevant spaces removed, and their behaviour can be verified by comparing the cleaned output against known string values, as demonstrated using data in which extra spaces are deliberately introduced into team name entries before each method is applied and the results checked.

08:22, 12^th July 2021

Adding lines or other geoms to a plot in ggplot2 by calling a custom function

When creating multiple similar plots in R using ggplot2, a common requirement is to add the same vertical and horizontal reference lines across all of them. The intuitive approach of combining geom calls using the addition operator inside a custom function fails because ggplot2 cannot add ggproto objects together outside a plot context.

The solution is to have the custom function return a list of geom calls rather than a sum of them, as ggplot2 can process a list of geom objects when added to an existing plot. This approach allows the same set of reference lines to be reused across multiple plots cleanly and efficiently.

12:34, 11^th July 2021

Tips for Selecting Columns in a Pandas DataFrame

Working with large datasets in Python often requires efficient column selection techniques, and Pandas offers several useful approaches for this purpose. The iloc function enables integer-location based indexing, allowing users to select individual columns, lists of columns or ranges using slice notation. When combining non-sequential ranges, NumPy's r_ object can translate a mix of slice notation and individual integers into a single array that iloc can process. Boolean arrays offer another powerful method, using string accessor functions on a column index to filter columns based on their names, including support for regular expressions to match multiple patterns simultaneously.

The Pandas filter function provides a more straightforward alternative for selecting columns by name or pattern, accepting either partial string matching or regular expression inputs. Practical utilities such as list and dictionary comprehensions can also help users build reference mappings of column names and their indices, reducing the need to repeatedly consult original data files during analysis. One important consideration when using numerical indexing is ensuring that column positions remain consistent across different data inputs, as changes in column order could cause errors in subsequent processing steps.

09:27, 11^th July 2021

What is Spaghetti Code (And Why You Should Avoid It)

Spaghetti code is a derogatory term in IT jargon used to describe poorly structured, convoluted programming that arises from a lack of planning, inexperience, unclear project scopes and the gradual accumulation of changes made by multiple developers over time.

Rather than following sound architectural principles, spaghetti code tends to rely on shortcuts such as GOTO statements, making programmes difficult to maintain and scale. The term has been in use since at least the late 1970s, with historical references linking it to early programming languages such as FORTRAN.

For enterprise businesses, the consequences can be significant, as untangling years of disorganised code wastes time, money and developer effort. To prevent it, developers are encouraged to plan architecture carefully from the outset, conduct regular unit testing, seek peer reviews and use lightweight frameworks with clearly defined layers. Related coding problems include ravioli code, which affects object-oriented projects, lasagna code, which results from overly interdependent layers and pizza code, which describes an architecture that is too flat.

09:26, 11^th July 2021

Spyder: The Scientific Python Development Environment

Spyder is an Integrated Development Environment designed for scientific computing in Python, offering an editor, an interactive console, a variable explorer and various other tools to support program development. Users can write and execute code directly within the editor, run individual lines or sections using keyboard shortcuts, and interact with defined objects through the IPython Console.

The environment supports incremental coding and debugging by allowing objects to persist between executions, though this can occasionally cause issues if code inadvertently relies on variables defined in a previous session rather than within the script itself. Spyder includes a step-by-step debugger that lets users move through code line by line, inspect variables at each stage and modify them as needed.

Plotting with Matplotlib is supported either inline within the console or in a separate window, and documentation strings can be formatted using reStructuredText and the Numpydoc standard to produce well-rendered output in the built-in help panel. Configuration options allow users to manage code style compliance, symbolic mathematics via SymPy and a range of run settings, while keyboard shortcuts and customisable preferences help streamline the development workflow.

13:12, 8^th July 2021

Python KeyError Exceptions and How to Handle Them

A KeyError is a common Python exception raised when attempting to access a key that does not exist in a dictionary or similar mapping structure. It is a subtype of the LookupError exception and can occasionally appear in standard library modules such as zipfile, where it signals that a requested item cannot be located.

When encountered, the traceback provides useful details, including the missing key and the line of code responsible for the error. There are several practical ways to handle a KeyError, depending on the situation.

The most common approach is to use the .get() method on a dictionary, which returns a default value rather than raising an exception when a key is absent. In cases where it is important to confirm whether a key exists without retrieving its value, the in operator offers a straightforward check. For more general scenarios, particularly when dealing with third-party code or modules that do not support .get(), a try-except block provides reliable control over program flow by catching the exception and executing a fallback action instead.

09:03, 5^th July 2021

10 Tips And Tricks For Data Scientists Vol.10

Volume 10 of a series aimed at data scientists covers a range of practical coding techniques in Python and R. In Python, the tips include retrieving the key of the maximum value in a dictionary, sorting dictionaries by value in ascending or descending order, shuffling a Pandas data frame using fractional sampling, repositioning a column to the end of a data frame through re-indexing, performing circular shifts on arrays using NumPy, replacing data frame values by specifying column and index positions with the loc function, automatically generating a requirements file for a Python project using the PIGAR package, producing random names with the names library and reading header-free CSV files by manually defining column names. The single R tip addresses a statistics problem typical of data science interviews, demonstrating how to calculate the standard deviation of a normal distribution from a known probability threshold using the standard normal distribution and confirming the result through simulation.

19:26, 4^th July 2021

Passing arguments to an R script from command lines

There are two main approaches to passing external arguments to an R script from the command line. The first uses the built-in R function commandArgs, which scans the arguments supplied when the R session was invoked, storing them in a vector that the script can then reference by position, with the ability to set default values or return errors when required arguments are missing. The second approach uses the optparse package, which works in a Python-like style by allowing named flags and options to be declared with their types, default values and help messages, producing a more structured and user-friendly interface that also generates automatic help documentation when requested.

09:31, 3^rd July 2021

Think of && as a stricter &

R provides two versions each of the logical AND and OR operators, namely the shorter vectorised forms (& and |) and the longer scalar forms (&& and ||). The shorter forms compare pairs of elements across two vectors and return a result of the same length, making them well-suited for filtering rows in a data frame. The longer forms, by contrast, only examine the first element of each input and return a single scalar value, with R 4.2.0 introducing warnings when vectors longer than one element are passed to them.

The longer forms also support short-circuit evaluation, meaning the second operand is not assessed if the result can be determined from the first alone. Because if() statements require a scalar condition, the longer forms are generally preferred in control-flow programming, though care must be taken to supply genuinely scalar inputs.

The functions all() and any() offer a clean way to reduce a logical vector to a scalar by applying AND or OR across all elements respectively. Finally, NA values complicate logical operations by infecting results, though the isTRUE() and isFALSE() functions provide a reliable means of testing whether a value is strictly and unambiguously TRUE or FALSE.

16:54, 2^nd July 2021

Speeding Up R Shiny – The Definitive Guide

Optimising the performance of R Shiny applications is achievable through a combination of thoughtful development practices and targeted technical interventions. Proper data handling is a foundational concern, including preprocessing static data in advance and choosing appropriate storage solutions based on data size, with faster reading and writing functions offering measurable gains over base R defaults.

Because R is single-threaded, delegating long-running tasks to separate processes using packages such as promises or shiny.worker can prevent one user's activity from blocking others, resulting in a smoother concurrent experience. Understanding Shiny's scoping rules allows developers to share objects efficiently across sessions where appropriate, while keeping user-specific data private.

Reducing reliance on renderUI in favour of update functions and leveraging JavaScript where possible minimises unnecessary communication between the browser and the server. Caching repeated, resource-heavy operations using either the memoise package or Shiny's built-in bindCache mechanism can significantly reduce rendering times, particularly when outputs depend on a finite set of input combinations.

Serving an application via a platform that supports multiple concurrent R processes improves scalability further. Finally, profiling tools such as profvis help identify which specific functions are consuming the most time, enabling developers to focus optimisation efforts precisely where they are needed rather than relying on guesswork.

« Older Entries «

» Newer Entries »