Coding Notebook

10:26, 21^st July 2021

Time Travel with py datatable 1.0

Version 1.0 of datatable, the Python counterpart to the widely used R package data.table, introduced support for temporal data types through two new formats and an accompanying family of functions. The date32 type represents a calendar date without a time component, storing values internally as a 32-bit signed integer counting days from the epoch date of 1 January 1970, with a range spanning approximately 5.8 million years in either direction. The time64 type captures a specific moment, stored as a 64-bit integer measuring nanoseconds from the same epoch in UTC.

Both types support initialisation in several ways, including from integer values, ISO 8601 formatted strings and individual date or time components via the constructor functions ymd() and ymdt(). When working with non-standard date strings, a combination of casting and string slicing functions within the datatable API can be used to parse and convert values correctly. The datatable.time family also includes a range of part functions such as year(), month(), day() and hour(), which allow users to extract individual components from date or time columns and apply them in operations such as filtering.

14:57, 15^th July 2021

Get image size (width, height) with Python, OpenCV, Pillow (PIL)

In Python, image dimensions can be retrieved using either OpenCV or Pillow (PIL), with a key difference in how each library orders width and height. OpenCV treats images as NumPy arrays, where the shape attribute returns dimensions in the order of height, width and channel for colour images, or height and width for greyscale images, meaning width and height must be accessed by their respective index positions or via tuple unpacking.

Pillow, by contrast, offers a more straightforward approach through its size attribute, which returns a (width, height) tuple directly, as well as dedicated width and height attributes that can be accessed individually, and this behaviour is consistent across both colour and greyscale images.

14:56, 15^th July 2021

Python Pandas – Stop Truncating Strings

In the Python Pandas library, long strings are truncated by default when displayed, but this behaviour can be overridden by using the set_option function with the display.max_colwidth parameter set to -1, which prevents any truncation from occurring. This approach is considered cleaner than the common workaround of entering a large, arbitrary integer value to achieve the same result.

09:32, 15^th July 2021

Reading and Writing XML Files in Python

Python offers two primary modules for handling XML files: the older minidom module and the more modern ElementTree module. Whilst minidom treats XML as a tree structure of objects based on the Document Object Model, ElementTree provides a more straightforward interface that represents XML data as simple lists and dictionaries, making it the more accessible option for those unfamiliar with DOM.

Using ElementTree, developers can parse existing XML files by creating a tree structure and accessing its root element, count child nodes, write new XML files by constructing elements and sub-elements, search for specific elements using functions such as find() and findall(), modify node content and attributes, add new sub-elements and remove individual attributes, specific sub-elements or entire groups of child nodes. Minidom is capable of parsing and counting XML elements too, but ElementTree is generally the recommended choice due to its simpler, more Pythonic approach and broader functionality.

09:31, 15^th July 2021

How to Use sorted() and sort() in Python

Python offers two primary methods for sorting data: the built-in sorted() function and the .sort() list method. The sorted() function accepts any iterable as an argument and returns a new sorted list, leaving the original data unchanged, whilst .sort() operates directly on a list, modifying it in place and returning nothing.

Both methods support two optional keyword arguments, namely reverse, which accepts a Boolean value to switch between ascending and descending order, and key, which accepts a single-argument function to customise how elements are compared during sorting. The key argument can be used with built-in functions such as len() or str.lower(), or with lambda functions for more flexible sorting logic.

There are notable limitations to be aware of, including the inability to sort lists containing incompatible data types and the potential for errors when the function passed to key cannot handle all values in the iterable. Choosing between the two approaches depends largely on whether preserving the original data matters, as sorted() is the safer choice when the original order may still be needed, whilst .sort() is appropriate for lists where in-place modification is acceptable.

09:30, 15^th July 2021

sed, a stream editor

GNU sed is a stream editor, first released under the GNU Free Documentation Licence, that performs basic input transformations on files or pipeline input by making a single pass over the data, making it more efficient than interactive editors. It is invoked from the command line using a script and one or more input files, and supports a wide range of options including in-place file editing, extended regular expressions, sandbox mode and unbuffered input and output. Scripts consist of one or more commands, each optionally preceded by an address or address range that determines which lines the command acts upon, and commands can be separated by semicolons or newlines or grouped using curly braces.

The most commonly used command is the substitute command, which matches a regular expression against the pattern space and replaces matched content with a specified replacement string, supporting flags for global replacement, case-insensitive matching and output to a file. sed maintains two internal buffers, the pattern space and the hold space, and advanced scripting techniques make use of multi-line commands such as N, P and D to process multiple lines simultaneously, alongside branching commands such as b, t and T for flow control. Both basic and extended regular expression syntaxes are supported, with the latter enabled via the -E or -r option, and GNU sed additionally provides extensions including special character classes, back-references, escape sequences and multibyte character handling for use in localised environments.

09:15, 15^th July 2021

How to conditionally stop SAS code execution and gracefully terminate SAS session

When developing SAS programmes that handle large datasets, there is often a need to stop code execution conditionally and terminate the SAS session without generating errors or warnings in the log. The ABORT statement, while useful for genuine failures, produces error messages that make it unsuitable for scenarios where stopping is a logical and expected outcome. The ENDSAS statement is a more appropriate tool, though it carries its own limitations, as it is a global statement that cannot be placed directly within conditional executable blocks such as IF-THEN logic without causing syntax errors.

Two reliable workarounds exist for achieving truly graceful termination. The first uses a data step with CALL EXECUTE to push the ENDSAS statement outside the step boundaries so that it executes conditionally after the step completes. The second uses SAS Macro Language to conditionally generate the ENDSAS statement alongside an informative note in the log.

For interactive development environments, capturing the SAS log to a file using PROC PRINTTO before any termination logic runs is strongly advisable, as closing the session will otherwise destroy the log output. Developers working in SAS Studio should be aware that ENDSAS behaves differently there, stopping further processing without terminating the session itself.

09:14, 15^th July 2021

Using SQL for R data.frames with sqldf

The R package sqldf offers a way to query R data frames using standard SQL syntax, making it particularly appealing to those with a SQL background. While numerous R packages exist for data manipulation and wrangling, including dplyr, data.table and tidyverse, sqldf provides an alternative approach by allowing users to filter and retrieve data using familiar SQL statements. A simple comparison using the iris dataset demonstrates that fetching filtered columns can be achieved through base R indexing, dplyr piping or sqldf queries, with each method producing equivalent results and the choice ultimately coming down to personal preference and familiarity.

09:14, 15^th July 2021

How to conditionally execute SAS global statements

SAS global statements, which include options such as LIBNAME, FILENAME, TITLE and FOOTNOTE, define objects and configure system settings that persist throughout a SAS session. Although these statements can appear within data steps, they are not considered executable in that context, meaning they cannot form part of an IF-THEN/ELSE conditional structure. Instead, they take effect immediately after compilation, before any data processing begins.

This behaviour can cause confusion, as a global statement placed inside a conditional block will still execute regardless of whether the condition is met. To work around this, two practical approaches exist. The first involves using the SAS macro language, whereby the macro processor runs before the SAS compiler and can conditionally generate global statements that are then passed to the compiler for execution. The second approach uses the CALL EXECUTE routine inside a data step, which allows code to be constructed dynamically based on actual data values and then submitted for execution once the step boundary is crossed, making it particularly useful for data-driven scenarios such as generating individual report titles for each unique category within a dataset.

09:11, 15^th July 2021

The Best Resources for Learning Shiny App Development

Shiny is an R package developed by Posit (formerly RStudio) that enables users to build interactive web applications directly from R, combining the computational power of the language with modern web interactivity. Posit offers a range of learning materials for Shiny, including written and video tutorials, an app gallery, development articles and community support. Beyond what Posit itself provides, a number of books cover the subject in depth, including works by Hadley Wickham, Colin Fay and David Granjon, while tutorials from contributors such as Zev Ross, Ted Laderas and Dean Attali offer structured, practical introductions to building Shiny applications.

« Older Entries «

» Newer Entries »