12:44, 30th September 2021
Decorators in R
Decorators in R, similar to those in Python, are functionals that modify or extend the behaviour of existing functions without altering their core functionality, often used for tasks such as logging inputs or timing execution. By treating functions as first-class objects, R allows decorators to wrap functions, enabling additional code to run before or after the original function’s execution.
Examples include a timer decorator that records start and end times, or a logger that writes outputs to a file, both demonstrating how decorators can enhance functionality with minimal changes to the original code. While syntactic sugar for decorators in R is less seamless than in Python, tools like the tinsel package offer partial support, allowing decorators to be applied directly above function definitions. A practical application involved using a decorator to log the first argument of the system function during code refactoring, highlighting their utility in improving code readability and debugging complex operations. This approach underscores how decorators can simplify tasks such as monitoring function performance or capturing input data, making them a versatile tool for developers working with R.
13:27, 29th September 2021
R for Loop
Loops in programming are used to repeat code execution efficiently, reducing redundancy and improving clarity. In R, for loops are particularly useful for iterating over sequences such as vectors or lists, executing a block of code for each element. The syntax involves specifying a value and a sequence, with the value taking each element in turn.
Examples demonstrate how for loops can count even numbers in a vector, use break to exit early when a condition is met, or employ next to skip iterations based on specific criteria. Nested for loops allow iteration through multiple sequences simultaneously, enabling tasks like identifying combinations of numbers that meet particular mathematical conditions. These structures provide flexibility in handling repetitive operations, making them essential tools for data processing and analysis in R.
13:26, 29th September 2021
Calmcode is an online learning platform offering 757 short video tutorials spread across 106 courses, designed to teach modern programming concepts and open-source tools in a calm, accessible manner. The platform is particularly focused on Python developers, covering a broad range of topics including code formatting, testing, data science, machine learning, web development and productivity tooling, with additional content for R users and general software development practices. New courses and lessons are published once or twice a month and are communicated through a newsletter, which the platform positions as the primary way for learners to stay informed. The overall philosophy behind Calmcode is one of reducing skill anxiety by presenting technical knowledge in short, straightforward lessons that begin from first principles, with the aim of making professional life in software development more manageable and enjoyable.
10:34, 28th September 2021
Understanding the Parquet File Format
Apache Parquet is a columnar storage file format designed for efficient data storage and querying, widely used in big data systems such as Hadoop. It organises data by columns rather than rows, enabling faster access to specific fields and reducing storage requirements through techniques like run-length encoding, dictionary encoding and compression.
This approach minimises file size, particularly beneficial for large datasets, and supports cross-platform compatibility. Compared to formats like RDS, which are specific to R and can store complex objects, Parquet prioritises storage efficiency and interoperability. It also contrasts with Feather, which focuses on speed and is part of the Apache Arrow ecosystem. The format's structure allows for metadata storage, facilitating efficient data processing and is implemented in tools such as the R package {arrow}, which enables reading and writing Parquet files with options for compression and encoding optimisation.
19:10, 26th September 2021
Statistics with Julia From the Ground Up
Here is a workshop that introduces the Julia programming language to data scientists and statisticians, focusing on statistical applications rather than general programming. Designed for those with experience in languages like R but no prior Julia knowledge, it covers foundational probability, statistical inference, regression and data manipulation using packages such as StatsBase, Distributions and GLM.
The session emphasises practical, goal-oriented scripting through examples and code snippets from a related book, with an accompanying Jupyter notebook for participants to follow along. The approach prioritises statistical methods and tools, treating Julia as a means to achieve analytical tasks rather than a programming language in isolation.
09:02, 24th September 2021
How to Find Files in Linux Using the Command Line
The Linux command line offers a powerful utility called find for locating files and directories within a file system by recursively filtering objects based on specified conditions. Users can search by file name or extension, modification time, file type and ownership, with results refined further using flags such as -iname for case-insensitive searches, -maxdepth to limit directory depth and -not to exclude certain results. Performance can be tuned through three optimisation levels, -O1, -O2 and -O3, with the default being -O1, which filters by file name first.
For content-based searching, the grep command can be paired with find using the -exec flag, which also allows matched files to be processed immediately, for example by changing permissions with chmod. The -execdir variant runs commands in the directory where the match is found rather than the root directory, which can offer security and performance benefits. A -delete flag can remove matched files, though this should be used with considerable caution, and interactive prompting before any action is taken can be enabled by substituting -exec with -ok or -execdir with -okdir.
13:07, 23rd September 2021
Excel TRIM function - quick way to remove extra spaces
Excel's TRIM function offers a straightforward way to remove unwanted spaces from cell data, resolving common formula errors caused by hidden leading, trailing or extra spaces between values. The basic syntax requires only a single cell reference as its argument, instantly stripping all excess spaces whilst preserving single spaces between words. For numeric data, TRIM must be combined with the VALUE function to ensure results behave as numbers rather than as strings, whilst the MID, FIND and LEN functions can be used together to remove only leading spaces whilst keeping multiple spaces between words intact.
Counting excess spaces is achievable by comparing the original string length against the trimmed length using the LEN function, and conditional formatting can highlight affected cells before any changes are made. When the TRIM function fails to remove certain spaces, this is typically due to non-breaking spaces or non-printing characters that require the SUBSTITUTE, CHAR and CLEAN functions in combination to resolve, with the CODE function helping to identify the specific character values causing the problem.
10:44, 23rd September 2021
GxP Compliance in Pharma Made Easier: Good Documentation Practices with R Markdown and {officedown}
In regulated pharmaceutical industries, maintaining rigorous documentation standards is essential for ensuring consumer safety and product reliability. GxP, a globally recognised framework covering practices such as Good Clinical Practice and Good Manufacturing Practice, places significant emphasis on Good Documentation Practices, requiring that records be traceable, accountable and data-integrity compliant. Meeting these standards manually is time-consuming and error-prone, which is where programmatic tools offer a practical advantage. R Markdown enables the creation of automated, reproducible and testable regulatory documents in multiple formats, reducing the burden of repetitive manual reporting.
However, its formatting flexibility is limited when precise structural or stylistic requirements must be met. The R package officedown addresses these shortcomings by extending R Markdown with capabilities more suited to generating Microsoft Word and PowerPoint documents, allowing users to control document structure with greater precision, apply custom styling to paragraphs and tables, use Word-based templates for consistent formatting, merge multiple documents while preserving references and numbering, and switch selected pages to landscape orientation. Together, these features make it considerably more straightforward for pharmaceutical teams to produce documentation that satisfies the detailed requirements of regulatory bodies such as the FDA and the European Medicines Agency.
09:13, 22nd September 2021
Metaprogramming workshop at JuliaCon 2021
This workshop on metaprogramming in Julia, held during JuliaCon 2021, provided an in-depth exploration of techniques for writing programs that manipulate or generate other programs, highlighting Julia's strengths in this area due to its flexible code structure. Led by David P. Sanders, the session covered foundational concepts such as symbols, expressions and abstract syntax trees, alongside practical applications like macros, code generation using eval and generated functions, which enable the creation of optimised, type-specific code.
Participants worked with Jupyter notebooks offering step-by-step exercises, including examples of substituting values in expressions and building domain-specific language features. The workshop emphasised the balance between leveraging metaprogramming for efficiency and avoiding excessive complexity, underscoring its role in enhancing Julia's performance and expressiveness for scientific and computational tasks.
20:32, 13th September 2021
Jedi SAS Tricks: Explicit SQL Pass-through in DS2
DS2, a SAS programming procedure, offers significant advantages over traditional DATA step processing through its tight integration with SQL and its ability to handle data retrieval and manipulation more efficiently. Unlike a conventional data step, which requires data to be pre-sorted or indexed before using a BY statement, DS2 retrieves data using an implicit SQL query, making pre-sorting unnecessary regardless of whether the data resides in SAS or in a relational database management system such as Oracle.
A particularly powerful feature of DS2 is its support for explicit SQL pass-through queries within the SET statement, allowing programmers to leverage database-specific functions, such as Oracle's DECODE function, while still applying DS2 logic to the results. Ordering of results can be handled either within the database using an ORDER BY clause or within DS2 using a BY statement, though the latter is generally more reliable across different processing environments, particularly in distributed or multithreaded configurations such as SAS Cloud Analytic Services. It is worth noting that row ordering behaviour in DS2 can vary depending on the platform and whether code is executed in-database or through threaded processing, so programmers are advised to consult the relevant SAS documentation to better understand these nuances before relying on any particular ordering approach.