Coding Notebook

13:11, 26^th May 2022

The touch command on Linux is used to create empty files and update their timestamps, which record access, modification and change times. It allows users to modify specific timestamps individually, set custom dates and times using options like -t or -d, or replicate timestamps from another file with -r.

The command supports creating single or multiple files simultaneously, and can be configured to avoid creating new files if they already exist. By adjusting these metadata attributes, the touch command provides a straightforward method for managing file information without altering the file's content.

14:55, 27^th April 2022

Pharmaverse is a collaborative network of pharmaceutical companies and individuals dedicated to the open-source development of curated R packages for clinical reporting. Rather than working in isolation on closed, often duplicative solutions, contributors share tools across a post-competitive space, with the aim of easing regulatory review and ultimately bringing new treatments to patients more quickly.

The network hosts a broad catalogue of packages spanning areas such as data standards, metadata, validation, submission and reporting, and anyone is free to adopt whichever packages suit their needs, with the understanding that inclusion in the catalogue does not constitute an endorsement of any code's reliability. Community involvement is central to the project, with decisions about package inclusion driven by open proposals and discussion, and a governing council stepping in for more contentious matters. The network also engages with industry working groups to explore approaches to R package validation in regulated environments.

10:21, 4^th April 2022

Splitting string into array of substrings in Julia – split() and rsplit() Method

Julia provides two functions, split() and rsplit(), for dividing strings into arrays of substrings based on specified delimiters. The split() function processes the string from the beginning, while rsplit() operates from the end, with both allowing parameters to control the maximum number of resulting elements and whether empty substrings are included. These methods are useful for parsing and manipulating text data, with examples demonstrating their application in handling various string formats and delimiter placements.

10:21, 4^th April 2022

Arrays in Julia

Working with arrays in Julia has been a journey of discovery, particularly when it comes to understanding the breadth of functions available for manipulation and analysis. Arrays form the backbone of data handling in the language and their versatility is evident through methods like axes, which returns valid indices and cat, which allows concatenation along specified dimensions.

Functions such as broadcast and broadcast! enable efficient operations across multiple arrays, while fill and fill! provide ways to initialise or overwrite arrays with specific values. Manipulation of arrays is straightforward, with push! and pop! adding or removing elements and deleteat! allowing targeted deletions. For multidimensional arrays, methods like hcat and vcat handle horizontal and vertical concatenation, while reshape and permute offer flexible reorganisation of data. The getindex and setindex functions provide precise control over accessing and modifying elements and findall, findfirst and findlast aid in locating specific values within arrays.

Beyond basic operations, Julia’s array methods extend to advanced tasks, such as computing strides with stride, creating views with @view and using similar to generate new arrays with the same structure. These tools collectively make array handling in Julia both powerful and intuitive, catering to everything from simple data storage to complex transformations.

10:20, 4^th April 2022

For loop in Julia

Julia's for loop follows a for in structure rather than the C-style syntax found in many other programming languages, making it closer in behaviour to a for-each loop. The syntax involves a loop keyword, an iterator, a range and a closing end keyword, allowing sequential traversal across various data structures. This includes lists, tuples and strings, each of which can be iterated over in the same straightforward manner. Julia also supports nested for loops, where one loop is placed inside another, enabling more complex iteration patterns such as printing structured numerical output across multiple rows and columns.

19:30, 27^th March 2022

How to Compare Strings in Bash

Bash string comparison relies on a set of operators that evaluate equality, inequality, alphabetical ordering and string length within conditional statements. The equality operators = and == check whether two strings match exactly, while != checks for a mismatch, and =~ tests whether a string matches a regular expression. The -z and -n flags determine whether a string is empty or non-empty respectively.

Since comparisons are case-sensitive by default, case-insensitive checks can be achieved either by converting variables to lowercase using the ,, parameter expansion or by enabling the nocasematch shell option, which should be disabled again afterwards to avoid unintended side effects. The double-bracket [[ construct is generally preferred over the single-bracket [ command in Bash scripts, as it supports pattern matching and regular expressions without requiring variables to be quoted to prevent word splitting.

Glob-style patterns can also be used with == to check for substrings or character classes, and the case statement offers a practical alternative when a string needs to be evaluated against several possible patterns. A useful distinction to keep in mind is that string operators compare characters individually, whereas numeric operators such as -eq and -gt evaluate integer values, meaning that strings like 02 and 2 would not be considered equal under string comparison even though they represent the same number.

20:59, 18^th March 2022

How to remove Scientific Notation in R

When working with large numbers in R, scientific notation may be used by default, but this can be adjusted using two approaches. One method involves setting a global preference to suppress scientific notation by modifying the scipen option, which affects all outputs in the session. Alternatively, a specific variable can be displayed without scientific notation by applying the format function with the scientific parameter set to false, allowing for direct control over individual results without altering broader settings. Both techniques provide ways to manage numerical display formats depending on the context of the analysis.

15:13, 9^th March 2022

Complete tutorial on using apply functions in R

The apply family of functions in R provides a flexible approach to applying operations across data structures such as data frames, lists and vectors. Each function within this family serves a distinct purpose. For instance, apply is commonly used for row-wise or column-wise operations on matrices or data frames, while lapply and sapply are tailored for list and vector manipulations, respectively. Next, tapply extends this capability by allowing grouping operations based on one or more factors.

These functions are particularly useful when dealing with complex data transformations that require iteration, though they are often less efficient than vectorised operations or dplyr alternatives. The dplyr package offers a more intuitive and readable approach for data manipulation, especially when working with tidy data.

Functions like group_by and summarise in dplyr can replicate the functionality of tapply, but with the added benefit of seamless integration with the pipe operator, enabling a more fluid workflow. For example, grouping data by a variable and calculating summary statistics can be achieved in a single, concise pipeline. Similarly, the across function in dplyr mirrors the row-wise or column-wise operations of apply, but within the context of tidy data principles.

The choice between apply and dplyr depends on the specific task and the structure of the data. It happens that apply functions are well-suited for quick, ad hoc calculations, particularly when working with matrices or lists. On the other hand, dplyr excels in scenarios where data is already in a tidy format and further transformations or analyses are required.

Both approaches have their strengths and the decision often hinges on the complexity of the task, the need for readability and the compatibility with downstream workflows. Understanding the nuances of each method allows for more effective data analysis and manipulation in R.

15:24, 8^th March 2022

How to Calculate a Cumulative Average in R

Calculating a cumulative average in R involves dividing the cumulative sum of a dataset by the sequence number of its elements, a process achievable through multiple methods. Using base R, the cumsum function combined with seq_along provides a straightforward approach, while the dplyr package offers a more efficient alternative with its cummean function, particularly beneficial for large datasets.

A sample dataset comprising monthly values demonstrates how each method generates identical results, illustrating the cumulative average as a running mean that updates with each additional data point, such as the average of the first three values being 5 and the first six values averaging 6. This technique is useful for tracking trends over time, with the output clearly displayed in the modified data frame, showing how each row's cumulative average evolves incrementally.

14:35, 7^th March 2022

How To Get Twitter Data Using R

Here is a demonstration of how to retrieve Twitter data using R, specifically through the rtweet library, which requires a developer account along with consumer and access keys for API authorisation. Once installed and authorised, the library enables users to search for tweets using keywords or hashtags, stream a live random sample of approximately one percent of all tweets and apply filters to refine results by language, engagement thresholds or by excluding retweets, quotes and replies. Additional functionality includes retrieving user timelines of up to 3,200 posts, identifying the most recently liked tweets from a given account, searching for users associated with particular keywords, accessing follower and friend lists and monitoring trending topics by location using either city names or geographic coordinates.

« Older Entries «

» Newer Entries »