Coding Notebook

16:48, 2^nd July 2021

Parallelisation in R allows multiple independent tasks to run simultaneously across processor cores, potentially reducing computation time significantly. By default, R uses only one core, but the parallel package enables multi-core processing. On Linux and Mac systems, this is achieved by replacing the standard lapply function with mclapply, which distributes tasks across available cores.

On Windows, the process is more involved, requiring the creation of a cluster using makeCluster, running computations with parLapply and then closing the cluster with stopCluster. Windows also differs in that each parallel process requires its own copy of data in working memory, rather than sharing a single copy, and each process starts in an empty environment, meaning objects must be manually exported before use.

Parallelisation is not always beneficial, as the initialisation process itself takes time, making it counterproductive for fast tasks but worthwhile for those running for minutes or longer. The optimal number of cores to use depends on available working memory and whether the computer needs to remain usable for other tasks during computation.

09:19, 30^th June 2021

Common Format and MIME Type for Comma-Separated Values (CSV) Files

RFC 4180, published in October 2005 by Yakov Shafranovich, formally documents the Comma-Separated Values (CSV) format and registers the associated MIME type text/csv with IANA. Although CSV had long been used for exchanging and converting data between spreadsheet programmes, no formal specification had previously existed, leading to inconsistent implementations across different systems.

The format structures data in records separated by line breaks, with fields within each record divided by commas, and allows for an optional header line identifying field names. Fields may be enclosed in double quotes, which is required when a field contains commas, line breaks or double quotes, with any double quote appearing within a quoted field escaped by a preceding double quote.

The registered MIME type supports optional parameters for character set encoding and header presence, defaults to US-ASCII and uses CRLF for line breaks. Security considerations note that whilst CSV data are generally passive, there is a theoretical risk of malicious binary data being embedded to exploit buffer overruns in processing programmes, and that private data may be inadvertently shared through the format.

09:26, 27^th June 2021

Create and Preview RMarkdown Documents with QBit Workspace

RMarkdown is a document format rooted in the concept of Literate Programming, a paradigm introduced by Donald Knuth that combines code outputs with written content. While it is widely associated with the R programming community and the RStudio IDE, it supports a broad range of language engines, with version 1.33 of the knitr package offering 44 options including Python, SQL, Julia and many others.

Through the pandoc document converter, RMarkdown can produce a variety of output formats, such as static HTML files, PDF documents generated via LaTeX, Microsoft Word documents, PowerPoint presentations and flexdashboard layouts, all configurable through the YAML header. QBit Workspace allows users to author and preview RMarkdown documents directly in a browser, with an instant preview feature in the Viewer pane designed to accelerate the development process and support the creation of all the aforementioned output formats.

14:49, 25^th June 2021

Compare data frames in R

Comparing data frames in R can be achieved using several packages, each offering different levels of detail. The dplyr package provides a straightforward approach through its all_equal function, which returns TRUE when two data frames are identical and describes the specific row differences when they are not. The arsenal package offers a more comprehensive comparison via its comparedf function, producing detailed summaries that highlight differing variables, unequal values and observations that appear in one data frame but not the other. The diffdf package similarly identifies and reports differences between data frames, flagging unequal values by variable and row number.

These tools handle various scenarios, including comparing identical data frames, those with differing values and those with different numbers of rows altogether. Of the three approaches, dplyr is generally considered the most accessible for quick comparisons.

12:37, 24^th June 2021

Top 5 tricks to make Matplotlib plots look better

Creating visually appealing data visualisations is an important skill for data scientists, and several straightforward techniques can significantly improve the quality of graphs produced using Python libraries such as Matplotlib and Seaborn. Applying a predefined plot theme, such as ggplot or one of the many Seaborn styles, instantly changes the overall look and feel of a chart with minimal effort.

Adjusting the colours of individual bars or lines, drawing from a library of over 950 named colours, allows specific data points to be highlighted through contrast. Changing the font family ensures that charts blend cohesively with the typography used in a wider presentation, while Seaborn's set context feature scales all visual elements, including fonts and titles, to suit different display settings such as paper, poster or talk formats. Finally, applying a colour palette to a plot unifies its tones and creates a more polished, harmonious appearance compared to using default, uncoordinated colours.

12:36, 24^th June 2021

Choosing Colormaps in Matplotlib

Matplotlib offers a wide range of built-in colormaps, organised into several categories to suit different types of data. Selecting the right one depends on factors such as whether the data has a natural ordering, a critical midpoint value or repeating endpoints, as well as any conventions expected by the intended audience.

Perceptually uniform colormaps, in which equal steps in data correspond to equal perceived steps in colour, are generally the best choice, since the human brain responds more reliably to changes in lightness than to changes in hue. Sequential colormaps are suited to ordered data, diverging colormaps work well when the data varies around a meaningful central value, cyclic colormaps are appropriate for values that wrap around at their endpoints and qualitative colormaps are used for unordered categorical data.

A separate miscellaneous category includes colormaps designed for specific purposes, such as topographic or depth visualisation. Lightness values are also important when considering how a plot will appear when printed in greyscale, as colormaps that increase monotonically in lightness tend to reproduce more clearly, whereas those with irregular lightness patterns can result in unreadable output. Awareness of colour vision deficiencies is also advisable, and avoiding colormaps that combine red and green reduces the risk of problems for a significant portion of viewers.

12:15, 24^th June 2021

Rotate Tick Labels in Matplotlib

Rotating axis labels in Matplotlib can be achieved through several methods, applicable to both the X and Y axes. At the figure level, plt.xticks() and plt.yticks() allow rotation to be set directly, while at the axes level, options include using ax.set_xticklabels() or ax.set_yticklabels(), iterating over tick labels and applying tick.set_rotation() to each, or using ax.tick_params() with a labelrotation argument. It is important to call plt.draw() before accessing tick labels when working at the axes level, as labels are only populated after the plot is drawn. For plots displaying dates, which often overlap and become unreadable without adjustment, Matplotlib provides the fig.autofmt_xdate() and fig.autofmt_ydate() functions as a convenient alternative to manual rotation.

12:14, 24^th June 2021

Create a grouped bar chart with Matplotlib and Pandas

A developer working through freeCodeCamp's Data Analysis with Python certification encountered difficulties creating a grouped bar chart using Matplotlib and pandas, and documented their solution for the Page View Time Series Visualiser project. The dataset used contains daily page view recordings, which are loaded via Pandas, cleaned by removing outliers in the top and bottom 2.5 percentiles, and enriched with year and month columns derived from date/time index attributes. The months are stored as categorical data to preserve chronological order.

The key step in producing the grouped bar chart is reshaping the DataFrame into a pivot table, with years as the index, months as the columns and mean page views as the cell values, after which calling the plot method with the bar type on the reshaped DataFrame is sufficient for Matplotlib to render the grouped visualisation correctly. The author notes that while the final plotting call is straightforward, the real challenge lies in understanding how to manipulate the data into the required shape beforehand, and that comparable results can be achieved more simply using Plotly, which requires only two additional parameters and no pivoting step.

12:44, 23^rd June 2021

10 Tips And Tricks For Data Scientists Vol.9

This ninth instalment in a series of practical tips for data scientists covers a range of techniques across R, Python, SQL and Postman. In R, the tips cover writing cross-platform file paths using the file.path() command, repeating vectors using the rep() function and performing circular shifts on vectors with a custom-built function. The Python tips demonstrate how to retrieve the source code of a function using the inspect module, remove elements from a NumPy array based on a specific value, generate filenames that include a creation date or timestamp, identify the most recently modified file in a directory and collect all files of a given type across directories and subdirectories using the os module. The SQL section explains how to extract key-value pairs from JSON objects in PostgreSQL, including nested structures, and the final tip highlights a Postman feature that automatically generates code for API calls in a chosen programming language.

12:44, 23^rd June 2021

Within() - Base R’s Mutate() function

Base R's built-in within() function serves as a direct alternative to the mutate() function from the popular dplyr package, with both performing the same task of creating new variables within a data frame. Using the classic iris dataset as a demonstration, both functions can be used to calculate a sepal length-to-width ratio, producing identical results regardless of whether tidy or base R syntax is employed.

A benchmark comparison using the rbenchmark package reveals that within() is considerably faster than mutate(), completing 1,000 replications in 0.18 seconds compared to 1.60 seconds for mutate(). Memory usage between the two is negligibly different, with mutate() consuming 1,312 bytes and within() consuming 1,296 bytes. For those seeking an even faster approach, the data.table package offers a further alternative, and the newer base R pipe operator has also been noted as a slightly quicker option than the magrittr pipe commonly associated with tidy workflows.

« Older Entries «

» Newer Entries »