Data management | Technology Tales

Some R packages to explore as you find your feet with the language

24^th March 2026

Here are some commonly used R packages and other tools that are pervasive, along with others that I have encountered while getting started with the language, itself becoming pervasive in my line of business. The collection grew organically as my explorations proceeded, and reflects what I was trying out during my acclimatisation.

General

Here are two general packages to get things started, with one of them being unavoidable in the R world. The other is more advanced, possibly offering more to package developers.

{tidyverse}

You cannot use R without knowing about this collection of packages. In many ways, they form a mini-language of their own, drawing some criticism from those who reckon that base R functionality covers a sufficient gamut anyway. Nevertheless, there is so much here that will get you going with data wrangling and visualisation that it is worth knowing what is possible. The complaints may come from your not needing to use anything else for these purposes.

{plumber}

This R package enables developers to convert existing R functions into web API endpoints by adding roxygen2-like comment annotations to their code. Once annotated, functions can handle HTTP GET and POST requests, accept query string or JSON parameters and return outputs such as plain values or rendered plots. The package is available on CRAN as a stable release, with a development version hosted on GitHub. For deployment, it integrates with DigitalOcean through a companion package called {plumberDeploy}, and also supports Posit Connect, PM2 and Docker as hosting options. Related projects in the same space include OpenCPU, which is designed for hosting R APIs in scientific research contexts, and the now-discontinued jug package, which took a more programmatic approach to API construction.

Data Preparation

You simply cannot avoid working with data during any analysis or reporting work. While there is a learning curve if you are used to other languages, there is little doubt that R is well-endowed when it comes to performing these tasks. Here are some packages that extend base R capabilities and might even add some extra user-friendliness along the way.

{forcats}

The {forcats} package in R provides functions to manage categorical variables by reordering factor levels, collapsing infrequent values and adjusting their sequence based on frequency or other variables. It includes tools such as reordering by another variable, grouping rare categories into 'other' and modifying level order manually, which are useful for data analysis and visualisation workflows. Designed as part of the tidyverse, it integrates with other packages to streamline tasks like counting and plotting categorical data, enhancing clarity and efficiency in handling factors within R.

{tidyr}

Around this time last year, I remember completing a LinkedIn course on a set of good practices known as tidy data, where each variable occupies a column, each observation a row and each value a single cell. This package is designed to help users restructure data so it follows those rules. It provides tools for reshaping data between long and wide formats, handling nested lists, splitting or combining columns, managing missing values and layering or flattening grouped data.

Installation options include the {tidyverse} collection, standalone installation, or the development version from GitHub. The package succeeds earlier reshaping tools like {reshape2} and {reshape}, offering a focused approach to tidying data rather than general reshaping or aggregation.

{haven}

Having a long track record of working with SAS, {haven} with its abilities to read and write data files from statistical software such as SAS, SPSS and Stata, leveraging the ReadStat library, arouses my interest. Handily, it supports a range of file formats, including SAS transport and data files, SPSS system and older portable files and Stata data files up to version 15, converting these into tibbles with enhanced printing capabilities. Value labels are preserved as a labelled class, allowing conversion to factors, while dates and times are transformed into standard R classes.

{RMariaDB}

While there are other approaches to working with databases using R, {RMariaDB} provides a database interface and driver for MariaDB, designed to fully comply with the DBI specification and serve as a replacement for the older {RMySQL} package. It supports connecting to databases using configuration files, executing queries, reading and writing data tables and managing results in chunks. Installation options include binary packages from CRAN or development versions from GitHub, with additional dependencies such as MariaDB Connector/C or libmysqlclient required for Linux and macOS systems. Configuration is typically handled through a MariaDB-specific file, and the package includes acknowledgments for contributions from various developers and organisations.

COVID-19 Data Hub

For many people, the pandemic may be a fading memory, yet it offered its chances for learning R, not least because there was a use case with more than a hint of personal interest about it. Here is a library making it easier to get hold of the data, with some added pre-processing too. Memories of how I needed to wrangle what was published by various sources make me appreciate just how vital it is to have harmonised data for analysis work.

Table Production

While many appear to graphical presentation of results to their tabular display, R does have its options here too. In recent times, the options have improved, particularly of the pharmaverse initiative. Here is a selection of what I found during my explorations.

{officer}

Part of the {officeverse} along with {officedown}, {Flextable}, {Rvg} and {mschart}, the {officer} R package enables users to create and modify Word and PowerPoint documents directly from R, allowing the insertion of images, tables and formatted content, as well as the import of document content into data frames. It supports the generation of RTF files and integrates with other packages for advanced features such as vector graphics and native office charts. Installation options include CRAN and GitHub, with community resources available for assistance and contributions. The package facilitates the manipulation of document elements like paragraphs, tables and section breaks and provides tools for exporting and importing content between R and office formats, alongside functions for managing slide layouts and embedded objects in presentations.

{pharmaRTF}

If you work in clinical research like I do, the need to produce data tabulations is a non-negotiable requirement. That is how this package came to be developed and the pharmaverse of which it is part has numerous other options, should you need to look at using one of those. The flavour of RTF produced here is the Microsoft Word variety, which did not look as well in LibreOffice Writer when I last looked at the results with that open-source alternative. Otherwise, the results look well to many eyes.

{formattable}

Here is a package that enhances data presentation by applying customisable formatting to vectors and data frames, supporting formats such as percentages, currency and accounting. Available on GitHub and CRAN, it integrates with dynamic document tools like {knitr} and {rmarkdown} to produce visually distinct tables, with features including gradient colour scales, conditional styling and icon-based representations. It automatically converts to {htmlwidgets} in interactive environments and is licensed under MIT, enabling flexible use in both static and interactive data displays.

{reactable}

The {reactable} package for R provides interactive data tables built on the React Table library, offering features such as sorting, filtering, pagination, grouping with aggregation, virtual scrolling for large datasets and support for custom rendering through R or JavaScript. It integrates seamlessly into R Markdown documents and Shiny applications, enabling the use of HTML widgets and conditional styling. Installation options include CRAN and GitHub, with examples demonstrating its application across various datasets and scenarios. The package supports major web browsers and is licensed under MIT, designed for developers seeking dynamic data presentation tools within the R ecosystem.

{DT}

Particularly useful in dynamic web applications like Shiny, the {DT} package in R provides a means of rendering interactive HTML tables by building on the DataTables JavaScript library. It supports features including sorting, searching, pagination and advanced filtering, with numeric, date and time columns using range-based sliders whilst factor and character columns rely on search boxes or dropdowns. Filtering operates on the client side by default, though server-side processing is also available. JavaScript callbacks can be injected after initialisation to manipulate table behaviour, such as enabling automatic page navigation or adding child rows to display additional detail. HTML content is escaped by default as a safeguard against cross-site scripting attacks, with the option to adjust this on a per-column basis. Whilst the package integrates with Shiny applications, attention is needed around scrolling and slider positioning to prevent layout problems. Overall, the package is well suited to exploratory data analysis and the building of interactive dashboards.

{gt}

The {gt} package in R enables users to create well-structured tables with a variety of formatting options, starting from data frames or tibbles and incorporating elements such as headers, footers and customised column labels. It supports output in HTML, LaTeX and RTF formats and includes example datasets for experimentation. The package prioritises simplicity for common tasks while offering advanced functions for detailed customisation, with installation available via CRAN or GitHub. Users can access resources like documentation, community forums and example projects to explore its capabilities, and it is supported by a range of related packages that extend its functionality.

{gtsummary}

Enabling users to produce publication-ready outputs with minimal code, the {gtsummary} package offers a streamlined approach to generating analytical and summary tables in R. It automates the summarisation of data frames, regression models and other datasets, identifying variable types and calculating relevant statistics, including measures of data incompleteness. Customisation options allow for formatting, merging and styling tables to suit specific needs, while integration with packages such as {broom} and {gt} facilitates seamless incorporation into R Markdown workflows. The package supports the creation of side-by-side regression tables and provides tools for exporting results as images, HTML, Word, or LaTeX files, enhancing flexibility for reporting and sharing findings.

{huxtable}

Here is an R package designed to generate LaTeX and HTML tables with a modern, user-friendly interface, offering extensive control over styling, formatting, alignment and layout. It supports features such as custom borders, padding, background colours and cell spanning across rows or columns, with tables modifiable using standard R subsetting or dplyr functions. Examples demonstrate its use for creating simple tables, applying conditional formatting and producing regression output with statistical details. The package also facilitates quick export to formats like PDF, DOCX, HTML and XLSX. Installation options include CRAN, R-Universe and GitHub, while the name reflects its origins as an enhanced version of the {xtable} package. The logo was generated using the package itself, and the background design draws inspiration from Piet Mondrian’s artwork.

Figure Generation

R has such a reputation for graphical presentations that it is cited as a strong reason to explore what the ecosystem has to offer. While base R itself is not shabby when it comes to creating graphs and charts, these packages will extend things by quite a way. In fact, the first on this list is near enough pervasive.

{ggplot2}

Though its default formatting does not appeal to me, the myriad of options makes this a very flexible tool, albeit at the expense of some code verbosity. Multi-panel plots are not among its strengths, which may send you elsewhere for that need.

{ggforce}

Focusing on features not included in the core library, the {ggforce} package extends {ggplot2} by offering additional tools to enhance data visualisation. Designed to complement the primary role of {ggplot2} in exploratory data analysis, it provides a range of geoms, stats and other components that are well-documented and implemented, aiming to support more complex and custom plot compositions. Available for installation via CRAN or GitHub, the package includes a variety of functionalities described in detail on its associated website, though specific examples are not included here.

{cowplot}

Developed by Claus O. Wilke for internal use in his lab, {cowplot} is an R package designed to help with the creation of publication-quality figures built on top of {ggplot2}. It provides a set of themes, tools for aligning and arranging plots into compound figures and functions for annotating plots or combining them with images. The package can be installed directly from CRAN or as a development version via GitHub, and it has seen widespread use in the book Fundamentals of Data Visualisation.

{sjPlot}

The {sjPlot} package provides a range of tools for visualising data and statistical results commonly used in social science research, including frequency tables, histograms, box plots, regression models, mixed effects models, PCA, correlation matrices and cluster analyses. It supports installation via CRAN for stable releases or through GitHub for development versions, with documentation and examples available online. The package is licensed under GPL-3 and developed by Daniel Lüdecke, offering functions to create visualisations such as scatter plots, Likert scales and interaction effect plots, along with tools for constructing index variables and presenting statistical outputs in tabular formats.

{thematic}

By offering a centralised approach to theming and enabling automatic adaptation of plot styles within Shiny applications, the {thematic} package simplifies the styling of R graphics, including {ggplot2}, {lattice} and base R plots, R Markdown documents and RStudio. It allows users to apply consistent visual themes across different plotting systems, with auto-theming in Shiny and R Markdown relying on CSS and {bslib} themes, respectively. Installation requires specific versions of dependent packages such as {shiny} and {rmarkdown}, while custom fonts benefit from {showtext} or {ragg}. Users can set global defaults for background, foreground and accent colours, as well as fonts, which can be overridden with plot-specific theme adjustments. The package also defines default colour scales for qualitative and sequential data and integrates with tools like bslib to import Google Fonts, enhancing visual consistency across different environments and user interfaces.

Publishing Tools

The R ecosystem goes beyond mere graphical and tabular display production to offer means for taking things much further, often offering platforms for publishing your work. These can be used locally too, so there is no need to entrust everything to a third-party provider. The uses are endless for what is available, and it appears that Posit has used this to help with building documentation and training too.

R Markdown

What you have here is one of those distinguishing facilities of the R ecosystem, particularly for those wanting to share their analysis work with more than a hint of reproducibility. The tool combines narrative text and code to generate various outputs, supporting multiple programming languages and formats such as HTML, PDF and dashboards. It enables users to produce reports, presentations and interactive applications, with options for publishing and scheduling through platforms like RStudio Connect, facilitating collaboration and distribution of results in professional settings.

Distill for R Markdown

Distill for R Markdown is a tool designed to streamline the creation of technical documents, offering features such as code folding, syntax highlighting and theming. It builds on existing frameworks like Pandoc, MathJax and D3, enabling the production of dynamic, interactive content. Users can customise the appearance with CSS and incorporate appendices for supplementary information. The tool acknowledges the contributions of developers who created foundational libraries, ensuring accessibility and functionality for a wide audience. Its design prioritises clarity, allowing authors to focus on presenting results rather than underlying code, while maintaining flexibility for those who wish to include detailed explanations.

{shiny}

For a while, this was one of R's unique selling points, and remains as compelling a reason to use the language even when Python has got its own version of the package. Enabling the creation of interactive web applications for data analysis without requiring web development expertise allows users to build interfaces that let others explore data through dynamic visualisations and filters. Here is a simple example: an app that generates scatter plots with adjustable variables, species filters and marginal plots, hosted either on personal servers or through a dedicated hosting service.

{bslib}

The {bslib} R package offers a modern user interface toolkit for Shiny and R Markdown applications, leveraging Bootstrap to enable the creation of customisable dashboards and interactive theming. It supports the use of updated Bootstrap and Bootswatch versions while maintaining compatibility with existing defaults, and provides tools for real-time visual adjustments. Installation is available through CRAN, with example previews demonstrating its capabilities.

{rhandsontable}

Enabling users to manipulate and validate data within a spreadsheet-like interface, the {rhandsontable} package introduces an interactive data grid for R. It supports features such as custom cell rendering, validation rules and integration with Shiny applications. When used in Shiny, the widget requires explicit conversion of data using the hot_to_r function, as updates may not be immediately reflected in reactive contexts. Examples demonstrate its application in various scenarios, including date editing, financial calculations and dynamic visualisations linked to charts. The package also accommodates bookmarks in Shiny apps with specific handling. Users are encouraged to report issues or contribute improvements, with guidance provided for those seeking to expand its functionality. The development team welcomes feedback to refine the tool further, ensuring it aligns with evolving user needs.

{xaringanExtra}

{xaringanExtra} offers a range of enhancements and extensions for creating and presenting slides with xaringan, enabling features such as adding an overview tile view, making slides editable, broadcasting in real time, incorporating animations, embedding live video feeds and applying custom styles. It allows users to selectively activate individual tools or load multiple features simultaneously through a single function call, supporting tasks like adding banners, enabling code copying, fitting slides to screen dimensions and integrating utility toolkits. The package is available for installation via CRAN or GitHub, providing flexibility for developers and presenters seeking to expand the functionality of their slides.

Advance your Data Science, AI and Computer Science skills using these online learning opportunities

25^th July 2025

The landscape of online education has transformed dramatically over the past decade, creating unprecedented access to high-quality learning resources across multiple disciplines. This comprehensive examination explores the diverse array of courses available for aspiring data scientists, analysts, and computer science professionals, spanning from foundational programming concepts to cutting-edge artificial intelligence applications.

Data Analysis with R Programming

R programming has established itself as a cornerstone language for statistical analysis and data visualisation, making it an essential skill for modern data professionals. DataCamp's Data Analyst with R programme represents a comprehensive 77-hour journey through the fundamentals of data analysis, encompassing 21 distinct courses that progressively build expertise. Students begin with core programming concepts including data structures, conditional statements, and loops before advancing to sophisticated data manipulation techniques using tools such as dplyr and ggplot2. The curriculum extends beyond basic programming to include R Markdown for reproducible research, data manipulation with data.table, and essential database skills through SQL integration.

For those seeking more advanced statistical expertise, DataCamp's Statistician with R career track provides an extensive 108-hour programme spanning 27 courses. This comprehensive pathway develops essential skills for professional statistician roles, progressing from fundamental concepts of data collection and analysis to advanced statistical methodology. Students explore random variables, distributions, and conditioning through practical examples before advancing to linear and logistic regression techniques. The curriculum encompasses sophisticated topics including binomial and Poisson regression models, sampling methodologies, hypothesis testing, experimental design, and A/B testing frameworks. Advanced modules cover missing data handling, survey design principles, survival analysis, Bayesian data analysis, and factor analysis, making this track particularly suitable for those with existing R programming knowledge who seek to specialise in statistical practice.

The Google Data Analytics Professional Certificate programme, developed by Google and hosted on Coursera with US and UK versions, offers a structured six-month pathway for those seeking industry-recognised credentials. Students progress through eight carefully designed courses, beginning with foundational concepts in "Foundations: Data, Data, Everywhere" and culminating in a practical capstone project. The curriculum emphasises real-world applications, teaching students to formulate data-driven questions, prepare datasets for analysis, and communicate findings effectively to stakeholders.

Udacity's Data Analysis with R course presents a unique proposition as a completely free resource spanning two months of study. This programme focuses intensively on exploratory data analysis techniques, providing students with hands-on experience using RStudio and essential R packages. The course structure emphasises practical application through projects, including an in-depth exploration of diamond pricing data that demonstrates predictive modelling techniques.

Advanced Statistical Learning and Specialised Applications

Duke University's Statistics with R Specialisation elevates statistical understanding through a comprehensive seven-month programme that has earned a 4.6-star rating from participants. This five-course sequence delves deep into statistical theory and application, beginning with probability and data fundamentals before progressing through inferential statistics, linear regression, and Bayesian analysis. The programme distinguishes itself by emphasising both theoretical understanding and practical implementation, making it particularly valuable for those seeking to master statistical concepts rather than merely apply them.

The R Programming: Advanced Analytics course on Udemy, led by instructor Kirill, provides focused training in advanced R techniques within a compact six-hour format. This course addresses specific challenges that working analysts face, including data preparation workflows, handling missing data through median imputation, and working with complex date-time formats. The curriculum emphasises efficiency techniques such as using apply functions instead of traditional loops, making it particularly valuable for professionals seeking to optimise their analytical workflows.

Complementing this practical approach, the Applied Statistical Modelling for Data Analysis in R course on Udemy offers a more comprehensive 9.5-hour exploration of statistical methodology. The curriculum covers linear modelling implementation, advanced regression analysis techniques, and multivariate analysis methods. With its emphasis on statistical theory and application, this course serves those who already possess foundational R and RStudio knowledge but seek to deepen their understanding of statistical modelling approaches.

Imperial College London's Statistical Analysis with R for Public Health Specialisation brings academic rigour to practical health applications through a four-month programme. This specialisation addresses real-world public health challenges, using datasets that examine fruit and vegetable consumption patterns, diabetes risk factors, and cardiac outcomes. Students develop expertise in linear and logistic regression while gaining exposure to survival analysis techniques, making this programme particularly relevant for those interested in healthcare analytics.

Visualisation and Data Communication

Johns Hopkins University's Data Visualisation & Dashboarding with R Specialisation represents the pinnacle of visual analytics education, achieving an exceptional 4.9-star rating across its four-month curriculum. This five-course programme begins with fundamental visualisation principles before progressing through advanced ggplot2 techniques and interactive dashboard development. Students learn to create compelling visual narratives using Shiny applications and flexdashboard frameworks, skills that are increasingly essential in today's data-driven business environment.

The programme's emphasis on publication-ready visualisations and interactive dashboards addresses the growing demand for data professionals who can not only analyse data but also communicate insights effectively to diverse audiences. The curriculum balances technical skill development with design principles, ensuring graduates can create both statistically accurate and visually compelling presentations.

Professional Certification Pathways

DataCamp's certification programmes offer accelerated pathways to professional recognition, with each certification designed to be completed within 30 days. The Data Analyst Certification combines timed examinations with practical assessments to evaluate real-world competency. Candidates must demonstrate proficiency in data extraction, quality assessment, cleaning procedures, and metric calculation, reflecting the core responsibilities of working data analysts.

The Data Scientist Certification expands these requirements to include machine learning and artificial intelligence applications, requiring candidates to collect and interpret large datasets whilst effectively communicating results to business stakeholders. Similarly, the Data Engineer Certification focuses on data infrastructure and preprocessing capabilities, essential skills as organisations increasingly rely on automated data pipelines and real-time analytics.

The SQL Associate Certification addresses the universal need for database querying skills across all data roles. This certification validates both theoretical knowledge through timed examinations and practical application through hands-on database challenges, ensuring graduates can confidently extract and manipulate data from various database systems.

Emerging Technologies and Artificial Intelligence

The rapid advancement of artificial intelligence has created new educational opportunities that bridge traditional data science with cutting-edge generative technologies. DataCamp's Understanding Artificial Intelligence course provides a foundation for those new to AI concepts, requiring no programming background whilst covering machine learning, deep learning, and generative model fundamentals. This accessibility makes it valuable for business professionals seeking to understand AI's implications without becoming technical practitioners.

The Generative AI Concepts course builds upon this foundation to explore the specific technologies driving current AI innovation. Students examine how large language models function, consider ethical implications of AI deployment, and learn to maximise the effectiveness of AI tools in professional contexts. This programme addresses the growing need for AI literacy across various industries and roles.

DataCamp's Large Language Model Concepts course provides intermediate-level exploration of the technologies underlying systems like ChatGPT. The curriculum covers natural language processing fundamentals, fine-tuning techniques, and various learning approaches including zero-shot and few-shot learning. This technical depth makes it particularly valuable for professionals seeking to implement or customise language models within their organisations.

The ChatGPT Prompt Engineering for Developers course addresses the developing field of prompt engineering, a skill that has gained significant commercial value. Students learn to craft effective prompts that consistently produce desired outputs from language models, a capability that combines technical understanding with creative problem-solving. This expertise has become increasingly valuable as organisations integrate AI tools into their workflows.

Working with OpenAI API provides practical implementation skills for those seeking to build AI-powered applications. The course covers text generation, sentiment analysis, and chatbot development, giving students hands-on experience with the tools that are reshaping how businesses interact with customers and process information.

Computer Science Foundations

Stanford University's Computer Science 101 offers an accessible introduction to computing concepts without requiring prior programming experience. This course addresses fundamental questions about computational capabilities and limitations whilst exploring hardware architecture, software development, and internet infrastructure. The curriculum includes essential topics such as computer security, making it valuable for anyone seeking to understand the digital systems that underpin modern society.

The University of Leeds' Introduction to Logic for Computer Science provides focused training in logical reasoning, a skill that underlies algorithm design and problem-solving approaches. This compact course covers propositional logic and logical modelling techniques that form the foundation for more advanced computer science concepts.

Harvard's CS50 course, taught by Professor David Malan, has gained worldwide recognition for its engaging approach to computer science education. The programme combines theoretical concepts with practical projects, teaching algorithmic thinking alongside multiple programming languages including Python, SQL, HTML, CSS, and JavaScript. This breadth of coverage makes it particularly valuable for those seeking a comprehensive introduction to software development.

MIT's Introduction to Computer Science and Programming Using Python focuses specifically on computational thinking and Python programming. The curriculum emphasises problem-solving methodologies, testing and debugging strategies, and algorithmic complexity analysis. This foundation proves essential for those planning to specialise in data science or software development.

MIT's The Missing Semester course addresses practical tools that traditional computer science curricula often overlook. Students learn command-line environments, version control with Git, debugging techniques, and security practices. These skills prove essential for professional software development but are rarely taught systematically in traditional academic settings.

Accessible Learning Resources and Community Support

The democratisation of education extends beyond formal courses to include diverse learning resources that support different learning styles and schedules. YouTube channels such as Programming with Mosh, freeCodeCamp, Alex the Analyst, Tina Huang, and Ken Lee provide free, high-quality content that complements formal education programmes. These resources offer everything from comprehensive programming tutorials to career guidance and project-based learning opportunities.

The 365 Data Science platform contributes to this ecosystem through flashcard decks that reinforce learning of essential terminology and concepts across Excel, SQL, Python, and emerging technologies like ChatGPT. Their statistics calculators provide interactive tools that help students understand the mechanics behind statistical calculations, bridging the gap between theoretical knowledge and practical application.

Udemy's marketplace model supports this diversity by hosting over 100,000 courses, including many free options that allow instructors to share expertise with global audiences. The platform's filtering capabilities enable learners to identify resources that match their specific needs and learning preferences.

Industry Integration and Career Development

Major technology companies have recognised the value of contributing to global education initiatives, with Google, Microsoft and Amazon offering professional-grade courses at no cost. Google's Data Analytics Professional Certificate exemplifies this trend, providing industry-recognised credentials that directly align with employment requirements at leading technology firms.

These industry partnerships ensure that course content remains current with rapidly evolving technological landscapes, whilst providing students with credentials that carry weight in hiring decisions. The integration of real-world projects and case studies helps bridge the gap between academic learning and professional application.

The comprehensive nature of these educational opportunities reflects the complex requirements of modern data and technology roles. Successful professionals must combine technical proficiency with communication skills, statistical understanding with programming capability, and theoretical knowledge with practical application. The diversity of available courses enables learners to develop these multifaceted skill sets according to their career goals and learning preferences.

As technology continues to reshape industries and create new professional opportunities, access to high-quality education becomes increasingly critical. These courses represent more than mere skill development; they provide pathways for career transformation and professional advancement that transcend traditional educational barriers. Whether pursuing data analysis, software development, or artificial intelligence applications, learners can now access world-class education that was previously available only through expensive university programmes or exclusive corporate training initiatives.

The future of professional development lies in this combination of accessibility, quality, and relevance that characterises the modern online education landscape. These resources enable individuals to build expertise that matches industry demands, also maintaining the flexibility to learn at their own pace and according to their specific circumstances and goals.

Synthetic Data: The key to unlocking AI's potential in healthcare

18^th July 2025

The integration of artificial intelligence into healthcare is being hindered by challenges such as data scarcity, privacy concerns and regulatory constraints. Healthcare organisations face difficulties in obtaining sufficient volumes of high-quality, real-world data to train AI models, which can accurately predict outcomes or assist in decision-making.

Synthetic data, defined as algorithmically generated data that mimics real-world data, is emerging as a solution to these challenges. This artificially generated data mirrors the statistical properties of real-world data without containing any sensitive or identifiable information, allowing organisations to sidestep privacy issues and adhere to regulatory requirements.

By generating datasets that preserve statistical relationships and distributions found in real data, synthetic data enables healthcare organisations to train AI models with rich datasets while ensuring sensitive information remains secure. The use of synthetic data can also help address bias and ensure fairness in AI systems by enabling the creation of balanced training sets and allowing for the evaluation of model outputs across different demographic groups.

Furthermore, synthetic data can be generated programmatically, reducing the time spent on data collection and processing and enabling organisations to scale their AI initiatives more efficiently. Ultimately, synthetic data are becoming a critical asset in the development of AI in healthcare, enabling faster development cycles, improving outcomes and driving innovation while maintaining trust and security.

Avoiding permissions, times or ownership failure messages when using rsync

22^nd April 2023

The rsync command is one that I use heavily for doing backups and web publishing. The latter means that it is part of how I update websites built using Hugo because new and/or updated files need uploading. The command also sees usage when uploading files onto other websites as well. During one of these operations, and I am unsure now as to which type is relevant, I encountered errors about being unable to set permissions.

The cause was the encompassing -a option. This is a shorthand for -rltpgoD, and the individual options perform the following:

-r: recursive transfer, copying all contents within a directory hierarchy

-l: symbolic links copied as symbolic links

-t: preserve times

-p: preserve permissions

-g: preserve groups

-o: preserve owners

-D: preserve device and special files

The solution is to some of the options if they are inappropriate. The minimum is to omit the option for permissions preservation, but others may not apply between different servers either, especially when operating systems differ. Removing the options for preserving permissions, groups and owners results in something like this:

rsync -rltD [rest of command]

While it can be good to have a more powerful command with the setting of a single option, it can mean trying to do too much. Another way to avoid permissions and similar errors is to have consistency between source and destination files systems, but that is not always possible.

Copying only updated new or updated files by command line in Linux or Windows

2^nd August 2014

With a growing collection of photographic images, I often find myself making backups of files using copy commands and the data volumes are such that I don't want to keep copying the same files over and over again, so incremental file transfers are what I need. So commands like the following often get issued from a Linux command line:

cp -pruv [source] [destination]

Because this is on Linux, it is the bash shell that I use, so the switches may not apply with others like ssh, fish or ksh. For my case, p preserves file properties such as its time and date and the cp command does not do this always, so it needs adding. The r switch is useful because the copy then in recursive, so only a directory needs to be specified as the source and the destination needs to be one level up from a folder with the same name there to avoid file duplication. It is the u switch that makes the file copy incremental, and the v one issues messages to the shell that show how the copying is going. Seeing a file name issued by the latter does tell you how much more needs to be copied and that the files are going where they should.

What inspired this post though is my need to do the same in a Windows session, and issuing xcopy commands will achieve the same end. Here are two that will do the needful:

xcopy [source] [destination] /d /s

xcopy [source] [destination] /d /e

In both cases, it is the d switch that ensures that the copy is incremental, and you can add a date too, with a colon between it and the /d, if you see fit. The s switch copies only directories that contain files, while the e one copies even empty directories. Using the d switch without either of those did not trigger any copying action when I tried, so I reckon that you cannot do without either of them. By default, both of these commands issue output to the command line so you can keep an eye on what is happening, and this especially is useful when ensuring that files are going to the right destination because the behaviour differs from that of the bash shell on Linux.