Computer data | Technology Tales

Synthetic Data: The key to unlocking AI's potential in healthcare

18^th July 2025

The integration of artificial intelligence into healthcare is being hindered by challenges such as data scarcity, privacy concerns and regulatory constraints. Healthcare organisations face difficulties in obtaining sufficient volumes of high-quality, real-world data to train AI models, which can accurately predict outcomes or assist in decision-making.

Synthetic data, defined as algorithmically generated data that mimics real-world data, is emerging as a solution to these challenges. This artificially generated data mirrors the statistical properties of real-world data without containing any sensitive or identifiable information, allowing organisations to sidestep privacy issues and adhere to regulatory requirements.

By generating datasets that preserve statistical relationships and distributions found in real data, synthetic data enables healthcare organisations to train AI models with rich datasets while ensuring sensitive information remains secure. The use of synthetic data can also help address bias and ensure fairness in AI systems by enabling the creation of balanced training sets and allowing for the evaluation of model outputs across different demographic groups.

Furthermore, synthetic data can be generated programmatically, reducing the time spent on data collection and processing and enabling organisations to scale their AI initiatives more efficiently. Ultimately, synthetic data are becoming a critical asset in the development of AI in healthcare, enabling faster development cycles, improving outcomes and driving innovation while maintaining trust and security.

Copying only updated new or updated files by command line in Linux or Windows

2^nd August 2014

With a growing collection of photographic images, I often find myself making backups of files using copy commands and the data volumes are such that I don't want to keep copying the same files over and over again, so incremental file transfers are what I need. So commands like the following often get issued from a Linux command line:

cp -pruv [source] [destination]

Because this is on Linux, it is the bash shell that I use, so the switches may not apply with others like ssh, fish or ksh. For my case, p preserves file properties such as its time and date and the cp command does not do this always, so it needs adding. The r switch is useful because the copy then in recursive, so only a directory needs to be specified as the source and the destination needs to be one level up from a folder with the same name there to avoid file duplication. It is the u switch that makes the file copy incremental, and the v one issues messages to the shell that show how the copying is going. Seeing a file name issued by the latter does tell you how much more needs to be copied and that the files are going where they should.

What inspired this post though is my need to do the same in a Windows session, and issuing xcopy commands will achieve the same end. Here are two that will do the needful:

xcopy [source] [destination] /d /s

xcopy [source] [destination] /d /e

In both cases, it is the d switch that ensures that the copy is incremental, and you can add a date too, with a colon between it and the /d, if you see fit. The s switch copies only directories that contain files, while the e one copies even empty directories. Using the d switch without either of those did not trigger any copying action when I tried, so I reckon that you cannot do without either of them. By default, both of these commands issue output to the command line so you can keep an eye on what is happening, and this especially is useful when ensuring that files are going to the right destination because the behaviour differs from that of the bash shell on Linux.

Using Data Step to Create a Dataset Template from a Dataset in SAS

23^rd November 2010

Recently, I wanted to make sure that some temporary datasets that were being created during data processing in a dataset creation program weren't truncating values or differed from the variable lengths in the original. It was then that a brainwave struck me: create an empty dataset shell using data step, and use that set all the variable lengths for me when the new datasets were concatenated to it. The code turned out to be very simple and here is an example of how it looked:

data shell;
    stop;
    set example;
run;

The STOP statement, prevents the data step from reading in any of the values in the template dataset and just its header is written out to another (empty) dataset that can be used to set things up as you would want them to be. It certainly was a quick solution in my case.