Technology Tales

Adventures in consumer and enterprise technology

AI & Data Science Jottings

While there is a LinkBlog on here, it has caught many different things, so I want to split off links to Data Science material and that is what you find here. Meanwhile, the world of GenAI has burst upon us, so you will find something from there in here too. There is so much happening, that just keeping up can become an effort all of its own. Even so, I will try to stop things getting lost in a growing pile.

13th August 2025, 23:41

Here is another course from SAS Institute, one that provides foundational knowledge about trust and responsibility in artificial intelligence and machine learning systems, targeting anyone involved in making business decisions based on AI or designing AI systems regardless of their role. The programme covers how trustworthy AI integrates with analytics life cycles and data supply chains, focusing on identifying and addressing unwanted biases throughout these processes. Participants learn six core principles of responsible innovation including human-centricity, inclusivity, accountability, privacy and security, robustness, and transparency through practical scenarios ranging from healthcare risk models to speech recognition systems. The curriculum examines real-world examples such as racial bias in research, mobile device encryption, cryptocurrency exchange failures, and credit rating agency practices to illustrate these principles in action. The course requires no formal prerequisites beyond basic data literacy and can be completed at one's own pace with each module designed to take under an hour, making it accessible to data consumers, IT professionals, managers, analysts, data scientists, and decision-makers across various industries.

13th August 2025, 23:39

This is a comprehensive course explores Generative Artificial Intelligence and its practical applications through SAS tools, covering approximately four hours of content with hands-on practice components. The programme examines various types of GenAI systems within the broader AI landscape, addressing key challenges and opportunities in developing trustworthy AI solutions. Students learn to generate synthetic data using techniques such as Synthetic Minority Oversampling Technique and Generative Adversarial Networks, whilst exploring how Large Language Models produce meaningful content through transformer architecture and attention mechanisms. The curriculum includes practical instruction on using Bidirectional Encoder Representations from Transformers for content classification and implementing Retrieval Augmented Generation to enhance LLM output accuracy and relevance. Designed for learners with existing statistics and machine learning background using SAS, the course takes a phased release approach with new lessons added periodically to reflect the rapidly evolving field, covering everything from fundamental GenAI concepts to advanced implementation techniques within SAS Viya and SAS Machine Learning environments.

8th August 2025, 14:10

OpenAI has released GPT-5, their most advanced model for coding and agentic tasks, now available through their API platform in three sizes: gpt-5, gpt-5-mini, and gpt-5-nano. The model achieves state-of-the-art performance across key coding benchmarks, scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot, whilst demonstrating particular excellence in frontend development where it outperformed OpenAI o3 in 70% of internal tests. GPT-5 excels at collaborative coding tasks, bug fixing, and handling complex codebases, with enhanced capabilities for chaining together multiple tool calls in sequence or parallel without losing context. The model introduces new API features including adjustable verbosity levels (low, medium, high), a minimal reasoning effort option for faster responses, and custom tools that allow plaintext input instead of JSON formatting. Beyond coding, GPT-5 shows significant improvements in instruction following, achieving 69.6% on Scale MultiChallenge, and demonstrates superior performance in long-context tasks with support for up to 400,000 total tokens. The model exhibits substantially improved factual accuracy, making approximately 80% fewer factual errors than previous models on Long Fact and FactScore benchmarks, making it more suitable for high-stakes applications where correctness is essential. Early testing partners including Cursor, Windsurf, and Vercel have provided positive feedback regarding the model's intelligence, steerability, and reduced error rates compared to other frontier models.

 

28th July 2025, 17:30

The development of Good Machine Learning Practice (GMLP) for medical device innovation is at the forefront of regulatory initiatives led by the U.S. FDA, Health Canada, and the UK's Medicines and Healthcare Products Regulatory Agency. These organisations have outlined ten guiding principles aimed at promoting the safe and effective use of AI and machine learning technologies in healthcare. Emphasising multidisciplinary expertise throughout the product lifecycle is crucial for integrating machine learning models into clinical workflows safely and effectively, while addressing patient needs. Ensuring representative data sets in clinical studies, maintaining independence between training and test data sets, and selecting reference data based on the best available methods are essential for generalising results across intended patient populations. Appropriately tailored model design can mitigate risks like overfitting and security issues, focusing not just on the models, but the human-AI team performance. Monitoring real-world use while managing re-training risks, providing users with clear and contextually relevant information, and maintaining robust software engineering and security practices are imperative. This collaborative framework aims to advance GMLP standards and regulatory guidelines by encouraging international cooperation, harmonisation, and innovation in AI-powered medical technologies. Users are encouraged to engage with these developments, providing valuable feedback through dedicated platforms.

26th July 2025, 19:26

Advanced problem-solving models, known as reasoning models, have been developed to perform complex tasks such as coding, scientific reasoning and multistep planning. These models think before responding, producing a chain of internal thought before generating an answer. They are particularly useful for tasks that require high-level guidance rather than precise instructions. The models use reasoning tokens, which are not visible, to break down prompts and consider multiple approaches to generating a response. To manage costs, it is possible to limit the total number of tokens generated by the model, including both reasoning and completion tokens. Ensuring sufficient space in the context window for reasoning tokens is crucial to prevent incurring costs without receiving a visible response. The models can be used through various endpoints, and developers may need to complete organisation verification before accessing certain models. When prompting these models, it is generally more effective to provide high-level guidance rather than precise instructions, allowing them to work out the details themselves.

26th July 2025, 12:23

To securely and reliably allow traffic from ChatGPT agents to reach a site, it is possible to identify authentic traffic by checking for specific headers. The ChatGPT agent signs every outbound HTTP request, enabling confident identification of genuine traffic. This is achieved through the use of HTTP Message Signatures, which include a Signature and Signature-Input set of headers, as well as a companion Signature-Agent header. By verifying these headers and checking the public key associated with the signature, it is possible to confirm the authenticity of the request. Cloudflare users can allowlist ChatGPT agent traffic by creating a rule that skips or allows requests from verified bots, while users of other CDNs can trust ChatGPT agent traffic by checking the request headers and verifying the signature.

26th July 2025, 12:16

ChatGPT Agent is a feature that enables ChatGPT to complete complex online tasks on behalf of users. It can conduct research, fill out forms and edit documents, all while allowing users to remain in control. To use this feature, users must be subscribed to certain plans, such as Pro, Plus, or Team, and it is available on various devices, including web, mobile and desktop apps. The feature is not currently available in Switzerland or the European Economic Area, but access is expected to be expanded soon. Users can schedule tasks to repeat and view and manage their tasks, and the feature includes safeguards to help prevent privacy risks, such as prompt injection attacks. To keep data safe, users are advised to be cautious when logging in to websites or using connectors and to follow best practices, such as not typing passwords or private information directly into messages and regularly reviewing connector permissions. The feature takes screenshots to interact with web pages, but does not capture sensitive data when users are controlling the virtual browser. Users' data are used in accordance with the provider's privacy policy, and chats and screenshots are retained until deleted by the user.

26th July 2025, 11:56

DeepLearning.AI is an online education platform founded by Andrew Ng in 2017, with the aim of making top-tier artificial intelligence education accessible globally. The company, offers a wide range of courses and certifications, including deep learning foundations, natural language processing and AI for non-technical audiences. The organisation is led by Andrew Ng, a leading figure in artificial intelligence, who has consistently advocated for accessible AI education and has launched several notable courses. Thus, the platform hosts expert instruction, hands-on projects and a supportive community, furthering its mission to democratise AI tools and skills for broad societal benefit.

9th July 2025, 22:12

Anthropic has unveiled a new 'Integrations' feature enabling Claude to connect with various applications and tools, alongside an enhanced 'Research' capability that can search the web, Google Workspace and integrated apps. This advanced research function allows Claude to investigate topics for up to 45 minutes before delivering comprehensive reports with proper citations. Initially available to users on premium plans, Integrations supports ten popular services including Atlassian's Jira, Zapier, Cloudflare and Intercom, with more partnerships forthcoming. Developers can create their own integrations in approximately 30 minutes using provided documentation. These updates significantly expand Claude's functionality, allowing it to understand project histories, organisational knowledge and take actions across multiple platforms, effectively transforming it into a more informed digital collaborator for complex project management.

9th July 2025, 21:38

The integration of data science and artificial intelligence is transforming biometrics careers, with employers now valuing candidates who possess hybrid skills, are familiar with newer platforms and can adapt to complex data environments. As organisations adopt more automated systems and predictive modelling tools, traditional biometric roles are being redefined, with a greater emphasis on interpretation, validation and system-level oversight. Biometrics teams must be able to work alongside automated systems, validate outputs and ensure that data meets regulatory standards, with skills such as programming fluency, experience with cloud tools and familiarity with machine learning libraries becoming increasingly important. Employers must prioritise candidates who understand regulated systems and can support traceability and inspection readiness and should provide training on audit trail review, output validation and documentation of overrides to upskill their teams. Ultimately, the most effective biometrics teams will combine strong analytical skills with a clear understanding of how automated outputs must be validated, interpreted and documented to meet regulatory standards.

9th July 2025, 17:41

Large language models are inherently non-deterministic, meaning they can produce different responses to the same input, which can lead to errors and inconsistencies. This lack of determinism can be problematic in enterprise software applications where reliability is crucial. To mitigate this issue, developers can implement measures such as sanitising inputs and outputs, observing the process as much as possible and ensuring that processes run once and only once. Additionally, using durable execution technologies can help save progress in workflows and prevent repeated calls to external services. By introducing these controls, developers can make large language models more reliable and trustworthy, which is essential for building robust enterprise software applications that organisations can rely on.

21st April 2025, 22:19

The Quartz guide to bad data is an extensive resource that helps journalists and data users recognise and address frequent issues found in real-world datasets. It details a wide variety of common data problems, such as missing or duplicated values, inconsistent spellings, ambiguous fields, problematic categorisations, and undocumented origins. The guide categorises issues according to whom is best placed to resolve them: the user, the data provider, an external expert, or a programmer. It also offers guidance for dealing with challenges like human data entry errors, non-random or biased samples, unclear margins of error, manual editing, inflation, seasonal variations, and manipulation of timeframes or reference points. More complex problems, such as those involving untrustworthy sources, opaque collection methods, unrealistic precision, outliers, misleading indices, statistical manipulation, or poorly aggregated data, may require the input of specialists or programmers. Overall, the guide emphasises a careful, questioning approach to data to help prevent mistakes and ensure more reliable analysis and reporting.

20th March 2025, 15:38

Elluminate Clinical Data Cloud from eClinical Solutions is a cloud-based platform that integrates various data streams, standardises complex information, and provides analytics capabilities, supporting decision-making throughout the clinical research process. It consolidates clinical and operational data into a single repository, eliminating traditional data silos and facilitating cross-functional collaboration. With built-in automation and study-agnostic machine learning, the platform supports AI integration, optimising data flow from initial acquisition to regulatory submission. The platform includes tools like the Elluminate Mapper, which allows non-technical users to perform intricate data transformations needed for regulatory compliance.

18th December 2024, 11:08

A blog post from Dataiku in November 2024 evaluates the performance of ChatGPT two years after its release, comparing its responses to those of AI professionals surveyed in May 2024. The survey involved 400 senior AI professionals from globally recognised companies, focusing on AI deployment trends. ChatGPT was tested with five questions presented to these AI leaders to assess its knowledge. The analysis revealed that large organisations typically adopt a Hub & Spoke or Centralised Center of Excellence model for AI initiatives, with most achieving a -5 return on each spent on AI and data science. Key barriers hindering AI value include access to quality data and a shortage of data talent. ChatGPT achieved a score of 3.15 out of 5 in the test, demonstrating a close alignment with the survey findings and highlighting its potential as a useful tool for understanding industry trends, despite some nuances it may miss.

5th December 2024, 17:06

Cursor

22nd November 2024, 23:04

Introduction to Meta AI’s LLaMa

25th October 2024, 15:49

NotebookLM

25th October 2024, 15:43

Build and Deploy RAG-as-a-service

25th October 2024, 15:41

IBM Granite 3.0: open, state-of-the-art enterprise models

2nd October 2024, 20:35

Using Llama 3.2 Locally

2nd October 2024, 20:35

 5 LLM Tools I Can’t Live Without

26th September 2024, 22:20

ScraperAPI

14th March 2024, 15:21

7 GPTs to Help Improve Your Data Science Workflow

14th March 2024, 15:20

Data Science and the Go Programming Language

22nd January 2024, 16:04

Tugan.ai

8th January 2024, 22:33

Microsoft Clarity

5th January 2024, 13:32

The best AI chatbots in 2024

How to write effective AI art prompts

The best AI image generators in 2024

18th December 2023, 18:18

OWASP Top 10 for Large Language Model Applications

20th October 2023, 23:04

Object Management Group Business Process Model and Notation

31st July 2023, 20:26

So you want to build your own open source ChatGPT-style chatbot…

31st July 2023, 16:42

Did ChatGPT write this? Here’s how to tell.

4th July 2023, 19:52

Voiceflow

14th June 2023, 17:13

ChatGPT brings AI into popular culture

14th June 2023, 17:02

My general advice on getting an analytics job

19th March 2023, 15:26

Welcome to NIHPO's Synthetic Health Data Platform

24th February 2023, 14:35

The best AI scheduling assistants

19th January 2023, 15:52

How to use OpenAI's GPT-3 to write business emails

30th November 2022, 16:27

What is Chebychev’s Theorem and How Does it Apply to Data Science?

19th November 2022, 18:22

The Complete Free PyTorch Course for Deep Learning

19th November 2022, 14:50

7 Techniques to Handle Imbalanced Data

18th November 2022, 22:46

15 More Free Machine Learning and Deep Learning Books

26th October 2022, 14:58

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

26th October 2022, 14:57

Alliance for Data Science Professionals

26th October 2022, 09:12

SingleStoreDB

24th October 2022, 09:25

What is data extraction? And how to automate the process

12th October 2022, 11:57

How to reveal new connections in a knowledge graph with link prediction

12th October 2022, 11:52

Fathom

12th October 2022, 11:43

Metabase

28th September 2022, 15:09

BlueSky Statistics

28th September 2022, 15:01

Top ten database attacks

8th September 2022, 13:49

Excel Formula Generator

17th August 2022, 16:54

Six tips for better spreadsheets

21st July 2022, 17:50

Apache Superset

8th July 2022, 15:54

FOSS For Spectroscopy

9th May 2022, 15:36

9 Free Harvard Courses to Learn Data Science

8 Free MIT Courses to Learn Data Science Online

28th April 2022, 14:32

The Book of OHDSI

28th April 2022, 14:31

Observational Health Data Sciences and Informatics

27th April 2022, 14:56

Katja Glass Consulting Open Source Portal

27th April 2022, 14:54

CDISC Open Source Alliance

27th April 2022, 14:54

OpenClinica

21st February 2022, 15:30

Best Data Science Books For Beginners

13th February 2022, 13:02

Data Hub

29th January 2022, 22:02

OpenCPU

24th January 2022, 08:54

The High Paying Side Hustles for Data Scientists

13th January 2022, 18:05

7 top predictive analytics use cases: Enterprise examples

13th January 2022, 17:53

6 challenges of building predictive analytics models

23rd December 2021, 14:00

SciML Scientific Machine Learning Software

23rd December 2021, 10:31

PumasAI

10th December 2021, 18:33

SAS Institute has shared a few COVID resources for data scientists and others, so I have shared links to them here as well:

8 terms you need to understand when assessing COVID-19 data

Vaccine Efficacy, Clinical Trials, and SAS: Part 4 of Biostats in the Time of Coronavirus

What matters now when it comes to COVID-19

7th December 2021, 13:37

CoCalc

2nd December 2021, 16:46

DataKind UK

21st October 2021, 16:49

Open Neural Network Exchange

28th September 2021, 14:47

Data Sources in Power BI Desktop

22nd September 2021, 17:09

OpenText Experience Platform

26th August 2021, 14:03

Text Mining Node in SAS Model Studio on SAS Viya

26th August 2021, 14:02

Natural Language Processing: An Introduction

25th August 2021, 11:13

JuliaHub

4th August 2021, 14:02

Data Science Experience | SAS

4th August 2021, 09:01

NumFOCUS: A Nonprofit Supporting Open Code for Better Science

13th July 2021, 18:09

Sudowrite

21st June 2021, 16:21

Top 10 Data Science Projects for Beginners

21st June 2021, 16:20

5 Data Science Open-source Projects To Which You Should Consider Contributing

3rd June 2021, 16:37

SAS analytics platform adds native support for AWS, GCP

28th May 2021, 09:58

SAS Curiosity

27th May 2021, 09:01

SingleStore

27th May 2021, 09:01

Teradata

27th May 2021, 08:59

Apache ORC

16th May 2021, 13:47

Machine Learning Operations

13th May 2021, 09:34

Business Process Model and Notation

13th May 2021, 09:33

BPR4GDPR

12th May 2021, 10:49

SAS User Group UK & Ireland

12th May 2021, 10:47

Top YouTube Machine Learning Channels

12th May 2021, 10:47

Top YouTube Channels for Data Science

10th May 2021, 17:12

Decisions in the Cloud from SAS

10th May 2021, 17:02

SAS Viya

10th May 2021, 17:01

Microsoft Azure and SAS

15th April 2021, 12:48

Apache Arrow

6th December 2020, 20:47

Julia Computing

18th November 2020, 15:12

Conda

9th November 2020, 09:20

Mockaroo

23rd October 2020, 13:53

rOpenSci Packages: Development, Maintenance, and Peer Review

7th October 2020, 09:22

Learning Machines

7th October 2020, 09:21

rOpenSci

18th September 2020, 10:47

Open Source Portal for Clinical Study Evaluations

17th September 2020, 16:11

Visual Define-XML Editor

17th September 2020, 09:15

Business Science University

1st July 2020, 14:10

23 sources of data bias for Machine Learning and Deep Learning

4th March 2020, 20:23

KDnuggets

4th March 2020, 20:22

Kaggle

29th January 2020, 17:41

Smart Submission Dataset Viewer

31st October 2019, 16:54

Open Data Science Conference

31st October 2019, 09:19

Cloudera

31st October 2019, 09:18

JASP

Jamovi

Qlik

Tableau

Scala

MATLAB

23rd October 2019, 22:02

What is eCOA and How Does it Improve Clinical Trial Data Quality?

1st October 2019, 15:08

What is a geometric mean?

8th June 2019, 14:30

How to install R on Windows, Mac OS X and Ubuntu

25th October 2017, 23:43

Revolutions

25th October 2017, 23:42

TIBCO

24th October 2017, 19:21

Nature Reviews Drug Discovery

22nd October 2017, 23:36

Navicat

22nd October 2017, 23:26

Institute of Clinical Research

12th October 2017, 22:42

RStudio

Jupyter

Impala

Amazon Redshift

Hadoop

4th October 2017, 13:51

Data Science Central

Association for Computing Machinery

ImageNet

TensorFlow

WordNet

Albert Cairo

Visualizing and Understanding Convolutional Networks

Hidden Technical Debt in Machine Learning Systems

Apache Spark

Anaconda

10th March 2017, 16:50

9th June 2016, 18:21

PMDA New Drug Review with Electronic Data

5th March 2015, 02:40

List of ISO 639-1 codes

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.