AI & Data Science Jottings

15:38, 20^th March 2025

Elluminate Clinical Data Cloud from eClinical Solutions is a cloud-based platform that integrates various data streams, standardises complex information, and provides analytics capabilities, supporting decision-making throughout the clinical research process. It consolidates clinical and operational data into a single repository, eliminating traditional data silos and facilitating cross-functional collaboration. With built-in automation and study-agnostic machine learning, the platform supports AI integration, optimising data flow from initial acquisition to regulatory submission. The platform includes tools like the Elluminate Mapper, which allows non-technical users to perform intricate data transformations needed for regulatory compliance.

11:08, 18^th December 2024

A blog post from Dataiku in November 2024 evaluates the performance of ChatGPT two years after its release, comparing its responses to those of AI professionals surveyed in May 2024. The survey involved 400 senior AI professionals from globally recognised companies, focusing on AI deployment trends. ChatGPT was tested with five questions presented to these AI leaders to assess its knowledge. The analysis revealed that large organisations typically adopt a Hub & Spoke or Centralised Center of Excellence model for AI initiatives, with most achieving a $2-5 return on each $1 spent on AI and data science. Key barriers hindering AI value include access to quality data and a shortage of data talent. ChatGPT achieved a score of 3.15 out of 5 in the test, demonstrating a close alignment with the survey findings and highlighting its potential as a useful tool for understanding industry trends, despite some nuances it may miss.

23:04, 22^nd November 2024

Introduction to Meta AI’s LLaMa

Meta AI's LLaMA models represent a significant advancement in open-source artificial intelligence, offering a range of foundation models that demonstrate competitive performance against leading closed-source systems. These models are trained on extensive publicly available data, enabling them to achieve state-of-the-art results with minimal computational resources. They support multilingual capabilities, though performance on non-English languages may be comparatively lower due to the predominance of English text in their training data.

While LLaMA models excel in general tasks and instruction-following, they face limitations in mathematical reasoning and domain-specific knowledge, which researchers are actively addressing through fine-tuning and other techniques. The models are primarily intended for research purposes under non-commercial licences, with a focus on evaluating and mitigating risks such as biases, hallucinations and the generation of harmful content.

Subsequent iterations like LLaMA 2 and 3 have introduced improvements in context length and applicability, though challenges remain in ensuring robustness across diverse use cases. The release of these models has spurred innovation in the open-source community, fostering collaboration to enhance their reliability and expand their potential applications in fields such as data science, natural language processing and beyond.

15:43, 25^th October 2024

Build and Deploy RAG-as-a-service

Unwind AI walks developers through building a production-ready retrieval-augmented generation service using Claude 3.5 Sonnet and Ragie.ai, achievable in fewer than 50 lines of Python code. Unlike conventional RAG applications, a managed service approach abstracts the more complex elements of data ingestion, chunking and vector retrieval through APIs, reducing infrastructure overhead and allowing developers to focus on building features.

Ragie.ai handles the full pipeline, from document chunking to hybrid keyword and semantic searches, and offers connectors for services such as Google Drive, Notion and Confluence. The tutorial guides readers through setting up a development environment, building a RAGPipeline class that manages authentication and API endpoints, and creating a Streamlit interface through which users can upload documents via URL, select a processing mode and submit queries that are answered using information retrieved from the uploaded material. The system retrieves relevant sections from a document and passes them to Claude 3.5 Sonnet, which synthesises a response, with the whole application launchable locally using a single terminal command.

15:41, 25^th October 2024

IBM Granite 3.0: open, state-of-the-art enterprise models

The release of IBM's Granite 3.0 models introduces a range of advancements in artificial intelligence, focusing on efficiency, safety and scalability. These models include mixture of experts (MoE) variants, such as the 3B-A800M and 1B-A400M, which balance performance with low-latency inference, making them suitable for on-device and server applications.

A speculative decoding technique, applied to the 8B Instruct model, achieves a 220% increase in tokens per step, enhancing inference speed without compromising accuracy. The suite also features Granite Guardian models, designed to detect and mitigate risks such as hallucinations, bias and harmful content, outperforming existing solutions in benchmark tests.

Available through platforms like Hugging Face, Ollama and IBM watsonx.ai, the models support diverse use cases, from agentic workflows to retrieval-augmented generation. Resources such as tutorials, quantisation guides and integration frameworks are provided to assist developers in deploying and optimising these tools for enterprise applications.

20:35, 2^nd October 2024

5 LLM Tools I Can’t Live Without

Matthew Mayo, Managing Editor at KDnuggets, outlines five large language model tools he considers essential to his current workflow. LlamaIndex is a framework built for data-centric applications, particularly retrieval augmented generation systems, offering seamless integration with over 40 vector stores and data sources. Ollama enables users to run a variety of language models locally on their own hardware with minimal effort, and also supports serving models to external applications via a Python API. Ollama UI is a no-configuration chat interface, available as a Chrome extension, that provides a straightforward way to interact with locally hosted models. NotebookLM is a Google AI tool that lets users create notebooks drawing on uploaded reference material to produce summaries, FAQs, study guides and even podcast-style audio overviews. Finally, ControlFlow is a Python framework for building agentic AI workflows by defining discrete tasks, assigning specialised agents and combining them into structured flows, with a syntax designed for rapid prototyping.

20:35, 2^nd October 2024

Using Llama 3.2 Locally

Meta's Llama 3.2 models are available in two main variants, lightweight models suited to multilingual generation and tool calling, and vision models capable of image reasoning by processing images alongside prompts. Both can be run locally using Msty, a free desktop chatbot application that supports open-source models downloaded directly to a user's machine as well as remote models accessed via API keys.

To run the lightweight Llama 3.2 3B Instruct model locally, users download it from Hugging Face through Msty's model management interface in GGUF format, after which it can be used without an internet connection for tasks such as code generation and debugging. The vision variant, which currently lacks a GGUF release, is instead accessed remotely through the Groq API by creating a GroqCloud account, generating an API key and configuring it within Msty's remote provider settings, allowing users to submit images with prompts and receive detailed descriptive responses at considerable speed.

22:20, 26^th September 2024

ScraperAPI is a web scraping platform designed to help businesses and developers collect data from public pages at scale, without the complexity of managing proxies, headless browsers or CAPTCHA handling. The service offers a plug-and-play API, asynchronous request handling and a no-code data pipeline tool, allowing teams to gather large volumes of information efficiently. It also provides structured data endpoints for popular platforms including Amazon, Google and Walmart, returning clean JSON or CSV output rather than raw HTML. A geotargeting feature grants access to a pool of over 40 million proxies across more than 50 countries, helping users avoid blocks. The platform is positioned as a cost-effective alternative to building in-house scraping infrastructure, with compliance with both CCPA and GDPR and a reported track record of serving over 11 billion requests within a 30-day period.

15:21, 14^th March 2024

7 GPTs to Help Improve Your Data Science Workflow

The GPT Store, which features more than three million custom models built by developers, offers several tools that can meaningfully enhance a data science workflow. The Data Analyst GPT, created by the ChatGPT team, can process uploaded data files and perform tasks such as correlation analysis while also generating reusable code. The Machine Learning GPT by Maryam Eskandari and the Machine Learning Engineer GPT by Hustle Playground both serve as assistants for building and comparing predictive models, with the latter placing particular emphasis on deployment and production structuring.

For coding support, AutoExpert by llmimagineers.com functions as a pair programming assistant with code generation capabilities and session state management. ScholarGPT by awesomegpts.ai helps practitioners locate relevant academic research papers based on a given use case or problem. Whimsical Diagrams by whimsical.com enables the creation of flowcharts, mind maps and sequence diagrams to visualise workflows and concepts, while the Canva GPT helps practitioners present their findings through professionally designed layouts and slide formats. Together, these tools cover a broad range of data science activities, from technical modelling and research to communication and visual presentation.

15:20, 14^th March 2024

Data Science and the Go Programming Language

Go is a programming language developed by Google engineers Robert Griesemer, Rob Pike and Ken Thompson, introduced in 2009 as a systems programming language that retains the performance advantages of C while being simpler and safer to use. Unlike Python, which predates multicore processors and is single-threaded and interpreted, Go is designed for concurrent processing and compiles directly to machine code, making it significantly faster and better suited to the computational demands of modern data science.

Northwestern University's Master of Science in Data Science programme has adopted a trilingual approach, using R as the primary language for analytics and modelling, Python for artificial intelligence and natural language processing, and Go for data engineering, recognising that each language has distinct strengths. Go is already widely used in industry by major organisations including Google, Netflix, Uber and Dropbox, and underpins many prominent computing tools and platforms. It is considered easier to maintain than Python, offers automated memory management, enforces a consistent coding style and has a strong commitment to backward compatibility, all of which make it well suited to building scalable, high-performance data science systems including web applications, database servers and distributed computing environments.

« Older Entries «

» Newer Entries »