AI & Data Science Jottings

12:03, 24^th November 2025

Web scraping has evolved into an essential skill in today's data-driven landscape, particularly as large language models require high-quality information from the internet to function effectively. Modern AI-powered scraping tools have transformed this once technical process into something accessible to users without coding experience.

Platforms like Firecrawl and ScrapeGraphAI offer API-based solutions that convert web content into clean, structured formats ready for use with language models, whilst handling challenges such as anti-bot systems and dynamic content automatically. The open-source option Crawl4AI provides fast, efficient crawling capabilities with intelligent algorithms that optimise data collection without requiring API keys.

User-friendly platforms such as Octoparse and Browse.AI feature drag-and-drop interfaces and pre-built templates that allow non-technical users to extract data through simple point-and-click operations. More advanced solutions like ScrapingBee and Apify handle proxy rotation, JavaScript rendering and large-scale extraction whilst offering integration with popular tools and frameworks.

These modern platforms have democratised web scraping by combining artificial intelligence capabilities with intuitive interfaces. It is now possible for anyone to build data pipelines and extract valuable information regardless of their technical background.

11:58, 24^th November 2025

Hugging Face has evolved beyond its role as a hub for AI models and datasets to become an educational platform offering free, community-driven courses across key AI disciplines. The platform provides five comprehensive programmes covering AI agents, Model Context Protocol, large language models, diffusion models and deep reinforcement learning.

Each course combines theoretical foundations with practical applications, allowing learners to work with popular libraries such as smol-agents, LlamaIndex, LangGraph, Transformers, Diffusers, Stable Baselines3 and CleanRL. Participants can experiment in preconfigured spaces, share projects with the community, compete on leaderboards and earn certificates upon completion of units and challenges.

The courses progress from fundamental concepts to advanced techniques, enabling learners to build, fine-tune and deploy models whilst gaining hands-on experience with a broad range of tasks. These include natural language processing and image generation through to training reinforcement learning agents in various environments.

09:20, 9^th November 2025

Mockaroo provides a tool for generating realistic test data across multiple formats, enabling users to create mock APIs and simulate backend services for UI development. It allows users to produce large volumes of data without requiring programming expertise, with options to customise fields, deploy via Docker and integrate into private clouds.

The platform supports diverse data types, including medical identifiers and AI-generated content, and offers features such as automated data generation through RESTful URLs and the ability to derive schemas from example files. Recent updates have introduced enhanced control over data generation parameters, expanded data type options and improved functionality for handling complex datasets, aiding more accurate testing and development workflows.

16:54, 31^st October 2025

The Open Data Science Conference advances knowledge in artificial intelligence and data science through a series of events, including in-person and virtual conferences, training sessions and community engagement initiatives. Keynote speakers include prominent researchers and industry leaders from institutions such as MIT, the Allen Institute for AI and Stanford University, covering topics ranging from machine learning to ethical AI.

The conference offers hands-on training programmes with expert-led workshops catering to participants at various skill levels, with registration options available for both in-person and virtual attendance. Events are scheduled across multiple global locations, with upcoming conferences in Boston, San Francisco and online, whilst past events have featured influential figures from across the field.

The organisation also maintains an online community and newsletter to keep attendees informed about developments, training opportunities and event updates. Together, these channels serve as a resource for practitioners looking to stay connected between conferences.

09:19, 31^st October 2025

Cloudera offers a hybrid data and AI platform designed to integrate artificial intelligence with data across diverse environments, including clouds, data centres and edge locations, enabling organisations to enhance decision-making, security and operational efficiency. The platform supports unified data management through an open data lakehouse, allowing real-time insights and predictive analytics, while maintaining control over data across all forms and locations. It caters to industries such as finance, telecommunications, manufacturing and public services, with a focus on delivering consistent cloud experiences and scalable AI solutions. Resources include reports on AI trends, industry analyses and technical documentation, highlighting the company's role in advancing data architecture and enterprise innovation.

09:18, 31^st October 2025

Qlik

Tableau

MATLAB

Qlik, Tableau and MATLAB represent three distinct enterprise-grade platforms serving different but complementary data and analytics needs. Each has carved out a recognised position in the market whilst continuing to evolve its core offering.

Qlik offers AI-powered analytics and data integration solutions built around an agentic AI framework. Its products include Qlik Talend Cloud for data integration and Qlik Cloud Analytics for business intelligence, and it counts 75% of Fortune 500 companies among its users.

Tableau, owned by Salesforce, provides a visual analytics platform available in cloud-hosted, self-hosted and desktop configurations. Its newer Tableau Next product positions itself as an agentic analytics platform that integrates AI capabilities with workflow tools, whilst supporting a large global community of data practitioners known as the DataFam.

MATLAB, developed by MathWorks, is a numeric computing and programming platform widely used by engineers and scientists for data analysis, algorithm development and modelling. It offers toolboxes, interactive apps and the ability to scale computations across clusters, GPUs and cloud environments, and supports code deployment to embedded devices and integration with Simulink for model-based design.

21:52, 18^th October 2025

An educational course from Microsoft provides comprehensive instruction on building AI agents through fifteen lessons covering fundamental concepts and practical implementation. The curriculum explores various agentic design patterns including tool use, planning, multi-agent systems and metacognition, alongside topics such as agentic retrieval-augmented generation, trustworthy agent development and memory management.

Learners gain hands-on experience through Python examples that utilise Azure AI Foundry and GitHub Model Catalogues, working with Microsoft frameworks such as the Microsoft Agent Framework, Azure AI Agent Service, Semantic Kernel and AutoGen. Each lesson combines written materials, video content and additional resources to guide students through the process of developing and deploying AI agents.

The course accommodates different skill levels by offering flexible starting points and includes upcoming content on computer use agents, scalable deployment, local agent creation and security considerations. Multi-language support ensures accessibility to a global audience, whilst community engagement through Discord channels provides opportunities for collaborative learning and problem-solving.

11:52, 12^th October 2025

Fathom is a business management solution that integrates reporting, analysis and forecasting tools to provide clear insights into financial performance. It offers AI-generated commentary tailored to specific business contexts, enabling users to create custom management reports quickly and share results effectively.

Features include cash flow forecasting, scenario evaluation and consolidated financial reporting for groups, supported by integration with major accounting platforms. The platform is designed for businesses of all sizes, offering tools to measure key performance indicators, benchmark company performance and streamline financial planning. User testimonials highlight its value in enhancing clarity and confidence in financial decision-making, with a focus on transparency, automation and ease of use.

09:22, 7^th October 2025

Learning Machines is a data science blog covering a range of topics including machine learning, statistical computing, quantitative finance and the R programming language. Recent posts explore building a transformer-based language model in R, the relationship between income and happiness, the role of artificial intelligence in academic work, regression to the mean in business contexts, Youth Bulge Theory as a lens for understanding Middle East conflict, the reliability of election polling, the distinction between weather and climate, stock market simulation using multi-agent models, trading strategy analysis and an introductory guide to R for newcomers to statistical programming.

13:51, 4^th October 2025

Apache Spark is a versatile engine designed for large-scale data analytics, supporting data engineering, science and machine learning tasks across single-node systems or clusters. It processes batch and real-time streaming data using multiple programming languages, executes distributed SQL queries efficiently and enables scalable data analysis without downsampling. The platform integrates with widely used frameworks for data science, machine learning and business intelligence, while its SQL engine optimises query execution through adaptive techniques and supports both structured and unstructured data formats. Widely adopted by numerous organisations, it benefits from a large open-source community and extensive ecosystem, facilitating deployment across diverse infrastructure and storage solutions.

« Older Entries «

» Newer Entries »