AI & Data Science Jottings

15:30, 21^st February 2022

Best Data Science Books For Beginners

For those looking to enter the field of data science, a strong grounding in programming, machine learning, probability, statistics and linear algebra is essential, and a range of books exists to help beginners build these skills. Among the most recommended are "Data Science from A-Z" by Benjamin Smith, which offers a clear and balanced introduction to core concepts, and "Data Science for Dummies" by Lillian Pierson and Jake Porway, which focuses on practical business applications and covers big data frameworks such as Hadoop and Spark.

Cole Nussbaumer Knaflic's "Storytelling with Data" takes a narrative approach to teaching data visualisation, while Joel Grus's "Data Science from Scratch" walks readers through Python, statistics and machine learning from the ground up. Jake VanderPlas covers key Python libraries in the "Python Data Science Handbook", and Wes McKinney's "Python for Data Analysis" is particularly suited to those new to both Python and analytical computing.

For those preferring R, Hadley Wickham's "R for Data Science" provides thorough coverage of the language. Statistical foundations are addressed by Gareth James and co-authors in "An Introduction to Statistical Learning" and by Peter Bruce in "Practical Statistics for Data Scientists", while Sheldon Axler's "Linear Algebra Done Right" and Blitzstein and Hwang's "Introduction to Probability" cover the underlying mathematics. Machine learning is explored practically in books by Andreas Muller and Sarah Guido and by Aurelien Geron, whose hands-on approach uses Scikit-Learn and TensorFlow, and Yves Hilpisch rounds out the selection with a finance-focused application of data science methods using Python.

22:02, 29^th January 2022

OpenCPU is an HTTP-based API system for executing R functions and scripts remotely, using standard GET and POST methods to retrieve objects and perform remote procedure calls, respectively. The API is structured around a configurable root path and exposes endpoints for accessing installed R packages, their functions, datasets and documentation, as well as temporary sessions that store the outputs of function or script executions. R objects can be retrieved in a range of formats including JSON, CSV, PDF and PNG, and function arguments can be passed using several content types such as URL-encoded form data, multipart form data or JSON.

Scripts are executed by posting to their file path, with the interpreter determined by file extension, supporting formats including R, LaTeX, knitr and Markdown. A simplified JSON RPC mode is available for cases where only the output data are needed, returning results directly in a single request rather than requiring a follow-up retrieval step. The system also supports static web applications bundled within R packages and offers continuous integration with GitHub, whereby pushing a commit to a repository's master branch can trigger automatic package installation on an OpenCPU server.

08:54, 24^th January 2022

The High-Paying Side Hustles for Data Scientists

The rise of remote working since the COVID-19 pandemic has led many data scientists to explore ways of supplementing their income through side work. Freelancing platforms such as Upwork, Toptal, AngelList and Kolabtree offer varying levels of entry, from open project bidding to elite networks requiring several years of experience.

Technical writing is another viable avenue, whether through blogging on platforms like Medium, contributing articles to publications that offer financial rewards based on readership, or taking on ghostwriting work that, while uncredited, tends to pay at a premium rate. Contract work, covering areas such as machine learning model design, data analysis and research, offers clear terms and flexible hours.

Consultancy, typically charged at an hourly rate, suits those with substantial field experience who can advise companies on data science strategy and investment. Career coaching rounds out the options, with platforms connecting experienced professionals with graduates and jobseekers needing guidance on interviews, networking and career direction. Beyond immediate earnings, these pursuits can broaden professional experience, strengthen a personal brand and contribute meaningfully to long-term career development.

14:00, 23^rd December 2021

SciML is an open-source ecosystem designed for scientific machine learning, offering a modular framework that integrates differentiable programming with physics-informed AI to solve complex problems in differential equations, nonlinear systems and inverse problems. Built primarily in Julia, it leverages high performance and scalability through distributed and GPU parallelism, while supporting interoperability with Python and R via tools like diffeqpy and diffeqr.

The ecosystem includes advanced solvers for a wide range of equations, automated model discovery tools and methods for sparsity acceleration and compiler-assisted analysis, enabling efficient simulation and optimisation. It also provides ML-assisted tools for accelerating scientific computations, such as neural differential equations and surrogate models, alongside extensive community resources for collaboration and support. The platform fosters research and development through a large contributor base and a suite of tools for benchmarking and testing new methodologies, aiming to bridge the gap between theoretical advancements and practical applications in scientific computing.

18:33, 10^th December 2021

SAS Institute has shared a few COVID resources for data scientists and others, so I have shared links to them here as well:

8 terms you need to understand when assessing COVID-19 data

Vaccine Efficacy, Clinical Trials, and SAS: Part 4 of Biostats in the Time of Coronavirus

What matters now when it comes to COVID-19

16:46, 2^nd December 2021

DataKind UK is a charity that supports third-sector organisations in the UK by enhancing their use of data analysis and science to address social challenges. Established in 2013, it connects these organisations with skilled volunteers who provide free, expert assistance to improve decision-making, build capacity and drive innovation. By fostering collaboration between data professionals and charities, voluntary groups and social enterprises, the organisation helps its partners navigate complex issues, leverage insights from data and strengthen their impact. Over the years, it has supported more than 280 organisations through hundreds of projects, contributing thousands of pro bono hours and demonstrating the value of data-driven approaches in addressing societal needs.

16:49, 21^st October 2021

Open Neural Network Exchange (ONNX)

An open format designed to enable machine learning models to be used across various frameworks and hardware, ONNX provides a standardised set of operators and file formats that facilitate compatibility between different tools and runtimes. It supports a wide range of frameworks and accelerators, allowing developers to leverage hardware optimisations while maintaining flexibility in model development. As a community-driven project, ONNX encourages collaboration through contributions, working groups and events such as meetups and surveys aimed at gathering feedback to guide its ongoing development.

14:47, 28^th September 2021

Data Sources in Power BI Desktop

Power BI Desktop supports a wide range of data sources, from traditional databases and spreadsheets to cloud-based services and web resources. Connecting to these sources involves selecting the appropriate protocol and specifying details such as server addresses, URLs or file paths.

For scenarios requiring shared connection settings, PBIDS files offer a way to export and distribute connection configurations. These files use a structured JSON format to define protocols, addresses and optional parameters such as connection mode, with supported examples including Azure Analysis Services, SharePoint lists, SQL Server and web data sources.

When using PBIDS files, users must ensure compatibility with supported protocols and avoid including encrypted columns or unsupported features. The files can be created automatically through Power BI Desktop or edited manually in a text editor, offering flexibility in how connection details are defined and maintained. This approach facilitates collaboration and standardisation in data integration workflows.

14:03, 26^th August 2021

Text Mining Node in SAS Model Studio on SAS Viya

The Text Mining node in SAS Model Studio on SAS Viya enables users to process unstructured data, such as free-form comments and reviews, and transform it into structured, quantitative representations through singular value decomposition, which can then be used as inputs for predictive modelling. When multiple variables carry a text role, the node defaults to the one with the greatest length, though users can override this by rejecting unwanted variables in the Data tab.

Configurable parsing options include part-of-speech tagging, noun group extraction, entity extraction and term stemming, whilst a minimum document threshold controls which terms are retained. From version Viya 4 2021.1.3 onwards, users can also upload custom lists such as stop lists and start lists.

The node generates up to 25 topic-based features, and in demonstrated comparisons a Decision Tree model incorporating those features outperformed one that did not, illustrating the potential value of extracting information from unstructured data. SAS Model Studio also supports automated pipeline creation, which detects text-role variables automatically and incorporates the Text Mining node into a full pipeline covering data preparation, model building, hyperparameter tuning and model selection. Results show that generated features frequently rank highly in variable importance across multiple model types.

14:02, 26^th August 2021

Natural Language Processing: An Introduction

Natural language processing offers tools to extract meaningful insights from unstructured text data, enabling applications across a wide range of fields. Techniques such as tokenisation, sentiment analysis and entity recognition allow textual information to be transformed into structured formats that can be integrated with relational databases for predictive modelling and decision-making.

In healthcare, for example, analysing patient notes can reveal psychosocial factors influencing treatment outcomes. In legal contexts, automated summarisation of case documents aids in identifying key details. The technology also supports innovations such as chatbots and translation services, though its effectiveness relies heavily on the quality of input data.

As the field advances, the integration of text analytics with other data sources will become increasingly important for generating comprehensive insights. This is particularly relevant in domains where traditional data alone may not capture the full complexity of a problem.

« Older Entries «

» Newer Entries »