14:00, 23rd December 2021
SciML is an open-source ecosystem designed for scientific machine learning, offering a modular framework that integrates differentiable programming with physics-informed AI to solve complex problems in differential equations, nonlinear systems and inverse problems. Built primarily in Julia, it leverages high performance and scalability through distributed and GPU parallelism, while supporting interoperability with Python and R via tools like diffeqpy and diffeqr.
The ecosystem includes advanced solvers for a wide range of equations, automated model discovery tools and methods for sparsity acceleration and compiler-assisted analysis, enabling efficient simulation and optimisation. It also provides ML-assisted tools for accelerating scientific computations, such as neural differential equations and surrogate models, alongside extensive community resources for collaboration and support. The platform fosters research and development through a large contributor base and a suite of tools for benchmarking and testing new methodologies, aiming to bridge the gap between theoretical advancements and practical applications in scientific computing.
18:33, 10th December 2021
SAS Institute has shared a few COVID resources for data scientists and others, so I have shared links to them here as well:
8 terms you need to understand when assessing COVID-19 data
Vaccine Efficacy, Clinical Trials, and SAS: Part 4 of Biostats in the Time of Coronavirus
16:46, 2nd December 2021
DataKind UK is a charity that supports third-sector organisations in the UK by enhancing their use of data analysis and science to address social challenges. Established in 2013, it connects these organisations with skilled volunteers who provide free, expert assistance to improve decision-making, build capacity and drive innovation. By fostering collaboration between data professionals and charities, voluntary groups and social enterprises, the organisation helps its partners navigate complex issues, leverage insights from data and strengthen their impact. Over the years, it has supported more than 280 organisations through hundreds of projects, contributing thousands of pro bono hours and demonstrating the value of data-driven approaches in addressing societal needs.
16:49, 21st October 2021
Open Neural Network Exchange (ONNX)
An open format designed to enable machine learning models to be used across various frameworks and hardware, ONNX provides a standardised set of operators and file formats that facilitate compatibility between different tools and runtimes. It supports a wide range of frameworks and accelerators, allowing developers to leverage hardware optimisations while maintaining flexibility in model development. As a community-driven project, ONNX encourages collaboration through contributions, working groups and events such as meetups and surveys aimed at gathering feedback to guide its ongoing development.
14:47, 28th September 2021
Data Sources in Power BI Desktop
Power BI Desktop supports a wide range of data sources, from traditional databases and spreadsheets to cloud-based services and web resources. Connecting to these sources involves selecting the appropriate protocol and specifying details such as server addresses, URLs, or file paths. For scenarios requiring shared connection settings, PBIDS files offer a way to export and distribute connection configurations. These files use a structured JSON format to define protocols, addresses and optional parameters like connection mode.
Examples include configurations for Azure Analysis Services, SharePoint lists, SQL Server and web data sources. When using PBIDS files, users must ensure compatibility with supported protocols and avoid including encrypted columns or unsupported features. The process of creating a PBIDS file can be automated through Power BI Desktop or manually edited in a text editor, allowing for flexibility in defining connection details. This approach facilitates collaboration and standardisation in data integration workflows.
14:03, 26th August 2021
Text Mining Node in SAS Model Studio on SAS Viya
The Text Mining node in SAS Model Studio on SAS Viya enables users to process unstructured data, such as free-form comments and reviews, and transform it into structured, quantitative representations through singular value decomposition, which can then be used as inputs for predictive modelling. When multiple variables carry a text role, the node defaults to the one with the greatest length, though users can override this by rejecting unwanted variables in the Data tab. Configurable parsing options include part-of-speech tagging, noun group extraction, entity extraction and term stemming, while a minimum document threshold controls which terms are retained.
From version Viya 4 2021.1.3 onwards, users can also upload custom lists such as stop lists and start lists. The node generates up to 25 topic-based features, and in demonstrated comparisons, a Decision Tree model that incorporated those features outperformed one that did not, illustrating the potential value of extracting information from unstructured data. SAS Model Studio also supports automated pipeline creation, which detects text-role variables automatically and incorporates the Text Mining node into a full pipeline covering data preparation, model building, hyperparameter tuning and model selection, with results showing that generated features frequently rank highly in variable importance across multiple model types.
14:02, 26th August 2021
Natural Language Processing: An Introduction
Natural Language Processing offers tools to extract meaningful insights from unstructured text data, enabling applications across various fields. Techniques such as tokenisation, sentiment analysis and entity recognition allow the transformation of textual information into structured formats that can be integrated with relational databases for predictive modelling and decision-making. In healthcare, for example, analysing patient notes can reveal psychosocial factors influencing treatment outcomes, while in legal contexts, automated summarisation of case documents aids in identifying key details. Beyond these, NLP supports innovations like chatbots and translation services, though its effectiveness relies heavily on the quality of input data. As the field advances, the integration of text analytics with other data sources will become increasingly vital for comprehensive insights, particularly in domains where traditional data alone may not capture the full complexity of a problem.
14:02, 4th August 2021
Data Science Experience | SAS
The Data Science Experience highlights real-world applications of data science across industries, showcasing how professionals address complex challenges through innovative solutions. Examples include using integrated tools to enhance customer experiences in banking and developing strategies to ensure model reliability in digital transformation initiatives. These stories illustrate the importance of combining technical expertise with clear objectives to solve problems in healthcare, insurance and public sector contexts. Additional resources such as training programs, events and cloud-based analytics deployment options are available to support further exploration and skill development in the field.
09:01, 4th August 2021
NumFOCUS: A Nonprofit Supporting Open Code for Better Science
NumFOCUS supports open-source projects used by organisations ranging from major technology companies to research institutions, aiming to address complex challenges through collaborative development. It offers opportunities for community engagement, employment in open-source roles and ways for individuals and organisations to contribute financially or through sponsorship. The organisation also provides resources such as a newsletter, annual reports and a shop where proceeds support its initiatives, while maintaining a focus on fostering innovation and accessibility in scientific computing through its various programs and partnerships.
08:59, 27th May 2021
Apache ORC is a columnar storage format designed for Hadoop workloads, offering efficient data handling through features such as ACID transaction support, built-in indexes for rapid data retrieval and compatibility with complex data types including structs, lists and maps. It is maintained by the Apache Software Foundation, a non-profit organisation that oversees open-source projects under the Apache Licence, ensuring governance and privacy standards. The project provides documentation and tools for integration with various frameworks like Spark, Hive and Hadoop.