AI & Data Science Jottings

14:50, 19^th November 2022

7 Techniques to Handle Imbalanced Data

Handling imbalanced datasets, common in fields such as fraud detection and intrusion detection, requires careful consideration of evaluation metrics beyond accuracy, as models may otherwise fail to identify rare events effectively. Techniques include resampling methods like under-sampling to reduce the majority class or over-sampling to generate synthetic minority instances, though both approaches have limitations depending on data availability.

Cross-validation should be applied before resampling to avoid overfitting, while ensemble methods that combine multiple resampled datasets can improve generalisation. Adjusting the ratio of classes during resampling, clustering the majority class to retain representative samples and designing models with cost functions that prioritise the minority class are further strategies. These approaches, along with adapting algorithms like XGBoost that inherently manage class imbalance, offer practical solutions to enhance model performance in scenarios where rare events are critical to detect.

22:46, 18^th November 2022

15 More Free Machine Learning and Deep Learning Books

Another compilation of 15 freely available eBooks covering machine learning and deep learning has been put together for those looking to build or deepen their knowledge in these areas. The selection spans a wide range of topics and skill levels, from foundational introductions to neural network architectures and the mathematics underpinning machine learning, through to more advanced subjects such as deep learning applied to physical simulations, graph-based data representation and the analysis of predictive models.

Authors include both academic researchers and industry practitioners, with titles from figures such as Ian Goodfellow, Yoshua Bengio and Aaron Courville, as well as Microsoft researchers Li Deng and Dong Yu. Some books take a practical, code-focused approach using tools like Jupyter notebooks and the fastai library, while others lean more heavily into theory, covering topics such as backpropagation, regularisation, natural language processing and reinforcement learning. One title is aimed specifically at those preparing for deep learning job interviews, and another offers an exceptionally thorough grounding in the mathematics relevant to computer science and machine learning.

14:58, 26^th October 2022

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Research led by Associate Professor Emma Schymanski at the University of Luxembourg examined how open-access chemistry tools can be used to extract per- and polyfluoroalkyl substances, commonly known as PFAS, from large collections of scientific literature and patent documents. PFAS are a broad group of fluorinated chemical compounds that present significant environmental and health concerns, and the sheer scale of their presence in scientific records poses a considerable challenge for researchers attempting to catalogue them. One major difficulty lies in the inconsistency of how these substances are recorded, as a single compound may appear under dozens of synonyms, multiple database identifiers and various cheminformatics formats, as well as in structural images that require additional processing to interpret.

The research drew on tens of millions of deduplicated documents and hundreds of billions of patent annotations to identify millions of fluorine-containing compounds. Different chemistry toolkits were found to interpret and validate chemical structures in noticeably different ways, meaning that the choice of toolkit directly affects how many PFAS are identified and whether certain structures are accepted or rejected.

Multiple definitions of PFAS also exist, and the breadth of any given definition significantly influences how many compounds qualify. Comparison with established PFAS lists revealed that the number of substances identified through this extraction process far exceeded those currently catalogued, in some cases by a ratio of roughly nine to one. To help researchers navigate this landscape, tools including MetFrag and a dedicated PFAS classification tree hosted on PubChem were highlighted as practical resources for finding and identifying these compounds.

14:57, 26^th October 2022

Alliance for Data Science Professionals

The Alliance for Data Science Professionals has established certifications and industry-wide standards aimed at promoting ethical practices in data science, ensuring data are handled responsibly and transparently. These standards, developed through collaboration with stakeholders and volunteers, address challenges such as data breaches, biased algorithms and improper data usage, providing assurance that data are stored securely and analysed rigorously. The initiative includes defining professional competencies for roles like data scientists and analysts, offering certifications to members, and using these criteria to accredit educational programmes that align with the field's evolving needs.

09:25, 24^th October 2022

What is data extraction? And how to automate the process

Data extraction involves retrieving usable information from large, unstructured datasets, enabling organisations to make informed decisions based on accurate, timely data. It differs from data mining in focus, with extraction targeting specific, actionable information rather than uncovering broader patterns. Structured data, such as contact details or financial figures, contrasts with unstructured formats like emails or resumes, each requiring tailored approaches for effective processing.

Methods include incremental extraction, which updates data in real-time and full extraction, which captures complete datasets periodically. The ETL (extract, transform, load) process streamlines data preparation for analysis, while tools like Zapier automate workflows by integrating disparate applications. For instance, Zapier’s Formatter tool can parse email content to populate CRM systems and its Email Parser extracts structured data from unstructured text.

Similarly, tools such as Machete monitor website changes and CandidateZip automates resume parsing for recruitment purposes. These automation tools reduce manual effort, enhance accuracy and free teams to focus on strategic tasks, demonstrating how modern technologies transform data handling into a seamless, scalable operation.

11:57, 12^th October 2022

How to reveal new connections in a knowledge graph with link prediction

Knowledge graphs serve as powerful tools for organising complex, interconnected data across various fields, such as biomedicine and healthcare, by representing entities, relationships and attributes in a flexible structure. These graphs often face challenges like incompleteness or missing connections, which can be addressed through link prediction techniques that infer potential relationships using unsupervised machine learning.

By applying methods such as network projection within tools like SAS Viya, analysts can identify missing links in knowledge graphs, such as disease or compound similarities, by analysing existing data patterns. This approach was demonstrated using the Hetionet knowledge graph, where predictions were made by removing specific link types and re-inferring them based on remaining connections, achieving high accuracy in identifying relevant associations. The results highlight the utility of such methods in applications like drug repurposing and improving data curation efficiency, showcasing how analytical workflows can enhance the completeness and usefulness of knowledge graphs in real-world scenarios.

11:43, 12^th October 2022

Metabase offers an open-source analytics platform designed to enable users to interact with data through natural language queries, AI-assisted tools and visual interfaces, allowing teams to generate insights without requiring extensive technical expertise. It supports integration with over 20 data sources, provides features for creating and sharing dashboards and includes tools for managing permissions, securing data and embedding analytics within applications. The platform is used by many organisations and includes advanced capabilities such as model creation, SQL editing and caching to enhance performance, alongside enterprise-level compliance and security measures. Its flexibility and ease of use make it suitable for startups and larger enterprises seeking to streamline data analysis and reporting processes.

15:09, 28^th September 2022

BlueSky Statistics is a data analysis tool designed for professionals in statistics, quality control, engineering and related fields, offering a user-friendly interface available in multiple languages. It integrates advanced statistical methods, machine learning and quality management features, with recent updates including an interactive graph builder that allows dynamic creation of visualisations through drag-and-drop functionality.

The software is used globally by thousands of organisations and individuals, supported by a range of collaboration tools and the ability to export analyses to common formats. Users have transitioned from other platforms due to its comprehensive feature set, ease of use and cost-effectiveness, with testimonials highlighting its suitability for applications ranging from Six Sigma to biostatistics. The company participates in industry events and provides training resources through partnerships, positioning itself as a versatile alternative to proprietary statistical software.

15:01, 28^th September 2022

Top ten database attacks

Enterprise databases face numerous serious security threats, many of which stem from poor configuration, inadequate design or insufficient oversight. Misconfigured cloud databases remain a persistent problem, frequently exposing vast quantities of sensitive data due to weak authentication or failure to properly restrict public access. SQL injection continues to be a highly damaging attack vector, exploiting poorly written application code to extract or manipulate database contents, while weak authentication practices such as storing passwords in plain form or failing to implement multifactor authentication leave systems unnecessarily vulnerable.

Privilege abuse and excessive privileges both pose significant internal risks, with users potentially exploiting legitimate access rights beyond their intended scope, particularly when role changes are not properly managed. Inadequate logging and auditing undermine an organisation's ability to detect and investigate suspicious activity, and denial-of-service attacks can render systems unavailable through either network flooding or resource exhaustion. Running unpatched software dramatically increases exposure to known vulnerabilities, and insecure overall system architecture can allow an initial breach to cascade into a much broader compromise. Finally, inadequate backup practices, particularly where backups remain reachable from a compromised environment or are not encrypted, leave organisations dangerously exposed to ransomware and other destructive attacks. Addressing these threats requires a combination of strong technical controls, sound procedural practices and regular risk assessment.

13:49, 8^th September 2022

Excel Formula Generator

Formula Bot is an AI-powered data analysis platform aimed primarily at marketers and data teams, enabling users to upload, connect and combine data from multiple sources and then query that data using plain language in any language. The platform can generate interactive charts and graphs, perform data transformation tasks such as cleaning, merging and reshaping datasets, and carry out text analysis functions including sentiment detection, keyword extraction and language translation.

Users can export results to Excel, create formatted reports and schedule recurring analyses to run automatically on a daily, weekly or monthly basis. Additional capabilities include a curated data explorer, embeddable analytics, web scraping, code transparency showing the underlying Python, SQL or R generated for each request, and a knowledge base for improving query accuracy. Security features include end-to-end encryption via AWS infrastructure, row-level access controls and isolated sandbox environments for each session, with a stated commitment to never using customer data for AI model training.

« Older Entries «

» Newer Entries »