14:58, 26th October 2022
Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS
Research led by Associate Professor Emma Schymanski at the University of Luxembourg examined how open-access chemistry tools can be used to extract per- and polyfluoroalkyl substances, commonly known as PFAS, from large collections of scientific literature and patent documents. PFAS are a broad group of fluorinated chemical compounds that present significant environmental and health concerns, and the sheer scale of their presence in scientific records poses a considerable challenge for researchers attempting to catalogue them. One major difficulty lies in the inconsistency of how these substances are recorded, as a single compound may appear under dozens of synonyms, multiple database identifiers and various cheminformatics formats, as well as in structural images that require additional processing to interpret.
The research drew on tens of millions of deduplicated documents and hundreds of billions of patent annotations to identify millions of fluorine-containing compounds. Different chemistry toolkits were found to interpret and validate chemical structures in noticeably different ways, meaning that the choice of toolkit directly affects how many PFAS are identified and whether certain structures are accepted or rejected.
Multiple definitions of PFAS also exist, and the breadth of any given definition significantly influences how many compounds qualify. Comparison with established PFAS lists revealed that the number of substances identified through this extraction process far exceeded those currently catalogued, in some cases by a ratio of roughly nine to one. To help researchers navigate this landscape, tools including MetFrag and a dedicated PFAS classification tree hosted on PubChem were highlighted as practical resources for finding and identifying these compounds.
14:57, 26th October 2022
Alliance for Data Science Professionals
The Alliance for Data Science Professionals has established certifications and industry-wide standards aimed at promoting ethical practices in data science, ensuring data are handled responsibly and transparently. These standards, developed through collaboration with stakeholders and volunteers, address challenges such as data breaches, biased algorithms and improper data usage, providing assurance that data are stored securely and analysed rigorously. The initiative includes defining professional competencies for roles like data scientists and analysts, offering certifications to members, and using these criteria to accredit educational programmes that align with the field's evolving needs.
09:25, 24th October 2022
What is data extraction? And how to automate the process
Data extraction involves retrieving usable information from large, unstructured datasets, enabling organisations to make informed decisions based on accurate, timely data. It differs from data mining in focus, with extraction targeting specific, actionable information rather than uncovering broader patterns. Structured data, such as contact details or financial figures, contrasts with unstructured formats like emails or resumes, each requiring tailored approaches for effective processing.
Methods include incremental extraction, which updates data in real-time and full extraction, which captures complete datasets periodically. The ETL (extract, transform, load) process streamlines data preparation for analysis, while tools like Zapier automate workflows by integrating disparate applications. For instance, Zapier’s Formatter tool can parse email content to populate CRM systems and its Email Parser extracts structured data from unstructured text.
Similarly, tools such as Machete monitor website changes and CandidateZip automates resume parsing for recruitment purposes. These automation tools reduce manual effort, enhance accuracy and free teams to focus on strategic tasks, demonstrating how modern technologies transform data handling into a seamless, scalable operation.
11:57, 12th October 2022
How to reveal new connections in a knowledge graph with link prediction
Knowledge graphs serve as powerful tools for organising complex, interconnected data across various fields, such as biomedicine and healthcare, by representing entities, relationships and attributes in a flexible structure. These graphs often face challenges like incompleteness or missing connections, which can be addressed through link prediction techniques that infer potential relationships using unsupervised machine learning.
By applying methods such as network projection within tools like SAS Viya, analysts can identify missing links in knowledge graphs, such as disease or compound similarities, by analysing existing data patterns. This approach was demonstrated using the Hetionet knowledge graph, where predictions were made by removing specific link types and re-inferring them based on remaining connections, achieving high accuracy in identifying relevant associations. The results highlight the utility of such methods in applications like drug repurposing and improving data curation efficiency, showcasing how analytical workflows can enhance the completeness and usefulness of knowledge graphs in real-world scenarios.
11:43, 12th October 2022
Metabase offers an open-source analytics platform designed to enable users to interact with data through natural language queries, AI-assisted tools and visual interfaces, allowing teams to generate insights without requiring extensive technical expertise. It supports integration with over 20 data sources, provides features for creating and sharing dashboards and includes tools for managing permissions, securing data and embedding analytics within applications. The platform is used by many organisations and includes advanced capabilities such as model creation, SQL editing and caching to enhance performance, alongside enterprise-level compliance and security measures. Its flexibility and ease of use make it suitable for startups and larger enterprises seeking to streamline data analysis and reporting processes.
15:09, 28th September 2022
BlueSky Statistics is a data analysis tool designed for professionals in statistics, quality control, engineering and related fields, offering a user-friendly interface available in multiple languages. It integrates advanced statistical methods, machine learning and quality management features, with recent updates including an interactive graph builder that allows dynamic creation of visualisations through drag-and-drop functionality.
The software is used globally by thousands of organisations and individuals, supported by a range of collaboration tools and the ability to export analyses to common formats. Users have transitioned from other platforms due to its comprehensive feature set, ease of use and cost-effectiveness, with testimonials highlighting its suitability for applications ranging from Six Sigma to biostatistics. The company participates in industry events and provides training resources through partnerships, positioning itself as a versatile alternative to proprietary statistical software.
15:01, 28th September 2022
Top ten database attacks
Enterprise databases face numerous serious security threats, many of which stem from poor configuration, inadequate design or insufficient oversight. Misconfigured cloud databases remain a persistent problem, frequently exposing vast quantities of sensitive data due to weak authentication or failure to properly restrict public access. SQL injection continues to be a highly damaging attack vector, exploiting poorly written application code to extract or manipulate database contents, while weak authentication practices such as storing passwords in plain form or failing to implement multifactor authentication leave systems unnecessarily vulnerable.
Privilege abuse and excessive privileges both pose significant internal risks, with users potentially exploiting legitimate access rights beyond their intended scope, particularly when role changes are not properly managed. Inadequate logging and auditing undermine an organisation's ability to detect and investigate suspicious activity, and denial-of-service attacks can render systems unavailable through either network flooding or resource exhaustion. Running unpatched software dramatically increases exposure to known vulnerabilities, and insecure overall system architecture can allow an initial breach to cascade into a much broader compromise. Finally, inadequate backup practices, particularly where backups remain reachable from a compromised environment or are not encrypted, leave organisations dangerously exposed to ransomware and other destructive attacks. Addressing these threats requires a combination of strong technical controls, sound procedural practices and regular risk assessment.
13:49, 8th September 2022
Excel Formula Generator
Formula Bot is an AI-powered data analysis platform aimed primarily at marketers and data teams, enabling users to upload, connect and combine data from multiple sources and then query that data using plain language in any language. The platform can generate interactive charts and graphs, perform data transformation tasks such as cleaning, merging and reshaping datasets, and carry out text analysis functions including sentiment detection, keyword extraction and language translation.
Users can export results to Excel, create formatted reports and schedule recurring analyses to run automatically on a daily, weekly or monthly basis. Additional capabilities include a curated data explorer, embeddable analytics, web scraping, code transparency showing the underlying Python, SQL or R generated for each request, and a knowledge base for improving query accuracy. Security features include end-to-end encryption via AWS infrastructure, row-level access controls and isolated sandbox environments for each session, with a stated commitment to never using customer data for AI model training.
16:54, 17th August 2022
Six tips for better spreadsheets
Nature examines how widely used spreadsheet tools such as Microsoft Excel and Google Sheets are frequently misused, drawing on insights from data scientists and researchers. Stephanie Labou, a data-science librarian at the University of California, San Diego, highlights common pitfalls encountered in practice, including errors arising from manually entered data such as GPS coordinates. The piece offers six practical tips aimed at improving how spreadsheets are structured and used, with the broader goal of encouraging more reliable and reproducible data handling in scientific and research contexts.
17:50, 21st July 2022
Apache Superset
An open-source data exploration and visualisation platform, Apache Superset offers a range of tools for users to interact with data through intuitive interfaces, including a drag-and-drop chart builder, SQL query capabilities and pre-installed visualisations. It supports integration with numerous databases, from traditional systems to modern cloud-native solutions, and provides features such as data caching, customisable dashboards and semantic layers for complex data transformations. Designed to be lightweight and scalable, the platform enables teams to create interactive dashboards, explore datasets and perform detailed analysis using cross-filters and drill-down functionalities. Adopted by numerous organisations, it is used for self-serve analytics, allowing users to generate insights without requiring extensive technical expertise.