15:54, 8th July 2022
FOSS For Spectroscopy
A comprehensive catalogue of free and open-source software for spectroscopy has been compiled, covering techniques including NMR, IR, Raman, ESR/EPR, fluorescence, XRF, LIBS and UV-Vis, though mass spectrometry and MRI software are deliberately excluded. The catalogue currently lists 405 entries, the majority written in R or Python, and draws information directly from the respective project repositories and developer pages.
Each entry is lightly vetted to exclude incomplete or unclear projects, and a status date is provided for each to indicate the most recent repository activity, issue filing or package submission, serving as a rough indicator of whether a project is actively maintained. Related resources are also acknowledged, including a comprehensive survey of R packages suited to metabolomics, a curated collection focused on Raman spectroscopy and a task view covering chemometrics and computational physics.
15:36, 9th May 2022
9 Free Harvard Courses to Learn Data Science
8 Free MIT Courses to Learn Data Science Online
Both Harvard and MIT offer free online data science courses through their respective open learning platforms, covering a progression from beginner programming through to advanced machine learning. The Harvard pathway consists of nine courses, primarily taught in R, that move through programming basics, data visualisation, probability, statistics, data pre-processing, linear regression and machine learning, culminating in a capstone project that brings all the learnt skills together. The MIT pathway takes a similarly structured approach, beginning with an introductory Python programming course before moving into statistics, foundational mathematics including calculus and linear algebra, and finally an intermediate machine learning programme that requires prior knowledge of all the preceding subjects.
While the MIT courses are noted for their depth and relatively fast pace, the Harvard sequence is designed to cover the full data science workflow in a more applied manner, giving learners the opportunity to work with real-world data throughout. Both pathways are available to audit at no cost, though certificates of completion carry a fee, and all courses are hosted on the edX platform in the case of Harvard and MIT OpenCourseWare in the case of MIT.
14:32, 28th April 2022
The Book of OHDSI serves as a comprehensive resource detailing the Observational Health Data Sciences and Informatics collaborative, covering its community, data standards and analytical tools. Organised into five sections, it explores topics such as the common data model, standardised vocabularies, data analytics use cases, evidence quality assessment and methodologies for conducting studies within a distributed research network.
Aimed at both newcomers and experienced participants, the book provides theoretical explanations alongside practical guidance on implementing OHDSI initiatives, including cohort definition, population-level estimation and patient-level prediction. It is developed using open-source tools and maintained through continuous community contributions, with updates reflected in the online version to ensure alignment with evolving software and methodologies.
14:31, 28th April 2022
Observational Health Data Sciences and Informatics
The Observational Health Data Sciences and Informatics initiative brings together researchers, healthcare professionals and data scientists globally to enhance healthcare through large-scale analysis of observational health data. Based at Columbia University, the programme develops open-source tools and fosters collaboration across diverse stakeholders to generate real-world evidence that supports informed clinical decisions and improves patient outcomes. It hosts international events such as the 2026 Global Symposium, which aims to showcase innovations and strengthen partnerships in advancing healthcare research. The organisation also provides educational resources, software and a platform for sharing findings, with ongoing efforts to expand its network and impact through community engagement and scientific exchange.
14:54, 27th April 2022
OpenClinica is a modular eClinical platform designed primarily for small to midsize organisations involved in clinical research, including academic institutions, sponsors, contract research organisations and biotech companies. It brings together electronic data capture, electronic consent, patient-reported outcomes, randomisation, EHR integration, analytics and patient recruitment into a single offering, with the stated aim of reducing the time required to launch a study from several months to just a few weeks.
Each client is assigned a dedicated Customer Success Manager rather than being directed to a general support queue, and the platform also provides around-the-clock application support for most of the working week, alongside an on-demand training system. Organisations can either configure studies themselves using drag-and-drop tools and pre-built templates, or engage the platform's professional services team to handle the build on their behalf.
The platform claims to reduce data queries by around half and to deliver patient recruitment at significantly lower cost per conversion than traditional approaches. Having reportedly supported more than 15,000 studies and three million patients globally, it positions itself as a practical middle ground between overly complex enterprise systems and tools that lack the necessary rigour for regulated clinical research.
14:54, 27th April 2022
CDISC Open-Source Alliance
The CDISC Open-Source Alliance maintains a directory of repositories that have been officially recognised as open-source projects aimed at implementing or developing CDISC standards, with the goal of fostering innovation within the CDISC community. Each project must satisfy specific inclusion criteria before being listed in the directory, and smaller projects that emerge from hackathons are catalogued separately in a dedicated hackathons panel.
15:49, 21st March 2022
Download, Tidy and Visualize Covid-19 Related Data
The Mathematics and Statistics of Infectious Disease Outbreaks
The tidycovid19 R package, created by economist Joachim Gassen, aggregates and tidies COVID-19 related data from multiple authoritative sources to support research into the pandemic, with a particular focus on non-pharmaceutical interventions. Data are drawn from organisations including Johns Hopkins University, the European Centre for Disease Prevention and Control, Our World in Data, the World Bank, ACAPS, Oxford University and Google and Apple mobility reports, all accessible through dedicated download functions. The package also includes visualisation tools for plotting the spread of the virus, generating stripe-based country comparisons and mapping global or regional trends, as well as a Shiny app for interactive exploration. A separate but related GitHub repository hosts materials for the MT3002 summer 2020 course on the mathematics and statistics of infectious disease outbreaks, delivered at Stockholm University by Tom Britton and Michael Hohle, covering topics such as epidemic modelling, reproduction numbers, vaccination, outbreak detection and COVID-19-specific analyses through video lectures, slides and accompanying R code.
15:30, 21st February 2022
Best Data Science Books For Beginners
For those looking to enter the field of data science, a strong grounding in programming, machine learning, probability, statistics and linear algebra is essential, and a range of books exists to help beginners build these skills. Among the most recommended are "Data Science from A-Z" by Benjamin Smith, which offers a clear and balanced introduction to core concepts, and "Data Science for Dummies" by Lillian Pierson and Jake Porway, which focuses on practical business applications and covers big data frameworks such as Hadoop and Spark.
Cole Nussbaumer Knaflic's "Storytelling with Data" takes a narrative approach to teaching data visualisation, while Joel Grus's "Data Science from Scratch" walks readers through Python, statistics and machine learning from the ground up. Jake VanderPlas covers key Python libraries in the "Python Data Science Handbook", and Wes McKinney's "Python for Data Analysis" is particularly suited to those new to both Python and analytical computing.
For those preferring R, Hadley Wickham's "R for Data Science" provides thorough coverage of the language. Statistical foundations are addressed by Gareth James and co-authors in "An Introduction to Statistical Learning" and by Peter Bruce in "Practical Statistics for Data Scientists", while Sheldon Axler's "Linear Algebra Done Right" and Blitzstein and Hwang's "Introduction to Probability" cover the underlying mathematics. Machine learning is explored practically in books by Andreas Muller and Sarah Guido and by Aurelien Geron, whose hands-on approach uses Scikit-Learn and TensorFlow, and Yves Hilpisch rounds out the selection with a finance-focused application of data science methods using Python.
22:02, 29th January 2022
OpenCPU is an HTTP-based API system for executing R functions and scripts remotely, using standard GET and POST methods to retrieve objects and perform remote procedure calls, respectively. The API is structured around a configurable root path and exposes endpoints for accessing installed R packages, their functions, datasets and documentation, as well as temporary sessions that store the outputs of function or script executions. R objects can be retrieved in a range of formats including JSON, CSV, PDF and PNG, and function arguments can be passed using several content types such as URL-encoded form data, multipart form data or JSON.
Scripts are executed by posting to their file path, with the interpreter determined by file extension, supporting formats including R, LaTeX, knitr and Markdown. A simplified JSON RPC mode is available for cases where only the output data are needed, returning results directly in a single request rather than requiring a follow-up retrieval step. The system also supports static web applications bundled within R packages and offers continuous integration with GitHub, whereby pushing a commit to a repository's master branch can trigger automatic package installation on an OpenCPU server.
08:54, 24th January 2022
The High-Paying Side Hustles for Data Scientists
The rise of remote working since the COVID-19 pandemic has led many data scientists to explore ways of supplementing their income through side work. Freelancing platforms such as Upwork, Toptal, AngelList and Kolabtree offer varying levels of entry, from open project bidding to elite networks requiring several years of experience. Technical writing is another viable avenue, whether through blogging on platforms like Medium, contributing articles to publications that offer financial rewards based on readership, or taking on ghostwriting work that, while uncredited, tends to pay at a premium rate. Contract work, covering areas such as machine learning model design, data analysis and research, offers clear terms and flexible hours, while consultancy, typically charged at an hourly rate, suits those with substantial field experience who can advise companies on data science strategy and investment. Career coaching rounds out the options, with platforms connecting experienced professionals with graduates and jobseekers needing guidance on interviews, networking and career direction. Beyond immediate earnings, these pursuits can broaden professional experience, strengthen a personal brand and contribute meaningfully to long-term career development.