TOPIC: PANDOC
Docling, MarkItDown and Textract: Understanding the New Document-Processing Landscape
The growing use of large language models has changed the way many organisations think about documents. Reports, manuals, protocols, spreadsheets, presentations and scanned PDFs are no longer just files to be opened by a person; they are also sources of knowledge that can feed search systems, retrieval-augmented generation workflows and internal knowledge bases.
This shift has brought renewed attention to a practical problem that long predates the current AI boom. Documents are often rich in structure, yet many extraction tools reduce them to plain text, and once headings, tables, figures, captions and reading order are discarded, the resulting output can become difficult for humans to review and even harder for an AI system to use reliably.
Docling, MarkItDown and Textract all sit in this space, but they approach the problem from different directions. Textract is rooted in general-purpose text extraction, MarkItDown focuses on producing Markdown for text-analysis workflows and Docling aims to build a richer understanding of document structure.
Why Document Conversion Matters for AI Systems
A PDF, Word document or spreadsheet may look orderly on screen, yet that order is not always easy to recover programmatically. A human reader can see that a line of text is a heading, that a table belongs to a section or that a caption describes a nearby figure, while a simple text extractor may see only a stream of characters.
That difference matters when documents are used with large language models. If a technical manual, financial report or policy document is flattened into undifferentiated text, a search system may retrieve the wrong passage or miss the relationship between a table and its surrounding explanation. Retrieval-augmented generation depends not only on having the right words in an index, but also on preserving enough context for those words to remain meaningful.
Markdown and structured JSON have therefore become important intermediate formats. Markdown is close to plain text, but it can still represent headings, tables, links and lists in a compact way. JSON can go further by encoding document hierarchy, page-level information and other metadata for downstream processing.
Docling and Document Understanding
Docling is an open-source Python toolkit designed to convert and understand documents for AI-oriented workflows. It was initially developed by IBM's AI for Knowledge team at IBM Research Zurich, open-sourced in July 2024 and is now hosted under the LF AI & Data Foundation (part of the Linux Foundation), following IBM's formal contribution of the project to the foundation on 29th April 2025. Its purpose is not merely to extract text, but to preserve structure and meaning in a form that can be used by search systems, knowledge extraction tools and retrieval-augmented generation pipelines.
A simple way to describe Docling is to place it between raw documents and a language model. Instead of treating a document as a block of text, it attempts to identify headings, document hierarchy, tables, figures, captions, formulas, code blocks, reading order and page layout information. This is particularly important for PDFs, where the visual appearance of a page can hide a complicated underlying structure.
Docling supports parsing for PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3 and image formats, as well as several application-specific XML schemas including USPTO patents, JATS articles and XBRL financial reports, making it relevant well beyond ordinary office files.
The project's outputs are also designed for modern AI pipelines. Docling can export to Markdown, HTML, WebVTT, DocTags and lossless JSON, while its internal DoclingDocument representation provides a unified model of the parsed material. This gives developers a way to move from a PDF or other source file to a structured representation that is easier to chunk, index and query.
Its PDF capabilities are among the main reasons it has drawn attention. The documentation highlights advanced PDF understanding, including page layout, reading order, table structure, code, formulas and image classification. It also includes extensive OCR support for scanned PDFs and images, support for several visual language models under the GraniteDocling name and audio support using automatic speech recognition models.
Docling is also designed to fit into the wider generative AI ecosystem. Its integrations include LangChain, LlamaIndex, CrewAI and Haystack for agentic AI, and it supports local execution for sensitive data and air-gapped environments. The ability to run entirely on local hardware is important where sending files to an external service is not acceptable.
A typical installation begins with pip install docling. A simple conversion in Python uses DocumentConverter from docling.document_converter, then calls converter.convert("study_report.pdf") and exports the result with result.document.export_to_markdown(). The resulting Markdown can then be used in a search index, vector database or language-model workflow.
Textract and Traditional Text Extraction
Textract, in this context, refers to the Python package named textract and not Amazon Textract, which is a separate, cloud-hosted service. Its philosophy is much simpler than Docling's: give it a file and receive extracted text. The package provides both a command-line interface (for example, textract path/to/file.extension) and a Python interface using textract.process("path/to/file.extension").
The Textract documentation describes the problem it addresses as one of recovering useful information from "dark data" embedded in Word documents, PowerPoint presentations, PDFs and other files, providing a single interface across many formats, which has long been useful for natural language processing and textual analysis. Its design is method-agnostic, meaning it wraps different tools and libraries depending on the file type.
Textract supports many formats through a variety of underlying systems. Its documentation mentions CSV, TSV, DOC, DOCX, EML, EPUB, GIF, JPG, JSON, HTML, MP3, MSG, ODT, OGG, PDF, PNG, PPTX, PostScript, RTF, TIFF, TXT, WAV, XLSX and XLS. DOC files may be processed through antiword, DOCX through python-docx2txt, images through tesseract-ocr, PDFs through pdftotext by default (or pdfminer.six) and audio through tools such as sox, SpeechRecognition and pocketsphinx.
This broad format support is useful, but Textract was not designed specifically for the current generation of RAG and LLM systems. Its goal is text extraction rather than document understanding, meaning it may retrieve text effectively from many sources, but it does not build a rich document model and has limited awareness of tables, layout, semantic sections or chunking for retrieval.
That distinction is central when comparing Textract with newer tools. Textract answers the question of what text is present in a file. Docling tries to answer what the document means structurally, which is a more demanding task when tables, appendices, captions and section hierarchies carry much of the meaning.
MarkItDown and the Rise of Markdown-First Conversion
MarkItDown is an open-source Microsoft tool for converting files and office documents to Markdown. It is published under an MIT licence and its stated purpose is to provide a lightweight Python utility for converting various files to Markdown for indexing, text analysis and related purposes. It is therefore closer to Textract than to a full document-understanding framework, but it places much more emphasis on preserving useful structure.
The tool's basic command-line use is straightforward. A document can be converted with a command such as markitdown path-to-file.pdf > document.md, or an output file can be specified with -o. In Python, the usual pattern is to import MarkItDown, create an instance and call md.convert("test.xlsx"), then read the result through result.text_content.
MarkItDown supports PDF, PowerPoint, Word, Excel, images (with EXIF metadata and OCR), audio (with EXIF metadata and speech transcription), HTML and various text-based formats such as CSV, JSON and XML. The exact capabilities depend on optional dependencies, which can be installed all at once with pip install 'markitdown[all]' or more selectively, such as pip install 'markitdown[pdf, docx, pptx]'.
The reason for focusing on Markdown is practical. Markdown is close to plain text, while still allowing structure to be represented through headings, tables, links and other simple conventions. The MarkItDown documentation notes that mainstream large language models such as GPT-4o appear to understand Markdown well and often produce it without being asked to do that, while Markdown conventions are also token-efficient.
MarkItDown is not intended as a high-fidelity document conversion tool for human publishing, and the project documentation makes clear that it is mainly intended for consumption by text-analysis tools. This makes it well suited to smaller knowledge-management projects, indexing workflows and pipelines where clean Markdown is more valuable than visual reproduction.
The project also supports optional integrations and extensions. Plugins are supported but disabled by default, and the documentation points developers to third-party plugins using the #markitdown-plugin tag. The markitdown-ocr plugin adds OCR support to PDF, DOCX, PPTX and XLSX converters by using LLM Vision through the same llm_client and llm_model pattern already used for image descriptions, while falling back to the standard converter if no client is provided.
MarkItDown also integrates with Microsoft cloud services for more advanced cases. Azure Document Intelligence can be used for conversion by providing an endpoint, while Azure Content Understanding offers higher-quality cloud extraction, structured field extraction through YAML front matter and multimodal support for documents, images, audio and video. The documentation notes that these cloud routes involve billable Azure API calls, so users can restrict which file types are sent through Content Understanding.
Its security guidance is also worth noting. MarkItDown performs I/O with the privileges of the current process, similar to open() or requests.get(). The project advises sanitising untrusted inputs and using the narrowest conversion function suitable for the task, such as convert_local(), convert_stream() or convert_response(), rather than the more permissive convert() when tighter control is needed.
Comparing the Three Approaches
The simplest comparison is based on the main question each tool is designed to answer. Textract asks what text is in a file, MarkItDown asks how to turn a file into useful Markdown, and Docling asks how to represent the structure of the document itself. These differences lead to different strengths, even where the tools appear to support similar file types.
For a quick extraction task, Textract remains a practical option. If the requirement is simply to read a DOCX file or pull text from a straightforward PDF, its single API can be convenient, particularly where plain text is enough, and the downstream process does not require reliable headings, tables or layout.
MarkItDown occupies a middle ground. It is lightweight, actively maintained by Microsoft and designed with LLM workflows in mind. It can produce Markdown that preserves more structure than plain text, making it useful for search, summarisation and note-taking systems without requiring the heavier processing associated with a document-understanding framework.
Docling is the strongest fit when the structure of a document is central to its meaning. Complex PDFs, detailed tables, heavily formatted reports, figure captions, formulas and multi-level document hierarchies are undoubtedly the kinds of material that can lose meaning when converted to plain text, and Docling's richer document representation and JSON export make it especially relevant for more demanding AI pipelines.
There is a cost to that additional capability, and Docling is a heavier and more complex tool than Textract. MarkItDown may be easier to adopt for smaller projects where Markdown output is the main requirement, while Docling becomes more attractive when accuracy of structure matters more than simplicity.
Where Document Structure Carries the Meaning
Some document types illustrate more clearly than others why structure matters as much as content. Legal contracts, engineering specifications, academic papers, financial filings and policy documents are all examples of material where meaning is often distributed across headings, tables, footnotes, appendices and cross-references. A plain-text extraction from any of these can be difficult to use reliably because it may blur the boundary between sections and collapse the relationship between a table and the text that explains it.
A Markdown conversion may preserve enough structure for many search and summarisation tasks. A richer representation such as Docling's can be more suitable when table structure, reading order and document hierarchy need to be retained for reliable retrieval or knowledge-base construction.
This does not mean that one tool replaces all others. A lightweight converter may be preferable for routine ingestion of simple office files, while a structured parser may be selected for scanned PDFs, complex reports or multisection reference documents. The appropriate choice depends on the nature of the source material and the level of structure required downstream.
Choosing Between Docling, MarkItDown and Textract
The current landscape reflects a broader shift from text extraction towards document understanding. Textract represents an older but still useful model, where the priority is to get plain text from many file types through a consistent interface. MarkItDown reflects the needs of LLM-era workflows by turning varied content into Markdown that is compact, readable and easier for language models to process.
Docling goes further by treating documents as structured objects rather than text containers. Its support for layout analysis, OCR, tables, reading order, formulas, figures, audio and specialised schemas makes it a more ambitious option for complex pipelines, and its ability to run locally also matters where privacy, security or regulatory constraints limit the use of external services.
For general users, the choice ultimately follows the complexity of the source material and what is expected of the output. Textract handles straightforward extraction well enough, while MarkItDown adds lightweight structure without much overhead, and Docling is the right tool when the document's own organisation needs to survive the conversion intact.
Online R programming books that are worth bookmarking
As part of making content more useful following its reorganisation, numerous articles on the R statistical computing language have appeared on here. All of those have taken a more narrative form. With this collation of online books on the R language, I take a different approach. What you find below is a collection of links with associated descriptions. While narrative accounts can be very useful, there is something handy about running one's eye down a compilation as well. Many entries have a corresponding print edition, some of which are not cheap to buy, which makes me wonder about the economics of posting the content online as well, though it can help with getting feedback during book preparation.
We start with this comprehensive collection of over 400 free and affordable resources related to the R programming language, organised into categories such as data science, statistics, machine learning and specific fields like economics and life sciences. In many ways, it is a superset of what you find below and complements this collection with many other finds. The fact that it is a living collection makes it even more useful.
R Programming for Data Science
Here is an introduction to the R programming language, focusing on its application in data science. It covers foundational topics such as installation, data manipulation, function writing, debugging and code optimisation, alongside advanced concepts like parallel computation and data analysis case studies. The text includes practical guidance on handling data structures, using packages such as {dplyr} and {readr} as well as working with dates, times and regular expressions. Additional sections address control structures, scoping rules and profiling techniques, while the author also discusses resources for staying updated through a podcast and accessing e-book versions for ongoing revisions.
Designed for individuals with no prior coding experience, the book provides an introduction to programming in R while using practical examples to teach fundamental concepts such as data manipulation, function creation and the use of R's environment system. It is structured around hands-on projects, including simulations of weighted dice, playing cards and a slot machine, alongside explanations of core programming principles like objects, notation, loops and performance optimisation. Additional sections cover installation, package management, data handling and debugging techniques. While the book is written using RMarkdown and published under a Creative Commons licence, a physical edition is available through O’Reilly.
What you have here is one of several books written by Hadley Wickham. This one is published in its second edition as part of Chapman and Hall's R Series and is aimed primarily at R users who want to deepen their programming skills and understanding of the language, though it is also useful for programmers migrating from other languages. The book covers a broad range of topics organised into sections on foundations, functional programming, object-oriented programming, metaprogramming and techniques, with the latter including debugging, performance measurement and rewriting R code in C++.
Unlike Paul Teetor's separately published R Cookbook, the Cookbook for R was created by Winston Chang. It offers solutions to common tasks and problems in data analysis, covering topics such as basic operations, numbers, strings, formulas, data input and output, data manipulation, statistical analysis, graphs, scripts and functions, and tools for experiments.
The second edition of R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund offers a structured approach to learning data science with R, covering essential skills such as data visualisation, transformation, import, programming and communication. Organised into chapters that explore workflows, data manipulation techniques and tools like Quarto for reproducible research, the book emphasises practical applications and best practices for handling data effectively.
The R Graphics Cookbook, 2nd edition, offers a comprehensive guide to creating visualisations in R, structured into chapters that cover foundational skills such as installing and using packages, loading data from various formats and exploring datasets through basic plots. It progresses to detailed techniques for constructing bar graphs, line graphs, scatter plots and histograms, alongside methods for customising axes, annotations, themes and legends.
The book also addresses advanced topics like colour application, faceting data into subplots, generating specialised graphs such as network diagrams and heat maps and preparing data for visualisation through reshaping and summarising. Additional sections focus on refining graphical outputs for presentation, including exporting to different file formats and adjusting visual elements for clarity and aesthetics, while an appendix provides an overview of the {ggplot2} system.
R Markdown: The Definitive Guide
Published by Chapman & Hall/CRC, R Markdown: The Definitive Guide by Yihui Xie, J.J. Allaire and Garrett Grolemund covers the R Markdown document format, which has been in use since 2012 and is built on the knitr and Pandoc tools. The format allows users to embed code within Markdown documents and compile the results into a range of output formats including PDF, HTML and Word. The guide covers a broad scope of practical applications, from creating presentations, dashboards, journal articles and books to building interactive applications and generating blogs, reflecting how the ecosystem has matured since the {rmarkdown} package was first released in 2014.
A key principle running throughout is that Markdown's deliberately limited feature set is a strength rather than a drawback, encouraging authors to focus on content rather than complex typesetting. Despite this simplicity, the format remains highly customisable through tools such as Pandoc templates, LaTeX and CSS. Documents produced in R Markdown are also notably portable, as their straightforward syntax makes conversion between output formats more reliable, and because results are generated dynamically from code rather than entered manually, they are far more reproducible than those produced through conventional copy-and-paste methods.
The R Markdown Cookbook is a practical guide designed to help users enhance their ability to create dynamic documents by combining analysis and reporting. It covers essential topics such as installation, document structure, formatting options and output formats like LaTeX, HTML and Word, while also addressing advanced features such as customisations, chunk options and integration with other programming languages. The book provides step-by-step solutions to common tasks, drawing on examples from online resources and community discussions to offer clear, actionable advice for both new and experienced users seeking to improve their workflow and explore the full potential of R Markdown.
This book provides a practical guide to using R Markdown for scientists, developed from a three-hour workshop and designed to evolve as a living resource. It covers essential topics such as setting up R Markdown documents, integrating with RStudio for efficient workflows, exporting outputs to formats like PDF, HTML and Word, managing figures and tables with dynamic references and captions, incorporating mathematical equations, handling bibliographies with citations and style adjustments, troubleshooting common issues and exploring advanced R Markdown extensions.
bookdown: Authoring Books and Technical Documents with R Markdown
Here is a guide to using the {bookdown} package, which extends R Markdown to facilitate the creation of books and technical documents. It covers Markdown syntax, integration of R code, formatting options for HTML, LaTeX and e-book outputs and features such as cross-referencing, custom blocks and theming. The package supports both multipage and single-document outputs, and its applications extend beyond traditional books to include course materials, manuals and other structured content. The work includes practical examples, publishing workflows and details on customisation, alongside information about licensing and the availability of a printed version.
[blogdown]: Creating Websites with R Markdown
Though the authors note that some information may be outdated due to recent updates to Hugo and the {blogdown} package, and they direct readers to additional resources for the latest features and changes, this book still provides a guide to building static websites using R Markdown and the Hugo static site generator, emphasising the advantages of this approach for creating reproducible, portable content. It covers installation, configuration, deployment options such as Netlify and GitHub Pages, migration from platforms like WordPress and advanced topics including custom layouts and version control as well as practical examples, workflow recommendations and discussions on themes, content management and technical aspects of website development.
[pagedown]: Create Paged HTML Documents for Printing from R Markdown
The R package {pagedown} enables users to create paged HTML documents suitable for printing to PDF, using R Markdown combined with a JavaScript library called paged.js, that later of which implements W3C specifications for paged media. While tools like LaTeX and Microsoft Word have traditionally dominated PDF production, pagedown offers an alternative approach through HTML and CSS, supporting a range of document types including resumes, posters, business cards, letters, theses and journal articles.
Documents can be converted to PDF via Google Chrome, Microsoft Edge or Chromium, either manually or through the chrome_print() function, with additional support for server-based, CI/CD pipeline and Docker-based workflows. The package provides customisable CSS stylesheets, a CSS overriding mechanism for adjusting fonts and page properties, and various formatting features such as lists of tables and figures, abbreviations, footnotes, line numbering, page references, cover images, running headers, chapter prefixes and page breaks. Previewing paged documents requires a local or remote web server, and the layout is sensitive to browser zoom levels, with 100% zoom recommended for the most accurate output.
Dynamic Documents with R and knitr
Developed by Yihui Xie and inspired by the earlier {Sweave} package, {knitr} is an R package designed for dynamic report generation that consolidates the functionality of numerous other add-on packages into a single, cohesive tool. It supports multiple input languages, including R, Python and shell scripts, as well as multiple output markup languages such as LaTeX, HTML, Markdown, AsciiDoc and reStructuredText. The package operates on a principle of transparency, giving users full control over how input and output are handled, and runs R code in a manner consistent with how it would behave in a standard R terminal.
Among its notable features are built-in caching, automatic code formatting via the {formatR} package, support for more than 20 graphics devices and flexible options for managing plots within documents. It also allows advanced users to define custom hooks and regular expressions to extend and tailor its behaviour further. The package is affiliated with the Foundation for Open Access Statistics, a nonprofit organisation promoting free software, open access publishing and reproducible research in statistics.
Mastering Shiny is a comprehensive guide to developing web applications using R, focusing on the Shiny framework designed for data scientists. It introduces core concepts such as user interface design, reactive programming and dynamic content generation, while also exploring advanced topics like performance optimisation, security and modular app development. The book covers practical applications across industries, from academic teaching tools to real-time analytics dashboards, and aims to equip readers with the skills to build scalable, maintainable applications. It includes detailed chapters on workflow, layout, visualisation and user interaction, alongside case studies and technical best practices.
Engineering Production-Grade Shiny Apps
This is aimed at developers and team managers who already possess a working knowledge of the Shiny framework for R and wish to advance beyond the basics toward building robust, production-ready applications. Rather than covering introductory Shiny concepts or post-deployment concerns, the book focuses on the intermediate ground between those two stages, addressing project management, workflow, code structure and optimisation.
It introduces the {golem} package as a central framework and guides readers through a five-step workflow covering design, prototyping, building, strengthening and deployment, with additional chapters on optimisation techniques including R code performance, JavaScript integration and CSS. The book is structured to serve both those with project management responsibilities and those focused on technical development, acknowledging that in many small teams these roles are carried out by the same individual.
Outstanding User Interfaces with Shiny
Written by David Granjon and published in 2022, Outstanding User Interfaces with Shiny is a book aimed at filling the gap between beginner and advanced Shiny developers, covering how to deeply customise and enhance Shiny applications to the point where they become indistinguishable from classic web applications. The book spans a wide range of topics, including working with HTML and CSS, integrating JavaScript, building Bootstrap dashboard templates, mobile development and the use of React, providing a comprehensive resource that consolidates knowledge and experience previously scattered across the Shiny developer community.
Now in its second edition, R Packages by Hadley Wickham and Jennifer Bryan is a freely available online guide that teaches readers how to develop packages in R. A package is the core unit of shareable and reproducible R code, typically comprising reusable functions, documentation explaining how to use them and sample data. The book guides readers through the entire process of package development, covering areas such as package structure, metadata, dependencies, testing, documentation and distribution, including how to release a package to CRAN. The authors encourage a gradual approach, noting that an imperfect first version is perfectly acceptable provided each subsequent version improves on the last.
Written by Javier Luraschi, Kevin Kuo and Edgar Ruiz, Mastering Spark with R is a comprehensive guide designed to take readers from little or no familiarity with Apache Spark or R through to proficiency in large-scale data science. The book covers a broad range of topics, including data analysis, modelling, pipelines, cluster management, connections, data handling, performance tuning, extensions, distributed computing, streaming and contributing to the Spark ecosystem.
Happy Git and GitHub for the useR
Here is a practical guide written by Jenny Bryan and contributors, aimed primarily at R users involved in data analysis or package development. It covers the installation and configuration of Git alongside GitHub, the development of key workflows for common tasks and the integration of these tools into day-to-day work with R and R Markdown. The guide is structured to take readers from initial setup through to more advanced daily workflows, with particular attention paid to how Git and GitHub serve the needs of data science rather than pure software development.
Written by John Coene and intended for release as part of the CRC Press R series, JavaScript for R explore how the R programming language and JavaScript can be used together to enhance data science workflows. Rather than teaching JavaScript as a standalone language, the book demonstrates how a limited working knowledge of it can meaningfully extend what R developers can achieve, particularly through the integration of external JavaScript libraries.
The book covers a broad range of topics, progressing from foundational concepts through to data visualisation using the {htmlwidgets} package, bidirectional communication with Shiny, JavaScript-powered computations via the V8 engine and Node.js and the use of modern JavaScript tools such as Vue, React and webpack alongside R. Practical examples are woven throughout, including the building of interactive visualisations, custom Shiny inputs and outputs, image classification and machine learning operations, with all accompanying code made publicly available on GitHub.
This guide addresses challenges faced by developers of R packages that interact with web resources, offering strategies to create reliable unit tests despite dependencies on internet connectivity, authentication and external service availability. It explores tools such as {vcr}, {webmockr}, {httptest} and {webfakes}, which enable mocking and recording HTTP requests to ensure consistent testing environments, reduce reliance on live data and improve test reliability. The text also covers advanced topics like handling errors, securing tests and ensuring compatibility with CRAN and Bioconductor, while emphasising best practices for maintaining test robustness and contributor-friendly workflows. Funded by rOpenSci and the R Consortium, the resource aims to support developers in building more resilient and maintainable R packages through structured testing approaches.
The Shiny AWS Book is an online resource designed to teach data scientists how to deploy, host and maintain Shiny web applications using cloud infrastructure. Addressing a common gap in data science education, it guides readers through a range of DevOps technologies including AWS, Docker, Git, NGINX and open-source Shiny Server, covering everything from server setup and cost management to networking, security and custom configuration.
{ggplot2}: Elegant Graphics for Data Analysis
The third edition of {ggplot2}: Elegant Graphics for Data Analysis provides an in-depth exploration of the Grammar of Graphics framework, focusing on the theoretical foundations and detailed implementation of the ggplot2 package rather than offering step-by-step instructions for specific visualisations. Written by Hadley Wickham, Danielle Navarro and Thomas Lin Pedersen, the book is presented as an online work-in-progress, with content structured across sections such as layers, scales, coordinate systems and advanced programming topics. It aims to equip readers with the knowledge to customise plots according to their needs, rather than serving as a direct guide for creating predefined graphics.
YaRrr! The Pirate’s Guide to R
Written by Nathaniel D. Phillips, this is a beginner-oriented guide to learning the R programming language from the ground up, covering everything from installation and basic navigation of the RStudio environment through to more advanced topics such as data manipulation, statistical analysis and custom function writing. The guide progresses logically through foundational concepts including scalars, vectors, matrices and dataframes before moving into practical areas such as hypothesis testing, regression, ANOVA and Bayesian statistics. Visualisation is given considerable attention across dedicated chapters on plotting, while later sections address loops, debugging and managing data from a variety of file formats. Each chapter includes practical exercises to reinforce learning, and the book concludes with a solutions section for reference.
Data Visualisation: A Practical Introduction
Data Visualisation: A Practical Introduction is a forthcoming second edition from Princeton University Press, written by Kieran Healy and due for release in March 2026, which teaches readers how to explore, understand and present data using the R programming language and the {ggplot2} library. The book aims to bridge the gap between works that discuss visualisation principles without teaching the underlying tools and those that provide code recipes without explaining the reasoning behind them, instead combining both practical instruction and conceptual grounding.
Revised and updated throughout to reflect developments in R and {ggplot2}, the second edition places greater emphasis on data wrangling, introduces updated and new datasets, and substantially rewrites several chapters, particularly those covering statistical models and map-drawing. Readers are guided through building plots progressively, from basic scatter plots to complex layered graphics, with the expectation that by the end they will be able to reproduce nearly every figure in the book and understand the principles that inform each choice.
The book also addresses the growing role of large language models in coding workflows, arguing that genuine understanding of what one is doing remains essential regardless of the tools available. It is suitable for complete beginners, those with some prior R experience, and instructors looking for a course companion, and requires the installation of R, RStudio and a number of supporting packages before work can begin.