PDF | Technology Tales

TOPIC: PDF

Docling, MarkItDown and Textract: Understanding the New Document-Processing Landscape

3^rd June 2026

The growing use of large language models has changed the way many organisations think about documents. Reports, manuals, protocols, spreadsheets, presentations and scanned PDFs are no longer just files to be opened by a person; they are also sources of knowledge that can feed search systems, retrieval-augmented generation workflows and internal knowledge bases.

This shift has brought renewed attention to a practical problem that long predates the current AI boom. Documents are often rich in structure, yet many extraction tools reduce them to plain text, and once headings, tables, figures, captions and reading order are discarded, the resulting output can become difficult for humans to review and even harder for an AI system to use reliably.

Docling, MarkItDown and Textract all sit in this space, but they approach the problem from different directions. Textract is rooted in general-purpose text extraction, MarkItDown focuses on producing Markdown for text-analysis workflows and Docling aims to build a richer understanding of document structure.

Why Document Conversion Matters for AI Systems

A PDF, Word document or spreadsheet may look orderly on screen, yet that order is not always easy to recover programmatically. A human reader can see that a line of text is a heading, that a table belongs to a section or that a caption describes a nearby figure, while a simple text extractor may see only a stream of characters.

That difference matters when documents are used with large language models. If a technical manual, financial report or policy document is flattened into undifferentiated text, a search system may retrieve the wrong passage or miss the relationship between a table and its surrounding explanation. Retrieval-augmented generation depends not only on having the right words in an index, but also on preserving enough context for those words to remain meaningful.

Markdown and structured JSON have therefore become important intermediate formats. Markdown is close to plain text, but it can still represent headings, tables, links and lists in a compact way. JSON can go further by encoding document hierarchy, page-level information and other metadata for downstream processing.

Docling and Document Understanding

Docling is an open-source Python toolkit designed to convert and understand documents for AI-oriented workflows. It was initially developed by IBM's AI for Knowledge team at IBM Research Zurich, open-sourced in July 2024 and is now hosted under the LF AI & Data Foundation (part of the Linux Foundation), following IBM's formal contribution of the project to the foundation on 29th April 2025. Its purpose is not merely to extract text, but to preserve structure and meaning in a form that can be used by search systems, knowledge extraction tools and retrieval-augmented generation pipelines.

A simple way to describe Docling is to place it between raw documents and a language model. Instead of treating a document as a block of text, it attempts to identify headings, document hierarchy, tables, figures, captions, formulas, code blocks, reading order and page layout information. This is particularly important for PDFs, where the visual appearance of a page can hide a complicated underlying structure.

Docling supports parsing for PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3 and image formats, as well as several application-specific XML schemas including USPTO patents, JATS articles and XBRL financial reports, making it relevant well beyond ordinary office files.

The project's outputs are also designed for modern AI pipelines. Docling can export to Markdown, HTML, WebVTT, DocTags and lossless JSON, while its internal DoclingDocument representation provides a unified model of the parsed material. This gives developers a way to move from a PDF or other source file to a structured representation that is easier to chunk, index and query.

Its PDF capabilities are among the main reasons it has drawn attention. The documentation highlights advanced PDF understanding, including page layout, reading order, table structure, code, formulas and image classification. It also includes extensive OCR support for scanned PDFs and images, support for several visual language models under the GraniteDocling name and audio support using automatic speech recognition models.

Docling is also designed to fit into the wider generative AI ecosystem. Its integrations include LangChain, LlamaIndex, CrewAI and Haystack for agentic AI, and it supports local execution for sensitive data and air-gapped environments. The ability to run entirely on local hardware is important where sending files to an external service is not acceptable.

A typical installation begins with pip install docling. A simple conversion in Python uses DocumentConverter from docling.document_converter, then calls converter.convert("study_report.pdf") and exports the result with result.document.export_to_markdown(). The resulting Markdown can then be used in a search index, vector database or language-model workflow.

Textract and Traditional Text Extraction

Textract, in this context, refers to the Python package named textract and not Amazon Textract, which is a separate, cloud-hosted service. Its philosophy is much simpler than Docling's: give it a file and receive extracted text. The package provides both a command-line interface (for example, textract path/to/file.extension) and a Python interface using textract.process("path/to/file.extension").

The Textract documentation describes the problem it addresses as one of recovering useful information from "dark data" embedded in Word documents, PowerPoint presentations, PDFs and other files, providing a single interface across many formats, which has long been useful for natural language processing and textual analysis. Its design is method-agnostic, meaning it wraps different tools and libraries depending on the file type.

Textract supports many formats through a variety of underlying systems. Its documentation mentions CSV, TSV, DOC, DOCX, EML, EPUB, GIF, JPG, JSON, HTML, MP3, MSG, ODT, OGG, PDF, PNG, PPTX, PostScript, RTF, TIFF, TXT, WAV, XLSX and XLS. DOC files may be processed through antiword, DOCX through python-docx2txt, images through tesseract-ocr, PDFs through pdftotext by default (or pdfminer.six) and audio through tools such as sox, SpeechRecognition and pocketsphinx.

This broad format support is useful, but Textract was not designed specifically for the current generation of RAG and LLM systems. Its goal is text extraction rather than document understanding, meaning it may retrieve text effectively from many sources, but it does not build a rich document model and has limited awareness of tables, layout, semantic sections or chunking for retrieval.

That distinction is central when comparing Textract with newer tools. Textract answers the question of what text is present in a file. Docling tries to answer what the document means structurally, which is a more demanding task when tables, appendices, captions and section hierarchies carry much of the meaning.

MarkItDown and the Rise of Markdown-First Conversion

MarkItDown is an open-source Microsoft tool for converting files and office documents to Markdown. It is published under an MIT licence and its stated purpose is to provide a lightweight Python utility for converting various files to Markdown for indexing, text analysis and related purposes. It is therefore closer to Textract than to a full document-understanding framework, but it places much more emphasis on preserving useful structure.

The tool's basic command-line use is straightforward. A document can be converted with a command such as markitdown path-to-file.pdf > document.md, or an output file can be specified with -o. In Python, the usual pattern is to import MarkItDown, create an instance and call md.convert("test.xlsx"), then read the result through result.text_content.

MarkItDown supports PDF, PowerPoint, Word, Excel, images (with EXIF metadata and OCR), audio (with EXIF metadata and speech transcription), HTML and various text-based formats such as CSV, JSON and XML. The exact capabilities depend on optional dependencies, which can be installed all at once with pip install 'markitdown[all]' or more selectively, such as pip install 'markitdown[pdf, docx, pptx]'.

The reason for focusing on Markdown is practical. Markdown is close to plain text, while still allowing structure to be represented through headings, tables, links and other simple conventions. The MarkItDown documentation notes that mainstream large language models such as GPT-4o appear to understand Markdown well and often produce it without being asked to do that, while Markdown conventions are also token-efficient.

MarkItDown is not intended as a high-fidelity document conversion tool for human publishing, and the project documentation makes clear that it is mainly intended for consumption by text-analysis tools. This makes it well suited to smaller knowledge-management projects, indexing workflows and pipelines where clean Markdown is more valuable than visual reproduction.

The project also supports optional integrations and extensions. Plugins are supported but disabled by default, and the documentation points developers to third-party plugins using the #markitdown-plugin tag. The markitdown-ocr plugin adds OCR support to PDF, DOCX, PPTX and XLSX converters by using LLM Vision through the same llm_client and llm_model pattern already used for image descriptions, while falling back to the standard converter if no client is provided.

MarkItDown also integrates with Microsoft cloud services for more advanced cases. Azure Document Intelligence can be used for conversion by providing an endpoint, while Azure Content Understanding offers higher-quality cloud extraction, structured field extraction through YAML front matter and multimodal support for documents, images, audio and video. The documentation notes that these cloud routes involve billable Azure API calls, so users can restrict which file types are sent through Content Understanding.

Its security guidance is also worth noting. MarkItDown performs I/O with the privileges of the current process, similar to open() or requests.get(). The project advises sanitising untrusted inputs and using the narrowest conversion function suitable for the task, such as convert_local(), convert_stream() or convert_response(), rather than the more permissive convert() when tighter control is needed.

Comparing the Three Approaches

The simplest comparison is based on the main question each tool is designed to answer. Textract asks what text is in a file, MarkItDown asks how to turn a file into useful Markdown, and Docling asks how to represent the structure of the document itself. These differences lead to different strengths, even where the tools appear to support similar file types.

For a quick extraction task, Textract remains a practical option. If the requirement is simply to read a DOCX file or pull text from a straightforward PDF, its single API can be convenient, particularly where plain text is enough, and the downstream process does not require reliable headings, tables or layout.

MarkItDown occupies a middle ground. It is lightweight, actively maintained by Microsoft and designed with LLM workflows in mind. It can produce Markdown that preserves more structure than plain text, making it useful for search, summarisation and note-taking systems without requiring the heavier processing associated with a document-understanding framework.

Docling is the strongest fit when the structure of a document is central to its meaning. Complex PDFs, detailed tables, heavily formatted reports, figure captions, formulas and multi-level document hierarchies are undoubtedly the kinds of material that can lose meaning when converted to plain text, and Docling's richer document representation and JSON export make it especially relevant for more demanding AI pipelines.

There is a cost to that additional capability, and Docling is a heavier and more complex tool than Textract. MarkItDown may be easier to adopt for smaller projects where Markdown output is the main requirement, while Docling becomes more attractive when accuracy of structure matters more than simplicity.

Where Document Structure Carries the Meaning

Some document types illustrate more clearly than others why structure matters as much as content. Legal contracts, engineering specifications, academic papers, financial filings and policy documents are all examples of material where meaning is often distributed across headings, tables, footnotes, appendices and cross-references. A plain-text extraction from any of these can be difficult to use reliably because it may blur the boundary between sections and collapse the relationship between a table and the text that explains it.

A Markdown conversion may preserve enough structure for many search and summarisation tasks. A richer representation such as Docling's can be more suitable when table structure, reading order and document hierarchy need to be retained for reliable retrieval or knowledge-base construction.

This does not mean that one tool replaces all others. A lightweight converter may be preferable for routine ingestion of simple office files, while a structured parser may be selected for scanned PDFs, complex reports or multisection reference documents. The appropriate choice depends on the nature of the source material and the level of structure required downstream.

Choosing Between Docling, MarkItDown and Textract

The current landscape reflects a broader shift from text extraction towards document understanding. Textract represents an older but still useful model, where the priority is to get plain text from many file types through a consistent interface. MarkItDown reflects the needs of LLM-era workflows by turning varied content into Markdown that is compact, readable and easier for language models to process.

Docling goes further by treating documents as structured objects rather than text containers. Its support for layout analysis, OCR, tables, reading order, formulas, figures, audio and specialised schemas makes it a more ambitious option for complex pipelines, and its ability to run locally also matters where privacy, security or regulatory constraints limit the use of external services.

For general users, the choice ultimately follows the complexity of the source material and what is expected of the output. Textract handles straightforward extraction well enough, while MarkItDown adds lightweight structure without much overhead, and Docling is the right tool when the document's own organisation needs to survive the conversion intact.

Converting from CGM to Postscript

24^th November 2009

One thing that I recently had to investigate was the possibility of converting CGM vector graphics files into Postscript and from there into PDF. Having used ImageMagick for converting images before, that was an obvious option. However, that cannot process CGM files on its own and needs a delegate or helper application as well. This is the case with raw digital camera files too, with UFRaw being the program chosen. For CGM images, the more obscure RALCGM is what's needed, and tracking it down is a bit of an art. Though the history is that it was developed at the U.K.'s Rutherford Appleton Laboratory, it appears that it was left to go off into the wilderness rather than someone keeping an eye on things. With that in mind, here are the installation packages for Windows and Linux (RPM):

Windows Installer

Linux RPM

RALCGM is a handy command line tool that can covert from CGM to Postscript on its own without any need for ImageMagick at all. From what I have seen, fonts on graphical output may look greyer than black, but it otherwise does its job well. However, considering that it is a freely available tool, one cannot complain too much. There are other packages for doing vector to raster conversion and the ones that I have seen do have GUI's but the freedom to look at for cost software wasn't mine to have. The required command looks something like the following:

ralcgm -d PS -oL test.cgm test.ps

The switch -d PS uses the software's Postscript driver and -oL specifies landscape orientation. If you like to find out more, here's a PDF rendition of the help file that comes with the thing:

RALCGM Documentation

Ghostscript: **** Unable to open the initial device, quitting.

6^th October 2008

The above error message has been greeting me when creating PDF's with Ghostscript on a Solaris box and does need some translation. If you are directing output to a real printer, I suppose that it is sensible enough: nothing will happen unless you can connect to it. It gets a little less obvious when associated with PDF creation and seems to mean that the pdfwrite virtual device is unable to create the specified output file. A first port of call would be to check that you can write to the directory where you are putting the new PDF file. In my case, there appears to be another cause, so I'll have to keep looking for a solution.

Update: I have since discovered the cause of this: a now defunct TEMP assignment in the .profile file for my user account. Removing that piece of code resolved the problem.

Combining bookmarked PDF files using Ghostscript

4^th October 2008

My latest adventure in the world of computing has led me into the world of automated PDF generation. When my first approach didn't prove to be completely trouble-free, I decided to look at the idea of going part of the way with it and finishing off the job with the open-source utility Ghostscript. It is that which got me thinking about combining bookmarked PDF files and I can say that Ghostscript is capable of producing what I need as long it doesn't generate any errors along the way. Here's the command that does the trick:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=final.pdf source_file1.pdf source_file2.pdf

The various switches of the gs command have very useful roles with dBATCH ensuring that Ghostscript shuts down when all is done, dNOPAUSE removing any prompts that would otherwise be given, q for quiet mode, sDEVICE using Ghostscript's own PDF creation functionality and sOutputFile creates the output file, stopping Ghostscript from sending it to its default stream. All of this applies to Windows Ghostscript too, though the name of the executable is gswin32c for 32-bit Windows instead of gs.

When it comes to any debugging, it is useful to consider that Ghostscript is case-sensitive with its command line switches, something that I have seen to trip up others. I am getting initial device initialisation, so it strikes me that dropping some of the ones that reduce the number of messages might help me work out what's going on. It's a useful idea that I have yet to try.

There is also online documentation if you fancy learning more, and Linux.com has an article that considers other possible PDF combination tools as well. All in all, it's nice to have command line tools to do these sorts of things rather than having to use GUI applications all the time.

Other uses for the middle mouse button

11^th November 2007

Here's another one of those things that I discovered while being clumsy: in Firefox, click on your middle mouse button/wheel while hovering over a tab, and it will close it; you don't even need to click on the close icon. Evince, the PDF viewer favoured by Ubuntu, also makes use of the middle mouse button: for panning your way through documents using the hand tool. In a moment of lateral thinking, I tried the same trick with Adobe Reader; in version 7.x, it works in the same way. On Windows at least, Adobe Reader 8.x is a different animal and features automatic scrolling, a very useful proposition for the reading of eBooks if the text doesn't pass by you too quickly, and even a moderately reliable read aloud feature.