Technology Tales

Notes drawn from experiences in consumer and enterprise technology

TOPIC: MARKUP LANGUAGES

Docling, MarkItDown and Textract: Understanding the New Document-Processing Landscape

3rd June 2026

The growing use of large language models has changed the way many organisations think about documents. Reports, manuals, protocols, spreadsheets, presentations and scanned PDFs are no longer just files to be opened by a person; they are also sources of knowledge that can feed search systems, retrieval-augmented generation workflows and internal knowledge bases.

This shift has brought renewed attention to a practical problem that long predates the current AI boom. Documents are often rich in structure, yet many extraction tools reduce them to plain text, and once headings, tables, figures, captions and reading order are discarded, the resulting output can become difficult for humans to review and even harder for an AI system to use reliably.

Docling, MarkItDown and Textract all sit in this space, but they approach the problem from different directions. Textract is rooted in general-purpose text extraction, MarkItDown focuses on producing Markdown for text-analysis workflows and Docling aims to build a richer understanding of document structure.

Why Document Conversion Matters for AI Systems

A PDF, Word document or spreadsheet may look orderly on screen, yet that order is not always easy to recover programmatically. A human reader can see that a line of text is a heading, that a table belongs to a section or that a caption describes a nearby figure, while a simple text extractor may see only a stream of characters.

That difference matters when documents are used with large language models. If a technical manual, financial report or policy document is flattened into undifferentiated text, a search system may retrieve the wrong passage or miss the relationship between a table and its surrounding explanation. Retrieval-augmented generation depends not only on having the right words in an index, but also on preserving enough context for those words to remain meaningful.

Markdown and structured JSON have therefore become important intermediate formats. Markdown is close to plain text, but it can still represent headings, tables, links and lists in a compact way. JSON can go further by encoding document hierarchy, page-level information and other metadata for downstream processing.

Docling and Document Understanding

Docling is an open-source Python toolkit designed to convert and understand documents for AI-oriented workflows. It was initially developed by IBM's AI for Knowledge team at IBM Research Zurich, open-sourced in July 2024 and is now hosted under the LF AI & Data Foundation (part of the Linux Foundation), following IBM's formal contribution of the project to the foundation on 29th April 2025. Its purpose is not merely to extract text, but to preserve structure and meaning in a form that can be used by search systems, knowledge extraction tools and retrieval-augmented generation pipelines.

A simple way to describe Docling is to place it between raw documents and a language model. Instead of treating a document as a block of text, it attempts to identify headings, document hierarchy, tables, figures, captions, formulas, code blocks, reading order and page layout information. This is particularly important for PDFs, where the visual appearance of a page can hide a complicated underlying structure.

Docling supports parsing for PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3 and image formats, as well as several application-specific XML schemas including USPTO patents, JATS articles and XBRL financial reports, making it relevant well beyond ordinary office files.

The project's outputs are also designed for modern AI pipelines. Docling can export to Markdown, HTML, WebVTT, DocTags and lossless JSON, while its internal DoclingDocument representation provides a unified model of the parsed material. This gives developers a way to move from a PDF or other source file to a structured representation that is easier to chunk, index and query.

Its PDF capabilities are among the main reasons it has drawn attention. The documentation highlights advanced PDF understanding, including page layout, reading order, table structure, code, formulas and image classification. It also includes extensive OCR support for scanned PDFs and images, support for several visual language models under the GraniteDocling name and audio support using automatic speech recognition models.

Docling is also designed to fit into the wider generative AI ecosystem. Its integrations include LangChain, LlamaIndex, CrewAI and Haystack for agentic AI, and it supports local execution for sensitive data and air-gapped environments. The ability to run entirely on local hardware is important where sending files to an external service is not acceptable.

A typical installation begins with pip install docling. A simple conversion in Python uses DocumentConverter from docling.document_converter, then calls converter.convert("study_report.pdf") and exports the result with result.document.export_to_markdown(). The resulting Markdown can then be used in a search index, vector database or language-model workflow.

Textract and Traditional Text Extraction

Textract, in this context, refers to the Python package named textract and not Amazon Textract, which is a separate, cloud-hosted service. Its philosophy is much simpler than Docling's: give it a file and receive extracted text. The package provides both a command-line interface (for example, textract path/to/file.extension) and a Python interface using textract.process("path/to/file.extension").

The Textract documentation describes the problem it addresses as one of recovering useful information from "dark data" embedded in Word documents, PowerPoint presentations, PDFs and other files, providing a single interface across many formats, which has long been useful for natural language processing and textual analysis. Its design is method-agnostic, meaning it wraps different tools and libraries depending on the file type.

Textract supports many formats through a variety of underlying systems. Its documentation mentions CSV, TSV, DOC, DOCX, EML, EPUB, GIF, JPG, JSON, HTML, MP3, MSG, ODT, OGG, PDF, PNG, PPTX, PostScript, RTF, TIFF, TXT, WAV, XLSX and XLS. DOC files may be processed through antiword, DOCX through python-docx2txt, images through tesseract-ocr, PDFs through pdftotext by default (or pdfminer.six) and audio through tools such as sox, SpeechRecognition and pocketsphinx.

This broad format support is useful, but Textract was not designed specifically for the current generation of RAG and LLM systems. Its goal is text extraction rather than document understanding, meaning it may retrieve text effectively from many sources, but it does not build a rich document model and has limited awareness of tables, layout, semantic sections or chunking for retrieval.

That distinction is central when comparing Textract with newer tools. Textract answers the question of what text is present in a file. Docling tries to answer what the document means structurally, which is a more demanding task when tables, appendices, captions and section hierarchies carry much of the meaning.

MarkItDown and the Rise of Markdown-First Conversion

MarkItDown is an open-source Microsoft tool for converting files and office documents to Markdown. It is published under an MIT licence and its stated purpose is to provide a lightweight Python utility for converting various files to Markdown for indexing, text analysis and related purposes. It is therefore closer to Textract than to a full document-understanding framework, but it places much more emphasis on preserving useful structure.

The tool's basic command-line use is straightforward. A document can be converted with a command such as markitdown path-to-file.pdf > document.md, or an output file can be specified with -o. In Python, the usual pattern is to import MarkItDown, create an instance and call md.convert("test.xlsx"), then read the result through result.text_content.

MarkItDown supports PDF, PowerPoint, Word, Excel, images (with EXIF metadata and OCR), audio (with EXIF metadata and speech transcription), HTML and various text-based formats such as CSV, JSON and XML. The exact capabilities depend on optional dependencies, which can be installed all at once with pip install 'markitdown[all]' or more selectively, such as pip install 'markitdown[pdf, docx, pptx]'.

The reason for focusing on Markdown is practical. Markdown is close to plain text, while still allowing structure to be represented through headings, tables, links and other simple conventions. The MarkItDown documentation notes that mainstream large language models such as GPT-4o appear to understand Markdown well and often produce it without being asked to do that, while Markdown conventions are also token-efficient.

MarkItDown is not intended as a high-fidelity document conversion tool for human publishing, and the project documentation makes clear that it is mainly intended for consumption by text-analysis tools. This makes it well suited to smaller knowledge-management projects, indexing workflows and pipelines where clean Markdown is more valuable than visual reproduction.

The project also supports optional integrations and extensions. Plugins are supported but disabled by default, and the documentation points developers to third-party plugins using the #markitdown-plugin tag. The markitdown-ocr plugin adds OCR support to PDF, DOCX, PPTX and XLSX converters by using LLM Vision through the same llm_client and llm_model pattern already used for image descriptions, while falling back to the standard converter if no client is provided.

MarkItDown also integrates with Microsoft cloud services for more advanced cases. Azure Document Intelligence can be used for conversion by providing an endpoint, while Azure Content Understanding offers higher-quality cloud extraction, structured field extraction through YAML front matter and multimodal support for documents, images, audio and video. The documentation notes that these cloud routes involve billable Azure API calls, so users can restrict which file types are sent through Content Understanding.

Its security guidance is also worth noting. MarkItDown performs I/O with the privileges of the current process, similar to open() or requests.get(). The project advises sanitising untrusted inputs and using the narrowest conversion function suitable for the task, such as convert_local(), convert_stream() or convert_response(), rather than the more permissive convert() when tighter control is needed.

Comparing the Three Approaches

The simplest comparison is based on the main question each tool is designed to answer. Textract asks what text is in a file, MarkItDown asks how to turn a file into useful Markdown, and Docling asks how to represent the structure of the document itself. These differences lead to different strengths, even where the tools appear to support similar file types.

For a quick extraction task, Textract remains a practical option. If the requirement is simply to read a DOCX file or pull text from a straightforward PDF, its single API can be convenient, particularly where plain text is enough, and the downstream process does not require reliable headings, tables or layout.

MarkItDown occupies a middle ground. It is lightweight, actively maintained by Microsoft and designed with LLM workflows in mind. It can produce Markdown that preserves more structure than plain text, making it useful for search, summarisation and note-taking systems without requiring the heavier processing associated with a document-understanding framework.

Docling is the strongest fit when the structure of a document is central to its meaning. Complex PDFs, detailed tables, heavily formatted reports, figure captions, formulas and multi-level document hierarchies are undoubtedly the kinds of material that can lose meaning when converted to plain text, and Docling's richer document representation and JSON export make it especially relevant for more demanding AI pipelines.

There is a cost to that additional capability, and Docling is a heavier and more complex tool than Textract. MarkItDown may be easier to adopt for smaller projects where Markdown output is the main requirement, while Docling becomes more attractive when accuracy of structure matters more than simplicity.

Where Document Structure Carries the Meaning

Some document types illustrate more clearly than others why structure matters as much as content. Legal contracts, engineering specifications, academic papers, financial filings and policy documents are all examples of material where meaning is often distributed across headings, tables, footnotes, appendices and cross-references. A plain-text extraction from any of these can be difficult to use reliably because it may blur the boundary between sections and collapse the relationship between a table and the text that explains it.

A Markdown conversion may preserve enough structure for many search and summarisation tasks. A richer representation such as Docling's can be more suitable when table structure, reading order and document hierarchy need to be retained for reliable retrieval or knowledge-base construction.

This does not mean that one tool replaces all others. A lightweight converter may be preferable for routine ingestion of simple office files, while a structured parser may be selected for scanned PDFs, complex reports or multisection reference documents. The appropriate choice depends on the nature of the source material and the level of structure required downstream.

Choosing Between Docling, MarkItDown and Textract

The current landscape reflects a broader shift from text extraction towards document understanding. Textract represents an older but still useful model, where the priority is to get plain text from many file types through a consistent interface. MarkItDown reflects the needs of LLM-era workflows by turning varied content into Markdown that is compact, readable and easier for language models to process.

Docling goes further by treating documents as structured objects rather than text containers. Its support for layout analysis, OCR, tables, reading order, formulas, figures, audio and specialised schemas makes it a more ambitious option for complex pipelines, and its ability to run locally also matters where privacy, security or regulatory constraints limit the use of external services.

For general users, the choice ultimately follows the complexity of the source material and what is expected of the output. Textract handles straightforward extraction well enough, while MarkItDown adds lightweight structure without much overhead, and Docling is the right tool when the document's own organisation needs to survive the conversion intact.

Altering table and hyperlink tags for single Grav articles using HTML post-processing

1st April 2026

This year, there have been a few entries on here regarding Grav because of my moving parts of my website estate to that content management system, first from Textpattern and latterly from WordPress. Once the second activity was completed, I then added an article on German public holidays elsewhere. That brought me to the topic of this piece: ensuring that some Markdown was rendered as required.

There were two parts to this: the styling of tables and the actions of hyperlinks. Each needs to be performed in a page template when all HTML has been initially rendered. Further processing then makes the required changes. Since this is a page template and not a partial template and not a partial template, you need to import a master template like this:

{% extends 'partials/base.html.twig' %}

Then, you go to the next stage, defining the content block within {% block content %}...{% endblock %} Twig tags:

{% set content = page.content
|replace({'<table>': '<table class="table mt-5 mb-5">'})
|replace({'<a href="http': '<a target="_blank" rel="noopener noreferrer" href="http'})
%}

The above reads in the page content (page.content) and does some text replacement operations. The first of these changes <table> to <table class="table mt-5 mb-5">, while the second replaces <a href="http with <a target="_blank" rel="noopener noreferrer" href="http. While my content was a mix of Markdown and HTML, depending on the article, the latter operation appeared to standardise every link.

Once the text replacement has been completed, the next step is to output the processed HTML like this:

{{ content|raw }}

This last line sits outside the {% block content %}...{% endblock %} block; coming after it, in fact. To send the processed output to the generated web page, you need to ensure that you are referring to the right variable, the local one called content and not page.content. The raw filter also is essential here to ensure that nothing is rendered into HTML when the raw HTML itself is what is needed.

All of this effort ensures that straightforward Markdown can be used in content, while Grav does some extra work in the background to ensure that all is rendered without extra intervention. While there may need to be a certain level of standardisation to make this all work well, I find that it does what is needed, albeit in a different manner from shortcode approach that you find in Hugo.

Displaying superscripted text in Hugo website content

6th January 2025

In a previous post, there was a discussion about displaying ordinal publishing dates with superscripted suffixes in Hugo and WordPress. Here, I go further with inserting superscripted text into Markdown content. Because of the default set up for the Goldmark Markdown renderer, it is not as simple as adding <sup>...</sup> constructs to your Markdown source file. That will generate a warning like this:

WARN Raw HTML omitted while rendering "[initial location of Markdown file]"; see https://gohugo.io/getting-started/configuration-markup/#rendererunsafe
You can suppress this warning by adding the following to your site configuration:
ignoreLogs = ['warning-goldmark-raw-html']

Because JavaScript can be added using HTML tags, there is an added security hazard that could be overlooked if you switch off the warning as suggested. Also, Goldmark does not interpret Markdown specifications of superscripting without an extension whose incorporation needs some familiarity with Go development.

That leaves using a Shortcode. These go into layouts/shortcodes under your theme area; the file containing mine got called super.html. The content is the following one-liner:

<sup>{{ .Get 0 | markdownify }}⁢/sup>

This then is what is added to the Markdown content:

{{< super "th" >}}

What happens here is that the Shortcode picks up the content within the content within the quotes and encapsulates it with the HTML superscript tags to give the required result. This approach can be extended for subscripts and other similar ways of rendering text, too. All that is required is a use case, and the rest can be put in place.

Easier to print?

20th February 2010

One matter that really came to light was how well or not the pages on here and on my hill walking and photography website came out on the printed page. After spotting a WordPress Codex article and with an eye on improving things, I have made a distinction between screen and print stylesheets. The code in the XHTML looks like this:

<link rel="stylesheet" href="/style.css" type="text/css" media="screen" />
<link rel="stylesheet" href="/style_print.css" type="text/css" media="print" />

The media attribute seems to be respected by the browsers that I have been using for testing (latest versions of Firefox, MSIE and Opera) so it then was a matter of using CSS to control what was shown and how it was displayed. Extraneous items like sidebars were excluded from the printed page in favour of the real content that visitors would be wanting anyway, and everything else was made as monochrome as possible, with images being the only things to escape. After all, people don't want to be wasting paper and ink in these cash strained times, and there's no need to have any more colour than necessary either. Then, there's the distraction caused by non-functioning hyperlinks that has inspired the sharing of some wisdom on A List Apart. Returning to my implementation, please let me know in the comments what you think of what I have done on here and if there remains any room for improvement.

Eliminating Peekaboo content display problems in Internet Explorer

1st July 2008

Recently, I changed the engine of my online photo gallery to a speedier PHP/MySQL-based affair from its PHP/Perl/XML-powered predecessor. On the server side, all was well, but a peculiar display issue turned up in Internet Explorer (6, 7 & 8 were afflicted by this behaviour) where photo caption text on the thumbnail gallery pages was being displayed erratically.

As far as I can gather, the trigger for the behaviour was that the thumbnail block was placed within a DIV floated using CSS that touched another DIV that cleared the floating behaviour. I use a table to hold the images and their associated captions in place. Furthermore, each caption was also a hyperlink nested within a set of P tags.

The remedy was to set the CSS Display property for the affected XHTML tag to a value of "inline-block". Within a DIV, TABLE, TR, TD, P and A tag hierarchy, finding the right tag where the CSS property in question has the desired effect took some doing. As it happened, it was the tag set, that for the hyperlink, at the bottom of the stack that needed the fix.

Of course, it's all very fine fixing something for one browser, but it's worthless if it breaks the presentation in other browsers. In that vein, I did some testing in Opera, Firefox, Seamonkey and Safari to check if all was well and it was. There may be older browsers, like versions of IE before 6, where things don't appear as intended, yet I get the impression from my visitor statistics that the newer variants hold sway anyway. All in all, it was a useful lesson learnt, and that's never a bad thing.

java.net.MalformedURLException: unknown protocol: j

15th December 2007

While I know that there are better things to call a blog post than to use part of an error message that I got from Saxonica's Saxon when I was converting XML files into PHP equivalents for the visitor information section of my main website, it is handy for anyone else needing to look up a solution when they encounter it. In my case, I use the open source Saxon-B rather than the commercial Saxon-SA, and it fulfils all of my needs. Version 8 and later (it has now reached 9.0.0.2) handle the XSLT 2.0 features that I need to make the transformations really clever.

Also, because Saxon is available as a jar file, it is cross-platform so long as you have Java on board. There are, however, some slight differences in behaviour. Now, I run the thing on Linux, where any Windows-style file locations are not recognised. When I had the file path in a DTD declaration starting with J:\, that was thought to be a protocol like file, http, https, ftp and so on because of the colon. Since there's no j protocol, Java gets confused, issuing the rather obscure error that titles this post. Otherwise, the migration of the Perl script that creates XSLT files and fires off the required XML to PHP transformations was a fairly straightforward exercise once file locations and shebang line were set right.

A selection of useful tools and technologies for contemporary web development

23rd March 2007

Having been on a web-building journey from Geocities to having a website with my own domain hosted by Fasthosts, it should come as no surprise that I have encountered a number of tools and technologies over this time and that my choices and knowledge have evolved too. I’ll muse over the technologies first before going on to the tools that I use.

Technologies

XHTML

When I started building websites, it was not after HTML 4 got released, and I devoured most if not all of Elizabeth Castro’s Peachpit Visual Quickstart guide to the language within a weekend. Having previously used fairly primitive WYSIWYG tools like Netscape Composer and Claris Home Page, it was an empowering experience and the first edition (it is now on its third) of Jennifer Niederst Robbins’ Web Design in a Nutshell took things much further, becoming something of a bible for a number of years.

When it first appeared, XHTML 1.0 wasn’t a major change from HTML 4, but its stricter more XML-compliant syntax was meant to point the way to the future and semantic markup was at its heart at least as much as it was for HTML 4. XHTML 2.0 is on the horizon and after the modular approach of XHTML 1.1 (which I have never used), it will be interesting to see how it develops. Nevertheless, there is a surprising development in that some people are musing over the idea of having an HTML 5. Let’s hope that the (X)HTML apple cart doesn’t get completely overturned after some years of relative stability. I still bear scars from the browser wars raging in the 1990’s and don’t want to see standards wars supplanting the relative peace that we have now. That said, I don’t mind peaceful progression.

CSS

Only seems to be coming into its own in the last few years and is truly a remarkable technology despite the hobbles that MSIE places on our ambitions. CSS Zen Garden has been a major source of ideas; I wouldn’t have been able to customise this blog as much as I have without them. I was an early adopter of the technology and got burnt by inconsistent browser support; Netscape 4 was the proverbial bête noir back then, fulfilling the role that MSIE plays today. In those days, it was the idea of controlling text display and element backgrounds from a single place that appealed. Since then, I have progressed to using CSS to replace table-based layouts and to control element positioning. It can do more…

JavaScript

Having had a JavaScript-powered photo gallery before my current Perl-driven one, I can say that I have definitely sampled this ever-pervasive scripting language. Being a client-side language rather than a server-side one, it does place you rather at the mercy of the browser purveyors, and it never ceases to amaze me that there is a buzz around AJAX because of this. In fact, the abundance of AJAX cross-browser function libraries is testimony to the need for browser-specific code. Despite my preferences for server-side scripting, I still find a use for JavaScript, and its main use for me these days is to dynamically control CSS elements to do such things as control the height of a page element or whether it is shown or not. Apparently, CSS may get some dynamic capabilities in the future and reduce my dependence on JavaScript. Meanwhile, Jeremy Keith’s DOM Scripting (Friends of Ed) will prove as much of an asset as it has done.

XML

These days, a lot of the raw data underlying my personal website is stored in XML. I did try to dynamically transform the display of the XML into something meaningful with CSS and XSLT when I first scaled its dizzy heights, but I soon resorted to other techniques. Browser support and the complexity of what I required were the major contributors to this. The new strategy involved two different approaches. The first was to create PHP/XHTML pages from the precursor XML offline, and this is how I generate the website’s directory pages. The other one is to process the XML as text to dynamically supply an XHTML page as the user visits it; this is the way that the photo gallery works.

Perl

This still powers all of my photo gallery. While thoughts of changing it all to PHP linger, there is a certain something about the Perl language that keeps it there. I suppose it is that PHP is entangled in the HTML while Perl encases the whole business, and I am reasonably familiar with its syntax these days, which is why it still does a lot of the data processing grunt work that I need.

PHP

PHP is everywhere these days, though it doesn’t attract quite the level of hype that used to be the case. It still appears with its sidekick MySQL in many website applications. Blogging software such as WordPress and content management systems like Drupal, Mambo and Joomla! wouldn’t exist without the pair. It appears on my website as the glue that holds my visitor directories together and is the processing engine of my WordPress blog. And if I ever get to a Drupal element to the site, by no means a foregone conclusion though I am spending a lot of time with it at this time, PHP will continue its presence in my website scripting as it powers that too.

Applications

Macromedia HomeSite

I have a liking for hand coding, so this does most of what I need. When Macromedia (itself since taken over by Adobe, of course) took over Allaire, HomeSite sadly lost its WYSIWYG capability, but the application still soldiers on even though Dreamweaver offers a lot to code cutters these days. Nevertheless, it does have certain advantages over Dreamweaver: it is a fleeter beast to start up and colour codes Perl syntax.

Macromedia Dreamweaver

There was a time when Dreamweaver was solely a tool for visual web page development, but the advent of Dreamweaver UltraDev added server-side development capabilities to the Dreamweaver family. These days, there is only one Dreamweaver version, but UltraDev’s capabilities still live on in the latest version and I would not be surprised if they were taken further in these database-driven times.

Nowadays, Dreamweaver isn’t an application where I spend a great deal of time. In former times, when my site was made up of static HTML pages, I used Dreamweaver a lot, even if its rendering capabilities were a step behind the then-current browser versions. I suppose that it didn’t fit the way in which I worked, but its template-driven workflow would have been a boon back then.

However, my move from a static site to a dynamic one, starting with my photo gallery, has meant that I haven’t used it as much since then. However, with my use of PHP/MySQL components on my site. Its server-side abilities could get the level of investigation that its PHP/MySQL capabilities allow.

Altova XMLSpy Professional

Adding MySQL databases to my web hosting costs money, not a lot, but it could be spent on other (more important?) things. Hence, I use XML as the data store for my photo gallery and XML files are pre-processed into XHTML/PHP pages for my visitor directories before uploading onto the server.

I use XMLSpy to edit and manage the XML files that I use: its ability to view XML in grid format is a killer feature as far as I am concerned and XML validation also proves very useful; particularly when it comes to ensuring that DTD’s and XML files are in step and for the correct coding of XSLT files. There are other features that I need to explore and that would also take my knowledge of the XML further to boot, not at all a bad thing.

Saxon

For processing XML into another file format such as XHTML, you need a parser and I use the free version of Saxon to do the needful, Saxonica offers commercial versions of it. There is, I believe, a parser in XMLSpy, but I don’t use it because Saxon’s command line interface fits better into my workflow. This is a Perl-driven process where XML files are read and XSLT files, one per XML file, are built before both are fed to Saxon for transforming into XHTML/PHP files. It all works smoothly and updating the XML inputs is all that is required.

AceFTP

If I were looking for an FTP client now, it would be FileZilla, but AceFTP has served me well over the last few years, and it looks as if that will continue. It does have some extra features over FileZilla: transfers between remote sites, and scheduling, for example. I have yet to use either, but they look valuable.

Hutmil

In bygone days when I had loads of static HTML files, making changes was a bit of a chore if they affected every single file. An example is changing the year on the copyright message on the page footers. Hutmil, which I found on a magazine cover-mounted disc, was a great time saver in those days. Today, I achieve this by putting this information into a single file and getting Perl or PHP to import that when building the page. The same “define once, use anywhere” approach underlies CSS as well, and scripting very usefully allows you to take that into the XHTML domain.

Apache

Apache is ubiquitous these days, and both the online and offline versions of my site are powered by it. It does require some configuration, but it is a powerful piece of kit. The introduction of 2.2.x meant a big change in the way that configuration files were modularised and while most things were contained in a single file for 2.0.x, the settings are broken up into different files in 2.2.x, and it can take a while to find things again. Without having it on my home PC, I would not be able to use Perl, PHP or MySQL. Apart from this, I especially like its virtual site capability; very useful for offline development.

WordPress

My hosting supplier offers blogs on Blogware, but that didn’t offer the level of configuration that I would have liked. It is true that this is probably true of any host of blogs. I can’t speak for Blogger, but WordPress.com does have its restrictions too. To make my hillwalking blog fit in with the appearance of my photo gallery, I went popped over to WordPress.org to download WordPress so that I could host a blog myself and have maximum control over its appearance. WordPress supports themes, so I created my own and got my blog pages looking as if they are part of my website, rather than looking like something that was bolted on. Now that I think of it, what about WordPress supporting user-created themes? I support that there is the worry of insecure PHP code but what about it?

MySQL

I am between minds on whether this is a technology or a tool. SQL certainly would be a technology standard, but I am not so clear on what MySQL would be. In any case, I have classed it as a tool, and a very useful one at that. It is the linchpin for my WordPress blogs and, if I go for a content management system like Drupal, its role would surely grow. While I do have a lot of experience with using SAS SQL and this helps me to deal with other varieties, there is still a learning curve with MySQL that gets me heading for a good book and Kofler’s The Definitive Guide to MySQL5 (Apress) seems to perform more than adequately in this endeavour.

Paint Shop Pro

As someone who hosts an online photo gallery, it won’t come as a surprise that I have had exposure to image editors. Despite various other flirtations, Paint Shop Pro has been my tool of choice over the years, but it is now set to be usurped by a member of Adobe’s Photoshop family. Paint Shop Pro does have books devoted to it, but it appears that Photoshop gets better coverage and I feel that my image processing needs to be taken up a gear, hence the potential move to Photoshop

HTML Tidy for Windows

22nd March 2007

Drupal has modules (Import HTML and its helper Static HTML together make up one option) for importing static (X)HTML pages into its database, and it needs HTML Tidy to work. Since I am playing with the thing on Windows, I went out and snagged the version for that OS. Being either lazy or bloody-minded, I tried an XHTML page with PHP code embedded in it and, needless to say, the thing choked. I must try it with plain XHTML instead.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.