Liberal HTML Parsing, Not Big, Not Clever

Revisiting the topic of liberal parsing, which has gathered some publicity recently, it is good to reflect on what the problems invalid HTML causes actually are. Suffice to say that toleration of errors and the associated handling of them that is required leads to the incompatibilities and inconsistencies in error handling between clients, coupled with proprietary extensions. Interspersed amongst my own commentary below are a few quotes gathered from a recent e-mail exchange with Dagfinn R. Parnas, the author of a paper on HTML error correction.

After doing some research on the history of browsers and HTML, I really didn’t find it very surprising that the standards compliance was really awful. A wise man once said: “It takes two […] to lie, one to lie and one to listen”. If the early browsers hadn’t started with error-correcting behind the backs of the authors, a lot of the incorrect code would have disappeared shortly. But then again the Internet might not have been as diverse as it is now (writing valid html code takes much more skill than writing tag soup). I think the solution would have been a simple smiley face incorporated in the browser or another means of telling the user how standard compliant the site was. Most web designers (both pro and novice) are very dedicated in making the best site they can, and clients would not be to happy about an angry smiley as they want a professional site. By the way, the wise man was Homer J. Simpson.

What are the most common causes of invalid HTML then?

Type of Error	% of documents with error	Example
missing end tag	41.9%	`<p>some text, <a href = "http://example.org> link text </p>`
invalid end-tag	45.8%	`<p>Some text </strong></p>`
invalid element content	38.3%	`<p>Some text <p>Some more text </p></p>`

In addition to these basic errors the spectre of browser specific extensions raises its head.

70.8 and 23.9 percent [of HTML documents] have defined non-standard attributes and non-standard elements respectively.

This heavy use of proprietary markup significantly raises the entry level for new browsers into the marketplace, a new browser entrant cannot rely on published standards but must examine a multitude of vendor sources to understand how to deal with these proprietary elements. By raising the cost of entry in this manner the current browsers help to maintain their market leading position.

As we move towards XML and the well formed paradigm let’s leave behind the legacy of invalid HTML and adopt the strictness of well formed and valid XHTML. If you can produce valid XHTML do it, if you can’t then stick with HTML and don’t pollute XHTML like we did with HTML.