Liberal Parsing

There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS [additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].

So now you’ve read all that you can see what my point is.

You can’t? Ok I’ll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI’s [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I’ve included a data table based on the results of the analysis below. Luckily RSS isn’t in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?

Categories Number of Documents % Attempted Validations (2 dp) % Total Requests (2 dp)
Invalid HTML documents 2,034,788 99.29 84.85
Not Downloaded 225,516 NA 9.40
Unknown DTD 123,359 NA 5.14
Valid HTML documents 14,563 0.71 0.61
(All) Grand Total 2,398,226 100.00 100.00

PS I’ve just worked out a few bugs in my weblog RSS feed, enjoy.