I’ve been thinking about how to deal with the malformed Xml feed problem that I was talking about a few days ago, after reviewing some of the information contained in the thesis on invalid HTML parsing I shifted my emphasis to solving the problem to before submitting the document to the XML processor.
Where’s the Focus? Most solutions that involve delivering valid RSS to an aggregator concentrate on the publishing tool that creates the document, this is undoubtedly the ideal solution, however how are we going to get the
Competitive Advantage that Mark discusses in his weblog? Simple, we need a tool that rescues the RSS file and nurtures it to health (or wellformdness) and then passes it into the XML processor for processing.
Lets get past the First aid shall we? What I want is a solution that doesn’t just try and put a band aid on the problem but prescribes the medicine and puts it in a sling. Rather than solving each error as it appears I think that a better approach would be to use something analogous to HTMLTidy to sort out all the illegal characters, entities and missing end tags. Kevin talked about his take on “smart” parsing which I mostly agree with, especially when he says
this functionality is needed for all XML aggregation formats not just RSS. If we resort to using regular expressions for this task then we have to write a new set of RegEx‘s for eery new type of Xml format we want to process. If we resort to fixing errors in an ad hoc manner then we are missing some of the elegance and formalism that comes with a well structured library that deals with the problem.
In summary: lets not break the Xml parsers, lets fix the Xml, if we can do that on the server so much the better, if we have to do it on the client then lets do it in a structured way that scales across formats, lets handle the errors gracefully. Let’s get well formed first, and then worry about validity.