RSS Tidy Anybody?

I’ve been thinking about how to deal with the malformed Xml feed problem that I was talking about a few days ago, after reviewing some of the information contained in the thesis on invalid HTML parsing I shifted my emphasis to solving the problem to before submitting the document to the XML processor.

Where’s the Focus? Most solutions that involve delivering valid RSS to an aggregator concentrate on the publishing tool that creates the document, this is undoubtedly the ideal solution, however how are we going to get the Competitive Advantage that Mark discusses in his weblog? Simple, we need a tool that rescues the RSS file and nurtures it to health (or wellformdness) and then passes it into the XML processor for processing.

Lets get past the First aid shall we? What I want is a solution that doesn’t just try and put a band aid on the problem but prescribes the medicine and puts it in a sling. Rather than solving each error as it appears I think that a better approach would be to use something analogous to HTMLTidy to sort out all the illegal characters, entities and missing end tags. Kevin talked about his take on “smart” parsing which I mostly agree with, especially when he says this functionality is needed for all XML aggregation formats not just RSS. If we resort to using regular expressions for this task then we have to write a new set of RegEx‘s for eery new type of Xml format we want to process. If we resort to fixing errors in an ad hoc manner then we are missing some of the elegance and formalism that comes with a well structured library that deals with the problem.

In summary: lets not break the Xml parsers, lets fix the Xml, if we can do that on the server so much the better, if we have to do it on the client then lets do it in a structured way that scales across formats, lets handle the errors gracefully. Let’s get well formed first, and then worry about validity.

Just some old school webloging

The W3C released a final call on several RDF working drafts, I’ve been reading over the RDF primer and getting more of a feel for how RDF can be used, one of the most interesting parts is section six which introduces some existing RDF applications, it’s nice to see it being used for something other than RSS!

I really like this. One of my favourite blogs, SarahHatter.com, has just posted a really interesting entry, anyway it helps to break up the endless technical stuff I end up reading most of the time.

Ten fold strong with the Tech. Lets get back on topic shall we, I’ve been having a couple of feature issues with implementing the functionality I want on my RSS feed. Basically I was questioning whether I could use xml:lang to identify the language in my post descriptions, I found a well thought out answer over on RSS-Dev, along with some feedback on my idea of including namespaced XHTML in my RSS feed.

Liberal Parsing

There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS [additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].

So now you’ve read all that you can see what my point is.

You can’t? Ok I’ll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI’s [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I’ve included a data table based on the results of the analysis below. Luckily RSS isn’t in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?

Categories Number of Documents % Attempted Validations (2 dp) % Total Requests (2 dp)
Invalid HTML documents 2,034,788 99.29 84.85
Not Downloaded 225,516 NA 9.40
Unknown DTD 123,359 NA 5.14
Valid HTML documents 14,563 0.71 0.61
(All) Grand Total 2,398,226 100.00 100.00

PS I’ve just worked out a few bugs in my weblog RSS feed, enjoy.

Building a Robots.txt parser

I’m currently building a robots.txt parser for a project I’m running in C#. It is quite interesting as I have been meaning to get more into C# but have found difficulty finding the time. Luckily the grammar behind robots.txt files isn’t to difficult to parse and I’m currently running a small scale test.

After I’m satisfied with how the parser works I’ll release the source code as a class for C#, I am basing it broadly around the functionality offered by Pythons robotparser.

WYSIWYG = Productivity Loss

There will be an anti-WYSIWYG backlash in corporate American when managers finally begin assessing the productivity hit they have taken in their engineering departments by allowing graphic illiterates to diddle fonts and push pixels instead of focusing on content.

From GUI – The Death of WYSIWYG, interesting, just think of all the wasted hours spent by accountants on powerpoint preparations…

RSS Enabled

In between my hectic revision schedule I’ve knocked up a fairly basic RSS scraper for this site. My CMS hasn’t gotten of the ground yet and I really wanted to begin providing an RSS feed for my weblog, I couldn’t wait any longer. I wrote it in C#, it isn’t tied into anything on the live site as I run it on my local machine to generate the RSS file.

A few little bugs. There do seem to be a few teething problems with some RSS aggregators due to my use of content encoded data (as per RSS 1.0 content module spec), anyway I’ll iron out some of the issues as time goes on.

A touch of real life

This website isn’t really about my life as such, the subject matter usually stays rooted to technical stuff that I’m interested in, in a break from the normal run of conversation I’m getting married soon! Things are going well and I’m looking forward to a nice wedding in a nice little Spanish town, followed by a religious ceremony in Madrid.

Glass chess pieces, the King and the Queen.

Tactics for Template Based Design

The recent meeting of the UK usability professionals get together concerning accessibility is covered in isolanis weblog entry about the accessibility meeting. The blog entry also contained a link to Ian Lloyds presentation on the Nationwide Building Society website redesign. This presentation covered many interesting points, one that is applicable to most projects though is the management of templates in a template driven site. First of all though, what happens when you mismanage templates? The quality of the original template can often be compromised by later additions or modifications, these modifications can have ramifications unknown to the new user. Without clear guidelines on how to use the set of templates the wrong template could be used for the wrong type of page.

Misuse of templates can have disastrous effects on the usability of the site, navigation may become inconsistent within subsections of the site, and also the accessibility of the site, someone with no accessibility training may use an HTML table in an inappropriate manner.

To guard against problems like this manifesting themselves there must be a form of management control exercised over the templates, especially in organisations where more than one person maintains them. A useful tool is to define and document the uses and properties of the templates explicitly. Some useful aspects to note are:

  1. What each templates is for, and when to use it. (e.g. “this is the subsection homepage template”)
  2. How the templates were constructed.
  3. What you can change on the template.
  4. What you should leave alone.
  5. Screenshots of how pages should appear in different browsers.

As with most things, it is important to have a level of modularity in your templates to enable pages to be customised to a particular section without damaging the overall site. Make sure though that the extension mechanism is documented and understood, a useful documentation system is an intranet where everyone can easily get access to how to sue the templates.

Remember your set of site templates is where the usability and accessibility testing efforts are focused, don’t let the investment in usable and accessible templates be wasted, protect them and promote them.