Liberal Parsing

There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS [additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].

So now you’ve read all that you can see what my point is.

You can’t? Ok I’ll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI’s [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I’ve included a data table based on the results of the analysis below. Luckily RSS isn’t in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?

Categories Number of Documents % Attempted Validations (2 dp) % Total Requests (2 dp)
Invalid HTML documents 2,034,788 99.29 84.85
Not Downloaded 225,516 NA 9.40
Unknown DTD 123,359 NA 5.14
Valid HTML documents 14,563 0.71 0.61
(All) Grand Total 2,398,226 100.00 100.00

PS I’ve just worked out a few bugs in my weblog RSS feed, enjoy.

Building a Robots.txt parser

I’m currently building a robots.txt parser for a project I’m running in C#. It is quite interesting as I have been meaning to get more into C# but have found difficulty finding the time. Luckily the grammar behind robots.txt files isn’t to difficult to parse and I’m currently running a small scale test.

After I’m satisfied with how the parser works I’ll release the source code as a class for C#, I am basing it broadly around the functionality offered by Pythons robotparser.

WYSIWYG = Productivity Loss

There will be an anti-WYSIWYG backlash in corporate American when managers finally begin assessing the productivity hit they have taken in their engineering departments by allowing graphic illiterates to diddle fonts and push pixels instead of focusing on content.

From GUI – The Death of WYSIWYG, interesting, just think of all the wasted hours spent by accountants on powerpoint preparations…

RSS Enabled

In between my hectic revision schedule I’ve knocked up a fairly basic RSS scraper for this site. My CMS hasn’t gotten of the ground yet and I really wanted to begin providing an RSS feed for my weblog, I couldn’t wait any longer. I wrote it in C#, it isn’t tied into anything on the live site as I run it on my local machine to generate the RSS file.

A few little bugs. There do seem to be a few teething problems with some RSS aggregators due to my use of content encoded data (as per RSS 1.0 content module spec), anyway I’ll iron out some of the issues as time goes on.

A touch of real life

This website isn’t really about my life as such, the subject matter usually stays rooted to technical stuff that I’m interested in, in a break from the normal run of conversation I’m getting married soon! Things are going well and I’m looking forward to a nice wedding in a nice little Spanish town, followed by a religious ceremony in Madrid.

Glass chess pieces, the King and the Queen.

Tactics for Template Based Design

The recent meeting of the UK usability professionals get together concerning accessibility is covered in isolanis weblog entry about the accessibility meeting. The blog entry also contained a link to Ian Lloyds presentation on the Nationwide Building Society website redesign. This presentation covered many interesting points, one that is applicable to most projects though is the management of templates in a template driven site. First of all though, what happens when you mismanage templates? The quality of the original template can often be compromised by later additions or modifications, these modifications can have ramifications unknown to the new user. Without clear guidelines on how to use the set of templates the wrong template could be used for the wrong type of page.

Misuse of templates can have disastrous effects on the usability of the site, navigation may become inconsistent within subsections of the site, and also the accessibility of the site, someone with no accessibility training may use an HTML table in an inappropriate manner.

To guard against problems like this manifesting themselves there must be a form of management control exercised over the templates, especially in organisations where more than one person maintains them. A useful tool is to define and document the uses and properties of the templates explicitly. Some useful aspects to note are:

  1. What each templates is for, and when to use it. (e.g. “this is the subsection homepage template”)
  2. How the templates were constructed.
  3. What you can change on the template.
  4. What you should leave alone.
  5. Screenshots of how pages should appear in different browsers.

As with most things, it is important to have a level of modularity in your templates to enable pages to be customised to a particular section without damaging the overall site. Make sure though that the extension mechanism is documented and understood, a useful documentation system is an intranet where everyone can easily get access to how to sue the templates.

Remember your set of site templates is where the usability and accessibility testing efforts are focused, don’t let the investment in usable and accessible templates be wasted, protect them and promote them.

Semantic vs. Presentation

A recent post to a newsgroup I visit, CIWAH, sparked a small yet interesting debate concerning the differences and similarities between semantic and presentational markup in HTML. Daniel Tobias started the ball rolling with a short piece that covers a few of the well worn points in this discussion. The article however does reinforce the need for people to understand what they are writing and not to abuse tags in a meaningless fashion.

In response to Dan’s post a more interesting point was raised by Jukka K. Korpela when he stated that:

Physical markup does not define the semantic meaning (except in the trivial sense where we might say that the visual presentation isthe meaning), but it may carry a connotation, at least when taken in context.

In the discussion of logical vs presentational I think that the point Jukka makes is important, as well as one that is often glossed over by proponents of semantic markup. Recognising that web based content may be delivered in varied contexts, and demonstrating that semantic markup is the best method for making this delivering in varied contexts is a key method of convincing web developers to use semantic markup.

An example of why semantic markup is important may be found in this post itself, when quoting Jukka’s post I added structural HTML constructs to give emphasis to certain words. In many browsers these ephasised words will appear in boldface or italics, how can non-visual browsers pass this information onto their users? Well first of all I have used <strong> and <em> tags to markup the emphasis rather than <b> or <i> tags, because the tags I used are defined semantically the relationship between the differently emphasised words is clear, and an aural browser will be able to easily pass on this information. In contrast the tags that merely make the text boldfaced or italicised have no defined relationship with each other and is open to differing interpretation, by using semantic tags the relative importance of the words is effectively gagued and can be taken into account during the presentation of the content.

The example I have given demonstrates one advantage semantic (logical) markup has over presentational markup, an enumeration of the arguments supporting semantic markup has been proposed as:

  1. Logical markup can be mapped to varying physical presentations depending on presentation medium.
  2. Logical markup can be automatically processed in a manner that is based on the defined logical meanings of elements.
  3. Logical markup leads to more flexible ways of affecting the visual presentation and creating alternative presentations.