Liberal HTML Parsing, Not Big, Not Clever

Revisiting the topic of liberal parsing, which has gathered some publicity recently, it is good to reflect on what the problems invalid HTML causes actually are. Suffice to say that toleration of errors and the associated handling of them that is required leads to the incompatibilities and inconsistencies in error handling between clients, coupled with proprietary extensions. Interspersed amongst my own commentary below are a few quotes gathered from a recent e-mail exchange with Dagfinn R. Parnas, the author of a paper on HTML error correction.

After doing some research on the history of browsers and HTML, I really didn’t find it very surprising that the standards compliance was really awful. A wise man once said: “It takes two […] to lie, one to lie and one to listen”. If the early browsers hadn’t started with error-correcting behind the backs of the authors, a lot of the incorrect code would have disappeared shortly. But then again the Internet might not have been as diverse as it is now (writing valid html code takes much more skill than writing tag soup). I think the solution would have been a simple smiley face incorporated in the browser or another means of telling the user how standard compliant the site was. Most web designers (both pro and novice) are very dedicated in making the best site they can, and clients would not be to happy about an angry smiley as they want a professional site. By the way, the wise man was Homer J. Simpson.

What are the most common causes of invalid HTML then?

Type of Error % of documents
with error
missing end tag 41.9% <p>some text, <a href = "> link text </p>
invalid end-tag 45.8% <p>Some text </strong></p>
invalid element content 38.3% <p>Some text <p>Some more text </p></p>

In addition to these basic errors the spectre of browser specific extensions raises its head.

70.8 and 23.9 percent [of HTML documents] have defined non-standard attributes and non-standard elements respectively.

This heavy use of proprietary markup significantly raises the entry level for new browsers into the marketplace, a new browser entrant cannot rely on published standards but must examine a multitude of vendor sources to understand how to deal with these proprietary elements. By raising the cost of entry in this manner the current browsers help to maintain their market leading position.

As we move towards XML and the well formed paradigm let’s leave behind the legacy of invalid HTML and adopt the strictness of well formed and valid XHTML. If you can produce valid XHTML do it, if you can’t then stick with HTML and don’t pollute XHTML like we did with HTML.

A Web site in Transition to a Web log (or the Opposite)

While checking my referrer logs the other day I came across a web log listing other blogs which discussed web standards or accessibility. What made me smile was the comment alongside the link to my site… – a web site in transition to a web log (or the opposite); info on CSS, WebDAV, HTML. source

As well as making me smile it helped me to think about the direction of this site, hopefully a few of my new ideas prompted by this will lead to a better site experience when I implement them.

Anyway while browsing that list I came across a french site on web standards, very nice looking, if only I could remember more of the french I learnt at school, anyway if you speak french then check this site out,

Attribution and the spreading of a meme

This is nothing more than a few observations concerning the significant increase in traffic I have been getting these last few days regarding my post that 99.29% of web sites are obsolete, from the fact that only 0.71% of sites testes were valid HTML.

  1. Dagfinn R. Parnas wrote a message about his masters thesis.
  2. I read the thesis, found it interesting, tabulated the results and blogged it.
  3. Got some normal linkage for a while, no major increases in traffic.
  4. Read an interesting W3c article about improving invalid sites.
  5. The article states that running a survey to determine how many sites are valid, I comment that this has been done, in the thesis.
  6. The time between my posting of that comment and Bill Masons tipping of Zeldman to my blog entry summarising the data leads to the conclusion that Bill read my comment.
  7. Number of recorded Hits to the site quadruples (even more than my last Zeldman mention)

So why mention this? Without permalinks and archives this information may not have been spread as easily over the web, the number of referrers is growing rapidly as different sites pick up on this and comment on it. As I have noted when we create a mechanism by which knowledge can be perpetuated we increase the likelihood that this knowledge will be built upon and more widely disseminated. Especially when authors are liberal with their citations. The practice of citing references allows the web of interconnections to grow and knowledge to be more easily located.

Anyway I’ll leave you with something a little lighter to look over.

URNs and persistence

This is just a short note to state that I am alive and well! Seriously though this is final exam season for me, along with which I have been working hard on my final year project as well as doing some real work and travelling to Spain, again 🙂 Anyway as the title suggests this post is about URNs and persistence, how so? Well, I’ve recently written a new article explaining the benefits of URNs, their practical application and how they can help web authors to preserve a measure of persistence in their citations.

I also received a rather interesting package yesterday from Microsoft, the contents of which will have to remain underwraps for a little while longer until I launch the new project I am developing and I can let the cat out of the bag.

package from Microsoft

Redesign Launched

The redesigned site has been launched! If the page looks a little bland then either you are using Netscape 4, in which case tough luck, or you have an old style sheet cached. To update your cached style sheet just reload this page. Please let me know what you think of the changes.

My frequently visited templates are unchanged, this redesign is based on the layout of template 2. The previous design was based ontemplate 1. I have made these templates freely available for people to base their designs on, and to give them ideas on what CSS can do. Please do not copy the “style” of this redesigned version! A good example of using the templates effectively is given in an example on theRogue Librarian site, the template has been used as a base but the finished design is definitely an individual work. Be individual, do not plagiarise!

The redesign was fairly basic, but there were quite a few basic changes than had to be factored in there. Tim Bray recently wrote aboutrefactoring software and I had his words in the back of my mind when I was making numerous changes to the various bits and pieces that make this site what it is. There are some things that need to be pushed out still, but I didn’t want this redesign to follow my earlier abortive attempts at a redesign by stalling before delivering it. When something works, it is tempting to just leave it, or as they say If it’s not broken, work on it.

Very Busy…

Well I am quite busy hence the lack of an update for a little while. Exam times, coursework and work pressures are all mounting up but I am progressing on all fronts. This weblog has suffered, but hey it’s not my priority at the moment. Nevertheless once exam season is out of the way prepare for some more high tech tidbits and opinionated rants!

All has not been quite on the web site front though, a redesign will be coming soon, thanks to all those who have commented on the prototype I did, IsoDavid and Leo, cheers. I might even get round to using proper headings for my blog posts soon 😉

Connected Computing

The new alpha release of Longhorn has stirred my interest again with a concept for managing concepts called the “My Contacts Library”. This replaces the windows address book application but uses a carousel concept has been hinted at recently, as I stated a while ago Longhorn seems to be crossing paths with some of the use cases covered by FOAF technology. The direction, and even the interface, seems to be getting closer and closer to what FOAF and its applications are doing.

The interface can be seen clearly in a mockup in Paul Thurrott’s review of one of the Longhorn Alpha builds. To describe it textually, the interface has the individual at the centre with contacts organised in concentric circles around the individual. The individual contacts can be grouped by user defined categories, like Windows messenger contacts can be. All in all Microsoft seem to be using some similar UI ideas to FoafNaut and the Semaview Foaf browser. With the carousel interface and pivot based viewing capability I am starting to really look forward to the new version. I can understand why some people might think this has all gone a bit far, however although I am a big fan of the Amiga, Linux and Mac OS etc I really like working in XP and Longhorn looks even better to me. Pivot views and filters aren’t for everyone but I lived in excel when I worked in finance and loved the flexibility and power that pivot tables gave the user (No I wasn’t one of those losers who only uses 10% of Excels capabilities, I was a Macro writing Pivot table freak!), bringing this power to the desktop looks like a good move to me.

Oh, now I’ve alienated most of my Linux loving readers by praising Microsoft I will say that I think that FOAF and it’s contemporaries/replacements have a great opportunity, Longhorn isn’t coming to a computer near you for a while yet. Lets make the most of that gap by creating applications that “normal people” can use now to manage their contacts in more powerful ways. Let’s face it FOAF is a bit geeky at the moment, it’s a 0.1 after all! I can think of a few things that need to happen for something like FOAF to be more widely used:

  • Import from address books. Such as .wab format.
  • Introduce flexible categorisation of contacts.
  • Push the prototype use case implementations (ie co-depiction) onto the desktop.
  • Wow the consumer, not just the techies!

Well there is my opinion piece for the day, if you are interested in researching more about FOAF I suggest looking around the relevant category in the open directory or google’s mirror. Paul Thurrot’s review of the Longhorn Alpha contains plenty of information on the windows side of the equation.

Problems with RSS as it is deployed

I have a some longstanding issues with RSS for example the method for RSS autodiscovery, however the two most important problems with respect to RSS are:

  1. Entity encoding in the <description> element.
  2. Resolving relative URLs.

As I use a decent news aggregator, I don’t suffer from the second problem. The first problem however is something that should interest us all. As Tim Bray notes, entity encoding in the description element and then expecting the encoding to be resolved back is prone to errors. This is due to the under specified nature of the various RSS branches and people just doing it in an effort to crowbar HTML (not necessarily well formed XHTML fragments) into the early RSS deployments.

How to do it right! In order to include even html in your RSS then there are a few steps you need to take.

Step 1: Convert to RSS 1 or RSS 2, earlier 0.9x versions do not support what I am proposing here.
Step 2: Include the <encoded> element from the RSS 1.0 content module namespace, using the namespace prefix “content” as in <content:encoded> will work in more readers.
Step 3: Wrap your content in a CDATA section and put the result into the <content:encoded> element.
Step 4: Ensure the result is well formed XML.

This solution can be used to ensure that the content is included in an element recognised as holding encoded data, rather than the much abused description element. This is the method I use for my own feed which you can take a look at it to get some ideas.

Using the Vera typeface with CSS

As I mentioned yesterdaynew typeface, Vera, has been released that is suitable for open source projects. In order to use it a bit more fully I wrote a quick test page and compared the rendering of each of the fonts in my font viewer program with the fonts used by my web browser, Mozilla 1.3, to ensure that the right font was consistently applied. Below is a short list of each of the fonts and the appropriate CSS font selection properties that will request the desired font.

  • Bitstream Vera Sans.
    font-family:'Bitstream Vera Sans';
  • Bitstream Vera Sans Bold.
    font-family:'Bitstream Vera Sans';
  • Bitstream Vera Sans Oblique.
    font-family:'Bitstream Vera Sans';
  • Bitstream Vera Sans Mono.
    font-family:'Bitstream Vera Sans Mono';
  • Bitstream Vera Sans Bold Oblique.
    font-family:'Bitstream Vera Sans';
  • Bitstream Vera Sans Mono Bold.
    font-family:'Bitstream Vera Sans Mono';
  • Bitstream Vera Sans Mono Oblique.
    font-family:'Bitstream Vera Sans Mono';
  • Bitstream Vera Sans Mono Bold Oblique.
    font-family:'Bitstream Vera Sans Mono';
  • Bitstream Vera Serif.
    font-family:'Bitstream Vera Serif';
  • Bitstream Vera Serif Bold.
    font-family:'Bitstream Vera Serif';

I’ve uploaded the test file I used for the typeface. It includes some lorem ipsum text to help get a better feel for the typefaces.