Building a Robots.txt parser

I’m currently building a robots.txt parser for a project I’m running in C#. It is quite interesting as I have been meaning to get more into C# but have found difficulty finding the time. Luckily the grammar behind robots.txt files isn’t to difficult to parse and I’m currently running a small scale test.

After I’m satisfied with how the parser works I’ll release the source code as a class for C#, I am basing it broadly around the functionality offered by Pythons robotparser.

WYSIWYG = Productivity Loss

There will be an anti-WYSIWYG backlash in corporate American when managers finally begin assessing the productivity hit they have taken in their engineering departments by allowing graphic illiterates to diddle fonts and push pixels instead of focusing on content.

From GUI – The Death of WYSIWYG, interesting, just think of all the wasted hours spent by accountants on powerpoint preparations…

RSS Enabled

In between my hectic revision schedule I’ve knocked up a fairly basic RSS scraper for this site. My CMS hasn’t gotten of the ground yet and I really wanted to begin providing an RSS feed for my weblog, I couldn’t wait any longer. I wrote it in C#, it isn’t tied into anything on the live site as I run it on my local machine to generate the RSS file.

A few little bugs. There do seem to be a few teething problems with some RSS aggregators due to my use of content encoded data (as per RSS 1.0 content module spec), anyway I’ll iron out some of the issues as time goes on.

A touch of real life

This website isn’t really about my life as such, the subject matter usually stays rooted to technical stuff that I’m interested in, in a break from the normal run of conversation I’m getting married soon! Things are going well and I’m looking forward to a nice wedding in a nice little Spanish town, followed by a religious ceremony in Madrid.

Glass chess pieces, the King and the Queen.

Tactics for Template Based Design

The recent meeting of the UK usability professionals get together concerning accessibility is covered in isolanis weblog entry about the accessibility meeting. The blog entry also contained a link to Ian Lloyds presentation on the Nationwide Building Society website redesign. This presentation covered many interesting points, one that is applicable to most projects though is the management of templates in a template driven site. First of all though, what happens when you mismanage templates? The quality of the original template can often be compromised by later additions or modifications, these modifications can have ramifications unknown to the new user. Without clear guidelines on how to use the set of templates the wrong template could be used for the wrong type of page.

Misuse of templates can have disastrous effects on the usability of the site, navigation may become inconsistent within subsections of the site, and also the accessibility of the site, someone with no accessibility training may use an HTML table in an inappropriate manner.

To guard against problems like this manifesting themselves there must be a form of management control exercised over the templates, especially in organisations where more than one person maintains them. A useful tool is to define and document the uses and properties of the templates explicitly. Some useful aspects to note are:

  1. What each templates is for, and when to use it. (e.g. “this is the subsection homepage template”)
  2. How the templates were constructed.
  3. What you can change on the template.
  4. What you should leave alone.
  5. Screenshots of how pages should appear in different browsers.

As with most things, it is important to have a level of modularity in your templates to enable pages to be customised to a particular section without damaging the overall site. Make sure though that the extension mechanism is documented and understood, a useful documentation system is an intranet where everyone can easily get access to how to sue the templates.

Remember your set of site templates is where the usability and accessibility testing efforts are focused, don’t let the investment in usable and accessible templates be wasted, protect them and promote them.

Semantic vs. Presentation

A recent post to a newsgroup I visit, CIWAH, sparked a small yet interesting debate concerning the differences and similarities between semantic and presentational markup in HTML. Daniel Tobias started the ball rolling with a short piece that covers a few of the well worn points in this discussion. The article however does reinforce the need for people to understand what they are writing and not to abuse tags in a meaningless fashion.

In response to Dan’s post a more interesting point was raised by Jukka K. Korpela when he stated that:

Physical markup does not define the semantic meaning (except in the trivial sense where we might say that the visual presentation isthe meaning), but it may carry a connotation, at least when taken in context.

In the discussion of logical vs presentational I think that the point Jukka makes is important, as well as one that is often glossed over by proponents of semantic markup. Recognising that web based content may be delivered in varied contexts, and demonstrating that semantic markup is the best method for making this delivering in varied contexts is a key method of convincing web developers to use semantic markup.

An example of why semantic markup is important may be found in this post itself, when quoting Jukka’s post I added structural HTML constructs to give emphasis to certain words. In many browsers these ephasised words will appear in boldface or italics, how can non-visual browsers pass this information onto their users? Well first of all I have used <strong> and <em> tags to markup the emphasis rather than <b> or <i> tags, because the tags I used are defined semantically the relationship between the differently emphasised words is clear, and an aural browser will be able to easily pass on this information. In contrast the tags that merely make the text boldfaced or italicised have no defined relationship with each other and is open to differing interpretation, by using semantic tags the relative importance of the words is effectively gagued and can be taken into account during the presentation of the content.

The example I have given demonstrates one advantage semantic (logical) markup has over presentational markup, an enumeration of the arguments supporting semantic markup has been proposed as:

  1. Logical markup can be mapped to varying physical presentations depending on presentation medium.
  2. Logical markup can be automatically processed in a manner that is based on the defined logical meanings of elements.
  3. Logical markup leads to more flexible ways of affecting the visual presentation and creating alternative presentations.

Google vs. SearchKing

One of the big stories that has been circulating recently is the legal wranglings between Google and SearchKing. In a reading some of the commentary on the case there were several aspects that interested me, especially peoples seeming willingness to turn search engines into regularised utility companies. This subject has already been covered elsewhere:

Google is so important to the web these days, that it probably ought to be a public utility. Regulatory interest from agencies such as the FTC is entirely appropriate, but we feel that the FTC addressed only the most blatant abuses among search engines. Google, which only recently began using sponsored links and ad boxes, was not even an object of concern to the Ralph Nader group, Commercial Alert, that complained to the FTC.

In my opinion however such regulation should not be imposed upon these companies, what is often unacknowledged is that many of these internet “giants” are not just used by US citizens, I live in the UK and I do not feel that restrictions should be placed on google by the US system that would adversely affect my search experience. In any case the following quote seems to echo many of the sentiments of my own views.

It’s possible to read this case as a case about media regulation. Maybe Google is a common carrier; in agreeing to rank pages and index the Internet, it has (implicitly) agreed to abide by a guarantee of equal and non-discriminiatory treatment. On this view, it would be immensely important whether Google devalued SearchKing specifically, or as part of a general algorithm tweak. A great deal may also hinge on whether you think that Google provides access to information or merely comments on it. SearchKing alleges the latter, and Google agrees, but maybe SearchKing should have brought its case by arguing that Google has become, in effect, a gatekeeper to Internet content. On that view, a low PageRank isn’t just an opinion, it’s also partly a factual statement that you don’t exist in answer to certain questions, on the basis that low search results are never seen. When was the last time you looked for results beyond 200 on a search request returning 20,000 pages?

These are very messy questions, but also very important ones. They’re also very unlikely to be addressed directly in the courtroom, in this case or in other cases. Existing law just comes down too squarely on Google’s side (I think) for courts to take these broader questions without mutilating our existing rules. Nor should they. Not everything should be settled in the courtroom, and the discussion about the proper role of search engines is one that needs to take place in the same place this case began, back before it was a lawsuit: out on the Internet, where people read and appreciate others’ thoughts, and then contribute their own by adding links. Among other things, Google is a device for determining the consensus of the Web; and it’s just not right to fix the process by which we determine consensus by any means other than honestly arriving at one.

Perhaps as the internet, and the information contained on it, becomes more important to us as a society the answers we think we already have will have to be re-evaluated.