XML subsets

I came across the concept of subsetting XML for ease of parsing by reading an article by Joe Gregorio called Regexable XML. It raises some interesting points and I would recommend you go over there and have a look around if that kind of thing interests you.

Total Information Awareness

I’ve stumbled across a few interesting articles recently regarding civil liberties, relax I’m not getting all warblogger I am just interested in the technological and social ramifications of this stuff. In fact I may even take bets (if I was a betting man) on how much money the British government is going to spend while they screw up the indentity card scheme they’re intent on introducing. Lets face it, introducing new information technology is not this governments strongpoint, cases in point the Home Office and the MOD (check out the list of screw ups at the bottom of that last link). Read a report on system failure.

At least the Government has not openly introduced anything like the TIA system those lucky Americans are going to enjoy.

One benefit of the system is that it may prevent your population desiring to learn more about the potential conflict in Iraq and it’s background from sources in the area, like this one. While reading comments elsewhere I came across this gem:

I’m curious to check out websites from around the world, especially in the middle east, to get a different view of what is going on, but am entirely too afraid that I may be black-listed or linked to a terrorist group.

But thats ok because if you are innocent you have nothing to fear, good citizen.

IAO Logo
Scientia Est Potentia, Knowledge Is Power.

It’s all about the questions

Foaf, great isn’t it? The reactions for those who know what Foaf is (let’s face the fact that it isn’t mainstream technology) are mixed, a common reaction is I haven’t found a practical use for it yet. There are some intresting memes floating around regarding interconnecting people using FOAF, RSS and other assorted metadata schemes. Application of these schemes is at a rudimentary stage at the moment (hence FOAF’s 0.1 version number), whether FOAF wins out over later formats is not really of concern, to me at least, what is interesting is thinking of the information we want to get from the data that is provided.

Two aspects of the same question. This introduction of large amounts of both personal and content based data leads to the question, who is the consumer? Two types of consumer are interested, potentially, in this data:

  • Data Miners
  • Geeks

One example of potential data mining applications is the sending of spam, luckily FOAF provides a means of hiding the email address of people who have FOAF data. However think of the potential in tying together email address with detailed information on a persons interests. This is certainly possible with FOAF, although the likelihood (aka potential payoff) is probably too low for it to be contemplated at the moment, the potential is there.

An example for the second consumer is more easy to come across, they are the creators of stuff like FOAF. There have been plenty of potential applications aired by those in the FOAF community, Using it with your blogroll and just generally finding “friends”. I’ve been examining some ways of using FOAF data myself, I am currently running a FOAF Harvesting robot for research purposes into potential applications. One possible application is the integration of FOAF based data into the browsing environment.

<<< Start Vapourware Content >>>

Bring on the Vapourware. First I will state that I have no intention of building this system myself, I am far too busy concentrating on the programming I’m having to do to get myself a degree! Anyway here is my idea, a foaf viewing sidebar. A simple implementation exists already that can be used to find out more information on the author of the page. The way the author information is found though is not in widespread use however (it uses meta information rather than link information to get data). The potential for a more polished implementation that supports the current trend for <link> based referencing of FOAF files would be quite interesting.

<<< End Vapourware Content >>>

Interesting newsreader

I’ve been investigating a few new newsreaders recently, although I’m reasonably happy withAggie it isn’t fitting into my workflow as nicely as I would like. I’ve been evaluating a couple of potential replacements, such asnewsmonster but the one that has me really interested is nntp//rss.

This newsreader ties together RSS and NNTP, this is well suited to me as I spend a lot of time reading emails and catching up with my newsgroups, the aggregation runs in the background as well, altogether it doesn’t require as much effort (or another desktop icon), just like it should be.

One initial drawback was the lack of autodiscovery and integration with my web browser, however I wrote a bookmarklet that takes care of that. So if you download and install rss/nnttp come back here and get my bookmarklet (subscribe). Unfortunately it won’t work well if you try to subscribe to different sites on the same host, however I will work around that after I’ve had some sleep!

This bookmarklet was released for version 0.2 of nntp//rss. For further info on bookmarklets see bookmarklets.com

Just FOAFing around

I’ve managed to get my FoafHarvester up and running, it is currently gathering some data for a foaf exploration application I am building. All told I picked up about 350Kb of FOAF data during the crawl. Other than a couple of minor glitches it went quite well. The data is currently in an XML file, I’m just writing a program to put all the data into an SQLServer database at the moment.

Some useful FOAF stuff. If you’re interested in learning more about FOAF then the best bet is the RDFWeb FOAF page, this links to all kinds of good stuff so I won’t replicate it here. Well apart from FOAFNaut, and this presentation I found “Photo RDF, Metadata and Pictures” that talks a little about FOAF.

Setting up a serious development system

The time has finally come that all my programming projects are getting a little bit unwieldy to manage, version control consisting of backing up all the files now and again to a different directory. I’ve finally got around to installing CVS on my computer. One of the advantages of this is that my project folders have suddenly become a lot more organised as I can just keep old projects in the CVS and out of my frequently used folders.

After setting all this up I’m beginning to feel more like a serious programmer, my development tools are taking shape. As I am a bit of a geek I’ll give you a rundown of what my current programming setup is like.

  • CVSNT – versioning control system.
  • TortoiseCVS – Graphical CVS Interface for Windows.
  • Textpad – Favourite text editor (It’s that good I even registered my copy!)
  • .Net SDK – Big download, but the reference is invaluable, and the command line tools are good.
  • Visual Studio 6 – I can’t afford the VS.Net yet as they haven’t released a student edition yet.
  • CYGWin Bash ShellNTEmacs – For when I want to do the unix vibe.
  • Active State Perl – Because its just so damn useful.

Add a few batch scripts I’ve written, a few custom macros and commands to integrate my text editor with the .Net command line tools and voila, a nice little development system.

RSS Tidy Anybody?

I’ve been thinking about how to deal with the malformed Xml feed problem that I was talking about a few days ago, after reviewing some of the information contained in the thesis on invalid HTML parsing I shifted my emphasis to solving the problem to before submitting the document to the XML processor.

Where’s the Focus? Most solutions that involve delivering valid RSS to an aggregator concentrate on the publishing tool that creates the document, this is undoubtedly the ideal solution, however how are we going to get the Competitive Advantage that Mark discusses in his weblog? Simple, we need a tool that rescues the RSS file and nurtures it to health (or wellformdness) and then passes it into the XML processor for processing.

Lets get past the First aid shall we? What I want is a solution that doesn’t just try and put a band aid on the problem but prescribes the medicine and puts it in a sling. Rather than solving each error as it appears I think that a better approach would be to use something analogous to HTMLTidy to sort out all the illegal characters, entities and missing end tags. Kevin talked about his take on “smart” parsing which I mostly agree with, especially when he says this functionality is needed for all XML aggregation formats not just RSS. If we resort to using regular expressions for this task then we have to write a new set of RegEx‘s for eery new type of Xml format we want to process. If we resort to fixing errors in an ad hoc manner then we are missing some of the elegance and formalism that comes with a well structured library that deals with the problem.

In summary: lets not break the Xml parsers, lets fix the Xml, if we can do that on the server so much the better, if we have to do it on the client then lets do it in a structured way that scales across formats, lets handle the errors gracefully. Let’s get well formed first, and then worry about validity.

Just some old school webloging

The W3C released a final call on several RDF working drafts, I’ve been reading over the RDF primer and getting more of a feel for how RDF can be used, one of the most interesting parts is section six which introduces some existing RDF applications, it’s nice to see it being used for something other than RSS!

I really like this. One of my favourite blogs, SarahHatter.com, has just posted a really interesting entry, anyway it helps to break up the endless technical stuff I end up reading most of the time.

Ten fold strong with the Tech. Lets get back on topic shall we, I’ve been having a couple of feature issues with implementing the functionality I want on my RSS feed. Basically I was questioning whether I could use xml:lang to identify the language in my post descriptions, I found a well thought out answer over on RSS-Dev, along with some feedback on my idea of including namespaced XHTML in my RSS feed.

Liberal Parsing

There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS [additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].

So now you’ve read all that you can see what my point is.

You can’t? Ok I’ll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI’s [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I’ve included a data table based on the results of the analysis below. Luckily RSS isn’t in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?

Categories Number of Documents % Attempted Validations (2 dp) % Total Requests (2 dp)
Invalid HTML documents 2,034,788 99.29 84.85
Not Downloaded 225,516 NA 9.40
Unknown DTD 123,359 NA 5.14
Valid HTML documents 14,563 0.71 0.61
(All) Grand Total 2,398,226 100.00 100.00

PS I’ve just worked out a few bugs in my weblog RSS feed, enjoy.