I’ve managed to get my FoafHarvester up and running, it is currently gathering some data for a foaf exploration application I am building. All told I picked up about 350Kb of FOAF data during the crawl. Other than a couple of minor glitches it went quite well. The data is currently in an XML file, I’m just writing a program to put all the data into an SQLServer database at the moment.
Some useful FOAF stuff. If you’re interested in learning more about FOAF then the best bet is the RDFWeb FOAF page, this links to all kinds of good stuff so I won’t replicate it here. Well apart from FOAFNaut, and this presentation I found “Photo RDF, Metadata and Pictures” that talks a little about FOAF.
The time has finally come that all my programming projects are getting a little bit unwieldy to manage, version control consisting of backing up all the files now and again to a different directory. I’ve finally got around to installing CVS on my computer. One of the advantages of this is that my project folders have suddenly become a lot more organised as I can just keep old projects in the CVS and out of my frequently used folders.
After setting all this up I’m beginning to feel more like a serious programmer, my development tools are taking shape. As I am a bit of a geek I’ll give you a rundown of what my current programming setup is like.
- CVSNT – versioning control system.
- TortoiseCVS – Graphical CVS Interface for Windows.
- Textpad – Favourite text editor (It’s that good I even registered my copy!)
- .Net SDK – Big download, but the reference is invaluable, and the command line tools are good.
- Visual Studio 6 – I can’t afford the VS.Net yet as they haven’t released a student edition yet.
- CYGWin Bash Shell & NTEmacs – For when I want to do the unix vibe.
- Active State Perl – Because its just so damn useful.
Add a few batch scripts I’ve written, a few custom macros and commands to integrate my text editor with the .Net command line tools and voila, a nice little development system.
I’ve been thinking about how to deal with the malformed Xml feed problem that I was talking about a few days ago, after reviewing some of the information contained in the thesis on invalid HTML parsing I shifted my emphasis to solving the problem to before submitting the document to the XML processor.
Where’s the Focus? Most solutions that involve delivering valid RSS to an aggregator concentrate on the publishing tool that creates the document, this is undoubtedly the ideal solution, however how are we going to get the
Competitive Advantage that Mark discusses in his weblog? Simple, we need a tool that rescues the RSS file and nurtures it to health (or wellformdness) and then passes it into the XML processor for processing.
Lets get past the First aid shall we? What I want is a solution that doesn’t just try and put a band aid on the problem but prescribes the medicine and puts it in a sling. Rather than solving each error as it appears I think that a better approach would be to use something analogous to HTMLTidy to sort out all the illegal characters, entities and missing end tags. Kevin talked about his take on “smart” parsing which I mostly agree with, especially when he says
this functionality is needed for all XML aggregation formats not just RSS. If we resort to using regular expressions for this task then we have to write a new set of RegEx‘s for eery new type of Xml format we want to process. If we resort to fixing errors in an ad hoc manner then we are missing some of the elegance and formalism that comes with a well structured library that deals with the problem.
In summary: lets not break the Xml parsers, lets fix the Xml, if we can do that on the server so much the better, if we have to do it on the client then lets do it in a structured way that scales across formats, lets handle the errors gracefully. Let’s get well formed first, and then worry about validity.
The W3C released a final call on several RDF working drafts, I’ve been reading over the RDF primer and getting more of a feel for how RDF can be used, one of the most interesting parts is section six which introduces some existing RDF applications, it’s nice to see it being used for something other than RSS!
I really like this. One of my favourite blogs, SarahHatter.com, has just posted a really interesting entry, anyway it helps to break up the endless technical stuff I end up reading most of the time.
Ten fold strong with the Tech. Lets get back on topic shall we, I’ve been having a couple of feature issues with implementing the functionality I want on my RSS feed. Basically I was questioning whether I could use xml:lang to identify the language in my post descriptions, I found a well thought out answer over on RSS-Dev, along with some feedback on my idea of including namespaced XHTML in my RSS feed.
There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS [additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].
So now you’ve read all that you can see what my point is.
You can’t? Ok I’ll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI’s [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I’ve included a data table based on the results of the analysis below. Luckily RSS isn’t in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?
||Number of Documents
||% Attempted Validations (2 dp)
||% Total Requests (2 dp)
|Invalid HTML documents
|Valid HTML documents
|(All) Grand Total
PS I’ve just worked out a few bugs in my weblog RSS feed, enjoy.
I’m currently building a robots.txt parser for a project I’m running in C#. It is quite interesting as I have been meaning to get more into C# but have found difficulty finding the time. Luckily the grammar behind robots.txt files isn’t to difficult to parse and I’m currently running a small scale test.
After I’m satisfied with how the parser works I’ll release the source code as a class for C#, I am basing it broadly around the functionality offered by Pythons robotparser.
The two books upon which the light is balancing are Professional C# 2nd Edition and Professional ASP.Net 1.0, both from Wrox press and both recommended by me.
There will be an anti-WYSIWYG backlash in corporate American when managers finally begin assessing the productivity hit they have taken in their engineering departments by allowing graphic illiterates to diddle fonts and push pixels instead of focusing on content.
From GUI – The Death of WYSIWYG, interesting, just think of all the wasted hours spent by accountants on powerpoint preparations…
In between my hectic revision schedule I’ve knocked up a fairly basic RSS scraper for this site. My CMS hasn’t gotten of the ground yet and I really wanted to begin providing an RSS feed for my weblog, I couldn’t wait any longer. I wrote it in C#, it isn’t tied into anything on the live site as I run it on my local machine to generate the RSS file.
A few little bugs. There do seem to be a few teething problems with some RSS aggregators due to my use of content encoded data (as per RSS 1.0 content module spec), anyway I’ll iron out some of the issues as time goes on.