Tuesday, May 11, 2010

Document and Concept: '#this' and how DBpedia does it

I'm following up on yesterday's post in which I looked at the distinction between 'concept' and 'document' as well as its implications for scholarly practice. To be honest, I'm not sure I've really addressed the scholarly practice aspect of this thread but that's where I'm heading. I'll give a preview at the very end of this post.

Yesterday I asked, "Is there an unambiguous and widely-accepted convention for indicating the concept lying behind a document?". Gabriel Bodard left a comment noting the convention of appending '#this' to indicate that a URI is a reference to the real-world concept rather than the document describing that concept. This is definitely worth considering.

As an aside, Gabby (if I may) is correct that it's hard to look for documentation of the convention since 'this' is understandably ignored by search engines. There's the W3 document 'Cool URIs for the Semantic Web', which does discuss '#this'. I'm not sure if that's the original citation but that title is definitely on the suggested reading list for this topic. As is 'Linked Data Tutorial - NG: Publishing and consuming linked data with RDFa', which I was reminded to look at anew by Sean Gillies.

I have reservations about '#this'. Some of them are aesthetic but that's not a strong leg to stand on. Practically, I don't like having to inspect the internal characters of a URI to figure out its semantics. I also wonder if the convention hasn't really taken off. The 'Linked Data Tutorial' was published after 'Cool URIs' so it may be indicative that it doesn't discuss '#this'. I'm also not sure it's good to devote the '#' mechanism (aka fragment identifiers) to represent metadata rather than maintaining its original purpose of specifying internal portions of a document. But if '#this' comes to rule the world, I'll happily use it.

The 'Linked Data Tutorial' does use DBpedia in its examples so I want to look more closely at how that site handles the 'Document/Concept' distinction. In truth, I didn't find an explicit discussion of the topic on the DBpedia site itself. Maybe I just didn't come across it so I'd welcome a link. I did find the following on the the OpenLink site: "the URI prefixes http://dbpedia.org/resource/..., http/dbpedia.org/page/... and http://dbpedia.org/data/... distinguish between a resource and its HTML or RDF description documents". OpenLink is the creator of Virtuoso, the software that powers DBpedia's SPARQL-endpoint, so I'll take that statement as definitive until I find something more authoritative.

Time to get into details... http://dbpedia.org/resource/Antioch is the URI for the concept 'Antioch: the ancient city'. Clicking on that URI will cause your browser to be redirected to the document http://dbpedia.org/page/Antioch . That's great. We have a clean separation between concept and document.

Looking at the source of 'page/Antioch' (I'll use that shorthand going forward) shows that this document uses RDFa to embed semantic information in human-readable html. We could switch that around. RDFa allows human-readable text to be embedded in machine-parsable data. I'm not sure it matters, which is the main point.

DBpedia even references the RDFa 1.0 DTD: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">. That's very cool and very correct. When RDFa 1.1 is published, I'm counting on DBpedia to be at the forefront of adoption.

The 'resource/Antioch' URL appears three times in the 'page/Antioch' document. The following link elements are in the header:
  • <link rel="foaf:primarytopic" href="http://dbpedia.org/resource/Antioch"/>
  • <link rev="describedby" href="http://dbpedia.org/resource/Antioch"/>

The body start tag looks like this:
  • <body onload="init();" about="http://dbpedia.org/resource/Antioch">
Ignore the @onload, it's the @about that's interesting. It's just RDFa to say that all the parsable information in the document describes the resource http://dbpedia.org/resource/Antioch .

But far more interesting to me is the 'rev="describedby"' in the quoted link element of the document's head. Note that it's 'rev', not 'rel'. The meaning of the whole element is "The current document describes the resource at http://dbpedia.org/resource/Antioch". Yes, that's similar to the @about of the body. I really like the distinctiveness of using @rev . It's easily accessible by javascript or by an RDFa extractor. And I like that I can point to a major player in the Linked Data world as a precedent. That gives it a sense of de facto standard. And a little googling of 'describedby' found instances on the W3 site. It seems it's not quite an officially accepted standard but, again, it's nice to see a major player possibly getting behind 'describedby'.

So it's worth asking if this is a convention that others might be willing to adopt. Any takers or comments? Is @rev too obscure? Other objections?

I also want to briefly point out that the DBPedia 'page/...' documents make some effort to be clear to human readers that they are describing resources. The link at the top of 'page/Antioch' is to 'resource/Antioch'. This could be clearer but is a start.

And as for scholarly practice, I'll just briefly say that this discussion is in part inspired by the observation that Concepts should be permanent, Documents may be temporary. Looking back to the Geonames discussion of yesterday, I will not hold it against geonames.org if it stops responding to the URL http://www.geonames.org/3020251/embrun.html . Maybe html will fall out of use someday. It will be annoying if the string of characters http://sws.geonames.org/3020251/ , ceases to mean anything. Actually, I wish they'd remove the 'sws' cruft from that URL but that's their choice. Scholarship likes permanence and to the extent that the distinction between document and concept is clearly maintained, scholarly practice will be well served.

No comments: