Thursday, February 7, 2008

PRAP Images: From Join Table to Containment

In my ongoing effort to create an archival version of PRAP's field data, I'm looking at how we dealt with image metadata and what we want to do now. A basic issue with the photographic record of an archaeological field project is that each photo can be of multiple subjects and each potential subject - a site, a tract, a piece of pottery - can appear in multiple photos. Photograph 203.29 is an example.

These relationships aren't very hard to handle in an adequate fashion with a modern database using a straigtforward many-to-many structure. At PRAP we had a table listing photographs and a separate table listing what identifiable entities appeared in each photograph. The latter can be called a join table. A schematic representation of the content of image 203.29 would be as follows:

Photographs
id: 203.29
filmtype: color slide
caption: byz sherds from area B.

Image<->Subject Join Table
photo_id: 203.29
subject_id: B92-181-02
position: left

photo_id:203.29
subject_id: B92-181-03
position: right

Sherds
id:B92-181-02
<info about sherd>

id:B92-181-03
<info about sherd>

I'm skipping over a whole lot of detail but I hope it's clear that this establishes that sherd B92-181-02 is the object at the left of photo 203.29. You can see this information put into action with further links on the PRAP website, which is serving the filemaker versions of our databases.

At this point I should say that it was Debi Harlan, now of the excellent ArchAtlas project, and I who implemented this system in the field.

One way to represent this many-to-many concept in xml is just to wrap markup around the records of the join table and leave it at that. For example:


Image<->Subject Join
<div class="imagelink">
<span property="image" src="prap:image:203.29" />
<span property="subject" src="prap:pottery:B92-181-02" />
<span property="position">Left</span>
</div>
<div class="imagelink">
<span property="image" src="prap:image:203.29" />
<span property="subject" src="prap:pottery:B92-186-03" />
<span property="position">Right</span>
</div>

Image Info
<div class="image" id="prap:image:203.29">
<span property="filmtype">CS</span>
<span property="label">Late Byzantine decorated base contiguous to Site B02, Kavalaria; Byzantine decorated bowl rim from B team tract</span>
<span property="photologdescription">B-92-181-2 (left), B-92-186-3 (right), respectively LByz decorated base, Byz decorated bowl rim</span>
</div>

You can see that each 'imagelink' div refers to a separately instantiated image and a separately instantiated subject. Unfortunately, there is some fragility and a lot of unnecessary overhead to this structure. The fragility comes from the possibility of change in one the joined tables. Perhaps "203.29" was a typo in the database. If you change the id of a photo, it needs to be changed in the join information as well.

XML allows one to take advantage of containment to incorporate the link information directly into the image database as follows:

<div class="image" id="prap:image:203.29">
<span property="filmtype">CS</span>
<span property="label">Late Byzantine decorated base contiguous to Site B02, Kavalaria; Byzantine decorated bowl rim from B team tract</span>
<span property="photologdescription">B-92-181-2 (left), B-92-186-3 (right), respectively LByz decorated base, Byz decorated bowl rim</span>
<span property="subjects">
<span>
<span property="subject" src="prap:pottery:B92-181-02" />
<span property="position">Left</span>
</span>
<span>
<span property="subject" src="prap:pottery:B92-186-03" />
<span property="position">Right</span>
</span>
</span>
</div>

I'm not yet happy with the details of this markup. Among the many odd things is the extra 'span' surrounding each subject-position pair. I'd like to call that something. The advantage of this representation is that the text from <div> to </div> is a self-contained description of the image with well-structured links to the named entities that appear within it.

One last point... I was not about to cut-and-paste the subject info into each image div. Instead, I wrote a quick-and-dirty xslt stylesheet to process the two xml datasets and produce a single set of image descriptions that point to their subjects. Here it is:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:x="http://www.w3.org/2002/06/xhtml2/" >

<xsl:key name="classes" match="//*[@class]" use="@class"/>
<xsl:key name="srcs" match="//*[@src]" use="@src"/>

<xsl:template match="/">
<xsl:for-each select="key('classes','image')">
<div class="{@class}" id="{@id}">
<xsl:copy-of select="./*"/>
<span property="subjects">
<xsl:for-each select="key('srcs',@id)/..">
<span>
<xsl:choose>
<xsl:when test="count(x:span[@property='subject']) &gt; 1">
<xsl:apply-templates select="x:span[not(@property='image')]"/>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="x:span[not(@property='image')]"/>
</xsl:otherwise>
</xsl:choose>
</span>
</xsl:for-each>
</span>
</div>
</xsl:for-each>
</xsl:template>

<xsl:template match="x:span">
<xsl:if test="not(contains(@src,':collectionunit:'))">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

This stylesheet has now been used to combine the image and image link portions of the PRAP Digital Archive and the tar ball has been updated.

I realize that I'm diving right into xml, xslt, etc. without much explanation. One purpose of this post is simply to share notes on what I've done. The other is to move towards the idea of a "PRAP Digital Archive Cookbook" that illustrates how to work with the PRAP data. This post can't be the beginning of such a publication since it has described a process that changed the underlying files in such a way that the just quoted xslt stylesheet will no longer work. But stay tuned for more fun things to do with this developing dataset...

No comments: