Tuesday 10 January 2012

Progress in CLAROS towards Pelagios

I have been working on getting data into CLAROS (http://www.clarosnet.org/) to make it a proper contributing partner to Pelagios. Not new data exactly (we have millions on RDF triples already), but new connections between data. Finally, we're almost there, as Alex Dutton will explain in a subsequent post, able to list all the objects and people in CLAROS which can be linked to Pleiades places. But it may be instructive to informally describe the process we go through, and the tools we use.

The starting point for data providers in CLAROS is a supply of RDF against the CIDOC CRM (obviously, that takes some doing at their end; the wiki at http://www.clarosnet.org/wiki/index.php?title=CIDOC_CRM_RDF/XML helps explain how and what). This RDF (I give examples in XML) typically describes a set of objects, eg an <crm:E22_Man-Made_Object rdf:about="http://www.beazley.ox.ac.uk/record/AA1CD952-927D-41D7-B7AF-39520936CF95"> which has a section saying where they think it comes from, in the slightly tortuous way familiar to users of the CRM:

<P16i;was_used_for>
<E7_Activity>
<P2_has_type rdf:resource="http://id.clarosnet.org/vocab/Event_FindObject"/>
<P7_took_place_at>
<E53_Place>
<P87_is_identified_by>
<E48_Place_Name>
<rdf:value>VULCI</rdf:value>
</E48_Place_Name>
</P87_is_identified_by>
<P89_falls_within>
<E53_Place>
<P87_is_identified_by>
<E48_Place_Name>
<rdf:value>ETRURIA</rdf:value>
</E48_Place_Name>
</P87_is_identified_by>
</E53_Place>
</P89_falls_within>
</E53_Place>
</P7_took_place_at>
</E7_Activity>
</P16i_was_used_for>

This is not wrong, but not ideal, since

  1. the E53_Place objects are not identified by a URL and so are not addressable in the RDF
  2. there is no indication of the geographical location of Vulci
  3. there is no link to any other record for Vulci

The CLAROS ingest procedure reads this data, and enhances it by taking the place name "Vulci" and comparing it to a list of known places in an internal gazetter called Metamorphoses. This has been built up by pulling together ad hoc catalogues from the various projects at Oxford, and gradually enhancing the entries with latitude and longitude acquired by finding places on Google Maps or Earth, and cross-referencing sites from Geonames (http://www.geonames.org/). By then consulting PleiadesPlus (http://googleancientplaces.wordpress.com/2011/01/24/pleiades-adapting-the-ancient-world-gazetteer-for-gap-%E2%80%93-by-leif-isaksen/), we can enhance the gazetteer still further with links to Pleiades. The end result looks like this, utilizing the skos:closeMatch relationship to link up our internal place Vulci: with Pleiades and Geonames

<E53_Place rdf:about="http://id.clarosnet.org/places/metamorphoses/place/vulci">
<rdfs:label>[IT] Vulci</rdfs:label>
<P87_is_identified_by>
<E48_Place_Name rdf:about="http://id.clarosnet.org/places/metamorphoses/placename/vulci">
<rdf:value>Vulci</rdf:value>
</E48_Place_Name>
</P87_is_identified_by>
<P87_is_identified_by>
<E47_Place_Spatial_Coordinates rdf:about="http://id.clarosnet.org/places/metamorphoses/place/vulci/coordinates">
<claros:has_geoObject>
<geo:Point xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
<geo:lat>42.4167</geo:lat>
<geo:long>11.5833</geo:long>
</geo:Point>
</claros:has_geoObject>
</E47_Place_Spatial_Coordinates>
</P87_is_identified_by>
<skos:closeMatch rdf:resource="http://pleiades.stoa.org/places/413393#this"/>
<skos:closeMatch rdf:resource="http://sws.geonames.org/3163940/"/>
<P89_falls_within rdf:resource="http://id.clarosnet.org/places/metamorphoses/country/IT"/></E53_Place>

Now we can match the "VULCI" from earlier on with this "vulci", and rewrite the <P7_took_place_at> as <P7_took_place_at rdf:resource="http://id.clarosnet.org/places/metamorphoses/place/vulci"/>; this now lets us assert that http://www.beazley.ox.ac.uk/record/AA1CD952-927D-41D7-B7AF-39520936CF95 is associated with http://pleiades.stoa.org/places/413393#this in some way, which is where we meet Pelagios.

Most of the normalizing process is done in a single XSLT 2.0 transform (which also does quality checks of the RDF) of incoming RDF XML, working with the Metamorphoses RDF and a lookup XML file listing common spelling mistakes. When the resulting rewritten RDF is loaded into the triple store, additional inferences are performed to make subsequent retrievals easier. This process is, of course, very open to change and refinement, and as CLAROS develops we will no doubt rewrite it all.

Does it work? CLAROS' gazetter currently defines about 7300 places, of which only 1442 are linked to Pleiades. But bearing in mind that CLAROS has a lot of modern place names, and a lot of ones in the middle and far east, we are not dissatisfied with progress. Our next step will be to gradually go over places in the obvious countries (Greece, Italy, France, Germany etc), and check them against Pleiades, with the target of complete synchronization across the Mediterranean. It will be slow work...

No comments:

Post a Comment