"Bundling" instances of author names together without using owl:sameas

Bundling?

It’s a verb I’ve taken from ”Glaser, H., Millard, I., Jaffri, A., Lewy, T. and Dowling, B. (2008) On Coreference and The Semantic Web http://eprints.ecs.soton.ac.uk/15765/” where the core idea is that you have a number of URIs that mean or reference the same real thing, and the technique they describe of bundling is to aggregate all those references together. The manner in which they describe is built on a sound basis in logic, and is related to (if not the same as) a congruent closure.

The notion of bundling I am using is not as rooted in terms of mathematical logic, because I need to convey an assertion that one URI is meant to represent the same thing that another URI represents in a given context and for a given reason. This is a different assertion, if only subtly different, than ‘owl:sameas’ asserts, but the difference is key for me.

It is best to think through an example of where I am using this – curating bibliographic records and linking authors together.

It’s an obvious desire – given a book or article, to find all the other works by an author of that said work. Technologically, with RDF this is a very simple proposition BUT the data needs to be there. This is the point where we come unstuck. We don’t really have that quality of data that firmly establishes that one author is the same as a number of others. String matching is not enough!

So, how do we clean up this data (converted to RDF) so that we can try to stitch together the authors and other entities in them?

See this previous post on augmenting British Library metadata so that the authors, publishers and so on are externally reference-able once they are given unique URIs. This really is the key step. Any other work that can be done to make any of the data about the authors and so on more semantically reference-able will be a boon to the process of connecting the dots, as I have done for authors with birth and/or death dates.

The fundamental aspect to realise is that we are dealing with datasets which have missing data, misrepresented data (typos), misinterpreted fields (ISBNs of £2.50 for example) and other non-uniform and irregular problems. Connecting authors together in datasets with these characteristics will rely on us and code that we write making educated guesses, and probabilistic assertions, based on how confident we are that things match and so on.

We cannot say for sure that something is a cast-iron match, only that we are above a certain limit of confidence that this is so. We also have to have a good reason as well.

Something else to take on board is that what I would consider to be a good match might not be good for someone else so there needs to be a manner to state a connection and to say why, who and how this match was made as well as a need to keep this data made up of assertions away from our source data.

I’ve adopted the following model for encoding this assertion in RDF, in a form that sits outside of the source data, as a form of overlay data and you can find the bundle ontology I’ve used at http://purl.org/net/bundle.rdf (pay no attention to where it is currently living):

Click to view in full, unsquished form:

Bundle of URIs, showing use of OPMV

The URIs shown to be ‘opmv:used’ in this diagram are not meant to be exhaustive. It is likely that a bundle may depend on a look-up or resolution service, external datasheets, authority files, csv lists, dictionary lists and so on.

Note that the ‘Reason’ class has few, if any, mandatory properties aside from its connection to a given Bundle and opmv:Process. Assessing if you trust a Bundle at this moment is very much based on the source and the agent that made the assertion. As things get more mature, more information will regularly find its place attached to a ‘Reason’ instance.

There are currently two subtypes of Reason: AlgorithmicReason and AgentReason. Straightforwardly, this is the difference between a machine-made match and a human-made match and use of these should aid the assessment of a given match.

Creating a bundle using python:

I have added a few classes to Will Waites’ excellent ‘ordf’ library, and you can find my version here. To create a virtualenv to work within, do as follows. You will need mercurial and virtualenv already installed:

At a command line – eg ‘[@localhost] $’, enter the following:

hg clone http://bitbucket.org/beno/ordf
virtualenv myenv
. ./myenv/bin/activate
(myenv) $ pip install ordf

So, creating a bundle of some URIs – “info:foo” and “info:bar”, due to a human choice of “They look the same to me :)”:

In python: code here


from ordf.vocab.bundle import Bundle, Reason, AlgorithmicReason, AgentReason

from ordf.vocab.opmv import Agent

from ordf.namespace import RDF, BUNDLE, OPMV, DC # you are likely to use these yourself

from ordf.term import Literal, URIRef # when adding arbitrary triples

b = Bundle()

"""or if you don't want a bnode for the Bundle URI: b = Bundle(identifier="http://example.org/1")"""

"""
NB this also instantiates empty bundle.Reason and opmv.Process instances too
in b.reason and b.process which are used to create the final combined graph at the end"""

b.encapsulate( URIRef("info:foo"), URIRef("info:bar") )

""" we don't want the default plain Reason, we want a human reason:"""

r = AgentReason()

""" again, pass a identifier="" kw to set the URI if you wish"""

r.comment("They look the same to me :)")

"""Let them know who made the assertion:"""

a = Agent()

a.nick("benosteen")

a.homepage("http://benosteen.com")

""" Add this agent as the controller of the process:"""
b.process.agent(a)

g = b.bundle_graph() # this creates an in-memory graph of all the triples required to assert this bundle

""" easiest way to get it out is to "serialize" it:"""

print g.serialize()

==============

Output:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:bundle="http://purl.org/net/bundle#"
   xmlns:foaf="http://xmlns.com/foaf/0.1/"
   xmlns:opmv="http://purl.org/net/opmv/ns#"
   xmlns:ordf="http://purl.org/NET/ordf/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
>
  <rdf:Description rdf:nodeID="PZCNCkfJ2">
    <rdfs:label> on monster (18787)</rdfs:label>
    <ordf:hostname>monster</ordf:hostname>
    <ordf:pid rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">18787</ordf:pid>
    <opmv:wasControlledBy rdf:nodeID="PZCNCkfJ9"/>
    <ordf:version rdf:nodeID="PZCNCkfJ4"/>
    <rdf:type rdf:resource="http://purl.org/net/opmv/ns#Process"/>
    <ordf:cmdline></ordf:cmdline>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ0">
    <bundle:encapsulates rdf:resource="info:bar"/>
    <bundle:encapsulates rdf:resource="info:foo"/>
    <bundle:justifiedby rdf:nodeID="PZCNCkfJ5"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
    <rdf:type rdf:resource="http://purl.org/net/bundle#Bundle"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ5">
    <rdf:type rdf:resource="http://purl.org/net/bundle#Reason"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ9">
    <foaf:nick>benosteen</foaf:nick>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/>
    <foaf:homepage rdf:resource="http://benosteen.com"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ4">
    <rdfs:label>ordf</rdfs:label>
    <rdf:value>0.26.391.901cf0a0995c</rdf:value>
  </rdf:Description>
</rdf:RDF>


Given a triplestore with these bundles, you can query for ‘same as’ URIs via which Bundles a given URI appears in.

This entry was posted in JISC OpenBib, ORDF, Semantic Web and tagged , , , , , , . Bookmark the permalink.

7 Responses to "Bundling" instances of author names together without using owl:sameas

  1. Pingback: Tweets that mention “Bundling” instances of author names together without using owl:sameas | Open Biblio (graphic) Projects -- Topsy.com

  2. I’ve still yet to read ‘on co-reference and the semantic web’, but wondering if “congruent closure” (while not making a mathematical assertion necessarily) is similar to “equivalence relations”[1] in set theory (had to crack open an old set theory text to try and explore this, but couldn’t find any definitive answer)?

    Also, great to see example code, could we maybe get an example query which demonstrates ability to get back a set of bundles?

    Very interesting post.

    [1]= http://en.wikipedia.org/wiki/Equivalence_relation

  3. Pingback: Disillusionment – 4Kids Entertainment, American Beauty in mind the whole OP (reprint, … | Home cooking

  4. Pingback: Name matching strategy using bibliographic data | Open Biblio (graphic) Projects

  5. Adrian Pohl says:

    Hello Ben,

    I am looking for a vocabulary to represent co-reference of bibliographic identifiers and am asking myself which approach fits best:
    – the coref-ontology, – your bundle ontology proposed in this post
    – or even owl:sameAs. (It is interesting – and disturbing – that Ian Millard and Hugh Glaser use owl:sameAs and not their bundling approach for the service sameas.org.)
    – or maybe even the Similarity Ontology?

    I have the following questions to you:
    Why don’t you (re)use the coref ontology in your approach although you explicitely built on the work by Hugh and Ian?What are the advantages of the bundle ontology comparing to the coref ontology?

    Furthermore, I think it makes good sense relating the bundle ontology to OAI-ORE but shouldn’t it read rdfs:subPropertyOf in lines 102 and 116 of the ontology, instead of rdfs:subClassOf?

    Adrian

  6. Hugh Glaser says:

    Hi Adrian,
    Sorry to disturb you, even if we also interest you. :-)
    In fact the underlying CRS system doesn’t use owl:sameAs; see for example
    http://southampton.rkbexplorer.com/crs/export/?uri=http://southampton.rkbexplorer.com/id/eprints-13916&format=rdfxml
    But when we decided to gather everything into sameas.org, it became apparent that the service needed to go along with the sameAs usage, wrong as it may be. otherwise we would get involved with complex ontology discussions that would distract from the use of the service.
    In those days, I think the whole thing was even less understood than it is now, and it was very hard to get people to accept the need for a coref-ontology, so they didn’t understand what the CRS software was doing.
    In fact people can do a simply script to edit owl:sameAs to whatever their favourite predicate is, if they want.
    One day I may add an argument to the interface so that you can tell sameAs.org what predicate(s) you want it back in.
    Hi David,
    I primarily think of bundles as Equivalence Classes, but sometimes refer to them as Closure :-).
    Since equivalence predicates are reflexive, symmetric and transitive, they are equivalence relations (in the mathematical sense) and construct equivalence classes over the sets of URIs.
    Closure is a related topic, concerned in part with whether, if you apply an operation to elements of a set (in this case the bundle, I think), you get an element of the set.
    Congruence relations are equivalence relations (in the mathematical sense).
    I’m sure someone else can do better on this if you want…
    Best
    Hugh

  7. Pingback: Co-referencing « LOCAH Project

Leave a Reply to Adrian Pohl Cancel reply

Your email address will not be published. Required fields are marked *