Name matching strategy using bibliographic data

One of the aims of an RDF representation of bibliographic data should be to have authors represented by unique, reference-able points within the data (as URIs), rather than as free-text fields. What steps can we do to match up the text value representing an author’s name to another example of their name in the data?

It’s not realistic to expect a match between say, Mark Twain to Samuel Clemens, without using some extra information typically not present in bibliographic datasets. What can be achieved however, is the ‘fuzzy’ matching of alternate forms of names – due to typos, mistakes and omitted initials and the like. It is important that these matches are understood to be fuzzy and not precise, based more on statistics than a definite assertion.

How best to carry out the matching of a set of authors within a bibliographic dataset? This is not the only way, but it is a useful method to make progress with:

  1. List – Gather a list of the things you wish to match, with unique identifiers for each and map out a list of the pairs of names that are required to be compared. (Note, that this mapping will be greatly affected by the next step.)
  2. Filter – Remove the matches that aren’t worth fully evaluating. An index of the text can give a qualitative view on which names are worth comparing and which are not.
  3. Compare – Run through the name-pairs and evaluate the match (likely using string metrics of some kind). The accuracy of the match may be improved by using other data, with some sorts of data drastically improving your chances, such as author email, affiliation (and date) and birth and death dates.
  4. Binding – Bind the names together in whichever manner required. I would recommend creating Bundles as a record of a successful match, and an index or sparql-able service to allow the ‘sameas’ style lookups in a live service.

In terms of the BL dataset within http://bnb.bibliographica.org then:

List:

We have had to apply a form of an identifier for each instance of an author’s name within the BL dataset. Currently, this is done via a ‘owl:sameas’ property on the original blank node linking to a URI of our own making, eg http://bibliographica.org/entity/735b0…12d033. It would be a lot better if the BL were to mint their own URIs for this, but in the meantime, this gives us enough of a hook to begin with.

One way you might gather the names and URIs is via SPARQL:

PREFIX dc: 
PREFIX bibo: 
PREFIX foaf: 
PREFIX skos: 
PREFIX owl: 
SELECT DISTINCT ?book ?authoruri ?name
WHERE {
    ?book a bibo:Book .
    ?book dc:contributor ?authorbn .
    ?authorbn skos:notation ?name .
    ?authorbn owl:sameAs ?authoruri .
}

However, there will be very many results to page through, and it will put a lot of stress on the SPARQL engine if lots of users are doing this heavy querying at the same time!

This is also a useful place to gather any extra data you will use at the compare stage (usually email, affiliation or in this particular case, the potential birth and death dates).

Filter:

This is the part that is quite difficult as you have to work out a method for negating matches without the cost of a full match. If the filter method is slower than simply working through all the matches, then it is not worth doing the step. In matching names from the BL set however, there are many millions of values, but from glancing over the data, I only expect tens of matches or fewer on average for a given name.

The method I am using is to make a simple stemming index of the names, with the birth/death dates as extra fields. This I have done in Solr (experimenting with different stemming methods) but come to an odd conclusion that a default english stemming provides suitable groupings. I found this was backed up somewhat by this comparison of string metrics [PDF]. It suggests that a simple index combined with a metric called ‘Jaro’ works well for names.

So, in this case, I generate the matching by running the names through an index of all the names and using the most relevant search results as the base for the pairs of names to be matched. The pairs are combined into a set, ordered alphabetically – only the pairing is necessary, not the ordering of the pair. This is so that we don’t end up matching the same names twice.

Compare:

This is the most awkward step – it is hard to generate a ‘golden set’ of data by which you can rate the comparison without using other data sources. However, the matching algorithm I am using is the Jaro comparison to get a figure indicating the similarity of the names. As the BL data is quite a good set of data (in that it is human-entered and care has been taken over the entry), this type of comparison is quite good – the difference between a positive and a negative match is quite high. Care must be taken to avoid missing false positives from omitted or included initials, middle names, proper forms and so on.

The additional data is quite dependant on the match between the names. If the names match perfectly, but the birth dates are very different (different in value and distant in edit distance), then this is likely to be a different author. If the names match somewhat, but the dates match perfectly, then this is a possible match. If the dates match perfectly, but the name doesn’t at all (unlikely due to the above filtering step) then this is not a match.

Binding:

This step I have not made my mind up about as the binding step is a compromise between recording. Bundling together all the names for a given author in a single bundle if they are below the threshold for a positive match, you get a bundle that requires fewer triples to describe it. However, you really should have a bundle for each pairing, but this dramatically increases the number of triples required to express it. Either way, the method for querying the resultant data is the same. For example:

A set of bundles ‘encapsulates’ A, B, C, D, E, F, G – so, given B, you can find the others by a quick, if inelegant, SPARQL query:

SELECT ?sameas
WHERE {
  ?bundle bundle:encapsulates  .
  ?bundle bundle:encapsulates ?sameas .
}

Whether this data should be collapsed into a closure of some sort is up to the administrator – how much must you trust this match before you can use owl:sameAs and incorporate it into a congruent closure? I’m not sure the methods outlined above can give you a strong enough guarantee to do so at this point.

This entry was posted in JISC OpenBib, Semantic Web. Bookmark the permalink.

One Response to Name matching strategy using bibliographic data

  1. Pingback: RealTime - Questions: "Do i own the Domain name?"

Leave a Reply

Your email address will not be published. Required fields are marked *