(NB Republished from a mailing list conversation at http://lists.okfn.org/pipermail/open-bibliography/2010-August/000397.html – follow this link to see the comments and replies)
In my work on meshing bibliographic datasets together, I’ve been using a conceptual tool that I would like to hear views on.
I am creating nodes for the ideals of things on records – whether that is for people, journals or even the bibliographic document itself. The ideal represents the best and most complete data for that thing – something we’ll never really achieve, but that’s not the point. This ideal serves as a node, a hook, on which we can join up records which describe the same thing (person, frbr manifestation, etc) but which have differing data for.
It’s easy to consider it for ‘deduplications’ of say article references. Consider two records, one from the ris feed from pubmed and one from a citation in a plos article. These are found to be references to the same article but as you can expect they differ, not just in terms of data but also on terms of the source or author of that reference.
The way I am tackling this is by creating a node for the ideal bibliographic reference each aspires to and when dupes are believed to be found, these ideal nodes are joined into a bundle using sameas (in a different store) and this bundle has some provenance triples recording the how when and why for this merging (using open provenance model verbs/classes)
Eg:
:bibrec —> record node from pubmed
:citerec —> plos record
_i suffix —> ideal node
- running analyser on record suggests two records are dupes, with a certain confidence score from a certain weighted matching (call this ‘heur.v0.13′)
Create ideal nodes Just In Time:
:bibrec hasIdeal :bibrec_i :citerec hasIdeal :citerec_IMake the bundle:
:b1 a Bundle sameas :bibrec_i sameas :citerec_I opmv:wasGeneratedBy :p1 created: 2010-08-......This structure let’s me create an aggregated rdf dataset with the best guess ideal records at any one time. Also, bundles can be merged later if required creating a tree structure – the top bundle instance and the ‘leaf’ records form a congruent closure and are thus exportable as such without the admin structure triples necessary for ongoing maintenance. The bundle notion comes from the excellent work by the team at southampton, including Hugh glazer, Ian milliard et al (google for coreference on the semantic web):p1 a opmv:Process Opmv:controlledBy :Ben Opmv:used :bibrec Opmv:used :citerec
:confidence a ConfidenceReport Opmv:wasGeneratedBy :p1 Hasreport <url of doc> # for time being
Using this technique for entities like people is actually very similar. If I use the words ‘person’ and ‘persona’ for the ideal and the data in a record respectively. The persona can have alternative spellings, and time-dependant details like a fleeting institutional affiliation, and so on. The (difficult) trick is spotting when two persona’s refer to the same person but the process for merging is the same even if the creation of an aggregated record for each is different.
