Augmenting the British Library's RDF data to allow for disambiguation

The British Library have released what they term the ‘British National Bibliography’ (BNB) under a permissive licence. This constitutes just under 3 million records, and is derived from the ‘most polished set of bibliographic data’ as some of it dates back a good number of years.

This effort is to be applauded and the data that is represented by that set is turning out to be reasonably high quality, with some errors due to XSLT problems rather than problems with the source data.

However, the RDF that is being created is very heavily dominated by blank nodes – each ‘record’ is an untyped blank node which has many properties that end in blank nodes, which in turn have rdf:value’s stating what that property’s value is.

For example:

  <rdf:Description>
    <dcterms:title>Tooley's dictionary of mapmakers.</dcterms:title>
    <dcterms:contributor>
      <rdf:Description>
        <rdfs:label>Tooley, R. V. (Ronald Vere), 1898-1986
      </rdf:Description>
    </dcterms:contributor>

etc...

  </rdf:Description>
  <rdf:Description>
....

This has a number of drawbacks as much of the data is unlinkable – you cannot reference it outside of a given triplestore, as well as the author name information being mixed (includes date information), and RDF errors in the file.

Another issue is that the data is held in 17 very large xml files, which makes it very hard to address individual records as independent documents.

The first task is to augment this data such that:

  • The ‘records’, authors, publishers, and related item bnodes are given unique, globally referenceable URIs
  • The items themselves are given a type, based on the literal values present within bnodes linked to the item by the dc:type property. (eg bibo:Book)
  • Any MARC -> RDF/XML errors are cleaned up (notably, there are a few occasions of rdf:description, rather than rdf:Description in there)
  • For authors with more authoritative names (eg Smith, John, b. 1923 or similar), to break up the dates into a more semantic construction.

You can find the script that will do this augmentation at https://bitbucket.org/okfn/jiscobib/src/4ddaa37e44a2/BL_scripts/BLDump_convert_and_store.py

This script requires the lxml python module for xpath support as well as the pairtree module to store the records as individual documents. The script should be able to process all 17 files in a few hours, but make sure you have plenty of disc space.

It’s easiest to explain what is happening to the individual author/publisher nodes by use of a diagram:

Augmenting the BL RDF to allow for disambiguation to be overlaid

For example, Original fragment:

    <dcterms:contributor>
      <rdf:Description>
        <rdfs:label>Tooley, R. V. (Ronald Vere), 1898-1986
      </rdf:Description>
    </dcterms:contributor>

To:

    <dcterms:contributor>
      <foaf:Agent rdf:about="http://purl.org/okfn/bl#agent-eea1ab4ff2be4baa6f9d623bdda5e852">
        <foaf:name>Tooley, R. V. (Ronald Vere)</foaf:name>
        <bio:event>
          <bio:Birth>
            <bio:date>1898</bio:date>
          </bio:Birth>
        </bio:event>
        <bio:event>
          <bio:Death>
            <bio:date>1986</bio:date>
          </bio:Death>
        </bio:event>
      </foaf:Agent>
    </dcterms:contributor>

The URIs are generated by taking an md5 hash of a number of values, including the full line from the file it appears on, the extracted author’s name, and the number of lines through the file it is. The idea was to generate URIs that were as unique as possible, but reproducable from the same set of data if the script was reran.

By giving the books, musical scores, authors, publishers and related works externally addressable works, it allows for third-party datasets, such as sameas.org, to overlay their version of which things are the same.

You can then choose the overlay dataset which links together the authors and so on based on how much you trust the matching techniques of the service, rather than glomming both original data and asserted data together inextricably.

This entry was posted in JISC OpenBib, Semantic Web and tagged , , , , . Bookmark the permalink.

5 Responses to Augmenting the British Library's RDF data to allow for disambiguation

  1. Pingback: “Bundling” instances of author names together without using owl:sameas | Open Biblio (graphic) Projects

  2. Pingback: Tweets that mention Augmenting the British Library’s RDF data to allow for disambiguation | Open Biblio (graphic) Projects -- Topsy.com

  3. Pingback: Unilever Centre for Molecular Informatics, Cambridge - #jiscopenbib The British Library’s National Bibliography is Open! Join in the party. « petermr’s blog

  4. Ed Chamberlain says:

    Firstly, its great to see the factually rich but structually poor MARC based BNB data released in this way. Congratulations to all involved! My first real library job involved archiving CD-ROMs with BNB updates on …

    I’ve recently had to break down the author entry from the MARC 100 field myself to get birth and death dates seperated. I’d be interested in the proccess used, especially in accounting for all the different ways AACR2 and MARC punctuate the date information. I’m not sure I’ve got them all!

  5. Pingback: Describing the “things”: the RDF terms used (part 1) « LOCAH Project

Leave a Reply

Your email address will not be published. Required fields are marked *