The British Library have released what they term the ‘British National Bibliography’ (BNB) under a permissive licence. This constitutes just under 3 million records, and is derived from the ‘most polished set of bibliographic data’ as some of it dates back a good number of years.
This effort is to be applauded and the data that is represented by that set is turning out to be reasonably high quality, with some errors due to XSLT problems rather than problems with the source data.
However, the RDF that is being created is very heavily dominated by blank nodes – each ‘record’ is an untyped blank node which has many properties that end in blank nodes, which in turn have rdf:value’s stating what that property’s value is.
<rdf:Description> <dcterms:title>Tooley's dictionary of mapmakers.</dcterms:title> <dcterms:contributor> <rdf:Description> <rdfs:label>Tooley, R. V. (Ronald Vere), 1898-1986 </rdf:Description> </dcterms:contributor> etc... </rdf:Description> <rdf:Description> ....
This has a number of drawbacks as much of the data is unlinkable – you cannot reference it outside of a given triplestore, as well as the author name information being mixed (includes date information), and RDF errors in the file.
Another issue is that the data is held in 17 very large xml files, which makes it very hard to address individual records as independent documents.
The first task is to augment this data such that:
- The ‘records’, authors, publishers, and related item bnodes are given unique, globally referenceable URIs
- The items themselves are given a type, based on the literal values present within bnodes linked to the item by the dc:type property. (eg bibo:Book)
- Any MARC -> RDF/XML errors are cleaned up (notably, there are a few occasions of rdf:description, rather than rdf:Description in there)
- For authors with more authoritative names (eg Smith, John, b. 1923 or similar), to break up the dates into a more semantic construction.
You can find the script that will do this augmentation at https://bitbucket.org/okfn/jiscobib/src/4ddaa37e44a2/BL_scripts/BLDump_convert_and_store.py
This script requires the lxml python module for xpath support as well as the pairtree module to store the records as individual documents. The script should be able to process all 17 files in a few hours, but make sure you have plenty of disc space.
It’s easiest to explain what is happening to the individual author/publisher nodes by use of a diagram:
For example, Original fragment:
<dcterms:contributor> <rdf:Description> <rdfs:label>Tooley, R. V. (Ronald Vere), 1898-1986 </rdf:Description> </dcterms:contributor>
<dcterms:contributor> <foaf:Agent rdf:about="http://purl.org/okfn/bl#agent-eea1ab4ff2be4baa6f9d623bdda5e852"> <foaf:name>Tooley, R. V. (Ronald Vere)</foaf:name> <bio:event> <bio:Birth> <bio:date>1898</bio:date> </bio:Birth> </bio:event> <bio:event> <bio:Death> <bio:date>1986</bio:date> </bio:Death> </bio:event> </foaf:Agent> </dcterms:contributor>
The URIs are generated by taking an md5 hash of a number of values, including the full line from the file it appears on, the extracted author’s name, and the number of lines through the file it is. The idea was to generate URIs that were as unique as possible, but reproducable from the same set of data if the script was reran.
By giving the books, musical scores, authors, publishers and related works externally addressable works, it allows for third-party datasets, such as sameas.org, to overlay their version of which things are the same.
You can then choose the overlay dataset which links together the authors and so on based on how much you trust the matching techniques of the service, rather than glomming both original data and asserted data together inextricably.