Querying the British National Bibliography

Following up on the earlier announcement that the British
Library
has made the British National Bibliography available
under a public domain dedication, the JISC Open Bibliography
project has worked to make this data more useable.

The data has been loaded into a Virtuoso store that is queriable
through the SPARQL Endpoint and the URIs that we have assigned each
record use the ORDF software to make them dereferencable,
supporting perform content auto-negotiation as well as embedding RDFa
in the HTML representation.

The data contains some 3 million individual records and some 173
million triples. Indexing the data was a very CPU intensive process
taking approximately three days. Transforming and loading the source
data took about five hours.

To get an idea of the shape of the data, let us consider a sample
resource, http://bnb.bibliographica.org/entry/GB8102507 . Apart from
linkage between the various representations, the description of the
entity itself is as follows

@prefix ov: <http://open.vocab.org/terms/> .
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix bio: <http://purl.org/vocab/bio/0.1/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://bnb.bibliographica.org/entry/GB8102507> a bibo:Book, bibo:Document;
  dc:source <http://bnb.bibliographica.org/dataset/BNBrdfdc03.xml#183143>;
  dc:isPartOf <http://bnb.bibliographica.org/dataset>;
  rdfs:seeAlso <http://purl.org/NET/book/isbn/0241105161#book>,
               <http://www4.wiwiss.fu-berlin.de/bookmashup/books/0241105161>;

  dc:title "A good man in Africa";
  dc:language [ rdf:value "eng"^^dc:ISO639-2 ];
  dc:extent [ rdfs:label "251p" ];

  dc:contributor [ a foaf:Agent;
                   foaf:name "Boyd, William";
                   skos:notation "Boyd, William, 1952-";
                   bio:event [ a bio:Birth;
                               bio:date "1952"^^xsd:gYear ];
                   = <http://bibliographica.org/entity/735b02a8f051e2249e40fbd48112d033>;
                 ];

  dc:subject [ rdfs:label "Fiction in English" ],
             [ rdfs:label "1945-" ],
             [ rdfs:label "Texts" ],
             [ a skos:Concept;
               skos:inScheme <http://dewey.info/scheme/e18>;
               skos:notation "823/.9/1"^^<ddc:Notation> ],
             [ a skos:Concept;
               skos:inScheme <http://dewey.info/scheme/e19>;
               skos:notation "823/.914"^^<ddc:Notation> ].

  dc:publisher [ a foaf:Agent;
                 foaf:name "Hamilton";
                 skos:notation "Hamilton";
                 = <http://bibliographica.org/entity/c080da5b03a0786efa61e61123b359d9>;
               ];
  dc:issued "1981"^^xsd:gYear;
  isbd:hasPlaceOfPublicationProductionDistribution [ rdfs:label "London" ].

  bibo:identifier "GB8102507";
  bibo:isbn <urn:isbn:0241105161>;
  ov:blid "008042853".

Some of the salient features of this representation are:

  • Assignment of URIs for each entry in the British National
    Bibliography
    under http://bnb.bibliographica.org/.
  • Linkage with rdfs:seeAlso to the RDF Book Mashup and RDF
    Book Vocabulary
    .
  • Author and publisher are preserved as blank nodes as in the source
    data but are augmented with owl:sameAs links into the
    Bibliographica namespace to support further annotation,
    correction, deduplication, etc.
  • Any series if present is promoted to a first-class entity in the
    Bibliographica namespace for further processing.
  • Extraction of birth and death dates from the canonical
    string representation for authors.
  • For authors, publishers and series, their name as present in the
    source data is preserved using skos:notation whilst their
    names less any metadata about birth and death are represented with
    foaf:name.

The entire dataset is queriable through the SPARQL Endpoint and
makes use of some of the extended features of Virtuoso such as
full-text indexing. This is accomplished by using the bif:contains
built-in function and is what powers the search functionality on the
website. The default (example) query returns some details about all
books that have "Edinburgh" in their titles:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?book ?title ?name ?description
WHERE {
    ?book a bibo:Book .
    ?book dc:title ?title . ?title bif:contains "Edinburgh" .
    OPTIONAL { ?book dc:description ?description } .
    OPTIONAL {
        ?book dc:contributor ?author . ?author foaf:name ?name
    }
} GROUP BY ?book LIMIT 50

It should be noted that only some predicates are indexed for full-text
searching, namely,

  • rdf:value
  • rdfs:label
  • rdfs:comment
  • skos:prefLabel
  • skos:altLabel
  • dc:title
  • dc:description
  • foaf:name

Further Work

An ultimate goal of our work in the Open Bibliography group at the
OKF is to enable the collection of rich metadata about the
relationships between works and authors, to document and map the
scholarly discourse. This dataset is an important building block to
help ground the references in such a project. However more immediatly
we will:

  • Make a voiD description of this dataset describing its
    properties in more detail available.
  • Make available a dump of the our dataset derived from the BNB
    so that the data can be easily mirrored and copied for local
    processing.
  • Correct the errors listed in the Errata section below.

though not necessarily in that order.

Errata

  • ISBNs were represented in the source dataset as string literals of the
    form URN:ISBN:0123456789 and were erroneously transformed to URIs in
    violation of the rdfs:range of bibo:isbn.
  • Linkage between the resource and its representations,
    foaf:isPrimaryTopicOf contains a typo in the predicate which may
    make it difficult to use some RDF browsing clients that do not
    infer the inverse of foaf:primaryTopic.
This entry was posted in JISC OpenBib and tagged , , , , . Bookmark the permalink.