This blog post is written by Etienne Posthumus and Adrian Pohl.
We are happy that the German National Library recently released the German National Bibliography as Linked Open Data, see (announcement). At the #bibliohack this week we worked on getting the data into a BibServer instance. Here, we want to share our experiences in trying to re-use this dataset.
Parsing large turtle files: problem and solution
The raw data file is 1.1GB in a compressed format – unzipped it is a 6.8 GB turtle file.
Working with this file is unwieldy, it can not be read into memory or converted with tools like rapper (which only works for turtle files up to 2 GB, see this mail thread). Thus, it would be nice if the German National Library could either provide one big N-Triples file that is better for streaming processing or provide a number of smaller turtle files.
Our solution to get the file into a workable form is to make a small Python script that is Turtle syntax aware, to split the file into smaller pieces. You can’t use the standard UNIX split command, as each snippet of the split file also needs the prefix information at the top and we do not want to split an entry in the middle, losing triples.
Converting the N-Triples to BibJSON
After this, we started working on parsing an example N-Triples file to convert the data to BibJSON. We haven’t gotten that far, though. See https://gist.github.com/2928984#file_ntriple2bibjson.py for the resulting code (work in progress).
We noted problems with some properties that we like to document here as feedback for the German National Library.
Heterogeneous use of dcterms:extent
The dcterms:extent property is used in many different ways, thus we are considering to omit it in the conversion to BibJSON. Some example values of this property: “Mikrofiches”, “21 cm”, “CD-ROMs”, “Videokassetten”, “XVII, 330 S.”. Probably it would be the more appropriate choice to use dcterms:format for most of these and to limit the use of dcterms:extent to pagination information and duration.
URIs that don’t resolve
Also, DDC URIs that are connected to a resource with dcters:subject don’t resolve, e.g. http://d-nb.info/ddc-sg/070.
At a previous BibServer hackday, we loaded the Britsh National Bibliography data into BibServer. This was a similar problem, but as the data was in RDF/XML we could directly use the built-in Python XML streaming parser to convert the RDF data into BibJSON.
See: https://gist.github.com/1731588 for the source.