Open Bibliography and Open Bibliographic Data » national library

German National Library publishes 11.5 Million MARC records from national bibliography

Adrian Pohl — Mon, 01 Jul 2013 13:26:47 +0000

In January 2012 the German National Library (DNB) already started publishing the national bibliographc as linked data under a CC0 license. Today, the DNB announced that it also publishes the national bibliography up to the year 2011 as MARC data. The full announcement reads as follows (quick translation by myself):

“All of German National Library’s title data which are offered under a Creative Commons Zero (CC0) license for free use are now available gratis as MARC 21 records. In total, these are more than 11.5 Million title records.

Currently title data up to bibliography year 2011 is offered under a Creative Commons Zero license (CC0). For using the data a registration free of charge is necessary. Title data of the current and the previous year are subject to charge. The CC0 data package will be expanded by one bibliography year each first quarter of a year.

It is planned to provide free access under CC0 conditions to all data in all formats in mid-2015. The German National Library thus takes into account the growing need for freely available metadata.”

As the MARC data contains much more information than the linked data (because not all MARC fields are currently mapped to RDF) this is good news for anybody who is interested in getting all the information available in the national bibliography. As DNB still makes money with selling the national bibliography to libraries and other interested parties it won’t release all bibliographic data until the present day into the public domain. It’s good to see that there already exist plans to switch to a fully free model in 2015.

Importing Spanish National Library to BibServer

Etienne Posthumus — Tue, 07 Aug 2012 16:09:29 +0000

The Spanish National Library (Biblioteca Nacional de España or BNE) has released their library catalogue as Linked Open Data on the Datahub.

Initially this entry only containd the SPARQL endpoints and not downloads of the full datasets. After some enquiries from Naomi Lillie the entry was updated with links to the some more information and bulk downloads at: http://www.bne.es/es/Catalogos/DatosEnlazados/DescargaFicheros/

This library dataset is particularly interesting as it is not a ‘straightforward’ dump of bibliographic records. This is best explained by Karen Coyle in her blogpost.

For a BibServer import, the implications are that we have to distinguish the types of record that is read by the importing script and take the relevant action before building the BibJSON entry. Fortunately the datadump was made as N-Triples already, so we did not have to pre-process the large datafile (4.9GB) in the same manner as we did with the German National Library dataset.

The Python script to perform the reading of the datafile can be viewed at https://gist.github.com/3225004

A complicating matter from a data wrangler’s point of view is that the field names are based on IFLA Standards, which are numeric codes and not ‘guessable’ English terms like DublinCore fields for example. This is more correct from an international and data quality point of view, but does make the initial mapping more time consuming.

So when mapping a data item like https://gist.github.com/3225004#file_sample.nt we need to dereference each fieldname and map it to the relevant BibJSON entry.

As we identify more Linked Open Data National Bibliographies, these experiments will be continued under the http://nb.bibsoup.net/ BibServer instance.

Bringing the Open German National Bibliography to a BibServer

Adrian Pohl — Mon, 18 Jun 2012 08:55:39 +0000

This blog post is written by Etienne Posthumus and Adrian Pohl.

We are happy that the German National Library recently released the German National Bibliography as Linked Open Data, see (announcement). At the #bibliohack this week we worked on getting the data into a BibServer instance. Here, we want to share our experiences in trying to re-use this dataset.

Parsing large turtle files: problem and solution

The raw data file is 1.1GB in a compressed format – unzipped it is a 6.8 GB turtle file.
Working with this file is unwieldy, it can not be read into memory or converted with tools like rapper (which only works for turtle files up to 2 GB, see this mail thread). Thus, it would be nice if the German National Library could either provide one big N-Triples file that is better for streaming processing or provide a number of smaller turtle files.

Our solution to get the file into a workable form is to make a small Python script that is Turtle syntax aware, to split the file into smaller pieces. You can’t use the standard UNIX split command, as each snippet of the split file also needs the prefix information at the top and we do not want to split an entry in the middle, losing triples.

See a sample converted N-Triples file from a turtle snippet.

Converting the N-Triples to BibJSON

After this, we started working on parsing an example N-Triples file to convert the data to BibJSON. We haven’t gotten that far, though. See https://gist.github.com/2928984#file_ntriple2bibjson.py for the resulting code (work in progress).

Problems

We noted problems with some properties that we like to document here as feedback for the German National Library.

Heterogeneous use of dcterms:extent

The dcterms:extent property is used in many different ways, thus we are considering to omit it in the conversion to BibJSON. Some example values of this property: “Mikrofiches”, “21 cm”, “CD-ROMs”, “Videokassetten”, “XVII, 330 S.”. Probably it would be the more appropriate choice to use dcterms:format for most of these and to limit the use of dcterms:extent to pagination information and duration.

URIs that don’t resolve

We stumbled over some URIs that don’t resolve, whether you order RDF or HTML in the accept header. Examples: http://d-nb.info/019673442, http://d-nb.info/019675585, http://d-nb.info/011077166

Also, DDC URIs that are connected to a resource with dcters:subject don’t resolve, e.g. http://d-nb.info/ddc-sg/070.

Footnote

At a previous BibServer hackday, we loaded the Britsh National Bibliography data into BibServer. This was a similar problem, but as the data was in RDF/XML we could directly use the built-in Python XML streaming parser to convert the RDF data into BibJSON.
See: https://gist.github.com/1731588 for the source.

Linked Data at the Biblioteca Nacional de España

Adrian Pohl — Thu, 02 Feb 2012 08:29:23 +0000

The following guest post is from the National Library of Spain and the Ontology Engineering Group (Technical University of Madrid (UPM)).

Datos.bne.es is an initiative of the Biblioteca Nacional de España (BNE) whose aim is to enrich the Semantic Web with library data.

This initiative is part of the project “Linked Data at the BNE”, supported by the BNE in cooperation with the Ontology Engineering Group (OEG) at the Universidad Politécnica de Madrid (UPM). The first meeting took place in September 2010, whereas the collaboration agreement was signed in October 2010. The first set of data was transformed and linked in April 2011, but a more significant set of data was done in December 2011.

The initiative was presented in the auditorium of the BNE on 14th December 2011 by Asunción Gómez-Pérez, Professor at the UPM and Daniel Vila-Suero, Project Lead (OEG-UPM), and by Ricardo Santos, Chief of Authorities, and Ana Manchado Mangas, Chief of Bibliographic Projects, both from the BNE. The attendant audience enjoyed the invaluable participation of Gordon Dunsire, Chair of the IFLA Namespace Group.

The concept of Linked Data was first introduced by Tim Berners-Lee in the context of the Semantic Web. It refers to the method of publishing and linking structured data on the Web. Hence, the project “Linked Data at the BNE” involves the transformation of BNE bibliographic and authority catalogues into RDF as well as their publication and linkage by means of IFLA-backed ontologies and vocabularies, with the aim of making data available in the so-called cloud of “Linked Open Data”. This project focuses on connecting the published data to other data sets in the cloud, such as VIAF (Virtual International Authority File) or DBpedia.
With this initiative, the BNE takes the challenge of publishing bibliographic and authority data in RDF, following the Linked Data Principles and under the CC0 (Creative Commons Public Domain Dedication) open license. Thereby, Spain joins the initiatives that national libraries from countries such as the United Kingdom and Germany have recently launched.

Vocabularies and models

IFLA-backed ontologies and models, widely agreed upon by the library community, have been used to represent the resources in RDF. Datos.bne.es is one of the first international initiatives to thoroughly embrace the models developed by IFLA, such as the FR models FRBR (Functional Requirements for Bibliographic Records), FRAD (Functional Requirements for Authority Data), FRSAD (Functional Requirements for Subject Authority Data), and ISBD (International Standard for Bibliographic Description).

FRBR has been used as a reference model and as a data model because it provides a comprehensive and organized description of the bibliographic universe, allowing the gathering of useful data and navigation. Entities, relationships and properties have been written in RDF using the RDF vocabularies taken from IFLA; thus FR ontologies have been used to describe Persons, Corporate Bodies, Works and Expressions, and ISBD properties for Manifestations. All these vocabularies are now available at Open Metadata Registry (OMR), with the status of published. Additionally, in cooperation with IFLA, labels have been translated to Spanish.
MARC21 bibliographic and authority files have been tested and mapped to the classes and properties at OMR. The following mappings were carried out:

A mapping to determine, given a field tag and a certain subfield combination, to which FRBR entity it is related (Person, Corporate Body, Work, Expression). This mapping was applied to authority files.
A mapping to establish relationships between entities.
A mapping to determine, given a field/subfield combination, to which property it can be mapped. Authority files were mapped to FR vocabularies, whereas bibliographic files were mapped to ISBD vocabulary. A number of properties from other vocabularies were also used.

The aforementioned mappings will be soon available to the library community and thus the BNE would like to contribute to the discussion of mapping MARC records to RDF; in addition, other libraries willing to transform their MARC records into RDF will be able to reuse such mappings.

Almost 7 million records transformed under an open license

Approximately 2.4 million bibliographic records have been transformed into RDF. They are modern and ancient monographies, sound-recordings and musical scores. Besides, 4 million authority records of persons, corporate names, uniform titles and subjects have been transformed. All of them belong to the bibliographic and authority catalogues of the BNE stored in MARC 21 format. As for the data transformation, the MARImbA (MARc mappIngs and rdf generAtor) tool has been developed and used. MARiMbA is a tool for librarians, whose goal is to support the entire process of generating RDF from MARC21 records. This tool allows using any vocabulary (in this case ISBD and FR family) and simplifies the process of assigning correspondences between RDFS/OWL vocabularies and MARC 21. As a result of this process, about 58 million triples have been generated in Spanish. These triples are high quality data with an important cultural value that substantially increases the presence of the Spanish language in the data cloud.

Once the data were described with IFLA models, and the bibliographic and authorities catalogues were generated in RDF, the following step was to connect these data with other existing knowledge RDF databases included in the Linking Open Data initiative. Thus, the data of the BNE are now linked or connected with data from other international data source through VIAF, the Virtual International Authority File.

The type of licence applied to the data is CC0 (Creative Commons Public Domain Dedication), a completely open licence aimed at promoting data reuse. With this project, the BNE adheres to the Spanish Public Sector’s Commitment to openness and data reuse, as established in the Royal Decree 1495/ 2011 of 24 October, (Real Decreto 1495/2011, de 24 de octubre) on reusing information in the public sector, and also acknowledges the proposals of the CENL (Conference of European National Librarians).

Future steps

In the short term, the next steps to carry out include

Migration of a larger set of catalogue records.
Improvement of the quality and granularity of both the transformed entities and the relationships between them.
Establishment of new links to other interesting datasets.
Development of a human-friendly visualization tool.
SKOSification of subject headings.

Team

From BNE: Ana Manchado, Mar Hernández Agustí, Fernando Monzón, Pilar Tejero López, Ana Manero García, Marina Jiménez Piano, Ricardo Santos Muñoz and Elena Escolano.
From UPM: Asunción Gómez-Pérez, Elena Montiel-Ponsoda, Boris Villazón-Terrazas and Daniel Vila-Suero.

German National Library goes LOD & publishes National Bibliography

Adrian Pohl — Thu, 26 Jan 2012 11:41:55 +0000

Good news from Germany. The German National Library

changed its licensing regime for Linked Data to CC0 which makes the data open according to the open definition,
has begun to publish the German national bibliography as Linked Open Data.

For background see the email (German) announcing this step. There it says (my translation):

“In 2010 the German National Library (DNB) started publishing authority data as Linked Data. The existing Linked Data service of the DNB is now extended with title data. In this context the licence for linked data is shifted to “Creative Commons Zero.

Until now, the majority of DNB title data is implemented as well as periodicals and series – the music data and holdings of the German Exiles Archive are missing. From now on, the RDF/XML representation of a title record is available in the DNB portal via a link. This is expressly an experimental service which will be extended and refined continually. More detailed informations about modelling questions and the general approach can be fund in the updated documentation.”

The English documentation (PDF) hasn’t been updated yet and only describes the GND authority data. On the wiki page about the LOD service it says: “Examples and further information about FTP-downloads will come soon.” An entry on the Data Hub has already been made for the data.

Swedish National Bibliography as Open Data

Adrian Pohl — Wed, 21 Sep 2011 20:33:39 +0000

In the blog of Sweden’s national library system LIBRIS it was announced today that the Swedish National Bibliography along with the authority data are published under a CC0 license.

“We are now pleased to announce the general availability of these records (see below for details). We see the investment in Open Data as a strategic one and one that is needed to ensure long term sustainability and competition when it comes to the services needed by libraries and their users as well as the right to control over their collections. The license chosen is CC0 which waives any rights the National Library have over the National Bibliography and the authority data.

There are two ways to access the data, as Atom feeds with references to the records and using (a somewhat rudimentary implementation of) the OAI-PMH protocol.”

LIBRIS pioneered in 2008 with making the records in the Swedish union catalog available as Linked Data. But the data had not been openly licensed and there was no possibility to easily get hold of bigger parts of it, which both has changed as of today. The long term goal is, it says in the announcement regarding the Swedish Union Catalogue, “to release the whole database under an open license, though this will undoubtedly take some time.”

LOD at Bibliothèque nationale de France

Adrian Pohl — Wed, 21 Sep 2011 11:56:56 +0000

Romain Wenz of the Bibliothèque nationale de France (BnF) informed me via email about this pleasant development: The BnF’s Linked Data service now is a Linked Open Data service!

The BnF LOD service is in the first instance limited to classical authors and currently comprises approximately 1,600 authors and nearly 4,000 works described in 1.4 million RDF triples. See also the accompanying entry on the Data hub. More information about data.bnf.fr can be found in this use case for the W3C Linked Library Data Incubator Group.

On http://data.bnf.fr/semanticweb one can find a link to a full dump of the RDF data. The corresponding license text reads:

” La réutilisation des données exposées en format RDF est libre et gratuite sous réserve du respect de la législation en vigueur et du maintien de la mention de source “Bibliothèque nationale de France” auprès des données. L’utilisateur peut les adapter ou les modifier, à condition d’en informer clairement les tiers et de ne pas en dénaturer le sens.”

This looks like an attribution license to me (my French is not the best!), but it is not made clear – neither on the webpage nor in the LICENCE.txt accompanying the dump – how the requirement of attribution should be met in practice. (The BnF might learn in this respect from New Zealand National Library’s approach.)

It also says that data in other formats than RDF is still licensed under a non-commercial license:

“La réutilisation des données exposées dans un autre format est soumise aux conditions suivantes :

la réutilisation non commerciale de ces données est libre et gratuite dans le respect de la législation en vigueur et notamment du maintien de la mention de source “Bibliothèque nationale de France” auprès des données. L’utilisateur peut les adapter ou les modifier, à condition d’en informer clairement les tiers et de ne pas en dénaturer le sens.
la réutilisation commerciale de ces contenus est payante et fait l’objet d’une licence. Est entendue par réutilisation commerciale l’acquisition des données en vue de l’élaboration d’un produit ou d’un service destiné à être mis à disposition de tiers à titre onéreux, ou à titre gratuit mais avec une finalité commerciale. Cliquer ici pour accéder aux tarifs et à la licence.”

Open national bibliography data by New Zealand National Library

Adrian Pohl — Mon, 12 Sep 2011 09:39:09 +0000

A tweet by Owen Stephens prodded me to the New Zealand National Library’s service which provides the national bibliography as MARC/MARCXML dumps (350,000 records) licensed under a Creative Commons Attribution license. Great!

Obviously this service has been around for a while now but I’ve not heard about it before. As it wasn’t registered on CKAN/the Data hub I created an entry and added it to the Bibliographic Data group.

Using attribution licenses for data

This publication is an interesting case as it uses an attribution license for bibliographic data. Until now, most open bibliographic datasets have been published under a public domain license. So, the question pops up: “Under what conditions may I use a CC-BY licensed dataset?”

The readme.txt accompanying the download file (118 MB!) gives some clarity:

“You do not need to make any attribution to National Library of New Zealand Te Puna Matauranga o Aotearoa if you are copying, distributing or adapting only a small number of individual bibliographic records from the overall Dataset.

If you publish, distribute or otherwise disseminate this work to the public without adapting it, the following attribution to National Library of New Zealand Te Puna Matauranga o Aotearoa should be used:

“Source: National Library of New Zealand Te Puna Matauranga o Aotearoa and licensed by the Department of Internal Affairs for re-use under the Creative Commons Attribution 3.0 New Zealand Licence (http://creativecommons.org/licenses/by/3.0/nz/).”

If you adapt this work in any way or include it in a wider collection, and publish, distribute or otherwise disseminate that adaptation or collection to the public, the following style of attribution to National Library of New Zealand Te Puna Matauranga o Aotearoa should be used:

“This work uses data sourced from National Library of New Zealand Te Puna Matauranga o Aotearoa’s Publications New Zealand Metadata Dataset which is licensed by the Department of Internal Affairs for re-use under the Creative Commons Attribution 3.0 New Zealand licence (http://creativecommons.org/licenses/by/3.0/nz/).””

In my opinion, these license requirements set a good precedence licensing bibliographic data with an attribution license, although it is not clear what still passes for “a small number of individual records”. I think it is important and the only legally consistent way that datasets with an attribution or share-alike license must only be attributed at the database level and not on the record level. Other’s who tend to use an attribution license should use a similar wording.

This might be of interest for other approaches of using an attribution license, e.g. at OCLC or E-LIS.

In related news, there’ll be a LODLAM-NZ event on December 1st in Wellington, see http://lod-lam.net/summit/2011/09/08/lodlam-nz/. Converting this dataset to LOD might be a topic…

Update: Tim McNamara has already provided an RDF version of the bibliographic data and reported on his motivations and challenges, see this post.

JISC OpenBibliography: British Library data release

Mark MacGillivray — Wed, 17 Nov 2010 20:16:54 +0000

The JISC OpenBibliography project is excited to announce that the British Library is providing a set of bibliographic data under CC0 Public Domain Dedication Licence.

We have initially received a dataset consisting of approximately 3 million records, which is now available as a CKAN package. This dataset consists of the entire British National Bibliography, describing new books published in the UK since 1950; this represents about 20% of the total BL catalogue, and we are working to add further releases. In addition, we are developing sample access methods onto the data, which we will post about later this week.

Agreements such as these are crucial to our community, as developments in areas such as Linked Data are only beneficial when there is content on which to operate. We look forward to announcing further releases and developments, and to being part of a community dedicated to the future of open scholarship.

Usage guide from BL:

This usage guide is based on goodwill. It is not a legal contract. We ask that you respect it.

Use of Data: This data is being made available under a Creative Commons CC0 1.0 Universal Public Domain Dedication licence. This means that the British Library Board makes no copyright, related or neighbouring rights claims to the data and does not apply any restrictions on subsequent use and reuse of the data. The British Library accepts no liability for damages from any use of the supplied data. For more detail please see the terms of the licence.

Support: The British Library is committed to providing high quality services and accurate data. If you have any queries or identify any problems with the data please contact metadata@bl.uk.

Share knowledge: We are also very interested to hear the ways in which you have used this data so we can understand more fully the benefits of sharing it and improve our services. Please contact metadata@bl.uk if you wish to share your experiences with us and those that are using this service.

Give Credit Where Credit is Due: The British Library has a responsibility to maintain its bibliographic data on the nation’s behalf. Please credit all use of this data to the British Library and link back to www.bl.uk/bibliographic/datafree.html in order that this information can be shared and developed with today’s Internet users as well as future generations.

Link to British Library announcement