“Opportunities for apps developers, designers and other digital innovators will be boosted today as the digital portal Europeana opens up its dataset of over 20 million cultural objects for free re-use.
The massive dataset is the descriptive information about Europe’s digitised treasures. For the first time, the metadata is released under the Creative Commons CC0 Public Domain Dedication, meaning that anyone can use the data for any purpose – creative, educational, commercial – with no restrictions. This release, which is by far the largest one-time dedication of cultural data to the public domain using CC0 offers a new boost to the digital economy, providing electronic entrepreneurs with opportunities to create innovative apps and games for tablets and smartphones and to create new web services and portals.
Europeana’s move to CC0 is a step change in open data access. Releasing data from across the memory organisations of every EU country sets an important new international precedent, a decisive move away from the world of closed and controlled data.”
Thanks to all the people who made this possible! See also Jonathan Gray’s post at the Guardian’s Datablog.
Update 30 September 2012: Actually, it is not true to call this release “by far the largest one-time dedication of cultural data to the public domain using CC0”. In December 2011 two German library networks released their catalog b3kat under CC0 which by then held 22 million descriptions of bibliographic resources. See this post for more information.
]]>Two days ago OCLC announced that linked data has been added to worldcat.org. I took a quick look at it and just want to share some notes on this.
OCLC goes open, finally
I am very happy that OCLC – with using the ODC-BY license – finally managed to choose open licensing for WorldCat. Quite a change of attitude when you recall the attempt in 2008 to sneak in a restrictive viral copyright license as part of a WorldCat record policy (for more information see the code4lib wikipage on the policy change or my German article about it). Certainly, it were not at last the blogging librarians and library tech people, the open access/open data proponents etc. who didn’t stop to push OCLC towards openness, who made this possible. Thank you all!
Of course, this is only the beginning. One thing is, that dumps of this WorldCat data aren’t available yet (see follow-up addendum here), thus, making it necessary to crawl the whole WorldCat to get hold of the data. Another thing is, that there probably is a whole lot of useful information in WorldCat that isn’t part of the linked data in worldcat.org yet .
schema.org in RDFa and microdata
What information is actually encoded as linked data in worldcat.org? And how did OCLC add RDF to worldcat.org? It used the schema.org vocabulary to add semantic markup to the HTML. This markup is both added as microdata – the native choice fo schema.org vocab – as well as in RDFa. schema.org lets people choose how to use the vocabulary, on the schema.org blog it recently said: “Our approach is “Microdata and more”. As implementations and services begin to consume RDFa 1.1, publishers with an interest in mixing schema.org with additional vocabularies, or who are using tools like Drupal 7, may find RDFa well worth exploring.“
Let’s take a look at a description of a bibliographic resource in worldcat.org, e.g. http://www.worldcat.org/title/linked-data-evolving-the-web-into-a-global-data-space/oclc/704257552. The part of the HTML source containing the semantic markup is marked as “Microdata Section” (although it does also contain RDFa). As the HTML source isn’t really readable for humans, we need to get hold of the RDF in a readable form first to have a look at it. I prefer the turtle syntax for looking at RDF. One can get the RDF contained in the HTML out using the RDFa distiller provided by the W3C. More precisely you have to use the distiller that supports RDFa 1.1 as schema.org supports RDFa 1.1 and, thus, worldcat.org is enriched according to the RDFa 1.1 standard.
However, using the distiller on the example resource I can get back a turtle document that contains the following triples:
1: @prefix library: <http://purl.org/library/> .
2: @prefix madsrdf: <http://www.loc.gov/mads/rdf/v1#> .
3: @prefix owl: <http://www.w3.org/2002/07/owl#> .
4: @prefix schema: <http://schema.org/> .
5: @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
6: <http://www.worldcat.org/oclc/707877350> a schema:Book;
7: library:holdingsCount "1"@en;
8: library:oclcnum "707877350"@en;
9: library:placeOfPublication [ a schema:Place;
10: schema:name "San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) :"@en ];
11: schema:about [ a skos:Concept;
12: schema:name "Web site development."@en;
13: madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh98004795> ],
14: [ a skos:Concept;
15: schema:name "Semantic Web."@en;
16: madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh2002000569> ],
17: <http://dewey.info/class/025/e22/>,
18: <http://id.worldcat.org/fast/1112076>,
19: <http://id.worldcat.org/fast/1173243>;
20: schema:author <http://viaf.org/viaf/38278185>;
21: schema:bookFormat schema:EBook;
22: schema:contributor <http://viaf.org/viaf/171087834>;
23: schema:copyrightYear "2011"@en;
24: schema:description "1. Introduction -- The data deluge -- The rationale for linked data -- Structure enables sophisticated processing -- Hyperlinks connect distributed data -- From data islands to a global data space -- Introducing Big Lynx productions --"@en,
25: "The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study."@en;
26: schema:inLanguage "en"@en;
27: schema:isbn "1608454312"@en,
28: "9781608454310"@en;
29: schema:name "Linked data evolving the web into a global data space"@en;
30: schema:publisher [ a schema:Organization;
31: schema:name "Morgan & Claypool"@en ];
32: owl:sameAs <http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001> .
This looks quite nice to me. You see, how schema.org let’s you easily convey the most relevant information and the property names are well-chosen to make it easy for humans to read the RDF (in contrast e.g. to the ISBD vocabulary which uses numbers in the property URIs following the library tradition :-/).
The example also shows the current shortcomings of schema.org and where the library community might put some effort in to extending it, as OCLC has already been doing for this release with the experimental “library” extension vocabulary for use with Schema.org. E.g., there are no seperate schema.org properties for a table of content and an abstract so that they are both put into one string using ther schema:description property.
Links to other linked data sources
There are links to several other data sources: LoC authorities (lines 13, 16, 41, 44) , dewey.info (17), the linked data FAST headings (18,19), viaf.org (20,22) and an owl:sameAs link to the HTTP-DOI identifier (32). As most of these services are already run by OCLC and as the connections probably all were already existent in the data, creating these links wasn’t hard work, which of course doesn’t make them less useful.
Copyright information
What I found very interesting is the schema:copyrightYear property used in some descriptions in worldcat.org. I don’t know how much resources are covered with the indication of a copyright year and how accurate the data is, but this seems a useful source to me for projects like publicdomainworks.net.
Missing URIs
As with other preceding publications of linked bibliographic data there are some URIs missing for things we might want to link to instead of only serving the name string of the respecting entity: I am talking about places and publishers. Until now, AFAIK URIs for publishers don’t exist, hopefully someone (OCLC perhaps?) is already working on a LOD registry for publishers. For places, we have geonames but it is not that trivial to generate the right links. It’s not a great surprise that a lot of work has to be done to build the global data space.
]]>The event attendees started the day all together, both hackers and workshop / seminar attendees, and Sam introduced the purpose of the day as follows: coders – to build tools and share ideas about things that will make our shared cultural heritage and knowledge commons more accessible and useful; non-coders – to get a crash course in what openness means for galleries, libraries, archives and museums, why it’s important and how you can begin opening up your data; everyone – to get a better idea about what other people working in your domain do and engender a better understanding between librarians, academics, curators, artists and technologists, in order to foster the creation of better, cooler tools that respond to the needs of our communities.
The hackers began the day with an overview of what a hackathon is for and how it can be run, as presented by Mahendra Mahey, and followed with lightning talks as follows:
We decided we wanted to work as a community, using our different skills towards one overarching goal, rather than breaking into smaller groups with separate agendas. We formed the central idea of an ‘open bibliographic tool-kit’ and people identified three main areas to hack around, playing to their skills and interests:
At this point we all broke for lunch, and the workshoppers and hackers mingled together. As hoped, conversations sprung up between people from the two different groups and it was great to see suggestions arising from shared ideas and applications of one group being explained to the theories of the other.
We re-grouped and the workshop continued until 16.00 – see here for Tim Hodson’s excellent write-up of the event and talks given – when the hackers were joined by some who attended the workshop. Each group gave a quick update on status, to try to persuade the new additions to the group to join their particular work-flow, and each group grew in number. After more hushed discussions and typing, the day finished with a talk from Tara Taubman about her background in the legalities of online security and IP, and we went for dinner. Hacking continued afterwards and we celebrated a hard day’s work down the pub, lookong forward to what was to come.
Day 2 to follow…
]]>This enables us to keep the core of BibJSON pretty simple whilst also opening up potential for more complex usage where that is required.
Due to this, we no longer use the “namespace” key in BibJSON.
Other changes include usage of “_” prefix on internal keys – so wherever our own database writes info into a record, we prefix it, such as “_id”. Because of this, uploaded BibJSON records can have an “id” key that will work, as well as an “_id” uuid applied by the BibServer system.
For more information, check out BibJSON.org and JSON-LD
]]>Multilingual matters in BibJSON arose again and, once more, JSON-LD was given backing as being useful for our purposes
Adrian Pohl discussed Nature’s release of 450,000 articles under a CC0 licence and followed up with this more in-depth article
Todd Robbins circulated Jim Pitman’s detailed article on author identity, which explores the issue of citations in relation to a non-existent publication (!) and includes recommendations for opening your own data
Antoine Isaac notified us of Europeana’s ‘Connecting Society to Culture Programme’, as part of the Hack4Europe! 2012 road show, including 3 hackathons in 3 different countries; it turns out that one in Berlin starts on the day our own Hackathon finishes – speaking of which…
Hackathon ‘show of hands’ request sent out! Contact naomi.lillie [@] okfn.org if you’d like to add your name to the list of interested people. This Hackathon is being run by Open Biblio with Open GLAM and DevCSI, on the 12th-14th June, in East London – more details to follow
Following the last Working Group meeting, Adrian Pohl expounded upon the role and goals of the Working Group, which is now given at http://openbiblio.net/about and http://openbiblio.net/get-involved
David Weinberger announced that Harvard has opened 12 million bibliographic records to the public domain under a CC0 licence – see his blog post to read more about it and for links to further information
Finally, Adrian announced the next Working Group meeting on 7th May, where Mark MacGillivray will be updating the community on project developments
As always, thanks to our amazing community for promoting and leading open bibliographic data in all its manifestations. To become part of the group and get your voice heard, sign up to the List here.
]]>Europeana aims to provide the widest access possible to the European cultural heritage massively published through digital resources by hundreds of musea, libraries and archives. This includes empowering other actors to build services that contribute to such access. Making data openly available to the public and private sectors alike is thus central to Europeana’s business strategy. We are also trying to provide a better service by making available richer data than the one very often published by cultural institutions. Data where millions of texts, images, videos and sounds are linked to other relevant resources: persons, places, concepts…
Europeana has therefore been interested for a while in Linked Data, as a technology that facilitates these objectives. We entirely subscribe to the views expressed in the W3C Library Linked Data report, which shows the benefits (but also acknowledges the challenges) of Linked Data for the cultural sector.
Last year, we released a first Linked Data pilot at data.europeana.eu. This has been a very exciting moment, a first opportunity for us to play with Linked Data.
We could deploy our prototype relatively easily and the whole experience was extremely valuable, from a technical perspective. In particular, this has been the first large-scale implementation of Europeana’s new approach to metadata, the Europeana Data Model (EDM). This model enables the representation of much richer data compared to the current format used by Europeana in its production service. First, our pilot could use EDM’s ability to represent several perspectives over a cultural object. We have used it to distinguish the original metadata our providers send us, from the data that we add ourselves. Among the Europeana data there are indeed enrichments that are created automatically and are not checked by professional data curators. For trust purposes, it is important that data consumers can see the difference.
We could also better highlight a part of Europeana’s added value as a central point for accessing digitized cultural material, in direct connection with the above mentioned enrichment. Europeana indeed employs semantic extraction tools that connect its objects with large multilingual reference resources available as Linked Data, in particular Geonames and GEMET. This new metadata allows us to deliver a better search service, especially in a European context. With the Linked Data pilot we could explicitly point at them, in the same environment they are published in. We hope this will help the entire community to better recognize the importance of these sources, and continue to provide authority resources in interoperable Linked Data format, using for example the SKOS vocabulary.
If you are interested in more lessons learnt from a technical perspective, we have published more of them in a technical paper at the Dublin Core conference last year. Among the less positive aspects, data.europeana.eu is still not part of the production system behind the main europeana.eu portal. It does not come with the guarantee of service we would like to offer for the linked data server, though the provision of data dumps is not impacted by this.
Another downside is that data.europeana.eu publishes data only for a subset of the objects the our main portal provides access to. We started with 3.5 million objects over a total of 20 millions. These were selected after a call for volunteers, to which only few providers answered. Additionally, we could not release our metadata under fully Open terms. This was clearly an obstacle to the re-use of our data.
After several months we have thus released a second version of data.europeana.eu. Though still a pilot, it nows contain fully open metadata (CC0).
The new version concerns an even smaller subset of our collections: in February 2012, data.europeana.eu contains metadata on 2.4 million objects. But this must be considered in context. The qualitative step of fully open publication is crucial to us. And over the past year, we have started an active campaign to convince our community of opening up their metadata, allowing everyone to make it work harder for the benefits of end users. The current metadata served at data.europeana come from data providers who have reacted early and positively to our efforts. We trust we will be able to make metadata available for many more objects in the coming year.
In fact we hope that this Linked Open Data pilot can contribute a part of our Open Data advocacy message. We believe such technology can trigger third parties to develop innovative applications and services, stimulating end users’ interest for digitized heritage. This would of course help to convince more partners to contribute metadata openly in the future. We have released next to our new pilot an animation that conveys exactly this message, you can view it here.
For additional information about access to and technical details of the dataset, see data.europeana.eu and our entry on the Data Hub.
]]>This is brilliant news in itself, but what I found particularly enchanting was the animated video that Europeana created in support of this announcement:
Linked Open Data from europeana on Vimeo.
As mentioned before, I am not of a technical background, and sometimes terms and explanations are difficult for me to grasp; however – I understand this video! I think it’s a brilliant explanation of the detail of Linked Open Data (LOD): how metadata works together, why it’s important that it’s open, how the more open data is available the more can be done with it, etc. This provides great clarity on what it is we’re seeking to do, for those of us who can’t tell a gig from a meg. The Open Biblio readership is generally more savvy with the nitty-gritty workings of LOD than I – if not in fact doing this sort of stuff already – but how can you not love the simplicity of this video?! It’s engaging, interesting, mostly jargon-free and less than four minutes long… So, if you’re an advocate of LOD without really understanding the processes behind the philosophy, get yourself a cuppa, settle down to watch this, and be informed.
Thanks Europeana, keep up the good work!
Europeana’s press release provides more information about the above dataset release and video.
]]>Datos.bne.es is an initiative of the Biblioteca Nacional de España (BNE) whose aim is to enrich the Semantic Web with library data.
This initiative is part of the project “Linked Data at the BNE”, supported by the BNE in cooperation with the Ontology Engineering Group (OEG) at the Universidad Politécnica de Madrid (UPM). The first meeting took place in September 2010, whereas the collaboration agreement was signed in October 2010. The first set of data was transformed and linked in April 2011, but a more significant set of data was done in December 2011.
The initiative was presented in the auditorium of the BNE on 14th December 2011 by Asunción Gómez-Pérez, Professor at the UPM and Daniel Vila-Suero, Project Lead (OEG-UPM), and by Ricardo Santos, Chief of Authorities, and Ana Manchado Mangas, Chief of Bibliographic Projects, both from the BNE. The attendant audience enjoyed the invaluable participation of Gordon Dunsire, Chair of the IFLA Namespace Group.
The concept of Linked Data was first introduced by Tim Berners-Lee in the context of the Semantic Web. It refers to the method of publishing and linking structured data on the Web. Hence, the project “Linked Data at the BNE” involves the transformation of BNE bibliographic and authority catalogues into RDF as well as their publication and linkage by means of IFLA-backed ontologies and vocabularies, with the aim of making data available in the so-called cloud of “Linked Open Data”. This project focuses on connecting the published data to other data sets in the cloud, such as VIAF (Virtual International Authority File) or DBpedia.
With this initiative, the BNE takes the challenge of publishing bibliographic and authority data in RDF, following the Linked Data Principles and under the CC0 (Creative Commons Public Domain Dedication) open license. Thereby, Spain joins the initiatives that national libraries from countries such as the United Kingdom and Germany have recently launched.
IFLA-backed ontologies and models, widely agreed upon by the library community, have been used to represent the resources in RDF. Datos.bne.es is one of the first international initiatives to thoroughly embrace the models developed by IFLA, such as the FR models FRBR (Functional Requirements for Bibliographic Records), FRAD (Functional Requirements for Authority Data), FRSAD (Functional Requirements for Subject Authority Data), and ISBD (International Standard for Bibliographic Description).
FRBR has been used as a reference model and as a data model because it provides a comprehensive and organized description of the bibliographic universe, allowing the gathering of useful data and navigation. Entities, relationships and properties have been written in RDF using the RDF vocabularies taken from IFLA; thus FR ontologies have been used to describe Persons, Corporate Bodies, Works and Expressions, and ISBD properties for Manifestations. All these vocabularies are now available at Open Metadata Registry (OMR), with the status of published. Additionally, in cooperation with IFLA, labels have been translated to Spanish.
MARC21 bibliographic and authority files have been tested and mapped to the classes and properties at OMR. The following mappings were carried out:
The aforementioned mappings will be soon available to the library community and thus the BNE would like to contribute to the discussion of mapping MARC records to RDF; in addition, other libraries willing to transform their MARC records into RDF will be able to reuse such mappings.
Approximately 2.4 million bibliographic records have been transformed into RDF. They are modern and ancient monographies, sound-recordings and musical scores. Besides, 4 million authority records of persons, corporate names, uniform titles and subjects have been transformed. All of them belong to the bibliographic and authority catalogues of the BNE stored in MARC 21 format. As for the data transformation, the MARImbA (MARc mappIngs and rdf generAtor) tool has been developed and used. MARiMbA is a tool for librarians, whose goal is to support the entire process of generating RDF from MARC21 records. This tool allows using any vocabulary (in this case ISBD and FR family) and simplifies the process of assigning correspondences between RDFS/OWL vocabularies and MARC 21. As a result of this process, about 58 million triples have been generated in Spanish. These triples are high quality data with an important cultural value that substantially increases the presence of the Spanish language in the data cloud.
Once the data were described with IFLA models, and the bibliographic and authorities catalogues were generated in RDF, the following step was to connect these data with other existing knowledge RDF databases included in the Linking Open Data initiative. Thus, the data of the BNE are now linked or connected with data from other international data source through VIAF, the Virtual International Authority File.
The type of licence applied to the data is CC0 (Creative Commons Public Domain Dedication), a completely open licence aimed at promoting data reuse. With this project, the BNE adheres to the Spanish Public Sector’s Commitment to openness and data reuse, as established in the Royal Decree 1495/ 2011 of 24 October, (Real Decreto 1495/2011, de 24 de octubre) on reusing information in the public sector, and also acknowledges the proposals of the CENL (Conference of European National Librarians).
In the short term, the next steps to carry out include
From BNE: Ana Manchado, Mar Hernández Agustí, Fernando Monzón, Pilar Tejero López, Ana Manero García, Marina Jiménez Piano, Ricardo Santos Muñoz and Elena Escolano.
From UPM: Asunción Gómez-Pérez, Elena Montiel-Ponsoda, Boris Villazón-Terrazas and Daniel Vila-Suero.
The data was released as MARC records (MARC XML) and also an experimental Linked Open Data service was launched at [http://lod.b3kat.de/] with full dumps of the RDF data made available for download:
The data was exported on November, 22th 2011 and updates can be harvested via an OAI-PMH interface from http://bvbr.bib-bvb.de:8991/aleph-cgi/oai/oai_opendata.pl?verb=ListRecords&set=OpenData&metadataPrefix=marc21.
So far, no entries for this data exist on the Data Hub. Anyone?
With this initiative, there is now open data from four German library networks available, BVB and KOBV now being the first to having released their entire union catalogue. The North Rhine-Westphalian Library Service Center (hbz) started in March 2010 with opening up data (see here) and in autumn 2010 libraries from the southwestern library network (SWB) followed (see here).
Hopefully this development will continue and libraries and related institutions will continue releasing open data, be it bibliographic data or other data produced by libraries.
]]>“Meeting at the Royal Library of Denmark, the Conference of European National Librarians (CENL), has voted overwhelmingly to support the open licensing of their data. CENL represents Europe’s 46 national libraries, and are responsible for the massive collection of publications that represent the accumulated knowledge of Europe.
What does that mean in practice?
It means that the datasets describing all the millions of books and texts ever published in Europe – the title, author, date, imprint, place of publication and so on, which exists in the vast library catalogues of Europe – will become increasingly accessible for anybody to re-use for whatever purpose they want.
The first outcome of the open licence agreement is that the metadata provided by national libraries to Europeana.eu, Europe’s digital library, museum and archive, via the CENL service The European Library, will have a Creative Commons Universal Public Domain Dedication, or CC0 licence. This metadata relates to millions of digitised texts and images coming into Europeana from initiatives that include Google’s mass digitisations of books in the national libraries of the Netherlands and Austria.”
See also this post by Richard Wallis.
(Thanks to Mathias for the title of this post.)
]]>