Open Bibliography and Open Bibliographic Data » Semantic Web http://openbiblio.net Open Bibliographic Data Working Group of the Open Knowledge Foundation Tue, 08 May 2018 15:46:25 +0000 en-US hourly 1 http://wordpress.org/?v=4.3.1 Nature’s data platform strongly expanded http://openbiblio.net/2012/07/20/natures-data-platform-strongly-expanded/ http://openbiblio.net/2012/07/20/natures-data-platform-strongly-expanded/#comments Fri, 20 Jul 2012 20:05:06 +0000 http://openbiblio.net/?p=2862 Continue reading ]]> Nature has largely expanded its Linked Open Data platform that was launched in April 2012. From today’s press release:

Logo of the journal Nature used in its first issue on Nov. 4, 1869

“As part of its wider commitment to open science, Nature Publishing Group’s (NPG) Linked Data Platform now hosts more than 270 million Resource Description Framework (RDF) statements. It has been expanded more than ten times, in a growing number of datasets. These datasets have been created under the Creative Commons Zero (CC0) waiver, which permits maximal use/reuse of this data. The data is now being updated in real-time and new triples are being dynamically added to the datasets as articles are published on nature.com.

Available at http://data.nature.com, the platform now contains bibliographic metadata for all NPG titles, including Scientific American back to 1845, and NPG’s academic journals published on behalf of our society partners. NPG’s Linked Data Platform now includes citation metadata for all published article references. The NPG subject ontology is also significantly expanded.

The new release expands the platform to include additional RDF statements of bibliographic, citation, data citation and ontology metadata, which are organised into 12 datasets – an increase from the 8 datasets previously available. Full snapshots of this data release are now available for download, either by individual dataset or as a complete package, for registered users at http://developers.nature.com.

This is exciting, especially the commitment to real-time updates is a great move and shows how serious Linked Open Data becomes in general and in particular in the realm of bibliographic data. Also, Nature now uses the Data Hub and has registered the data seperated into several datasets.

]]>
http://openbiblio.net/2012/07/20/natures-data-platform-strongly-expanded/feed/ 0
Nature releases Metadata for 450k Articles into the Public Domain http://openbiblio.net/2012/04/05/nature-releases-metadata-for-450k-articles-into-the-public-domain/ http://openbiblio.net/2012/04/05/nature-releases-metadata-for-450k-articles-into-the-public-domain/#comments Thu, 05 Apr 2012 10:04:16 +0000 http://openbiblio.net/?p=2594 Continue reading ]]> Yesterday Nature Publishing Group announced the launch of a Linked Data platform with RDF descriptions of more than 450,000 articles published by NPG since 1869.

From the press release:

Nature Publishing Group (NPG) today is pleased to join the linked data community by opening up access to its publication data via a linked data platform. NPG’s Linked Data Platform is available at http://data.nature.com.

The platform includes more than 20 million Resource Description Framework (RDF) statements, including primary metadata for more than 450,000 articles published by NPG since 1869. In this first release, the datasets include basic citation information (title, author, publication date, etc) as well as NPG specific ontologies. These datasets are being released under an open metadata license, Creative Commons Zero (CC0), which permits maximal use/re-use of this data.

NPG’s platform allows for easy querying, exploration and extraction of data and relationships about articles, contributors, publications, and subjects. Users can run web-standard SPARQL Protocol and RDF Query Language (SPARQL) queries to obtain and manipulate data stored as RDF. The platform uses standard vocabularies such as Dublin Core, FOAF, PRISM, BIBO and OWL, and the data is integrated with existing public datasets including CrossRef and PubMed.

That’s great news having such an important publisher moving towards open bibliographic data. As of yet, there are no dumps of the data available. Also, there is no entry on the Data Hub. Anybody?

There might be some problems with the platform from some browsers, but it is being worked on.

]]>
http://openbiblio.net/2012/04/05/nature-releases-metadata-for-450k-articles-into-the-public-domain/feed/ 2
Europeana and Linked Open Data http://openbiblio.net/2012/03/19/europeana-and-linked-open-data/ http://openbiblio.net/2012/03/19/europeana-and-linked-open-data/#comments Mon, 19 Mar 2012 08:53:44 +0000 http://openbiblio.net/?p=2484 Continue reading ]]> Europeana has recently released a new version of its Linked Data Pilot, data.europeana.eu. We now publish data for 2.4 million objects under an open metadata licence: CC0, the Creative Commons Public Domain Dedication. This post elaborates on this earlier one by Naomi.

The interest of Europeana for Linked Open Data

Europeana aims to provide the widest access possible to the European cultural heritage massively published through digital resources by hundreds of musea, libraries and archives. This includes empowering other actors to build services that contribute to such access. Making data openly available to the public and private sectors alike is thus central to Europeana’s business strategy. We are also trying to provide a better service by making available richer data than the one very often published by cultural institutions. Data where millions of texts, images, videos and sounds are linked to other relevant resources: persons, places, concepts…

Europeana has therefore been interested for a while in Linked Data, as a technology that facilitates these objectives. We entirely subscribe to the views expressed in the W3C Library Linked Data report, which shows the benefits (but also acknowledges the challenges) of Linked Data for the cultural sector.

Europeana’s first toe in the Linked Data water

Last year, we released a first Linked Data pilot at data.europeana.eu. This has been a very exciting moment, a first opportunity for us to play with Linked Data.

We could deploy our prototype relatively easily and the whole experience was extremely valuable, from a technical perspective. In particular, this has been the first large-scale implementation of Europeana’s new approach to metadata, the Europeana Data Model (EDM). This model enables the representation of much richer data compared to the current format used by Europeana in its production service. First, our pilot could use EDM’s ability to represent several perspectives over a cultural object. We have used it to distinguish the original metadata our providers send us, from the data that we add ourselves. Among the Europeana data there are indeed enrichments that are created automatically and are not checked by professional data curators. For trust purposes, it is important that data consumers can see the difference.

We could also better highlight a part of Europeana’s added value as a central point for accessing digitized cultural material, in direct connection with the above mentioned enrichment. Europeana indeed employs semantic extraction tools that connect its objects with large multilingual reference resources available as Linked Data, in particular Geonames and GEMET. This new metadata allows us to deliver a better search service, especially in a European context. With the Linked Data pilot we could explicitly point at them, in the same environment they are published in. We hope this will help the entire community to better recognize the importance of these sources, and continue to provide authority resources in interoperable Linked Data format, using for example the SKOS vocabulary.

If you are interested in more lessons learnt from a technical perspective, we have published more of them in a technical paper at the Dublin Core conference last year. Among the less positive aspects, data.europeana.eu is still not part of the production system behind the main europeana.eu portal. It does not come with the guarantee of service we would like to offer for the linked data server, though the provision of data dumps is not impacted by this.

Making progress on Open Data

Another downside is that data.europeana.eu publishes data only for a subset of the objects the our main portal provides access to. We started with 3.5 million objects over a total of 20 millions. These were selected after a call for volunteers, to which only few providers answered. Additionally, we could not release our metadata under fully Open terms. This was clearly an obstacle to the re-use of our data.

After several months we have thus released a second version of data.europeana.eu. Though still a pilot, it nows contain fully open metadata (CC0).

The new version concerns an even smaller subset of our collections: in February 2012, data.europeana.eu contains metadata on 2.4 million objects. But this must be considered in context. The qualitative step of fully open publication is crucial to us. And over the past year, we have started an active campaign to convince our community of opening up their metadata, allowing everyone to make it work harder for the benefits of end users. The current metadata served at data.europeana come from data providers who have reacted early and positively to our efforts. We trust we will be able to make metadata available for many more objects in the coming year.

In fact we hope that this Linked Open Data pilot can contribute a part of our Open Data advocacy message. We believe such technology can trigger third parties to develop innovative applications and services, stimulating end users’ interest for digitized heritage. This would of course help to convince more partners to contribute metadata openly in the future. We have released next to our new pilot an animation that conveys exactly this message, you can view it here.

For additional information about access to and technical details of the dataset, see data.europeana.eu and our entry on the Data Hub.

]]>
http://openbiblio.net/2012/03/19/europeana-and-linked-open-data/feed/ 0
Linked Data at the Biblioteca Nacional de España http://openbiblio.net/2012/02/02/linked-data-at-the-biblioteca-nacional-de-espana/ http://openbiblio.net/2012/02/02/linked-data-at-the-biblioteca-nacional-de-espana/#comments Thu, 02 Feb 2012 08:29:23 +0000 http://openbiblio.net/?p=2201 Continue reading ]]> The following guest post is from the National Library of Spain and the Ontology Engineering Group (Technical University of Madrid (UPM)).

Datos.bne.es is an initiative of the Biblioteca Nacional de España (BNE) whose aim is to enrich the Semantic Web with library data.

This initiative is part of the project “Linked Data at the BNE”, supported by the BNE in cooperation with the Ontology Engineering Group (OEG) at the Universidad Politécnica de Madrid (UPM). The first meeting took place in September 2010, whereas the collaboration agreement was signed in October 2010. The first set of data was transformed and linked in April 2011, but a more significant set of data was done in December 2011.

The initiative was presented in the auditorium of the BNE on 14th December 2011 by Asunción Gómez-Pérez, Professor at the UPM and Daniel Vila-Suero, Project Lead (OEG-UPM), and by Ricardo Santos, Chief of Authorities, and Ana Manchado Mangas, Chief of Bibliographic Projects, both from the BNE. The attendant audience enjoyed the invaluable participation of Gordon Dunsire, Chair of the IFLA Namespace Group.

The concept of Linked Data was first introduced by Tim Berners-Lee in the context of the Semantic Web. It refers to the method of publishing and linking structured data on the Web. Hence, the project “Linked Data at the BNE” involves the transformation of BNE bibliographic and authority catalogues into RDF as well as their publication and linkage by means of IFLA-backed ontologies and vocabularies, with the aim of making data available in the so-called cloud of “Linked Open Data”. This project focuses on connecting the published data to other data sets in the cloud, such as VIAF (Virtual International Authority File) or DBpedia.
With this initiative, the BNE takes the challenge of publishing bibliographic and authority data in RDF, following the Linked Data Principles and under the CC0 (Creative Commons Public Domain Dedication) open license. Thereby, Spain joins the initiatives that national libraries from countries such as the United Kingdom and Germany have recently launched.

Vocabularies and models

IFLA-backed ontologies and models, widely agreed upon by the library community, have been used to represent the resources in RDF. Datos.bne.es is one of the first international initiatives to thoroughly embrace the models developed by IFLA, such as the FR models FRBR (Functional Requirements for Bibliographic Records), FRAD (Functional Requirements for Authority Data), FRSAD (Functional Requirements for Subject Authority Data), and ISBD (International Standard for Bibliographic Description).

FRBR has been used as a reference model and as a data model because it provides a comprehensive and organized description of the bibliographic universe, allowing the gathering of useful data and navigation. Entities, relationships and properties have been written in RDF using the RDF vocabularies taken from IFLA; thus FR ontologies have been used to describe Persons, Corporate Bodies, Works and Expressions, and ISBD properties for Manifestations. All these vocabularies are now available at Open Metadata Registry (OMR), with the status of published. Additionally, in cooperation with IFLA, labels have been translated to Spanish.
MARC21 bibliographic and authority files have been tested and mapped to the classes and properties at OMR. The following mappings were carried out:

  • A mapping to determine, given a field tag and a certain subfield combination, to which FRBR entity it is related (Person, Corporate Body, Work, Expression). This mapping was applied to authority files.
  • A mapping to establish relationships between entities.
  • A mapping to determine, given a field/subfield combination, to which property it can be mapped. Authority files were mapped to FR vocabularies, whereas bibliographic files were mapped to ISBD vocabulary. A number of properties from other vocabularies were also used.

The aforementioned mappings will be soon available to the library community and thus the BNE would like to contribute to the discussion of mapping MARC records to RDF; in addition, other libraries willing to transform their MARC records into RDF will be able to reuse such mappings.

Almost 7 million records transformed under an open license

Approximately 2.4 million bibliographic records have been transformed into RDF. They are modern and ancient monographies, sound-recordings and musical scores. Besides, 4 million authority records of persons, corporate names, uniform titles and subjects have been transformed. All of them belong to the bibliographic and authority catalogues of the BNE stored in MARC 21 format. As for the data transformation, the MARImbA (MARc mappIngs and rdf generAtor) tool has been developed and used. MARiMbA is a tool for librarians, whose goal is to support the entire process of generating RDF from MARC21 records. This tool allows using any vocabulary (in this case ISBD and FR family) and simplifies the process of assigning correspondences between RDFS/OWL vocabularies and MARC 21. As a result of this process, about 58 million triples have been generated in Spanish. These triples are high quality data with an important cultural value that substantially increases the presence of the Spanish language in the data cloud.

Once the data were described with IFLA models, and the bibliographic and authorities catalogues were generated in RDF, the following step was to connect these data with other existing knowledge RDF databases included in the Linking Open Data initiative. Thus, the data of the BNE are now linked or connected with data from other international data source through VIAF, the Virtual International Authority File.

The type of licence applied to the data is CC0 (Creative Commons Public Domain Dedication), a completely open licence aimed at promoting data reuse. With this project, the BNE adheres to the Spanish Public Sector’s Commitment to openness and data reuse, as established in the Royal Decree 1495/ 2011 of 24 October, (Real Decreto 1495/2011, de 24 de octubre) on reusing information in the public sector, and also acknowledges the proposals of the CENL (Conference of European National Librarians).

Future steps

In the short term, the next steps to carry out include

  • Migration of a larger set of catalogue records.
  • Improvement of the quality and granularity of both the transformed entities and the relationships between them.
  • Establishment of new links to other interesting datasets.
  • Development of a human-friendly visualization tool.
  • SKOSification of subject headings.

Team

From BNE: Ana Manchado, Mar Hernández Agustí, Fernando Monzón, Pilar Tejero López, Ana Manero García, Marina Jiménez Piano, Ricardo Santos Muñoz and Elena Escolano.
From UPM: Asunción Gómez-Pérez, Elena Montiel-Ponsoda, Boris Villazón-Terrazas and Daniel Vila-Suero.

]]>
http://openbiblio.net/2012/02/02/linked-data-at-the-biblioteca-nacional-de-espana/feed/ 2
LOD at Bibliothèque nationale de France http://openbiblio.net/2011/09/21/lod-at-bibliotheque-nationale-de-france/ http://openbiblio.net/2011/09/21/lod-at-bibliotheque-nationale-de-france/#comments Wed, 21 Sep 2011 11:56:56 +0000 http://openbiblio.net/?p=1519 Continue reading ]]> Romain Wenz of the Bibliothèque nationale de France (BnF) informed me via email about this pleasant development: The BnF’s Linked Data service now is a Linked Open Data service!

The BnF LOD service is in the first instance limited to classical authors and currently comprises approximately 1,600 authors and nearly 4,000 works described in 1.4 million RDF triples. See also the accompanying entry on the Data hub. More information about data.bnf.fr can be found in this use case for the W3C Linked Library Data Incubator Group.

On http://data.bnf.fr/semanticweb one can find a link to a full dump of the RDF data. The corresponding license text reads:

” La réutilisation des données exposées en format RDF est libre et gratuite sous réserve du respect de la législation en vigueur et du maintien de la mention de source “Bibliothèque nationale de France” auprès des données. L’utilisateur peut les adapter ou les modifier, à condition d’en informer clairement les tiers et de ne pas en dénaturer le sens.”

This looks like an attribution license to me (my French is not the best!), but it is not made clear – neither on the webpage nor in the LICENCE.txt accompanying the dump – how the requirement of attribution should be met in practice. (The BnF might learn in this respect from New Zealand National Library’s approach.)

It also says that data in other formats than RDF is still licensed under a non-commercial license:

“La réutilisation des données exposées dans un autre format est soumise aux conditions suivantes :

  • la réutilisation non commerciale de ces données est libre et gratuite dans le respect de la législation en vigueur et notamment du maintien de la mention de source “Bibliothèque nationale de France” auprès des données. L’utilisateur peut les adapter ou les modifier, à condition d’en informer clairement les tiers et de ne pas en dénaturer le sens.

  • la réutilisation commerciale de ces contenus est payante et fait l’objet d’une licence. Est entendue par réutilisation commerciale l’acquisition des données en vue de l’élaboration d’un produit ou d’un service destiné à être mis à disposition de tiers à titre onéreux, ou à titre gratuit mais avec une finalité commerciale. Cliquer ici pour accéder aux tarifs et à la licence.”

]]>
http://openbiblio.net/2011/09/21/lod-at-bibliotheque-nationale-de-france/feed/ 0
Collections in Bibliographica: unsorted information is not information http://openbiblio.net/2011/06/12/collections-in-bibliographica/ http://openbiblio.net/2011/06/12/collections-in-bibliographica/#comments Sun, 12 Jun 2011 13:35:54 +0000 http://openbiblio.net/?p=942 Continue reading ]]> Collections are the first feature aimed for our users participation at Bibliographica.
The collections are lists of books users can create and share with others, and they are one of the basic features of Bibliographica as Jonathan Gray pointed out already:

lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing.

Details of use

They are accessible via the collections link on the top menu of the website.

To create collections you must be logged in. You may login on http://bibliographica.org/account/login with an openID

Once logged in, every time you open a book page (i.e. http://bnb.bibliographica.org/entry/GB6502067 ) you will see at your right the Collections menu, where you can choose between creating a new collection with that work, or adding the work to an already existing collection.

If you have created some collections you can always access them through the menu and they are also going to appear in your account page

For removing a book from one collection, you can click remove in the collection listing of the sidebar.

Collections screencast

]]>
http://openbiblio.net/2011/06/12/collections-in-bibliographica/feed/ 0
Bibliographica gadget in Wikipedia http://openbiblio.net/2011/06/06/bibliographica-gadget-in-wikipedia/ http://openbiblio.net/2011/06/06/bibliographica-gadget-in-wikipedia/#comments Mon, 06 Jun 2011 10:14:04 +0000 http://openbiblio.net/?p=1017 Continue reading ]]> What is a wikipedia gadget?

Thinking of ways to show the possibilities of linked data, we have made a Wikipedia gadget, making use of a great resource the Wikimedia developers give to the community.

Wikipedia gadgets are small pieces of code you can add to your Wikipedia user templates, and allow you to add more functionality and render more information when you browse wikipedia pages.

In our case, we wanted to retrieve information from our bibliographica site to render in Wikipedia, and so as the pages are rendered with specific markup we can use the ISBN numbers present on the wikipedia articles to make consults to the bibliographica database, in a way similar to what Mark has done with the Edinburgh International Science Festival.

Bibliographica.org offers an isbn search endpoint at http://bibliographica.org/isbn/, so if we ask for the page http://bibliographica.org/isbn/0241105161 we receive

[{"issued": "1981-01-01T00:00:00Z", "publisher": {"name": "Hamilton"}, "uri": "http://bnb.bibliographica.org/entry/GB8102507", "contributors": [{"name": "Boyd, William, 1952-"}], "title": "A good man in Africa"}]

I can use this information to make a window pop up with more information about works when we hover their ISBNs on the Wikipedia pages. If my user templates has the bibliographica gadget, every time I open a wiki page the script will ask information about all the ISBNs the page has to our database.
If something is found, it will render a frame around the ISBN numbers:

And if I hover over them, I see a window with information about the book:

Get the widget

So, if you want to have this widget, first you need to create an account in the wikipedia, and then change your default template to add the JavaScript snippet. Once you do this (instructions here ) you will be able to get the information available in bibliographica about the books.

Next steps

By now, the interaction goes in just one direction. Later on, we will be able to feed that information back to Bibliographica.

]]>
http://openbiblio.net/2011/06/06/bibliographica-gadget-in-wikipedia/feed/ 0
Medline dataset http://openbiblio.net/2011/05/23/medline-dataset/ http://openbiblio.net/2011/05/23/medline-dataset/#comments Mon, 23 May 2011 09:56:55 +0000 http://openbiblio.net/?p=1120 Continue reading ]]> Announcing the CC0 Medline dataset

We are happy to report that we now have a full, clean public domain (CC0) version of the Medline dataset available for use by the community.

What is the Medline dataset?

The Medline dataset is a subset of bibliographic metadata covering approximately 98% of all PubMed publications. The dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. There are approximately 19 million publication records.

Medline is a maintained dataset, and updates chronologically append to the current dataset.

Read our explanation of the different PubMed datasets for further information.

Where to get it

The raw dataset can be downloaded from CKAN : http://ckan.net/package/medline

What is in a record

Most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID. Many also have DOIs. We have stripped out any potentially copyrightable material such as abstracts.

Read our technical description of a record for further information.

Sample usage

We have made an online visualisation of a sample of the Medline dataset – however the visualisation relies on WebGL which is not yet widely supported by all browsers. It should work in Chrome and probably FireFox4.

This is just one example, but shows what great things we can build and learn from when we have open access to the necessary data to do so.

]]>
http://openbiblio.net/2011/05/23/medline-dataset/feed/ 3
OpenBiblio workshop report http://openbiblio.net/2011/05/09/openbiblio-workshop-report/ http://openbiblio.net/2011/05/09/openbiblio-workshop-report/#comments Mon, 09 May 2011 16:03:29 +0000 http://openbiblio.net/?p=1081 Continue reading ]]> #openbiblio #jiscopenbib

The OpenBiblio workshop took place on 6th May 2011, at London Knowledge Lab

Participants

  • Peter Murray-Rust (Open Bibliography project, University of Cambridge, IUCr)
  • Mark MacGillivray (Open Bibliography project, University of Edinburgh, OKF, Cottage Labs)
  • William Waites (Open Bibliography project, University of Edinburgh, OKF)
  • Ben O’Steen (Open Bibliography project, Cottage Labs)
  • Alex Dutton (Open Citation project, University of Oxford)
  • Owen Stephens (Open Bibliographic Data guide project, Open University)
  • Neil Wilson (British Library)
  • Richard Jones (Cottage Labs)
  • David Flanders (JISC)
  • Jim Pitman (Bibserver project, UCB) (remote)
  • Adrian Pohl (OKF bibliographic working group) (remote)

During the workshop we covered some key areas where we have seen some success already in the project, and discussed how we could continue further.

Open bibliographic data formats

In order to ensure successful sharing of bibliographic data, we require agreement on a suitable yet simple format via which to disseminate records. Whilst representing linked data is valuable, it also adds complexity; however, simplicity is key for ensuring uptake and for enabling easy front end system development.

Whilst data is available as RDF/XML, JSON is now a very popular format for data transfer, particularly where front end systems are concerned. We considered various JSON linked data formats, and have implemented two for further evaluation. In order to make sure this development work is as widely applicable as possible, we wrote parsers and serialisers for JSON-LD and RDF/JSON as plugins for the popular RDFlib.

The RDF/JSON format is, of course, RDF; therefore, it requires no further change to enable it to handle our data, and our RDF/JSON parser and serialiser are already complete. However, it is not very JSON-like, as data takes the subject(predicate(object)) form rather than the general key:value form. This is where JSON-LD can improve the situation – it provides for listing information in a more key:value-like format, making it easier for front end developers not interested in the RDF relations to utilise. But this leads to additional complexity in the spec and parsing requirements, so we have some further work to complete:
* remove angle brackets from blank nodes
* use type coersion to move types out of main code
* use language coersion to omit languages

Our code is currently available in our repository, and we will request that our parsers and serialisers get added to RDFlib or to RDFextras once they are complete (they are still in development at present).

To further assist in representing bibliographic information in JSON, we also intend to implement BibJSON within JSON-LD; this should provide the necessary lined data functionality where necessary via JSON-LD support, whilst also enabling simpler representation of bibliographic data via key:value pairs where that is all that is required.

By making these options available to our users, we will be able to gauge the most popular representation format.

Regardless of format used, a critical consideration is that of stable references to data. Without this maintaining datasets will be very hard. To date, the British Library data for example does not have suitable identifiers. However, the BL are moving forward with applying identifiers and will be issuing a new version of their dataset soon, which we will take as a new starting point. We have provided a list of records that we have identified as non-unique, and in turn the BL will share the tools they use to manage and convert data where possible, to enable better community collaboration.

Getting more open datasets

We are building on the success of the BL data release by continuing work on our CUL and IUCr data, and also by getting more datasets. The latest is the Medline dataset; there were some initial issues with properly identifying this dataset, so we have a previous blog post and a link to further information, the Medline DTD and specifications of the PubMed data elements to help.

The Medline dataset

We are very excited to have the Medline dataset; we are currently working on cleaning so that we can provide access to all the non-copyrightable material it contains, which should represent a listing of about 98% of all articles published in PubMed.

The Medline dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. This also means that further updates will be trackable as they will append to the current dataset. We have found that most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID, and that some contain further metadata such as citations, which we will remove. Once this is done, and we have checked that there are unique IDs (e.g. that the PubMed IDs are unique) we will make the raw CC0 collection available, then attempt to get it into our Bibliographica instance. We will then also be able to generate visualisations on our total dataset, which we hope will be approaching 30 million records by the end of the JISC Open Bibliography project.

Displaying bibliographic records

Whilst Bibliographica allows for display of individual bibliographic records and enables building collections of such records, it does not yet provide a means of neatly displaying lists of bibliographic records. We have partnered with Jim Pitman of Berkeley University to develop his BibServer to fit this requirement, and also to bring further functionality such as search and faceted browse. This also provides further development direction for the output of the project beyond the July end date of the JISC Open Bibliography project.

Searching bibliographic records

Given the collaboration between Bibliographica and BibServer on collection and display of bibliographic records, we are also considering ways to enable search across non-copyrightable bibliographic metadata relating to any published article. We believe this may be achievable by building a collection of DOIs with relevant metadata, and enabling crowdsourcing of updates and comments.

This effort is separate to the main development of the projects, however would make a very good addition both to the functionality of developed software and to the community. This would also tie in with any future functionality that enables author identification and information retrieval, such as ORCID, and allowing us to build on the work done at sites such as BIBKN

Disambiguation without deduplication

There have been a number of experiments recently highlighting the fact that a simple LUCENE search index over datasets tends to give better matches than more complex methods of identifying duplicates. Ben O’Steen and Alex Dutton both provided examples of this, from their work with the Open Citation project.

This is also supported by a recent paper from Jeff Bilder entitled “Disambiguation without Deduplication” (not publicly available). The main point here is that instead of deduplicating objects we can simply do machine disambiguation and make sameAs-ness assertions between multiple objects; this would enable changes to still be applied to different versions of an object by disparate groups (e.g. where each group has a different spelling or identifier, perhaps, for some key part of the record) whilst still maintaining a relationship between the two objects. We could build on this sort of functionality by applying expertise from the library community if necessary, although deduplication/merging should only be contemplated if there is a new dataset being formed which some agent is taking responsibility to curate. If not, better to just cluster the data by SameAs assertions, and keep track of who is making those assertions, to assess their reliability.

We suggest a concept for increasing collaboration on this sort of work – a ReCaptcha of identities. Upon login, perhaps to a Bibliographica or another relevant system, a user could be presented with two questions, one of which we know the answer to, and the other being a request to match identical objects. This, in combination with decent open source software tools enabling bibliographic data management (building on tools such as Google Refine and Needlebase), would allow for simple verifiable disambiguation across large datasets.

Sustaining open bibliographic data

Having had success in getting open bibliographic datasets and prototyping their availability, we must consider how to maintain long term open access. There are three key issues:

Continuing community engagement

We must continue to work with the community, and to provide explanatory information to those needing to make decisions about bibliographic data, such as the OpenBiblio Principles and the Open BIbliographic Data guide. We must also ensure we improve resource discovery by supporting the requirement for generating collections and searching content.

Additionally, quality bibliographic data should be hosted at some key sites – there are a variety of options such as Freebase, CKAN, bibliographica – but we must also ensure that community members can be crowdsourced both for managing records within these central options and also for providing access to smaller distributed nodes, where data can be owned and maintained at the local level whilst being discoverable globally.

Maintaining datasets

Dataset maintenance is critical to ongoing success – stale data is of little use to people and disregard for content maintenance will put off new users. We must co-ordinate with source providers such as the BL by accepting changesets from them and incorporating that into other versions. This is already possible with the Medline data, for example, and will very soon be the case with BL updates too. We should advocate for this method of dataset updates during any future open data negotiations. This will allow us to keep our datasets fresh and relevant, and to properly represent growing datasets.

We must continue to promote open access to non-copyrightable datasets, and ensure that there is a location for open data providers to easily make their raw datasets available – such as CKAN.

We will ensure that all the software we have developed during the course of the project – and in future – will remain open source and publicly available, so that it will be possible for anyone to perform the transforms and services that we can perform.

Community involvement with dataset maintenance

We should support community members that wish to take responsibility for overseeing updating of datasets. This is critical for long term sustainability, but hard to find. These people need to be recruited and provided with simple tools which will empower them to easily maintain and share datasets they care about with a minimal time commitment. Thus we must make sure that our software and tools are not only open source, but usable by non-team members.

We will work on developing tools such as ReCaptcha for disambiguation, and on building game / rank table functionality for those wishing to participate in entity disambiguation (in addition to machine disambiguation).

Critical mass

We hope that by providing almost 30 million records to the community under CC0 license, and with the support of all the providers that made this possible, we will achieve a critical mass of data, and an exemplar for future open access to such data.

This should provide the go-to list of such information, and inspire others to contribute and maintain. However, such community assistance will only continue for as long as there appears to be reasonable maintenance of the corpus and software we have already developed – if this slips into disrepair, community engagement is far less likely.

Maintaining services

The bibliographica service that we currently run already requires significant hardware to run. Once we add in Medline data, we will require very large indexes, requiring a great deal of RAM and fast disks. There is therefore a long term maintenance requirement implicit in running any such central service of open bibliographic data on this scale.

We will present a case for ongoing funding requirements and seek sources for financial support both for technical maintenance and for ongoing software maintenance and community engagement.

Business cases

In order to ensure future engagement with groups and business entities, we must make clear examples of the benefits of open bibliographic data. We have already done some work on visualising the underlying data, which we will develop further for higher impact. We will identify key figures in the data that we can feed into such representations to act as exemplars. Additionally, we will continue to develop mashups using the datasets, to show the serendipitous benefit that increases exposure but is only possible with unambiguously open access to useful data.

Events and announcements

We will continue to promote our work and the efforts of our partners, and advocate further for open bibliography, by publicising our successes so far. We will co-ordinate this with JISC, BL, OKF and other interested groups, to ensure the impact of announcements by all groups are enhanced.

We will present our work at further events throughout the year, such as attendance and sessions at OKCon, OR11 and other conferences, and by arranging further hackdays.

]]>
http://openbiblio.net/2011/05/09/openbiblio-workshop-report/feed/ 0
Follow-up to serialising RDF in JSON http://openbiblio.net/2011/05/05/follow-up-to-serialising-rdf-in-json/ http://openbiblio.net/2011/05/05/follow-up-to-serialising-rdf-in-json/#comments Thu, 05 May 2011 10:58:13 +0000 http://openbiblio.net/?p=1064 Continue reading ]]> Following on from Richard’s post yesterday, we now have a JSON-LD serialiser for RDFlib. This is still a work in progress, and there may be things that it is serialising incorrectly. So, please give us feedback on this, and tell us where we have misinterpreted the structure.

Here you will find a sample JSON-LD output file, which was generated from this Bibliographica record.

The particular area of concern surrounds how the JSON-LD spec describes serialising disjoint graphs into JSON-LD (section 8.2). How does this differ from serialising joined graphs? We are presuming all that our output file is an example of a joined graph, and that additional disjoint graphs would be added by appending additional @:[] sections.

]]>
http://openbiblio.net/2011/05/05/follow-up-to-serialising-rdf-in-json/feed/ 0