open bibliographic data and dev8d

Ben, Mark and Rufus are attending Dev8D this week – 16th Feb – 18th Feb 2011. This is a very exciting event where we hope to see lots of people from UK higher education institutions engaging with open knowledge and open source development.

We will also be advertising the open bibliographic data challenge today, and hoping to see some great new developments with the data.

Further updates will be posted soon – or follow the #dev8d tag on twitter, or check out the dev8d webpages.

Mark

Posted in JISC OpenBib, News, OKFN Openbiblio | Tagged , , , , , | Leave a comment

Minutes: 8th Virtual meeting of the OKFN Openbiblio Group

Date: February, 1st 2011, 16:00-17:15 GMT

Channel: Meeting was held via Skype and Etherpad

Participants

  • Adrian Pohl
  • Karen Coyle
  • Rufus Pollock
  • Thomas Krichel
  • Tatiana de la O

Openbiblio Principles: What’s next?

As the Principles on Open Bibliographic Data have been published we discussed what to do to broaden their reception and endorsement. We collected the following:

  • Contact organisations to sign the principles. Here is a short list of organisations we came up with:
    • AGRIS
    • Europeana
    • hbz
    • Institutional Repositories
    • Libraries
    • Project Gutenberg
    • Open Library
    • Internet Archive
    • BASE
  • Create a call for endorsement to be used in blog posts and mails to mailing lists, individuals and organizations.
  • Provide a button (similar to these) to show support for open bibliograpic data which is proposed to put on one’s page after signing the principles.
  • Enable organizations to include their logo in the endorsement list.
    • ACTION: Karen will approach the Internet Archive to endorse the principles
    • ACTION: Thomas will contact BASE.
    • ACTION: Karen will write a call for endorsement.
    • ACTION: Adrian will create a button to show support for openbiblio (not to mark openbiblio data) .
    • ACTION: Rufus will implement the possibility for organizations to include their logo in the endorsement list.

Working with Open Library Society (OLS)

  • Thomas gives a short characterization of the Open Library Society: It focuses on scientific publication data (journal articles, papers but not monographs) and promotes open services around it. (See http://3lib.org/ for data that is gathered by OLS and used in their services.)
  • Open Library Society’s policy is to not scrape data and acknowledge all sources.
  • A problem with OLS data is that it isn’t clean open data as some isn’t licensed at all and e.g. Repec data forbids commercial use. As OKFN is interested in clean open data there’s a problem with this.
  • Thomas emphasizes that at least author claim data is open the legal status of the other data sets has yet to be determined/declared.
    • ACTION: Ask 3lib providers whether their data is free using http://isitopendata.org/.
    • ACTION: Thomas will start with contacting DBLP regarding open licensing.

Open Bibliographic Data Challenge

Rufus gave some information about the running openbiblio challenge. Already submitted ideas can be seen here.

Reports from events and other projects

Karen reported on this:

  • At the American Library Association’s meeting a new Interest Group on Linked Library Data was founded which will comprise the topic open data.
  • At the Knowledge Organization Conference in Oslo (January 2011) (see also Karen’s post about it) open data – as far as Karen could understand – wasn’t touched.
  • The W3C LLD Incubator group endorsed the use of CKAN for listing Linked Library Data sets.

Action Collection

  • Karen will approach the Internet Archive to endorse the principles
  • Thomas will contact BASE.
  • Karen will write a call for endorsement.
  • Adrian will create a button to show support for openbiblio (not to mark openbiblio data) .
  • Rufus will implement the possibility for organizations to include their logo in the endorsement list.
  • Ask 3lib providers whether their data is free using http://isitopendata.org/.
  • Thomas will start with contacting DBLP regarding open licensing.
  • NEXT MEETING: Discuss how to improve the openbiblio webpresence at http://openbiblio.net.
Posted in minutes, OKFN Openbiblio | Leave a comment

Open bibliographic principles announced

On Monday 17th January at the Visions of a semantic molecular future symposium and hackfest, during a presentation packed with displays of exciting new technology, Peter Murray-Rust introduced the Open Bibliographic Principles on behalf of the contributor group that has worked on them for the last six months.

  1. When publishing bibliographic data make an explicit and robust license statement.
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Definition (http://opendefinition.org/) – in particular non-commercial and other restrictive clauses should not be used.
  4. Where possible, we recommend explicitly placing bibliographic data in the Public Domain via PDDL or CC0.

http://openbiblio.net/principles has full details along with an endorsement form, and will soon link to alternate language versions. We will now work on exposing these principles as widely as possible to achieve endorsement from individuals and organisations across the academic, library and publisher community.

(pictures to follow)

Posted in Data, JISC OpenBib, OKFN Openbiblio | Tagged , , , , | Leave a comment

Academic Bibliography data available from Acta Cryst E

The bibliographic data from Acta Cryst E, a publication by the International Union of Crystallography (IUCr), has been extracted and made available with their consent.

You can find a SPARQL endpoint for the data here and the full dataset here.

I have also geocoded a number of the affiliations of the authors, plotting them on a timemap (visualising the time of publication against the location of the authors), and you can see this at this location.

What you will find:

  • A SPARQL endpoint with limited output capabilities (limited content negotiation).
  • A ‘describe’ method, to display unstyled HTML pages about authors or the papers, based on the given URI.
  • Links from the data in the service to the original papers.
  • The data dump consists of a zipped up directories of rdf, which have most of the intermediary xml, html and other bits removed. Hopefully, this helps explain the odd layout!
Posted in JISC OpenBib | Tagged , , , | 3 Comments

Minutes: 7th virtual meeting of the OKFN Openbiblio Group

Date: January, 4th 2011, 16:00-17:15 GMT

Channel: Meeting was held via Skype and Google Docs

Participants

  • Adrian Pohl
  • Karen Coyle
  • Rufus Pollock
  • Peter Murray-Rust
  • Jim Pitman
  • Alexander Dutton

Summary

Discussing the draft Principles on Open Bibliographic Data

For background information see the google doc and the discussion thread on the mailing list

  • We did a significant rewriting of the ‘Definition of bibliographic data’ section in order to remove the reference to intellectual property rights on different parts of bibliographic data and deleted point 5 in the recommendations for the same reason.
  • A question to be resolved is: Do we need in the 4th principle (“…it is STRONGLY recommended that bibliographic data or collections of bibliographic data, especially where publicly funded, be explicitly placed in the public domain…”) a ‘STRONGLY’ recommend for public domain or should we drop the ‘Strongly’ and simply recommend? The underlying question is whether attribution is rather a problem or a benefit in the future bibliographic data environment.
  • At the moment we have gathered the following list of contributors: Karen Coyle, Jim Pitman, Adrian Pohl, Rufus Pollock, Peter Murray-Rust Who else wants to be listed as contributor?
    • ACTION: Discuss if we find areas where attribution would be a problem or would be a benefit?
    • ACTION: Ask on list who also wishes to be contributor.

Updates on Open Bibliography work & projects

  • Adrian has written in October 2010 a provisional summary of Open Bibliographic Data activities in 2010. He’ll try to complete it and translate it to English for a post on the OKFN blog.
  • Jim knows a person he possibly could get to do the co-cordinating work together with Adrian.
  • Karen said that the W3C Incubator Group on Linked Library Data is working on a report to be published in the next months.
  • Furthermore we talked about the meetings several of the group members attend and that it would be great to spread the word about the Openbiblio principles there. E.g. Karen is attending the ALA midwinter meeting (January 7-9) where she will be talking about linked library data. Rufus Pollock is going to the annual UKSG conference.
  • We agreed that it would be great if group members who attend a meeting anywhere ping the Openbiblio list about it (before and/or after) and use the gathering to promote the Openbiblio principles.
    • ACTION: Rufus writes/sends a description of coordinator duties.
    • ACTION: Jim puts a person he has for coordinator work in touch with Adrian and Rufus.
    • ACTION: Please ping the list if you are going to a meeting.

Privacy and user data

Some people have recently raised the issue concerning Open Data in the context of user data. E.g. Tim Spalding wrote on the Openbiblio list:

Basically, many institutions and sites are gathering user data relative to books—tags, reviews, lists, etc. How can that data be shared?

Let’s say, for example, that two libraries were collecting tags or reviews and wanted to share that data in an open way. How would they do it? What should the balance be between openness, privacy, and keeping users in control of their data? Is there an open license that
requires you to refresh data, so a user can release their review but expect to be able to update it? Should institutions sharing tags include primary keys relative to users, or just submit total tag counts, etc.?

Rufus posed the question whether this should be an area we think about and produce some guidance on. Everybody had a strong feeling that this is interesting. However, we agreed that it presents very significant complications and challenges. Therefore we should, at least at the present, spend our energy on the main issue of opening up core bibliographic data. Karen said that these questions refer to so much stuff (reviews, user stats, tagging etc.) that is hard to do good work on it. Adrian said we should watch the issue and address this when it comes up more and more.

Launch of Open Scholarship (17th Jan)

On January 17th there will be the “Visions of a Semantic (Molecular) Future” symposium organized by Peter. Lots of interesting and influential people will be there and Peter will speak on Open Citation, Open Bibliography, Open Scholarship. He would like to present the Open Bibliographic Data Principles in an agreed (late draft version) there. He asked whether an agreed set of principles by the 11th of January is possible and the participants made clear that they would like to contribute to this goal.

  • ACTION: Provide an agreed Principles document (maybe still a late draft version) until January 11th.

Action collection

We agreed on the following actions:

  • Adrian posts a write-up of the meeting on http://openbiblio.net and notifies the mailing list.
  • We provide an agreed Principles document (maybe still a late draft version) until January 11th.
  • Adrian bugs the people about finishing the principles and asks on list who wishes to be listed as a contributor.
  • We discuss on list if we find areas where attribution licenses would be a problem or a benefit?
  • Rufus writes/sends a description of coordinator duties.
  • Jim puts his coordinator person in touch with Adrian and Rufus.
  • Everybody pings the list if you are going to an openbiblio-related meeting.
Posted in minutes, OKFN Openbiblio | 1 Comment

Open library data: more data & a flyer from Germany

Open data by the German National Library of Medicine

The open library data movement in Germany has achieved a major success. In December, the German National Library of Medicine (ZB MED) announced the release of more than one Million bibliographic records into the public domain under a CC0 licence. Descriptions of the data as well as download links are available here. The data was converted to RDF using the Bibliographic Ontology (Bibo) and is now part of the Linked Open Bibliographic Data service lobid.org.

Flyer on open library data

The hbz has also published together with the Open Knowledge Foundation a German flyer on open library data which informs about the advantages of open data for libraries. The flyer was created based on feedback and discussions within the OKFN Working Group on Open Bibliographic Data. A pdf version of the flyer can be downloaded here. If interested in printed exemplars of the flyer just print your own, create a new one (it’s openly licensed) or write an email to the hbz.
A – slightly different – English version of the flyer can be viewed here.

Adrian Pohl is coordinator of the OKFN Working Group on Open Bibliographic Data and works at the North Rhine-Westphalian Library Service Center (hbz).

Posted in Data, OKFN Openbiblio, Uncategorized | Leave a comment

JISC OpenBibliography: Development ideas

Now that we have a queryable British National Bibliography dataset, we are investigating useful functionality to take advantage of the data.

The team have listed a few development ideas based both on our own interests and on discussion with others in the community:

  1. flagging – attaching notes to bibliographic records highlighting possible updates
  2. wikipedia – link to wikipedia by author / title / ISBN for further information
  3. book crossing – search an ISBN, find where a copy of it is available
  4. public libraries – search by ISBN and find out which local public library it is in
  5. exporting records – for example to bibtex
  6. google scholar lookup

We are moving forward with these, however we know that it is not possible for us to guess all the uses that the community might find for such data, so we would appreciate further comments and new ideas. It would be great to have a list of use cases that are valued by the community, and to enable as many of them as possible by project end.

Posted in JISC OpenBib | Tagged , , , , , | 3 Comments

Name matching strategy using bibliographic data

One of the aims of an RDF representation of bibliographic data should be to have authors represented by unique, reference-able points within the data (as URIs), rather than as free-text fields. What steps can we do to match up the text value representing an author’s name to another example of their name in the data?

It’s not realistic to expect a match between say, Mark Twain to Samuel Clemens, without using some extra information typically not present in bibliographic datasets. What can be achieved however, is the ‘fuzzy’ matching of alternate forms of names – due to typos, mistakes and omitted initials and the like. It is important that these matches are understood to be fuzzy and not precise, based more on statistics than a definite assertion.

How best to carry out the matching of a set of authors within a bibliographic dataset? This is not the only way, but it is a useful method to make progress with:

  1. List – Gather a list of the things you wish to match, with unique identifiers for each and map out a list of the pairs of names that are required to be compared. (Note, that this mapping will be greatly affected by the next step.)
  2. Filter – Remove the matches that aren’t worth fully evaluating. An index of the text can give a qualitative view on which names are worth comparing and which are not.
  3. Compare – Run through the name-pairs and evaluate the match (likely using string metrics of some kind). The accuracy of the match may be improved by using other data, with some sorts of data drastically improving your chances, such as author email, affiliation (and date) and birth and death dates.
  4. Binding – Bind the names together in whichever manner required. I would recommend creating Bundles as a record of a successful match, and an index or sparql-able service to allow the ‘sameas’ style lookups in a live service.

In terms of the BL dataset within http://bnb.bibliographica.org then:

List:

We have had to apply a form of an identifier for each instance of an author’s name within the BL dataset. Currently, this is done via a ‘owl:sameas’ property on the original blank node linking to a URI of our own making, eg http://bibliographica.org/entity/735b0…12d033. It would be a lot better if the BL were to mint their own URIs for this, but in the meantime, this gives us enough of a hook to begin with.

One way you might gather the names and URIs is via SPARQL:

PREFIX dc: 
PREFIX bibo: 
PREFIX foaf: 
PREFIX skos: 
PREFIX owl: 
SELECT DISTINCT ?book ?authoruri ?name
WHERE {
    ?book a bibo:Book .
    ?book dc:contributor ?authorbn .
    ?authorbn skos:notation ?name .
    ?authorbn owl:sameAs ?authoruri .
}

However, there will be very many results to page through, and it will put a lot of stress on the SPARQL engine if lots of users are doing this heavy querying at the same time!

This is also a useful place to gather any extra data you will use at the compare stage (usually email, affiliation or in this particular case, the potential birth and death dates).

Filter:

This is the part that is quite difficult as you have to work out a method for negating matches without the cost of a full match. If the filter method is slower than simply working through all the matches, then it is not worth doing the step. In matching names from the BL set however, there are many millions of values, but from glancing over the data, I only expect tens of matches or fewer on average for a given name.

The method I am using is to make a simple stemming index of the names, with the birth/death dates as extra fields. This I have done in Solr (experimenting with different stemming methods) but come to an odd conclusion that a default english stemming provides suitable groupings. I found this was backed up somewhat by this comparison of string metrics [PDF]. It suggests that a simple index combined with a metric called ‘Jaro’ works well for names.

So, in this case, I generate the matching by running the names through an index of all the names and using the most relevant search results as the base for the pairs of names to be matched. The pairs are combined into a set, ordered alphabetically – only the pairing is necessary, not the ordering of the pair. This is so that we don’t end up matching the same names twice.

Compare:

This is the most awkward step – it is hard to generate a ‘golden set’ of data by which you can rate the comparison without using other data sources. However, the matching algorithm I am using is the Jaro comparison to get a figure indicating the similarity of the names. As the BL data is quite a good set of data (in that it is human-entered and care has been taken over the entry), this type of comparison is quite good – the difference between a positive and a negative match is quite high. Care must be taken to avoid missing false positives from omitted or included initials, middle names, proper forms and so on.

The additional data is quite dependant on the match between the names. If the names match perfectly, but the birth dates are very different (different in value and distant in edit distance), then this is likely to be a different author. If the names match somewhat, but the dates match perfectly, then this is a possible match. If the dates match perfectly, but the name doesn’t at all (unlikely due to the above filtering step) then this is not a match.

Binding:

This step I have not made my mind up about as the binding step is a compromise between recording. Bundling together all the names for a given author in a single bundle if they are below the threshold for a positive match, you get a bundle that requires fewer triples to describe it. However, you really should have a bundle for each pairing, but this dramatically increases the number of triples required to express it. Either way, the method for querying the resultant data is the same. For example:

A set of bundles ‘encapsulates’ A, B, C, D, E, F, G – so, given B, you can find the others by a quick, if inelegant, SPARQL query:

SELECT ?sameas
WHERE {
  ?bundle bundle:encapsulates  .
  ?bundle bundle:encapsulates ?sameas .
}

Whether this data should be collapsed into a closure of some sort is up to the administrator – how much must you trust this match before you can use owl:sameAs and incorporate it into a congruent closure? I’m not sure the methods outlined above can give you a strong enough guarantee to do so at this point.

Posted in JISC OpenBib, Semantic Web | 1 Comment

Neil Wilson on open British Library metadata

Opening up the BL’s Metadata

Neil Wilson presented at GLAMWIKI on the British Library open metadata strategy, providing an overview of the history and responsibilities of the British Library, and how open metadata relates to the vision of future library services.

The slides of this informative talk are available on slideshare:

http://www.slideshare.net/nw13/wiki-opening-the-b-ls-data

Posted in News, OKFN Openbiblio | Leave a comment

Querying the British National Bibliography

Following up on the earlier announcement that the British
Library
has made the British National Bibliography available
under a public domain dedication, the JISC Open Bibliography
project has worked to make this data more useable.

The data has been loaded into a Virtuoso store that is queriable
through the SPARQL Endpoint and the URIs that we have assigned each
record use the ORDF software to make them dereferencable,
supporting perform content auto-negotiation as well as embedding RDFa
in the HTML representation.

The data contains some 3 million individual records and some 173
million triples. Indexing the data was a very CPU intensive process
taking approximately three days. Transforming and loading the source
data took about five hours.

To get an idea of the shape of the data, let us consider a sample
resource, http://bnb.bibliographica.org/entry/GB8102507 . Apart from
linkage between the various representations, the description of the
entity itself is as follows

@prefix ov: <http://open.vocab.org/terms/> .
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix bio: <http://purl.org/vocab/bio/0.1/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://bnb.bibliographica.org/entry/GB8102507> a bibo:Book, bibo:Document;
  dc:source <http://bnb.bibliographica.org/dataset/BNBrdfdc03.xml#183143>;
  dc:isPartOf <http://bnb.bibliographica.org/dataset>;
  rdfs:seeAlso <http://purl.org/NET/book/isbn/0241105161#book>,
               <http://www4.wiwiss.fu-berlin.de/bookmashup/books/0241105161>;

  dc:title "A good man in Africa";
  dc:language [ rdf:value "eng"^^dc:ISO639-2 ];
  dc:extent [ rdfs:label "251p" ];

  dc:contributor [ a foaf:Agent;
                   foaf:name "Boyd, William";
                   skos:notation "Boyd, William, 1952-";
                   bio:event [ a bio:Birth;
                               bio:date "1952"^^xsd:gYear ];
                   = <http://bibliographica.org/entity/735b02a8f051e2249e40fbd48112d033>;
                 ];

  dc:subject [ rdfs:label "Fiction in English" ],
             [ rdfs:label "1945-" ],
             [ rdfs:label "Texts" ],
             [ a skos:Concept;
               skos:inScheme <http://dewey.info/scheme/e18>;
               skos:notation "823/.9/1"^^<ddc:Notation> ],
             [ a skos:Concept;
               skos:inScheme <http://dewey.info/scheme/e19>;
               skos:notation "823/.914"^^<ddc:Notation> ].

  dc:publisher [ a foaf:Agent;
                 foaf:name "Hamilton";
                 skos:notation "Hamilton";
                 = <http://bibliographica.org/entity/c080da5b03a0786efa61e61123b359d9>;
               ];
  dc:issued "1981"^^xsd:gYear;
  isbd:hasPlaceOfPublicationProductionDistribution [ rdfs:label "London" ].

  bibo:identifier "GB8102507";
  bibo:isbn <urn:isbn:0241105161>;
  ov:blid "008042853".

Some of the salient features of this representation are:

  • Assignment of URIs for each entry in the British National
    Bibliography
    under http://bnb.bibliographica.org/.
  • Linkage with rdfs:seeAlso to the RDF Book Mashup and RDF
    Book Vocabulary
    .
  • Author and publisher are preserved as blank nodes as in the source
    data but are augmented with owl:sameAs links into the
    Bibliographica namespace to support further annotation,
    correction, deduplication, etc.
  • Any series if present is promoted to a first-class entity in the
    Bibliographica namespace for further processing.
  • Extraction of birth and death dates from the canonical
    string representation for authors.
  • For authors, publishers and series, their name as present in the
    source data is preserved using skos:notation whilst their
    names less any metadata about birth and death are represented with
    foaf:name.

The entire dataset is queriable through the SPARQL Endpoint and
makes use of some of the extended features of Virtuoso such as
full-text indexing. This is accomplished by using the bif:contains
built-in function and is what powers the search functionality on the
website. The default (example) query returns some details about all
books that have "Edinburgh" in their titles:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?book ?title ?name ?description
WHERE {
    ?book a bibo:Book .
    ?book dc:title ?title . ?title bif:contains "Edinburgh" .
    OPTIONAL { ?book dc:description ?description } .
    OPTIONAL {
        ?book dc:contributor ?author . ?author foaf:name ?name
    }
} GROUP BY ?book LIMIT 50

It should be noted that only some predicates are indexed for full-text
searching, namely,

  • rdf:value
  • rdfs:label
  • rdfs:comment
  • skos:prefLabel
  • skos:altLabel
  • dc:title
  • dc:description
  • foaf:name

Further Work

An ultimate goal of our work in the Open Bibliography group at the
OKF is to enable the collection of rich metadata about the
relationships between works and authors, to document and map the
scholarly discourse. This dataset is an important building block to
help ground the references in such a project. However more immediatly
we will:

  • Make a voiD description of this dataset describing its
    properties in more detail available.
  • Make available a dump of the our dataset derived from the BNB
    so that the data can be easily mirrored and copied for local
    processing.
  • Correct the errors listed in the Errata section below.

though not necessarily in that order.

Errata

  • ISBNs were represented in the source dataset as string literals of the
    form URN:ISBN:0123456789 and were erroneously transformed to URIs in
    violation of the rdfs:range of bibo:isbn.
  • Linkage between the resource and its representations,
    foaf:isPrimaryTopicOf contains a typo in the predicate which may
    make it difficult to use some RDF browsing clients that do not
    infer the inverse of foaf:primaryTopic.
Posted in JISC OpenBib | Tagged , , , , | 6 Comments