Minutes: 24th Virtual Meeting of the OKFN Working Group for Open Bibliographic Data

Date: August, 7th 2012, 15:00 GMT

Channels: Meeting was held via Skype and Etherpad

Participants

  • Jim Pitman
  • Karen Coyle
  • Naomi Lillie

Agenda

JISC Open Biblio 2 project coming to close

  • Blog-post write-up of project being finished this week, Mark MacGillivray reporting back to JISC in late September
  • Further funding being explored mainly in terms of related work

ISBNdb http://isbndb.com/

  • Similar to BibJSON
  • Uses other sources, has no explicit license / restrictions
  • API will give 500 returns a day
  • Jim’s example: http://isbndb.com/d/person/pitman_jim/books.html
    • author identity is not working very well – this example contains a book that isn’t Jim’s
  • There is no record without an ISBN – seems to be no information from pre-1970
  • Claims to have 7million books but only 2m authors – FAQs state that records are gleaned from different libraries so duplication is likely
  • Open Library is possibly a better source

Karen’s most recent blog: http://kcoyle.blogspot.co.uk/2012/07/fair-use-deja-vu.html

  • “The argument that Google has made from the beginning of its book scanning project is that copying for the purpose of providing keyword access to full texts is fair use”
    • HathiTrust has been in court to defend the storing and searching of metadata

Actions:

Posted in JISC OpenBib, minutes, OKFN Openbiblio | Leave a comment

Nature’s data platform strongly expanded

Nature has largely expanded its Linked Open Data platform that was launched in April 2012. From today’s press release:

Logo of the journal Nature used in its first issue on Nov. 4, 1869

“As part of its wider commitment to open science, Nature Publishing Group’s (NPG) Linked Data Platform now hosts more than 270 million Resource Description Framework (RDF) statements. It has been expanded more than ten times, in a growing number of datasets. These datasets have been created under the Creative Commons Zero (CC0) waiver, which permits maximal use/reuse of this data. The data is now being updated in real-time and new triples are being dynamically added to the datasets as articles are published on nature.com.

Available at http://data.nature.com, the platform now contains bibliographic metadata for all NPG titles, including Scientific American back to 1845, and NPG’s academic journals published on behalf of our society partners. NPG’s Linked Data Platform now includes citation metadata for all published article references. The NPG subject ontology is also significantly expanded.

The new release expands the platform to include additional RDF statements of bibliographic, citation, data citation and ontology metadata, which are organised into 12 datasets – an increase from the 8 datasets previously available. Full snapshots of this data release are now available for download, either by individual dataset or as a complete package, for registered users at http://developers.nature.com.

This is exciting, especially the commitment to real-time updates is a great move and shows how serious Linked Open Data becomes in general and in particular in the realm of bibliographic data. Also, Nature now uses the Data Hub and has registered the data seperated into several datasets.

Posted in Data, News, Semantic Web | Tagged | Leave a comment

Community Discussions 3

It has been a couple of months since the round-up on Community Discussions 2 and we have been busy! BiblioHack was a highlight for me, and last week included a meeting of many OKFN types – here’s a picture taken by Lucy Chambers for @OKFN of some team members:

IMG_0351

The Discussion List has been busy too:

  • Further to David Weinbergers’s pointer that Harvard released 12 million bibliographic records with a CC0 licence, Rufus Pollock created a collection on the DataHub and added it to the Biblio section for easy of reference

  • Rufus also noticed that OCLC had issued their major release of VIAF, meaning that millions of author records are now available as Open Data (under Open Data Commons Attribution license), and updated the DataHub dataset to reflect this

  • Peter Murray-Rust noted that Nature has made its metadata Open CC0

  • David Shotton promoted the International Workshop on Contributorship and Scholarly Attribution at Harvard, and prepared a handy guide for attribution of submissions

  • Adrian Pohl circulated a call for participation for the SWIB12 “Semantic Web in Bibliotheken” (Semantic Web in Libraries) Conference in Cologne, 26-28 November this year, and hosted the monthly Working Group call

  • Lars Aronsson looked at multivolume works, asking whether the OpenLibrary can create and connect records for each volume. HathiTrust and Gallica were suggested as potential tools in collating volumes, and the barcode (containing information populated by the source library) was noted as being invaluable in processing these

  • Sam Leon explained that TEXTUS would be integrating BibSever facet view and encouraged people to have a look at the work so far; Tom Oinn highlighted the collaboration between Enriched BibJSON and TEXTUS, and explained that he would be adding a ‘TEXTUS’ field to BibJSON for this purpose

  • Sam also circulated two tools for people to test, Pundit and Korbo, which have been developed out of Digitised Manuscripts to Europeana (DM2E)

  • Jenny Molloy promoted the Open Science Hackday which took place last week – see below for a snap-shot courtesy of @OKFN:

IMG_1964

In related news, Peter Murray-Rust is continuing to advocate the cause of open data – do have a read of the latest posts on his blog to see how he’s getting on.

The Open Biblio community continues to be invaluable to the Open GLAM, Heritage, Access and other groups too and I would encourage those interested in such discussions to join up at the OKFN Lists page.

Posted in BibServer, Data, event, events, JISC OpenBib, licensing, News, OKFN Openbiblio | Tagged , , , | Leave a comment

Using wikipedia to build a philosophy (or other sort of) collection in BibSoup

Here is a quick example of how to quickly build a reference collection in BibSoup, using the great source of knowledge that is Wikipedia.

To begin with, you might want to go to Wikipedia directly and try performing some searches for relevant material, to help you put together sensible search terms for your area of interest. Your search terms will be used to pull relevant citations from the wikipedia database.

Then, go over to the BibSoup upload page; signup / login is required, so do that if you have not already done so.

Type in your wikipedia search terms in the upload box at the top of the page, give your collection a name and a description, specify the license if you wish, and choose the “wikipedia search to citations” file format from the list at the bottom. Then hit upload.

A ticket will be created for building your collection, and you can view the progress on the ticket page.

Once it is done, you can find your new collection either on the BibSoup collections page or on your own BibSoup user account page – for example atfor the user named “test”. Also of course, you could go straight to the URL of your collection – they appear at http://bibsoup.net/username/collection.

There you go! You should now have a reference collection based on your wikipedia search terms. Check out our our example.

Posted in BibServer, JISC OpenBib | Tagged , , | 1 Comment

Linked Data in worldcat.org

This post was first published on Übertext: Blog.

Two days ago OCLC announced that linked data has been added to worldcat.org. I took a quick look at it and just want to share some notes on this.

OCLC goes open, finally

I am very happy that OCLC – with using the ODC-BY license – finally managed to choose open licensing for WorldCat. Quite a change of attitude when you recall the attempt in 2008 to sneak in a restrictive viral copyright license as part of a WorldCat record policy (for more information see the code4lib wikipage on the policy change or my German article about it). Certainly, it were not at last the blogging librarians and library tech people, the open access/open data proponents etc. who didn’t stop to push OCLC towards openness, who made this possible. Thank you all!

Of course, this is only the beginning. One thing is, that dumps of this WorldCat data aren’t available yet (see follow-up addendum here), thus, making it necessary to crawl the whole WorldCat to get hold of the data. Another thing is, that there probably is a whole lot of useful information in WorldCat that isn’t part of the linked data in worldcat.org yet .

schema.org in RDFa and microdata

What information is actually encoded as linked data in worldcat.org? And how did OCLC add RDF to worldcat.org? It used the schema.org vocabulary to add semantic markup to the HTML. This markup is both added as microdata – the native choice fo schema.org vocab – as well as in RDFa. schema.org lets people choose how to use the vocabulary, on the schema.org blog it recently said: “Our approach is “Microdata and more”. As implementations and services begin to consume RDFa 1.1, publishers with an interest in mixing schema.org with additional vocabularies, or who are using tools like Drupal 7, may find RDFa well worth exploring.

Let’s take a look at a description of a bibliographic resource in worldcat.org, e.g. http://www.worldcat.org/title/linked-data-evolving-the-web-into-a-global-data-space/oclc/704257552The part of the HTML source containing the semantic markup is marked as “Microdata Section” (although it does also contain RDFa). As the HTML source isn’t really readable for humans, we need to get hold of the RDF in a readable form first to have a look at it. I prefer the turtle syntax for looking at RDF. One can get the RDF contained in the HTML out using the RDFa distiller provided by the W3C. More precisely you have to use the distiller that supports RDFa 1.1 as schema.org supports RDFa 1.1 and, thus, worldcat.org is enriched according to the RDFa 1.1 standard.

However, using the distiller on the example resource I can get back a turtle document that contains the following triples:

1:  @prefix library: <http://purl.org/library/> .
2: @prefix madsrdf: <http://www.loc.gov/mads/rdf/v1#> .
3: @prefix owl: <http://www.w3.org/2002/07/owl#> .
4: @prefix schema: <http://schema.org/> .
5: @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
6: <http://www.worldcat.org/oclc/707877350> a schema:Book;
7: library:holdingsCount "1"@en;
8: library:oclcnum "707877350"@en;
9: library:placeOfPublication [ a schema:Place;
10: schema:name "San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) :"@en ];
11: schema:about [ a skos:Concept;
12: schema:name "Web site development."@en;
13: madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh98004795> ],
14: [ a skos:Concept;
15: schema:name "Semantic Web."@en;
16: madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh2002000569> ],
17: <http://dewey.info/class/025/e22/>,
18: <http://id.worldcat.org/fast/1112076>,
19: <http://id.worldcat.org/fast/1173243>;
20: schema:author <http://viaf.org/viaf/38278185>;
21: schema:bookFormat schema:EBook;
22: schema:contributor <http://viaf.org/viaf/171087834>;
23: schema:copyrightYear "2011"@en;
24: schema:description "1. Introduction -- The data deluge -- The rationale for linked data -- Structure enables sophisticated processing -- Hyperlinks connect distributed data -- From data islands to a global data space -- Introducing Big Lynx productions --"@en,
25: "The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study."@en;
26: schema:inLanguage "en"@en;
27: schema:isbn "1608454312"@en,
28: "9781608454310"@en;
29: schema:name "Linked data evolving the web into a global data space"@en;
30: schema:publisher [ a schema:Organization;
31: schema:name "Morgan & Claypool"@en ];
32: owl:sameAs <http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001> .

This looks quite nice to me. You see, how schema.org let’s you easily convey the most relevant information and the property names are well-chosen to make it easy for humans to read the RDF (in contrast e.g. to the ISBD vocabulary which uses numbers in the property URIs following the library tradition :-/).

The example also shows the current shortcomings of schema.org and where the library community might put some effort in to extending it, as OCLC has already been doing for this release with the experimental “library” extension vocabulary for use with Schema.org. E.g., there are no seperate schema.org properties for a table of content and an abstract so that they are both put into one string using ther schema:description property.

Links to other linked data sources

There are links to several other data sources: LoC authorities (lines 13, 16, 41, 44) , dewey.info (17), the linked data FAST headings (18,19), viaf.org (20,22) and an owl:sameAs link to the HTTP-DOI identifier (32). As most of these services are already run by OCLC and as the connections probably all were already existent in the data, creating these links wasn’t hard work, which of course doesn’t make them less useful.

Copyright information

What I found very interesting is the schema:copyrightYear property used in some descriptions in worldcat.org. I don’t know how much resources are covered with the indication of a copyright year and how accurate the data is, but this seems a useful source to me for projects like publicdomainworks.net.

Missing URIs

As with other preceding publications of linked bibliographic data there are some URIs missing for things we might want to link to instead of only serving the name string of the respecting entity: I am talking about places and publishers. Until now, AFAIK URIs for publishers don’t exist, hopefully someone (OCLC perhaps?) is already working on a LOD registry for publishers. For places, we have geonames but it is not that trivial to generate the right links. It’s not a great surprise that a lot of work has to be done to build the global data space.

Posted in Data, LOD-LAM | 3 Comments

Bringing the Open German National Bibliography to a BibServer

This blog post is written by Etienne Posthumus and Adrian Pohl.

We are happy that the German National Library recently released the German National Bibliography as Linked Open Data, see (announcement). At the #bibliohack this week we worked on getting the data into a BibServer instance. Here, we want to share our experiences in trying to re-use this dataset.

Parsing large turtle files: problem and solution

The raw data file is 1.1GB in a compressed format – unzipped it is a 6.8 GB turtle file. Working with this file is unwieldy, it can not be read into memory or converted with tools like rapper (which only works for turtle files up to 2 GB, see this mail thread). Thus, it would be nice if the German National Library could either provide one big N-Triples file that is better for streaming processing or provide a number of smaller turtle files.

Our solution to get the file into a workable form is to make a small Python script that is Turtle syntax aware, to split the file into smaller pieces. You can’t use the standard UNIX split command, as each snippet of the split file also needs the prefix information at the top and we do not want to split an entry in the middle, losing triples.

See a sample converted N-Triples file from a turtle snippet.

Converting the N-Triples to BibJSON

After this, we started working on parsing an example N-Triples file to convert the data to BibJSON. We haven’t gotten that far, though. See https://gist.github.com/2928984#file_ntriple2bibjson.py for the resulting code (work in progress).

Problems

We noted problems with some properties that we like to document here as feedback for the German National Library.

Heterogeneous use of dcterms:extent

The dcterms:extent property is used in many different ways, thus we are considering to omit it in the conversion to BibJSON. Some example values of this property: “Mikrofiches”, “21 cm”, “CD-ROMs”, “Videokassetten”, “XVII, 330 S.”. Probably it would be the more appropriate choice to use dcterms:format for most of these and to limit the use of dcterms:extent to pagination information and duration.

URIs that don’t resolve

We stumbled over some URIs that don’t resolve, whether you order RDF or HTML in the accept header. Examples: http://d-nb.info/019673442, http://d-nb.info/019675585, http://d-nb.info/011077166

Also, DDC URIs that are connected to a resource with dcters:subject don’t resolve, e.g. http://d-nb.info/ddc-sg/070.

Footnote

At a previous BibServer hackday, we loaded the Britsh National Bibliography data into BibServer. This was a similar problem, but as the data was in RDF/XML we could directly use the built-in Python XML streaming parser to convert the RDF data into BibJSON. See: https://gist.github.com/1731588 for the source.

Posted in BibServer, Data, event, events | Tagged , , | 4 Comments

BiblioHack: Day 2, part 2

Pens down! Or, rather, key-strokes cease!

BiblioHack has drawn to a close and the results of two days’ hard labour are in:

A Bibliographic Toolkit

Utilising BibServer

Peter Murray-Rust reported back on what was planned, what was done, and the overlap between the two! The priority was cleaning up the process for setting up BibServers and getting them running on different architectures. (PubCrawler was going to be run on BibServer but currently it’s not working). Yesterday’s big news was that Nature has released 30 million references or thereabouts – this furthers the cause of scholarly literature whereby we, in principle, can index records rather than just corporate organisations being able / permitted to do so. National Bibliographies have been put on BibSoup – UK (‘BL’), Germany, Spain and Sweden – with the technical problem character encodings raising its head (UTF8 solves this where used). Also, BibSoup is useful for TEXTUS so the overall ‘toolkit’ approach is reinforced!

Open Access Index

Emanuil Tolev presented on ACat – Academic Catalogue. The first part of an index is having things to access – so gathering about 55,000 journals was a good start! Using Elastic Search within these journals will give list of contents which will then provide lists of articles (via facet view), then other services will determine licensing / open access information (URL checks assisted in this process). The ongoing plan is to use this tool to ascertain licensing information for every single record in the world. (Link to ACat to follow).

Annotation Tools

Tom Oinn talked about the ideas that have come out of discussions and hacking around annotators and TEXTUS. Reading lists and citation management is a key part of what TEXTUS is intended to assist with, so the plan is for any annotation to be allowed to carry a citation – whether personal opinion or related record. Personalised lists will come out of this and TEXTUS should become a reference management tool in its own right. Keep your eye on TEXTUS for the practical applications of these ideas!

Note: more detailed write-ups will appear courtesy of others, do watch the OKFN blog for this and all things open…

Postscript: OKFN blog post here

Huge thanks to all those who participated in the event – your ideas and enthusiasm have made this so much fun to be involved with.

Also thanks to those who helped run the event, visible or behind-the-scenes, particularly Sam Leon.

Here’s to the next one :-)

Posted in BibServer, Data, event, events, JISC OpenBib, minutes, News, OKFN Openbiblio, Talks | Tagged , , , , , , , , , | Leave a comment

BiblioHack: Day 2, part 1

After easing into the day with breakfast and coffee, each of the 3 sub-groups gave an overview of the mini-project’s aim and fed back on the evening’s progress:

  • Peter Murray-Rust revisited the overarching theme of ‘A Bibliographic Toolkit’ and the BibServer sub-group’s specific work on adding datasets and easily deploying BibServer; Adrian Pohl followed up to explain that he would be developing a National Libraries BibServer.
  • Tom Oinn explained the Annotation Tools sub-groups’s work on developing annotation tools – ie TEXTUS – looking at adding fragments of text, with your own comments and metadata linked to it, which then forms BibSoup collections. Collating personalised references is enhanced with existing search functionality, and reading lists with annotations can refer to other texts within TEXTUS.
  • Mark MacGillivray presented the 3rd group’s work on an Open Access Index. This began with listing all the journals that can be found in the whole world, with the aim of identifying the licence of each article. They have been scraping collections (eg PubMed) and gathering journals – at the time of speaking they had around 50,000+! The aim is to enable a crowd-sourced list of every journal in the world which, using PubCrawler, should provide every single article in the world.

With just 5 hours left before stopping to gather thoughts, write-up and feedback to the rest of the group, it will be very interesting to see the result…

Posted in BibServer, Data, event, events, JISC OpenBib, minutes, News, OKFN Openbiblio, Talks | Tagged , , , , , , , , , | Leave a comment

BiblioHack: Day 1

The first day of BiblioHack was a day of combinations and sub-divisions!

The event attendees started the day all together, both hackers and workshop / seminar attendees, and Sam introduced the purpose of the day as follows: coders – to build tools and share ideas about things that will make our shared cultural heritage and knowledge commons more accessible and useful; non-coders – to get a crash course in what openness means for galleries, libraries, archives and museums, why it’s important and how you can begin opening up your data; everyone – to get a better idea about what other people working in your domain do and engender a better understanding between librarians, academics, curators, artists and technologists, in order to foster the creation of better, cooler tools that respond to the needs of our communities.

The hackers began the day with an overview of what a hackathon is for and how it can be run, as presented by Mahendra Mahey, and followed with lightning talks as follows:

  • Talk 1 Peter Murray Rust & Ross Mounce – Content and Data Mining and a PDF extractor
  • Talk 2 Mike Jones – the m-biblio project
  • Talk 4 Ian Stuart – ORI/RJB (formerly OA-RJ)
  • Talk 5 Etienne Posthumus – Making a BibServer Parser
  • Talk 6 Emanuil Tolev – IDFind – identifying identifiers (“Feedback and real user needs won’t gather themselves”)
  • Talk 7 Mark MacGillivray – BibServer – what the project has been doing recently, how that ties into the open access index idea.
  • Talk 8 Tom Oinn – TEXTUS
  • Talk 9 Simone Fonda – Pundit – collaborative semantic annotations of texts (Semantic Web-related tool)
  • Talk 10 Ian Stuart – The basics of Linked Data

We decided we wanted to work as a community, using our different skills towards one overarching goal, rather than breaking into smaller groups with separate agendas. We formed the central idea of an ‘open bibliographic tool-kit’ and people identified three main areas to hack around, playing to their skills and interests:

  • Utilising BibServer – adding datasets and using PubCrawler
  • Creating an Open Access Index
  • Developing annotation tools

At this point we all broke for lunch, and the workshoppers and hackers mingled together. As hoped, conversations sprung up between people from the two different groups and it was great to see suggestions arising from shared ideas and applications of one group being explained to the theories of the other.

We re-grouped and the workshop continued until 16.00 – see here for Tim Hodson’s excellent write-up of the event and talks given – when the hackers were joined by some who attended the workshop. Each group gave a quick update on status, to try to persuade the new additions to the group to join their particular work-flow, and each group grew in number. After more hushed discussions and typing, the day finished with a talk from Tara Taubman about her background in the legalities of online security and IP, and we went for dinner. Hacking continued afterwards and we celebrated a hard day’s work down the pub, lookong forward to what was to come.

Day 2 to follow…

Posted in BibServer, Data, event, events, JISC OpenBib, licensing, LOD-LAM, minutes, OKFN Openbiblio, Talks | Tagged , , , , , , , , , | Leave a comment

BiblioHack Meet-up

I’ve been quiet on this blog lately, but it’s in the same way a duck looks still when swimming: things may look peaceful but there is much activity going on beneath the surface! The Open Biblio crowd have been busy on the discussion List (link to follow) and the BiblioHack organisers have been preparing for this week’s events, which kicked off with a Meet-up last night.

The pre-BiblioHack Meet-up was designed to be an informal opportunity for those involved in the events to put names to faces and start up discussions; it was also open to anyone who wanted to come along to find out more about open data and the OKFN’s Working Groups including Open GLAM, and projects such as DM2E as well as Open Biblio.

With no formal agenda, we started up conversations as the mood took us – this covered legalities of openness in relation to IP, licensing and open access, annotation, cat-sitting and the Blues. In a nod to the more ‘usual’ OKFN #OpenData meet-ups, we went around the room to introduce ourselves (trying to explain our interests in only 3 words was challenging…) which prompted some people to cross the room in a purposeful fashion to intercept someone they hadn’t spoken to by that point. I really enjoyed meeting the people with whom I’d be spending the next two days, so thanks to all those who came along, for their interesting ideas and suggestions, and huge thanks to Sam Leon for arranging the tasty food and drinks at C4CC and for facilitating the evening.

Posted in event, events, JISC OpenBib, OKFN Openbiblio | Tagged , , | 1 Comment