Open Bibliography and Open Bibliographic Data » JISC OpenBib

Final report: JISC Open Bibliography 2

Mark MacGillivray — Thu, 23 Aug 2012 12:35:16 +0000

Following on from the success of the first JISC Open Bibliography project we have now completed a further year of development and advocacy as part of the JISC Discovery programme.

Our stated aims at the beginning of the second year of development were to show our community (namely all those interested in furthering the cause of Open via bibliographic data, including: coders; academics; those with interest in supporting Galleries, Libraries, Archives and Museums; etc) what we are missing if we do not commit to Open Bibliography, and to show that Open Bibliography is a fundamental requirement of a community committed to discovery and dissemination of ideas. We intended to do this by demonstrating the value of carefully managed metadata collections of particular interest to individuals and small groups, thus realising the potential of the open access to large collections of metadata we now enjoy.

We have been successful overall in achieving our aims, and we present here a summary of our output to date (it may be useful to refer to this guide to terms).

Outputs

BibServer and FacetView

The BibServer open source software package enables individuals and small groups to present their bibliographic collections easily online. BibServer utilises elasticsearch in the background to index supplied records, and these are presented via the frontend using the FacetView javascript library. This use of javascript at the front end allows easy embedding of result displays on any web page.

BibSoup and more demonstrations

Our own version of BibServer is up and running at http://bibsoup.net, where we have seen over 100 users sharing more than 14000 records across over 60 collections. Some particularly interesting example collections include:

Cambridge Physics Tripos – a collection of 234 records extracted from a physics department MS Word reading list
Adrians bibsonomy bibliographie – a collection of 338 records extracted directly from Bibsonomy
Testing philosophy – a collection of 21 records extracted directly from Wikipedia via the Wikipedia search collection builder

Additionally, we have created some niche instances of BibServer for solving specific problems – for example, check out http://malaria.bibsoup.net; here we have used BibServer to analyse and display collections specific to malaria researchers, as a demonstration of the extent of open access materials in the field. Further analysis allowed us to show where best to look for relevant materials that could be expected to be openly available, and to begin work on the concept of an Open Access Index for research.

Another example is the German National Bibliography, as provided by the German National Library, which is in progress (as explained by Adrian Pohl and Etienne Posthumus here). We have and are building similar collections for all other national bibliographies that we receive.

BibJSON

At http://bibjson.org we have produced a simple convention for presenting bibliographic records in JSON. This has seen good uptake so far, with additional use in the JISC TEXTUS project and in Total Impact, amongst others.

Pubcrawler

Pubcrawler collects bibliographic metadata, via parsers created for particular sites, and we have used it to create collections of articles. The full post provides more information.

datahub collections

We have continued to collect useful bibliographic collections throughout the year, and these along with all others discovered by the community can be found on the datahub in the bibliographic group.

Open Access / Bibliography advocacy videos and presentations

As part of a Sprint in January we recorded videos of the work we were doing and the roles we play in this project and wider biblio promotion; we also made a how-to for using BibServer, including feedback from a new user:

Setting up a Bibserver and Faceted Browsing (Mark MacGillivray) from Bibsoup Project on Vimeo.

Peter and Tom Murray-Rust’s video, made into a prezi, has proven useful in explaining the basics of the need for Open Bibliography and Open Access:

Community activities

The Open Biblio community have gathered for a number of different reasons over the duration of this project: the project team met in Cambridge and Edinburgh to plan work in Sprints; Edinburgh also played host to a couple of Meet-ups for the wider open community, as did London; and London hosted BiblioHack – a hackathon / workshop for established enthusasiasts as well as new faces, both with and without technical know-how.

These events – particularly BiblioHack – attracted people from all over the UK and Europe, and we were pleased that the work we are doing is gaining attention from similar projects world-wide.

Further collaborations

TEXTUS wants to integrate BibServer FacetView and add a ‘TEXTUS’ field to BibJSON – this is ongoing but work to-date is available as a prototype;
Public Domain Works produced the Open Metadata Handbook in collaboration with the Open Bibliographic Data Working Group;
Mike Jones used his mobile application, M-Biblio, to deposit metadata to BibServer during the Hackathon – he writes about the trials and successes here;
OSS Watch, the JISC-funded organisation, looked at our project output to monitor good open-source standards. We shared their results here;
ServiceCORE tells you if a record within BibServer is available from any UK repository – we link directly to the results from our record pages.

Lessons

Over the course of this project we have learnt that open source development provides great flexibility and power to do what we need to do, and open access in general frees us from many difficult constraints. There is now a lot of useful information available online for how to do open source and open access.
Whilst licensing remains an issue, it becomes clear that making everything publicly and freely available to the fullest extent possible is the simplest solution, causing no further complications down the line. See the open definition as well as our principles for more information.

We discovered during the BibJSON spec development that it must be clear whether a specification is centrally controlled, or more of a communal agreement on use. There are advantages and disadvantages to each method, however they are not compatible – although one may become the other. We took the communal agreement approach, as we found that in the early stages there was more value in exposing the spec to people as widely and openly as possible than in maintaining close control. Moving to a close control format requires specific and ongoing commitment.

Community building remains tricky and somewhat serendipitous. Just as word-of-mouth can enhance reputation, failure of certain communities can detrimentally impact other parts of the project. Again, the best solution is to ensure everything is as open as possible from the outset, thereby reducing the impact of any one particular failure.

Opportunities and Possibilities

Over the two years, the concept of open bibliography has gone from requiring justification to being an expectation; the value of making this metadata openly available to the public is now obvious, and getting such access is no longer so difficult; where access is not yet available, many groups are now moving toward making it available. And of course, there are now plenty tools to make good use of available metadata.

Future opportunities now lie in the more general field of Open Scholarship, where a default of Open Bibliography can be leveraged to great effect. For example, recent Open Access mandates by many UK funding councils (eg Finch Report) could be backed up by investigative checks on the accessibility of research outputs, supporting provision of an open access corpus of scholarly material.

We intend now to continue work in this wider context, and we will soon publicise our more specific ideas; we would appreciate contact with other groups interested in working further in this area.

Further information

For the original project overview, see http://openbiblio.net/p/jiscopenbib2; also, a full chronological listing of all our project posts is available at http://openbiblio.net/tag/jiscopenbib2/. The work package descriptions are available at http://openbiblio.net/p/jiscopenbib2/work-packages/, and links to posts relevant to each work package over the course of the project follow:

WP1 Participation with Discovery programme
WP2 Collaborate with partners to develop social and technical interoperability
WP3 Open Bibliography advocacy
WP4 Community support
WP5 Data acquisition
WP6 Software development
WP7 Beta deployment
WP8 Disruptive innovation
WP9 Project management (NB all posts about the project are relevant to this WP)
WP10 Preparation for service delivery

All software developed during this project is available on open source licence. All the data that was released during this project fell under OKD compliant licenses such as PDDL or CC0, depending on that chosen by the publisher. The content of our site is licensed under a Creative Commons Attribution 3.0 License (all jurisdictions).

The project team would like to thank supporting staff at the Open Knowledge Foundation and Cambridge University Library, the OKF Open Bibliography working group and Open Access working group, Neil Wilson and the team at the British Library, and Andy McGregor and the rest of the team at JISC.

Importing Spanish National Library to BibServer

Etienne Posthumus — Tue, 07 Aug 2012 16:09:29 +0000

The Spanish National Library (Biblioteca Nacional de España or BNE) has released their library catalogue as Linked Open Data on the Datahub.

Initially this entry only containd the SPARQL endpoints and not downloads of the full datasets. After some enquiries from Naomi Lillie the entry was updated with links to the some more information and bulk downloads at: http://www.bne.es/es/Catalogos/DatosEnlazados/DescargaFicheros/

This library dataset is particularly interesting as it is not a ‘straightforward’ dump of bibliographic records. This is best explained by Karen Coyle in her blogpost.

For a BibServer import, the implications are that we have to distinguish the types of record that is read by the importing script and take the relevant action before building the BibJSON entry. Fortunately the datadump was made as N-Triples already, so we did not have to pre-process the large datafile (4.9GB) in the same manner as we did with the German National Library dataset.

The Python script to perform the reading of the datafile can be viewed at https://gist.github.com/3225004

A complicating matter from a data wrangler’s point of view is that the field names are based on IFLA Standards, which are numeric codes and not ‘guessable’ English terms like DublinCore fields for example. This is more correct from an international and data quality point of view, but does make the initial mapping more time consuming.

So when mapping a data item like https://gist.github.com/3225004#file_sample.nt we need to dereference each fieldname and map it to the relevant BibJSON entry.

As we identify more Linked Open Data National Bibliographies, these experiments will be continued under the http://nb.bibsoup.net/ BibServer instance.

Minutes: 24th Virtual Meeting of the OKFN Working Group for Open Bibliographic Data

Naomi Lillie — Tue, 07 Aug 2012 15:48:44 +0000

Date: August, 7th 2012, 15:00 GMT

Channels: Meeting was held via Skype and Etherpad

Participants

Jim Pitman
Karen Coyle
Naomi Lillie

Agenda

JISC Open Biblio 2 project coming to close

Blog-post write-up of project being finished this week, Mark MacGillivray reporting back to JISC in late September
Further funding being explored mainly in terms of related work

ISBNdb http://isbndb.com/

Similar to BibJSON
Uses other sources, has no explicit license / restrictions
API will give 500 returns a day
Jim’s example: http://isbndb.com/d/person/pitman_jim/books.html
- author identity is not working very well – this example contains a book that isn’t Jim’s
There is no record without an ISBN – seems to be no information from pre-1970
Claims to have 7million books but only 2m authors – FAQs state that records are gleaned from different libraries so duplication is likely
Open Library is possibly a better source

Karen’s most recent blog: http://kcoyle.blogspot.co.uk/2012/07/fair-use-deja-vu.html

“The argument that Google has made from the beginning of its book scanning project is that copying for the purpose of providing keyword access to full texts is fair use”
- HathiTrust has been in court to defend the storing and searching of metadata

Actions:

Next meeting on Tuesday 4th September 2012
- KC apologies – away at DublinCore conference
- Will include what we hope to do at OKFest

Community Discussions 3

Naomi Lillie — Fri, 13 Jul 2012 12:41:46 +0000

It has been a couple of months since the round-up on Community Discussions 2 and we have been busy! BiblioHack was a highlight for me, and last week included a meeting of many OKFN types – here’s a picture taken by Lucy Chambers for @OKFN of some team members:

The Discussion List has been busy too:

Further to David Weinbergers’s pointer that Harvard released 12 million bibliographic records with a CC0 licence, Rufus Pollock created a collection on the DataHub and added it to the Biblio section for easy of reference
Rufus also noticed that OCLC had issued their major release of VIAF, meaning that millions of author records are now available as Open Data (under Open Data Commons Attribution license), and updated the DataHub dataset to reflect this
Peter Murray-Rust noted that Nature has made its metadata Open CC0
David Shotton promoted the International Workshop on Contributorship and Scholarly Attribution at Harvard, and prepared a handy guide for attribution of submissions
Adrian Pohl circulated a call for participation for the SWIB12 “Semantic Web in Bibliotheken” (Semantic Web in Libraries) Conference in Cologne, 26-28 November this year, and hosted the monthly Working Group call
Lars Aronsson looked at multivolume works, asking whether the OpenLibrary can create and connect records for each volume. HathiTrust and Gallica were suggested as potential tools in collating volumes, and the barcode (containing information populated by the source library) was noted as being invaluable in processing these
Sam Leon explained that TEXTUS would be integrating BibSever facet view and encouraged people to have a look at the work so far; Tom Oinn highlighted the collaboration between Enriched BibJSON and TEXTUS, and explained that he would be adding a ‘TEXTUS’ field to BibJSON for this purpose
Sam also circulated two tools for people to test, Pundit and Korbo, which have been developed out of Digitised Manuscripts to Europeana (DM2E)
Jenny Molloy promoted the Open Science Hackday which took place last week – see below for a snap-shot courtesy of @OKFN:

In related news, Peter Murray-Rust is continuing to advocate the cause of open data – do have a read of the latest posts on his blog to see how he’s getting on.

The Open Biblio community continues to be invaluable to the Open GLAM, Heritage, Access and other groups too and I would encourage those interested in such discussions to join up at the OKFN Lists page.

Using wikipedia to build a philosophy (or other sort of) collection in BibSoup

Mark MacGillivray — Wed, 27 Jun 2012 08:48:12 +0000

Here is a quick example of how to quickly build a reference collection in BibSoup, using the great source of knowledge that is Wikipedia.

To begin with, you might want to go to Wikipedia directly and try performing some searches for relevant material, to help you put together sensible search terms for your area of interest. Your search terms will be used to pull relevant citations from the wikipedia database.

Then, go over to the BibSoup upload page; signup / login is required, so do that if you have not already done so.

Type in your wikipedia search terms in the upload box at the top of the page, give your collection a name and a description, specify the license if you wish, and choose the “wikipedia search to citations” file format from the list at the bottom. Then hit upload.

A ticket will be created for building your collection, and you can view the progress on the ticket page.

Once it is done, you can find your new collection either on the BibSoup collections page or on your own BibSoup user account page – for example atfor the user named “test”. Also of course, you could go straight to the URL of your collection – they appear at http://bibsoup.net/username/collection.

There you go! You should now have a reference collection based on your wikipedia search terms. Check out our our example.

BiblioHack: Day 2, part 2

Naomi Lillie — Thu, 14 Jun 2012 15:00:10 +0000

Pens down! Or, rather, key-strokes cease!

BiblioHack has drawn to a close and the results of two days’ hard labour are in:

A Bibliographic Toolkit

Utilising BibServer

Peter Murray-Rust reported back on what was planned, what was done, and the overlap between the two! The priority was cleaning up the process for setting up BibServers and getting them running on different architectures. (PubCrawler was going to be run on BibServer but currently it’s not working). Yesterday’s big news was that Nature has released 30 million references or thereabouts – this furthers the cause of scholarly literature whereby we, in principle, can index records rather than just corporate organisations being able / permitted to do so. National Bibliographies have been put on BibSoup – UK (‘BL’), Germany, Spain and Sweden – with the technical problem character encodings raising its head (UTF8 solves this where used). Also, BibSoup is useful for TEXTUS so the overall ‘toolkit’ approach is reinforced!

Open Access Index

Emanuil Tolev presented on ACat – Academic Catalogue. The first part of an index is having things to access – so gathering about 55,000 journals was a good start! Using Elastic Search within these journals will give list of contents which will then provide lists of articles (via facet view), then other services will determine licensing / open access information (URL checks assisted in this process). The ongoing plan is to use this tool to ascertain licensing information for every single record in the world. (Link to ACat to follow).

Annotation Tools

Tom Oinn talked about the ideas that have come out of discussions and hacking around annotators and TEXTUS. Reading lists and citation management is a key part of what TEXTUS is intended to assist with, so the plan is for any annotation to be allowed to carry a citation – whether personal opinion or related record. Personalised lists will come out of this and TEXTUS should become a reference management tool in its own right. Keep your eye on TEXTUS for the practical applications of these ideas!

Note: more detailed write-ups will appear courtesy of others, do watch the OKFN blog for this and all things open…

Postscript: OKFN blog post here

Huge thanks to all those who participated in the event – your ideas and enthusiasm have made this so much fun to be involved with.

Also thanks to those who helped run the event, visible or behind-the-scenes, particularly Sam Leon.

Here’s to the next one

BiblioHack: Day 2, part 1

Naomi Lillie — Thu, 14 Jun 2012 10:46:36 +0000

After easing into the day with breakfast and coffee, each of the 3 sub-groups gave an overview of the mini-project’s aim and fed back on the evening’s progress:

Peter Murray-Rust revisited the overarching theme of ‘A Bibliographic Toolkit’ and the BibServer sub-group’s specific work on adding datasets and easily deploying BibServer; Adrian Pohl followed up to explain that he would be developing a National Libraries BibServer.
Tom Oinn explained the Annotation Tools sub-groups’s work on developing annotation tools – ie TEXTUS – looking at adding fragments of text, with your own comments and metadata linked to it, which then forms BibSoup collections. Collating personalised references is enhanced with existing search functionality, and reading lists with annotations can refer to other texts within TEXTUS.
Mark MacGillivray presented the 3rd group’s work on an Open Access Index. This began with listing all the journals that can be found in the whole world, with the aim of identifying the licence of each article. They have been scraping collections (eg PubMed) and gathering journals – at the time of speaking they had around 50,000+! The aim is to enable a crowd-sourced list of every journal in the world which, using PubCrawler, should provide every single article in the world.

With just 5 hours left before stopping to gather thoughts, write-up and feedback to the rest of the group, it will be very interesting to see the result…

BiblioHack: Day 1

Naomi Lillie — Thu, 14 Jun 2012 10:25:46 +0000

The first day of BiblioHack was a day of combinations and sub-divisions!

The event attendees started the day all together, both hackers and workshop / seminar attendees, and Sam introduced the purpose of the day as follows: coders – to build tools and share ideas about things that will make our shared cultural heritage and knowledge commons more accessible and useful; non-coders – to get a crash course in what openness means for galleries, libraries, archives and museums, why it’s important and how you can begin opening up your data; everyone – to get a better idea about what other people working in your domain do and engender a better understanding between librarians, academics, curators, artists and technologists, in order to foster the creation of better, cooler tools that respond to the needs of our communities.

The hackers began the day with an overview of what a hackathon is for and how it can be run, as presented by Mahendra Mahey, and followed with lightning talks as follows:

Talk 1 Peter Murray Rust & Ross Mounce – Content and Data Mining and a PDF extractor
Talk 2 Mike Jones – the m-biblio project
Talk 4 Ian Stuart – ORI/RJB (formerly OA-RJ)
Talk 5 Etienne Posthumus – Making a BibServer Parser
Talk 6 Emanuil Tolev – IDFind – identifying identifiers (“Feedback and real user needs won’t gather themselves”)
Talk 7 Mark MacGillivray – BibServer – what the project has been doing recently, how that ties into the open access index idea.
Talk 8 Tom Oinn – TEXTUS
Talk 9 Simone Fonda – Pundit – collaborative semantic annotations of texts (Semantic Web-related tool)
Talk 10 Ian Stuart – The basics of Linked Data

We decided we wanted to work as a community, using our different skills towards one overarching goal, rather than breaking into smaller groups with separate agendas. We formed the central idea of an ‘open bibliographic tool-kit’ and people identified three main areas to hack around, playing to their skills and interests:

Utilising BibServer – adding datasets and using PubCrawler
Creating an Open Access Index
Developing annotation tools

At this point we all broke for lunch, and the workshoppers and hackers mingled together. As hoped, conversations sprung up between people from the two different groups and it was great to see suggestions arising from shared ideas and applications of one group being explained to the theories of the other.

We re-grouped and the workshop continued until 16.00 – see here for Tim Hodson’s excellent write-up of the event and talks given – when the hackers were joined by some who attended the workshop. Each group gave a quick update on status, to try to persuade the new additions to the group to join their particular work-flow, and each group grew in number. After more hushed discussions and typing, the day finished with a talk from Tara Taubman about her background in the legalities of online security and IP, and we went for dinner. Hacking continued afterwards and we celebrated a hard day’s work down the pub, lookong forward to what was to come.

Day 2 to follow…

BiblioHack Meet-up

Naomi Lillie — Wed, 13 Jun 2012 18:05:26 +0000

I’ve been quiet on this blog lately, but it’s in the same way a duck looks still when swimming: things may look peaceful but there is much activity going on beneath the surface! The Open Biblio crowd have been busy on the discussion List (link to follow) and the BiblioHack organisers have been preparing for this week’s events, which kicked off with a Meet-up last night.

The pre-BiblioHack Meet-up was designed to be an informal opportunity for those involved in the events to put names to faces and start up discussions; it was also open to anyone who wanted to come along to find out more about open data and the OKFN’s Working Groups including Open GLAM, and projects such as DM2E as well as Open Biblio.

With no formal agenda, we started up conversations as the mood took us – this covered legalities of openness in relation to IP, licensing and open access, annotation, cat-sitting and the Blues. In a nod to the more ‘usual’ OKFN #OpenData meet-ups, we went around the room to introduce ourselves (trying to explain our interests in only 3 words was challenging…) which prompted some people to cross the room in a purposeful fashion to intercept someone they hadn’t spoken to by that point. I really enjoyed meeting the people with whom I’d be spending the next two days, so thanks to all those who came along, for their interesting ideas and suggestions, and huge thanks to Sam Leon for arranging the tasty food and drinks at C4CC and for facilitating the evening.

Pubcrawler: finding research publications

Mark MacGillivray — Wed, 13 Jun 2012 10:45:48 +0000

This is a guest post from Sam Adams. (We have been using Pubcrawler in the Open Biblio 2 project to create reference collections of journal articles, and hope to continue this work further; this is a brief introduction to the software. Code is currently available in http://bitbucket.org/sea36/pubcrawler)

Pubcrawler collects bibliographic metadata (author, title, reference, DOI) by indexing journals’ websites in a similar manner to the way in which search engines explore the web to build their indexes. Where possible (which depends on the particular publication) it identifies any supplementary resources associated with a paper, and whether the paper is open access (i.e. readable without a subscription or any other charge) – though it cannot determine the license / conditions of such access.

Pubcrawler was originally developed by Nick Day as part of the CrystalEye project to aggregate published crystallographic structures from the supplementary data to articles on journals’ websites. Since then Pubcrawler has been extended to collect bibliographic metadata and support a wider range of journals than just those containing crystallography. Some of the activities Pubcrawler can currently support are:

Providing core bibliographic metadata
Identifying collections of open access articles
Identifying freely accessible supplementary information, which is often a rich source of scientific data

When pointed at a publisher’s homepage Pubcrawler will generate a list of the journals on the site and then crawl the issues’ tables of contents, recording the bibliographic metadata for the articles that it discovers. Pubcrawler uses a combination of two approaches to crawling a journal: starting at the current issue it can follow links to previous issues, walking the journal’s publication history, and if a journal’s website contains a list of issues it will also use that as a source of pages to crawl. When necessary, such as to identify supplementary resources, Pubcrawler can follow links to individual articles’ splash pages.

Pubcrawler does not index any content that is restricted by a journal’s paywall – it has been designed not to follow such links, and as added protection it is run over a commercial broadband connection, rather than from inside a University network to ensure that it does not receive any kind of privileged access.

While Pubcrawler’s general workflow is the same for any publication, custom parsers are required to extract the metadata and correct links from each website. Generally publishers use common templates for their journals web pages, so a parser only needs to be developed once per publishers, however in some instances, such as where older issues have not been updated to match the current template, a parser may need to support a variety of styles.

Pubcrawler currently has parsers (in varying states of completeness) for a number of publishers (biased by its history of indexing published Crystallographic structures):

The American Chemical Society (ACS)
Elsevier
The International Union of Crystallography (IUCr)
Nature
The Royal Society of Chemistry (RSC)
Springer
Wiley

And to date it has indexed over 10 million bibliographic records.

There are many other publishers who could be supported by Pubcrawler, they just require parsers to be created for them. Pubcrawler requires two types of maintainance – the general support to keep it running, administer servers etc, that any software requires, and occasional updates to the parsers as journal’s websites change their formatting.