Pubcrawler: finding research publications

This is a guest post from Sam Adams. (We have been using Pubcrawler in the Open Biblio 2 project to create reference collections of journal articles, and hope to continue this work further; this is a brief introduction to the software. Code is currently available in

Pubcrawler collects bibliographic metadata (author, title, reference, DOI) by indexing journals’ websites in a similar manner to the way in which search engines explore the web to build their indexes. Where possible (which depends on the particular publication) it identifies any supplementary resources associated with a paper, and whether the paper is open access (i.e. readable without a subscription or any other charge) – though it cannot determine the license / conditions of such access.

Pubcrawler was originally developed by Nick Day as part of the CrystalEye project to aggregate published crystallographic structures from the supplementary data to articles on journals’ websites. Since then Pubcrawler has been extended to collect bibliographic metadata and support a wider range of journals than just those containing crystallography. Some of the activities Pubcrawler can currently support are:

  • Providing core bibliographic metadata
  • Identifying collections of open access articles
  • Identifying freely accessible supplementary information, which is often a rich source of scientific data

When pointed at a publisher’s homepage Pubcrawler will generate a list of the journals on the site and then crawl the issues’ tables of contents, recording the bibliographic metadata for the articles that it discovers. Pubcrawler uses a combination of two approaches to crawling a journal: starting at the current issue it can follow links to previous issues, walking the journal’s publication history, and if a journal’s website contains a list of issues it will also use that as a source of pages to crawl. When necessary, such as to identify supplementary resources, Pubcrawler can follow links to individual articles’ splash pages.

Pubcrawler does not index any content that is restricted by a journal’s paywall – it has been designed not to follow such links, and as added protection it is run over a commercial broadband connection, rather than from inside a University network to ensure that it does not receive any kind of privileged access.

While Pubcrawler’s general workflow is the same for any publication, custom parsers are required to extract the metadata and correct links from each website. Generally publishers use common templates for their journals web pages, so a parser only needs to be developed once per publishers, however in some instances, such as where older issues have not been updated to match the current template, a parser may need to support a variety of styles.

Pubcrawler currently has parsers (in varying states of completeness) for a number of publishers (biased by its history of indexing published Crystallographic structures):

  • The American Chemical Society (ACS)
  • Elsevier
  • The International Union of Crystallography (IUCr)
  • Nature
  • The Royal Society of Chemistry (RSC)
  • Springer
  • Wiley

And to date it has indexed over 10 million bibliographic records.

There are many other publishers who could be supported by Pubcrawler, they just require parsers to be created for them. Pubcrawler requires two types of maintainance – the general support to keep it running, administer servers etc, that any software requires, and occasional updates to the parsers as journal’s websites change their formatting.

Posted in BibServer, JISC OpenBib | Tagged , , , , | 2 Comments

Open source development – how we are doing

Whilst at Open Source Junction earlier this year, I talked to Sander van der Waal and Rowan Wilson about the problems of doing open source development. Sander and Rowan work at OSS watch, and their aim is to make sure that open source software development delivers its potential to UK HEI and research; so, I thought it would be good to get their feedback on how our project is doing, and if there is anything we are getting wrong or could improve on.

It struck me that as other JISC projects such as ours are required to make their output similarly publicly available, this discussion may be of benefit to others; after all, not everyone knows what open source software is, let alone the complexities that can arise from trying to create such software. Whilst we cannot help avoid all such complexities, we can at least detail what we have found helpful to date, and how OSS Watch view our efforts.

I provided Sander and Rowan a review of our project, and Rowan provided some feedback confirming that overall we are doing a good job, although we lack a listing of the other open source software our project relies on, and their licenses. Whilst such data can be discerned from the dependencies of the project, this is not clear enough; I will add a written list of dependencies to the README.

The response we received is provided below, followed by the overview I initially provided, which gives a brief overview of how we managed our open source development efforts:

==== Rowan Wilson, OSS Watch, responds:

Your work on this project is extremely impressive. You have the systems in place that we recommend for open development and creation of community around software, and you are using them. As an outsider I am able to quickly see that your project is active and the mailing list and roadmap present information about ways in which I could participate.

One thing I could not find, although this may be my fault, is a list of third party software within the distribution. This may well be because there is none, but it’s something I would generally be keen to see for the purposes of auditing licence compatibility.

Overall though I commend you on how tangible and visible the development work on this project is, and on the focus on user-base expansion that is evident on the mailing list.

==== Mark MacGillivray wrote:

Background – May 2011, OKF / AIM bibserver project

Open Knowledge Foundation contracted with American Institute of
Mathematics under the direction of Jim Pitman in the dept. of Maths
and Stats at UC Berkeley. The purpose of the project was to create an
open source software repository named BibServer, and to develop a
software tool that could be deployed by anyone requiring an easy way
to put and share bibliographic records online.

A repository was created at, and it
performs the usual logging of commits and other activities expected of
a modern DVCS system. This work was completed in September 2011, and the repository has been available since the start of that project with a GNU Affero GPL v3 licence attached.

October 2011 – JISC Open Biblio 2 project

The JISC Open BIblio 2 project chose to build on the open source
software tool named BibServer. As there was no support from AIM for
maintaining the BibServer repository, the project took on maintenance
of the repository and all further development work, with no change to
previous licence conditions.

We made this choice as we perceive open source licensing as a benefit
rather than a threat; it fit very well with the requirements of JISC
and with the desires of the developers involved in the project. At
worst, an owner may change the licence attached to some software, but
even in such a situation we could continue our work by forking from
the last available open source version (presuming that licence
conditions cannot be altered retrospectively).

The code continues to display the licence under which it is available,
and remains publicly downloadable at
Should this hosting resource become publicly unavailable, an
alternative public host would be sought.

Development work and discussion has been managed publicly, via a
combination of the project website at, the issue tracker at, a project wiki at, and via a mailing list

February 2012 – JISC Open Biblio 2 offers beta service

In February the JISC Open Biblio 2 project announced a beta service
available online for free public use at The
website runs an instance of BibServer, and highlights that the code is
open source and available (linking to the repository) to anyone who
wishes to use it.

Current status

We believe that we have made sensible decisions in choosing open
source software for our project, and have made all efforts to promote
the fact that the code is freely and publicly available.

We have found the open source development paradigm to be highly
beneficial – it has enabled us to publicly share all the work we have
done on the project, increasing engagement with potential users and
also with collaborators; we have also been able to take advantage of
other open source software during the project, incorporating it into
our work to enable faster development and improved outcomes.

We continue to develop code for the benefit of people wishing to
publicly put and share their bibliographies online, and all our
outputs will continue to be publicly available beyond the end of the
current project.

Posted in BibServer, JISC OpenBib | Tagged , , , , , , , , , , , , , | 1 Comment

Hackathon alert: BiblioHack!

This is cross-posted from the OKFN blog

The Open Knowledge Foundation’s Open Biblio group, and Working Group on Open Data in Cultural Heritage, along with DevCSI, present BiblioHack: an open Hackathon to kick-start the summer months. From Wednesday 13th – Thursday 14th June, we’ll be meeting at Queen Mary, University of London, East London, and any budding hackers are welcome, along with anyone interested in opening up metadata and the open cause – this free event aims to bring together software developers, project managers, librarians and experts in the area of Open Bibliographic Data. A workshop will run alongside the coding on the 13th, and a meet-up on the evening of the 12th is open to all whether you’re attending the Hackathon or not.

What is BiblioHack?

BiblioHack will be two days of hacking and sharing ideas about open bibliographic metadata.

There will be opportunities to hack on open bibliographic datasets and experiment with new prototypes and tools. The focus will be on building things and improving existing systems that enable people and institutions to get the most of bibliographic data.

If you’re a non-coder there are sessions for you too. We will be running a hands-on workshop addressing the technical aspects to opening up cultural heritage data looking at best of breed open source tools for doing that, preparing your data for a hackathon and the best standards for storing and exposing your data to make it more easily re-used.

When and where?

  • The main hackathon will take place over two days between 13th and 14th June at Queen Mary University of London
  • On the morning of the 13th June we’ll be running the workshop addressed at the technical challenges to opening up metadata. So for those unable to participate in the hack due to time constraints or lack of coding know how – this is for you!
  • On the 12th June – Tuesday evening (details TBC but will be a pub in central / east London!) – we’ll also be hosting a meet-up for anyone attending the hack and open data more generally. Whether it’s open bibliographic data, spending or government data that floats your boat all tribes are welcome!

Who is organising the event?

Who else is involved?

We’ve already lined up a whole host of speakers and groups who’ll be attending both the hack and the workshop. The list so far includes UK Discovery, CKAN, Europeana, Total Impact, Neontribe, The British Library with many more to be added in the coming days…

You’re giving your time and expertise – what do you get if you attend the whole hack?

  • Accommodation at QMUL overnight on the 13th
  • Food and drink across the 3 days
  • The chance to work with experts in their fields
  • Admiration and respect from your peers
  • We could expound at length, but… go on, you know you want to (it’s free!)

How can I sign up?

  • Register here for the 2 day hack
  • Register here for workshop only
  • Register here for Meet-up only

Please note, if you wish to attend all 3 events you should sign up for each, and the Workshop will run in parallel with the hacking on the morning of the 13th.

More questions?

Contact Naomi Lillie on admin [@]

See you there!

Posted in event, events, JISC OpenBib, News, OKFN Openbiblio | Tagged , | 1 Comment

BiblioHack hackathon registration form

Registration is now closed, but keep an eye on this blog and the main OKFN blog for upcoming events.

Posted in event, events, JISC OpenBib, OKFN Openbiblio | Tagged , , | 3 Comments

#OpenDataEDB 2: 16th May

Following the fun we had at March’s Meet-up ‘launch’, we will be having another gathering of people interested in open data next Wednesday 16th May. Hosted by the Wash Bar, Edinburgh, from 19.00, come and join us to discuss ideas, projects and plans in relation to openness.

Lightning Talks will include Federico Sangati on crowdsourcing and education, ahead of his presentation at Dev8ed later this month, and a sneak preview of the hackathon that Open Biblio will be running 12-14th June in collaboration with OKFN’s Open GLAM and Cultural Heritage Working Group and DevCSI.

If you would like to give a lightning talk (informal 2-3 minute presentations) about anything related to open data or knowledge, contact naomi.lillie [@]

Sign up here and we’ll see you there!

Sticker Design 1

For this and other events in Edinburgh and the rest of Scotland, sign up here.

Posted in event, events, JISC OpenBib, News, OKFN Openbiblio | Tagged , | Leave a comment

Recent BibServer technical development

Along with the recent push of new front-end functionality to BibServer, and demonstrated on BibSoup, we have also applied some changes to the back-end.

The new scheduled collection uploader is now runnable as a stand-alone tool, to which source URLs can be provided for retrieval, conversion, and upload. Retrieved sources are stored and available from a folder on disk, as are the conversions.

Parsers can now be written in any language and plugged into the ingest functionality – for example, we now have a MARC parser that runs in perl and is usable via and available on an instance of BibServer – thanks very much to Ed for that.

In addition, parsers need no longer be ‘parsers’ – we have introduced the concept of scrapers as well. Check out our new Wikipedia parser / scraper, for example; it functions by taking in a search value rather than a URL, then using that to search Wikipedia for relevant references which it downloads, bundles, and converts to a BibJSON collection – this is a really great example that Etienne put together, and it demonstrates a great deal of potential for further parser / scraper development.

See the examples on the BibServer repo for more insight – they are in the parserscrapers_plugins folder, and they are managed by bibserver/

We know documents are now lacking – we have set up an online docs resource but are in the process of writing up to populate it – please check back soon.

As usual, development work is scheduled via the tickets and milestones on our repo. Current efforts are on documentation and adding as many feature requests as possible before our hackathon on June 12th – 14th.

Posted in BibServer, Data, JISC OpenBib, News, OKFN Openbiblio | Tagged , , , , , , | 1 Comment

BibJSON updates

Following recent discussion on our mailing list, BibJSON has been updated to adopt JSON-LD for all your linked data needs.

This enables us to keep the core of BibJSON pretty simple whilst also opening up potential for more complex usage where that is required.

Due to this, we no longer use the “namespace” key in BibJSON.

Other changes include usage of “_” prefix on internal keys – so wherever our own database writes info into a record, we prefix it, such as “_id”. Because of this, uploaded BibJSON records can have an “id” key that will work, as well as an “_id” uuid applied by the BibServer system.

For more information, check out and JSON-LD

Posted in BibServer, Data, JISC OpenBib, LOD-LAM, News, OKFN Openbiblio | Tagged , , , , , , | 1 Comment

New BibServer features available on BibSoup

A couple of months ago the development team had a Sprint and came up with some cool ideas of how to improve the user experience for BibServer and, subsequently, BibSoup. Have a play with the new features and see below for the details:

Main pages

  • Collections visualisation – a smart new graphic on the landing page showing information from new collections

  • Improved FAQ section with links to videos (coming soon: links to our new online docs)

Creating collections

  • New Wikipedia parser – create a collection based on the references retrievable from Wikipedia for your chosen search value

  • Improved collection upload – specify collection information, then view upload tickets to see progress and errors

  • ‘Retry’ and other options on particular collection creation attempts are also now available from the tickets page

Search results

  • Filter search results by a value range as well as specific values

  • Visualise any filter as a bubble chart and select the values you want to search with

  • Add / remove available filters and rename filter display names

  • Improved layout of record info in search results, including auto-display of the first image referenced in a record – e.g. if there is a link to an image in your record, it is displayed in the search result

Managing and sharing collections

  • Collection admin available – save your current display settings as the default for your collection, allow other users to have admin rights on your own collection

  • Share any specific searches by providing the URL displayed under the ‘share’ option

  • Embed – as the whole front-end of search and collection visualisation is handled by facetview it is possible to embed your collection search in any web page you control; the share / embed option on collection pages provides the code you need to insert to enable this

  • Download as BibJSON – a nice new obvious button on each collection provides a link to download your collection as BibJSON

Viewing records

  • Improved display of individual records, including search options to discover relevant content online

  • EXPERIMENTAL record editing – this has been enabled although still in progress – you can edit the content of a record using a visual display of the keys and values in the record, although functionality for adding new keys does not yet work. However, you can also edit the JSON directly via the options, and try saving that. Be aware – this could damage your records, and of course changes the details from whatever they were in the source content.

Still in development

These ones are not yet available on BibSoup but watch this space:

  • Creating new collections on-site – search and find particular records for inclusion in new collections or addition to pre-existing collections. This is not currently possible but we are working on making this an easy process

  • Merging collections

  • Better user creation and management, plus gravatars

  • Additional functionality on record pages – linking out directly to related sources such as PubMed, Total Impact, Service Core etc

We hope you like these changes, and find them useful – do let us know what you think and keep an eye out for the upcoming improvements.

Posted in BibServer, Data, JISC OpenBib, News, OKFN Openbiblio | Tagged , , , , , , | 3 Comments

Harvard Library releases 12M bibliographic records under CC0

Harvard Library yesterday announced the release of 12 Million bibliographic record into the public domain using CC0.

From the announcement:

The Harvard Library announced it is making more than 12 million catalog records from Harvard’s 73 libraries publicly available.

The records contain bibliographic information about books, videos, audio recordings, images, manuscripts, maps, and more. The Harvard Library is making these records available in accordance with its Open Metadata Policy and under a Creative Commons 0 (CC0) public domain license. In addition, the Harvard Library announced its open distribution of metadata from its Digital Access to Scholarship at Harvard (DASH) scholarly article repository under a similar CC0 license.

‘The Harvard Library is committed to collaboration and open access. We hope this contribution is one of many steps toward sharing the vital cultural knowledge held by libraries with all,’ said Mary Lee Kennedy, Senior Associate Provost for the Harvard Library.

The catalog records are available for bulk download from Harvard, and are available for programmatic access by software applications via API’s at the Digital Public Library of America (DPLA). The records are in the standard MARC21 format.

That’s great news. There already is an entry for this dataset at the Data Hub.

See also David Weinberger’s post on the data release.

Posted in Data, News | 9 Comments

Community discussions 2

It’s been a funny few weeks, with Easter meaning that various people have been out-and-about at various times, but as always, the community never rests… Following on from Community Discussions (1), here are the latest goings-on to raise your interest and maybe your eyebrows:

  • Mark MacGillivray reported on the 29th March that the project is working with Total Impact to link their services to an instance of Bibserver

  • Multilingual matters in BibJSON arose again and, once more, JSON-LD was given backing as being useful for our purposes

  • Adrian Pohl discussed Nature’s release of 450,000 articles under a CC0 licence and followed up with this more in-depth article

  • Todd Robbins circulated Jim Pitman’s detailed article on author identity, which explores the issue of citations in relation to a non-existent publication (!) and includes recommendations for opening your own data

  • Antoine Isaac notified us of Europeana’s ‘Connecting Society to Culture Programme’, as part of the Hack4Europe! 2012 road show, including 3 hackathons in 3 different countries; it turns out that one in Berlin starts on the day our own Hackathon finishes – speaking of which…

  • Hackathon ‘show of hands’ request sent out! Contact naomi.lillie [@] if you’d like to add your name to the list of interested people. This Hackathon is being run by Open Biblio with Open GLAM and DevCSI, on the 12th-14th June, in East London – more details to follow

  • Following the last Working Group meeting, Adrian Pohl expounded upon the role and goals of the Working Group, which is now given at and

  • David Weinberger announced that Harvard has opened 12 million bibliographic records to the public domain under a CC0 licence – see his blog post to read more about it and for links to further information

  • Finally, Adrian announced the next Working Group meeting on 7th May, where Mark MacGillivray will be updating the community on project developments

As always, thanks to our amazing community for promoting and leading open bibliographic data in all its manifestations. To become part of the group and get your voice heard, sign up to the List here.

Posted in event, JISC OpenBib, LOD-LAM, News, OKFN Openbiblio | Tagged , , , | Leave a comment