A revamp of bibserver and bibsoup

Mark MacGillivray — Fri, 11 Jan 2013 03:26:12 +0000

Since our work last year on the JISC Open Bibliography 2 project, I have been thinking about the approach we took to building a tool that people might use; some of that approach, I think, was wrong. So, I have recently been working on some changes and pushed a new version of bibserver to the repository, in a branch called bibwiki.

Also today, the service running at http://bibsoup.net has been rebooted to run the new branch. One of the downsides of this is that user accounts and data that existed on the old system are no longer available; because there were some issues with the old system anyway, it was giving errors for a few of the more recent attempts to upload large datasets, so I decided to wipe the slate clean and start again from scratch. However, if you had any particular collections in there that you need to have recovered, please get in touch via the openbiblio-dev mailing list and I will recover them for you.

Now, on to the details of what has changed, and why. Let’s start with the why.

Why change it?

One of the original requirements of bibserver was that it would present a personally curated collection of bibliographic records; this extended not only to the curation of the collection, but to the curation of records within that collection. Unfortunately, this made every collection an island – a private island, with guards round the edges; not so good for building open knowledge or community. Also, we put too much emphasis on legacy data and formats; whilst there is of course value in old standards like bibtex, and in historical records, giving up the flexibility of the present for the sake of the past is the opposite of progress. Instead we should take the best bits of what we had and improve on them, then get our historical content into newer, more useful forms.

Because of these issues, it seems sensible therefore to try a more connected, more open, more modern approach. So, what I have done is to remove the concept of “ownership” of a record and to remove the ties to legacy data formats or sources. Instead what we now have is a tool into which we can dump bibJSON data, and via which we can build personally curated collections of shared bibliographic records.

So what has changed?

you can only upload bibJSON

Whilst the conversion tools we wrote to process data from formats such as bibtex or RIS into bibJSON are useful and will be utilised elsewhere, they are not part of the core functionality of bibserver. They are a way to get from the past into the present, and once you are here, you should forget about the past and get on with the future. So your upload is one-off, and cares not from whence it came.

You can edit records, but so can anybody else

Does what it says on the tin. For now, editing is only via clunky edit of the JSON itself, but this can have a nice UI added later.

You can tag any record with anything, but so can anybody else

Anyone can tag a record with a useful term; anyone can remove a tag.

You can still build your own collection

You can still create your own collection and curate it as you see fit, and other people will not be able to change what records are in that collection; but the records themselves are still editable by anyone. Seems scary? Well, yes. But get used to it. It works for wikipedia. (Which is why I called the new branch bibwiki.)

You can’t visualise facets anymore

You used to be able to make a little bubble picture out of the facet filters down the left hand side. Now you can’t. It was a bit incongruously located, so this functionality is being hived off into a more specifically useful form.

You can search for any record and add it to your collection

Anything that is on the bibserver instance can be found by anyone using the search box, then you can add it to one of your collections. However, searching for everything has limited functionality and does not offer filters. This is because one of the constraints of scaling up to large datasets is that filtering is expensive; so now, you have simple search across everything, then nice complex filtered search on the things you care about. Best of both worlds with minimal compromise.

Simplistic record deduping

Where a record appears to have the same title-and-authors string on import as another record already in the database, it will try to squish them together. The important point here though is that the functionality exists now in to deduplicate things via various methods, and there is no longer a constraint to maintain unique copies of things, so we can get on and build those methods.

Exciting. So, what next?

Rework the parsers into a stand-alone service

The parsers from bibtex, RIS, etc should be built out as a simple service that we can run where you hit the webpage, give it your file (or file URL), and it pings you when it has done the conversion with a link for you to get your bibJSON from. This should work with parser plugins sort-of functionality, so that we can run it with the parsers we have, and other people can run it with their own if they wish. Then we can boot up a translation service at http://translate.bibsoup.net.

This is the most important next step, as without it not many people will be able to upload records.

Upload some bibliographic metadata

There are numerous sources of biblio metadata we have collected over the years, and some of these will be uploaded into bibsoup for people to use.

Also, there is potential to run specific instances of bibsoup for people who need them – although, overall, it is probably more sensible to keep them all together and distinguish via collections.

Bugfix

This is basically a beta 2 implementation. Please go and use the new system at http://bibsoup.net, and get back to the mailing list with the usual issues.

Build up some deduplication maybe with pybossa

Now that we can edit records and find similar ones, we can also do interesting things like enable users to tag records that are about the same thing. We can also run queries to find similar records and expose that data perhaps through a tool like pybossa, to get crowd-sourced deduplication on the go.

Rewrite the tests

All the tests that were in the original branch have yet to be copied over. A lot of them will become redundant. So if you like tests (and we should have them), then get involved with porting them over / writing new ones

Update the docs

The documentation needs to be updated, a lot of it still refers to the old branch. Although, a fair bit of it is still relevant.

Decide how to manage the code and bibsoup in the future

What I have done here are some fairly large changes to our original aims; it is possible that not everybody will like this. However, the great thing about code repositories is that we have versioning, so anyone can use any version of the software. My changes are still in a branch, so we can either merge these into the main, or fork them off to a separate project if necessary. Unless there are reasons against merging into main are given, that will be the course taken once the parsers have been hived off.

Open Bibliography and Open Bibliographic Data » bibsoup