Open Bibliography and Open Bibliographic Data » wp6 http://openbiblio.net Open Bibliographic Data Working Group of the Open Knowledge Foundation Tue, 08 May 2018 15:46:25 +0000 en-US hourly 1 http://wordpress.org/?v=4.3.1 Importing Spanish National Library to BibServer http://openbiblio.net/2012/08/07/importing-spanish-national-library-to-bibserver/ http://openbiblio.net/2012/08/07/importing-spanish-national-library-to-bibserver/#comments Tue, 07 Aug 2012 16:09:29 +0000 http://openbiblio.net/?p=2895 Continue reading ]]> The Spanish National Library (Biblioteca Nacional de España or BNE) has released their library catalogue as Linked Open Data on the Datahub.

Initially this entry only containd the SPARQL endpoints and not downloads of the full datasets. After some enquiries from Naomi Lillie the entry was updated with links to the some more information and bulk downloads at: http://www.bne.es/es/Catalogos/DatosEnlazados/DescargaFicheros/

This library dataset is particularly interesting as it is not a ‘straightforward’ dump of bibliographic records. This is best explained by Karen Coyle in her blogpost.

For a BibServer import,  the implications are that we have to distinguish the types of record that is read by the importing script and take the relevant action before building the BibJSON entry. Fortunately the datadump was made as N-Triples already, so we did not have to pre-process the large datafile (4.9GB) in the same manner as we did with the German National Library dataset.

The Python script to perform the reading of the datafile can be viewed at https://gist.github.com/3225004

A complicating matter from a data wrangler’s point of view is that the field names are based on IFLA Standards, which are numeric codes and not ‘guessable’ English terms like DublinCore fields for example. This is more correct from an international and data quality point of view, but does make the initial mapping more time consuming.

 So when mapping a data item like https://gist.github.com/3225004#file_sample.nt we need to dereference each fieldname and map it to the relevant BibJSON entry.

As we identify more Linked Open Data National Bibliographies, these experiments will be continued under the http://nb.bibsoup.net/ BibServer instance.

]]>
http://openbiblio.net/2012/08/07/importing-spanish-national-library-to-bibserver/feed/ 0
BiblioHack: Day 2, part 2 http://openbiblio.net/2012/06/14/bibliohack-day-2-part-2/ http://openbiblio.net/2012/06/14/bibliohack-day-2-part-2/#comments Thu, 14 Jun 2012 15:00:10 +0000 http://openbiblio.net/?p=2755 Continue reading ]]> Pens down! Or, rather, key-strokes cease!

BiblioHack has drawn to a close and the results of two days’ hard labour are in:

A Bibliographic Toolkit

Utilising BibServer

Peter Murray-Rust reported back on what was planned, what was done, and the overlap between the two! The priority was cleaning up the process for setting up BibServers and getting them running on different architectures. (PubCrawler was going to be run on BibServer but currently it’s not working). Yesterday’s big news was that Nature has released 30 million references or thereabouts – this furthers the cause of scholarly literature whereby we, in principle, can index records rather than just corporate organisations being able / permitted to do so. National Bibliographies have been put on BibSoup – UK (‘BL’), Germany, Spain and Sweden – with the technical problem character encodings raising its head (UTF8 solves this where used). Also, BibSoup is useful for TEXTUS so the overall ‘toolkit’ approach is reinforced!

Open Access Index

Emanuil Tolev presented on ACat – Academic Catalogue. The first part of an index is having things to access – so gathering about 55,000 journals was a good start! Using Elastic Search within these journals will give list of contents which will then provide lists of articles (via facet view), then other services will determine licensing / open access information (URL checks assisted in this process). The ongoing plan is to use this tool to ascertain licensing information for every single record in the world. (Link to ACat to follow).

Annotation Tools

Tom Oinn talked about the ideas that have come out of discussions and hacking around annotators and TEXTUS. Reading lists and citation management is a key part of what TEXTUS is intended to assist with, so the plan is for any annotation to be allowed to carry a citation – whether personal opinion or related record. Personalised lists will come out of this and TEXTUS should become a reference management tool in its own right. Keep your eye on TEXTUS for the practical applications of these ideas!

Note: more detailed write-ups will appear courtesy of others, do watch the OKFN blog for this and all things open…

Postscript: OKFN blog post here

Huge thanks to all those who participated in the event – your ideas and enthusiasm have made this so much fun to be involved with.

Also thanks to those who helped run the event, visible or behind-the-scenes, particularly Sam Leon.

Here’s to the next one :-)

]]>
http://openbiblio.net/2012/06/14/bibliohack-day-2-part-2/feed/ 0
BiblioHack: Day 2, part 1 http://openbiblio.net/2012/06/14/day-2-part-1/ http://openbiblio.net/2012/06/14/day-2-part-1/#comments Thu, 14 Jun 2012 10:46:36 +0000 http://openbiblio.net/?p=2748 Continue reading ]]> After easing into the day with breakfast and coffee, each of the 3 sub-groups gave an overview of the mini-project’s aim and fed back on the evening’s progress:

  • Peter Murray-Rust revisited the overarching theme of ‘A Bibliographic Toolkit’ and the BibServer sub-group’s specific work on adding datasets and easily deploying BibServer; Adrian Pohl followed up to explain that he would be developing a National Libraries BibServer.
  • Tom Oinn explained the Annotation Tools sub-groups’s work on developing annotation tools – ie TEXTUS – looking at adding fragments of text, with your own comments and metadata linked to it, which then forms BibSoup collections. Collating personalised references is enhanced with existing search functionality, and reading lists with annotations can refer to other texts within TEXTUS.
  • Mark MacGillivray presented the 3rd group’s work on an Open Access Index. This began with listing all the journals that can be found in the whole world, with the aim of identifying the licence of each article. They have been scraping collections (eg PubMed) and gathering journals – at the time of speaking they had around 50,000+! The aim is to enable a crowd-sourced list of every journal in the world which, using PubCrawler, should provide every single article in the world.

With just 5 hours left before stopping to gather thoughts, write-up and feedback to the rest of the group, it will be very interesting to see the result…

]]>
http://openbiblio.net/2012/06/14/day-2-part-1/feed/ 0
BiblioHack: Day 1 http://openbiblio.net/2012/06/14/bibliohack-day-1/ http://openbiblio.net/2012/06/14/bibliohack-day-1/#comments Thu, 14 Jun 2012 10:25:46 +0000 http://openbiblio.net/?p=2742 Continue reading ]]> The first day of BiblioHack was a day of combinations and sub-divisions!

The event attendees started the day all together, both hackers and workshop / seminar attendees, and Sam introduced the purpose of the day as follows: coders – to build tools and share ideas about things that will make our shared cultural heritage and knowledge commons more accessible and useful; non-coders – to get a crash course in what openness means for galleries, libraries, archives and museums, why it’s important and how you can begin opening up your data; everyone – to get a better idea about what other people working in your domain do and engender a better understanding between librarians, academics, curators, artists and technologists, in order to foster the creation of better, cooler tools that respond to the needs of our communities.

The hackers began the day with an overview of what a hackathon is for and how it can be run, as presented by Mahendra Mahey, and followed with lightning talks as follows:

  • Talk 1 Peter Murray Rust & Ross Mounce – Content and Data Mining and a PDF extractor
  • Talk 2 Mike Jones – the m-biblio project
  • Talk 4 Ian Stuart – ORI/RJB (formerly OA-RJ)
  • Talk 5 Etienne Posthumus – Making a BibServer Parser
  • Talk 6 Emanuil Tolev – IDFind – identifying identifiers (“Feedback and real user needs won’t gather themselves”)
  • Talk 7 Mark MacGillivray – BibServer – what the project has been doing recently, how that ties into the open access index idea.
  • Talk 8 Tom Oinn – TEXTUS
  • Talk 9 Simone Fonda – Pundit – collaborative semantic annotations of texts (Semantic Web-related tool)
  • Talk 10 Ian Stuart – The basics of Linked Data

We decided we wanted to work as a community, using our different skills towards one overarching goal, rather than breaking into smaller groups with separate agendas. We formed the central idea of an ‘open bibliographic tool-kit’ and people identified three main areas to hack around, playing to their skills and interests:

  • Utilising BibServer – adding datasets and using PubCrawler
  • Creating an Open Access Index
  • Developing annotation tools

At this point we all broke for lunch, and the workshoppers and hackers mingled together. As hoped, conversations sprung up between people from the two different groups and it was great to see suggestions arising from shared ideas and applications of one group being explained to the theories of the other.

We re-grouped and the workshop continued until 16.00 – see here for Tim Hodson’s excellent write-up of the event and talks given – when the hackers were joined by some who attended the workshop. Each group gave a quick update on status, to try to persuade the new additions to the group to join their particular work-flow, and each group grew in number. After more hushed discussions and typing, the day finished with a talk from Tara Taubman about her background in the legalities of online security and IP, and we went for dinner. Hacking continued afterwards and we celebrated a hard day’s work down the pub, lookong forward to what was to come.

Day 2 to follow…

]]>
http://openbiblio.net/2012/06/14/bibliohack-day-1/feed/ 0
Pubcrawler: finding research publications http://openbiblio.net/2012/06/13/pubcrawler-finding-research-publications/ http://openbiblio.net/2012/06/13/pubcrawler-finding-research-publications/#comments Wed, 13 Jun 2012 10:45:48 +0000 http://openbiblio.net/?p=2726 Continue reading ]]> This is a guest post from Sam Adams. (We have been using Pubcrawler in the Open Biblio 2 project to create reference collections of journal articles, and hope to continue this work further; this is a brief introduction to the software. Code is currently available in http://bitbucket.org/sea36/pubcrawler)

Pubcrawler collects bibliographic metadata (author, title, reference, DOI) by indexing journals’ websites in a similar manner to the way in which search engines explore the web to build their indexes. Where possible (which depends on the particular publication) it identifies any supplementary resources associated with a paper, and whether the paper is open access (i.e. readable without a subscription or any other charge) – though it cannot determine the license / conditions of such access.

Pubcrawler was originally developed by Nick Day as part of the CrystalEye project to aggregate published crystallographic structures from the supplementary data to articles on journals’ websites. Since then Pubcrawler has been extended to collect bibliographic metadata and support a wider range of journals than just those containing crystallography. Some of the activities Pubcrawler can currently support are:

  • Providing core bibliographic metadata
  • Identifying collections of open access articles
  • Identifying freely accessible supplementary information, which is often a rich source of scientific data

When pointed at a publisher’s homepage Pubcrawler will generate a list of the journals on the site and then crawl the issues’ tables of contents, recording the bibliographic metadata for the articles that it discovers. Pubcrawler uses a combination of two approaches to crawling a journal: starting at the current issue it can follow links to previous issues, walking the journal’s publication history, and if a journal’s website contains a list of issues it will also use that as a source of pages to crawl. When necessary, such as to identify supplementary resources, Pubcrawler can follow links to individual articles’ splash pages.

Pubcrawler does not index any content that is restricted by a journal’s paywall – it has been designed not to follow such links, and as added protection it is run over a commercial broadband connection, rather than from inside a University network to ensure that it does not receive any kind of privileged access.

While Pubcrawler’s general workflow is the same for any publication, custom parsers are required to extract the metadata and correct links from each website. Generally publishers use common templates for their journals web pages, so a parser only needs to be developed once per publishers, however in some instances, such as where older issues have not been updated to match the current template, a parser may need to support a variety of styles.

Pubcrawler currently has parsers (in varying states of completeness) for a number of publishers (biased by its history of indexing published Crystallographic structures):

  • The American Chemical Society (ACS)
  • Elsevier
  • The International Union of Crystallography (IUCr)
  • Nature
  • The Royal Society of Chemistry (RSC)
  • Springer
  • Wiley

And to date it has indexed over 10 million bibliographic records.

There are many other publishers who could be supported by Pubcrawler, they just require parsers to be created for them. Pubcrawler requires two types of maintainance – the general support to keep it running, administer servers etc, that any software requires, and occasional updates to the parsers as journal’s websites change their formatting.

]]>
http://openbiblio.net/2012/06/13/pubcrawler-finding-research-publications/feed/ 2
Open source development – how we are doing http://openbiblio.net/2012/05/29/open-source-development-how-we-are-doing/ http://openbiblio.net/2012/05/29/open-source-development-how-we-are-doing/#comments Tue, 29 May 2012 11:24:17 +0000 http://openbiblio.net/?p=2671 Continue reading ]]> Whilst at Open Source Junction earlier this year, I talked to Sander van der Waal and Rowan Wilson about the problems of doing open source development. Sander and Rowan work at OSS watch, and their aim is to make sure that open source software development delivers its potential to UK HEI and research; so, I thought it would be good to get their feedback on how our project is doing, and if there is anything we are getting wrong or could improve on.

It struck me that as other JISC projects such as ours are required to make their output similarly publicly available, this discussion may be of benefit to others; after all, not everyone knows what open source software is, let alone the complexities that can arise from trying to create such software. Whilst we cannot help avoid all such complexities, we can at least detail what we have found helpful to date, and how OSS Watch view our efforts.

I provided Sander and Rowan a review of our project, and Rowan provided some feedback confirming that overall we are doing a good job, although we lack a listing of the other open source software our project relies on, and their licenses. Whilst such data can be discerned from the dependencies of the project, this is not clear enough; I will add a written list of dependencies to the README.

The response we received is provided below, followed by the overview I initially provided, which gives a brief overview of how we managed our open source development efforts:

==== Rowan Wilson, OSS Watch, responds:

Your work on this project is extremely impressive. You have the systems in place that we recommend for open development and creation of community around software, and you are using them. As an outsider I am able to quickly see that your project is active and the mailing list and roadmap present information about ways in which I could participate.

One thing I could not find, although this may be my fault, is a list of third party software within the distribution. This may well be because there is none, but it’s something I would generally be keen to see for the purposes of auditing licence compatibility.

Overall though I commend you on how tangible and visible the development work on this project is, and on the focus on user-base expansion that is evident on the mailing list.

==== Mark MacGillivray wrote:

Background – May 2011, OKF / AIM bibserver project

Open Knowledge Foundation contracted with American Institute of
Mathematics under the direction of Jim Pitman in the dept. of Maths
and Stats at UC Berkeley. The purpose of the project was to create an
open source software repository named BibServer, and to develop a
software tool that could be deployed by anyone requiring an easy way
to put and share bibliographic records online.

A repository was created at http://github.com/okfn/bibserver, and it
performs the usual logging of commits and other activities expected of
a modern DVCS system. This work was completed in September 2011, and the repository has been available since the start of that project with a GNU Affero GPL v3 licence attached.

October 2011 – JISC Open Biblio 2 project

The JISC Open BIblio 2 project chose to build on the open source
software tool named BibServer. As there was no support from AIM for
maintaining the BibServer repository, the project took on maintenance
of the repository and all further development work, with no change to
previous licence conditions.

We made this choice as we perceive open source licensing as a benefit
rather than a threat; it fit very well with the requirements of JISC
and with the desires of the developers involved in the project. At
worst, an owner may change the licence attached to some software, but
even in such a situation we could continue our work by forking from
the last available open source version (presuming that licence
conditions cannot be altered retrospectively).

The code continues to display the licence under which it is available,
and remains publicly downloadable at http://github.com/okfn/bibserver.
Should this hosting resource become publicly unavailable, an
alternative public host would be sought.

Development work and discussion has been managed publicly, via a
combination of the project website at
http://openbiblio.net/p/jiscopenbib2, the issue tracker at
http://github.com/okfn/bibserver/issues, a project wiki at
http://wiki.okfn.org/Projects/openbibliography, and via a mailing list
at openbiblio-dev@lists.okfn.org

February 2012 – JISC Open Biblio 2 offers bibsoup.net beta service

In February the JISC Open Biblio 2 project announced a beta service
available online for free public use at http://bibsoup.net. The
website runs an instance of BibServer, and highlights that the code is
open source and available (linking to the repository) to anyone who
wishes to use it.

Current status

We believe that we have made sensible decisions in choosing open
source software for our project, and have made all efforts to promote
the fact that the code is freely and publicly available.

We have found the open source development paradigm to be highly
beneficial – it has enabled us to publicly share all the work we have
done on the project, increasing engagement with potential users and
also with collaborators; we have also been able to take advantage of
other open source software during the project, incorporating it into
our work to enable faster development and improved outcomes.

We continue to develop code for the benefit of people wishing to
publicly put and share their bibliographies online, and all our
outputs will continue to be publicly available beyond the end of the
current project.

]]>
http://openbiblio.net/2012/05/29/open-source-development-how-we-are-doing/feed/ 1
Recent BibServer technical development http://openbiblio.net/2012/05/08/recent-bibserver-technical-development/ http://openbiblio.net/2012/05/08/recent-bibserver-technical-development/#comments Tue, 08 May 2012 14:39:23 +0000 http://openbiblio.net/?p=2667 Continue reading ]]> Along with the recent push of new front-end functionality to BibServer, and demonstrated on BibSoup, we have also applied some changes to the back-end.

The new scheduled collection uploader is now runnable as a stand-alone tool, to which source URLs can be provided for retrieval, conversion, and upload. Retrieved sources are stored and available from a folder on disk, as are the conversions.

Parsers can now be written in any language and plugged into the ingest functionality – for example, we now have a MARC parser that runs in perl and is usable via ingest.py and available on an instance of BibServer – thanks very much to Ed for that.

In addition, parsers need no longer be ‘parsers’ – we have introduced the concept of scrapers as well. Check out our new Wikipedia parser / scraper, for example; it functions by taking in a search value rather than a URL, then using that to search Wikipedia for relevant references which it downloads, bundles, and converts to a BibJSON collection – this is a really great example that Etienne put together, and it demonstrates a great deal of potential for further parser / scraper development.

See the examples on the BibServer repo for more insight – they are in the parserscrapers_plugins folder, and they are managed by bibserver/ingest.py.

We know documents are now lacking – we have set up an online docs resource but are in the process of writing up to populate it – please check back soon.

As usual, development work is scheduled via the tickets and milestones on our repo. Current efforts are on documentation and adding as many feature requests as possible before our hackathon on June 12th – 14th.

]]>
http://openbiblio.net/2012/05/08/recent-bibserver-technical-development/feed/ 1
BibJSON updates http://openbiblio.net/2012/05/08/bibjson-updates/ http://openbiblio.net/2012/05/08/bibjson-updates/#comments Tue, 08 May 2012 14:30:19 +0000 http://openbiblio.net/?p=2669 Continue reading ]]> Following recent discussion on our mailing list, BibJSON has been updated to adopt JSON-LD for all your linked data needs.

This enables us to keep the core of BibJSON pretty simple whilst also opening up potential for more complex usage where that is required.

Due to this, we no longer use the “namespace” key in BibJSON.

Other changes include usage of “_” prefix on internal keys – so wherever our own database writes info into a record, we prefix it, such as “_id”. Because of this, uploaded BibJSON records can have an “id” key that will work, as well as an “_id” uuid applied by the BibServer system.

For more information, check out BibJSON.org and JSON-LD

]]>
http://openbiblio.net/2012/05/08/bibjson-updates/feed/ 1
New BibServer features available on BibSoup http://openbiblio.net/2012/05/08/new-bibserver-features-available-on-bibsoup/ http://openbiblio.net/2012/05/08/new-bibserver-features-available-on-bibsoup/#comments Tue, 08 May 2012 14:22:53 +0000 http://openbiblio.net/?p=2664 Continue reading ]]> A couple of months ago the development team had a Sprint and came up with some cool ideas of how to improve the user experience for BibServer and, subsequently, BibSoup. Have a play with the new features and see below for the details:

Main pages

  • Collections visualisation – a smart new graphic on the landing page showing information from new collections

  • Improved FAQ section with links to videos (coming soon: links to our new online docs)

Creating collections

  • New Wikipedia parser – create a collection based on the references retrievable from Wikipedia for your chosen search value

  • Improved collection upload – specify collection information, then view upload tickets to see progress and errors

  • ‘Retry’ and other options on particular collection creation attempts are also now available from the tickets page

Search results

  • Filter search results by a value range as well as specific values

  • Visualise any filter as a bubble chart and select the values you want to search with

  • Add / remove available filters and rename filter display names

  • Improved layout of record info in search results, including auto-display of the first image referenced in a record – e.g. if there is a link to an image in your record, it is displayed in the search result

Managing and sharing collections

  • Collection admin available – save your current display settings as the default for your collection, allow other users to have admin rights on your own collection

  • Share any specific searches by providing the URL displayed under the ‘share’ option

  • Embed – as the whole front-end of search and collection visualisation is handled by facetview it is possible to embed your collection search in any web page you control; the share / embed option on collection pages provides the code you need to insert to enable this

  • Download as BibJSON – a nice new obvious button on each collection provides a link to download your collection as BibJSON

Viewing records

  • Improved display of individual records, including search options to discover relevant content online

  • EXPERIMENTAL record editing – this has been enabled although still in progress – you can edit the content of a record using a visual display of the keys and values in the record, although functionality for adding new keys does not yet work. However, you can also edit the JSON directly via the options, and try saving that. Be aware – this could damage your records, and of course changes the details from whatever they were in the source content.

Still in development

These ones are not yet available on BibSoup but watch this space:

  • Creating new collections on-site – search and find particular records for inclusion in new collections or addition to pre-existing collections. This is not currently possible but we are working on making this an easy process

  • Merging collections

  • Better user creation and management, plus gravatars

  • Additional functionality on record pages – linking out directly to related sources such as PubMed, Total Impact, Service Core etc

We hope you like these changes, and find them useful – do let us know what you think and keep an eye out for the upcoming improvements.

]]>
http://openbiblio.net/2012/05/08/new-bibserver-features-available-on-bibsoup/feed/ 3
Planning for the next three months http://openbiblio.net/2012/03/20/planning-for-the-next-three-months/ http://openbiblio.net/2012/03/20/planning-for-the-next-three-months/#comments Tue, 20 Mar 2012 18:36:37 +0000 http://openbiblio.net/?p=2363 Continue reading ]]> We have developed BibJSON

We’ve improved BibServer

We’ve made BibSoup

…But what’s next?

The nature of cutting-edge technology is that it is fast-paced and constantly adapting. We may think we’ve come up with a good idea, but if it turns out someone else has already had that idea and developed it – that’s great and means we incorporate it and go on to the next exciting thing. We may think that this next thing is important, but if it turns out it doesn’t quite do the helpful thing needed to make our users delighted or promote open bibliographic data – we change tack and try something else. We know what we want to do, ie make useful and smart tools for the people doing wonderful things in the public domain, but, as for what our end product looks like (if indeed there is the one product to play with) – well, that all depends on the emerging requirements, other technologies that come to light and how successful our ideas are along the way.

Taking all that into account, at the Sprint last week we attempted to plan for the next three months. Our work will be more successful the more focused we are, and having an end-result in mind is useful for that. So, here’s a rough guide to how we think our project will shape up between now and June:

To-Do

Timeline

NB the images are a little fuzzy, but do click on them to follow the links to Flickr where these are stored and appear more clearly.

We have already published the CUL blog post and Mark has written about BiBServer functionality that arose from ideas at the Sprint. We’ll develop these ideas into workable and worthwhile tools or processes, and before we know it we’ll be three months down the line and thinking ‘…but what’s next?’

]]>
http://openbiblio.net/2012/03/20/planning-for-the-next-three-months/feed/ 0