Open Bibliography and Open Bibliographic Data » jiscLMS http://openbiblio.net Open Bibliographic Data Working Group of the Open Knowledge Foundation Tue, 08 May 2018 15:46:25 +0000 en-US hourly 1 http://wordpress.org/?v=4.3.1 Bibliographica gadget in Wikipedia http://openbiblio.net/2011/06/06/bibliographica-gadget-in-wikipedia/ http://openbiblio.net/2011/06/06/bibliographica-gadget-in-wikipedia/#comments Mon, 06 Jun 2011 10:14:04 +0000 http://openbiblio.net/?p=1017 Continue reading ]]> What is a wikipedia gadget?

Thinking of ways to show the possibilities of linked data, we have made a Wikipedia gadget, making use of a great resource the Wikimedia developers give to the community.

Wikipedia gadgets are small pieces of code you can add to your Wikipedia user templates, and allow you to add more functionality and render more information when you browse wikipedia pages.

In our case, we wanted to retrieve information from our bibliographica site to render in Wikipedia, and so as the pages are rendered with specific markup we can use the ISBN numbers present on the wikipedia articles to make consults to the bibliographica database, in a way similar to what Mark has done with the Edinburgh International Science Festival.

Bibliographica.org offers an isbn search endpoint at http://bibliographica.org/isbn/, so if we ask for the page http://bibliographica.org/isbn/0241105161 we receive

[{"issued": "1981-01-01T00:00:00Z", "publisher": {"name": "Hamilton"}, "uri": "http://bnb.bibliographica.org/entry/GB8102507", "contributors": [{"name": "Boyd, William, 1952-"}], "title": "A good man in Africa"}]

I can use this information to make a window pop up with more information about works when we hover their ISBNs on the Wikipedia pages. If my user templates has the bibliographica gadget, every time I open a wiki page the script will ask information about all the ISBNs the page has to our database.
If something is found, it will render a frame around the ISBN numbers:

And if I hover over them, I see a window with information about the book:

Get the widget

So, if you want to have this widget, first you need to create an account in the wikipedia, and then change your default template to add the JavaScript snippet. Once you do this (instructions here ) you will be able to get the information available in bibliographica about the books.

Next steps

By now, the interaction goes in just one direction. Later on, we will be able to feed that information back to Bibliographica.

]]>
http://openbiblio.net/2011/06/06/bibliographica-gadget-in-wikipedia/feed/ 0
OpenBiblio workshop report http://openbiblio.net/2011/05/09/openbiblio-workshop-report/ http://openbiblio.net/2011/05/09/openbiblio-workshop-report/#comments Mon, 09 May 2011 16:03:29 +0000 http://openbiblio.net/?p=1081 Continue reading ]]> #openbiblio #jiscopenbib

The OpenBiblio workshop took place on 6th May 2011, at London Knowledge Lab

Participants

  • Peter Murray-Rust (Open Bibliography project, University of Cambridge, IUCr)
  • Mark MacGillivray (Open Bibliography project, University of Edinburgh, OKF, Cottage Labs)
  • William Waites (Open Bibliography project, University of Edinburgh, OKF)
  • Ben O’Steen (Open Bibliography project, Cottage Labs)
  • Alex Dutton (Open Citation project, University of Oxford)
  • Owen Stephens (Open Bibliographic Data guide project, Open University)
  • Neil Wilson (British Library)
  • Richard Jones (Cottage Labs)
  • David Flanders (JISC)
  • Jim Pitman (Bibserver project, UCB) (remote)
  • Adrian Pohl (OKF bibliographic working group) (remote)

During the workshop we covered some key areas where we have seen some success already in the project, and discussed how we could continue further.

Open bibliographic data formats

In order to ensure successful sharing of bibliographic data, we require agreement on a suitable yet simple format via which to disseminate records. Whilst representing linked data is valuable, it also adds complexity; however, simplicity is key for ensuring uptake and for enabling easy front end system development.

Whilst data is available as RDF/XML, JSON is now a very popular format for data transfer, particularly where front end systems are concerned. We considered various JSON linked data formats, and have implemented two for further evaluation. In order to make sure this development work is as widely applicable as possible, we wrote parsers and serialisers for JSON-LD and RDF/JSON as plugins for the popular RDFlib.

The RDF/JSON format is, of course, RDF; therefore, it requires no further change to enable it to handle our data, and our RDF/JSON parser and serialiser are already complete. However, it is not very JSON-like, as data takes the subject(predicate(object)) form rather than the general key:value form. This is where JSON-LD can improve the situation – it provides for listing information in a more key:value-like format, making it easier for front end developers not interested in the RDF relations to utilise. But this leads to additional complexity in the spec and parsing requirements, so we have some further work to complete:
* remove angle brackets from blank nodes
* use type coersion to move types out of main code
* use language coersion to omit languages

Our code is currently available in our repository, and we will request that our parsers and serialisers get added to RDFlib or to RDFextras once they are complete (they are still in development at present).

To further assist in representing bibliographic information in JSON, we also intend to implement BibJSON within JSON-LD; this should provide the necessary lined data functionality where necessary via JSON-LD support, whilst also enabling simpler representation of bibliographic data via key:value pairs where that is all that is required.

By making these options available to our users, we will be able to gauge the most popular representation format.

Regardless of format used, a critical consideration is that of stable references to data. Without this maintaining datasets will be very hard. To date, the British Library data for example does not have suitable identifiers. However, the BL are moving forward with applying identifiers and will be issuing a new version of their dataset soon, which we will take as a new starting point. We have provided a list of records that we have identified as non-unique, and in turn the BL will share the tools they use to manage and convert data where possible, to enable better community collaboration.

Getting more open datasets

We are building on the success of the BL data release by continuing work on our CUL and IUCr data, and also by getting more datasets. The latest is the Medline dataset; there were some initial issues with properly identifying this dataset, so we have a previous blog post and a link to further information, the Medline DTD and specifications of the PubMed data elements to help.

The Medline dataset

We are very excited to have the Medline dataset; we are currently working on cleaning so that we can provide access to all the non-copyrightable material it contains, which should represent a listing of about 98% of all articles published in PubMed.

The Medline dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. This also means that further updates will be trackable as they will append to the current dataset. We have found that most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID, and that some contain further metadata such as citations, which we will remove. Once this is done, and we have checked that there are unique IDs (e.g. that the PubMed IDs are unique) we will make the raw CC0 collection available, then attempt to get it into our Bibliographica instance. We will then also be able to generate visualisations on our total dataset, which we hope will be approaching 30 million records by the end of the JISC Open Bibliography project.

Displaying bibliographic records

Whilst Bibliographica allows for display of individual bibliographic records and enables building collections of such records, it does not yet provide a means of neatly displaying lists of bibliographic records. We have partnered with Jim Pitman of Berkeley University to develop his BibServer to fit this requirement, and also to bring further functionality such as search and faceted browse. This also provides further development direction for the output of the project beyond the July end date of the JISC Open Bibliography project.

Searching bibliographic records

Given the collaboration between Bibliographica and BibServer on collection and display of bibliographic records, we are also considering ways to enable search across non-copyrightable bibliographic metadata relating to any published article. We believe this may be achievable by building a collection of DOIs with relevant metadata, and enabling crowdsourcing of updates and comments.

This effort is separate to the main development of the projects, however would make a very good addition both to the functionality of developed software and to the community. This would also tie in with any future functionality that enables author identification and information retrieval, such as ORCID, and allowing us to build on the work done at sites such as BIBKN

Disambiguation without deduplication

There have been a number of experiments recently highlighting the fact that a simple LUCENE search index over datasets tends to give better matches than more complex methods of identifying duplicates. Ben O’Steen and Alex Dutton both provided examples of this, from their work with the Open Citation project.

This is also supported by a recent paper from Jeff Bilder entitled “Disambiguation without Deduplication” (not publicly available). The main point here is that instead of deduplicating objects we can simply do machine disambiguation and make sameAs-ness assertions between multiple objects; this would enable changes to still be applied to different versions of an object by disparate groups (e.g. where each group has a different spelling or identifier, perhaps, for some key part of the record) whilst still maintaining a relationship between the two objects. We could build on this sort of functionality by applying expertise from the library community if necessary, although deduplication/merging should only be contemplated if there is a new dataset being formed which some agent is taking responsibility to curate. If not, better to just cluster the data by SameAs assertions, and keep track of who is making those assertions, to assess their reliability.

We suggest a concept for increasing collaboration on this sort of work – a ReCaptcha of identities. Upon login, perhaps to a Bibliographica or another relevant system, a user could be presented with two questions, one of which we know the answer to, and the other being a request to match identical objects. This, in combination with decent open source software tools enabling bibliographic data management (building on tools such as Google Refine and Needlebase), would allow for simple verifiable disambiguation across large datasets.

Sustaining open bibliographic data

Having had success in getting open bibliographic datasets and prototyping their availability, we must consider how to maintain long term open access. There are three key issues:

Continuing community engagement

We must continue to work with the community, and to provide explanatory information to those needing to make decisions about bibliographic data, such as the OpenBiblio Principles and the Open BIbliographic Data guide. We must also ensure we improve resource discovery by supporting the requirement for generating collections and searching content.

Additionally, quality bibliographic data should be hosted at some key sites – there are a variety of options such as Freebase, CKAN, bibliographica – but we must also ensure that community members can be crowdsourced both for managing records within these central options and also for providing access to smaller distributed nodes, where data can be owned and maintained at the local level whilst being discoverable globally.

Maintaining datasets

Dataset maintenance is critical to ongoing success – stale data is of little use to people and disregard for content maintenance will put off new users. We must co-ordinate with source providers such as the BL by accepting changesets from them and incorporating that into other versions. This is already possible with the Medline data, for example, and will very soon be the case with BL updates too. We should advocate for this method of dataset updates during any future open data negotiations. This will allow us to keep our datasets fresh and relevant, and to properly represent growing datasets.

We must continue to promote open access to non-copyrightable datasets, and ensure that there is a location for open data providers to easily make their raw datasets available – such as CKAN.

We will ensure that all the software we have developed during the course of the project – and in future – will remain open source and publicly available, so that it will be possible for anyone to perform the transforms and services that we can perform.

Community involvement with dataset maintenance

We should support community members that wish to take responsibility for overseeing updating of datasets. This is critical for long term sustainability, but hard to find. These people need to be recruited and provided with simple tools which will empower them to easily maintain and share datasets they care about with a minimal time commitment. Thus we must make sure that our software and tools are not only open source, but usable by non-team members.

We will work on developing tools such as ReCaptcha for disambiguation, and on building game / rank table functionality for those wishing to participate in entity disambiguation (in addition to machine disambiguation).

Critical mass

We hope that by providing almost 30 million records to the community under CC0 license, and with the support of all the providers that made this possible, we will achieve a critical mass of data, and an exemplar for future open access to such data.

This should provide the go-to list of such information, and inspire others to contribute and maintain. However, such community assistance will only continue for as long as there appears to be reasonable maintenance of the corpus and software we have already developed – if this slips into disrepair, community engagement is far less likely.

Maintaining services

The bibliographica service that we currently run already requires significant hardware to run. Once we add in Medline data, we will require very large indexes, requiring a great deal of RAM and fast disks. There is therefore a long term maintenance requirement implicit in running any such central service of open bibliographic data on this scale.

We will present a case for ongoing funding requirements and seek sources for financial support both for technical maintenance and for ongoing software maintenance and community engagement.

Business cases

In order to ensure future engagement with groups and business entities, we must make clear examples of the benefits of open bibliographic data. We have already done some work on visualising the underlying data, which we will develop further for higher impact. We will identify key figures in the data that we can feed into such representations to act as exemplars. Additionally, we will continue to develop mashups using the datasets, to show the serendipitous benefit that increases exposure but is only possible with unambiguously open access to useful data.

Events and announcements

We will continue to promote our work and the efforts of our partners, and advocate further for open bibliography, by publicising our successes so far. We will co-ordinate this with JISC, BL, OKF and other interested groups, to ensure the impact of announcements by all groups are enhanced.

We will present our work at further events throughout the year, such as attendance and sessions at OKCon, OR11 and other conferences, and by arranging further hackdays.

]]>
http://openbiblio.net/2011/05/09/openbiblio-workshop-report/feed/ 0
Some obvious URI patterns for a service? http://openbiblio.net/2010/10/26/some-obvious-uri-patterns-for-a-service/ http://openbiblio.net/2010/10/26/some-obvious-uri-patterns-for-a-service/#comments Tue, 26 Oct 2010 12:05:15 +0000 http://openbiblio.net/?p=326 Continue reading ]]> Whilst the technical issues and backends may vary, there are one or two URI patterns that may be adopted I think. It’s not REST, but it is a sensible structure I hope. (This is not to replace voID, but to accompany a voID description and other characterisation methods)

http://host/#catalog – URI for the catalog dataset

http://host/void
302 – conneg response to a voID description at .ttl, .rdf (xml), etc

http://host/describe/{uri} –
200 – responds with a conneg’d graph with the information a store ‘knows’ about a given URI. The HTML representation would likely be viewed as a ‘record’ page, insofar as this is valid for the item. (uses Content-Location: http://host/describe/{uri}/ttl etc rather than 302, due to load and network traffic cost.)
404 – doesn’t know about this uri

http://host/types
200 – voID-like Response based on the canned query ‘SELECT DISTINCT ?x WHERE {?foo rdf:type ?x)’ BUT with the addition of some lowest common denominator types. Can be easily cached. Filtering out the least important types is at the discretion of the service – this is not intended to be a complete set, but to publish the set of types that this service cares most about. Best shown by example (note that some predicates need to be minted/swapped for suitable ones. Shown by *):

<http://host/#catalog>  a void:Dataset ;
    *containsType* <myOnto:Researcher> ;
    *containsType* <myOnto:foo> ;
    etc...
    void:uriLookupEndpoint <http://host/describe/> ;
    etc...

<myOnto:Researcher> <owl:subclassOf> <foaf:Person> ;
<myOnto:foo> <owl:subclassOf> <bibo:Article> ;

Thoughts?

]]>
http://openbiblio.net/2010/10/26/some-obvious-uri-patterns-for-a-service/feed/ 2
JISC OpenBibliography: IPR statement http://openbiblio.net/2010/07/15/jisc-openbibliography-ipr-statement/ http://openbiblio.net/2010/07/15/jisc-openbibliography-ipr-statement/#comments Thu, 15 Jul 2010 10:24:38 +0000 http://openbiblio.net/?p=115 Continue reading ]]> All sourced data will fall under a license compatible with the criteria laid out at http://www.opendefinition.org/ – which will ensure that replication and reuse of the data created and hosted by this project is both fully reusable by the community that JISC seeks to support and the wider community still.

Project documentation will be published under a CC-BY attribution license, project data created by the team will be published under the PDDL and the source code created for the project will be published under the BSD license.

Organisational terms:

The OKF uses Open Definition compliant license for its content and data. For example, for content, CC-By, and for data the Open Data Commons Public Domain Definition and License (PDDL) or the Open Data Commons Data(base) Attribution License.

The University of Cambridge asserts its rights to IP created by employees in the course of their employment.

All software is distributed under the Artistic Licence (BSD style).

The IUCr uses CC-BY for its Open Access material and will use the services of the OKF to advise on the best ways of Opening data and services.

]]>
http://openbiblio.net/2010/07/15/jisc-openbibliography-ipr-statement/feed/ 0
JISC OpenBibliography: Risk Analysis and Success Plan http://openbiblio.net/2010/07/15/jisc-openbibliography-risk-analysis-and-success-plan/ http://openbiblio.net/2010/07/15/jisc-openbibliography-risk-analysis-and-success-plan/#comments Thu, 15 Jul 2010 10:10:10 +0000 http://openbiblio.net/?p=113 Continue reading ]]> Key Risk:

Collections are unavailable or intractable:

This was quoted as one of the key risks in the project plan. However, from initial conversations with publishers and other sources, the likelyhood of the project having too little data to work is rapidly diminishing.

Success Plan:

Success: The initial search, query and other compute-intensive services become over-subscribed from real demand.

Managed by: The service is hosted on Amazon EC2 and is designed to be scalable. If there is money left in the budget, the service could be transferred to a more heavy duty VM. Otherwise, part of the design is that anyone can setup and run the service as all the tools and data are open, so we could recommend to heavy users that they run a mirror instance locally to themselves.

Success: Bibliographic metadata from this project is begun to be used in production library management systems.

Managed By: Whilst we cannot affect the cataloguing processes by which the records are entered into a given institution’s system, we maintain URLs and provenance for all the records we provide. This enables those systems which reuse the data to be able to track and show the provenance for a given record, if they maintain a link to the source. We would also recommend that institutions or organisations that reuse the data to state openly that they do so, thereby increasing the profile of the project and of JISC, its funder.

Risk Assesment:

Risk Probability
(1-5)
Severity
(1-5)
Score
(P x S)
Action to Prevent/Manage Risk
Staffing
Staff retention 3 5 15 Ensure staff are satisfied and challenged and have chance to give feedback by means of regular one-to-ones. Apply open management to ensure sharing of expertise thus enabling cover.
Key academic staff leave 1 2 2 There is sufficient in-depth coverage from expertise available in the university; recruit replacement
Technical
Technical problems 1 5 5 Similar problems already solved; well-known experts on team
Difficulty in integrating tools in services and workflows 1 4 4 Use iterative development so as deliver at least a partial solution as opposed to nothing at all
OKF service not supplied 2 4 8 Move to other available platforms such as Talis connect commons,4store, Sesame
Hardware Failure resulting in loss of data 2 4 8 Use standard approaches to data and service backup, including automated backup and off-site replication
External suppliers
Collections are unavailable or intractable 2 5 10 For catalaogues use other Open offerings (several are available, many are members of the OKF’s working group on bibliographic information).
Open Citations is not funded 1 1 1 Work with other citation experts
LEGAL
Data protection infringement 1 5 5 Close consultation with University legal services such as UMIP, establish clear project staff guidelines w.r.t. commercial partners
]]>
http://openbiblio.net/2010/07/15/jisc-openbibliography-risk-analysis-and-success-plan/feed/ 0
JISC OpenBibliography: Wider Benefits to Sector & Achievements for Host Institution http://openbiblio.net/2010/07/15/jisc-openbibliography-wider-benefits-to-sector-achievements-for-host-institution/ http://openbiblio.net/2010/07/15/jisc-openbibliography-wider-benefits-to-sector-achievements-for-host-institution/#comments Thu, 15 Jul 2010 09:15:48 +0000 http://openbiblio.net/?p=111 Continue reading ]]>
  • Bibliographic data is useful; A number of organisations such as CERN and Library of Congress have recognised that providing open access to bibliographic records and controlled vocabularies is a natural and necessary step to begin to identify errors and to avoid erroneous or divergent duplication, thereby improving the metadata accuracy. A key point from Karen Coyle is “The change that libraries will need to make in response [to user demand] must include the transformation of the library’s public catalog from a stand-alone database of bibliographic records to a highly hyperlinked data set that can interact with information resources on the World Wide Web.”
  • Bibliographic data is, in general, not open or linked: this limits its usefulness to the academic community. This project will deliver bibliographic material that is truly open (as in http://opendefinition.org where the team has particular expertise). Many attempts to create LOD suffer because there are no useful resources to link to. OpenBibliography will expose Author names, Institutions and Geographical Locations with semantic targets in the LOD ecosystem (e.g. Geonames, Wikipedia); the project will put significant effort into disambiguation so that OpenBibliography can become an important node in the LOD graph.
  • Processes to make it open or linked are not familiar to libraries and publishers: Much modern bibliographic data is created implicitly or explicitly by the scholarly publication process but exposed poorly or not at all. Working with cooperating publishers can rapidly transform their output to complete open semantic bibliography. By providing a clear working model for bibliographic metadata as semantic, referenceable links with a reusable workflow to gather, add provenance, refine and disambiguate existing metadata information, members of the JISC community can apply the same model and techiniques with the open-source code and services we will provide to use data from and contribute to the aforementioned ‘highly hyperlinked data set’.
  • ]]>
    http://openbiblio.net/2010/07/15/jisc-openbibliography-wider-benefits-to-sector-achievements-for-host-institution/feed/ 0
    JISC OpenBiblography: Aims, Objectives and Final Outputs http://openbiblio.net/2010/07/15/jisc-openbiblography-aims-objectives-and-final-outputs/ http://openbiblio.net/2010/07/15/jisc-openbiblography-aims-objectives-and-final-outputs/#comments Thu, 15 Jul 2010 09:07:01 +0000 http://openbiblio.net/?p=108 Continue reading ]]> This project will publish a substantial corpus of bibliographic metadata as Linked Open Data, using existing semantic web tools, standards (RDF, SPARQL), linked data patterns and accepted Open ontologies (FoaF, Bibo, DC, etc).

    The data will be from two distinct sources: traditional library catalogues (Cambridge University Library and the British Library) and ToCs from a scientific publisher, the IUCr. None of the material is currently available as LOD, furthermore the outputs can be guaranteed to be open (unlike many existing data efforts, linked or otherwise).

    Key strategies are

    • transformation of current publishers’ model to create Open Bibliography as part of their future business, and
    • the immediate and continuing engagement of the scholarly community.

    Deliverables include a maintained and growing bibliography on the IUCr site and engagements with other like-minded publishers such as PLoS as well as the code for the  software used to create the Linked Data versions of the aforementioned sources.

    ]]>
    http://openbiblio.net/2010/07/15/jisc-openbiblography-aims-objectives-and-final-outputs/feed/ 1