Open Bibliography and Open Bibliographic Data | Open Bibliographic Data Working Group of the Open Knowledge Foundation

OpenBiblio workshop report

Posted on May 9, 2011 by Mark MacGillivray

#openbiblio #jiscopenbib

The OpenBiblio workshop took place on 6th May 2011, at London Knowledge Lab

Participants

Peter Murray-Rust (Open Bibliography project, University of Cambridge, IUCr)
Mark MacGillivray (Open Bibliography project, University of Edinburgh, OKF, Cottage Labs)
William Waites (Open Bibliography project, University of Edinburgh, OKF)
Ben O’Steen (Open Bibliography project, Cottage Labs)
Alex Dutton (Open Citation project, University of Oxford)
Owen Stephens (Open Bibliographic Data guide project, Open University)
Neil Wilson (British Library)
Richard Jones (Cottage Labs)
David Flanders (JISC)
Jim Pitman (Bibserver project, UCB) (remote)
Adrian Pohl (OKF bibliographic working group) (remote)

During the workshop we covered some key areas where we have seen some success already in the project, and discussed how we could continue further.

Open bibliographic data formats

In order to ensure successful sharing of bibliographic data, we require agreement on a suitable yet simple format via which to disseminate records. Whilst representing linked data is valuable, it also adds complexity; however, simplicity is key for ensuring uptake and for enabling easy front end system development.

Whilst data is available as RDF/XML, JSON is now a very popular format for data transfer, particularly where front end systems are concerned. We considered various JSON linked data formats, and have implemented two for further evaluation. In order to make sure this development work is as widely applicable as possible, we wrote parsers and serialisers for JSON-LD and RDF/JSON as plugins for the popular RDFlib.

The RDF/JSON format is, of course, RDF; therefore, it requires no further change to enable it to handle our data, and our RDF/JSON parser and serialiser are already complete. However, it is not very JSON-like, as data takes the subject(predicate(object)) form rather than the general key:value form. This is where JSON-LD can improve the situation – it provides for listing information in a more key:value-like format, making it easier for front end developers not interested in the RDF relations to utilise. But this leads to additional complexity in the spec and parsing requirements, so we have some further work to complete:
* remove angle brackets from blank nodes
* use type coersion to move types out of main code
* use language coersion to omit languages

Our code is currently available in our repository, and we will request that our parsers and serialisers get added to RDFlib or to RDFextras once they are complete (they are still in development at present).

To further assist in representing bibliographic information in JSON, we also intend to implement BibJSON within JSON-LD; this should provide the necessary lined data functionality where necessary via JSON-LD support, whilst also enabling simpler representation of bibliographic data via key:value pairs where that is all that is required.

By making these options available to our users, we will be able to gauge the most popular representation format.

Regardless of format used, a critical consideration is that of stable references to data. Without this maintaining datasets will be very hard. To date, the British Library data for example does not have suitable identifiers. However, the BL are moving forward with applying identifiers and will be issuing a new version of their dataset soon, which we will take as a new starting point. We have provided a list of records that we have identified as non-unique, and in turn the BL will share the tools they use to manage and convert data where possible, to enable better community collaboration.

Getting more open datasets

We are building on the success of the BL data release by continuing work on our CUL and IUCr data, and also by getting more datasets. The latest is the Medline dataset; there were some initial issues with properly identifying this dataset, so we have a previous blog post and a link to further information, the Medline DTD and specifications of the PubMed data elements to help.

The Medline dataset

We are very excited to have the Medline dataset; we are currently working on cleaning so that we can provide access to all the non-copyrightable material it contains, which should represent a listing of about 98% of all articles published in PubMed.

The Medline dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. This also means that further updates will be trackable as they will append to the current dataset. We have found that most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID, and that some contain further metadata such as citations, which we will remove. Once this is done, and we have checked that there are unique IDs (e.g. that the PubMed IDs are unique) we will make the raw CC0 collection available, then attempt to get it into our Bibliographica instance. We will then also be able to generate visualisations on our total dataset, which we hope will be approaching 30 million records by the end of the JISC Open Bibliography project.

Displaying bibliographic records

Whilst Bibliographica allows for display of individual bibliographic records and enables building collections of such records, it does not yet provide a means of neatly displaying lists of bibliographic records. We have partnered with Jim Pitman of Berkeley University to develop his BibServer to fit this requirement, and also to bring further functionality such as search and faceted browse. This also provides further development direction for the output of the project beyond the July end date of the JISC Open Bibliography project.

Searching bibliographic records

Given the collaboration between Bibliographica and BibServer on collection and display of bibliographic records, we are also considering ways to enable search across non-copyrightable bibliographic metadata relating to any published article. We believe this may be achievable by building a collection of DOIs with relevant metadata, and enabling crowdsourcing of updates and comments.

This effort is separate to the main development of the projects, however would make a very good addition both to the functionality of developed software and to the community. This would also tie in with any future functionality that enables author identification and information retrieval, such as ORCID, and allowing us to build on the work done at sites such as BIBKN

Disambiguation without deduplication

There have been a number of experiments recently highlighting the fact that a simple LUCENE search index over datasets tends to give better matches than more complex methods of identifying duplicates. Ben O’Steen and Alex Dutton both provided examples of this, from their work with the Open Citation project.

This is also supported by a recent paper from Jeff Bilder entitled “Disambiguation without Deduplication” (not publicly available). The main point here is that instead of deduplicating objects we can simply do machine disambiguation and make sameAs-ness assertions between multiple objects; this would enable changes to still be applied to different versions of an object by disparate groups (e.g. where each group has a different spelling or identifier, perhaps, for some key part of the record) whilst still maintaining a relationship between the two objects. We could build on this sort of functionality by applying expertise from the library community if necessary, although deduplication/merging should only be contemplated if there is a new dataset being formed which some agent is taking responsibility to curate. If not, better to just cluster the data by SameAs assertions, and keep track of who is making those assertions, to assess their reliability.

We suggest a concept for increasing collaboration on this sort of work – a ReCaptcha of identities. Upon login, perhaps to a Bibliographica or another relevant system, a user could be presented with two questions, one of which we know the answer to, and the other being a request to match identical objects. This, in combination with decent open source software tools enabling bibliographic data management (building on tools such as Google Refine and Needlebase), would allow for simple verifiable disambiguation across large datasets.

Sustaining open bibliographic data

Having had success in getting open bibliographic datasets and prototyping their availability, we must consider how to maintain long term open access. There are three key issues:

Continuing community engagement

We must continue to work with the community, and to provide explanatory information to those needing to make decisions about bibliographic data, such as the OpenBiblio Principles and the Open BIbliographic Data guide. We must also ensure we improve resource discovery by supporting the requirement for generating collections and searching content.

Additionally, quality bibliographic data should be hosted at some key sites – there are a variety of options such as Freebase, CKAN, bibliographica – but we must also ensure that community members can be crowdsourced both for managing records within these central options and also for providing access to smaller distributed nodes, where data can be owned and maintained at the local level whilst being discoverable globally.

Maintaining datasets

Dataset maintenance is critical to ongoing success – stale data is of little use to people and disregard for content maintenance will put off new users. We must co-ordinate with source providers such as the BL by accepting changesets from them and incorporating that into other versions. This is already possible with the Medline data, for example, and will very soon be the case with BL updates too. We should advocate for this method of dataset updates during any future open data negotiations. This will allow us to keep our datasets fresh and relevant, and to properly represent growing datasets.

We must continue to promote open access to non-copyrightable datasets, and ensure that there is a location for open data providers to easily make their raw datasets available – such as CKAN.

We will ensure that all the software we have developed during the course of the project – and in future – will remain open source and publicly available, so that it will be possible for anyone to perform the transforms and services that we can perform.

Community involvement with dataset maintenance

We should support community members that wish to take responsibility for overseeing updating of datasets. This is critical for long term sustainability, but hard to find. These people need to be recruited and provided with simple tools which will empower them to easily maintain and share datasets they care about with a minimal time commitment. Thus we must make sure that our software and tools are not only open source, but usable by non-team members.

We will work on developing tools such as ReCaptcha for disambiguation, and on building game / rank table functionality for those wishing to participate in entity disambiguation (in addition to machine disambiguation).

Critical mass

We hope that by providing almost 30 million records to the community under CC0 license, and with the support of all the providers that made this possible, we will achieve a critical mass of data, and an exemplar for future open access to such data.

This should provide the go-to list of such information, and inspire others to contribute and maintain. However, such community assistance will only continue for as long as there appears to be reasonable maintenance of the corpus and software we have already developed – if this slips into disrepair, community engagement is far less likely.

Maintaining services

The bibliographica service that we currently run already requires significant hardware to run. Once we add in Medline data, we will require very large indexes, requiring a great deal of RAM and fast disks. There is therefore a long term maintenance requirement implicit in running any such central service of open bibliographic data on this scale.

We will present a case for ongoing funding requirements and seek sources for financial support both for technical maintenance and for ongoing software maintenance and community engagement.

Business cases

In order to ensure future engagement with groups and business entities, we must make clear examples of the benefits of open bibliographic data. We have already done some work on visualising the underlying data, which we will develop further for higher impact. We will identify key figures in the data that we can feed into such representations to act as exemplars. Additionally, we will continue to develop mashups using the datasets, to show the serendipitous benefit that increases exposure but is only possible with unambiguously open access to useful data.

Events and announcements

We will continue to promote our work and the efforts of our partners, and advocate further for open bibliography, by publicising our successes so far. We will co-ordinate this with JISC, BL, OKF and other interested groups, to ensure the impact of announcements by all groups are enhanced.

We will present our work at further events throughout the year, such as attendance and sessions at OKCon, OR11 and other conferences, and by arranging further hackdays.

Posted in BibServer, Data, event, JISC OpenBib, OKFN Openbiblio, Semantic Web | Tagged bibliographic, communityBenefits, inf11, jisc, jiscEXPO, jiscLMS, jiscopenbib, progress, progressPosts, rdf, WIN | Leave a comment

Minutes: 11th Virtual Meeting of the OKFN Openbiblio Group

Posted on May 9, 2011 by Adrian Pohl

Date: May, 3rd 2011, 15:00 – 16:15 GMT

Channel: Meeting was held via Skype and Etherpad

Participants

Adrian Pohl
Peter Murray-Rust
Karen Coyle
Jim Pitman
Mark MacGillivray
late: Thomas Krichel

Update on bibliographica

Getting Open Bibliographic Data from PubMed Central (PMC), see http://openbiblio.net/2011/05/03/getting-open-bibliographic-data-from-pmc/
Thoughts about pushing changes back to British Library (BL) and other data providers.
An JSON serialization of RDF will be offered. Discussions around which to choose: RDF/JSON or JSON-LD. See http://openbiblio.net/2011/05/04/comparative-serialisation-of-rdf-in-json/.
Working on provenance information in bibliographica: the subdomain-approach for BL-data (bnb.bibliographica.org) resulted in more problems than benefits.
Underlying Triple Store is changed back from Virtuoso to 4store because of Virtuoso bugs.
In the end use cases of bibliographica were discussed:
- Long term strategy: bibliographica.org should be only one instantiation of the openbiblio software.
- Mark: Use cases differ widely. Providing open data to be used freely in the first place is goal.
- Peter: Use bibliographica/openbiblio for open institutional bibliographies to provide a snapshot of an academic organization.
- Peter/Mark: Bibliographic data is key for the Research assessment exercises.
- Karen makes clear the different functions of different data sets, like institutional bibliography vs. library catalog and states that these distinctions should be documented in the triple store.
- Jim proposes a further use case over bibliographica data: the Institute of Mathematical Statistics (IMS) will open up a data set of scholarly authors and their publications. Use case: Linking people information to entries in bibliographica (as well as other sources like Open Library).

Update on BibServer

There’s current work on an integration of the BibServer with bibliographica to provide HTML views over bibliographica data like personal or departmental bibliography.
BibServer data in BibJSOn format has been pulled into CouchDB with ElasticSearch building a search index over the data. Example query over Jim’s bibliography serving BibJSON: http://elastic.cottagelabs.com/bibjp/bibjp/_search?q=probability.
ACTION: discussion over details will be taken to openbiblio-dev list.

Talk at ORCID

As announced in the last meeting, James Griffin and Thomas gave a talk at the ORCID participants meeting. The draft slides can be viewed here (PPT).

Openbiblio Principles

An Italian translation of the principles is done and published.
ACTION: Karen will solicit Norwegian and Swedish translations
As documented in the last meeting, Primavera wanted to have the Korean version checked by a friend. –> ACTION: Adrian will query her again about it.

OKCon 2011 (Berlin)

A Bibliographic data workshop will be held at this year’s OKCon.
Peter and Adrian will be there.
We have define the aims of this workshop.
ACTION: Write a description/proposal of the workshop

Action Collection

Take discussion of BibServer/bibliographica integration to openbiblio-dev list.
Karen: Solicit Norwegian and Swedish translations of the openbiblio principles.
Adrian: Query Primavera regarding the Korean translation of the principles.
Adrian/Peter: Write a description/proposal for the OKCon openbiblio workshop.

Posted in BibServer, minutes, OKFN Openbiblio | Leave a comment

Follow-up to serialising RDF in JSON

Posted on May 5, 2011 by Mark MacGillivray

Following on from Richard’s post yesterday, we now have a JSON-LD serialiser for RDFlib. This is still a work in progress, and there may be things that it is serialising incorrectly. So, please give us feedback on this, and tell us where we have misinterpreted the structure.

Here you will find a sample JSON-LD output file, which was generated from this Bibliographica record.

The particular area of concern surrounds how the JSON-LD spec describes serialising disjoint graphs into JSON-LD (section 8.2). How does this differ from serialising joined graphs? We are presuming all that our output file is an example of a joined graph, and that additional disjoint graphs would be added by appending additional @:[] sections.

Posted in BibServer, Data, JISC OpenBib, OKFN Openbiblio, Semantic Web | Tagged inf11, jisc, jiscEXPO, jiscopenbib, ontology, progress, rdf | Leave a comment

Comparative Serialisation of RDF in JSON

Posted on May 4, 2011 by richardjones

This is a comparison of RDF-JSON and JSON-LD for serialising bibliographic RDF data. Given that we are also working
with BibServer we have taken a BibJSON document as our source data for
comparison. The objective was to both understand these two JSON
serialisations of RDF and also to look at the BibJSON profile to see how it
fits into such a framework.

Due to limitations of the display of large plain-text code snippets on the site, we have placed the actual content in this text file which you should refer to as we go along.

We used a BibJSON document, which comes from the examples on the
BibJSON homepage.

When converting this into the two RDF serialisations we invent a namespace

http://www.bibkn.org/bibjson/terms/

This namespace provisionally holds all predicates/keys that are used by BibJSON
and are not immediately clearly available in another ontology. These terms should
not under any circumstances be considered definitive or final, only indicative.

Now consider the RDF-JSON serialisation

Some key things to note about this serialisation:

There is no explicit shortening of URIs for predicates into CURIEs,
all URIs are instead presented in full.
The subject of each predicate is a JSON object with up to 4 keys (value,
type, datatype, lang). This means that it is not easy for the human
eye to pick out the value of a particular predicate.
Of the two RDF serialisations, this is by far the most verbose
It is relatively difficult for a human to read and write

Compare this with the equivalent JSON-LD serialisation:

Some things to note about this serialisation:

It has a clear treatment of namespaces
It may be slightly inaccurate, as there are some parts of its specification
which are ambiguous – feedback welcome
The object values cannot be taken as the value of the predicate,
as they may contain datatype and/or language information in them, or may
be surrounded by angled brackets.
It is relatively easy for a human to read and write

Both serialisations are capable of representing the same data, although JSON-LD
is far more terse and therefore easier to read and write. It is not, however,
possible to reliably treat JSON-LD as a pure list of key-value pairs in non-RDF
aware environments, as it includes RDF type and language semantics in the literal
values of objects. RDF-JSON does not suffer from this same issue within the object
literals, but in return its notation is more complex.

A serious lacking in RDF-JSON is explicit handling of CURIEs and namespaces,
and it could benefit from adopting the conventions laid out in JSON-LD – this
may bring the choice of which serialisation to use down to preference rather
than relying on any significant technical differences.

Each of the formats also comfortably represents BibJSON, and with the extensive lists of predicates provided in that specification it would be straightforward enough to do a full and proper treatment of BibJSON through one of these routes.

Posted in BibServer, Data, JISC OpenBib, OKFN Openbiblio, Semantic Web | Tagged inf11, jisc, jiscEXPO, jiscopenbib, model, ontology, outputs, progress, progressPosts, rdf | 1 Comment

Getting open bibliographic data from (UK)PMC / PubMed

Posted on May 3, 2011 by Mark MacGillivray

For some time now, the JISC Open Bibliography project team has been attempting to get open bibliographic data from (UK)PMC / PubMed. Everyone involved (Robert Kiley – Wellcome, Ben O’Steen, Peter Murray-Rust – JISC OpenBib, Jeff Beck – NIH/NLM/NCBI, Johanna McEntyre) has worked hard to achieve this, but attempts have been hampered by ambiguities and technical restrictions. The purpose of this post is to clarify and highlight these issues as examples of stumbling blocks on any path to linked open data, to specify what it is we are trying to achieve at present, and learn how to improve this process.

WHAT WE ARE TRYING TO DO

Closed access to bibliography is dangerous – it actually holds back the scientific discovery process. We therefore believe it is important to have an authoritative Open collection of bibliographic records. This acts as a primary resource for the community which they can use for normalisation, discovery, annotation, etc. We seek confirmation that we can have programmatic access to the approximately twenty million or so records in PubMed. NCBI for example should be able to say: “these are the articles which we have in Pubmed” without breaking any laws or contracts. These articles would be identified by their core bibliographic data.

PROBLEMS

We received an original email last year stating that we could have such access to PubMed, but it has become unclear what PubMed is.
Identifying the correct content is not straightforward – are we talking about PMC / UKPMC / PubMed / Open Access subset?
What licenses are involved and on which subsets do open licenses such as CC0 apply?
These datasets are very large, so incremental and recordset-by-recordset requests to servers have resulted in roadblocks such as timeouts and errors.

WHAT DATASET ARE WE TALKING ABOUT

The 2 million articles in PMC are NOT all open access. There are 251,129 articles (approx 12% of PMC) that are in the open access subset.
Although there are 2 million or so articles in PMC which anyone can look at, print out etc, only 251k of these have an OA licence which allows people to re-use the content, including creating derivative works.
PMC and UKPMC have approximately the same full-text content. There are a small minority of journals which refused to allow their content to be mirrored to UKPMC.
The distinction between “public access” content and “open access” articles (i.e 0.25m articles) is irrelevant, as we are only interested in the bibliographic record, not the content.
For current purposes PMC and UKPMC can be used interchangeably.
PMC is only a subset of PubMed – which contains about twenty million records, the totality of content in NIH / NLM / NCBI.
The MEDLINE dataset is a subset of about 98% of PubMed.
However we believe, as per previous discussions, that the legal situation applies equally to PubMed as to the PMC.
So we are looking for every bibliographic record in PubMed (or MEDLINE if that is easier to acquire).

WHAT DO WE MEAN BY BIBLIOGRAPHIC RECORD

“Bibliography” is sometimes used as synonymous with “a given collection of bibliographic records”. Consider “the bibliographic data for Pubmed”; what we are interested in is enumerating individual bibliographic records.
“Citation” often refers to the reference within the fulltext to another publication (via its bibliographic record). The list of citations is not in general Open except in Open Access journals.
For the purposes of Open Bibliography we are restricting our discussion to what we call core bibliographic data (described in the open bibliographic data principles)
We regard the core bibliographic data as uncopyrightable, and generally acknowledged to be necessarily Open.
This core bibliographic data is what we mean by the bibliographic record.
Such records are unoriginal and inevitable, being the only way of actually identifying a work.
Although collections of bibliographic data are copyrightable (at least in Europe) because they are the result of the creative act of assembling a set of records, the individual records are not.
There is no creative act in compiling the list of bibliographic records held by NCBI/Pubmed as it is an exhaustive enumeration.
We believe that there is no moral case and probably no legal case for regarding these as the property of the publisher.

WHAT DO WE NOT MEAN BY BIBLIOGRAPHIC RECORD

As abstracts appear to be copyrightable we do not include abstracts, or annotations.
If it is not in the open bibliographic principles, we do not consider it to be in the bibliographic record.

WHAT WE HOPE TO GET NOW

Due to issues with programmatic access to PMC / PubMed dataset (restrictions on requests to the servers that contain them, we request a dump of the MEDLINE dataset.
This represents about 98% of PubMed which we believe is or should be available as CC0.
As MEDLINE also has incremental updates, we request ongoing access to those, to allow change tracking and synchronisation.
We have have filled in the automatic leasing form for the MEDLINE set a few times since February, (most recent attempt was at the end of April.)
We hope that the position is now clearly stated in this post, and await confirmation.
Upon agreement we look forward to receiving the XML files containing the MEDLINE dataset, from which we will extract the aforementioned unoriginal and re-usable bibliographic data.

We look forward to resolving this, to receiving the data, and to helping to make it openly available.

Posted in Data, JISC OpenBib, OKFN Openbiblio | Tagged bibliographic, inf11, jisc, jiscEXPO, jiscopenbib, progress, progressPosts | 1 Comment

Open Linked Data at Mannheim University Library

Posted on April 28, 2011 by Adrian Pohl

Mannheim University Library recently announced that it releases its catalog data under a CC0 licence.

Translating part of the announcement:

After Mannheim University Library was the first German library to publish catalog data as Linked Data (…), both the original raw data and the RDF version are as of now also published under a CC0 licence and therefore available for anybody to reuse without any restrictions.

The data is available via http://data.bib.uni-mannheim.de/. A CKAN entry for this data set has yet to be made.

It is good news that Mannheim University Library which is experimenting with Linked Data for some time now also took the step to open up the data. Our congratulations on this move. May more libraries follow suit.

Posted in Data | Leave a comment

Bibliographica and Edinburgh International Science Festival

Posted on April 11, 2011 by Mark MacGillivray

This weekend I was trying to build a useful search tool to help my wife find interesting events on at the Edinburgh International Science Festival. One problem was that the dataset was poor, and the descriptions did not always give a lot of detail. I attempted to rectify this by hooking up the events to bibliographica.

Now, you can filter events then select “more” to see further details and a list of relevant publications based on the event speakers and the event theme; this can give a slightly better idea of what might be going on, as you can review the published work of those involved.

http://eisf.cottagelabs.com

Unfortunately, the data does still have quite a few errors, and I have not ensured that names tie up properly, so the results are not always perfect. But still, it is quite a good demonstration. It would be even better with journal articles to search across.

Posted in Data, event, JISC OpenBib, OKFN Openbiblio | Tagged inf11, jisc, jiscEXPO, jiscopenbib, progressPosts, WIN | Leave a comment

open theses at EURODOC

Posted on April 7, 2011 by Mark MacGillivray

#jiscopenbib #opentheses

On Friday 1st April 2011, Mark MacGillivray, Peter Murray-Rust and Ben O’Steen remotely attended the EURODOC conference in Vilnius, Lithuania in order to take part in an Open Theses workshop locally hosted by Daniel Mietchen and Alfredo Ferreira (funded by the JISC Open Bib project to attend in person).

During the workshop we began laying the foundations for open theses in Europe, discussing with current and recently finished postgraduate students and collecting data from those present and from anyone else interested.

As described by Peter prior to the event:

As part of our JISCOpenBIB project we are running a workshop on Open Theses at EURODOC 2011. “We” is an extended community of volunteers centered round the main JISC project. In that project we have developed an approach to the representation of Open Bibliographic metadata, and now we are extending this to theses.

Why theses? Because, surprisingly, many theses are not easily discoverable outside their universities. So we are running the workshop to see how much metadata we can collect on European theses. Things like name, university, subject, datae, title – standard metadata.

We have the beginnings of a dataset at:

https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dHFTNDhJU0xfdGhIT01WeTBMMDZWOGc&hl=en_GB&authkey=CJuy4owB

The content of this datasheet will hopefully be used to populate an open theses collection in bibliographica, and in addition it is powering a mashup that will allow us to view at a glance the theses that have been published across the world, and where possible a link to the work itself:

http://benosteen.com/eurodoc.html

We also have a survey to fill in, to collect opinion around copyright issues for current / soon to be published theses, based at:

http://openbiblio.net/opentheses-survey/

The data collected by this survey is available at:

https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dDN1cHQ3TDJpYWRaWmkxWlFDS2lMWXc&hl=en_GB&authkey=CMKN-O8I#gid=0

Posted in JISC OpenBib | Tagged bibliographic, communityBenefits, inf11, jisc, jiscEXPO, jiscopenbib, progress, progressPosts, WIN | Leave a comment

Minutes: 10th Virtual meeting of the OKFN Openbiblio Group

Posted on April 6, 2011 by Adrian Pohl

Date: April, 5st 2011, 15:00-16:15 GMT

Channel: Meeting was held via Skype and Etherpad

Participants

Adrian Pohl
Jim Pitman
Primavera De Filippi
Karen Coyle
James Robert Griffin III
Thomas Krichel
William Waites
Lucy Chambers

publicdomainworks.net & bibliographica.org

Primavera from publicdomainworks.net attended the meeting. As publicdomainworks.net in the future wants to use bibliographica.net as data store she wanted to know how to query bibliographica.org to get JSON output. It was decided to move this particular discussion to the openbiblio-dev mailing list which has already happened.

The further discussion was about how to best model bibliographic data: First, which serialization to use and second which metadata elements. William said, regarding serializations the current lineup of JSON serialisations was being considered at bibliographica.org.

Furthermore, BibJSON was mentioned as a loose standard for describing bibliographic resources in JSON. FOAF/BIBO/DC are incorporated in BibJSON which also allows other name spaces.

Openbiblio Principles

Since the last meeting mails promoting the openbiblio principles have been sent to many library-related mailing lists:
- Karen: lita-l, code4lib, ngc4lib, ol-discuss, public-lld
- Antoine: europeana, public-lod, public-esw
- Adrian: Inetbib
- Jim and Thomas haven’t sent out mails. ACTION: They will catch up on this.
The number of signatories grew to 83.
A Korean version of the Principles on Open Bibliographic Data was sent to Mark McGillivray. These need to be verified before publishing them on openbiblio.net.
ACTION: Primavera will ask a Korean friend to take a look at it.

Open Data enquiries

Adrian had a mail exchange with E-LIS.
They confirmed that E-LIS metadata is open and are thinking about explicitely using an open license and signing the openbiblio principles. The enquiry is now resolved, see here.
Nothing happened regarding the other running enquiries.

ORCID

ORCID ist a non-profit organization “dedicated to solving the name ambiguity problem in scholarly research” by establishing “a registry that is adopted and embraced as the de facto standard by the whole of the community”. (See http://www.orcid.org/aboutus.)
Momentarily the ORCID principles state that individual person information will be licensed CC0 but there is no such statement regarding bulk data.
James Griffin and Thomas will give a talk at the next ORCID participant meetings, 18th May where they will promote open bibliography.

BibServer/Openbiblio/Bibliographica

Jim announced that he agreed with Rufus and Mark about BibServer integration with Openbiblio/Bibliographica.

Microsoft Academic Search (MAS) API

Microsoft Academic Search
MAS has got clean well-structured data about academic articles.
Thomas Krichel, Peter Murray-Rust and Jim Pitman are talking with MAS about making the service more open.

Upcoming events

Open Knowledge Conference: On 30th June & 1st July, 2011 OKCon 2011 will take place in Berlin. It would be great if the openbiblio working group would be represented there with a talk. Adrian said he could do this but he’d like present together with someone from the academic paper group (like PMR, Jim or Thomas) to paint a better picture of the group.

Action Collection

Thomas will post to engineering librarian lists and lists Karen hasn’t covered.
Jim will cover math/stat related lists and publications
Primavera will ask a Korean friend to take a look at the openbiblio principles translation.

Posted in minutes, OKFN Openbiblio, Uncategorized | Leave a comment

Talk at UKSG 2011 Conference on Open Bibliography

Posted on April 6, 2011 by Rufus Pollock

This is a post by Rufus Pollock, a member of the Working Group and co-Founder of the Open Knowledge Foundation.

Yesterday, I was up in Harrogate at the UKSG (UK Serials Group) annual conference to speak in a keynote session on Open Bibiliograpy and Open Bibliographic Data.

I’ve posted the slides online and iframed below.

Outline

Over the past few years, there has an explosive growth in open data
with significant uptake in government, research and elsewhere.

Bibliographic records are a key part of our shared cultural heritage.
They too should therefore be open, that is made available to the
public for access and re-use under an open license which permits use
and reuse without restriction (). Doing this
promises a variety of benefits.

First, it would allow libraries and other managers of bibliographic
data to share records more efficiently and improve quality more
rapidly through better, easier feedback. Second, through increased
innovation in bibliographic services and applications generating
benefits for the producers and users of bibliographic data and the
wider community.

This talk will cover the what, why and how of open bibliographica
data, drawing on direct recent experience such as the development of
the Open Biblio Principles and the work of the Bibliographica and JISC
OpenBib projects to make the 3 million records of the British
Library’s British National Bibliography (BNB) into linked open data.

With a growing number of Government agencies and public institutions
making data open, is it now time for the publishing and library
community to do likewise?

Posted in News, Talks | Leave a comment

Participants

Open bibliographic data formats

Getting more open datasets

The Medline dataset

Displaying bibliographic records

Searching bibliographic records

Disambiguation without deduplication

Sustaining open bibliographic data

Continuing community engagement

Maintaining datasets

Community involvement with dataset maintenance

Critical mass

Maintaining services

Business cases

Events and announcements

Participants

Update on bibliographica

Update on BibServer

Talk at ORCID

Openbiblio Principles

OKCon 2011 (Berlin)

Action Collection

WHAT WE ARE TRYING TO DO

PROBLEMS

WHAT DATASET ARE WE TALKING ABOUT

WHAT DO WE MEAN BY BIBLIOGRAPHIC RECORD

WHAT DO WE NOT MEAN BY BIBLIOGRAPHIC RECORD

WHAT WE HOPE TO GET NOW

Participants

publicdomainworks.net & bibliographica.org

Openbiblio Principles

Open Data enquiries

ORCID

BibServer/Openbiblio/Bibliographica

Microsoft Academic Search (MAS) API

Upcoming events

Action Collection

Outline

Total Petition Signatures

Recent Comments

Recent Posts

Meta