For some time now, the JISC Open Bibliography project team has been attempting to get open bibliographic data from (UK)PMC / PubMed. Everyone involved (Robert Kiley – Wellcome, Ben O’Steen, Peter Murray-Rust – JISC OpenBib, Jeff Beck – NIH/NLM/NCBI, Johanna McEntyre) has worked hard to achieve this, but attempts have been hampered by ambiguities and technical restrictions. The purpose of this post is to clarify and highlight these issues as examples of stumbling blocks on any path to linked open data, to specify what it is we are trying to achieve at present, and learn how to improve this process.
WHAT WE ARE TRYING TO DO
Closed access to bibliography is dangerous – it actually holds back the scientific discovery process. We therefore believe it is important to have an authoritative Open collection of bibliographic records. This acts as a primary resource for the community which they can use for normalisation, discovery, annotation, etc. We seek confirmation that we can have programmatic access to the approximately twenty million or so records in PubMed. NCBI for example should be able to say: “these are the articles which we have in Pubmed” without breaking any laws or contracts. These articles would be identified by their core bibliographic data.
- We received an original email last year stating that we could have such access to PubMed, but it has become unclear what PubMed is.
- Identifying the correct content is not straightforward – are we talking about PMC / UKPMC / PubMed / Open Access subset?
- What licenses are involved and on which subsets do open licenses such as CC0 apply?
- These datasets are very large, so incremental and recordset-by-recordset requests to servers have resulted in roadblocks such as timeouts and errors.
WHAT DATASET ARE WE TALKING ABOUT
- The 2 million articles in PMC are NOT all open access. There are 251,129 articles (approx 12% of PMC) that are in the open access subset.
- Although there are 2 million or so articles in PMC which anyone can look at, print out etc, only 251k of these have an OA licence which allows people to re-use the content, including creating derivative works.
- PMC and UKPMC have approximately the same full-text content. There are a small minority of journals which refused to allow their content to be mirrored to UKPMC.
- The distinction between “public access” content and “open access” articles (i.e 0.25m articles) is irrelevant, as we are only interested in the bibliographic record, not the content.
- For current purposes PMC and UKPMC can be used interchangeably.
- PMC is only a subset of PubMed – which contains about twenty million records, the totality of content in NIH / NLM / NCBI.
- The MEDLINE dataset is a subset of about 98% of PubMed.
- However we believe, as per previous discussions, that the legal situation applies equally to PubMed as to the PMC.
- So we are looking for every bibliographic record in PubMed (or MEDLINE if that is easier to acquire).
WHAT DO WE MEAN BY BIBLIOGRAPHIC RECORD
- “Bibliography” is sometimes used as synonymous with “a given collection of bibliographic records”. Consider “the bibliographic data for Pubmed”; what we are interested in is enumerating individual bibliographic records.
- “Citation” often refers to the reference within the fulltext to another publication (via its bibliographic record). The list of citations is not in general Open except in Open Access journals.
- For the purposes of Open Bibliography we are restricting our discussion to what we call core bibliographic data (described in the open bibliographic data principles)
- We regard the core bibliographic data as uncopyrightable, and generally acknowledged to be necessarily Open.
- This core bibliographic data is what we mean by the bibliographic record.
- Such records are unoriginal and inevitable, being the only way of actually identifying a work.
- Although collections of bibliographic data are copyrightable (at least in Europe) because they are the result of the creative act of assembling a set of records, the individual records are not.
- There is no creative act in compiling the list of bibliographic records held by NCBI/Pubmed as it is an exhaustive enumeration.
- We believe that there is no moral case and probably no legal case for regarding these as the property of the publisher.
WHAT DO WE NOT MEAN BY BIBLIOGRAPHIC RECORD
- As abstracts appear to be copyrightable we do not include abstracts, or annotations.
- If it is not in the open bibliographic principles, we do not consider it to be in the bibliographic record.
WHAT WE HOPE TO GET NOW
- Due to issues with programmatic access to PMC / PubMed dataset (restrictions on requests to the servers that contain them, we request a dump of the MEDLINE dataset.
- This represents about 98% of PubMed which we believe is or should be available as CC0.
- As MEDLINE also has incremental updates, we request ongoing access to those, to allow change tracking and synchronisation.
- We have have filled in the automatic leasing form for the MEDLINE set a few times since February, (most recent attempt was at the end of April.)
- We hope that the position is now clearly stated in this post, and await confirmation.
- Upon agreement we look forward to receiving the XML files containing the MEDLINE dataset, from which we will extract the aforementioned unoriginal and re-usable bibliographic data.
We look forward to resolving this, to receiving the data, and to helping to make it openly available.