Announcing the CC0 Medline dataset
We are happy to report that we now have a full, clean public domain (CC0) version of the Medline dataset available for use by the community.
What is the Medline dataset?
The Medline dataset is a subset of bibliographic metadata covering approximately 98% of all PubMed publications. The dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. There are approximately 19 million publication records.
Medline is a maintained dataset, and updates chronologically append to the current dataset.
Read our explanation of the different PubMed datasets for further information.
Where to get it
The raw dataset can be downloaded from CKAN : http://ckan.net/package/medline
What is in a record
Most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID. Many also have DOIs. We have stripped out any potentially copyrightable material such as abstracts.
Read our technical description of a record for further information.
We have made an online visualisation of a sample of the Medline dataset – however the visualisation relies on WebGL which is not yet widely supported by all browsers. It should work in Chrome and probably FireFox4.
This is just one example, but shows what great things we can build and learn from when we have open access to the necessary data to do so.