Are bibliographies copyrightable? – the German case

Unfortunately, you have to deal with copyright and neighbouring rights like the European sui generis database right when you want to liberate data, be it bibliographic data or other data. In recent discussions about bibliographies and copyright, Peter Murray-Rust several times raised the point that (at least some) publishers consider a bibliography, i.e. the list of references at the end of a reasearch article or a monograph, as copyrighted. (N.B., we don’t talk about specialized bibliographies here which aren’t part of an academic publication.)

Assuming this to be a fact, it would mean that you couldn’t copy bibliographies, reuse, publish and distribute them without prior consent by the copyright holder (which often is the publisher). Peter’s view of bibliographies and copyright surprised me as for all I have heard regarding German law, bibliographies aren’t copyrightable at all. This post provides the background for my conviction.

Currently, there are two main sources regarding this topic:

I will translate the letter and summarize the relevant information in the legal guide.

The German Publishers and Booksellers Association’s letter

The letter is adressed to Gabriele Beger, Director of the German Library Association (dbv) and was written in the context of discussions about enriching library catalogues with additional information like indexes, covers etc. (usually called “Catalogue Enrichment”). Here my try on a translation of the letter:

++++begin translation+++

Dear Prof. Dr. Gabriele Beger,

in the last few months discussions have taken place between the German National Library, the German Library Association and the German Publishers and Booksellers Association about the enrichment of library catalogues with different data that goes beyond mere bibliographic information. You’ve asked the Publishers and Booksellers Association to examine for its member publishers which possible addenda are unproblematic regarding copyright law.

After examination by our legal department I can inform you today, that there are no legal objections to indexing of the following data for catalogue enrichment purposes:

  • title page (not book cover)
  • table of contents
  • list of tables
  • table of figures
  • bibliography
  • subject index
  • index of persons
  • places index

Also, the Publishers and Booksellers Association for his member publishers wants to enable the use of

  • covertext and blurb

These texts will often be copyrighted but normally the publishers own the rights. Because these short-descriptions are advertisement-like and in no way replace reading the book we ensure you that we will inform our member publishers that we see no objections against indexing this information in library catalogues.

Unfortunately, this doesn’t apply to the summaries of academic books. Such abstracts in many cases are made freely accessible by the publishers with authors’ consent. But not all authors give their assent to this practice so that publishers have to take into account their legal duties to respect authors’ rights. However, in order to enable public access to abstracts of scientific books in library catalogues we will try
to work towards a change in publisher’s contract in the academic field.

Also, publishers can’t generally permit catalogue enrichment with cover scans. For the creation of book covers very often copyrighted illustrations, photos etc. are used which the publisher has to seek a license agreement for with the rightsholder (…). In these cases often only few distinct rights are conferred. Therefore, sometimes even the publisher himself often doesn’t own the rights to display cover scans in databases.

(…)

++++end translation+++

Obviously, this letter only adresses content of mongraphies but there is nothing to be said against this also applying to journal articles.

It has to be made clear that the practice of catalogue enrichment means scanning printed books and adding the scans (and/or the OCRed text) to search tools. The scraping of publisher’s websites for this data wasn’t part of the discussion. But it is made sufficiently clear that single bibliographies aren’t copyrighted. Of course, scraping websites and aggregating many bibliographies from one source would probably conflict with European database right – as does large-scale scraping of single bibliographic records.

Legal guide to the digitization of public domain material

While the above letter only states that bibliographies and other indexes aren’t subject to copyright in Germany, the legal guide about digitization of public domain works by libraries also provides background for this view. The guide was published by the Northrhine-Westfalian Library Service Center (hbz). It is written on behalf of the hbz by Dr. Till Kreutzer, a lawyer with focus on copyright and related rights and irights.info editor.

The guide not only covers complete works which are in the public domain because of expired copyright but also public domain parts of otherwise copyrighted works. Concerning these, Till Kreutzer states that indexes like bibliographies, tables of figures, tables of content, name and subject indexes and indexes of tables normally are not copyrighted (in Germany) and can thus mostly be digitized without asking for permission.

Title page and covers

In section 2.3 the guide covers public domain parts of otherwise copyrighted works. Regarding the title page he writes on p. 12:

The title page normally isn’t copyrighted. Usually the title page only contains the work’s title and bibliographic information about author and publisher. These elements don’t reach the necessary level of originality for copyright protection.

In the following paragraph he also makes clear that covers may display copyrighted material so that they are often copyrighted.

Indexes

Till Kreutzer states that most indexes like bibliographies, tables of figures, tables of content, name and subject indexes and indexes of tables aren’t copyrightable because

  1. They are lists of items which aren’t copyrightable by themselves nd
  2. The creation of an index, i.e. the collection of individual items (headings, names, figure titles etc.) into an index, lacks the necessary level of originality, it isn’t a creative act and actually is mostly done automatically.

He concludes that tables of content and other indexes can be digitized and published without the rightsholder’s consent regardless of whether the underlying work is copyrighted or not.

Posted in OKFN Openbiblio | Tagged | 2 Comments

Minutes: 13th Virtual Meeting of the OKFN Openbiblio Group

Date July, 5th 2011, 15:00 GMT

Channel Meeting was held via Skype and Etherpad

Participants

  • Adrian Pohl
  • Peter Murray-Rust
  • David Shotton
  • Thomas Krichel
  • Jim Pitman
  • Karen Coyle

Agenda

Report from the #openbiblio workshop at #okcon2011 & okcon in general

openbiblio workshop

See workshop etherpad: http://okfnpad.org/okcon-biblio-workshop

  • More than 20 People attended
  • Nice mix of library people, scientists and humanities scholars
  • Peter presented openbiblio background: different types of data, some copyrighted some probably not
  • Different interests in open bibliographic data by librarians and humanities scholars and natural scientiests were topic; nonetheless much overlap –> Peter: pragmatically/politically we should work together but institutional and technical we differ (bibliographic data vs. library data with holdings, serials data vs. monographs)
  • Mark presented BibServer/BibJSON/bibsoup development which generated some discussion
  • General brainstorming: Why open data? What to do with it? What stops us from doing it? How do we do it?
  • Generally, we had focused significant, discussions at the workshop. But time was too short to do actual work. Organize some local/regional hackathons?

“Knowledge for all” project

http://www.k4all.ca/

  • Aim: To build an international collaboration of libraries providing an open source / openly licensed database with journal article data.
  • Peter sees several problems which could impede the project:
    • Possible legal problems haven’t been adressed in detail yet.
    • Still a very small group.

Internet Archive/Open library

  • Peter spoke at length with Brewster Kahle at OKCon2011
  • Collaboration with Open Library?
  • Problem with OL: It’s not a clean “open” database in light of the open definition because of data contributed by library which wasn’t explicitely openly licensed.
  • Asking OL to provide BibJSON? – Karen: We could write a template for OL to output BibJSON. – Jim: Otherwise we could map the output data ourselves.
  • Open Library bulk download: http://openlibrary.org/data#downloads

JISC openbiblio project ends – results and perspectives

Open Scholarship

  • Peter has created draft proposal (to be sent to JISC) for continued funding. Being reviewed currently by Dave Flanders (JISC) and Jonathan Gray. No monetary info included.

BibServer and BibJSON

Journals data

Open Research Reports (OKF, Jenny Molloy) 2011-12-08? London. (PMR or DS)

Jenny in http://www.zoo.ox.ac.uk/staff/academics/sinkins_sp.htm
* hackathon about bibliographic data in ?

Ex Libris Expert Group on Open Data

  • With Karen’s mediation, Adrian asked Carl Grant of Ex Libris to provide a blog post on OKFN blog and openbiblio.net clarifying the recent “open data expert group” announcement: http://bit.ly/jfPKDj

JISC Open Citations project

Final blogged report of JISC Open Citations Project – sister project of PMR’s Open Bibliography Project – is at https://opencitations.wordpress.com/2011/07/01/jisc-open-citations-project-%E2%80%93-final-project-blog-post/.

The Open Citations Web site (http://opencitations.net) permits you to browse the citation data of some 200,000 citing articles within the Open Access Subset (OASS) of Pubmed Central, citing ~3.4 million papers out there in the big wide world, as described at http://opencitations.wordpress.com/2011/07/03/like-a-kid-with-a-new-train-set-exploring-citation-networks/.

Promoting the openbiblio principles

Thomas will bring attention to principles on open bibliographic data on several mailing lists.

Posted in minutes, OKFN Openbiblio | Leave a comment

Final Product Post: Open Bibliography

Bibliographic data has long been understood to contain important information about the large scale structure of scientific disciplines, the influence and impact of various authors and journals. Instead of a relatively small number of privileged data owners being able to manage and control large bibliographic data stores, we want to enable an individual researcher to browse millions of records, view collaboration graphs, submit complex queries, make selections and analyses of data – all on their laptop while commuting to work. The software tools for such easy processing are not yet adequately developed, so for the last year we have been working to improve that: primarily by acquiring open datasets upon which the community can operate, and secondarily by demonstrating what can be done with these open datasets.

Our primary product is Open Bibliographic data

Open Bibliography is a combination of Open Source tools, Open specifications and Open bibliographic data. Bibliographic data is subject to a process of continual creation and replication. The elements of bibliographic data are facts, which in most jurisdictions cannot be copyrighted; there are few technical and legal obstacles to widespread replication of bibliographic records on a massive scale – but there are social limitations: whether individuals and organisations are adequately motivated and able to create and maintain open bibliographic resources.

Open bibliographic datasets

Source Description Availability
Cambridge University Library This dataset consists of MARC 21 output in a single file, comprising around 180000 records. More info… get the data
British Library The British National Bibliography contains about 3 million records – covering every book published in the UK since 1950. More info… get the data
query the data
International Union of Crystallography Crystallographic research journal publications metadata from Acta Cryst E. More info… get the data
query the data
view the data
PubMed The PubMed Medline dataset contains about 19 million records, representing roughly 98% of PubMed publications. More info… get the data
view the data

Open bibliographic principles

In working towards acquiring these open bibliographic datasets, we have clarified the key principles of open bibliographic data and set them out for others to reference and endorse. We have already collected over 100 endorsements, and we continue to promote these principles within the community. Anyone battling with issues surrounding access to bibliographic data can use these principles and the endorsements supporting them to leverage arguments in favour of open access to such metadata.

Products demonstrating the value of Open Bibliography

OpenBiblio / Bibliographica

Bibliographica is an open catalogue of books with integrated bibliography tools for example to allow you to create your own collections and work with Wikipedia. Search our instance to find metadata about anything in the British National Bibliography. More information is available about the collections tool and the Wikipedia tool.

Bibliographica runs on the open source openbiblio software, which is designed for others to use – so you can deploy your own bibliography service and create open collections. Other significant features include native RDF linked data support, queryable SOLR indexing and a variety of data output formats.

Visualising bibliographic data

Traditionally, bibliographic records have been seen as a management tool for physical and electronic collections, whether institutional or personal. In bulk, however, they are much richer than that because they can be linked, without violation of rights, to a variety of other information. The primary objective axes are:

  • Authors. As well as using individual authors as nodes in a bibliographic map, we can create co-occurrence of authors (collaborations).
  • Authors’ affiliation. Most bibliographic references will now allow direct or indirect identification of the authors’ affiliation, especially the employing institution. We can use heuristics to determine where the bulk of the work might have been done (e.g. first authorship, commonality of themes in related papers etc. Disambiguation of institutions is generally much easier than for authors, as there is a smaller number and there are also high-quality sites on the web (e.g. wikipedia for universities). In general therefore, we can geo-locate all the components of a bibliographic record.
  • Time. The time of publication is well-recorded and although this may not always indicate when the work was done, the pressure of modern science indicates that in many cases bibliography provides a fairly accurate snapshot of current research (i.e. with a delay of perhaps one year).
  • Subject. Although we cannot rely on access to abstracts (most are closed), the title is Open and in many subjects gives high precision and recall. Currently, our best examples are in infectious diseases, where terms such as malaria, plasmodium etc. are regularly and consistently used.

With these components, it is possible to create a living map of scholarship, and we show three examples carried out with our bibliographic sets.

This is a geo-temporal bibliography from the full Medline dataset. Bibliographic records have been extracted by year and geo-spatial co-ordinates located on a grid. The frequency of publications in each grid square is represented by vertical bars. (Note: Only a proportion of the entries in the full dataset have been used and readers should not draw serious conclusions from this prototype). (A demonstration screencast is available at http://vimeo.com/benosteen/medline; the full interactive resource is accessible with Firefox 4 or Google Chrome, at http://benosteen.com/globe.)

This example shows a citation map of papers recursively referencing Wakefield’s paper on the adverse effects of MMR vaccination. A full analysis requires not just the act of citation but the sentiment, and initial inspection shows that the immediate papers had a negative sentiment i.e. were critical of the paper. Wakefield’s paper was eventually withdrawn but the other papers in the map still exist. It should be noted that recursive citation can often build a false sense of value for a distantly-cited object.

This is a geo-temporal bibliographic map for crystallography. The IUCr’s Open Access articles are an excellent resource as their bibliography is well-defined and the authors and affiliations well-identified. The records are plotted here on an interactive map where a slider determines the current timeslice and plots each week’s publications on a map of the world. Each publication is linked back to the original article. (The full interactive resource is available at http://benosteen.com/timemap/index.)

These visualisations show independent publications, but when the semantic facets on the data have been extracted it will be straightforward to aggregate by region, by date and to create linkages between locations.

Open bibliography for Science, Technology and Medicine

We have made further efforts to advocate for open bibliographic data by writing a paper on the subject of Open Bibliography for Science, Technology and Medicine. In addition to submitting for publication to a journal, we have
made the paper available
as a prototype of the tools we are now developing. Although somewhat subsequent to the main development of this project, these examples show where this work is taking us – with large collections available, and agreement on what to expect in terms of open bibliographic data, we can now support the individual user in new ways.

Uses in the wider community

Demonstrating further applications of our main product, we have identified other projects making use of the data we have made available. These act as demonstrations for how others could make use of open bibliographic data and the tools we (or others) have developed on top of them.

Public Domain Works is an open registry of artistic works that are in the public domain. It was originally created with a focus on sound recordings (and their underlying compositions) because a term extension for sound recordings was being considered in the EU. However, it now aims to cover all types of cultural works, and the British National Bibliography data queryable via http://bibliographica.org provides an exemplar for books. The Public Domain Works team have built on our project output to create another useful resource for the community – which could not exist without both the open bibliographic data and the software to make use of it.

The Bruce at Brunel project was also able to make use of the output of the JISC Open Bibliography project; in their work to develop faceted browse for reporting, they required large quality datasets to operate on, and we were able to provide the open Medline dataset for this purpose. This is a clear advantage for having such open data, in that it informs further developments elsewhere. Additionally, in sharing these datasets we can receive feedback on the usefulness of the conversions we provide.

A further example involves the OKF Open Data in Science working group; Jenny Molloy is organising a hackathon as part of the SWAT4LS conference in December 2011, with the aim of generating open research reports using bibliographic data from PubMedCentral, focussing on malaria research. It is designed to demonstrate what can be done with open data, and this example highlights the concept of targeted bibliographic collections: essentially, reading lists of all the relevant publications on a particular topic. With open access to the bibliographic metadata, we can create and share these easily, and as required.

Additionally, with easy access to such useful datasets comes serendipitous development of useful tools. For example, one of our project team developed a simple tool over the course of a weekend for displaying relevant reading lists for events at the Edinburgh International Science Festival. This again demonstrates what can be done if only the key ingredient – the data – is openly available, discoverable and searchable.

Benefits of Open Bibliography products

Anyone with a vested interest in research and publication can benefit from these open data and open software products – academic researchers from students through to professors, as well as academic administrators and software developers, are better served by having open access to the metadata that helps describe and map the environments in which they operate. The key reasons and use cases which motivate our commitment to open bibliography are:

  1. Access to Information. Open Bibliography empowers and encourages individuals and organisations of various sizes to contribute, edit, improve, link to and enhance the value of public domain bibliographic records.
  2. Error detection and correction. Community supporting the practice of Open Bibliography will rapidly add means of checking and validating the quality of open bibliographic data.
  3. Publication of small bibliographic datasets. It is common for individuals, departments and organisations to provide definitive lists of bibliographic records.
  4. Merging bibliographic collections. With open data, we can enable referencing and linking of records between collections.
  5. A bibliographic node in the Linked Open Data cloud. Communities can add their own linked and annotated bibliographic material to an open LOD cloud.
  6. Collaboration with other bibliographic organisations. Reference manager and identifier systems such as Zotero, Mendeley, CrossRef, and academic libraries and library organisations.
  7. Mapping scholarly research and activity. Open Bibliography can provide definitive records against which publication assessments can be collated, and by which collaborations can be identified.
  8. An Open catalogue of Open scholarship. Since the bibliographic record for an article is Open, it can be annotated to show the Openness of the article itself, thus bibliographic data can be openly enhanced to show to what extent a paper is open and freely available.
  9. Cataloguing diverse materials related to bibliographic records. We see the opportunity to list databases, websites, review articles and other information which the community may find valuable, and to associate such lists with open bibliographic records.
  10. Use and development of machine learning methods for bibliographic data processing. Widespread availability of open bibliographic data in machine-readable formats
    should rapidly promote the use and development of machine-learning algorithms.
  11. Promotion of community information services. Widespread availability of open bibliographic web services will make it easier for those interested in promoting the development of scientific communities to develop and maintain subject-specific community information.

Sustaining Open Bibliography

Using these products

The products of this project add strength to an ecosystem of ongoing efforts towards large scale open bibliographic (and other) collections. We encourage others to use tools such as the OpenBiblio software, and to take our visualisations as examples for further application. We will maintain our exemplars for at least one year from publication of this post, whilst the software and content remain openly available to the community in perpetuity. We would be happy to hear from members of the community interested in using our products.

Further collaborations and future work

We intend to continue to build on the output of this project; after the success of liberating large bibliographic collections and clarifying open bibliographic principles, the focus is now on managing personal / small collections. Collaborative efforts with the Bibliographic Knowledge network project have begun, and continuing development will make the aforementioned releases of large scale open bibliographic datasets directly relevant and beneficial to people in the academic community, by providing a way for individuals – or departments or research groups – to easily manage, present, and search their own bibliographic collections.

Via collaboration with the Scholarly HTML community we intend to follow conventions for embedding bibliographic metadata within HTML documents whilst also enabling collection of such embedded records into BibJSON, thus allowing embedded metadata whilst also providing additional functionality similar to that demonstrated already, such as search and visualisation. We are also working towards ensuring compatibility between ScHTML and Schema.org, affording greater relevance and usability of ScHTML data.

Success in these ongoing efforts will enable us to support large scale open bibliographic data, providing a strong basis for open scholarship in the future. We hope to attract further support and collaboration from groups that realise the importance of Open Source code, Open Data and Open Knowledge to the future of scholarship.

Project TOC

All the posts about our project can be viewed in chronological order on our site via the jiscopenbib tag. Posts fall into three main types, and the key posts are listed below. The three types reflect the core strands of this project – documenting our progress whilst adjusting objectives for the best outcome, detailing technical development to increase awareness and for future reference, and announcing data releases. Whilst project progress and technical reports may be more common, it is very important to us to ensure also that the open dataset commitments are understood to be key events in themselves; these are the events that set the example for other groups in the publishing community, and should demonstrate open releases as the “best practice” for the community.

Project progress

Technical reports

Data releases

Further information

Project particulars

This project started on 14th June 2010 and finished successfully and on time on 30th June 2011, with a total cost of £77050. This project was funded by JISC under the jiscEXPO stream of the INF11 programme. The PIMS URL for this project is https://pims.jisc.ac.uk/projects/view/1867.

Software and documentation links and licenses

The main software output of this project is the further developed OpenBiblio software, which is available with installation documentation at http://bitbucket.org/okfn/openbiblio. However, there were other developments done as further demonstrations over the course of the project, and each is detailed on the project blog. See the Project TOC Technical reports list for further information

All the data that was released during this project fell under OKD compliant licenses such as PDDL or CC0, depending on that chosen by the publisher, and detailed in the aforementioned announcement posts.

The content of this site is licensed under a Creative Commons Attribution 3.0 License (all jurisdictions).

Project team

  • Peter Murray-Rust – Principal Investigator
  • Rufus Pollock – Project manager
  • Ben O’Steen – Technical lead
  • Mark MacGillivray – Project management
  • Will Waites – Software developer
  • Richard Jones – Additional software development
  • Tatiana De La O – Additional software development

And with thanks to David Flanders – JISC Program Manager

Project partners

Posted in JISC OpenBib | Tagged , , , , , , , , , , , , | 6 Comments

4 Stars for Metadata: an Open Ranking System for Library, Archive, and Museum Collection Metadata

This post was written by participants of the LOD-LAM Summit which was held on June 2nd/3rd in San Francisco and is crossposted on the Creative Commons blog and the OKFN blog. For author information see the list below the document.

The library, archives and museums (i.e. LAM) community is increasingly interested in the potential of Linked Open Data to enable new ways of leveraging and improving our digital collections, as recently illustrated by the first international Linked Open Data in Libraries Museums and Archives Summit (LOD-LAM) Summit in San Francisco. The Linked Open Data approach combines knowledge and information in new ways by linking data about cultural heritage and other materials coming from different Museums, Archives and Libraries. This not only allows for the enrichment of metadata describing individual cultural objects, but also makes our collections more accessible to users by supporting new forms of online discovery and data-driven research.

But as cultural institutions start to embrace the Linked Open Data practices, the intellectual property rights associated with their digital collections become a more pressing concern. Cultural institutions often struggle with rights issues related to the content in their collections, primarily due to the fact that these institutions often do not hold the (copy)rights to the works in their collections. Instead, copyrights often rest with the authors or creators of the works, or intermediaries who have obtained these rights from the authors, so that cultural institutions must get permission before they can make their digital collections available online.

However, the situation with regard to the metadata — individual metadata records and collections of records — to describe these cultural collections is generally less complex. Factual data are not protected by copyright, and where descriptive metadata records or record collections are covered by rights (either because they are not strictly factual, or because they are vested with other rights such as the European Union’s sui generis database right) it is generally the cultural institutions themselves who are the rights holders. This means that in most cases cultural institutions can independently decide how to publish their descriptive metadata records — individually and collectively — allowing them to embrace the Linked Open Data approach if they so choose.

As the word “open” implies, the Linked Open Data approach requires that data be published under a license or other legal tool that allows everyone to freely use and reuse the data. This requirement is one of most basic elements of the LOD architecture. And, according to Tim Berners-Lee’s 5 star scheme, the most basic way of making available data online is to make it ‘available on the web (whatever format), but with an open licence’. However, there still is considerable confusion in the field as to what exactly qualifies as “open” and “open licenses”.

While there are a number of definitions available such as the Open Knowledge Definition and the Definition of Free Cultural Works, these don’t easily translate into a licensing recommendation for cultural institutions that want to make their descriptive metadata available as Linked Open Data. To address this, participants of the LOD-LAM summit drafted ‘a 4-star classification-scheme for linked open cultural metadata’. The proposed scheme (obviously inspired by Tim Berners-Lee’s Linked Open Data star scheme) ranks the different options for metadata publishing — legal waivers and licenses — by their usefulness in the LOD context.

In line with the Open Knowledge Definition and the Definition of Free Cultural Works, licenses that either impose restrictions on the ways the metadata may be used (such as ‘non-commercial only’ or ‘no derivatives’) are not considered truly “open” licenses in this context. This means that metatdata made available under a more restrictive license than those proposed in the 4-star system above should not be considered Linked Open Data.

According to the classification there are 4 publishing options suitable for descriptive metadata as Linked Open Data, and libraries, archives and museums trying to maximize the benefits and interoperability of their metadata collections should aim for the approach with the highest number of stars that they’re comfortable with. Ideally the LAM community will come to agreement about the best approach to sharing metadata so that we all do it in a consistent way that makes our ambitions for new research and discovery services achievable.

Finally, it should be noted that the ranking system only addresses metadata licensing (individual records and collections of records) and does not specify how that metadata is made available, e.g., via APIs or downloadable files.

The proposed classification system is described in detail on the International LOD-LAM Summit blog but to give you a sneak preview, here are the rankings:

★★★★ Public Domain (CC0 / ODC PDDL / Public Domain Mark)
★★★ Attribution License (CC-BY / ODC-BY) where the licensor considers linkbacks to meet the attribution requirement
★★ Attribution License (CC-BY / ODC-BY) with another form of attribution defined by the licensor
★ Attribution Share-Alike License (CC-BY-SA/ODC-ODbL)

We encourage discussion of this proposal as we work towards a final draft this summer, so please take a look and tell us what you think!

Paul Keller, Creative Commons and Knowledgeland (Netherlands)
Adrian Pohl, Open Knowledge Foundation and hbz (Germany)
MacKenzie Smith, MIT Libraries (USA)
John Wilbanks, Creative Commons (USA)

Posted in News | Tagged , | Leave a comment

Collections in Bibliographica: unsorted information is not information

Collections are the first feature aimed for our users participation at Bibliographica.
The collections are lists of books users can create and share with others, and they are one of the basic features of Bibliographica as Jonathan Gray pointed out already:

lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing.

Details of use

They are accessible via the collections link on the top menu of the website.

To create collections you must be logged in. You may login on http://bibliographica.org/account/login with an openID

Once logged in, every time you open a book page (i.e. http://bnb.bibliographica.org/entry/GB6502067 ) you will see at your right the Collections menu, where you can choose between creating a new collection with that work, or adding the work to an already existing collection.

If you have created some collections you can always access them through the menu and they are also going to appear in your account page

For removing a book from one collection, you can click remove in the collection listing of the sidebar.

Collections screencast

Posted in JISC OpenBib, News, OKFN Openbiblio, Semantic Web | Tagged , , , , , | Leave a comment

Minutes: 12th Virtual Meeting of the OKFN Openbiblio Group

Date June 7th, 2011, 1500 GMT

Channel Meeting was held via Etherpad

Participants

  • Adrian Pohl
  • Karen Coyle
  • Jim Pitman
  • Mark MacGillivray
  • Will Waites

Planning an Open Bibliography activity for OKCON

This was the main focus of the call this month. OKCON is on 30th June and 1st July 2011 in Berlin, and we plan to have an Open Bibliography activity at 1530 Berlin time on the Friday afternoon.

There are a lot of possible avenues to explore that are covered in more detail on the pad; I will present here an overview to act as a first draft of the activity plan.

We have identified four areas of work – the ones we go with on the day will probably depend on the interests of the attendees.

BibJSON technical development

  • review the functionality developed for BibServer / BibSoup
  • demonstrate conversions e.g. from BibTex to BibJSON
  • list sources to scrape data from
  • learn how to write new converters for new locations / formats

Language / format / syntax

  • translation issues in BibJSON
  • how best to handle complex content?
  • converting MARC -> XML with XSLT -> BibJSON
  • BibJSON from Zebra or Aleph
  • RDF via Z39.50 – YAZ – http://river.styx.org/ww/2011/06/z39
  • institution/library URIs. Top-down approach of Adrian and Felix? Or bottom-up (mint URIs based on identifiers found in the data rather than in the master list)?

Community engagement

  • Social networking: what existing online places could use open bib data?
  • User options: what do users want to do with open bib data?
  • http://lod-lam.net/summit/2011/06/03/users-uses-service/
  • how could we get users to make open bib data available? e.g. what functionality do they want that we could trade for their participation?
  • Using CKAN as the description and registration hub for bibliographic data (in the wider sense).
  • design (and make?) the T-shirt – e.g. basics of BibJSON, how to approach people for open biblio data

Politics

  • How we promote the idea to (say) funding bodies
  • How to involve system vendors, since most libraries have their data in systems they do not control
  • Other collection holders such as individuals, departments, universities, publishers, mendeley, bibsonomy, archives, museums, government organisations
  • What dataset owners to approach and who might be good at doing that, get volunteers to take on particular dataset liberation projects. Sign up sheet?
  • How to organize/authorize that people to approach other groups on behalf of OKF / BKN
  • Using isitopendata.org as the service to use, and targeting FOI responsibilities
  • Publicising successful as well as NO responses on OKF/BKN blog

ELAG 2011 and LOD-LAM Summit

OpenBiblio calls in Italy

There have been skype calls with Stefano Costa, Antonella De Robbio, Francesca di Donato to discuss issues relating to open bibl data in Italy. The quick summary is that bib data is not open at this point for a variety of reasons:

  • in many cases, the rights issues have not been made clear by institutions
  • in addition, libraries often lack the technical ability to make their data available
  • there is an assumption that proprietary library systems = proprietary bibliographic data
Posted in minutes, OKFN Openbiblio | Leave a comment

Bibliographica gadget in Wikipedia

What is a wikipedia gadget?

Thinking of ways to show the possibilities of linked data, we have made a Wikipedia gadget, making use of a great resource the Wikimedia developers give to the community.

Wikipedia gadgets are small pieces of code you can add to your Wikipedia user templates, and allow you to add more functionality and render more information when you browse wikipedia pages.

In our case, we wanted to retrieve information from our bibliographica site to render in Wikipedia, and so as the pages are rendered with specific markup we can use the ISBN numbers present on the wikipedia articles to make consults to the bibliographica database, in a way similar to what Mark has done with the Edinburgh International Science Festival.

Bibliographica.org offers an isbn search endpoint at http://bibliographica.org/isbn/, so if we ask for the page http://bibliographica.org/isbn/0241105161 we receive

[{"issued": "1981-01-01T00:00:00Z", "publisher": {"name": "Hamilton"}, "uri": "http://bnb.bibliographica.org/entry/GB8102507", "contributors": [{"name": "Boyd, William, 1952-"}], "title": "A good man in Africa"}]

I can use this information to make a window pop up with more information about works when we hover their ISBNs on the Wikipedia pages. If my user templates has the bibliographica gadget, every time I open a wiki page the script will ask information about all the ISBNs the page has to our database.
If something is found, it will render a frame around the ISBN numbers:

And if I hover over them, I see a window with information about the book:

Get the widget

So, if you want to have this widget, first you need to create an account in the wikipedia, and then change your default template to add the JavaScript snippet. Once you do this (instructions here ) you will be able to get the information available in bibliographica about the books.

Next steps

By now, the interaction goes in just one direction. Later on, we will be able to feed that information back to Bibliographica.

Posted in JISC OpenBib, News, OKFN Openbiblio, Semantic Web | Tagged , , , , , , , , | Leave a comment

Openbiblio at #elag2011 and #lodlam

I wrote a post over at the blog for the LOD-LAM (Linked Open Data in Libraries, Archives and Museums) summit. It’s mainly a summary of the ELAG 2011 from an openbiblio viewpoint. See http://lod-lam.net/summit/2011/06/01/from-elag2011-to-lodlam for the post.

Also, the German Zukunftswerkstatt published an interview podcast regarding Open Bibliographic Data. Julia Bergman interviewed Patrick Danowski, Kai Eckert and me at the German barcamp for librarians and other hackers BibCamp. Hopefully, a text version of this interview will also be published on the web soon.

Posted in event | Tagged , , | Leave a comment

“Full-text” search for openbiblio, using Apache Solr

Overview:

This provides a simple search interface for openbiblio, using a network-addressable Apache Solr instance to provide FTS over the content.

The indexer currently relies on the Entry Model (in /model/entry.py) to provide an acceptable dictionary of terms to be fed to a solr instance.

Configuration:

In the paster main .ini, you need to set the param ‘solr.server’ to point to the solr instance. For example, ‘http://localhost:8983/solr’ or ‘http://solr.okfn.org/solr/bibliographica.org’. If the instance requires authentication, set the ‘solr.http_user’ and ‘solr.http_pass’ parameters too. (Solr is often put behind a password-protected proxy, due to its lack of native authentication for updating the index.)

Basic usage:

The search controller: solr_search.py (linked in config/routing.py to /search)

Provides HTML and JSON responses (content-negotiation or .html/.json as desired) and interprets a limited but easily expandable subset of Solr params (see ALLOWED_TERMS in the /controller/solr_search.py.)

JSON response is the raw solr response as this is quite usable in javascript.

HTML response is styled in the same manner as the previous (xapian-based?) search service, with the key template function formatting each row in templates/paginated_common.html – genshi function “solr_search_row”. Unless specified, the search controller will get all the fields it can for the search terms, meaning that the list of resuts in c.solr.results contain dicts with much more information than is currently exposed. The potentially available fields are as follows:

    "uri"          # URI for the item - eg http://bibligraphica.org/entry/BB1000
    "title"        # Title of the item
    "type"         # URI type(s) of the item (eg http://.... bibo#Document)
    "description"
    "issued"       # Corresponds to the date issued, if given.
    "extent"
    "language"     # ISO formatted, 3 lettered - eg 'eng'
    "hasEditionStatement"
    "replaces"        # Free-text entry for the work that this item supercedes
    "isReplacedBy"    # Vice-versa above
    "contributor"           # Author, collaborator, co-author, etc
                            # Formatted as "John Smith b1920 "
                            # Use lib/helpers.py:extracturi method to add formatting.
                            # Give it a list of these sorts of strings, and it will return
                            # a list of tuples back, in the form ("John Smith b1920", "http...")
                            # or ("John Smith", "") if no -enclosed URI is found.
    "contributor_filtered"  # URIs removed
    "contributor_uris"      # Just the entity URIs alone
    "editor"                # editor and publisher are formatted as contributor
    "publisher"
    "publisher_uris"        # list of publisher entity URIs
    "placeofpublication"    # Place of publication - as defined in ISBD. Possible and likely to
                            # have multiple locations here
    "keyword"               # Keyword (eg not ascribed to a taxonomy)
    "ddc"                   # Dewey number (formatted as contributor, if accompanied by a URI scheme)
    "ddc_inscheme"          # Just the dewey scheme URIs
    "lcsh"                  # eg "Music "
    "lcsh_inscheme"         # lcsh URIs
    "subjects"              # Catch-all,with all the above subjects queriable in one field.
    "bnb_id"                # Identifiers, if found in the item
    "bl_id"
    "isbn"
    "issn"
    "eissn"
    "nlmid"                 # NLM-specific id, used in PubMed
    "seeAlso"               # URIs pertinent to this item
    "series_title"          # If part of a series: (again, formatted like contributor)
    "series_uris"
    "container_title"       # If it has some other container, like a Journal, or similar
    "container_type"
    "text"                  # Catch-all and default search field.
                            # Covers: title, contributor, description, publisher, and subjects
    "f_title"               # Fields indexed to be suitable for facetting
    "f_contributor"         # Contents as above
    "f_subjects
    "f_publisher"
    "f_placeofpublication"  # See http://wiki.apache.org/solr/SimpleFacetParameters for info

The query text is passed to the solr instance verbatim, so it is possible to do complex queries within the textbox, according to normal solr/lucene syntax. See http://wiki.apache.org/solr/SolrQuerySyntax for some generic documentation. The basics of the more advanced search are as follows however:

  • field:query — search only within a given field,

eg ‘contributor:”Dickens, Charles”‘

Note that query text within quotes is searched for as declared. The above search will

not hit an author value of “Charles Dickens” for example (and why the above is not a good

way to search generically.)

  • Booleans, AND and OR — if left out, multiple queries will be OR’d

eg ‘contributor:Dickens contributor:Charles’ == ‘contributor:Dickens OR contributor:Charles’

The above will match contributors who are called ‘Charles’ OR ‘Dickens’ (non-exclusively), which is unlikely to be what is desired. ‘Charles Smith’ and ‘Eliza Dickens’ would be valid hits in this search.

‘contributor:Dickens AND contributor:Charles’ would be closer to what is intended.

  • URI matching — many fields include the URI and these can be used to be specific about the match

eg ‘contributor:”http://bibliographica.org/entity/E200000″‘

Given an entity URI therefore, you can see which items are published/contributed/etc just by performing a search for the URI in that field.

Basic Solr Updating:

The ‘solrpy’ library is used to talk to a Solr instance and so seek that project out for library-specific documentation. (>=0.9.4 as this includes basic auth)

Fundamentally, to update the index, you need an Entry (model/entry.py) instance mapped to the item you wish to (re)index and a valid SolrConnection instance.

from solr import SolrConnection, SolrException
s = SolrConnection("http://host/solr", http_user="usernamehere", http_pass="passwordhere")
e = Entry.get_by_uri("Entry Graph URI")

Then, it’s straightforward: (catching two typical errors that might be thrown due to a bad or incorrectly configured Solr connection.)

from socket import error as SocketError
try:
    s.add(e.to_solr_dict())
    # Uncomment the next line to commit updates (inadvisable to do after every small change of a bulk update):
    # s.commit()
except SocketError:
    print "Solr isn't responding or isn't there"
    # Do something here about it
except SolrException:
    print "Something wrong with the update that was sent. Make sure the solr instance has the correct schema in place and is working and that the Entry has something in it."
    # Do something here, like log the error, etc

Bulk Solr updating from nquads:

There is a paster command for taking the nquads Bibliographica.org dataset, parsing this into mapped Entry’s and then performing the above.

    Usage: paster indexnquads [options] config.ini NQuadFile
Create Solr index from an NQuad input

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -b BATCHSIZE, --batchsize=BATCHSIZE
                        Number of solr 'docs' to combine into a single update
                        request document
  -j TOJSON, --json=TOJSON
                        Do not update solr - entry's solr dicts will be
                        json.dumped to files for later solr updating

The –json option is particularly useful for production systems, as the time consuming part of this is the parsing and mapping to Entry’s and you can offload that drain to any computer and upload the solrupdate*.json files it creates directly to the production system for rapid indexing.

NOTE! This will start with solrupdate0.json and iterate up. IT WONT CHECK for existence of previous solr updates and they will be overwritten!

[I used a batchsize of 10000 when using the json export method]

Bulk Solr updating from aforementioned solrupdate*.json:

    paster indexjson [options] config.ini solrupdate*
    Create Solr index from a JSON serialised list of dicts

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -C COMMIT, --commit=COMMIT
                        COMMIT the solr index after sending all the updates
  -o OPTIMISE, --optimise=OPTIMISE
                        Optimise the solr index after sending all the updates
                        and committing (forces a commit)

eg

“paster indexjson development.ini –commit=True /path/to/solrupdate*”

Posted in JISC OpenBib, Uncategorized | Leave a comment

Medline dataset

Announcing the CC0 Medline dataset

We are happy to report that we now have a full, clean public domain (CC0) version of the Medline dataset available for use by the community.

What is the Medline dataset?

The Medline dataset is a subset of bibliographic metadata covering approximately 98% of all PubMed publications. The dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. There are approximately 19 million publication records.

Medline is a maintained dataset, and updates chronologically append to the current dataset.

Read our explanation of the different PubMed datasets for further information.

Where to get it

The raw dataset can be downloaded from CKAN : http://ckan.net/package/medline

What is in a record

Most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID. Many also have DOIs. We have stripped out any potentially copyrightable material such as abstracts.

Read our technical description of a record for further information.

Sample usage

We have made an online visualisation of a sample of the Medline dataset – however the visualisation relies on WebGL which is not yet widely supported by all browsers. It should work in Chrome and probably FireFox4.

This is just one example, but shows what great things we can build and learn from when we have open access to the necessary data to do so.

Posted in Data, JISC OpenBib, News, OKFN Openbiblio, Semantic Web | Tagged , , , , , , , , , , | 3 Comments