Open Bibliography and Open Bibliographic Data » projectMethodology http://openbiblio.net Open Bibliographic Data Working Group of the Open Knowledge Foundation Tue, 08 May 2018 15:46:25 +0000 en-US hourly 1 http://wordpress.org/?v=4.3.1 Open source development – how we are doing http://openbiblio.net/2012/05/29/open-source-development-how-we-are-doing/ http://openbiblio.net/2012/05/29/open-source-development-how-we-are-doing/#comments Tue, 29 May 2012 11:24:17 +0000 http://openbiblio.net/?p=2671 Continue reading ]]> Whilst at Open Source Junction earlier this year, I talked to Sander van der Waal and Rowan Wilson about the problems of doing open source development. Sander and Rowan work at OSS watch, and their aim is to make sure that open source software development delivers its potential to UK HEI and research; so, I thought it would be good to get their feedback on how our project is doing, and if there is anything we are getting wrong or could improve on.

It struck me that as other JISC projects such as ours are required to make their output similarly publicly available, this discussion may be of benefit to others; after all, not everyone knows what open source software is, let alone the complexities that can arise from trying to create such software. Whilst we cannot help avoid all such complexities, we can at least detail what we have found helpful to date, and how OSS Watch view our efforts.

I provided Sander and Rowan a review of our project, and Rowan provided some feedback confirming that overall we are doing a good job, although we lack a listing of the other open source software our project relies on, and their licenses. Whilst such data can be discerned from the dependencies of the project, this is not clear enough; I will add a written list of dependencies to the README.

The response we received is provided below, followed by the overview I initially provided, which gives a brief overview of how we managed our open source development efforts:

==== Rowan Wilson, OSS Watch, responds:

Your work on this project is extremely impressive. You have the systems in place that we recommend for open development and creation of community around software, and you are using them. As an outsider I am able to quickly see that your project is active and the mailing list and roadmap present information about ways in which I could participate.

One thing I could not find, although this may be my fault, is a list of third party software within the distribution. This may well be because there is none, but it’s something I would generally be keen to see for the purposes of auditing licence compatibility.

Overall though I commend you on how tangible and visible the development work on this project is, and on the focus on user-base expansion that is evident on the mailing list.

==== Mark MacGillivray wrote:

Background – May 2011, OKF / AIM bibserver project

Open Knowledge Foundation contracted with American Institute of
Mathematics under the direction of Jim Pitman in the dept. of Maths
and Stats at UC Berkeley. The purpose of the project was to create an
open source software repository named BibServer, and to develop a
software tool that could be deployed by anyone requiring an easy way
to put and share bibliographic records online.

A repository was created at http://github.com/okfn/bibserver, and it
performs the usual logging of commits and other activities expected of
a modern DVCS system. This work was completed in September 2011, and the repository has been available since the start of that project with a GNU Affero GPL v3 licence attached.

October 2011 – JISC Open Biblio 2 project

The JISC Open BIblio 2 project chose to build on the open source
software tool named BibServer. As there was no support from AIM for
maintaining the BibServer repository, the project took on maintenance
of the repository and all further development work, with no change to
previous licence conditions.

We made this choice as we perceive open source licensing as a benefit
rather than a threat; it fit very well with the requirements of JISC
and with the desires of the developers involved in the project. At
worst, an owner may change the licence attached to some software, but
even in such a situation we could continue our work by forking from
the last available open source version (presuming that licence
conditions cannot be altered retrospectively).

The code continues to display the licence under which it is available,
and remains publicly downloadable at http://github.com/okfn/bibserver.
Should this hosting resource become publicly unavailable, an
alternative public host would be sought.

Development work and discussion has been managed publicly, via a
combination of the project website at
http://openbiblio.net/p/jiscopenbib2, the issue tracker at
http://github.com/okfn/bibserver/issues, a project wiki at
http://wiki.okfn.org/Projects/openbibliography, and via a mailing list
at openbiblio-dev@lists.okfn.org

February 2012 – JISC Open Biblio 2 offers bibsoup.net beta service

In February the JISC Open Biblio 2 project announced a beta service
available online for free public use at http://bibsoup.net. The
website runs an instance of BibServer, and highlights that the code is
open source and available (linking to the repository) to anyone who
wishes to use it.

Current status

We believe that we have made sensible decisions in choosing open
source software for our project, and have made all efforts to promote
the fact that the code is freely and publicly available.

We have found the open source development paradigm to be highly
beneficial – it has enabled us to publicly share all the work we have
done on the project, increasing engagement with potential users and
also with collaborators; we have also been able to take advantage of
other open source software during the project, incorporating it into
our work to enable faster development and improved outcomes.

We continue to develop code for the benefit of people wishing to
publicly put and share their bibliographies online, and all our
outputs will continue to be publicly available beyond the end of the
current project.

]]>
http://openbiblio.net/2012/05/29/open-source-development-how-we-are-doing/feed/ 1
"Bundling" instances of author names together without using owl:sameas http://openbiblio.net/2010/11/17/bundling-instances-of-author-names-together-without-using-owlsameas/ http://openbiblio.net/2010/11/17/bundling-instances-of-author-names-together-without-using-owlsameas/#comments Wed, 17 Nov 2010 13:16:20 +0000 http://openbiblio.net/?p=346 Continue reading ]]> Bundling?

It’s a verb I’ve taken from ”Glaser, H., Millard, I., Jaffri, A., Lewy, T. and Dowling, B. (2008) On Coreference and The Semantic Web http://eprints.ecs.soton.ac.uk/15765/” where the core idea is that you have a number of URIs that mean or reference the same real thing, and the technique they describe of bundling is to aggregate all those references together. The manner in which they describe is built on a sound basis in logic, and is related to (if not the same as) a congruent closure.

The notion of bundling I am using is not as rooted in terms of mathematical logic, because I need to convey an assertion that one URI is meant to represent the same thing that another URI represents in a given context and for a given reason. This is a different assertion, if only subtly different, than ‘owl:sameas’ asserts, but the difference is key for me.

It is best to think through an example of where I am using this – curating bibliographic records and linking authors together.

It’s an obvious desire – given a book or article, to find all the other works by an author of that said work. Technologically, with RDF this is a very simple proposition BUT the data needs to be there. This is the point where we come unstuck. We don’t really have that quality of data that firmly establishes that one author is the same as a number of others. String matching is not enough!

So, how do we clean up this data (converted to RDF) so that we can try to stitch together the authors and other entities in them?

See this previous post on augmenting British Library metadata so that the authors, publishers and so on are externally reference-able once they are given unique URIs. This really is the key step. Any other work that can be done to make any of the data about the authors and so on more semantically reference-able will be a boon to the process of connecting the dots, as I have done for authors with birth and/or death dates.

The fundamental aspect to realise is that we are dealing with datasets which have missing data, misrepresented data (typos), misinterpreted fields (ISBNs of £2.50 for example) and other non-uniform and irregular problems. Connecting authors together in datasets with these characteristics will rely on us and code that we write making educated guesses, and probabilistic assertions, based on how confident we are that things match and so on.

We cannot say for sure that something is a cast-iron match, only that we are above a certain limit of confidence that this is so. We also have to have a good reason as well.

Something else to take on board is that what I would consider to be a good match might not be good for someone else so there needs to be a manner to state a connection and to say why, who and how this match was made as well as a need to keep this data made up of assertions away from our source data.

I’ve adopted the following model for encoding this assertion in RDF, in a form that sits outside of the source data, as a form of overlay data and you can find the bundle ontology I’ve used at http://purl.org/net/bundle.rdf (pay no attention to where it is currently living):

Click to view in full, unsquished form:

Bundle of URIs, showing use of OPMV

The URIs shown to be ‘opmv:used’ in this diagram are not meant to be exhaustive. It is likely that a bundle may depend on a look-up or resolution service, external datasheets, authority files, csv lists, dictionary lists and so on.

Note that the ‘Reason’ class has few, if any, mandatory properties aside from its connection to a given Bundle and opmv:Process. Assessing if you trust a Bundle at this moment is very much based on the source and the agent that made the assertion. As things get more mature, more information will regularly find its place attached to a ‘Reason’ instance.

There are currently two subtypes of Reason: AlgorithmicReason and AgentReason. Straightforwardly, this is the difference between a machine-made match and a human-made match and use of these should aid the assessment of a given match.

Creating a bundle using python:

I have added a few classes to Will Waites’ excellent ‘ordf’ library, and you can find my version here. To create a virtualenv to work within, do as follows. You will need mercurial and virtualenv already installed:

At a command line – eg ‘[@localhost] $’, enter the following:

hg clone http://bitbucket.org/beno/ordf
virtualenv myenv
. ./myenv/bin/activate
(myenv) $ pip install ordf

So, creating a bundle of some URIs – “info:foo” and “info:bar”, due to a human choice of “They look the same to me :)”:

In python: code here


from ordf.vocab.bundle import Bundle, Reason, AlgorithmicReason, AgentReason

from ordf.vocab.opmv import Agent

from ordf.namespace import RDF, BUNDLE, OPMV, DC # you are likely to use these yourself

from ordf.term import Literal, URIRef # when adding arbitrary triples

b = Bundle()

"""or if you don't want a bnode for the Bundle URI: b = Bundle(identifier="http://example.org/1")"""

"""
NB this also instantiates empty bundle.Reason and opmv.Process instances too
in b.reason and b.process which are used to create the final combined graph at the end"""

b.encapsulate( URIRef("info:foo"), URIRef("info:bar") )

""" we don't want the default plain Reason, we want a human reason:"""

r = AgentReason()

""" again, pass a identifier="" kw to set the URI if you wish"""

r.comment("They look the same to me :)")

"""Let them know who made the assertion:"""

a = Agent()

a.nick("benosteen")

a.homepage("http://benosteen.com")

""" Add this agent as the controller of the process:"""
b.process.agent(a)

g = b.bundle_graph() # this creates an in-memory graph of all the triples required to assert this bundle

""" easiest way to get it out is to "serialize" it:"""

print g.serialize()

==============

Output:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:bundle="http://purl.org/net/bundle#"
   xmlns:foaf="http://xmlns.com/foaf/0.1/"
   xmlns:opmv="http://purl.org/net/opmv/ns#"
   xmlns:ordf="http://purl.org/NET/ordf/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
>
  <rdf:Description rdf:nodeID="PZCNCkfJ2">
    <rdfs:label> on monster (18787)</rdfs:label>
    <ordf:hostname>monster</ordf:hostname>
    <ordf:pid rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">18787</ordf:pid>
    <opmv:wasControlledBy rdf:nodeID="PZCNCkfJ9"/>
    <ordf:version rdf:nodeID="PZCNCkfJ4"/>
    <rdf:type rdf:resource="http://purl.org/net/opmv/ns#Process"/>
    <ordf:cmdline></ordf:cmdline>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ0">
    <bundle:encapsulates rdf:resource="info:bar"/>
    <bundle:encapsulates rdf:resource="info:foo"/>
    <bundle:justifiedby rdf:nodeID="PZCNCkfJ5"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
    <rdf:type rdf:resource="http://purl.org/net/bundle#Bundle"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ5">
    <rdf:type rdf:resource="http://purl.org/net/bundle#Reason"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ9">
    <foaf:nick>benosteen</foaf:nick>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/>
    <foaf:homepage rdf:resource="http://benosteen.com"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ4">
    <rdfs:label>ordf</rdfs:label>
    <rdf:value>0.26.391.901cf0a0995c</rdf:value>
  </rdf:Description>
</rdf:RDF>


Given a triplestore with these bundles, you can query for ‘same as’ URIs via which Bundles a given URI appears in.

]]>
http://openbiblio.net/2010/11/17/bundling-instances-of-author-names-together-without-using-owlsameas/feed/ 7
JISC OpenBibliography: Projected Timeline, Workplan & Overall Project Methodology http://openbiblio.net/2010/08/31/jisc-openbibliography-projected-timeline-workplan-overall-project-methodology/ http://openbiblio.net/2010/08/31/jisc-openbibliography-projected-timeline-workplan-overall-project-methodology/#comments Tue, 31 Aug 2010 15:40:40 +0000 http://openbiblio.net/?p=140 Continue reading ]]> The JISC OpenBiblio project is scheduled to run from 1st July 2010 to 31st March 2011. During that time, the project will run 2 week iterative development cycles, each including (for links to trac, code repository, wiki etc see project resources page):

  • Weekly meetings
  • Technical lead reports on development since last meeting; incomplete functionality is moved into the next development cycle or abandoned (if at the end of a cycle)
  • Advocacy lead reports on development since last meeting; team should be updated about recent advocacy successes and about events soon to take place.
  • Team discuss and decide technical developments for next development cycle
  • A project blog post should be written each time a success occurs, and referenced to a deliverable listed under the work packages in the JISC project bid.
  • A project blog post should be written describing obstacles causing delay to any functionality aims or advocacy successes.
  • Team members should identify topics raised via the mailing lists that need further consideration. This is to manage how mailing list discussions become documented parts of the project; anything from the mailing list that becomes significant to the project should be documented in a blog post / comment / trac task as appropriate.
  • Team members keep notes in the meeting minutes document.
  • Technical lead (or others) updates trac as necessary to keep note of technical aims, successes and failures. The trac should be viewed as a resource for the technical lead to report to the team, and as a source of information for writing up the project report, but not as a project management tool
  • OKF project wiki and JISC expo spreadsheet should be updated as required to reflect any changes in location of key documents

Rather than aiming to develop a specific product, the aim is to develop as much useful output as possible before the project deadline, ideally meeting or exceeding the deliverables defined under the work packages in the JISC project bid.

This attitude is suitable due to the nature of the project – a significant amount of advocacy is required to convince data publishers of the benefits of open access to bibliographic information, and this work in itself should not be overlooked. Therefore, achievements in attaining open agreements will lead to further development opportunities. Overall success can be measured against three strands:

  1. Publicity / advocacy successes – e.g. a good response at a conference to a discussion of the project goals.
  2. Agreements to provide open data – when data providers actually commit to allowing access to their datasets; this is a specific achievement over and above those in point 1.
  3. Technical developments – with access to open data sets, develop examples of how they can be put to valuable use for the community; this should feed back into point 1, leading to more of point 2, and so on.
]]>
http://openbiblio.net/2010/08/31/jisc-openbibliography-projected-timeline-workplan-overall-project-methodology/feed/ 0