Open Bibliography and Open Bibliographic Data » benosteen

“Full-text” search for openbiblio, using Apache Solr

benosteen — Wed, 25 May 2011 13:55:34 +0000

Overview:

This provides a simple search interface for openbiblio, using a network-addressable Apache Solr instance to provide FTS over the content.

The indexer currently relies on the Entry Model (in /model/entry.py) to provide an acceptable dictionary of terms to be fed to a solr instance.

Configuration:

In the paster main .ini, you need to set the param ‘solr.server’ to point to the solr instance. For example, ‘http://localhost:8983/solr’ or ‘http://solr.okfn.org/solr/bibliographica.org’. If the instance requires authentication, set the ‘solr.http_user’ and ‘solr.http_pass’ parameters too. (Solr is often put behind a password-protected proxy, due to its lack of native authentication for updating the index.)

Basic usage:

The search controller: solr_search.py (linked in config/routing.py to /search)

Provides HTML and JSON responses (content-negotiation or .html/.json as desired) and interprets a limited but easily expandable subset of Solr params (see ALLOWED_TERMS in the /controller/solr_search.py.)

JSON response is the raw solr response as this is quite usable in javascript.

HTML response is styled in the same manner as the previous (xapian-based?) search service, with the key template function formatting each row in templates/paginated_common.html – genshi function “solr_search_row”. Unless specified, the search controller will get all the fields it can for the search terms, meaning that the list of resuts in c.solr.results contain dicts with much more information than is currently exposed. The potentially available fields are as follows:

    "uri"          # URI for the item - eg http://bibligraphica.org/entry/BB1000
    "title"        # Title of the item
    "type"         # URI type(s) of the item (eg http://.... bibo#Document)
    "description"
    "issued"       # Corresponds to the date issued, if given.
    "extent"
    "language"     # ISO formatted, 3 lettered - eg 'eng'
    "hasEditionStatement"

    "replaces"        # Free-text entry for the work that this item supercedes
    "isReplacedBy"    # Vice-versa above

    "contributor"           # Author, collaborator, co-author, etc
                            # Formatted as "John Smith b1920 "
                            # Use lib/helpers.py:extracturi method to add formatting.
                            # Give it a list of these sorts of strings, and it will return
                            # a list of tuples back, in the form ("John Smith b1920", "http...")
                            # or ("John Smith", "") if no -enclosed URI is found.
    "contributor_filtered"  # URIs removed
    "contributor_uris"      # Just the entity URIs alone

    "editor"                # editor and publisher are formatted as contributor
    "publisher"
    "publisher_uris"        # list of publisher entity URIs

    "placeofpublication"    # Place of publication - as defined in ISBD. Possible and likely to
                            # have multiple locations here

    "keyword"               # Keyword (eg not ascribed to a taxonomy)
    "ddc"                   # Dewey number (formatted as contributor, if accompanied by a URI scheme)
    "ddc_inscheme"          # Just the dewey scheme URIs
    "lcsh"                  # eg "Music "
    "lcsh_inscheme"         # lcsh URIs

    "subjects"              # Catch-all,with all the above subjects queriable in one field.

    "bnb_id"                # Identifiers, if found in the item
    "bl_id"
    "isbn"
    "issn"
    "eissn"
    "nlmid"                 # NLM-specific id, used in PubMed
    "seeAlso"               # URIs pertinent to this item

    "series_title"          # If part of a series: (again, formatted like contributor)
    "series_uris"

    "container_title"       # If it has some other container, like a Journal, or similar
    "container_type"

    "text"                  # Catch-all and default search field.
                            # Covers: title, contributor, description, publisher, and subjects

    "f_title"               # Fields indexed to be suitable for facetting
    "f_contributor"         # Contents as above
    "f_subjects
    "f_publisher"
    "f_placeofpublication"  # See http://wiki.apache.org/solr/SimpleFacetParameters for info

The query text is passed to the solr instance verbatim, so it is possible to do complex queries within the textbox, according to normal solr/lucene syntax. See http://wiki.apache.org/solr/SolrQuerySyntax for some generic documentation. The basics of the more advanced search are as follows however:

field:query — search only within a given field,

eg ‘contributor:”Dickens, Charles”‘

Note that query text within quotes is searched for as declared. The above search will

not hit an author value of “Charles Dickens” for example (and why the above is not a good

way to search generically.)

Booleans, AND and OR — if left out, multiple queries will be OR’d

eg ‘contributor:Dickens contributor:Charles’ == ‘contributor:Dickens OR contributor:Charles’

The above will match contributors who are called ‘Charles’ OR ‘Dickens’ (non-exclusively), which is unlikely to be what is desired. ‘Charles Smith’ and ‘Eliza Dickens’ would be valid hits in this search.

‘contributor:Dickens AND contributor:Charles’ would be closer to what is intended.

URI matching — many fields include the URI and these can be used to be specific about the match

eg ‘contributor:”http://bibliographica.org/entity/E200000″‘

Given an entity URI therefore, you can see which items are published/contributed/etc just by performing a search for the URI in that field.

Basic Solr Updating:

The ‘solrpy’ library is used to talk to a Solr instance and so seek that project out for library-specific documentation. (>=0.9.4 as this includes basic auth)

Fundamentally, to update the index, you need an Entry (model/entry.py) instance mapped to the item you wish to (re)index and a valid SolrConnection instance.

from solr import SolrConnection, SolrException
s = SolrConnection("http://host/solr", http_user="usernamehere", http_pass="passwordhere")
e = Entry.get_by_uri("Entry Graph URI")

Then, it’s straightforward: (catching two typical errors that might be thrown due to a bad or incorrectly configured Solr connection.)

from socket import error as SocketError
try:
    s.add(e.to_solr_dict())
    # Uncomment the next line to commit updates (inadvisable to do after every small change of a bulk update):
    # s.commit()
except SocketError:
    print "Solr isn't responding or isn't there"
    # Do something here about it
except SolrException:
    print "Something wrong with the update that was sent. Make sure the solr instance has the correct schema in place and is working and that the Entry has something in it."
    # Do something here, like log the error, etc

Bulk Solr updating from nquads:

There is a paster command for taking the nquads Bibliographica.org dataset, parsing this into mapped Entry’s and then performing the above.

    Usage: paster indexnquads [options] config.ini NQuadFile
Create Solr index from an NQuad input

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -b BATCHSIZE, --batchsize=BATCHSIZE
                        Number of solr 'docs' to combine into a single update
                        request document
  -j TOJSON, --json=TOJSON
                        Do not update solr - entry's solr dicts will be
                        json.dumped to files for later solr updating

The –json option is particularly useful for production systems, as the time consuming part of this is the parsing and mapping to Entry’s and you can offload that drain to any computer and upload the solrupdate*.json files it creates directly to the production system for rapid indexing.

NOTE! This will start with solrupdate0.json and iterate up. IT WONT CHECK for existence of previous solr updates and they will be overwritten!

[I used a batchsize of 10000 when using the json export method]

Bulk Solr updating from aforementioned solrupdate*.json:

    paster indexjson [options] config.ini solrupdate*
    Create Solr index from a JSON serialised list of dicts

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -C COMMIT, --commit=COMMIT
                        COMMIT the solr index after sending all the updates
  -o OPTIMISE, --optimise=OPTIMISE
                        Optimise the solr index after sending all the updates
                        and committing (forces a commit)

“paster indexjson development.ini –commit=True /path/to/solrupdate*”

Academic Bibliography data available from Acta Cryst E

benosteen — Wed, 12 Jan 2011 10:08:05 +0000

The bibliographic data from Acta Cryst E, a publication by the International Union of Crystallography (IUCr), has been extracted and made available with their consent.

You can find a SPARQL endpoint for the data here and the full dataset here.

I have also geocoded a number of the affiliations of the authors, plotting them on a timemap (visualising the time of publication against the location of the authors), and you can see this at this location.

What you will find:

A SPARQL endpoint with limited output capabilities (limited content negotiation).
A ‘describe’ method, to display unstyled HTML pages about authors or the papers, based on the given URI.
Links from the data in the service to the original papers.
The data dump consists of a zipped up directories of rdf, which have most of the intermediary xml, html and other bits removed. Hopefully, this helps explain the odd layout!

Name matching strategy using bibliographic data

benosteen — Wed, 01 Dec 2010 17:19:13 +0000

One of the aims of an RDF representation of bibliographic data should be to have authors represented by unique, reference-able points within the data (as URIs), rather than as free-text fields. What steps can we do to match up the text value representing an author’s name to another example of their name in the data?

It’s not realistic to expect a match between say, Mark Twain to Samuel Clemens, without using some extra information typically not present in bibliographic datasets. What can be achieved however, is the ‘fuzzy’ matching of alternate forms of names – due to typos, mistakes and omitted initials and the like. It is important that these matches are understood to be fuzzy and not precise, based more on statistics than a definite assertion.

How best to carry out the matching of a set of authors within a bibliographic dataset? This is not the only way, but it is a useful method to make progress with:

List – Gather a list of the things you wish to match, with unique identifiers for each and map out a list of the pairs of names that are required to be compared. (Note, that this mapping will be greatly affected by the next step.)
Filter – Remove the matches that aren’t worth fully evaluating. An index of the text can give a qualitative view on which names are worth comparing and which are not.
Compare – Run through the name-pairs and evaluate the match (likely using string metrics of some kind). The accuracy of the match may be improved by using other data, with some sorts of data drastically improving your chances, such as author email, affiliation (and date) and birth and death dates.
Binding – Bind the names together in whichever manner required. I would recommend creating Bundles as a record of a successful match, and an index or sparql-able service to allow the ‘sameas’ style lookups in a live service.

In terms of the BL dataset within http://bnb.bibliographica.org then:

List:

We have had to apply a form of an identifier for each instance of an author’s name within the BL dataset. Currently, this is done via a ‘owl:sameas’ property on the original blank node linking to a URI of our own making, eg http://bibliographica.org/entity/735b0…12d033. It would be a lot better if the BL were to mint their own URIs for this, but in the meantime, this gives us enough of a hook to begin with.

One way you might gather the names and URIs is via SPARQL:

PREFIX dc: 
PREFIX bibo: 
PREFIX foaf: 
PREFIX skos: 
PREFIX owl: 
SELECT DISTINCT ?book ?authoruri ?name
WHERE {
    ?book a bibo:Book .
    ?book dc:contributor ?authorbn .
    ?authorbn skos:notation ?name .
    ?authorbn owl:sameAs ?authoruri .
}

However, there will be very many results to page through, and it will put a lot of stress on the SPARQL engine if lots of users are doing this heavy querying at the same time!

This is also a useful place to gather any extra data you will use at the compare stage (usually email, affiliation or in this particular case, the potential birth and death dates).

Filter:

This is the part that is quite difficult as you have to work out a method for negating matches without the cost of a full match. If the filter method is slower than simply working through all the matches, then it is not worth doing the step. In matching names from the BL set however, there are many millions of values, but from glancing over the data, I only expect tens of matches or fewer on average for a given name.

The method I am using is to make a simple stemming index of the names, with the birth/death dates as extra fields. This I have done in Solr (experimenting with different stemming methods) but come to an odd conclusion that a default english stemming provides suitable groupings. I found this was backed up somewhat by this comparison of string metrics [PDF]. It suggests that a simple index combined with a metric called ‘Jaro’ works well for names.

So, in this case, I generate the matching by running the names through an index of all the names and using the most relevant search results as the base for the pairs of names to be matched. The pairs are combined into a set, ordered alphabetically – only the pairing is necessary, not the ordering of the pair. This is so that we don’t end up matching the same names twice.

Compare:

This is the most awkward step – it is hard to generate a ‘golden set’ of data by which you can rate the comparison without using other data sources. However, the matching algorithm I am using is the Jaro comparison to get a figure indicating the similarity of the names. As the BL data is quite a good set of data (in that it is human-entered and care has been taken over the entry), this type of comparison is quite good – the difference between a positive and a negative match is quite high. Care must be taken to avoid missing false positives from omitted or included initials, middle names, proper forms and so on.

The additional data is quite dependant on the match between the names. If the names match perfectly, but the birth dates are very different (different in value and distant in edit distance), then this is likely to be a different author. If the names match somewhat, but the dates match perfectly, then this is a possible match. If the dates match perfectly, but the name doesn’t at all (unlikely due to the above filtering step) then this is not a match.

Binding:

This step I have not made my mind up about as the binding step is a compromise between recording. Bundling together all the names for a given author in a single bundle if they are below the threshold for a positive match, you get a bundle that requires fewer triples to describe it. However, you really should have a bundle for each pairing, but this dramatically increases the number of triples required to express it. Either way, the method for querying the resultant data is the same. For example:

A set of bundles ‘encapsulates’ A, B, C, D, E, F, G – so, given B, you can find the others by a quick, if inelegant, SPARQL query:

SELECT ?sameas
WHERE {
  ?bundle bundle:encapsulates  .
  ?bundle bundle:encapsulates ?sameas .
}

Whether this data should be collapsed into a closure of some sort is up to the administrator – how much must you trust this match before you can use owl:sameAs and incorporate it into a congruent closure? I’m not sure the methods outlined above can give you a strong enough guarantee to do so at this point.

Characterising the British Library Bibliographic dataset

benosteen — Thu, 18 Nov 2010 10:38:21 +0000

Having RDF data is good. Having Linkable data is better but having some idea of what sorts of properties you can expect to find within a triplestore or block of data can be crucial. That sort of broad-stroke information can be vital in letting you know when a dataset contains interesting data that makes the work to use it worthwhile.

I ran the recently re-released BL RDF data (get from here or here) (CC0) through a simple script that counted occurrences of various elements within the 17 files, as well as enumerating all the different sorts of property you can expect to find.

Some interesting figures:

Over 100,000 records in each file, 2.9 million ‘records’ in total. Each record is a blank node.

Three main types of identifier – a ‘(Uk)123….’, ‘GB123…’ and (as a literal) ‘URN:ISBN:123…’, but not all records have ISBNs as some of them predate it.

Nearly 29 million blank nodes in total.

11,187,804 uses of dcterms:subject, for an average of just under 4 per record (3.75…)
Uses properties from Dublin Core terms, OWL-Time, ISBD, and SKOS

dcterms:subject’s are all as SKOS declarations, and include the Dewey decimal, LCSH and MESH schemes. (Work to use id.loc.gov LCSH URIs instead of literals is underway)

Includes rare and valuable information, stored in properties such as dcterms:isPartOf, isReferencedBy, isReplacedBy, replaces, requires and dcterms:relation.

Google spreadsheet of the tallys

Occurrence trends through the 17 data files (BNBrdfdc01.xml –> 17.xml)

(The image is as Google spreadsheet exported it, click on the link above to go to the sheet itself to view it natively without axis distortion.)

Literals and what to expect:

I wrote another straightforward script that can mine sample sets of unique literals from the BNBrdfdc xml files.

Usage for ‘gather_test_literals.py’
Usage: python gather_test_literals.py path/to/BNBrdfdcXX.xml ns:predicate number_to_retrieve [redis_set_to_populate]
For example, to retrieve 10 literal values from the bnodes within dcterms:publisher in BNBrdfdc01.xml: python gather_test_literals.py BNBrdfdc01.xml "dcterms:publisher" 10 and to also push those values into a local Redis set 'publisherset01' if Redis is running and redis-py is installed:
python gather_test_literals.py BNBrdfdc01.xml "dcterms:publisher" 10 publisherset01

So, to find out what, at most, 10 of those intriguing ‘dcterms:isReferencedBy’ predicates contain in BNBrdfdc12.xml, you can run:

python gather_test_literals.py BNBrdfdc12.xml "dcterms:isReferencedBy" 10

(As long as gather_test_literals.py and the xml files are in the same directory of course)

Result:
Chemical abstracts, Soulsby no. 4061 Soulsby no. 3921 Soulsby no. 4018 Chemical abstracts

As the script gathers the literals into a set, it will only return when it has either reached the desired number of unique values, or has reached the end of the file.

Hopefully, this will help other people explore this dataset and also pull information from it. I have also created a basic Solr configuration that has fields for all the elements found in the BNB dataset here.

"Bundling" instances of author names together without using owl:sameas

benosteen — Wed, 17 Nov 2010 13:16:20 +0000

Bundling?

It’s a verb I’ve taken from ”Glaser, H., Millard, I., Jaffri, A., Lewy, T. and Dowling, B. (2008) On Coreference and The Semantic Web http://eprints.ecs.soton.ac.uk/15765/” where the core idea is that you have a number of URIs that mean or reference the same real thing, and the technique they describe of bundling is to aggregate all those references together. The manner in which they describe is built on a sound basis in logic, and is related to (if not the same as) a congruent closure.

The notion of bundling I am using is not as rooted in terms of mathematical logic, because I need to convey an assertion that one URI is meant to represent the same thing that another URI represents in a given context and for a given reason. This is a different assertion, if only subtly different, than ‘owl:sameas’ asserts, but the difference is key for me.

It is best to think through an example of where I am using this – curating bibliographic records and linking authors together.

It’s an obvious desire – given a book or article, to find all the other works by an author of that said work. Technologically, with RDF this is a very simple proposition BUT the data needs to be there. This is the point where we come unstuck. We don’t really have that quality of data that firmly establishes that one author is the same as a number of others. String matching is not enough!

So, how do we clean up this data (converted to RDF) so that we can try to stitch together the authors and other entities in them?

See this previous post on augmenting British Library metadata so that the authors, publishers and so on are externally reference-able once they are given unique URIs. This really is the key step. Any other work that can be done to make any of the data about the authors and so on more semantically reference-able will be a boon to the process of connecting the dots, as I have done for authors with birth and/or death dates.

The fundamental aspect to realise is that we are dealing with datasets which have missing data, misrepresented data (typos), misinterpreted fields (ISBNs of £2.50 for example) and other non-uniform and irregular problems. Connecting authors together in datasets with these characteristics will rely on us and code that we write making educated guesses, and probabilistic assertions, based on how confident we are that things match and so on.

We cannot say for sure that something is a cast-iron match, only that we are above a certain limit of confidence that this is so. We also have to have a good reason as well.

Something else to take on board is that what I would consider to be a good match might not be good for someone else so there needs to be a manner to state a connection and to say why, who and how this match was made as well as a need to keep this data made up of assertions away from our source data.

I’ve adopted the following model for encoding this assertion in RDF, in a form that sits outside of the source data, as a form of overlay data and you can find the bundle ontology I’ve used at http://purl.org/net/bundle.rdf (pay no attention to where it is currently living):

Click to view in full, unsquished form:

The URIs shown to be ‘opmv:used’ in this diagram are not meant to be exhaustive. It is likely that a bundle may depend on a look-up or resolution service, external datasheets, authority files, csv lists, dictionary lists and so on.

Note that the ‘Reason’ class has few, if any, mandatory properties aside from its connection to a given Bundle and opmv:Process. Assessing if you trust a Bundle at this moment is very much based on the source and the agent that made the assertion. As things get more mature, more information will regularly find its place attached to a ‘Reason’ instance.

There are currently two subtypes of Reason: AlgorithmicReason and AgentReason. Straightforwardly, this is the difference between a machine-made match and a human-made match and use of these should aid the assessment of a given match.

Creating a bundle using python:

I have added a few classes to Will Waites’ excellent ‘ordf’ library, and you can find my version here. To create a virtualenv to work within, do as follows. You will need mercurial and virtualenv already installed:

At a command line – eg ‘[@localhost] $’, enter the following:

hg clone http://bitbucket.org/beno/ordf virtualenv myenv . ./myenv/bin/activate (myenv) $ pip install ordf

So, creating a bundle of some URIs – “info:foo” and “info:bar”, due to a human choice of “They look the same to me :)”:

In python: code here

from ordf.vocab.bundle import Bundle, Reason, AlgorithmicReason, AgentReason
from ordf.vocab.opmv import Agent from ordf.namespace import RDF, BUNDLE, OPMV, DC # you are likely to use these yourself from ordf.term import Literal, URIRef # when adding arbitrary triples b = Bundle() """or if you don't want a bnode for the Bundle URI: b = Bundle(identifier="http://example.org/1")""" """ NB this also instantiates empty bundle.Reason and opmv.Process instances too in b.reason and b.process which are used to create the final combined graph at the end""" b.encapsulate( URIRef("info:foo"), URIRef("info:bar") ) """ we don't want the default plain Reason, we want a human reason:""" r = AgentReason() """ again, pass a identifier="" kw to set the URI if you wish""" r.comment("They look the same to me :)") """Let them know who made the assertion:""" a = Agent() a.nick("benosteen") a.homepage("http://benosteen.com") """ Add this agent as the controller of the process:""" b.process.agent(a) g = b.bundle_graph() # this creates an in-memory graph of all the triples required to assert this bundle """ easiest way to get it out is to "serialize" it:""" print g.serialize() ==============
Output:

on monster (18787) monster 18787 benosteen ordf 0.26.391.901cf0a0995c

Given a triplestore with these bundles, you can query for ‘same as’ URIs via which Bundles a given URI appears in.

Augmenting the British Library's RDF data to allow for disambiguation

benosteen — Wed, 17 Nov 2010 12:00:59 +0000

The British Library have released what they term the ‘British National Bibliography’ (BNB) under a permissive licence. This constitutes just under 3 million records, and is derived from the ‘most polished set of bibliographic data’ as some of it dates back a good number of years.

This effort is to be applauded and the data that is represented by that set is turning out to be reasonably high quality, with some errors due to XSLT problems rather than problems with the source data.

However, the RDF that is being created is very heavily dominated by blank nodes – each ‘record’ is an untyped blank node which has many properties that end in blank nodes, which in turn have rdf:value’s stating what that property’s value is.

For example:

Tooley's dictionary of mapmakers. Tooley, R. V. (Ronald Vere), 1898-1986 etc... ....

This has a number of drawbacks as much of the data is unlinkable – you cannot reference it outside of a given triplestore, as well as the author name information being mixed (includes date information), and RDF errors in the file.

Another issue is that the data is held in 17 very large xml files, which makes it very hard to address individual records as independent documents.

The first task is to augment this data such that:

The ‘records’, authors, publishers, and related item bnodes are given unique, globally referenceable URIs

The items themselves are given a type, based on the literal values present within bnodes linked to the item by the dc:type property. (eg bibo:Book)

Any MARC -> RDF/XML errors are cleaned up (notably, there are a few occasions of rdf:description, rather than rdf:Description in there)

For authors with more authoritative names (eg Smith, John, b. 1923 or similar), to break up the dates into a more semantic construction.

You can find the script that will do this augmentation at https://bitbucket.org/okfn/jiscobib/src/4ddaa37e44a2/BL_scripts/BLDump_convert_and_store.py

This script requires the lxml python module for xpath support as well as the pairtree module to store the records as individual documents. The script should be able to process all 17 files in a few hours, but make sure you have plenty of disc space.

It’s easiest to explain what is happening to the individual author/publisher nodes by use of a diagram:

For example, Original fragment:

Tooley, R. V. (Ronald Vere), 1898-1986

To:

Tooley, R. V. (Ronald Vere) 1898 1986

The URIs are generated by taking an md5 hash of a number of values, including the full line from the file it appears on, the extracted author’s name, and the number of lines through the file it is. The idea was to generate URIs that were as unique as possible, but reproducable from the same set of data if the script was reran.

By giving the books, musical scores, authors, publishers and related works externally addressable works, it allows for third-party datasets, such as sameas.org, to overlay their version of which things are the same.

You can then choose the overlay dataset which links together the authors and so on based on how much you trust the matching techniques of the service, rather than glomming both original data and asserted data together inextricably.

Some obvious URI patterns for a service?

benosteen — Tue, 26 Oct 2010 12:05:15 +0000

Whilst the technical issues and backends may vary, there are one or two URI patterns that may be adopted I think. It’s not REST, but it is a sensible structure I hope. (This is not to replace voID, but to accompany a voID description and other characterisation methods)

http://host/#catalog – URI for the catalog dataset

http://host/void
302 – conneg response to a voID description at .ttl, .rdf (xml), etc

http://host/describe/{uri} –
200 – responds with a conneg’d graph with the information a store ‘knows’ about a given URI. The HTML representation would likely be viewed as a ‘record’ page, insofar as this is valid for the item. (uses Content-Location: http://host/describe/{uri}/ttl etc rather than 302, due to load and network traffic cost.)
404 – doesn’t know about this uri

http://host/types
200 – voID-like Response based on the canned query ‘SELECT DISTINCT ?x WHERE {?foo rdf:type ?x)’ BUT with the addition of some lowest common denominator types. Can be easily cached. Filtering out the least important types is at the discretion of the service – this is not intended to be a complete set, but to publish the set of types that this service cares most about. Best shown by example (note that some predicates need to be minted/swapped for suitable ones. Shown by *):

a void:Dataset ; *containsType* ; *containsType* ; etc... void:uriLookupEndpoint ; etc... ; ;

Thoughts?

Data Triage Notes

benosteen — Wed, 22 Sep 2010 16:10:17 +0000

I’ve begun to write up my experiences and notes on the triage of the datasets I am processing for the JISC Open Bibliography and Citation projects, in a way that others might make sense of them.

You can find the WIP writeup here: http://knowledgeforge.net/pdw/trac/wiki/datatriage

This will include links to the source datasets and any subsequent curated data as I am able to put them up online.

Disambiguation, deduplication and 'ideals'

benosteen — Wed, 22 Sep 2010 12:38:45 +0000

(NB Republished from a mailing list conversation at http://lists.okfn.org/pipermail/open-bibliography/2010-August/000397.html – follow this link to see the comments and replies)

In my work on meshing bibliographic datasets together, I’ve been using a
conceptual tool that I would like to hear views on.

I am creating nodes for the ideals of things on records – whether that is
for people, journals or even the bibliographic document itself. The ideal
represents the best and most complete data for that thing – something we’ll
never really achieve, but that’s not the point. This ideal serves as a node,
a hook, on which we can join up records which describe the same thing
(person, frbr manifestation, etc) but which have differing data for.

It’s easy to consider it for ‘deduplications’ of say article references.
Consider two records, one from the ris feed from pubmed and one from a
citation in a plos article. These are found to be references to the same
article but as you can expect they differ, not just in terms of data but
also on terms of the source or author of that reference.

The way I am tackling this is by creating a node for the ideal bibliographic
reference each aspires to and when dupes are believed to be found, these
ideal nodes are joined into a bundle using sameas (in a different store) and
this bundle has some provenance triples recording the how when and why for
this merging (using open provenance model verbs/classes)

Eg:

:bibrec —> record node from pubmed

:citerec —> plos record

_i suffix —> ideal node

running analyser on record suggests two records are dupes, with a certain
confidence score from a certain weighted matching (call this ‘heur.v0.13’)

Create ideal nodes Just In Time:

:bibrec hasIdeal :bibrec_i :citerec hasIdeal :citerec_I

Make the bundle:

:b1 a Bundle sameas :bibrec_i sameas :citerec_I opmv:wasGeneratedBy :p1 created: 2010-08-...... :p1 a opmv:Process Opmv:controlledBy :Ben Opmv:used :bibrec Opmv:used :citerec :confidence a ConfidenceReport Opmv:wasGeneratedBy :p1 Hasreport # for time being

This structure let’s me create an aggregated rdf dataset with the best guess
ideal records at any one time. Also, bundles can be merged later if required
creating a tree structure – the top bundle instance and the ‘leaf’ records
form a congruent closure and are thus exportable as such without the admin
structure triples necessary for ongoing maintenance. The bundle notion comes
from the excellent work by the team at southampton, including Hugh glazer,
Ian milliard et al (google for coreference on the semantic web)

Using this technique for entities like people is actually very similar. If I
use the words ‘person’ and ‘persona’ for the ideal and the data in a record
respectively. The persona can have alternative spellings, and time-dependant
details like a fleeting institutional affiliation, and so on. The
(difficult) trick is spotting when two persona’s refer to the same person
but the process for merging is the same even if the creation of an
aggregated record for each is different.

Bibliographic models in RDF

benosteen — Fri, 10 Sep 2010 14:56:12 +0000

Put it in RDF to solve all your problems!

As with most things in life, the reality is often a little more complex. If you are old enough, you may well remember when this very same cry was often uttered, but with ‘RDF’ above replaced by ‘XML’ or if you are older still, ‘SGML’.

We haven’t quite reached the tipping point with bibliographic data in RDF so that a defacto model and structure has clearly emerged. There are plenty of contenders though, each based on differing models for how this data should be encapsulated in RDF. The main characteristic difference is in how markedly hierarchical or flat the model structure is.

A model that has emerged from the library world is FRBR – Functional Requirements for Bibliographic Records. From wikipedia:

FRBR is a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases from a user’s perspective. It represents a more holistic approach to retrieval and access as the relationships between the entities provide links to navigate through the hierarchy of relationships.

There are plenty of articles and documents online to explain further, so I will not take up your time with a summary of it, just my opinion. FRBR is very much built around the notion of books – what a book is, taking into account things like editions and so on. Where FRBR really does fall down a rabbit’s hole, is the consideration of things like serials and journal articles. Their treatment feels very much like an afterthought and the philosophical ideas of Work and Expression get very much more murky, especially when considering linking these records to conference papers and blog posts by the same article authors.

There is enough of a model, however, to render an understandable bibliographic ‘record’ for an article in RDF, and this post will give an example of this, using David Shotton and Silvio Peroni’s FaBIO ontology to encapsulate the information in a FRBR-like manner.

The data used comes from an IUCr paper “Nicotinamide-2,2,2-trifluoroethanol (2/1)” Acta Cryst. (2009). E65, o727-o728, which has RDF embedded in the HTML page itself. The original RDF looks something like this:

@prefix dc: . @prefix dcterms: . @prefix foaf: . @prefix prism: . @prefix rdf: . @prefix rdfs: . prism:eissn "1600-5368"; prism:endingpage "728"; prism:issn "1600-5368"; prism:number "4"; prism:publicationdate "2009-04-01"; prism:publicationname "Acta Crystallographica Section E: Structure Reports Online"; prism:rightsagent "med@iucr.org"; prism:section "organic compounds"; prism:startingpage "727"; prism:volume "65"; dc:creator "Bardin, J.", "Florence, A.J.", "Johnston, B.F.", "Kennedy, A.R.", "Wong, L.V."; dc:date "2009-04-01"; dc:description "The nicotinamide (NA) molecules of the title compound, 2C6H6N2O.C2H3F3O, form centrosymmetric R22(8) hydrogen-bonded dimers via N-H...O contacts. The asymmetric unit contains two molecules of NA and one trifluoroethanol molecule disordered over two sites of equal occupancy. The packing consists of alternating layers of nicotinamide dimers and disordered 2,2,2-trifluoroethanol molecules stacking in the c-axis direction. Intramolecular C-H...O and intermolecular N-H...N, O-H...N, C-H...N, C-H...O and C-H...F interactions are present."; dc:identifier _9:S1600536809007594; dc:language "en"; dc:link ; dc:publisher "International Union of Crystallography"; dc:rights ; dc:source ; dc:subject ""; dc:title "Nicotinamide-2,2,2-trifluoroethanol (2/1)"; dc:type "text"; dcterms:abstract "The nicotinamide (NA) molecules of the title compound, 2C6H6N2O.C2H3F3O, form centrosymmetric R22(8) hydrogen-bonded dimers via N-H...O contacts. The asymmetric unit contains two molecules of NA and one trifluoroethanol molecule disordered over two sites of equal occupancy. The packing consists of alternating layers of nicotinamide dimers and disordered 2,2,2-trifluoroethanol molecules stacking in the c-axis direction. Intramolecular C-H...O and intermolecular N-H...N, O-H...N, C-H...N, C-H...O and C-H...F interactions are present.".

This bibliographic information rendered into a FaBIO model (amongst other ontologies):

@prefix fabio: . @prefix c4o: . @prefix dc: . @prefix dcterms: . @prefix foaf: . @prefix owl: . @prefix rdfs: . @prefix rdf: . @prefix frbr: . @prefix prism: . :article a fabio:JournalArticle ; dc:title "Nicotinamide-2,2,2-trifluoroethanol (2/1)" ; dcterms:creator [ a foaf:Person ; foaf:name "Johnston, B.F." ] ; dcterms:creator [ a foaf:Person ; foaf:name "Florence, A.J." ] ; dcterms:creator [ a foaf:Person ; foaf:name "Bardin, J." ] ; dcterms:creator [ a foaf:Person ; foaf:name "Kennedy, A.R." ] ; dcterms:creator [ a foaf:Person ; foaf:name "Wong, L.V." ] ; dc:rights ; dc:language "en" ; fabio:hasPublicationYear "2009" ; fabio:publicationDate "2009-04-01" ; frbr:embodiment :printedArticle , :webArticle ; frbr:partOf :issue ; fabio:doi "10.1107/S1600536809007594" ; frbr:part :abstract ; prism:rightsagent "med@iucr.org" . :abstract a fabio:Abstract ; c4o:hasContent "The nicotinamide (NA) molecules of the title compound, 2C6H6N2O.C2H3F3O, form centrosymmetric R22(8) hydrogen-bonded dimers via N-H...O contacts. The asymmetric unit contains two molecules of NA and one trifluoroethanol molecule disordered over two sites of equal occupancy. The packing consists of alternating layers of nicotinamide dimers and disordered 2,2,2-trifluoroethanol molecules stacking in the c-axis direction. Intramolecular C-H...O and intermolecular N-H...N, O-H...N, C-H...N, C-H...O and C-H...F interactions are present." . :printedArticle a fabio:PrintObject ; prism:pageRange "727-728" . :webArticle a fabio:WebPage ; fabio:hasURL "http://scripts.iucr.org/cgi-bin/paper?fl2234" . :volume a fabio:JournalVolume ; prism:volume "65" ; frbr:partOf :journal . :issue a fabio:JournalIssue ; prism:issueIdentifier "4" ; frbr:partOf :volume :journal a fabio:Journal ; dc:title "Acta Crystallographica Section E: Structure Reports Online" ; fabio:hasShortTitle "Acta Cryst. E" ; dcterms:publisher [ a foaf:Organization ; foaf:name "International Union of Crystallography" ] ; fabio:issn "1600-5368" .

The most obvious model and ontology that has emerged for describing bibliographic metadata in RDF is the Bibliographic Ontology, developed by Frédérick Giasson and Bruce D’Arcus and has been in existence for long enough to gain acceptance by a number of other projects, such as EPrints, Talis Aspire and Chronicling America (The Chronicling America website at the Library of Congress provides a view on millions of page of digitized newspaper content from around the United States.)

The same data again, rendered this time using BIBO’s model and ontology, rather than a FRBR-like one:

@prefix bibo: . @prefix dc: . @prefix dcterms: . @prefix foaf: . @prefix owl: . @prefix rdfs: . @prefix rdf: . @prefix frbr: . @prefix prism: . a bibo:Article ; dc:title "Nicotinamide-2,2,2-trifluoroethanol (2/1)" ; dc:isPartOf ; bibo:volume "65" ; bibo:issue "4" ; bibo:pageStart "727" ; bibo:pageEnd "728" ; dc:creator :author1 ; dc:creator :author2 ; dc:creator :author3 ; dc:creator :author4 ; dc:creator :author5 ; bibo:authorList (:author1 :author2 :author3 :author4 :author5) ; dc:rights ; dc:language "en" ; dc:date "2009-04-01" ; bibo:doi "10.1107/S1600536809007594" ; bibo:abstract "The nicotinamide (NA) molecules of the title compound, 2C6H6N2O.C2H3F3O, form centrosymmetric R22(8) hydrogen-bonded dimers via N-H...O contacts. The asymmetric unit contains two molecules of NA and one trifluoroethanol molecule disordered over two sites of equal occupancy. The packing consists of alternating layers of nicotinamide dimers and disordered 2,2,2-trifluoroethanol molecules stacking in the c-axis direction. Intramolecular C-H...O and intermolecular N-H...N, O-H...N, C-H...N, C-H...O and C-H...F interactions are present." ; prism:rightsagent "med@iucr.org" . a bibo:Journal ; dc:title "Acta Crystallographica Section E: Structure Reports Online"@en ; ; bibo:shortTitle "Acta Cryst. E"@en ; bibo:issn "1600-5368" . :author1 a foaf:Person ; foaf:name "Johnston, B.F." . :author2 a foaf:Person ; foaf:name "Florence, A.J." . :author3 a foaf:Person ; foaf:name "Bardin, J." . :author4 a foaf:Person ; foaf:name "Kennedy, A.R." :author5 a foaf:Person ; foaf:name "Wong, L.V."

Comments on which is the most useable, the most understandable and what is likely to be the better model for sharing this data with other people are most welcome. This is an area in which the community will have to chose a model, as practically, wrapping the information in any of the models is straightforward, but if you put it into a model that noone uses, the model becomes more of a data coffin, than a useful concept to use.