Having RDF data is good. Having Linkable data is better but having some idea of what sorts of properties you can expect to find within a triplestore or block of data can be crucial. That sort of broad-stroke information can be vital in letting you know when a dataset contains interesting data that makes the work to use it worthwhile.
I ran the recently re-released BL RDF data (get from here or here) (CC0) through a simple script that counted occurrences of various elements within the 17 files, as well as enumerating all the different sorts of property you can expect to find.
Some interesting figures:
- Over 100,000 records in each file, 2.9 million ‘records’ in total. Each record is a blank node.
- Three main types of identifier – a ‘(Uk)123….’, ‘GB123…’ and (as a literal) ‘URN:ISBN:123…’, but not all records have ISBNs as some of them predate it.
- Nearly 29 million blank nodes in total.
- 11,187,804 uses of dcterms:subject, for an average of just under 4 per record (3.75…)
- Uses properties from Dublin Core terms, OWL-Time, ISBD, and SKOS
- dcterms:subject’s are all as SKOS declarations, and include the Dewey decimal, LCSH and MESH schemes. (Work to use id.loc.gov LCSH URIs instead of literals is underway)
- Includes rare and valuable information, stored in properties such as dcterms:isPartOf, isReferencedBy, isReplacedBy, replaces, requires and dcterms:relation.
Google spreadsheet of the tallys
Occurrence trends through the 17 data files (BNBrdfdc01.xml –> 17.xml)
(The image is as Google spreadsheet exported it, click on the link above to go to the sheet itself to view it natively without axis distortion.)
Literals and what to expect:
I wrote another straightforward script that can mine sample sets of unique literals from the BNBrdfdc xml files.
Usage for ‘gather_test_literals.py’
Usage: python gather_test_literals.py path/to/BNBrdfdcXX.xml ns:predicate number_to_retrieve [redis_set_to_populate]
For example, to retrieve 10 literal values
from the bnodes within dcterms:publisher
in BNBrdfdc01.xml:
python gather_test_literals.py BNBrdfdc01.xml "dcterms:publisher" 10
- and to also push those values into a local
Redis set 'publisherset01' if Redis is running
and redis-py is installed:
python gather_test_literals.py BNBrdfdc01.xml "dcterms:publisher" 10 publisherset01
So, to find out what, at most, 10 of those intriguing ‘dcterms:isReferencedBy’ predicates contain in BNBrdfdc12.xml, you can run:
python gather_test_literals.py BNBrdfdc12.xml "dcterms:isReferencedBy" 10
(As long as gather_test_literals.py and the xml files are in the same directory of course)
Result:
Chemical abstracts,
Soulsby no. 4061
Soulsby no. 3921
Soulsby no. 4018
Chemical abstracts
As the script gathers the literals into a set, it will only return when it has either reached the desired number of unique values, or has reached the end of the file.
Hopefully, this will help other people explore this dataset and also pull information from it. I have also created a basic Solr configuration that has fields for all the elements found in the BNB dataset here.
Pingback: PabloG » Blog Archive » links for 2010-11-23
Many of the counts in the Google spreadsheet are double (or more) what they should be. I thought perhaps you were counting both start and end tags separately, but the contributors count is actually 4x, not 2x.
Rather than resorting to Python, you can get simple counts just using grep, e.g.
$ zcat BNBrdfdc*.xml.gz| grep ” | wc -l
3010972
Working with compressed data is much faster, at least on my laptop. I can rip through 445 MB of compressed data in 20% of the time it would take to search the raw 7.3 GB of uncompressed XML.
$ time zcat BNBrdfdc*.xml.gz| grep ‘<dcterms:contributor' | wc -l
1777986
real 0m59.000s
user 1m29.528s
sys 0m10.200s