Comments on: Characterising the British Library Bibliographic dataset http://openbiblio.net/2010/11/18/characterising-the-british-library-bibliographic-dataset/ Open Bibliographic Data Working Group of the Open Knowledge Foundation Wed, 26 Nov 2014 08:29:16 +0000 hourly 1 http://wordpress.org/?v=4.3.1 By: Tom Morris http://openbiblio.net/2010/11/18/characterising-the-british-library-bibliographic-dataset/#comment-171 Thu, 25 Nov 2010 22:45:21 +0000 http://openbiblio.net/?p=368#comment-171 Many of the counts in the Google spreadsheet are double (or more) what they should be. I thought perhaps you were counting both start and end tags separately, but the contributors count is actually 4x, not 2x.

Rather than resorting to Python, you can get simple counts just using grep, e.g.

$ zcat BNBrdfdc*.xml.gz| grep ” | wc -l
3010972

Working with compressed data is much faster, at least on my laptop. I can rip through 445 MB of compressed data in 20% of the time it would take to search the raw 7.3 GB of uncompressed XML.

$ time zcat BNBrdfdc*.xml.gz| grep ‘<dcterms:contributor' | wc -l
1777986

real 0m59.000s
user 1m29.528s
sys 0m10.200s

]]>
By: PabloG » Blog Archive » links for 2010-11-23 http://openbiblio.net/2010/11/18/characterising-the-british-library-bibliographic-dataset/#comment-170 Wed, 24 Nov 2010 01:03:29 +0000 http://openbiblio.net/?p=368#comment-170 […] Characterising the British Library Bibliographic dataset | Open Biblio (graphic) Projects (tags: BL bibliothèques bibliographie rdf xml webservice JSON websemantique library opendata reference catalogue opac metadonnees) […]

]]>