By: Tom Morris

Tom Morris — Thu, 25 Nov 2010 22:45:21 +0000

Many of the counts in the Google spreadsheet are double (or more) what they should be. I thought perhaps you were counting both start and end tags separately, but the contributors count is actually 4x, not 2x.

Rather than resorting to Python, you can get simple counts just using grep, e.g.

$ zcat BNBrdfdc*.xml.gz| grep '' | wc -l 3010972

Working with compressed data is much faster, at least on my laptop. I can rip through 445 MB of compressed data in 20% of the time it would take to search the raw 7.3 GB of uncompressed XML.

$ time zcat BNBrdfdc*.xml.gz| grep '<dcterms:contributor' | wc -l 1777986

real 0m59.000s user 1m29.528s sys 0m10.200s

By: PabloG » Blog Archive » links for 2010-11-23

PabloG » Blog Archive » links for 2010-11-23 — Wed, 24 Nov 2010 01:03:29 +0000

[...] Characterising the British Library Bibliographic dataset | Open Biblio (graphic) Projects (tags: BL bibliothèques bibliographie rdf xml webservice JSON websemantique library opendata reference catalogue opac metadonnees) [...]

Comments on: Characterising the British Library Bibliographic dataset

By: Tom Morris

By: PabloG » Blog Archive » links for 2010-11-23