Accounting Careers

Showing posts with label Darwin Core riplet. Show all posts

How many specimens does GBIF really have?

Duplicate records are the bane of any project that aggregates data from multiple sources. Mendeley, for example, has numerous copies of the same article, as documented by Duncan Hull (How many unique papers are there in Mendeley?). In their defence, Mendeley is aggregating data from lots of personal reference libraries and hence they will often encounter the same article with slightly differing metadata (we all have our own quirks when we store bibliographic details of papers). It's a challenging problem to identify and merge records which are not identical, but which are clearly the same thing.

What I'm finding rather more alarming is that GBIF has duplicate records for the same specimen from the same data provider. For example, the specimen USNM 547844 is present twice:

As far as I can tell this is the same specimen, but the catalogue numbers differ (547844 versus 547844.6544573). Apart from this the only difference is when the two records were indexed. The source for 547844 was last indexed August 9, 2009, the source for 547844.6544573 was first indexed August 22, 2010. So it would appear that some time between these two dates the US National Museum of Natural History (NMNH) changed the catalogue codes (by appending another number), so GBIF has treated them as two distinct specimens. Browsing other GBIF records from the NMNH shows the same pattern. I've not quantified the extent of this problem, but it's probably a safe bet that every NMNH herp specimen occurs twice in GBIF.

Then there are the records from Harvard's Museum of Comparative Zoology that are duplicates, such as http://data.gbif.org/occurrences/33400333/ and http://data.gbif.org/occurrences/328478233/ (both for specimen MCZ A-4092, in this case the collectionCode is either "Herp" or "HERPAMPH"). These are records that have been loaded at different times, and because the metadata has changed GBIF hasn't recognised that these are the same thing.

At the root of this problem is the lack of globally unique identifiers for specimens, or even identifiers that are unique and stable within a dataset. The Darwin Core wiki lists a field for occurrenceID for which it states:

The occurrenceID is supposed to (globally) uniquely identify an occurrence record, whether it is a specimen-based occurrence, a one-time observation of a species at a location, or one of many occurrences of an individual who is being tracked, monitored, or recaptured. Making it globally unique is quite a trick, one for which we don't really have good solutions in place yet, but one which ontologists insist is essential.

Well, now we see the side effect of not tackling this problem - our flagship aggregator of biodiversity data has duplicate records. Note that this has nothing to do with "ontologists" (whatever they are), it's simple data management. Assign a unique id (a primary key in a database will do fine) that can be used to track the identity of an object even as its metadata changes. Otherwise you are reduced to matching based on metadata, and if that is changeable then you have a problem.

Now, just imagine the potential chaos if we start changing institution and collection codes to conform to the Darwin Core triplet. In the absence of unique identifiers (again, these can be local to the data set) GBIF is going to be faced with a massive data reconciliation task to try and match old and new specimen records.

The other problem, of course, is that my plan to use GBIF occurrence URLs as globally unique identifiers for specimens is looking pretty shaky because they are unique (the same specimen can have more than one) and if GBIF cleans up the duplicates a number of these URLs will disappear. Bugger.

Extracting museum specimen codes from text

Quick note about a tool I've cobbled together as part of the phyloinformatics course, which addresses a long standing need I and others have to extract specimen codes from text. I've had this code kicking around for a while (as part of various never-finished data mining projects), but never got around to releasing it, until now. It is very crude (basically a bunch of regular expressions), and there's a lot which could be done to improve it (not least starting with a complete list of museum specimen codes, rather than just those I've come across in, say Zootaxa and BioStor).

You can try the tool at http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php. Paste in some text and it will try and extract museum codes. The tool tries to handle ranges of specimens (e.g., MHNSM 1808-09), and some of the more common specimen numbering schemes.

Comments welcome. If you are looking for a source of text, papers in Zookeys or Zootaxa are a good place to start (especially papers on vertebrates where specimen numbers are often used). BioStor is also a good source: if you're looking at a paper in BioStor click on the "Text" link to get the OCR text for an article and paste that into the form at . For example, the text for Systematics of the Bufo coccifer complex (Anura: Bufonidae) of Mesoamerica is available at http://biostor.org/reference/97426.text.

The extraction tool can also be called as a web service using POST to get back the results in JSON.

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more. It's the "Darwin Core triplet", which creates identifiers for voucher specimens in the form <institution-code>:<OPTIONAL collection-code>:<specimen-id>. For example,

MVZ:Herp:246033

is the identifier for specimen 246033 in the Herpetology collection of the Museum of Vertebrate Zoology (see http://arctos.database.museum/guid/MVZ:Herp:246033).

On the face of it this seems a perfectly reasonable idea, and goes some way towards addressing the problem of linking GenBank sequences to vouchers (see, for example, http://dx.doi.org/10.1016/j.ympev.2009.04.016, preprint at PubMed Central). But I'd argue that this is a hack, and one which potentially will create the same sort of mess that citation linking was in before the widespread use of DOIs. In other words, it's a fudge to postpone adopting what we really need, namely persistent resolvable identifiers for specimens.

In many ways the Darwin Core triplet is analogous to an article citation of the form <journal>, <volume>:<starting page>. In order to go from this "triplet" to the digital version of the article we've ended up with OpenURL resolvers, which are basically web services that take this triple and (hopefully) return a link. In practice building OpenURL resolvers gets tricky, not least because you have to deal with ambiguities in the <journal> field. Journal names are often abbreviated, and there are various ways those abbreviations can be constructed. This leads to lists of standard abbreviations of journals and/or tools to map these to standard identifiers for journals, such as ISSNs.

This should sound familiar to anybody dealing with specimens. Databases such as the Registry of Biological Repositories and the Biodiversity Collectuons Index have been created to provide standardised lists of collection abbreviations (such as MVZ = Museum of Vertebrate Zoology). Indeed, one could easily argue that the what we need is an OpenURL for specimens (and I've done exactly that).

As much as there are advantages to OpenURL (nicely articulated in Eric Hellman's post When shall we link?), ultimately this will end in tears. Linking mechanisms that depend on metadata (such as museum acronyms and specimen codes, or journal names) are prone to break as the metadata changes. In the case of journals, publishers can rename entire back catalogues and change the corresponding metadata (see Orwellian metadata: making journals disappear), journals can be renamed, merged, or moved to new publishers. In the same way, museums can be rebranded, specimens moved to new institutions, etc. By using a metadata-based identifier we are storing up a world of hurt for someone in the future. Why don't we look at the publishing industry and learn from them? By having unique, resolvable, widely adopted identifiers (in this case DOIs) scientific publishers have created an infrastructure we now take for granted. I can read a paper online, and follow the citations by clicking on the DOIs. It's seamless and by and large it works.

On could argue that a big advantage of the Darwin Core triplet is that it can identify a specimen even if it doesn't have a web presence (which is another way of saying that maybe it doesn't have a web presence now, but it might in the future). But for me this is the crux of the matter. Why don't these specimens have a web presence? Why is it the case that biodiversity informatics has failed to tackle this? It seems crazy that in the context of digital data (DNA sequences) and digital databases (GenBank) we are constructing unresolvable text strings as identifiers.

But, of course, much of the specimen data we care about is online, in the form of aggregated records hosted by GBIF. It would be technically trivial for GBIF to assign a decent identifier to these (for example, a DOI) and we could complete the link between sequence and specimen. There are ways this could be done such that these identifiers could be passed on to the home institutions if and when they have the infrastructure to do it (see GBIF and Handles: admitting that "distributed" begets "centralized").

But for now, we seem determined to postpone having resolvable identifiers for specimens. The Darwin Core triplet may seem a pragmatic solution to the lack of specimen identifiers, but it seems to me it's simply postponing the day we actually get serious about this problem.

Accounting Careers

Search this keyword

How many specimens does GBIF really have?

Extracting museum specimen codes from text

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Blog Archive

Popular Posts

Labels