Accounting Careers

Showing posts with label Catalogue of Life. Show all posts

More fictional taxa and the myth of the expert taxonomic database

I know I'm starting to sound like a broken record, but the more I look, the more taxonomic databases seem to be full of garbage. Databases such as the Catalogue of life, which states that it is a "quality-assured checklist" have records that are patently wrong. Here's yet another example.

If you search for the genus Raymondia in the Catalogue of Life you get multiple occurrences of the same species names, e.g.:

Both of these are listed as "provisionally accepted names", supplied by WTaxa: Electronic Catalogue of Weevil names (Curculionoidea). Clearly we can't have two species with the same name, so what's happening?

Firstly, Hustache, A., 1930 is:

Hustache A (1930) Curculionidae Gallo-Rhénans. Annales de la Société entomologique de France 99: 81-272. http://gallica.bnf.fr/ark:/12148/bpt6k6112240j/f3

On p. 246 Hustache refers to Raymondionymus fossor Aubé, 1864 (see below).

F168 highres

So, Raymondionymus fossor Hustache, A., 1930 is not a new species but simply the citation of a previously published one (it's a chresonym). Hustache cites the author of the name as Aubé, 1864, and you can see the original description by Aubé in BioStor (Description de six espèces nouvelles de Coléoptères d'Europe dont deux appartenant a deux genres nouveaux et aveugles, http://biostor.org/reference/104589). So, if the taxonomic authority should be Aubé, 1864, what about Raymondionymus fossor Ganglebauer, L., 1906? Again, if we track down the original publication (Revision der Blindrüsslergattungen Alaocyba und Raymondionymus, http://biostor.org/reference/104591) it's simply Ganglebauer citing (on p. 142) Aubé's paper, not describing a new species.

Note that the nomenclature of this weevil species is further complicated because Aubé originally described the species as Raymondia fossor, but Raymondia was already in use for a fly (see Über eine neue Fliegengattung: Raymondia, aus der Familie der Coriaceen, nebst Beschreibung zweier Arten derselben, http://biostor.org/reference/104588). To resolve this homonymy Wollaston proposed the name Raymondionymus:

Wollaston, T. V. (1873). XVIII. On the Genera of the Cossonidae. Transactions of the Royal Entomological Society of London, 21(4), 427–652. doi:10.1111/j.1365-2311.1873.tb00645.x http://biostor.org/reference/51301

So, we have a bit of a mess. Unfortunately this mess percolates up through other databases, for example EOL has three different pages for Raymondionymus fossor.

For me the lesson here is that relying on acquiring data from "trusted" sources, curated by "experts" is simply not a tenable strategy for building lists of taxa. If names are essential bits of biodiversity infrastructure upon which we hang other data, then these lists need to be cleaned, which means exposing them to scrutiny, and providing an easy means for errors to be flagged and corrected. Trust is something that is earned, not asserted, and it's time taxonomic databases stop claiming to be authoritative simply because they rely on expert sources. Expertise is no guarantee that you won't make errors.

For me this is one of the key reasons projects like BHL are so important. As more and more of the original literature becomes available, we lessen our reliance on "expertise". We can start to see for ourselves. In other words, "Nullius in verba" ("take nobody's word for it").

Using a zoomable treemap to visualise a taxonomic classification

One visualisation method I keep coming back too is the treemap. Each time I experiment with them I learn a little bit more, but I usually end up abandoning them (with the exception of using quantum treemaps to display bibliographic data). But they keep calling me back.

My latest experiment builds on some earlier thoughts on quantum treemaps, but tackles two issues that have kept bugging me. The first is that quantum treemaps are limited to hierarchies that are only two levels deep (e.g., family → genus → species). This is because, unlike regular treemaps where you are slicing and dicing a rectangle of predetermined size, when you construct a quantum treemap you don't know how big it will be until you've made it (this is because you want to ensure that every item in the hierarchy can be displayed at the same size, and fitting them in may require you to tweak the size of the treemap). Given that taxonomic classifications have > 2 levels this is a problem. One approach is to construct quantum treemaps for the lower parts of the classification, then pack those into a larger rectangle. This is an instance of the packing problem. After Googling for a bit I came up across this code for packing rectangles, which was easy to follow and gave reasonable results.

The second problem is that I want the treemap to be interactive. I want to be able to zoom in and out and navigate around the treemap. After more Googling, I came across the Zoomooz.js library which makes web page elements zoom (for a pretty mind-blowing example of what can be done see impress.js), but I decided I want to work with SVG. After playing with examples from Keith Wood's jQuery SVG plugin I started to get the hang of creating zoomable visualisations in SVG.

Here's a video of what I've come up with so far (you can see this live at http://iphylo.org/~rpage/zoomrect/primates.html). This is an interactive display of the Catalogue of Life 2010 classification of primates, with images from EOL. It's crude, there are some obvious issues with redrawing images, labels, etc., but it gives a sense of what can be done. With care this could probably be scaled up to handle the entire Catalogue of Life classification. With a bit more care, it could probably be optimised for the iPad, which would be a fun way to navigate through the diversity of life.

Visualising primate classification from Roderic Page on Vimeo.

How many species are there, and why do we get two very different answers from same data?

Two papers estimating the total number of species have recently been published, one in the open access journal PLoS Biology:

Camilo Mora, Derek P. Tittensor, Sina Adl, Alastair G. B. Simpson, Boris Worm. How Many Species Are There on Earth and in the Ocean?. PLoS Biol 9(8): e1001127. doi:10.1371/journal.pbio.1001127

the second in Systematic Biology (which has an open access option but the authors didn't use it for this article):

Mark J. Costello, Simon Wilson and Brett Houlding. Predicting total global species richness using rates of species description and estimates of taxonomic effort. Syst Biol (2011) doi:10.1093/sysbio/syr080

The first paper has gained a lot of attention, in part because Jonathan Eisen Bacteria & archaea don't get no respect from interesting but flawed #PLoSBio paper on # of species on the planet was mightily pissed off about the estimates of the number:

Their estimates of ~ 10,000 or so bacteria and archaea on the planet are so completely out of touch in my opinion that this calls into question the validity of their method for bacteria and archaea at all.

The fuss over the number of bacteria and archaea seems to me to be largely a misunderstanding of how taxonomic databases count taxa. Databases like Catalogue of Life record described species, and most bacteria aren't formally described because they can't be cultured. Hence there will always be a disparity between the extent of diversity revealed by phylogenetics and by classical taxonomy.

The PLoS Biology paper has garnered a lot more reaction than the Systematic Biology paper (e.g., the commentary by Carl Zimmer in the New York TimesHow Many Species? A Study Says 8.7 Million, but It’s Tricky), which arguably has the more dramatic conclusion.

How many species, 8.7 million, or 1.8 to 2.0 million?

Whereas the Mora et al. in PLoS Biology concluded that there are some 8.7 million (±1.3 million SE) species on the planet, Costello et al. in Systematic Biology arrive at a much more conservative figure (1.8 to 2.0 million). The implications of these two studies are very different, one implies there's a lot of work to do, the other leads to headlines such as 'Every species on Earth could be discovered within 50 years'.

What is intriguing is that both studies use the same databases, Catalogue of Life and the World's Register of Marine Species, and yet arrive at very different results.

So, the question is, how did we arrive at two very different answers from the same data?

Are names really the key to the big new biology?

David ("Paddy") Patterson, Jerry Cooper, Paul Kirk, Rich Pyle, and David Remsen have published an article in TREE entitled "Names are key to the big new biology" (doi:10.1016/j.tree.2010.09.004). The abstract states:

Those who seek answers to big, broad questions about biology, especially questions emphasizing the organism (taxonomy, evolution and ecology), will soon benefit from an emerging names-based infrastructure. It will draw on the almost universal association of organism names with biological information to index and interconnect information distributed across the Internet. The result will be a virtual data commons, expanding as further data are shared, allowing biology to become more of a ‘big science’. Informatics devices will exploit this ‘big new biology’, revitalizing comparative biology with a broad perspective to reveal previously inaccessible trends and discontinuities, so helping us to reveal unfamiliar biological truths. Here, we review the first components of this freely available, participatory and semantic Global Names Architecture.

Do we need names?

Reading this (full disclosure, I was a reviewer) I can't wondering whether the assumption that names are key really needs to be challenged. Roger Hyam has argued that we should be calling time on biological nomenclature, and I wonder whether for a generation of biologists brought up on DNA barcodes and GPS, taxonomy and names will seem horribly quaint. For a start, sequences and GPS coordinates are computable, we can stick them in computers and do useful things with them. DNA barcodes can be used to infer identity, evolutionary relationships, and dates of divergence. Taken in aggregate we can infer ecological relationships (such as diet, e.g., doi:10.1371/journal.pone.0000831), biogeographic history, gene flow, etc. While barcodes can tells us something about an organism, names don't. Even if we have the taxonomic description we can't do much with it — extracting information from taxonomic descriptions is hard.

Furthermore, formal taxonomic names don't seem terribly necessary in order to do a lot of science. Patterson et al. note that taxa may have "surrogate" names":

Surrogates include provisional names and specimen, culture or strain numbers which refer to a taxon. 'SAR-11' ('SAR' refers to the Sargasso Sea) was a surrogate name given in 1990 to an important member of the marine plankton. Only a decade later did it become known as Pelagibacter ubique.

The name Pelagibacter ubique was published in 2002 (doi:10.1038/nature00917), although as a Candidatus name (doi:10.1099/00207713-45-1-186), not a name conforming to the International Code of Nomenclature of Bacteria. I doubt the lack of a name that follows this code is hindering the study of this organism, and researchers seem happy to continue to use 'SAR11'.

So, I think that as we go forward we are going to find nomenclature struggling to establish its relevance in the age of digital biology.

If we do need them, how do we manage them?
If we grant Patterson et al. their premise that names matter (and for a lot of the legacy literature they will), then how do we manage them? In many ways the "Names are key to the big new biology" paper is really a pitch for the Global Names Architecture or GNA (and it's components GNI, GNITE, and GNUB). So, we're off into alphabet soup again (sigh). The more I think about this the more I want something very simple.

Names
All I want here is a database of name strings and tools to find them in documents. In other words, uBio.

Documents
Broadly defined to include articles, books, DNA sequences, specimens, etc. I want an database of [name,document] pairs (BHL has a huge one), and a database of documents.

Realistically, given the number and type of documents there will be several "document" databases, such as GenBank and GBIF. For citations Mendeley looks very promising. If we had every taxonomic publication in Mendeley, tagged with scientific names, then we'd have the bibliography of life. Taxonomic nomenclators would be essentially out of business, given that their function is to store the first publication of a name. Given a complete bibliography we just create a timeline of usage for a name and note the earliest [name,document] pair:

Taxonomy
There are a few wrinkles to deal with. Firstly, names may have synonyms, lexical variants, etc. (the Patterson et al. paper has a nice example of this). Leaving aside lexical variants, what we want is a "view" of the [name,document] pairs that says this subset refer to the same thing (the "taxon concept").

We can obsess with details in individual cases, but at web-scale there are only two ones that spring to mind. The first is the Catalogue of Life, the second is NCBI. The Catalogue of Life lists sets of names and reference that it regards as being the same thing, although it does unspeakable things to many of the references. In the case of NCBI the "concepts" would be the sets of DNA sequences and associated publications linked to the same taxonomy id. Whatever you think of the NCBI taxonomy, it is at least computable, in the sense that you could take a taxon and generate a list of publications 'about" that taxon.

So, we have names, [name,document] pairs, and sets of [name,document] pairs. Simples.

Replicating and forking data in 2010: Catalogue of Life and CouchDB

Time (just) for a Friday folly. A couple of days ago the latest edition of the Catalogue of Life (CoL) arrived in my mailbox in the form of a DVD and booklet:

While in some ways it's wonderful that the Catalogue of Life provides a complete data dump of its contents, this strikes me as a rather old-fashioned way to distribute it. So I began to wonder how this could be done differently, and started to think of CouchDB. In particular, I began to think of being able to upload the data to a service (such as Cloudant) where the data could be stored and replicated at scale. then I began to think about forking the data. The Catalogue of Life has some good things going for it (some 1.25 million species, and around 2 million names), and is widely used as the backbone of sites such as EOL, GBIF, and iNaturalist.org, but parts of it are broken. Literature citations are often incomplete or mangled, and in places it is horribly out of date.

Rather than wait for the Catalogue of Life to fix this, what if we could share the data, annotate it, correct mistakes, and add links? In particular, what if we link the literature to records in the Biodiversity Heritage Library so at we can finally start to connect names to the primary literature (imagine clicking on a name and being able to see the original species description). We could have something akin to github, but instead of downloading and forking code, we download and fork data. CouchDB makes replicating data pretty straightforward.

So, I've started to upload some Catalogue of Life records to a CouchDB instance at Cloudant, and write a simple web site to display these records. For example, you can see the record for at http://iphylo.org/~rpage/col/?id=e9fda47629c1102b9a4a00304854f820:

The e9fda47629c1102b9a4a00304854f820 in this URL is the UUID of the record in CouchDB, which is also the UUID embedded in the (non-functional) CoL LSIDs. This ensures the records have a unique identifier, but also one that is related to the original record. You can search for names, or browse the immediate hierarchy around a name. I hope to add more records over time as I explore this further — at the moment I've added a few lizards, wasps, and conifers while I explore how to convert the CoL records into a sensible JSON object to upload to CouchDB.

The next step is to think about this as a way to distribute data (want a copy of CoL, just point your CouchDB at the Cloudant URL and replicate it), and to think about how to build upon the basic records, editing and improving them, then thinking about how to get that information into a future version of the Catalogue.

Why we need wikis

I've just spent a frustrating few minutes trying to find a reference in BioStor. The reference in question is

Heller, Edmund 1901. Papers from the Hopkins Stanford Galapagos Expedition, 1898-1899. WIV. Reptiles. Proc. Biol. Soc. Washington 14: 39-98

and comes from the Reptile Database page for the gecko Phyllodactylus gilberti HELLER, 1903 . This is primary database for reptile taxonomy, and supplies the Catalogue of Life, which repeats this reference verbatim.
Thing is, this reference doesn't exist! Page 39 of Proc. Biol. Soc. Washington volume 14 is the start of Gerrit S Miller (1901) A new dormouse from Italy. Proc Biol Soc Washington 14: 39-40.
After much fussing with trying diferent volumes and dates for Proc. Biol. Soc. Washington, I searched BHL for Phyllodactylus gilberti, and discovered that this name was published in Proceedings of the Washington Academy of Sciences:

Edmund Heller (1903) Papers from the Hopkins Stanford Galapagos Expedition, 1898-1899. XIV. Reptiles. Proceedings of the Washington Academy of Sciences 5: 39-98

(see http://biostor.org/reference/20322). Three errors (wrong journal, wrong date, minor typo in title), but enough to break the link between a name and the primary source for that name.

Anybody who demands authoritative, expert-vetted resources, and thinks the Catalgoue of Life is a shining example of this needs to think again. Our databases are riddled with errors, which are repackaged over and over again, yet these would be so easy to fix if they were opened up and made easy to edit. It's time to get serious about wikis.

Tag trees: displaying the taxonomy of names in BHL

I've added a feature to my Biodiversity Heritage Library viewer that should help make sense of the names found on a page. Until now I've displayed them as a list of "tags", which ignores the relations among the names. Based on some code I'd developed for my e-Biosphere 09 challenge entry I've added a "tag tree" that displays the classification of the names found on a BHL page:

The idea is that a set of names can make much more sense if you know what kind of organism they are referring to. For example, I don't know what Onetes is, but if I look at BHL page 2298380 I can see that it's an insect:

The names in gray don't occur on the page, but do occur in the tree that links those names (the latter are highlighed in black). The tag tree can be useful for separating out host and parasite, e.g. BHL page 2298491 is about a flea and it's mammalian hosts:

The tag tree can also flag names that might be mistaken, such as those found on page 2298330:

This page has names of some grasshoppers from Madagascar, as well as the name of a butterfly (Tsaratanana), which seems a little odd. Looking at the text, we discover that "Tsaratanana" is Mont. Tsaratanana a mountain in Madagascar. It would be fun to develop tools to annotate such cases so that somebody looking for the butterfly won't be presented with this page.

How it works

The inspiration for this tag tree came from several sources. David Remsen has often used an example of finding a fly name in the middle of a book on birds as being of interest, and the NCBI have a subtree view of taxa in a PubMed article. My own tag tree is constructed by finding for each name the ancestor-descendant path in a local, modified copy of the Catalogue of Life database, then assembling those paths into a tree. Because not all the names on a BHL page are in the Catalogue of Life, there may be names that aren't classified. These are simply listed below the tag tree (see image above).

Visualising taxonomic classifications using SpaceTrees

The problem of displaying large taxonomic classifications on a web page continues to be an on again-off again obsession. My latest experiment makes use of Nicolas Garcia Belmonte's wonderful JavaScript Infovis Toolkit (JIT), which provides implementations of classic visualisations such as treemaps, hyperbolic trees, and SpaceTrees.

SpaceTrees were developed at Maryland's HCIL lab, and that lab has applied them to biodiversity informatics. The LepTree project has also used them (see LepTaxonTree). I've not been a huge fan, mainly because the existing implementation is a stand-alone Java program, which somewhat limits it's utility. But JIT changes all that.

To get a sense of whether SpaceTrees would be useful, I took Belmonte's second SpaceTree example as a starting point. In this example, nodes are created on demand (rather than loading the entire tree into memory). It proved relatively straightforward (after getting my head around making Ajax requests using Mootools) to modify the example to load nodes from a local copy of the Catalogue of Life 2008 classification.

I've put a live version of the Catalogue of Life SpaceTree up at http://bioguid.info/demos/spacetree. It doesn't do much, beyond display the tree, together with some basic information about the node. But I think it shows the power of Javacsript to create pleasing visualisations, and the potential of SpaceTrees as a simple tool to browse large taxonomic classifications.

Subscribe to: Posts ( Atom )

Accounting Careers

Search this keyword