Search this keyword

Showing posts with label ION. Show all posts
Showing posts with label ION. Show all posts

BioNames update - matching taxon names to classifications

On eof the things BioNames will need to do is match taxon names to classifications. For example, if I want to display a taxonomic hierarchy for the user to browse through the names, then I need a map between the taxon names that I've collected and one or more classifications. The approach I'm taking is to match strings, wherever possible using both the name and taxon authority. In many cases this is straightforward, especially if there is only one taxon with a name. But often we have cases where the same name has been used more than once for different taxa. For example, here is what ION has for the name "Nystactes".
Nystactes Bohlke2735131
Nystactes2787598
Nystactes Gloger 18274888093
Nystactes Kaup 18294888094


If I want to map these names to GBIF then these are corresponding taxa with the name "Nystactes":
Nystactes Böhlke, 19572403398
Nystactes Gloger, 18272475109
Nystactes Kaup, 18293239722


Clearly the names are almost identical, but there are enough little differences (presence or absence of comma, "o" versus "ö") to make things interesting. To make the mapping I construct a bipartite graph where the nodes are taxon names, divided into two sets based on which database they came from. I then connect the nodes of the graph by edges, weighted by how similar the names are. For example, here is the graph for "Nystactes" (displayed using Google images:


I then compute the maximum weighted bipartite matching using a C++ program I wrote. This matching corresponds to the solid lines in the graph above.

In this way we can make a sensible guess as to how names in the two databases relate to one another.

Why the ICZN is in trouble

There are many reasons why the International Commission on Zoological Nomenclature (ICZN) is in trouble, but fundamentally I think it's because of situation illustrated by following diagram.

ICZN
Based on an analysis of the Index of Organism Names (ION) database that I'm currently working on, there are around 3.8 million animal names (I define "animal" loosely, the ICZN covers a number of eukaryote groups), of which around 1.5 million are "original combinations", that is, the name as originally published. The other 2 million plus names are synonyms, spelling variations, etc.

Of these 3.8 million names the ICZN itself can say very little. It has placed some 12,600 names (around 0.3% of the total) on its Official Lists and Indexes (which is where it records decisions on nomenclature), and its new register of names, ZooBank, has less than 100,000 names (i.e., less than 3% of all animal names).

The ICZN doesn't have a comprehensive database of animal names, so it can't answer the most basic questions one might have about names (e.g., "is this a name?", "can I use this name, or has somebody already used it?", "what other names have people used for this taxon?", "where was this name originally published?", "can I see the original description?", "who first said these two names are synonyms?", and so on). The ICZN has no answer to these questions. In the absence of these services, it is reduced to making decisions about a tiny fraction of the names that are in use (and there is no database of these decisions). It is no wonder that it is in such trouble.

Rate of description of new animal species and *that* Taxatoy graph

As part of the discussion on whether legacy biodiversity literature matters a graph from the following paper came up:

Sarkar, I., Schenk, R., & Norton, C. N. (2008). Exploring historical trends using taxonomic name metadata. BMC Evolutionary Biology, 8(1), 144. doi:10.1186/1471-2148-8-144


So, why is the Sarkar et al. graph bogus? Here is their graph (Fig. 3) for animals:

Taxatoy

This is the number of new animal species described each year, estimated by parsing taxonomic names and extracting the date in the taxonomic authority. There are two prominent "spikes" which are worrying. Sarkar et al. discuss the peak in 1994:

For example, the analyzed data indicate that a significant portion of the 1994 peak is due to an increase in descriptions of the family Cerambycidae, a large group of beetles.


So, 1994 was a bumper year for describing new species of Cerambycidae? Not quite. Taxatoy is based on names in uBio, and I have a local copy of most of these names. The Cerambycidae names contain lots of duplicate names that differ only in taxon authority. For example, searching the name Ancylocera macrotela on uBio finds:


Ancylocera macrotela
Ancylocera macrotela Aurivillius, 1912
Ancylocera macrotela BATES Henry Walter, 1880
Ancylocera macrotela Bates, 1880
Ancylocera macrotela Bates, 1885
Ancylocera macrotela Blackwelder, 1946
Ancylocera macrotela Chemsak & Linsley, 1970
Ancylocera macrotela Chemsak, 1963
Ancylocera macrotela Chemsak, 1964
Ancylocera macrotela Chemsak, Linsley & Mankins, 1980
Ancylocera macrotela Chemsak, Linsley & Noguera, 1992
Ancylocera macrotela Lameere, 1883
Ancylocera macrotela Maes & al., 1994
Ancylocera macrotela Monné & Giesbert, 1994
Ancylocera macrotela Monné, 1994
Ancylocera macrotela Noguera & Chemsak, 1996
Ancylocera macrotela Viana, 1971


These names are chresonyms. The original name is Ancylocera macrotela Bates, 1880 (you can see first publication of this name in BHL), the rest are subsequent citations of that name (gotta love taxonomy...).

Why the spike in 1994? I suspect that this is due to the publication in 1994 of "Checklist of the Cerambycidae and Disteniidae (Coleoptera) of the Western Hemisphere" by Miguel A Monné and Edmund F Giesbert. At least 8552 names from that checklist seem to have ended up in uBio, all with the date "1994". So the spike is an artefact. Similarly, the other peak (1912) corresponds to the publication of a checklist by Per Olof Christopher Aurivillius, which contributes over 3000 names.

One reason I was suspicious of the Taxatoy graph is that it doesn't look anything like the equivalent graph from the Index of Organism Names. After a bit of fussing I've grabbed data from the ION site, and from Taxatoy's Google Code repository and created the following chart:

Taxatoy version2

The data for this chart is on figshare http://dx.doi.org/10.6084/m9.figshare.156862. ION is an index of all new animal names, based on Zoological Record. I place more confidence in its data than data derived from uBio, but it clearly ION has its own issues (such as the gap after 1850, and the uneven sampling of the early years of taxonomy). The key point is that arguments on the temporal distribution of taxonomic descriptions (and the value of legacy literature) need to be aware that the data used is in pretty poor shape.

Update 2013-02-23
Jose Antonio Gonzalez Oreja pointed out in an email that the values for ION that I used were a little higher than those that appear on the ION web site. My script for retrieving those values hadn't quite worked. I've uploaded the corrected data to Figshare http://dx.doi.org/10.6084/m9.figshare.156862, updated the diagram above, and put the web calls I used to fetch the data on GitHub https://gist.github.com/rdmpage/5019153. The story doesn't change, but it helps to have the correct data.

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:

http://iphylo.org/~rpage/itaxonAlpheus

This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

Towards the bibliography of life

David King et al.'s paper "Towards the bibliography of life" http://dx.doi.org/10.3897/zookeys.150.2167 has just appeared in a special issue of ZooKeys. I've written a number of posts on this topic, so I've a few comments.

King et al. survey some of the issues, but don't really tackle the big issue of how we're going to build this. If we define the "bibliography of life" somewhat narrowly as the list of all papers that have published a scientific name (or a new combination, such as moving a species from one genus to another), then this is a large, but measurable undertaking. According to ION's metrics page, these are the numbers involved (for animals and protozoa):

Total New Names1,510,402
Total New Genera / Subgenera215,242
Total New Species / Subspecies1,192,366
Total Other New Names102,794
Total New Combinations241,296
Total New Synonyms260,544


Even in the worse case scenario of one name per publication (clearly not the case) this is big, but not insurmountable, task.

Publications not taxa
Part of the challenge is figuring out the best way to tackle the problem. In the past, most efforts at building taxonomic bibliographies have focussed on specific taxa, which is natural — the bibliographies are being built by taxonomists and they specialise in particular groups. But I'd argue that this is not the most efficient way to tackle the problem. Because the taxonomic literature is so widely dispersed, after the obvious "low hanging fruit" have been collected, considerable effort must be spent tracking down the harder to find citations. There are few economies of scale in this approach. In contrast, if we focus on publications at, say, the level of journal, then we can build a bibliography much more quickly. Once we've found the source, say, for one article, often we could use that information to harvest many articles from the same source (e.g., write scripts to harvest from a digital repository such as a DSpace server, or a digital library such as Gallica). But if we are focussed on a particular taxon, we will ignore the other articles in that journal ("what do I care about fish, I like turtles").

Put another way, if we imagine a taxa × publication matrix, then we can either go after rows (i.e., a bibliography for a specific taxonomic group), or columns (a list of articles in a specific journal). The article-based approach will be faster, albeit at the cost of finding articles that aren't necessarily relevant to taxonomy. This is why I'm spending what feels like far too much time harvesting article lists and uploading these to Mendeley. It is also one reason BHL has been so successful. They've simply gone after scanning the literature wholesale, rather than focussing on particular taxonomic groups.

TaxapublicationmatrixWikispecies logo enCrowd sourcing and Wikispecies
Crowd sourcing often strikes me as a euphemism for "we can't be bothered doing the tedious stuff, lets get the public to do it for us (plus it will look like we're engaged with the public)." I'm not denying can work, but I suspect it's not a magic bullet. Perhaps the best crowd sourcing is not to try and bring the crowd to a project, but go where the crowd has already gathered. In this case, an obvious crowd is the Wikispecies community. Working with the ION database for my Sherborn presentation, it's clear that the quality of bibliographic data in ION is variable, and rather poor for older references. In contrast, the reference lists on Wikispecies can be very good (e.g., the bibliography for George Boulenger). There are some issues with Wikispecies, notably the lack of a decent bibliographic template (unlike Wikipedia) so parsing references can be *cough* interesting, but there is scope here to use it to improve other databases. Citation matching can be a challenge, but in this case we have citations indexed by taxonomic name (in both ION and Wikispecies), which greatly reduces the scope of possible matches.

Summary
I think building the "bibliography of life" needs a combination of aggressive data gathering, and avoiding building additional tools unless absolutely needed. There are great tools and communities that can already be leveraged (e.g., Mendeley, Wikispecies), let's make use of them.

Sherborn presentation on Open Taxonomy

Here is my presentation from today's Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting.


All the presentations will be posted online, along with podcasts of the audio. Meantime, presentations by Dave Remsen and Chris Freeland are already online.

Linking taxonomic names to literature: beyond digitised 5×3 index cards

Pubs
Tomorrow is the Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting. It should be an interesting gathering, albeit overshadowed by the sudden death of Frank Bisby.

I'm giving a talk entitled "Open Taxonomy", in which I argue that most taxonomic databases are little more than digitised collections of 5×3 index cards, where literature is treated as dumb citation strings rather than as resources with digital identifiers. To make the discussion concrete I've created a mapping between the Index to Organism Names (ION) database and a range of bibliographic sources, such as CrossRef (for DOIs), BioStor, JSTOR, etc.

This mapping is online at http://iphylo.org/~rpage/itaxon/.

So far I've managed to link some 200,000 animal names to a literature identifier, and a good fraction of these articles are freely available, either as images in BioStor and Gallica (for I've created a simple viewer) or as PDFs (which are displayed using Google Docs.

Some examples are:


The site is obviously a work in progress, and there's a lot to be done to the interface, but I hope it conveys the key point: a significant fraction of the primary taxonomic literature is online, and we should be linking to this. The days of digitised 5×3 index cards are past.



Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

SpeciesStatusDOIYear describedImage
Atelopus nanayCRhttp://dx.doi.org/10.1655/0018-0831(2002)058[0229:TNSOAA]2.0.CO;22002
Eleutherodactylus mariposaCRhttp://dx.doi.org/10.2307/14669621992
Phrynopus kauneorumCRhttp://dx.doi.org/10.2307/15659932002
Eleutherodactylus eunasterCRhttp://dx.doi.org/10.2307/15630101973
Eleutherodactylus amadeusCRhttp://dx.doi.org/10.2307/14455571987
Eleutherodactylus lamprotesCRhttp://dx.doi.org/10.2307/15630101973
Churamiti maridadiCRhttp://dx.doi.org/10.1080/21564574.2002.96354672002
Eleutherodactylus thorectesCRhttp://dx.doi.org/10.2307/14453811988
Eleutherodactylus apostatesCRhttp://dx.doi.org/10.2307/15630101973
Leptodactylus silvanimbusCRhttp://dx.doi.org/10.2307/15636911980
Eleutherodactylus sciagraphusCRhttp://dx.doi.org/10.2307/15630101973
Bufo chavinCRhttp://dx.doi.org/10.1643/0045-8511(2001)001[0216:NSOBAB]2.0.CO;22001
Eleutherodactylus fowleriCRhttp://dx.doi.org/10.2307/15630101973
Ptychohyla hypomykterCRhttp://dx.doi.org/10.2307/36720601993
Hyla suweonensisDDhttp://dx.doi.org/10.2307/14441381980
Proceratophrys concavitympanumDDhttp://dx.doi.org/10.2307/15654122000
Phrynopus bufoidesDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Boophis periegetesDDhttp://dx.doi.org/10.1111/j.1096-3642.1995.tb01427.x1995
Phyllomedusa duellmaniDDhttp://dx.doi.org/10.2307/14446491982
Boophis liamiDDhttp://dx.doi.org/10.1163/1568538033224407722003
Hyalinobatrachium ignioculusDDhttp://dx.doi.org/10.1670/0022-1511(2003)037[0091:ANSOHA]2.0.CO;22003
Proceratophrys cururuDDhttp://dx.doi.org/10.2307/14477121998
Amolops bellulusDDhttp://dx.doi.org/10.1643/0045-8511(2000)000[0536:ABANSO]2.0.CO;22000
Centrolene bacatumDDhttp://dx.doi.org/10.2307/15645281994
Litoria kumaeDDhttp://dx.doi.org/10.1071/ZO030082004
Phrynopus pesantesiDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Gastrotheca galeataDDhttp://dx.doi.org/10.2307/14436171978
Paratelmatobius cardosoiDDhttp://dx.doi.org/10.2307/14479761999
Rhacophorus catamitusDDhttp://dx.doi.org/10.1655/0733-1347(2002)016[0046:NAPKPF]2.0.CO;22002
Huia melasmaDDhttp://dx.doi.org/10.1643/CH-04-137R32005
Telmatobius vilamensisDDhttp://dx.doi.org/10.1655/0018-0831(2003)059[0253:ANSOTA]2.0.CO;22003
Callulina kisiwamsituENhttp://dx.doi.org/10.1670/209-03A2004
Arthroleptis nikeaeENhttp://dx.doi.org/10.1080/21564574.2003.96354862003
Eleutherodactylus amplinymphaENhttp://dx.doi.org/10.1139/z94-2971994
Eleutherodactylus glaphycompusENhttp://dx.doi.org/10.2307/15630101973
Bufo tacanensisENhttp://dx.doi.org/10.2307/14397001952
Phrynopus brackiENhttp://dx.doi.org/10.2307/14458261990
Telmatobius sibiricusENhttp://dx.doi.org/10.1655/0018-0831(2003)059[0127:ANSOTF]2.0.CO;22003
Cochranella macheENhttp://dx.doi.org/10.1655/03-742004
Eleutherodactylus melacaraENhttp://dx.doi.org/10.2307/14669621992
Plectrohyla glandulosaENhttp://dx.doi.org/10.2307/14410461964
Aglyptodactylus laticepsENhttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Eleutherodactylus glamyrusENhttp://dx.doi.org/10.2307/15656641997
Gastrotheca trachycepsENhttp://dx.doi.org/10.2307/15643751987
Eleutherodactylus grahamiENhttp://dx.doi.org/10.2307/15639291979
Litoria havinaLChttp://dx.doi.org/10.1071/ZO99302251993
Crinia ripariaLChttp://dx.doi.org/10.2307/14407941965
Litoria longirostrisLChttp://dx.doi.org/10.2307/14431591977
Osteocephalus mutaborLChttp://dx.doi.org/10.1163/1568538023208776092002
Leptobrachium nigropsLChttp://dx.doi.org/10.2307/14409661963
Pseudis tocantinsLChttp://dx.doi.org/10.1590/S0101-817519980004000111998
Mantidactylus argenteusLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Aglyptodactylus securiferLChttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Pseudis cardosoiLChttp://dx.doi.org/10.1163/1568538005072642000
Uperoleia inundataLChttp://dx.doi.org/10.1071/AJZS0791981
Litoria pronimiaLChttp://dx.doi.org/10.1071/ZO99302251993
Litoria paraewingiLChttp://dx.doi.org/10.1071/ZO97602831976
Philautus aurifasciatusLChttp://dx.doi.org/10.1163/156853887X000361987
Proceratophrys avelinoiLChttp://dx.doi.org/10.1163/156853893X001561993
Osteocephalus deridensLChttp://dx.doi.org/10.1163/1568538005075252000
Gephyromantis boulengeriLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Crossodactylus caramaschiiLChttp://dx.doi.org/10.2307/14469071995
Rana yavapaiensisLChttp://dx.doi.org/10.2307/14453381984
Boophis lichenoidesLChttp://dx.doi.org/10.1163/156853898X000251998
Megistolotis lignariusLChttp://dx.doi.org/10.1071/ZO97901351979
Ansonia endauensisNEhttp://dx.doi.org/10.1655/0018-0831(2006)62[466:ANSOAS]2.0.CO;22006
Ansonia kraensisNEhttp://dx.doi.org/10.2108/zsj.22.8092005
Arthroleptella landdrosiaNThttp://dx.doi.org/10.2307/15653592000
Litoria jungguyNThttp://dx.doi.org/10.1071/ZO020692004
Phrynobatrachus phyllophilusNThttp://dx.doi.org/10.2307/15659252002
Philautus ingeriVUhttp://dx.doi.org/10.1163/156853887X000361987
Gastrotheca dendronastesVUhttp://dx.doi.org/10.2307/14450881983
Hyperolius cystocandicansVUhttp://dx.doi.org/10.2307/14439111977
Boophis sambiranoVUhttp://dx.doi.org/10.1080/21564574.2005.96355202005
Ansonia torrentisVUhttp://dx.doi.org/10.1163/156853883X000211983
Telmatobufo australisVUhttp://dx.doi.org/10.2307/15630861972
Stefania coxiVUhttp://dx.doi.org/10.1655/0018-0831(2002)058[0327:EDOSAH]2.0.CO;22002
Oreolalax multipunctatusVUhttp://dx.doi.org/10.2307/15648281993
Eleutherodactylus guantanameraVUhttp://dx.doi.org/10.2307/14669621992
Spicospina flammocaeruleaVUhttp://dx.doi.org/10.2307/14477571997
Cycloramphus acangatanVUhttp://dx.doi.org/10.1655/02-782003
Leiopelma pakekaVUhttp://dx.doi.org/10.1080/03014223.1998.95175541998
Rana okaloosaeVUhttp://dx.doi.org/10.2307/14448471985
Phrynobatrachus uzungwensisVUhttp://dx.doi.org/10.1163/156853883X000301983


This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tdwg_tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tdwg_co: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?name ?status ?doi ?date ?thumbnail
WHERE {
?ncbi uniprot:scientificName ?name .
?ncbi rdfs:seeAlso ?dbpedia .
?dbpedia dbpedia-owl:conservationStatus ?status .
?ion tdwg_tn:nameComplete ?name .
?ion tdwg_co:publishedInCitation ?doi .
?doi dcterms:date ?date .

OPTIONAL
{
?dbpedia dbpedia-owl:thumbnail ?thumbnail
}
}
ORDER BY ASC(?status)


This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
Chart
In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Update
Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large

TDWG Challenge - what is RDF good for?

Last month, feeling particularly grumpy, I fired off an email to the TDWG-TAG mailing list with the subject Lobbing grenades: a challenge. Here's the email:
It's morning and the coffee hasn't quite kicked in yet, but reading through recent TDWG TAG posts, and mindful of the upcoming meeting in New Orleans (which sadly I won't be attending) I'm seeing a mismatch between the amount of effort being expended on discussions of vocabularies, ontologies, etc. and the concrete results we can point to.

Hence, a challenge:

"What new things have we learnt about biodiversity by converting biodiversity data into RDF?"

I'm not saying we can't learn new things, I'm simply asking what have we learnt so far?

Since around 2006 we have had literally millions of triples in the wild (uBio, ION, Index Fungorum, IPNI, Catalogue of Life, more recently Biodiversity Collections Index, Atlas of Living Australia, World Register of Marine Species, etc.), most of these using the same vocabulary. What new inferences have we made?

Let's make the challenge more concrete. Load all these data sources into a triple store (subchallenge - is this actually possible?). Perhaps add other RDF sources (DBpedia, Bio2RDF, CrossRef). What novel inferences can we make?

I may, of course, simply be in "grumpy old arse" mode, but we have millions of triples in the wild and nothing to show for it. I hope I'm not alone in wondering why...

In the context of the TDWG meeting (happening as we speak and which I'm following via Twitter, hashtag #tdwg) Joel Sachs asked me whether I had any specific data in mind that could form the basis of a discussion. So, here goes. I've assembled some small RDF data sets that it might be fun to play with. Each data set is for frogs, and I've divided them into two sets.

Primary data
These data sets are essentially unmodified RDF fetched from data providers:
  • uniprot.rdf Uniprot RDF for frogs in GenBank
  • ion.rdf Index of Organism Names (ION) RDF for taxonomic names for frogs (filtered to just those names that are also in GenBank, the RDF comes from ION LSIDs)
  • crossref.rdf CrossRef RDF for DOIs for publications that published new frog names (obtaining using CrossRef's support for Linked Data for DOIs)
  • dbpedia.rdf Dbpedia RDF for frogs in GenBank (Update 2011-10-20: the dbpedia.rdf file is a bit big, so here is subset.rdf which has just the conservation status and thumbnail image)


These sources give us information on genomics (at least, they tell us which taxa have been sequenced), where and when the original taxonomic description was published, and by whom, as well as some information on conservation status and what the frog looks like (via Dbpedia). Ideally we just load these files into a triple store and then ask a bunch of questions, such as what is the conservation status of frogs sequenced in Genbank?, is there correlation between the conservation status of a frog and the date it was discovered?, who has described the most frog species?, etc.

My contention is that actually we can't do any of this because the data is siloed due to the lack of shared identifiers and vocabularies (I suspect that there is not a single identifier any of these files share). The only way we can currently link these data sets together is by shared string literals (e.g., taxonomic names), in which case why bother with RDF? So my first challenge is to see whether any of the questions I've just listed can actually be tackled using this data.

Glue
In a slightly more constructive mode, to see if we can make progress I'm providing some additional RDF files, based on projects I'm working on to link data together. These files may help provide some of the missing "glue" to connect these data sets.

  • linkout.rdf The list of links between NCBI and Dbpedia (based on mapping in iPhylo LinkOut)
  • ion_doi.rdf A subset of publications listed in ION have DOIs, this file links the corresponding ION LSIDs to those DOIs (this file is from an ongoing project mapping names to primary literature)


The first file links the ION and CrossRef RDF, so we could start to ask questions about dates of discovery, who described what species, etc.. The second file links NCBI taxon ids (in this case in the form of UniProt URIs) to Wikipedia (in the form of Dbpedia URIs). Dbpedia has information on conservation status, and some frogs will also have pictures, so we can start to join genomics to conservation, as well as make some visualisations.

Update
I've now added another RDF file for 1000 georeferenced GenBank sequences for frogs. The file is genbank.rdf. This file is generated from a local, processed version of EMBL, and uses a mixture of Dublin Core and TDWG vocabularies. Here's an example of a single record:

<?xml version="1.0"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:toccurrence="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#"
xmlns:uniprot="http://purl.uniprot.org/core/">
<uniprot:Molecule rdf:about="http://bio2rdf.org/genbank:EU566842">
<dcterms:created>2008-07-06</dcterms:created>
<dcterms:modified>2010-12-23</dcterms:modified>
<dcterms:title>EU566842</dcterms:title>
<dcterms:description>Xenopus borealis voucher MHNG:Herp:2644.64
cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial.</dcterms:description>
<dcterms:subject rdf:resource="http://purl.uniprot.org/taxonomy/8354"/>
<dcterms:relation rdf:parseType="Resource">
<rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#TaxonOccurrence"/>
<toccurrence:identifiedToString>Xenopus borealis</toccurrence:identifiedToString>
<toccurrence:decimalLatitude>0.66</toccurrence:decimalLatitude>
<geo:lat>0.66</geo:lat>
<toccurrence:decimalLongitude>37.5</toccurrence:decimalLongitude>
<geo:long>37.5</geo:long>
<toccurrence:verbatimCoordinates>0.66 N 37.5 E</toccurrence:verbatimCoordinates>
<toccurrence:country>Kenya</toccurrence:country>
<dcterms:identifier>MHNG:Herp:2644.64</dcterms:identifier>
</dcterms:relation>
</uniprot:Molecule>
</rdf:RDF>

I've added this simply so one could do some geographical queries.

Missing links
There are still lots of missing links here (for example, there's no explicit link between NCBI and ION, so we'd need to create this using taxonomic names), and we could add further links to the literature via sequences for taxa. Then there's the lack of geographic data. We could get some of this via georeferenced sequences in GenBank, but there's no RDF for this (Bio2RDF does have RDF for sequences but it ignores the bulk of the organismal metadata such as voucher specimens and latitude and longitude).

In many ways it's this lack of links that was point of my original email. The reality is that "linked data" isn't linked to anything like the extent that makes it useful. Simply pumping out RDF won't get us very far until we tackle this problem (see also my earlier post Linked data that isn't: the failings of RDF).

So, if you think RDF is the way to go, please tell me what you can learn from these data files.