Accounting Careers

Showing posts with label SPARQL. Show all posts

Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

Species	Status	DOI	Year described
Atelopus nanay	CR	http://dx.doi.org/10.1655/0018-0831(2002)058[0229:TNSOAA]2.0.CO;2	2002
Eleutherodactylus mariposa	CR	http://dx.doi.org/10.2307/1466962	1992
Phrynopus kauneorum	CR	http://dx.doi.org/10.2307/1565993	2002
Eleutherodactylus eunaster	CR	http://dx.doi.org/10.2307/1563010	1973
Eleutherodactylus amadeus	CR	http://dx.doi.org/10.2307/1445557	1987
Eleutherodactylus lamprotes	CR	http://dx.doi.org/10.2307/1563010	1973
Churamiti maridadi	CR	http://dx.doi.org/10.1080/21564574.2002.9635467	2002
Eleutherodactylus thorectes	CR	http://dx.doi.org/10.2307/1445381	1988
Eleutherodactylus apostates	CR	http://dx.doi.org/10.2307/1563010	1973
Leptodactylus silvanimbus	CR	http://dx.doi.org/10.2307/1563691	1980
Eleutherodactylus sciagraphus	CR	http://dx.doi.org/10.2307/1563010	1973
Bufo chavin	CR	http://dx.doi.org/10.1643/0045-8511(2001)001[0216:NSOBAB]2.0.CO;2	2001
Eleutherodactylus fowleri	CR	http://dx.doi.org/10.2307/1563010	1973
Ptychohyla hypomykter	CR	http://dx.doi.org/10.2307/3672060	1993
Hyla suweonensis	DD	http://dx.doi.org/10.2307/1444138	1980
Proceratophrys concavitympanum	DD	http://dx.doi.org/10.2307/1565412	2000
Phrynopus bufoides	DD	http://dx.doi.org/10.1643/CH-04-278R2	2005
Boophis periegetes	DD	http://dx.doi.org/10.1111/j.1096-3642.1995.tb01427.x	1995
Phyllomedusa duellmani	DD	http://dx.doi.org/10.2307/1444649	1982
Boophis liami	DD	http://dx.doi.org/10.1163/156853803322440772	2003
Hyalinobatrachium ignioculus	DD	http://dx.doi.org/10.1670/0022-1511(2003)037[0091:ANSOHA]2.0.CO;2	2003
Proceratophrys cururu	DD	http://dx.doi.org/10.2307/1447712	1998
Amolops bellulus	DD	http://dx.doi.org/10.1643/0045-8511(2000)000[0536:ABANSO]2.0.CO;2	2000
Centrolene bacatum	DD	http://dx.doi.org/10.2307/1564528	1994
Litoria kumae	DD	http://dx.doi.org/10.1071/ZO03008	2004
Phrynopus pesantesi	DD	http://dx.doi.org/10.1643/CH-04-278R2	2005
Gastrotheca galeata	DD	http://dx.doi.org/10.2307/1443617	1978
Paratelmatobius cardosoi	DD	http://dx.doi.org/10.2307/1447976	1999
Rhacophorus catamitus	DD	http://dx.doi.org/10.1655/0733-1347(2002)016[0046:NAPKPF]2.0.CO;2	2002
Huia melasma	DD	http://dx.doi.org/10.1643/CH-04-137R3	2005
Telmatobius vilamensis	DD	http://dx.doi.org/10.1655/0018-0831(2003)059[0253:ANSOTA]2.0.CO;2	2003
Callulina kisiwamsitu	EN	http://dx.doi.org/10.1670/209-03A	2004
Arthroleptis nikeae	EN	http://dx.doi.org/10.1080/21564574.2003.9635486	2003
Eleutherodactylus amplinympha	EN	http://dx.doi.org/10.1139/z94-297	1994
Eleutherodactylus glaphycompus	EN	http://dx.doi.org/10.2307/1563010	1973
Bufo tacanensis	EN	http://dx.doi.org/10.2307/1439700	1952
Phrynopus bracki	EN	http://dx.doi.org/10.2307/1445826	1990
Telmatobius sibiricus	EN	http://dx.doi.org/10.1655/0018-0831(2003)059[0127:ANSOTF]2.0.CO;2	2003
Cochranella mache	EN	http://dx.doi.org/10.1655/03-74	2004
Eleutherodactylus melacara	EN	http://dx.doi.org/10.2307/1466962	1992
Plectrohyla glandulosa	EN	http://dx.doi.org/10.2307/1441046	1964
Aglyptodactylus laticeps	EN	http://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x	1998
Eleutherodactylus glamyrus	EN	http://dx.doi.org/10.2307/1565664	1997
Gastrotheca trachyceps	EN	http://dx.doi.org/10.2307/1564375	1987
Eleutherodactylus grahami	EN	http://dx.doi.org/10.2307/1563929	1979
Litoria havina	LC	http://dx.doi.org/10.1071/ZO9930225	1993
Crinia riparia	LC	http://dx.doi.org/10.2307/1440794	1965
Litoria longirostris	LC	http://dx.doi.org/10.2307/1443159	1977
Osteocephalus mutabor	LC	http://dx.doi.org/10.1163/156853802320877609	2002
Leptobrachium nigrops	LC	http://dx.doi.org/10.2307/1440966	1963
Pseudis tocantins	LC	http://dx.doi.org/10.1590/S0101-81751998000400011	1998
Mantidactylus argenteus	LC	http://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x	1919
Aglyptodactylus securifer	LC	http://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x	1998
Pseudis cardosoi	LC	http://dx.doi.org/10.1163/156853800507264	2000
Uperoleia inundata	LC	http://dx.doi.org/10.1071/AJZS079	1981
Litoria pronimia	LC	http://dx.doi.org/10.1071/ZO9930225	1993
Litoria paraewingi	LC	http://dx.doi.org/10.1071/ZO9760283	1976
Philautus aurifasciatus	LC	http://dx.doi.org/10.1163/156853887X00036	1987
Proceratophrys avelinoi	LC	http://dx.doi.org/10.1163/156853893X00156	1993
Osteocephalus deridens	LC	http://dx.doi.org/10.1163/156853800507525	2000
Gephyromantis boulengeri	LC	http://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x	1919
Crossodactylus caramaschii	LC	http://dx.doi.org/10.2307/1446907	1995
Rana yavapaiensis	LC	http://dx.doi.org/10.2307/1445338	1984
Boophis lichenoides	LC	http://dx.doi.org/10.1163/156853898X00025	1998
Megistolotis lignarius	LC	http://dx.doi.org/10.1071/ZO9790135	1979
Ansonia endauensis	NE	http://dx.doi.org/10.1655/0018-0831(2006)62[466:ANSOAS]2.0.CO;2	2006
Ansonia kraensis	NE	http://dx.doi.org/10.2108/zsj.22.809	2005
Arthroleptella landdrosia	NT	http://dx.doi.org/10.2307/1565359	2000
Litoria jungguy	NT	http://dx.doi.org/10.1071/ZO02069	2004
Phrynobatrachus phyllophilus	NT	http://dx.doi.org/10.2307/1565925	2002
Philautus ingeri	VU	http://dx.doi.org/10.1163/156853887X00036	1987
Gastrotheca dendronastes	VU	http://dx.doi.org/10.2307/1445088	1983
Hyperolius cystocandicans	VU	http://dx.doi.org/10.2307/1443911	1977
Boophis sambirano	VU	http://dx.doi.org/10.1080/21564574.2005.9635520	2005
Ansonia torrentis	VU	http://dx.doi.org/10.1163/156853883X00021	1983
Telmatobufo australis	VU	http://dx.doi.org/10.2307/1563086	1972
Stefania coxi	VU	http://dx.doi.org/10.1655/0018-0831(2002)058[0327:EDOSAH]2.0.CO;2	2002
Oreolalax multipunctatus	VU	http://dx.doi.org/10.2307/1564828	1993
Eleutherodactylus guantanamera	VU	http://dx.doi.org/10.2307/1466962	1992
Spicospina flammocaerulea	VU	http://dx.doi.org/10.2307/1447757	1997
Cycloramphus acangatan	VU	http://dx.doi.org/10.1655/02-78	2003
Leiopelma pakeka	VU	http://dx.doi.org/10.1080/03014223.1998.9517554	1998
Rana okaloosae	VU	http://dx.doi.org/10.2307/1444847	1985
Phrynobatrachus uzungwensis	VU	http://dx.doi.org/10.1163/156853883X00030	1983

This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tdwg_tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tdwg_co: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?name ?status ?doi ?date ?thumbnail
WHERE {
  ?ncbi uniprot:scientificName ?name .
  ?ncbi rdfs:seeAlso ?dbpedia .
  ?dbpedia dbpedia-owl:conservationStatus ?status .
  ?ion  tdwg_tn:nameComplete ?name . 
  ?ion tdwg_co:publishedInCitation ?doi .
  ?doi dcterms:date ?date .

  OPTIONAL
  {
   ?dbpedia dbpedia-owl:thumbnail ?thumbnail
  }
} 
ORDER BY ASC(?status)

This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
Chart

In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Update
Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large

Linked data that isn't: the failings of RDF

OK, a bit of hyperbole in the morning. One of the goals of RDF is to create the Semantic Web, an interwoven network of data seamlessly linked by shared identifiers and shared vocabularies. Everyone uses the same identifiers for the same things, and when they describe these things they use the same terms. Simples.

Of course, the reality is somewhat different. Typically people don't reuse identifiers, and there are usually several competing vocabularies we can chose from. To give a concrete example, consider two RDF documents describing the same article, one provided by CiNii, the other by CrossRef. The article is:

Astuti, D., Azuma, N., Suzuki, H., & Higashi, S. (2006). Phylogenetic Relationships Within Parrots (Psittacidae) Inferred from Mitochondrial Cytochrome-b Gene Sequences(Phylogeny). Zoological science, 23(2), 191-198. doi:10.2108/zsj.23.191

You can get RDF for a CiNii record by appending ".rdf" to the URL for the article, in this case http://ci.nii.ac.jp/naid/130000017049. For CrossRef you need a Linked Data compliant client, or you can do something like this:


curl -D - -L -H "Accept: application/rdf+xml" "http://dx.doi.org/10.2108/zsj.23.191"

You can view the RDF from these two sources here and here.

No shared identifiers
The two RDF documents have no shared identifiers, or at least, any identifiers they do share aren't described in a way that is easily discovered. The CrossRef record knows nothing about the CiNii record, but the CiNii document includes this statement:


<rdfs:seeAlso rdf:resource="http://ci.nii.ac.jp/lognavi?name=crossref
&amp;id=info:doi/10.2108/zsj.23.191" dc:title="CrossRef" />

So, CiNii knows about the DOI, but this doesn't help much as the CrossRef document has the URI "http://dx.doi.org/10.2108/zsj.23.191", so we don't have an explicit statement that the two documents refer to the same article.

The other shared identifier the documents could share is the ISSN for the journal (0289-0003), but CiNii writes this without the "-", and uses the PRISM term "prism:issn", so we have:


<prism:issn>02890003</prism:issn>

whereas CrossRef writes the ISSN like this:


<ns0:issn xmlns:ns0="http://prismstandard.org/namespaces/basic/2.1/">
0289-0003</ns0:issn>

Unless we have a linked data client that normalises ISSNs before it does a SPARQL query we will miss the fact that these two articles are in the same journal.

Inconsistent vocabularies
Both CiNii use the PRISM vocabulary to describe the article, but they use different versions. CrossRef uses "http://prismstandard.org/namespaces/basic/2.1/" whereas CiNii uses "http://prismstandard.org/namespaces/basic/2.0/". Version 2.1 versus version 2.0 is a minor difference, but the URIs are different and hence they are different vocabularies (having version numbers in vocabulary URIs is asking for trouble). Hence, even if CiNii and CrossRef wrote ISSNs in the same way, we'd still not be able to assert that the articles come from the same journal.
Inconsistent use of vocabularies
Both CiNii use FOAF for author names, but they write the names differently:


<foaf:name xml:lang="en">Suzuki Hitoshi</foaf:name>


<ns0:name xmlns:ns0="http://xmlns.com/foaf/0.1/">Hitoshi Suzuki</ns0:name>

So, another missed opportunity to link the documents. One could argue this would be solved if we had consistent identifiers for authors, but we don't. In this case CiNii have their own local identifiers (e.g. http://ci.nii.ac.jp/nrid/1000040179239), and CrossRef has a rather hideous looking Skolemisation: http://id.crossref.org/contributor/hitoshi-suzuki-2gypi8bnqk7yy.

In summary, it's a mess. Both CiNii and CrossRef organisations are whose core business is bibliographic metadata. It's great that both are serving RDF, but if we think this is anything more than providing metadata in a useful format I think we may be deceiving ourselves.

NCBI taxonomy, TDWG vocabularies, and RDF

Lately I've been returning to playing with RDF and triple stores. This is a serious case of déjà vu, as two blogs I've now abandoned will testify (bioGUID and SemAnt). Basically, a combination of frustration with the tools, data cleaning, and the lack of identifiers got in the way of making much progress. I gave up on triple stores for a while, rolling my own Entity–Attribute–Value (EAV) database, which I used for the Elsevier Challenge (EAV databases are essentially key-value databases, CouchDB being a well-known example).

Now, I'm revisiting triple stores and SPARQL, partly because Linked Data is gaining momentum, and partly because we now have a few LSID providers, and some decent vocabularies from TDWG. Having created a LSID resolver that plays nicely with Linked Data (it also does the same thing for DOIs), it's time to dust off SPARQL and see what can be done.

One reason there's interest in having GUIDs and standard vocabularies is so that we can link different sources of information together. But more than just linking, we should be able to compute across these links and learn new things, or at least add annotations from one database to another.

To make this concrete, take the NCBI taxon 101855 , Lulworthia uniseptata. If we visit the NCBI page we see links to other resources, such as Index Fungorum record 105488, which tells us that Lulworthia uniseptata was published in Trans. Mycol. Soc. Japan 25(4): 382 (1984), and that the current name is Lulwoana uniseptata, which was published in Mycol. Res. 109(5): 562 (2005).

Wouldn't it be nice to be able to automatically link these things together? And wouldn't it be nice to have identifiers for the literature, rather than only human-readable text strings? Using bioGUID, we can discover that Mycol. Res. 109(5): 562 (2005) has the DOI doi:10.1017/S0953756205002716 -- I haven't found Trans. Mycol. Soc. Japan 25(4): 382 (1984) online anywhere.

Now, given that we have LSIDs for Index Fungorum, I can resolve urn:lsid:indexfungorum.org:names:369395 and discover that

urn:lsid:indexfungorum.org:names:369395 tname:hasBasionym urn:lsid:indexfungorum.org:names:105488

and, I can add the statement

urn:lsid:indexfungorum.org:names:36939 tcommon:publishedInCitation doi:10.1017/S0953756205002716

What I'd like to do is link this to the NCBI taxon, so that I can display this additional knowledge in one place (i.e., there is an additional name for this fungus, and where it is published). To do this, I need the NCBI taxonomy in RDF. Turns out that everyone and their dog has been generating RDF versions of the NCBI taxonomy, including Uniport (source of the diagram above). The problem is, each effort creates their own project-specific vocabulary. For example , here is the record for NCBI taxon 101855 in Uniprot RDF (http://www.uniprot.org/taxonomy/101855):


<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <rdf:Description rdf:about="http://purl.uniprot.org/taxonomy/101855">
        <rdf:type rdf:resource="http://purl.uniprot.org/core/Taxon"/>
        <rank rdf:resource="http://purl.uniprot.org/core/Species"/>
        <scientificName>Lulworthia uniseptata</scientificName>
        <otherName>Zalerion maritimum</otherName>
        <rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/45817"/>
        <partOfLineage>false</partOfLineage>
    </rdf:Description>
</rdf:RDF>

Uniprot has it's own vocabulary, http://purl.uniprot.org/core/. So, what I'd like to do is create a version of the NCBI taxonomy using TDWG's TaxonConcept vocabulary, so that it becomes straightforward to link NCBI to name databases such as Index Fungorum, IPNI, Zoobank, and ION that are serving taxon names.

Accounting Careers

Search this keyword

Reflections on the TDWG RDF "Challenge"

Linked data that isn't: the failings of RDF

NCBI taxonomy, TDWG vocabularies, and RDF

Blog Archive

Popular Posts

Labels