Search this keyword

Showing posts with label Semantic Web. Show all posts
Showing posts with label Semantic Web. Show all posts

Linked data that isn't: the failings of RDF

OK, a bit of hyperbole in the morning. One of the goals of RDF is to create the Semantic Web, an interwoven network of data seamlessly linked by shared identifiers and shared vocabularies. Everyone uses the same identifiers for the same things, and when they describe these things they use the same terms. Simples.

Of course, the reality is somewhat different. Typically people don't reuse identifiers, and there are usually several competing vocabularies we can chose from. To give a concrete example, consider two RDF documents describing the same article, one provided by CiNii, the other by CrossRef. The article is:

Astuti, D., Azuma, N., Suzuki, H., & Higashi, S. (2006). Phylogenetic Relationships Within Parrots (Psittacidae) Inferred from Mitochondrial Cytochrome-b Gene Sequences(Phylogeny). Zoological science, 23(2), 191-198. doi:10.2108/zsj.23.191

You can get RDF for a CiNii record by appending ".rdf" to the URL for the article, in this case http://ci.nii.ac.jp/naid/130000017049. For CrossRef you need a Linked Data compliant client, or you can do something like this:


curl -D - -L -H "Accept: application/rdf+xml" "http://dx.doi.org/10.2108/zsj.23.191"

You can view the RDF from these two sources here and here.

No shared identifiers
The two RDF documents have no shared identifiers, or at least, any identifiers they do share aren't described in a way that is easily discovered. The CrossRef record knows nothing about the CiNii record, but the CiNii document includes this statement:


<rdfs:seeAlso rdf:resource="http://ci.nii.ac.jp/lognavi?name=crossref
&amp;id=info:doi/10.2108/zsj.23.191" dc:title="CrossRef" />

So, CiNii knows about the DOI, but this doesn't help much as the CrossRef document has the URI "http://dx.doi.org/10.2108/zsj.23.191", so we don't have an explicit statement that the two documents refer to the same article.

The other shared identifier the documents could share is the ISSN for the journal (0289-0003), but CiNii writes this without the "-", and uses the PRISM term "prism:issn", so we have:


<prism:issn>02890003</prism:issn>


whereas CrossRef writes the ISSN like this:


<ns0:issn xmlns:ns0="http://prismstandard.org/namespaces/basic/2.1/">
0289-0003</ns0:issn>


Unless we have a linked data client that normalises ISSNs before it does a SPARQL query we will miss the fact that these two articles are in the same journal.

Inconsistent vocabularies
Both CiNii use the PRISM vocabulary to describe the article, but they use different versions. CrossRef uses "http://prismstandard.org/namespaces/basic/2.1/" whereas CiNii uses "http://prismstandard.org/namespaces/basic/2.0/". Version 2.1 versus version 2.0 is a minor difference, but the URIs are different and hence they are different vocabularies (having version numbers in vocabulary URIs is asking for trouble). Hence, even if CiNii and CrossRef wrote ISSNs in the same way, we'd still not be able to assert that the articles come from the same journal.
Inconsistent use of vocabularies
Both CiNii use FOAF for author names, but they write the names differently:


<foaf:name xml:lang="en">Suzuki Hitoshi</foaf:name>


<ns0:name xmlns:ns0="http://xmlns.com/foaf/0.1/">Hitoshi Suzuki</ns0:name>


So, another missed opportunity to link the documents. One could argue this would be solved if we had consistent identifiers for authors, but we don't. In this case CiNii have their own local identifiers (e.g. http://ci.nii.ac.jp/nrid/1000040179239), and CrossRef has a rather hideous looking Skolemisation: http://id.crossref.org/contributor/hitoshi-suzuki-2gypi8bnqk7yy.

In summary, it's a mess. Both CiNii and CrossRef organisations are whose core business is bibliographic metadata. It's great that both are serving RDF, but if we think this is anything more than providing metadata in a useful format I think we may be deceiving ourselves.

Referring to a one-degree square in RDF using c-squares

I'm in the midst of rebuilding iSpecies (my mash-up of Wikipedia, NCBI, GBIF, Yahoo, and Google search results) with the aim of outputting the results in RDF. The goal is to convert iSpecies from a pretty crude "on-the-fly" mash-up to a triple store where results are cached and can be queried in interesting ways. Why? Partly because I think such a triple store is an obvious way to underpin a "biodiversity hub" of the kind envisaged by PLoS (see my earlier post).

As ever, once one embarks down the RDF route (and I've been here before), one hits all the classic stumbling blocks, such as "what URI do I use for a thing?", and "what vocabulary do I use to express relationships between things?". For example, I'd like to represent the geographic distribution of a taxon as depicted on a GBIF map. How do I describe this in a RDF document?

To make this concrete, take one of my favourite animals, the New Zealand mud crab Helice crassa. Here's the GBIF map for this taxon:

wms.png
This map has the URL (I kid you not):

http://ogc.gbif.org/wms?request=GetMap
&bgcolor=0x666698
&styles=,,,
&layers=gbif:country_fill,gbif:tabDensityLayer,gbif:country_borders,gbif:country_names
&srs=EPSG:4326
&filter=()(
%3CFilter%3E
%3CPropertyIsEqualTo%3E
%3CPropertyName%3Eurl
%3C/PropertyName%3E
%3CLiteral%3E
%3C![CDATA[http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F17462693%2FtenDeg%2F-45%2F160%2F]]%3E
%3C/Literal%3E
%3C/PropertyIsEqualTo%3E
%3C/Filter%3E)()()
&width=721
&height=362
&Format=image/png
&bbox=160,-45,180,-35

(or http://bit.ly/cuTFW9, if you prefer). Now, there's no way I'm using this URL! Plus, the URL identifies an image, not the distribution.

But, if we look at the map we see that it is made of 1° × 1° squares. If each of those had a URI then I could simply list those squares as the distribution of the crab. This seems straightforward as GBIF has a service that provides these squares. For example, the URL http://data.gbif.org/species/17462693 (where 17462693 corresponds to Helice crassa) returns:

MINX MINY MAXX MAXY DENSITY
167.0 -45.0 168.0 -44.0 5
174.0 -42.0 175.0 -41.0 20
174.0 -38.0 175.0 -37.0 17
174.0 -37.0 175.0 -36.0 4

These are the 1° × 1° squares for which there are records of Helice crassa. Now, what I'd like to do is have a URI for each square, and I'd like to do this without reinventing the wheel. I've come across a URI space for points of the globe (the WGS 84 Geographic Point URI Space"), but not one for polygons. Then it dawned on me that perhaps c-squares, developed by Tony Rees at the CSIRO in Australia, would do the trick1. To quote Tony:
C-squares is a system for storage, querying, display, and exchange of "spatial data" locations and extents in a simple, text-based, human- and machine- readable format. It uses numbered (coded) squares on the earth's surface measured in degrees (or fractions of degrees) of latitude and longitude as fundamental units of spatial information, which can then be quoted as single squares (similar to a "global postcode") in which one or more data points are located, or be built up into strings of codes to represent a wide variety of shapes and sizes of spatial data "footprints".

C-squares appeal partly (and this says nothing good about me) because they have a slightly Byzantine syntax. However, they are short, and quite easy to calculate. I'll let the reader find out the gory details. To give an example, my home town, Auckland, has latitude -36.84, longitude 174.74, which corresponds to the 1° × 1° c-square with the code 3317:364.

Now, all I need to do is convert c-squares into URIs. If you append the c-square to http://bioguid.info/csquare:, like this, http://bioguid.info/csquare:3317:364, you get a linked data-friendly URI for the c-square. In a web browser you get a simple web page like this:

csquare.png
A linked data client will get RDF, like this:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/"
xmlns:geom="http://fabl.net/vocabularies/geometry/1.1/">
<dcterms:Location rdf:about="http://bioguid.info/csquare:3307:364">
<rdfs:label>3307:364</rdfs:label>
<geom:xmin>74</geom:xmin>
<geom:ymin>-37</geom:ymin>
<geom:xmax>75</geom:xmax>
<geom:ymax>-36</geom:ymax>
<dwc:footprintWKT>POLYGON((-37 75,-37 74,-36 74,-36 75,-37 75))</dwc:footprintWKT>
</dcterms:Location>
</rdf:RDF>

Now, I can refer to each square by it's own URI. This will also enable me to query a triple store by c-square (e.g., what other taxa occur within this 1° × 1° square?).
  1. Tony Rees had emailed me about this in response to a tweet about URIs for co-ordinates, but it took me a while to realise how useful c-square notation could be.

Progress on a phylogeny wiki

I've made some progress on a wiki of phylogenies. Still much to do, but here are some samples of what I'm trying to do.

First up, here's an example of a publication http://iphylo.org/treebase/Doi:10.1016/j.ympev.2008.02.021:
wiki1.png

In addition to basic bibiographic details we have links to GenBank sequences and a phylogeny. The sequences are georeferenced, which enables us to generate a map. At a glance we see that the study area is central America.

This study published the following tree:
wiki2.png

The tree is displayed using my tvwidget. A key task in constructing the wiki is mapping labels used in TreeBASE to other taxonomic names, for example, those in the NCBI taxonomy database. This is something I first started working on in the TbMap project (doi:10.1186/1471-2105-8-158). In the context of this wiki I'm explicitly mapping TreeBASE taxa to NCBI taxa. Taxa are modelled as "namestrings" (simple text strings), OTUs (TreeBASE taxa), and taxonomic concepts (sets of observations or taxa). For example, the tree shown above has numerous samples of the frog Pristimantis ridens, each with a unique label (namestring) that includes, in this case, voucher specimen information (e.g., "Pristimantis ridens La Selva AJC0522 CR"). Each of these labels is mapped to the NCBI taxon Pristimantis ridens.

One thing I'm interested in doing is annotating the tree. Eventually I hope to generate (or make it easy to generate) things such as Google Earth phylogenies (via georeferenced sequences and specimens). For now I'm playing with generating nicer labels for the terminal taxa. As it stands if you download the original tree from TreeBASE you have the original study-specific labels (e.g., "Pristimantis ridens La Selva AJC0522 CR"), whereas it would be nice to also have taxonomic names (for example, if you wanted to combine the tree or data with another study). Below the tree you'll see a NEXUS NOTES block with the "ALTTAXNAMES" command. The program Mesquite can use this command to enable users to toggle between different labels, so that you can have either a tree like this:
wiki3.png

or a tree like this:
wiki4.png

How Wikipedia can help scope a project

I'm revisiting the idea of building a wiki of phylogenies using Semantic Mediawiki. One problem with a project like this is that it can rapidly explode. Phylogenies have taxa, which have characters, nucleotides sequences and other genomics data, and names, and come from geographic locations, and are collected and described by people, who may deposit samples in museums, and also write papers, which are published in journals, and so on. Pretty soon, any decent model of a phylogeny database is connected to pretty much anything of interest in the biological sciences. So we have a problem of scope. At what point do we stop adding things to the database model?

It seems to me that Wikipedia can help. Once we hit a topic that exists in Wikipedia, then we can stop. It's a reasonable bet that either now, or at some point in the future, the Wikipedia page is likely to be as good as, or better than, anything a single project could do. Hence, there's probably not much point storing lots of information about genes, countries, geographic regions, people, journals, or even taxa, as Wikipedia has these. This means we can focus on gluing together the core bits of a phylogenetic study (trees, taxa, data, specimens, publications) and then link these to Wikipedia.

In a sense this is a variation on the ideas explored in EOL, the BBC, and Wikipedia, but in developing my wiki of phylogenies project (this is the third iteration of this project) it's struck me how the question "is this in Wikipedia?" is the quickest way to answer the question "should I add x to my wiki?" Hence, Wikipedia becomes an antidote to feature bloat, and helps define the scope of a project more clearly.

Towards a wiki of phylogenies

At the start of this week I took part in a biodiversity informatics workshop at the Naturhistoriska riksmuseets, organised by Kevin Holston. It was a fun experience, and Kevin was a great host, going out of his way to make sure myself and other contributors were looked after. I gave my usual pitch along the lines of "if you're not online you don't exist", and talked about iSpecies, identifiers, and wikis.

I also ran a short, not terribly successful exercise using iTaxon to demo what semantic wikis can do. As is often the case with something that hasn't been polished yet, the students found the mechanics of doing things less than intuitive. I need to do a lot of work making data input easier (to date I've focussed on automated adding of data, and forms to edit existing data). Adding data is easy if you know how, but the user needs to know more than they really should have to.

The exercise was to take some frog taxa from the Frost et al. amphibian tree (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2) and link them to GenBank sequences and museum specimens. The hope was that by making these links new information would emerge. You could think of it as an editable version of this. With a bit of post-exercise tidying, we got someway there. The wiki page for the Frost et al.
paper
now shows a list of sequences from that paper (not all, I hasten to add), and a map for those sequences that the students added to the wiki:

frost.png


Although much remains to be done, I can't help thinking that this approach would work well for a database like TreeBASE, where one really needs to add a lot of annotation to make it useful (for example, mapping OTUs to taxon names, linking data to sequences and specimens). So, one of the things I'm going to look at is dumping a copy of TreeBASE (complete with trees) into the wiki and seeing what can be done with it. Oh, and I need to make it much, much easier for people to add data.

Semantic Publishing: towards real integration by linking

PLoS Computational Biolgy has recently published "Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article" (doi:10.1371/journal.pcbi.1000361) by David Shotton and colleagues. As a proof of concept, they took Reis et al. (doi:10.1371/journal.pntd.0000228) and "semantically enhanced" it:
These semantic enhancements include provision of live DOIs and hyperlinks; semantic markup of textual terms, with links to relevant third-party information resources; interactive figures; a re-orderable reference list; a document summary containing a study summary, a tag cloud, and a citation analysis; and two novel types of semantic enrichment: the first, a Supporting Claims Tooltip to permit “Citations in Context”, and the second, Tag Trees that bring together semantically related terms. In addition, we have published downloadable spreadsheets containing data from within tables and figures, have enriched these with provenance information, and have demonstrated various types of data fusion (mashups) with results from other research articles and with Google Maps.
The enhanced article is here: doi:10.1371/journal.pntd.0000228.x001. For background on these enhancements, see also David's companion article "Semantic publishing: the coming revolution in scientific journal publishing" (doi:10.1087/2009202, PDF preprint available here). The process is summarised in the figure below (Fig. 10 from Shotton et al., doi:10.1371/journal.pcbi.1000361.g010).



While there is lots of cool stuff here (see also Elsevier's Article 2.0 Contest, and the Grand Chalenge, for which David is one of the judges), I have a couple of reservations.

The unique role of the journal article?

Shotton et al. argue for a clear distinction between journal article and database, in contrast to the view articulated by Philip Bourne (doi:10.1371/journal.pcbi.0​010034) that there's really no difference between a database and a journal article and that the two are converging. I tend to favour the later viewpoint. Indeed, as I argued in my Elsevier Challenge entry (doi:10.1038/npre.2008.2579.1), I think we should publish articles (and indeed data) as wikis, so that we can fix the inevitable error. We can always roll back to the original version if we want to see the author's original paper.

Real linking

But my real concern is that the example presented is essentially "integration by linking", that is, the semantically enhanced version gives us lots of links to other information, but these are regular hyperlinks to web pages. So, essentially we've gone from pre-web documents with no links, to documents where the bibliography is hyperlinked (most online journals), to documents where both the bibliography and some terms in the text are hyperlinked (a few journals, plus the Shotton et al. example). I'm a tad underwhelmed.
What bothers me about this is:
  1. The links are to web pages, so it will be hard to do computation on these (unless the web page has easily retrievable metadata)
  2. There is no reciprocal linking -- the resource being linked to doesn't know it is the target of the link


Web pages are for humans

The first concern is that the marked-up article is largely intended for human readers. Yes, there are associated metadata files in RDF N3, but the core "added value" is really only of use to humans. For it to be of use to a computer, the links would have to go to resource that the computer can understand. A human clicking on many of the links will get a web page and they can interpret that, but computers are thick and they need a little help. For example, one hyperlinked term is Leptospira spirochete, linked to the uBio namebank record (click on the link to see it). The link resolves to a web page, so it's not much use to a computer (unless if has a scrapper for uBio HTML). Ironically, uBio serves LSIDs, so we could retrieve RDF metadata for this name (urn:lsid:ubio.org:namebank:255659), but there's nothing in the uBio web page that tells the computer that.

Of course, Shotton et al. aren't responsible for the fact that most web pages aren't easily interpreted by computers, but simply embedding links to web pages isn't a big leap forward. What could they have done instead? One approach is to link to resources that are computer-readable. For example, instead of linking the term "Oswaldo Cruz Foundation" to that organisation's home page (http://www.fiocruz.br/cgi/cgilua.exe/sys/start.htm?tpl=home), why not use the DBpedia URI http://dbpedia.org/page/Instituto_Oswaldo_Cruz? Now we get both a human-readable page, and extensive RDF that a computer can use. In other words, if we crawl the semantically enhanced PLoS article with a program, I want to be able to have that crawler follow the links and still get useful information, not the dead end of a HTML web page. Quite a few of the institutions listed in the enhanced paper have DBPedia URIs:


Why does this matter? Well, if you use DBPedia URIs you get RDF, plus you get connections with the Linked Data crowd, who are rapidly linking diverse data sets together:


I think this is where we need to be headed, and with a little extra effort we can get there, once we move on from thinking solely about human readers.

An alternative approach (and one that I played with in my Challenge entry, as well as my ongoing wiki efforts) is to create what Vandervalk et al. term a "semantic warehouse" (doi:10.1093/bib/bbn051). Information about each object of interest is stored locally, so that clicking on a link doesn't take you off-site into the world wide wilderness, but to information about that object. For example, the page for the paper Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio) lists the papers cited, clicking on one takes you to the page about that paper. There are limitations to this approach as well, but the key thing is that one could imagine doing computations over this (e.g., computing citation counts for DNA sequences, or geospatial queries across papers) that simple HTML hyperlinking won't get you.

Reciprocal links

The other big issue I have with the Shotton et al. "integration by linking" is that it is one-way. The semantically enhanced paper "knows" that it links to, say, the uBio record for Leptospira, but uBio doesn't know this. It would enhance the uBio record if it knew that doi:10.1371/journal.pntd.0​000228.x001 linked to it.

Links are inherently reciprocal, in the sense that if paper 1 cites paper 2, then paper 2 is cited by paper 1.

Publishers understand this, and the web page of an article will often show lists of papers that cite the paper being displayed. How do we do this for data and other objects of interest? If we database everything, then it's straightforward. CrossRef is storing citation metadata and offers a "forward linking" service, some publishers (e.g., Elsevier and Highwire) offer their own versions of this. In the same way, this record for GenBank sequence AY322281 "knows" that it is cited by (at least) two papers because I've stored those links in a database. Knowing that you're being linked to dramatically enhances discoverability. If I'm browsing uBio I gain more from the experience if I know that the PLoS paper cites Leptospira.

Knowing when you're being linked to

If we database everything locally then reciprocal linking is easy. But, realistically, we can't database everything (OK, maybe that's not strictly true, can can think of Google as a database of everything). The enhanced PLoS paper "knows" that it cites the uBio record, how can the uBio record "know" that it has been cited by the PLoS paper? What if the act of linking was reciprocal? How can we achieve this in a distributed world? Some possibilities:
  • we have an explicit API embedded in the link so that uBio can extract the source of the link (could be spoofed, need authentication?)
  • we use OpenURL-style links that embed the PLoS DOI, so that uBio knows the source of the link (OpenURL is a mess, but potentially very powerful)
  • uBio uses the HTTP referrer header to get the source of the link, then parses the PLoS HTML to extract metadata and the DOI (ugly screen scraping, but no work for PLoS)

Obviously this needs a little more thought, but I think that real integration by linking requires that the resources being linked are both computer and human readable, and that both resources know about the link. This would create much more powerful "semantically enhanced" publications.