Search this keyword

Showing posts with label biodiversity informatics. Show all posts
Showing posts with label biodiversity informatics. Show all posts

Planet management, GBIF, and the future of biodiversity informatics

Earth russia large verge medium landscape

Next week I'm in Copenhagen for GBIC, the Global Biodiversity Informatics Conference. The goal of the conference is to:
...convene expertise in the fields of biodiversity informatics, genomics, earth observation, natural history collections, biodiversity research and policy needed to set such collaboration in motion.

The collaboration referred to is the agreement to mobilise data and informatics capability to met the Aichi Biodiversity Targets.

I confess I have mixed feelings about the upcoming meeting. There will be something like 100 people attending the conference, with backgrounds ranging from pure science to intergovernmental policy. It promises to be interesting, but whether a clear vision of the future of biodiversity informatics will emerge is another matter.

GBIC is part of the process of "planet management", a phrase that's been around for a while, but I only came across in the Bowker's essay "Biodiversity Datadiversity"1:

Bowker, G. C. (2000). Biodiversity Datadiversity. Social Studies of Science, 30(5), 643–683. doi:10.1177/030631200030005001

Bowker's essay is well worth a read, not least for the choice quotes such as:

Each particular discipline associated with biodiversity has its own incompletely articulated series of objects. These objects each enfold an organizational history and subtend a particular temporality or spatiality. They frequently are incompletely articulated with other objects, temporalities and spatialities — often legacy versions, when drawing on non-proximate disciplines. If one wants to produce a consistent, long-term database of biodiversity-relevant information the world over, all this sounds like an unholy mess. At the very least it suggests that global panopticons are not the way to go in biodiversity data. (p. 675, emphasis added)

and
I have not, in general, questioned the mania to name which is rife in the circles whose work I have described. There is no absolutely compelling connection between the observation that many of the world’s species are dying and the attempt to catalogue the world before they do. If your house is on fire, you do not necessarily stop to inventory the contents before diving out the window. However, as Jack Goody (1977) and others have observed, list-keeping is at the heart of our body politic. It is also, by extension, at the heart of our scientific strategies. Right or wrong, it is what we do. (p. 676, emphasis added)

Given that I'm a fan of the notion of a "global panopticon", and spend a lot of time fussing with lists of names, I find Bowker's views refreshing. Meantime, roll on GBIC2012.



1. Bowker cites Elichirigoity as a source of the term "planet management":

Fernando Elichirigoity (1999), Planet Management: Limits to Growth,
Computer Simulations, and the Emergence of Global Spaces (Evanston, IL: Northwestern
University Press). ISBN 0810115875 (Google Books oP3wVnKpGDkC).

From the limited Google preview, and the review by Edwards, this looks like an interesting book:

Edwards, P. (2000). Book Review:Planet Management: Limits to Growth, Computer Simulation, and the Emergence of Global Spaces Fernando Elichirigoity. Isis, 91(4), 828. doi:10.1086/385020 (PDF here)

Sherborn presentation on Open Taxonomy

Here is my presentation from today's Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting.


All the presentations will be posted online, along with podcasts of the audio. Meantime, presentations by Dave Remsen and Chris Freeland are already online.

How many species are there, and why do we get two very different answers from same data?

GlobeTwo papers estimating the total number of species have recently been published, one in the open access journal PLoS Biology:

Camilo Mora, Derek P. Tittensor, Sina Adl, Alastair G. B. Simpson, Boris Worm. How Many Species Are There on Earth and in the Ocean?. PLoS Biol 9(8): e1001127. doi:10.1371/journal.pbio.1001127
SSB logo final
the second in Systematic Biology (which has an open access option but the authors didn't use it for this article):

Mark J. Costello, Simon Wilson and Brett Houlding. Predicting total global species richness using rates of species description and estimates of taxonomic effort. Syst Biol (2011) doi:10.1093/sysbio/syr080

The first paper has gained a lot of attention, in part because Jonathan Eisen Bacteria & archaea don't get no respect from interesting but flawed #PLoSBio paper on # of species on the planet was mightily pissed off about the estimates of the number:
Their estimates of ~ 10,000 or so bacteria and archaea on the planet are so completely out of touch in my opinion that this calls into question the validity of their method for bacteria and archaea at all.

The fuss over the number of bacteria and archaea seems to me to be largely a misunderstanding of how taxonomic databases count taxa. Databases like Catalogue of Life record described species, and most bacteria aren't formally described because they can't be cultured. Hence there will always be a disparity between the extent of diversity revealed by phylogenetics and by classical taxonomy.

The PLoS Biology paper has garnered a lot more reaction than the Systematic Biology paper (e.g., the commentary by Carl Zimmer in the New York TimesHow Many Species? A Study Says 8.7 Million, but It’s Tricky), which arguably has the more dramatic conclusion.

How many species, 8.7 million, or 1.8 to 2.0 million?

Whereas the Mora et al. in PLoS Biology concluded that there are some 8.7 million (±1.3 million SE) species on the planet, Costello et al. in Systematic Biology arrive at a much more conservative figure (1.8 to 2.0 million). The implications of these two studies are very different, one implies there's a lot of work to do, the other leads to headlines such as 'Every species on Earth could be discovered within 50 years'.

What is intriguing is that both studies use the same databases, Catalogue of Life and the World's Register of Marine Species, and yet arrive at very different results.

So, the question is, how did we arrive at two very different answers from the same data?


Anchoring Biodiversity Information: from Sherborn to the 21st century and beyond

Next month I'll be speaking in London at The Natural History Museum at a one day event Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. This meeting is being organised by the International Commission on Zoological Nomenclature and the Society for the History of Natural History, and is partly a celebration of his major work Index Animalium and partly a chance to look at the future of zoological nomenclature.

Details are available from the ICZN web site. I'll be giving a a talk entitled "Towards an open taxonomy" (no, I don't know what I mean by that either). But it should be a chance to rant about the failure of taxonomy to embrace the Interwebs.

SherbornPoster Sept 11

Biodiversity informatics = #fail (and what to do about it)

The context for this post is the PLos markup meeting held at the California Academy of Sciences over the weekend (many thanks to Brian Fisher for the invitation). PLoS are launching a "biodiversity hub" and were looking for ideas on how to implement this. The fact that nobody -- least of all those attending from PLoS -- could adequately explain what a hub was made things a tad tricky, but that didn't matter, because PLoS did know when the first iteration of the hub was going live (later this summer). So, once we got past the fact that PLoS operates with a timeline that says "cool stuff will happen here" then sets about figuring what that cool stuff will actually be (in retrospect you gotta admire this approach), we then tried to figure out what PLoS needed from us.

That's when things got messy. It became very clear that PLoS wanted basic things like, you know, information on names, being able to link to specimens, etc., and our community can't do this, at least not yet. Nor can we provide simple answers to simple questions. For example, Rich Pyle, gave an overview of taxonomic names, nomenclature, concepts, and the horrendous alphabet soup of databases (uBio, ZooBank, IPNI, IndexFungorum, GNA, GNUB, GNI, CoL, etc.) that have a stake in this. You could see the look of horror in the eyes of the PLoS developers who were tasked with making the hub happen ("run away, run away now"). And this was after the simple version of things. In a week where taxonomy was in the news because of the possibility that Drosophila melanogaster would have to, *cough*, change its name (doi:10.1038/464825a)1, this was not a great start.

At each step when we outlined some of the stuff that would be cool, it became clear we couldn't deliver what we were actually arguing PLoS should do. For example, we have millions of digitised specimen records, and lots of papers refer to these specimens by name, but because individual specimens don't have URIs we can't refer to them (instead we have horrific query interfaces like TAPIR, see Accessing specimens using TAPIR or, why do we make this so hard?). We're digitising the taxonomic literature, but don't provide a way to link this to modern literature at the level of granularity publishers use (i.e., articles).

Readers of this blog will have heard this all before, but what made this meeting different was we actually had a "customer" rock up and ask for our help to enhance their content and create something useful for the community...and the best we could do was um and er and confess we couldn't really give them what they wanted2.

Think of the children
It's time biodiversity informatics stopped playing "let's make an acronym", stopped trying to keep taxonomists happy (face it, that's never going to happen, and frankly, they'll be extinct soon anyway), and stopped obsessing with who owns the data, and instead focus on delivering some simple, solid, services that address the needs of people who, you know, will actually do something useful with them. Otherwise we'll be like digital librarians, who thought people would search the way librarians do, then got their nose out of joint when Google ate their lunch.

It's time to make some simple services, and stop the endless cycle of inward looking meetings where we talk to each other. We need to learn to hide what people don't need (nor want) to see. We need to be able to:

  1. Extract entities from text, e.g. scientific names, specimen codes, localities, GenBank accession numbers.

  2. Lookup a taxonomic name and return basic information about that name (rather like iSpecies but as a service).

  3. Make specimen codes resolvable.

  4. Make taxonomic literature accessible using identifiers and tools publishers know about (that means DOIs and OpenURL).


We're close to a lot of this already, but we're still far enough away to make some of this non-trivial. And we keep having meetings about this stuff, and fail to actually get it done. Something is wrong somewhere when E O Wilson has his name on yet another call for megabucks for a biodiversity project (the "Barometer of Life, doi:10.1126/science.1188606). At what point will someone ask "um, we've given you guys a lot of money already, why can't you tell me the stuff we need to know?"

Let me just say that I'm a short term pessimist, but a long term optimist. The things I complain about will get fixed, one day. It's just that I see little evidence they'll get fixed by us. Prove me wrong, go on, I dare you...

  1. Personally I'm intensely relaxed about Drosophila melanogaster remaining Drosophila melanogaster, even if it ends up in a clade surrounded by flies with other generic names. Having (a) a stable name and (b) knowing where it fits in the tree of life is all we need to do science.

  2. At the meeting I couldn't stop thinking of the scene in The West Wing where President Bartlett walks up to the Capitol for an impromptu meeting with the Speaker of the House to sort out the budget, and is left waiting outside while the Speaker sorts out his game plan. By the time the Speaker is ready, the President has turned on his heels and left, making the Speaker look a tad foolish.


Integrating and displaying data using RSS


Although I'd been thinking of getting the wiki project ready for e-Biosphere '09 as a challenge entry, lately I've been playing with RSS has a complementary, but quicker way to achieve some simple integration.

I've been playing with RSS on and off for a while, but what reignited my interest was the swine flu timemap I made last week. The neatest thing about the timemap was how easy it was to make. Just take some RSS that is geotagged and you get the timemap (courtesy of Nick Rabinowitz's wonderful Timemap library).

So, I began to think about taking RSS feeds for, say journals and taxonomic and genomic databases and adding them together and displaying them using tools such as timemap (see here for an earlier mock up of some GenBank data). Two obstacles are in the way. The first is that not every data source of interest provides RSS feeds. To address this I've started to develop wrappers around some sources, the first of which is ZooBank.

The second obstacle is that integration requires shared content (e.g., tags, identifiers, or localities). Some integration will be possible geographically (for example, adding geotagged sequences and images to a map), but this won't work for everything. So, I need to spend some time trying to link stuff together. In the case of Zoobank there's some scope for this, as ZooBank metadata sometimes includes DOIs, which enables us to link to the original publication, as well as bookmarking services such as Connotea. I'm aiming to include these links within the feed, as shown in this snippet (see the <link rel="related"...> element):


<entry>
<title>New Protocetid Whale from the Middle Eocene of Pakistan: Birth on Land, Precocial Development, and Sexual Dimorphism</title>
<link rel="alternate" type="text/html" href="http://zoobank.org/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
<updated>2009-05-06T18:37:34+01:00</updated>
<id>urn:uuid:c8f6be01-2359-1805-8bdb-02f271a95ab4</id>
<content type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></content>
<summary type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></summary>
<link rel="related" type="text/html" href="http://dx.doi.org/10.1371/journal.pone.0004366" title="doi:10.1371/journal.pone.0004366"/>
<link rel="related" type="text/html" href="http://bioguid.info/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C" title="urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
</entry>


What I'm hoping is that there will be enough links to create something rather like my Elsevier Challenge entry, but with a much more diverse set of sources.

GBIF and Handles: admitting that "distributed" begets "centralized"

The problem with this ... is that my personal and unfashionable observation is that “distributed” begets “centralized.” For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.).
--Geoffrey Bilder interviewed by Martin Fenner

Thinking about the GUID mess in biodiversity informatics, stumbling across some documents about the PILIN (Persistent Identifier Linking INfrastructure) project, and still smarting from problems getting hold of specimen data, I thought I'd try and articulate one solution.

Firstly, I think biodiversity informatics has made the same mistake as digital librarians in thinking that people care where the get information from. We don't, in the sense that I don't care whether I get the information from Google or my local library, I just want the information. In this context local is irrelevant. Nor do I care about individual collections. I care about particular taxa, or particular areas, but not collections (likewise, I may care about philosophy, but not philosophy books at Glasgow University Library). I think the concern for local has lead to an emphasis on providing complex software to each data provider that supports operations (such as search) that don't scale (live federated search simply doesn't work), at the expense of focussing on simple solutions that are easy to use.

In a (no doubt unsuccessful) attempt to think beyond what I want, let's imagine we have several people/organisations with interests in this area. For example:

Imagine I am an occasional user. I see a specimen referred to, say a holotype, I want to learn more about that specimen. Is there some identifier I can use to find out more. I'm used to using DOIs to retrieve papers, what about specimens. So, I want:
  1. identifiers for specimens so I can retrieve more information

Imagine I am a publisher (which can be anything from a major commercial publisher to a blogger). I want to make my content more useful to my readers, and I've noticed that other's are doing this so I better get onboard. But I don't want to clutter my content with fragile links -- and if a link breaks I want it fixed, or I want a cached copy (hence the use of WebCite by some publishers). If I want a link fixed I don't want to have to chase up individual providers, I want one place to go (as I do for references if a DOI breaks). So, I want:
  1. stable links with some guarantee of persistence
  2. somebody who will take responsibility to fix the broken ones

Imagine I am a data provider. I want to make my data available, but I want something simple to put in place (I have better things to do with my time, and my IT department keep a tight grip on the servers). I would also like to be able to show my masters that this is a good thing to do, for example by being able to present statistics on how many times my data has been accessed. I'd like identifiers that are meaningful to me (maybe carry some local "branding"). I might not be so keen on some central agency serving all my data as if it was theirs. So, I want
  1. simplicity
  2. option to serve my own data with my own identifiers

Imagine I am an power user. I want lots of data, maybe grouped in ways that the data providers hadn't anticipated. I'm in a hurry, so I want to get this stuff quickly. So I want:
  1. convenient, fast APIs to fetch data
  2. flexible search interfaces would be nice, but I may just download it myself because it's probably quicker if I do it myself

Imagine I am an aggregator. I want data providers to have a simple harvesting interface so that I can grab the data. I don't need a search interface to their data because I can do it much faster if I have the data locally (federated search sucks). So I want:
  1. the ability to harvest all the data ("all your data are belong to me")
  2. a simple way to update my copy of provider's data when it changes


It's too late in the evening for me to do this justice, but I think a reasonable solution is this:
  1. Individual data providers serve their data via URLs, ideally serving a combination of HTML and RDF (i.e., linked data), but XML would be OK
  2. Each record (e.g., specimen) has an identifier that is locally unique, and the identifier is resolvable (for example, by simply appending it to a URL)
  3. Each data provider is encouraged to reuse existing GUIDs wherever possible, (e.g., for literature (DOIs) and taxonomic names) to make their data "meshable"
  4. Data provider can be harvested, either completey, or for records modified after a given date
  5. A central aggregator (e.g., GBIF) aggregates all specimen/observation data. It uses Handles (or DOIs) to create GUIDs, comprising a naming authority (one for each data provider), and an identifier (supplied by the data provider, may carry branding, e.g. "antweb:casent0100367"), so an example would be "hdl:1234567/antweb:casent0100367" or "doi:10.1234/antweb:casent0100367". Note that this avoids labeling these GUIDs as, say, http://gbif.org/1234567/antweb:casent0100367
  6. Handles resolve to data provider URL, but cached aggregator copy of metadata may be used if data provide is offline
  7. Publishers use "hdl:1234567/antweb:casent0100367" (i.e., authors use this when writing manuscripts), as they can harass central aggregator if they break
  8. Central aggregator is reponsible for generating reports to providers of how there data has been used, e.g. how many times "cited" in literaure

So, GBIF (for whoever steps up to the plate) would use handles (or DOIs). This gives them the tools to manage the identifiers, plus tells the world that we are serious about this. Publishers can trust that the links to millions of specimen records won't disappear. Providers don't have complex software to install, removing one barrier to making more data available.

I think it's time we made a serious effort to address these issues.