Search this keyword

Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Species wait 21 years to be described - show me the data

21Benoît Fontaine et al. recently published a study concluding that average lag time between a species being discovered and subsequently described is 21 years.

Fontaine, B., Perrard, A., & Bouchet, P. (2012). 21 years of shelf life between discovery and description of new species. Current Biology, 22(22), R943–R944. doi:10.1016/j.cub.2012.10.029

The paper concludes:

With a biodiversity crisis that predicts massive extinctions and a shelf life that will continue to reach several decades, taxonomists will increasingly be describing from museum collections species that are already extinct in the wild, just as astronomers observe stars that vanished thousands of years ago.

This is a conclusion that merits more investigation, especially as the title of the paper suggests there is an appalling lack of efficiency (or resources) in the way we decsribe biodiversity. So, with interest I looked at the Supplemental Information for the data:

I was hoping to see the list of the 600 species chosen at random, the publication containing their original description, and the date of their first collection. Instead, all we have is a description of the methods for data collection and analysis. Where is the data? Without the data I have no way of exploring the conclusions, asking additional questions. For example, what is the distribution of date of specimen collection in each species? One could imagine situations where a number of specimens are recently collected, prompting recognition and description of a new species, and as part of that process rummaging through the collections turns up older, unrecognised members of that species. Indeed, if it takes a certain number of specimens to describe a species (people tend to frown upon descriptions based on single specimens) perhaps what we are seeing is the outcome of a sampling process where specimens of new species are rare, they take a while to accumulate in collections, and the distribution of collection dates will have a long tail.

These are the sort of questions we could have if we had the data, but the authors don't provide that. The worrying thing is that we are seeing a number of high-visibility papers that potentially have major implications for how we view the field of taxonomy but which don't publish their data. Another recent example is:

Joppa, L. N., Roberts, D. L., & Pimm, S. L. (2011). The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution, 26(11), 551–553. doi:10.1016/j.tree.2011.07.010

Biodiversity is a big data science, it's time we insisted on that data being made available.

EOL Computable Data Challenge community

17823 130 130Now we are awash in challenges! EOL has announced its Computable Data Challenge:
We invite ideas for scientific research projects that use EOL, including the Biodiversity Heritage Library (BHL), to answer questions in biology. The specific field of biological interest for the challenge is open; projects in ecology, evolution, behavior, conservation biology, developmental biology, or systematics may be most appropriate. Projects advancing informatics alone may be less competitive. EOL may be used as a source of biological information, to establish a sampling strategy, to assist the retrieval of computable data by mapping identifiers across sources (e.g. to accomplish name resolution), and/or in other innovative ways. Projects involving data or text or image mining of EOL or BHL content are encouraged. Current EOL data and API shall be used; suggestions for modification of content or the API could be a deliverable of the project. We encourage the use of data not yet in EOL for analyses. In all cases projects must honor terms of use and licensing as appropriate.

Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted. So, it's essentially a grant competition (with a pleasingly minimal amount of administrivia). There is also a Computable Data Challenge community to discuss the challenge.

It's great to see EOL trying different strategies to engage with developers. Of the different challenges EOL is running this one is perhaps the most appealing to me, because one of my biggest complaints about EOL is that it's hard to envisage "doing science" with it. For example, we can download GenBank and cluster sequences into gene families, or grab data from GBIF and model species distributions, but what could we do with EOL? This challenge will be a chance to explore the extent to which EOL can support science, which I would argue will be a key part of its long term future.

Data matters but do data sets?

Interest in archiving data and data publication is growing, as evidenced by projects such as Dryad, and earlier tools such as TreeBASE. But I can't help wondering whether this is a little misguided. I think the issues are granularity and reuse.

Taking the second issue first, how much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses.

Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much").

But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.

To me, citing data sets makes almost as much sense as citing journal volumes - the level of granularity is wrong. Journal volumes are largely arbitrary collections of articles, it's the articles that are the typical unit of citation. Likewise I think sequences will be cited more often than alignments.

It might be argued that there are disciplines where the dataset is the sensible unit, such as an ecological study of a particular species. Such a data set may lack obvious subsets, and hence it makes sense to be cited as a unit. But my expectation here is that such datasets will see limited re-use, for the very reason that they can't be easily partitioned and mashed up. Data sets, such as alignments, are built from smaller, reusable units of data (i.e., sequences) can be recombined, trimmed, or merged, and hence can be readily re-used. Monolithic datasets with largely unique content can't be easily mashed up with other data.

Hence, my suspicion is that many data sets in digital archives will gather digital dust, and anyone submitting a data set in the expectation that it will be cited may turn out to be disappointed.

The Plant List: nice data, shame it's not open

nd.large.pngThe Plant List (http://www.theplantlist.org/) has been released today, complete with glowing press releases. The list includes some 1,040,426 names. I eagerly looked for the Download button, but none is to be found. You can grab download individual search results (say, at family level), but not the whole data set.

OK, so that makes getting the complete data set a little tedious (there are 620 plant families in the data set), but we can still do it without too much hassle (in fact, I've grabbed the complete data set while writing this blog post). Then I see that the data is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license. Creative Commons is good, right? In this case, not so much. The CC BY-NC-ND license includes the clause:
You may not alter, transform, or build upon this work.
So, you can look but not touch. You can't take this data (properly attributed, or course) and build your own list, for example with references linked to DOIs, or to the Biodiversity Heritage Library (which is, of course, exactly what I plan to do). That's a derivative work, and the creators of the Plant List don't want you to do that. Despite this, the Plant List want us to use the data:
Use of the content (such as the classification, synonymised species checklist, and scientific names) for publications and databases by individuals and organizations for not-for-profit usage is encouraged, on condition that full and precise credit is given to The Plant List and the conditions of the Creative Commons Licence are observed.
Great, but you've pretty much killed that by using BY-NC-ND. Then there's this:
If you wish to use the content on a public portal or webpage you are required to contact The Plant List editors at editors@theplantlist.org to request written permission and to ensure that credits are properly made.
Really? The whole point of Creative Commons is that the permissions are explicit in the license. So, actually I don't need your permission to use the data on a public portal, CC BY-NC-ND gives me permission (but with the crippling limitation that I can't make a derivative work).

So, instead of writing a post congratulating the Royal Botanic Gardens, Kew and Missouri Botanical Garden (MOBOT) for releasing this data, I'm left spluttering in disbelief that they would hamstring its use through such a poor choice of license. Kew and MOBOT could have made the Plant List available as open data using one of the licenses listed on the Open Definition web site, such as putting the data in the public domain (for example, or using a Creative Commons CC0 license). Instead, they've chosen a restrictive license which makes the data closed, effectively killing the possibility for people to build upon the effort they've put into creating the list. Why do biodiversity data providers seem determined to cling to data for dear life, rather than open it up and let people realise its potential?

Dryad, DOIs, and why data matters more than journal articles


For the last two days I've been participating in a NESCent meeting on Dryad, a "repository of data underlying scientific publications, with an initial focus on evolutionary biology and related fields". The aim of Dryad is to provide a durable home for the kinds of data that don't get captured by existing databases such as GenBank and TreeBASE (for example, the Excel spreadsheets, Word files, and tarballs of data that, if they are lucky, make it on to a journal's web site as supplementary material (like this example). These data have an alarming tendency to disappear (see "Unavailability of online supplementary scientiļ¬c information from articles published in major journals" doi:10.1096/fj.05-4784lsf).

Perhaps it was because I was participating virtually (via Adobe Connect, which worked very well), but at times I felt seriously out of step with many of the participants. I got the sense that they regard the scientific article as primary, data as secondary, and weren't entirely convinced that data needed to be treated in the same way as a publication. I was arguing that Dryad should assign DOIs to data sets, join CrossRef, and ensure data sets were cited in the same way as papers. For me this is a no brainer -- by making data equivalent to a publication, journals don't need to do anything special, publishers know how to handle DOIs, and will have fewer qualms than handling URLs, which have a nasty tendency to break (see "Going, Going, Gone: Lost Internet References" doi:10.1126/science.1088234).

Furthermore, being part of CrossRef would bring other benefits. Their cited-by linking service enables publishers to display lists of articles that cite a given paper -- imagine being able to do this for data sets. Dryad could display not just the paper associated with publication of the data set, but all subsequent citations. As an author, I'd love to see this. It would enable me to see what others had done with my data, and provide an incentive to submit my data to Dryad (providing incentives to authors to archive data is a big issue, see Mark Costello's recent paper doi:10.1525/bio.2009.59.5.9).

Not everyone saw things this way, and it's often a "reality check" to discover that things one takes for granted are not at all obvious to others (leading to mutual incomprehension). Many editors, understandably, think of the the journal article as primary, and data as something else (some even struggle to see why one would want to cite data). There's also (to my mind) a ridiculous level of concern about whether ISI would index the data. In the age of Google, who cares? Partly these concerns may reflect the diversity of the participants. Some subjects, such as phylogenetics, are built on reuse of previous data, and it's this reuse that makes data citation both important and potentially powerful (for more on this see my papers hdl:10101/npre.2009.3173.1 and doi:10.1093/bib/bbn022). In many ways, the data is more important than the publication. If I look at a phylogenetics paper published, say 5 or more years ago, the methods may be outmoded, the software obsolete (I might not be able to run it on a modern machine), and the results likely to be outdated (additional data and/or taxa changing the tree). So, the paper might be virtually useless, but the data continues to be of value. Furthermore, the great thing about data (especially sequence data) is that it can be used in all sorts of unexpected ways. In disciplines such as phylogenetics, data reuse is very common. In other areas in evolution and ecology, this might not be the case.

It will be clear from this that I buy the idea articulated by Philip Bourne (doi:10.1371/journal.pcbi.0​010034) that there's really no difference between a database and a journal article and that the two are converging (I've argued for a long time that the best thing that could happen to phylogeneics would be if Molecular Phylogenetics and Evolution and TreeBASE were to merge and become one entity). Data submission would equal publication. In the age of Google where data is unreasonably effective (doi:10.1109/mis.2009.36, PDF here), privileging articles at the expense of data strikes me as archaic.

So, whither Dryad? I wish it every success, and I'm sure it will be a great start. There are some very clever people behind it, and it takes a lot of work to bring a community on board. However, I think Dryad's use of Handles is a mistake (they are the obvious choice of identifier given Dryad is based on DSpace), as this presents publishers with another identifier to deal with, and has none of the benefits of DOIs. Indeed, I would go further and say that the use of Handles + DSpace marks Dryad as being basically yet another digital library project, which is fine, but it puts it outside the mainstream of science publishing, and I think that is a strategic mistake. An example of how to do things better is Nature Precedings, which assigns DOIs to manuscripts, reports, and presentations. I think the use of DOIs in this context demonstrated that Nature was serious, and valued these sorts of resource. Personally, I'd argue that Dryad should be more ambitious, and see itself as a publisher, not a repository. In fact, it could think of itself as a journal publisher. Ironically, maybe the editors at the NESCent meeting were well advised to be wary, what they could be witnessing is the formation of a new kind of publication, where data is the article, and the article is data.