Search this keyword

Showing posts with label open data. Show all posts
Showing posts with label open data. Show all posts

On Names Attribution, Rights, and Licensing of taxonomic names

Few things have annoyed be as much as the following post on TAXACOM:

The Global Names project will host a workshop to explore options and to make recommendations as to issues that relate to Attribution, Rights and Licensing of names and compilations of names. The aim of the workshop is a report that clarifies if and how we share names.

We seek submissions from all interested parties - nomenclaturalists, taxonomists, aggregators, and users of names. Let us know what (you think) intellectual property rights apply or what rights should be associated with names and compilations of names. How can those who compile names get useful attribution for names, and what responsibilities do they have to ensure that information is authoritative. If there are rights, what kind of licensing is appropriate.

Contributions can be submitted http://names-attribution-rights-and-licensing.wikia.com/wiki/Main_Page, where you will find more information about this event.

I'm trying to work out why this seemingly innocuous post made me so mad. I think this is because I think this fundamentally framing the question the wrong way. Surely the goal is to have a list of names that is global in scope, well documented, and freely usable by all without restriction? Surely we want open and free access to fundamental biodiversity data? In which case, can we please stop having meetings and get on with making this so?

If you frame the discussion as one of "Attribution, Rights and Licensing of names and compilations of names" then you've already lost sight of the prize. You've focussed on the presumed "rights" of name compilers instead.

I would argue that names compilations are somewhat overvalued. They are basically lists of names, sometimes (all to rarely) with some degree of provenance (e.g., a citation to the original use of the name). As I've documented before (e.g., More fictional taxa and the myth of the expert taxonomic database and Fictional taxa) entirely fictional can end up in taxonomic databases with alarming ease. So any claims that these are expert-curated lists should be taken with a pinch of salt.

Furthermore, it is increasingly easy to automate building these lists, given that we have tools for finding names in text, and an ever expanding volume of digitised text becoming available. Indeed, in an ideal world where all taxonomic literature was digitised much of the rationale for taxonomic name databases would disappear (in the same way that library card catalogues are irrelevant in the age of Google). We are fast approaching the point where we can do better than experts. To give just one example, in a recent BHL interview with Gary Poore it was stated that:

For example, the name widely used name Pentastomida itself was widely attributed to Diesing, 1836, but the word did not appear in the literature until 1905.


A quick check of Google Ngrams shows this to be simply false:

Pentastomida

I don't need taxonomic expertise to see this, I simply need decent text indexing. So, if you have a list of names, you have something that it will soon be largely possible to recreate using automated methods (i.e., text mining). With a little sophistication we could mine the literature for further details, such as synonymy, etc. Annotation and clarification of a few "edge cases" where things get tricky will always be needed, but if you want to argue that your lists deserves "Attribution, Rights and Licensing" then you fail to realise that your list is going to be increasing easy to recreate simply by crawling the web.

It seems to me that most taxonomic databases are little more than digitised 5x3 index cards, and lack any details on the provenance of the names they contain. They often don't have links to the primary literature, and if they do cite that literature they typically do so in a way that makes it hard to find the actual publication. I once gave a talk which included the slide below showing taxonomic databases as being "in the way" between taxonomists and users of taxonomic information:

Users

In the old days building taxonomic databases required expertise and access to obscure, hard to find, physical literature. A catalogue of names was a way to summarise that information (since we couldn't share access). Now we are in an age where more and more primary taxonomic information is available to all, which removes most of the rationale for taxonomic databases. Users can go directly to taxonomic information themselves, which mean they can get the "good stuff", and maybe even cite it (giving us provenance and credit, which I regard as basically the same thing). In many ways taxonomic databases are transitional phenomena (like phone directories, remember those), and one could argue are now in the way of the taxonomists' Holy Grail, getting their work cited.

Lastly, any discussion of "Attribution, Rights and Licensing of names and compilations of names" reflects one of the great self inflicted wounds of biodiversity informatics, namely the reluctance to freely share data. As we speak terrabytes of genomics data are whizzing around the planet, people are downloading entire copies of GenBank and creating new databases. All of this without people fussing over "Attribution, Rights and Licensing." It's time for taxonomic databases to get over themselves and focus on making biodiversity data as accessible and available as genomics data.

Why are botanists locking away their data in JSTOR Plant Science?

Goet008353Somehow I get the feeling that botanists haven't got the "open data" religion. Not only is the list of plant names list behind a really bad license, but the Global Plants Initiative (GPI) hides its type images behind a JSTOR Plant Sciences paywall. Why is botany determined to keep its data under wraps?

For example, the first specimen on the JSTOR site is the GOET008353, the isotype of Aa achalensis Schltr.. You can see a thumbnail of the specimen (shown on the right), but if you want the full image you need to have a subscription, otherwise you see this message:

The resource you are attempting to access is part of JSTOR Plant Science. JSTOR Plant Science is currently being offered free of charge for all JSTOR participants and not for profit institutions. To learn more about JSTOR Plant Science, please contact plants@jstor.org.


So, without a subscription you don't get to see this in high resolution (the JSTOR site features a higher resolution image and associated viewer):

Resolver

Why would herbariums hand over this imagery? I complained about this on Facebook and Chuck Miller responded that the original herbaria retain control over the images, so they aren't locked away. However, I then when to the herbarium that has this specimen (the Type Database of Herbarium Göttingen (GOET) and search for this specimen I eventually find it listed as 4966. There is no image!

So, the only place I can see this image is on JSTOR, for which I need a subscription. I'm also puzzled by the fact that JSTOR refers to this as "GOET008353", whereas the original herbarium refers to it as "4966". GBIF also has this specimen, which it refers to as GOET GOET-Typen 4966. The GOET008353 is a barcode given to types as part of the GPI digitisation programme. Unfortunately, neither the originating herbarium nor GBIF seems to know about this.

In summary, we have three databases with data on this specimen, each with a different specimen identifier, none of which link to each other, and the available imagery is behind a paywall.

Clearly botany hasn't gotten the memo about open data...

The Plant List: nice data, shame it's not open

nd.large.pngThe Plant List (http://www.theplantlist.org/) has been released today, complete with glowing press releases. The list includes some 1,040,426 names. I eagerly looked for the Download button, but none is to be found. You can grab download individual search results (say, at family level), but not the whole data set.

OK, so that makes getting the complete data set a little tedious (there are 620 plant families in the data set), but we can still do it without too much hassle (in fact, I've grabbed the complete data set while writing this blog post). Then I see that the data is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license. Creative Commons is good, right? In this case, not so much. The CC BY-NC-ND license includes the clause:
You may not alter, transform, or build upon this work.
So, you can look but not touch. You can't take this data (properly attributed, or course) and build your own list, for example with references linked to DOIs, or to the Biodiversity Heritage Library (which is, of course, exactly what I plan to do). That's a derivative work, and the creators of the Plant List don't want you to do that. Despite this, the Plant List want us to use the data:
Use of the content (such as the classification, synonymised species checklist, and scientific names) for publications and databases by individuals and organizations for not-for-profit usage is encouraged, on condition that full and precise credit is given to The Plant List and the conditions of the Creative Commons Licence are observed.
Great, but you've pretty much killed that by using BY-NC-ND. Then there's this:
If you wish to use the content on a public portal or webpage you are required to contact The Plant List editors at editors@theplantlist.org to request written permission and to ensure that credits are properly made.
Really? The whole point of Creative Commons is that the permissions are explicit in the license. So, actually I don't need your permission to use the data on a public portal, CC BY-NC-ND gives me permission (but with the crippling limitation that I can't make a derivative work).

So, instead of writing a post congratulating the Royal Botanic Gardens, Kew and Missouri Botanical Garden (MOBOT) for releasing this data, I'm left spluttering in disbelief that they would hamstring its use through such a poor choice of license. Kew and MOBOT could have made the Plant List available as open data using one of the licenses listed on the Open Definition web site, such as putting the data in the public domain (for example, or using a Creative Commons CC0 license). Instead, they've chosen a restrictive license which makes the data closed, effectively killing the possibility for people to build upon the effort they've put into creating the list. Why do biodiversity data providers seem determined to cling to data for dear life, rather than open it up and let people realise its potential?

On being open: Mendeley and open data versus open source

Paulo Nuin, not the biggest fan of Mendeley wrote a blog post entitled Mendeley is going to be open source, in which he wrote:

After extensively researching some material online, analysing many blog posts and statements made by people linked to Mendeley, checking my sources, I reached the conclusion that soon Mendeley is going to be open source.

Among the essays Paulo read is Jason Hoyt's post on the Mendeley blog: Dear researcher, which side of history will you be on?. In response to a question about open sourcing the Mendeley client, Jason replied:
I get asked a lot about open sourcing Mendeley when I go to speaking events. I always state that we are open to the possibility, but then ask how many people know how to type a URL verus how many know how to program in C++? That’s why we went with the Open API first instead of open sourcing the desktop software. If you can type a URL, which is what the API is based upon, then you can build on top of Mendeley. You don’t need to know how to program.

Despite the fact that open sourcing the desktop client is the second most requested feature for Mendeley, I think Jason is right. I also think Paulo's campaign to make Mendeley open source is misguided. The client doesn't matter. OK, yes, it's probably the reason most people use Mendeley, but there are lots of competing clients (EndNote, Zotero, Papers, etc.), and there are several bibliographic data formats (RIS, EndNote XML, BibTeX) and essentially one document format (PDF) that they support, so individual users don't have to worry about locking their individual bibliographies into a proprietary format. Couple this with the existence of an API (albeit a pretty crap one), and whether an individual software client is closed or open source doesn't matter much.

Will the data be open?

However, what makes Mendeley different is the aggregation of bibliographic data (35 million references and counting).

privacy_1264073462.jpg


I'd argue it's the fate of this aggregation that matters. In a comment on the Guardian's piece Mendeley 'most likely to change the world for the better', Jane Good wrote:
"World-changing potential"? This utopian fantasy stuff is a little much, no? After all, we're talking about a for-profit corporation using closed-source software to monitor private usage habits for monetary gain. And how exactly is this company meant to sustain its millions of dollars of annual burn on a few measly storage subscriptions? At some point the data will have to go up for sale to the highest bidder, plain and simple. The API, as it exists now, does not provide access to that data, and it probably never will, right, DrGinn[sic]?

Toning down the rhetoric, the question is Mendeley, Scopus, Talis – will you be making your data Open?:

But how can a company create an income stream from Open Scientific content? That’s the a question for me for this decade. If we can solve it we can transform the world. If however the linked Open data are all going to be through paywalls, portals, query engines then we regress into the feudal information possession of the past. I hope the companies present in this session can help solve this. It won’t be easy but it has to be done. So I now ask Mendeley, Elsevier/Scopus, Talis: Are your data Openly available for re-use?

For me the question of whether the source code for the Mendeley desktop will be made open source is a red herring, and ultimately a distraction from the real question — will the data be open?