Search this keyword

Showing posts with label COinS. Show all posts
Showing posts with label COinS. Show all posts

In defence of OpenURL: making bibliographic metadata hackable

This is not a post I'd thought I'd write, because OpenURL is an awful spec. But last week I ended up in vigorous debate on Twitter after I posted what I thought was a casual remark:



This ended up being a marathon thread about OpenURL, accessibility, bibliographic metadata, and more. It spilled over onto a previous blog post (Tight versus loose coupling) where Ed Summers and I debated the merits of Context Object in Span (COinS).

This debate still nags at me because I think there's an underlying assumption that people making bibliographic web sites know what's best for their users.

Ed wrote:

I prefer to encourage publishers to use HTML's metadata facilities using the <meta> tag and microdata/RDFa, and build actually useful tools that do something useful with it, like Zotero or Mendeley have done.

That's fine, I like embedded metadata, both as a consumer and as a provider (I provide Google Scholar-compatible metadata in BioStor). What I object to is the idea that this is all we need to do. Embedded metadata is great if you want to make individual articles visible to search engines:
Metadata1
Tools like Google (or bibliographic managers like Mendeley and Zotero) can "read" the web page, extract structured data, and do something with that. Nice for search engines, nice for repositories (metadata becomes part of their search engine optimisation strategy).

But this isn't the only thing a user might want to do. I often find myself confronted with a list of articles on a web site (e.g., a bibliography on a topic, a list of references cited in a paper, the results of a bibliographic search) and those references have no links. Often those links may not have existed when original web page was published, but may exist now. I'd like a tool that helped me find those links.

If a web site doesn't provide the functionality you need then, luckily, you are not entirely at the mercy of the people who made the decisions about what you can and can't do. Tools like Greasemonkey pioneered the idea that we can hack a web page to make it more useful. I see COinS as an example of this approach. If the web page doesn't provide links, but has embedded COinS then I can use those to create OpenURL links to try and locate those references. I am no longer bound by the limitations of the web page itself.

Metadata2
This strikes me as very powerful, and I use COinS a lot where they are available. For example, CrossRef's excellent search engine supports COinS, which means I can find a reference using that tool, then use the embedded COinS to see whether there is a version of that article digitised by the Biodiversity Heritage Library. This enables me to do stuff that CrossRef itself hasn't anticipated, and that makes their search engine much more valuable to me. In a way this is ironic because CrossRef is predicated on the idea that there is one definitive link to a reference, the DOI.

So, what I found frustrating about the conversation with Ed was that it seemed to me that his insistence on following certain standards was at the expense of functionality that I found useful. If the client is the search engine, or the repository, then COinS do indeed seem to offer little apart from God-awful HTML messing up the page. But if you include the user and accept that users may want to do stuff that you don't (indeed can't) anticipate then COinS are useful. This is the "genius of and", why not support both approaches?

Now, COinS are not the only way to implement what I want to do, we could imagine other ways to do this. But to support the functionality that they offer we need a way to encode metadata in a web page, a way to extract that metadata and form a query URL, and a set of services that know what to do with that URL. OpenURL and COinS provide all of this right now and work. I'd be all for alternative tools that did this more simply than the Byzantine syntax of OpenURL, but in the absence of such tools I stick by my original tweet:

If you publish bibliographic data and don't use COinS you are doing it wrong

Tight versus loose coupling

Following on from my previous post bemoaning the lack of links between biodiversity data sets, it's worth looking at different ways we can build these links. Specifically, data can be tightly or loosely coupled.

Tight coupling


Tight coupling uses identifiers. A good example is bibliographic citation, where we state that one reference cites another by linking DOIs. This makes it easy to store these links in a database, such as the Open Citations project which is exploring citation networks base don data from PubMed Cenral. Tight coupling also makes it easy to aggregate information from multiple sources. For example, one database may record citations of a paper, another may record citations of GenBank sequences, a third may record publication of taxonomic names. If all three databases use the same identifiers for the same publications (e.g., DOIs) we can combine them and potentially discover new things (for example, we could answer the question "how many descriptions of new species include sequence data?").

Loose coupling


In part this post has been prompted by a discussion I've been having with Paul Murray (@PaulMurrayCbr on his blog. Paul has added COinS to pages in the Australian Faunal Directory (AFD). These are snippets of HTML that encode a bibliographic reference as an OpenURL, and which browser extensions such as OpenURL Referrer for Firefox and COinS 2 OpenURL for Chrome can convert into links.

I've mapped many of the references in AFD to standard identifiers such as DOIs, or to digital libraries such as BioStor, and this tightly-coupled mapping is available in AFD on CouchDB. To date these mappings haven't been imported into AFD itself, which means that users of the original site don't have easy access to the literature that appears on that site (basically they'll have to Google each reference). However, if they have a browser extension (or the Javascript bookmarklet available from http://iphylo.org/~rpage/afd/openurl) that supports COinS, they will now see a clickable link that, in many cases, will take them to the online version of the corresponding reference.

This is an example of loose linking. The AFD site provides OpenURL links which can be resolved "just in time". Users of the AFD site can get some of the benefits of the tight linking stored in my CouchDB version of AFD, but the maintainers of AFD itself don't need to add code to handle these identifiers.

A lot of linking of biodiversity data shares this pattern. Instead of linking identifiers, one site links to another through a query. For example, NCBI taxonomy links to GBIF using URLs of the form "http://data.gbif.org/search/<taxon name>". Linking by query is potentially more robust than simply linking by URLs, especially if the target of the link doesn't ensure its identifiers are stable (GBIF, I'm looking at you). But there may be multiple ways to construct the same search query, which makes them poor candidates for use as identifiers. COinS are perhaps an extreme example, where there are at least two versions of the OpenURL standard in the wild, and the key-value pairs that make up the query can be in any order.

If the goal is to integrate data then having the same identifiers for the same thing make life a lot simpler, and means that we can switch from endless data cleaning and matching ("is this citation the same as that one?") to building systems that can tackle some of the scientific questions we are interested in. But in their absence we are left a kind of defensive programming where we expect the links to fail. Loose linking creates "soft links" that may work for humans (we get to click on a link and, with luck, see a web page) but they are less useful for mechanised tools trying to aggregate data.

When tight=loose


Although I've distinguished between tight and loose coupling, the distinction is not absolute. Indeed, one could argue that the best "tight" coupling is a form of "loose" coupling. For example, the most obvious form of tight linking is to use URLs for the things of interest. This is simple and direct, but has draw backs for both publisher and consumer. For the consumer, we are now at the mercy of the publisher's ability to keep the URLs stable. If they change (for example, publishing firm is bought by another firm, or adopts new publishing platform which generates different URLs) then the links break (not to mention that URLs for some resources, such as articles, are often conditional on how you are accessing the article, and may contain extraneous cruff such as session ids, etc.).

Likewise, the publisher is now constrained by a decision it made at the time of publication. If it decides to adopt better technology, or if circumstances otherwise change, it may find itself having to break existing identifiers. Some of this can be avoided if we designed clean URLs, such as this example http://data.rbge.org.uk/herb/E00001195 given by Roger Hyam. However, I wonder how persistent the ".uk" part of this URL will be if the Royal Botanic Garden Edinburgh finds itself in a Scotland that is no longer part of the United Kingdom.

One solution is our old friend indirection, where we put an identifier in between the consumer and the actual URL of the resource, and the consumer uses that identifier. This is the rationale for DOIs. The user gets an identifier that is unlikely to change, and hence can build systems upon that identifier. The publisher knows that they can change how they serve the corresponding data without disrupting their users, so long as they update the URL that the DOI points to. Indirection gives users the appearance of tight coupling without imposing the constraints of tight coupling on publishers.

BioStor

Today I finally got a project out the door. BioStor is my take on what an interface to theBiodiversity Heritage Library (BHL) could look like. It features the visualisations I've mentioned in earlier posts, such as Google maps based on extracted localities, and tag trees. It also has a modified version of my earlier BHL viewer.

There are a number of ideas I want to play with using BioStor, but the main goal this site is to provide article-level metadata for BHL. As I've discussed earlier (see also Chris Freeland's post But where are the articles?), BHL has very little article-level metadata, making searching for articles a frustrating experience. BioStor aims to make this easier by providing an OpenURL resolver that tries to find articles in BHL.

BioStor supports the OpenURL standard, which means it can be used from
within EndNote and Zotero. Web sites that support COinS (such as Drupal-based Scratchpads and EOL's LifeDesks) can also be uses BioStor (see http://biostor.org/referrer.php for details).

My approach to finding articles in BHL is to take existing metadata from bilbiographies and databases, and use this to search BHL using techniques ranging from reasonably elegant (Smith-Waterman alignment on words to match titles) to down-and-dirty regular expression matching. Since this metadata may contain errors, BioStor provides basic editing tools (using reCAPTCHA rather than user logins at this point).

There's much to be done, the article finding is somewhat error-prone, and the search requires a local copy of BHL, and mine is rather out of date. However, it is a start.

To get a flavour of BioStor, try browsing some references:

http://biostor.org/reference/1
http://biostor.org/reference/4
http://biostor.org/reference/12

or view information for a journal:

http://biostor.org/issn/0007-1498


or an author:

http://biostor.org/author/41
http://biostor.org/author/16

or a taxon name:

http://biostor.org/name/Atelophryniscus%20chrysophorus