Accounting Careers

Showing posts with label metadata. Show all posts

In defence of OpenURL: making bibliographic metadata hackable

This is not a post I'd thought I'd write, because OpenURL is an awful spec. But last week I ended up in vigorous debate on Twitter after I posted what I thought was a casual remark:

If you publish bibliographic data and don't use COinS ocoins.info you are doing it wrong (I'm looking at you @europepmc_news)
— Roderic Page (@rdmpage) March 8, 2013

This ended up being a marathon thread about OpenURL, accessibility, bibliographic metadata, and more. It spilled over onto a previous blog post (Tight versus loose coupling) where Ed Summers and I debated the merits of Context Object in Span (COinS).

This debate still nags at me because I think there's an underlying assumption that people making bibliographic web sites know what's best for their users.

Ed wrote:

I prefer to encourage publishers to use HTML's metadata facilities using the <meta> tag and microdata/RDFa, and build actually useful tools that do something useful with it, like Zotero or Mendeley have done.

That's fine, I like embedded metadata, both as a consumer and as a provider (I provide Google Scholar-compatible metadata in BioStor). What I object to is the idea that this is all we need to do. Embedded metadata is great if you want to make individual articles visible to search engines:
Metadata1

Tools like Google (or bibliographic managers like Mendeley and Zotero) can "read" the web page, extract structured data, and do something with that. Nice for search engines, nice for repositories (metadata becomes part of their search engine optimisation strategy).

But this isn't the only thing a user might want to do. I often find myself confronted with a list of articles on a web site (e.g., a bibliography on a topic, a list of references cited in a paper, the results of a bibliographic search) and those references have no links. Often those links may not have existed when original web page was published, but may exist now. I'd like a tool that helped me find those links.

If a web site doesn't provide the functionality you need then, luckily, you are not entirely at the mercy of the people who made the decisions about what you can and can't do. Tools like Greasemonkey pioneered the idea that we can hack a web page to make it more useful. I see COinS as an example of this approach. If the web page doesn't provide links, but has embedded COinS then I can use those to create OpenURL links to try and locate those references. I am no longer bound by the limitations of the web page itself.

Metadata2

This strikes me as very powerful, and I use COinS a lot where they are available. For example, CrossRef's excellent search engine supports COinS, which means I can find a reference using that tool, then use the embedded COinS to see whether there is a version of that article digitised by the Biodiversity Heritage Library. This enables me to do stuff that CrossRef itself hasn't anticipated, and that makes their search engine much more valuable to me. In a way this is ironic because CrossRef is predicated on the idea that there is one definitive link to a reference, the DOI.

So, what I found frustrating about the conversation with Ed was that it seemed to me that his insistence on following certain standards was at the expense of functionality that I found useful. If the client is the search engine, or the repository, then COinS do indeed seem to offer little apart from God-awful HTML messing up the page. But if you include the user and accept that users may want to do stuff that you don't (indeed can't) anticipate then COinS are useful. This is the "genius of and", why not support both approaches?

Now, COinS are not the only way to implement what I want to do, we could imagine other ways to do this. But to support the functionality that they offer we need a way to encode metadata in a web page, a way to extract that metadata and form a query URL, and a set of services that know what to do with that URL. OpenURL and COinS provide all of this right now and work. I'd be all for alternative tools that did this more simply than the Byzantine syntax of OpenURL, but in the absence of such tools I stick by my original tweet:

If you publish bibliographic data and don't use COinS you are doing it wrong

Bibliographic metadata pollution

I spend a lot of time searching the web for bibliographic metadata and links to digitised versions of publications. Sometimes I search Google and get nothing, sometimes I get the article I'm after, but often I get something like this:

Google

If I search for Die cestoden der Vogel in Google I get masses of hits for the same thing from multiple sources (e.g., Google Books, Amazon, other booksellers, etc.). For this query we can happily click through pages and pages of results that are all, in some sense, the same thing. Sometimes I get the similar results when searching for an article, multiple hits from sites with metadata on that article, but few, if any with an actual link to the article itself.

One byproduct of putting bibliographic metadata on the web is that we are starting to pollute web space with repetitions of the same (or closely similar) metadata. This makes searching for definitive metadata difficult, never mind actually finding the content itself. In some cases we can use tools such as Google Scholar, which clusters multiple versions of the same reference, but Google Scholar is often poor for the kind of literature I am after (e.g., older taxonomic publications).

As Alan Ruttenberg (@alanruttenberg points out, books would seem to be a case where Google could extend its knowledge graph and cluster the books together (using ISBNs, title matching, etc.). But meantime if you think simply pumping out bibliographic metadata is a good thing, spare a thought for those of us trying to wade through the metadata soup looking for the "good stuff".

Linked data that isn't: the failings of RDF

OK, a bit of hyperbole in the morning. One of the goals of RDF is to create the Semantic Web, an interwoven network of data seamlessly linked by shared identifiers and shared vocabularies. Everyone uses the same identifiers for the same things, and when they describe these things they use the same terms. Simples.

Of course, the reality is somewhat different. Typically people don't reuse identifiers, and there are usually several competing vocabularies we can chose from. To give a concrete example, consider two RDF documents describing the same article, one provided by CiNii, the other by CrossRef. The article is:

Astuti, D., Azuma, N., Suzuki, H., & Higashi, S. (2006). Phylogenetic Relationships Within Parrots (Psittacidae) Inferred from Mitochondrial Cytochrome-b Gene Sequences(Phylogeny). Zoological science, 23(2), 191-198. doi:10.2108/zsj.23.191

You can get RDF for a CiNii record by appending ".rdf" to the URL for the article, in this case http://ci.nii.ac.jp/naid/130000017049. For CrossRef you need a Linked Data compliant client, or you can do something like this:


curl -D - -L -H "Accept: application/rdf+xml" "http://dx.doi.org/10.2108/zsj.23.191"

You can view the RDF from these two sources here and here.

No shared identifiers
The two RDF documents have no shared identifiers, or at least, any identifiers they do share aren't described in a way that is easily discovered. The CrossRef record knows nothing about the CiNii record, but the CiNii document includes this statement:


<rdfs:seeAlso rdf:resource="http://ci.nii.ac.jp/lognavi?name=crossref
&amp;id=info:doi/10.2108/zsj.23.191" dc:title="CrossRef" />

So, CiNii knows about the DOI, but this doesn't help much as the CrossRef document has the URI "http://dx.doi.org/10.2108/zsj.23.191", so we don't have an explicit statement that the two documents refer to the same article.

The other shared identifier the documents could share is the ISSN for the journal (0289-0003), but CiNii writes this without the "-", and uses the PRISM term "prism:issn", so we have:


<prism:issn>02890003</prism:issn>

whereas CrossRef writes the ISSN like this:


<ns0:issn xmlns:ns0="http://prismstandard.org/namespaces/basic/2.1/">
0289-0003</ns0:issn>

Unless we have a linked data client that normalises ISSNs before it does a SPARQL query we will miss the fact that these two articles are in the same journal.

Inconsistent vocabularies
Both CiNii use the PRISM vocabulary to describe the article, but they use different versions. CrossRef uses "http://prismstandard.org/namespaces/basic/2.1/" whereas CiNii uses "http://prismstandard.org/namespaces/basic/2.0/". Version 2.1 versus version 2.0 is a minor difference, but the URIs are different and hence they are different vocabularies (having version numbers in vocabulary URIs is asking for trouble). Hence, even if CiNii and CrossRef wrote ISSNs in the same way, we'd still not be able to assert that the articles come from the same journal.
Inconsistent use of vocabularies
Both CiNii use FOAF for author names, but they write the names differently:


<foaf:name xml:lang="en">Suzuki Hitoshi</foaf:name>


<ns0:name xmlns:ns0="http://xmlns.com/foaf/0.1/">Hitoshi Suzuki</ns0:name>

So, another missed opportunity to link the documents. One could argue this would be solved if we had consistent identifiers for authors, but we don't. In this case CiNii have their own local identifiers (e.g. http://ci.nii.ac.jp/nrid/1000040179239), and CrossRef has a rather hideous looking Skolemisation: http://id.crossref.org/contributor/hitoshi-suzuki-2gypi8bnqk7yy.

In summary, it's a mess. Both CiNii and CrossRef organisations are whose core business is bibliographic metadata. It's great that both are serving RDF, but if we think this is anything more than providing metadata in a useful format I think we may be deceiving ourselves.

Orwellian metadata: making journals disappear

I've been spending a lot of time recently mapping bibliographic citations for taxonomic names to digital identifiers (such as DOIs). This is tedious work at the best of times (despite lots of automation), but it is not helped but the somewhat Orwellian practices of some publishers. Occasionally when an established journal gets renamed the publisher retrospectively applies that name to the previous journal. For example, in 2000 the journal Entomologica Scandinavica (ISSN 0013-8711) became Insect Systematics & Evolution (ISSN 1399-560X):

(diagram based on WorldCat xISSN history tool, rendered using Google Charts.)

Content for both Entomologica Scandinavica and Insect Systematics & Evolution is available from Ingenta's web site, but every article is listed as being in Insect Systematics & Evolution, and this is reflected in the metadata CrossRef has for each DOI.

For example, the paper

Andersen, N.M. & P.-p. Chen, 1993. A taxonomic revision of pondskater genus Gerris Fabricius in China, with two new species (Hemiptera: Gerridae). – Entomologica Scandinavica 24: 147-166

has the DOI doi:10.1163/187631293X00262 which resolves to a page saying this article was published in Insect Systematics & Evolution. The XML for the DOI says the same thing:



   <issn type="print">1399560X</issn>
   <issn type="electronic">1876312X</issn>
   <journal_title>Insect Systematics & Evolution</journal_title>

In one sense this is no big deal. If you know the DOI then that's all you need to use to refer to the article (and the sooner we abandon fussing with citation styles and just use DOIs the better).

But if you haven't yet found the DOI then this is problem, because if I search CrossRef using the original journal name (Entomologica Scandinavica) I get nothing. As far as CrossRef is concerned the DOI doesn't exist. If, however, I happen to know that Entomologica Scandinavica is now Insect Systematics & Evolution, I rewrite the query and I retrieve the DOI.

It's bad enough dealing with taxonomic names changes without having to deal with journal names changes as well! It would be great if publishers didn't indulge in wholesale renaming old journals, or if CrossRef had a mechanism (perhaps based on WorldCat's xISSN History Visualization Tool) to handle retrospectively renamed journals.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.

To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:

Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
Using the most common citation styles, generate a set of possible citations for each reference.
Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

Why metadata matters

Quick note to express the frustration I experience sometimes when dealing with taxonomic literature. As part of a frankly Quixotic desire to link every article cited in the Australian Faunal Directory (AFD) to the equivalent online resource (for example, in the Biodiversity Heritage Library using BioStor, or to a publisher web site using a DOI) I sometimes come across references that I should be able to find yet can't. Often it turns out that the metadata for the article is incorrect. For example, take this reference:

Report upon the Stomatopod crustaceans obtained by P.W. Basset-Smith Esq., surgeon R.N. during the cruise, in the Australia and China Sea, of H.M.S. "Penguin", commander W.V. Moore. Ann. Mag. Nat. Hist. Vol. 6 pp. 473-479 pl. 20B

which is in the Australian Faunal Directory (urn:lsid:biodiversity.org.au:afd.publication:087892ae-2134-4bb4-83ae-8b8cbd15b299). Using my OpenURL resolver in BioStor I failed to locate this article. Sometimes this is because the code I used to parse references from AFD mangles the reference, but not in this case. So, I Google the title and find a page in the Zoological catalogue of Australia: Aplacophora, Polyplacophora, Scaphopoda:

Here's the relevant part of this page:
Zoocat

Same as AFD, Ann. Mag. Nat. Hist. volume 6, pages 473-479, 1893.

In despair I looked at the BHL page for The Annals and Magazine of Natural History and discover that there is no volume 6 published in 1893. There is, however, series 6. Oops! Browsing the BHL content I discover the start of the article I'm looking for on BHL page 27734740 , volume 11 of series 6 of The Annals and Magazine of Natural History. Gotcha! So, I can now link AFD to BHL like this.

I should stress that in general AFD is an great resource for someone like me trying to link names to literature and, to be fair, with its reuse of volume numbers across series The Annals and Magazine of Natural History can be a challenge to cite. Usually the bibliographic details in AFD are accurate enough to locate articles in BHL or CrossRef, but every so often references get mangled, misinterpreted, or someone couldn't resist adding a few "helpful" notes to a field in the database, resulting in my parser failing. What is slightly alarming is how often when I Google for the reference I find the same, erroneous metadata repeated across several articles. This, coupled with the inevitable citation mutations can make life a little tricky. The bulk of the links I'm making are constructed automatically, but there are a few cases where one is lead on a wild goose chase to find the actual reference.

Although this is an example of why it matters to have accurate metadata, it can also be seen as an argument for using identifiers rather than metadata. If these references had stable, persistent identifiers (such as DOIs) that taxonomic databases cited, then we wouldn't need detailed metadata, and we could avoid the pain of rummaging around in digital archives trying to make sense of what the author meant to cite. Until taxonomic databases routinely use identifiers for literature, names and literature will be as ships that pass in the night.

How do I know if an article is Open Access?

One of my pet projects is to build a "Universal Article Reader" for the iPad (or similar mobile device), so that a reader can seemlessly move between articles from different publishers, follow up citations, and get more information on entities mentioned in those articles (e.g., species, molecules, localities, etc.). I've made various toys towards this, the latest being a HTML5 clone of Nature's iPhone app.

One impediment to this is knowing whether an article is Open Access, and if so, what representations are available (i.e., PDF, HTML, XML). Ideally, the "Universal Article Reader" would be able to look at the web page for an article, determine whether it can extract and redisplay the text (i.e., is the article Open Access) and if so, can it, for example, grab the article in XML and reformat it.

Some journals are entirely Open Access, so for these journals the first problem (is it Open Access?) is trivial, but a large number of journals have a mixed publishing model, some articles are Open Access, some aren't. One thing publishers could do that would be helpful would be to specify the access status of an article in a consistent manner. Here's a quick survey at how things stand at the moment.

Journal	Rights
PLoSOne	Embedded RDF, e.g. <license rdf:resource="http://creativecommons.org/licenses/by/2.5/" />
Nature Communications	<meta name="access" content="Yes" /> for open, <meta name="access" content="No" /> for close
Systematic Biology	<meta name="citation_access" content="all" /> for open, this tag missing if closed
BioOne	Nothing for article, Open Access icon next to open access articles in table of contents
BMC Evolutionary Biology	<meta name ="dc.rights" content="http://creativecommons.org/licenses/by/2.0/" />
Philosophical Transactions of the Royal Society	<meta name="citation_access" content="all" /> for open access
Microbial Ecology	No metadata (links and images in HTML)
Human Genomics and Proteomics	<meta name ="dc.rights" content="http://creativecommons.org/licenses/by/2.0/" />

A bit of a mess. Some publishers embed this information in <meta> tags (which is good), some (such as PLoS) embed RDF (good, if a little more hassle), some leaves us in the dark, or give vidual clues such as logos (which mean nothing to a computer). In some ways this parallels the variety of ways journals have implemented RSS feeds, which has lead to some explicit Recommendations on RSS Feeds for Scholarly Publishers. Perhaps the time is right to develop equivalent recommendations for article metadata, so that apps to read the scientific literature can correctly determine whether they can display an article or not.

BioStor gets PDFs with XMP metadata - bibliographies made easy with Mendeley and Papers

The ability to create PDFs for the articles BioStor extracts from the Biodiversity Heritage Library has been the single most requested feature for BioStor. I've taken a while to get around to this -- for a bunch of reasons -- but I've finally added it today. You can get a PDF of an article by either clicking on the PDF link on the page for an article, or by appending ".pdf" to the article URL (e.g., http://biostor.org/reference/570.pdf). In some ways the BioStor PDFs are pretty basic - they contain page images, not the OCR text, so they tend to be quite large and you can't search for text within the article. But what they do have is XMP metadata.

XMP metadata

One of the great bugbears about organising bibliographies is the lack of embedded metadata in PDFs, in other words Why can't I manage academic papers like MP3s? (see my earlier post for some background). Music files and digital photos contain embedded metadata that store information such as song title and artist in the case of music, or date, exposure, camera model, and GPS co-ordinates in the case of digital images. This means software (and webs sites such as Flickr) can automatically organise your collection of media based on this embedded metadata.

Wouldn't it be great if there was an equivalent for PDFs of papers, whereby the PDF contains all the relevant the bibliographic details (article title, authorship, journal, volume, pages, etc.), and reference managing software could read this and automatically put the PDF into whatever categories you chose (e.g., by author, journal, or date)? Well, at least two software programs can do this, namely the cross-platform Mendeley, and Papers, which supports Apple's Macintosh, iPhone, and iPad platforms. Both programs can read bibliographic metadata in Adobe's Extensible Metadata Platform (XMP), which has been adopted by journals such as Nature, and CrossRef has recently been experimenting in providing services to add XMP to PDFs.

One reason I put off adding PDFs to BioStor was the issue of simply generating dumb PDFs for which users would then have to retype the corresponding bibliographic metadata if they wanted to store the PDF in a reference manager. However, given that both Papers and Mendeley support XMP, you can simply drag the PDF on to either program and they will extract the details for you (including a list of up to 10 taxonomic names found in the article). Both Papers and Mendeley support the notion of a "watched folder" where you can dump PDFs and they will "automagically" appear in your reference manager's library. Hence, if you use either program you should be able to simply download PDFs from BioStor and add them to your library without having to retype anything at all.

Technical details
This post is being written as I'm waiting to catch a plane, so I haven't time to go into all the gory details. The basic tools I used to construct the PDfs were FPDF and ExifTool, which supports injecting XMP into PDFs (I couldn't find another free tool that could insert XMP into a PDF that didn't already have any XMP metadata). I store basic Dublin Core and PRISM metadata in the PDF. The ten most common taxonomic names found in the pages of the article are stored as subject tags.

Initially it appeared that only Papers could extract XMP, Mendeley failed completely (somewhat confirming my prejudices about Mendeley). However, I sent an example PDF to Mendeley support, and they helpfully diagnosed the problem. Because XMP metadata can't always be trusted, Mendeley compares title and author values in the XMP metadata with text on the first couple of pages of the PDF. If they match, then the program accepts the XMP metadata. Because my initial efforts as creating PDfs just contained the BHL page images and no text, they wouldn't pass Mendeley's tests. Hence, I added a cover page containing the basic bibliographic metadata for the article, and now Mendeley is happy (the program itself is growing on me, but if you're an Apple fanboy like me, Papers has native look and feel, and syncing your library with your iPhone is a killer feature). There are a few minor differences in how Papers and Mendeley handle tags. Papers will take text in the Dublin Core "Subject" tag and use those as keywords, whereas to get Mendeley to extract tags I had to store them using the "Keywords" tag (implemented using FPDF's SetKeywords function). But after a bit of fussing I think the BioStor PDFs should play nice in both programs.

BioStor

Today I finally got a project out the door. BioStor is my take on what an interface to theBiodiversity Heritage Library (BHL) could look like. It features the visualisations I've mentioned in earlier posts, such as Google maps based on extracted localities, and tag trees. It also has a modified version of my earlier BHL viewer.

There are a number of ideas I want to play with using BioStor, but the main goal this site is to provide article-level metadata for BHL. As I've discussed earlier (see also Chris Freeland's post But where are the articles?), BHL has very little article-level metadata, making searching for articles a frustrating experience. BioStor aims to make this easier by providing an OpenURL resolver that tries to find articles in BHL.

BioStor supports the OpenURL standard, which means it can be used from
within EndNote and Zotero. Web sites that support COinS (such as Drupal-based Scratchpads and EOL's LifeDesks) can also be uses BioStor (see http://biostor.org/referrer.php for details).

My approach to finding articles in BHL is to take existing metadata from bilbiographies and databases, and use this to search BHL using techniques ranging from reasonably elegant (Smith-Waterman alignment on words to match titles) to down-and-dirty regular expression matching. Since this metadata may contain errors, BioStor provides basic editing tools (using reCAPTCHA rather than user logins at this point).

There's much to be done, the article finding is somewhat error-prone, and the search requires a local copy of BHL, and mine is rather out of date. However, it is a start.

To get a flavour of BioStor, try browsing some references:

http://biostor.org/reference/1
http://biostor.org/reference/4
http://biostor.org/reference/12

or view information for a journal:

http://biostor.org/issn/0007-1498

or an author:

http://biostor.org/author/41
http://biostor.org/author/16

or a taxon name:

http://biostor.org/name/Atelophryniscus%20chrysophorus

Biodiversity Heritage Library viewer experiments

In between the chaos that is term-time I've been playing with ways to view Biodiversity Heritage Library content. The viewer is crude, and likely to go off-line at any moment while I fuss with it, the you can view an example here. This link takes you to a display of BHL item 19513, which is volume 12 of the Bulletin of the British Museum (Natural History) Entomology, which includes some striking insects, such as the species of Lopheuthymia displayed below:

The viewer attempts to do several things:

Display a BHL document in a way that I can quickly scan pages looking for article boundaries by showing thumbnails
Provide a simple way for me to edit metadata (if you find the start of an article you can click on the is the first page in an article and edit the article details)
Provide a RIS dump of the articles which can then be uploaded into other bibliographic tools

Note that on the left hand side I'm displaying the articles that have already been found in the volume. The editing interface is crude, and I'll need to look at user authentication and versioning to do this seriously, but it seems a quick way to annotate BHL content. Much of this needs only be done once, as once we have article boundaries then searching BHL journal content using, say, OpenURL becomes easy, and we can link bibliographic records in nomenclators to BHL. Improving the metadata will also improve visualisation such as my BHL timeline.

I hope to play a bit more with the view over the next days and weeks. It's pretty simple (Javascript with PHP back end). The key to creating the viewer was a complete dump of BHL's page metadata kindly provided by Mike Lichtenberg. I use this to locate individual page images stored by the Internet Archive, which I then store locally (and generate 100 pixel wide thumbnails).

Potentially there's a lot more one could add to a tool like this. I'm playing with displaying the taxon names found by uBio so that I can flag instances where the page is where the name is first published. One could imagine adding other flags, such as when a taxon is depicted so that we could easily extract images of taxa from BHL content. It would also be nice to be able to add taxon names that uBio's algorithms have missed. Utlimately, one could even display the OCR text and correct/annotate that. Much to do...

Zotero group for Biodiversity Heritage Library content

One thing I find myself doing (probably more often than I should) is adding a reference to my Zotero library for an item in the Biodiversity Heritage Library (BHL). BHL doesn't have article-level metadata (see But where are the articles?), so when I discover a page of interest (e.g., one that contains the original description of a taxon) I store metadata for the article containing that page in my Zotero library. Typically this involves manually step back through scanned pages until I find the start of the article, then store that URL as well as the page number as a Zotero item. As an example, here is the record for Ogilby , J. Douglas (1907). A new tree frog from Brisbane. Proceedings of the Royal Society of Queensland 20:31-32. The URL in the Zotero record http://www.biodiversitylibrary.org/page/13861218 take you to the first page of this article.

One reason for storing metadata in Zotero is so that these reference are made available through the bioGUID OpenURL resolver. This is achieved by regularly harvesting the RSS feed for my Zotero account, and adding items in that feed to the bioGUID database of articles. This makes Zotero highly attractive, as I don't have to write a graphical user interface to create bibliographic records. BHL have their own citation editor in the works ("CiteBank"), based on the ubiquitous Drupal, but I wonder whether Zotero is the better bet -- it has a bunch of nice features, including being able to sync local copies across multiple machines, and store local copies of PDFs (synced using WebDAV).

For fun I've created a Zotero group called Biodiversity Heritage Library which will list articles that I've extracted from BHL. At some point I may investigate automating this process of extracting articles (using existing blbiliographic metadata mapped to BHL page numbers), but for now there a mere 27 manually acquired items listed in the BHL group.

Google Scholar metadata quality and Mendeley hype

Hot on the heels of Geoffrey Nunberg's essay about the train wreck that is Google books metadata (see my earlier post) comes Google Scholar’s Ghost Authors, Lost Authors, and Other Problems by Péter Jacsó. It's a fairly scathing look at some of the problems with the quality of Google Scholar's metadata.

Now, Google Scholar isn't perfect, but it's come to play a key role in a variety of bibliographic tools, such as Mendeley, and Papers. These tools do a delicate dance with Google Scholar who, strictly speaking, don't want anybody scraping their content. There's no API, so Mendeley, Papers (and my own iSpecies) have to keep up with the HTML tweaks that Google introduces, pretend to be web browsers, fuss with cookies, and try to keep the rate of queries below the level at which the Google monster stirs and slaps them down.

Jacsó's critique also misses the main point. Why do we have free (albeit closed) tools like Google Scholar in the first place? It's largely because scientists have ceeded the field of citation analysis to commercial companies, such as Elsevier and Thompson Reuters. To echo Martin Kalfatovic's comment:

Over the years, we've (librarians and the user community) have allowed an important class of metadata - specifically the article level metadata - migrate to for profit entities.

Some visionaries, such as Robert Cameron in his A Universal Citation Database as a Catalyst for Reform in Scholarly Communication, argued for free, open citation databases, but this came to nought.

For me, this is the one thing the ridiculously over-hyped Mendeley could do that would merit the degree of media attention it is getting -- be the basis of an open citation database. It would need massive improvement to its metadata extraction algorithms, which currently suck (Google Scholar's, for all Jacsó's complaints, are much better), but it would generate something of lasting value.

Biodiversity Heritage Library, Google books, and metadata quality

I've been playing recently with the Biodiversity Heritage Library (BHL), and am starting to get a sense for the complexities (and limitations) of the metadata BHL stores about publications. The more I look at BHL the more I think the resource is (a) wonderfully useful and (b) hampered by some dodgy metadata.

The BHL data model has three kinds of entities, "Titles", "Items", and "Pages". Pages are individual pages in an item, where an item which corresponds to a physical object that has been scanned (such as a book or a bound volume of a journal). A title may comprise a single item, such as book, or many items, such as volumes of a journal. Most of the metadata BHL has relates to physical items (books and bound volume issues), as opposed to article-level metadata, which is basically absent (see But where are the articles?).

This model reflects the sources of the BHL metadata (library catalogues) and the mode of operation (bulk scanning of bound volumes). But it can make working out dates of somewhat challenging.

To give an example, I did a search on the frog name Hyla rivularis Taylor, 1952 (NameBankID 27357), currently known as Isthmohyla rivularis. I wanted to find the original description of this frog. A BHL search returns 34 pages containing the name Hyla rivularis, distributed over 5 titles (a title in BHL may be a book, or a journal). Given that the name was published in 1952, it would be nice if I could sort these results by date, and then look at items from 1952. Unfortunately I can't. BHL has limited information on dates, especially at the level I would need to find a document published in 1952.

For the five titles returned in the search, I have dates for four of them, albeit two are ranges (University of Kansas publications, Museum of Natural History, 1946-1971, and The University of Kansas science bulletin, 1902-1996). At the level of individual items, only item 25858 (University of Kansas publications, Museum of Natural History) has dates (1961-1966). If I look at the VolumeInfo field for an item (you can get this from the database dump, or using the JSON web service) I sometimes get strings like this "v.35:pt.1 (1952)". This item (25857) is the one I'm after, but the date is buried in the VolumeInfo string. So, the information I need is there, but it's going to need some parsing.

Another issue is that of duplicates. Searching for publications on Rana grahamii, I found items 41040 and 45847. Although one item is treated as a book, and the other as a volume of the journal Records of the Indian Museum, these are the same thing. Having duplicates is a complication, but it might also be useful for quality control and testing (for example, do taxon name extraction algorithms return the same names from OCR text from both copies?). Nor is having duplicate copies and/or identifiers unique to BHL. The Records of the Indian Museum has a series-level identifier (ISSN 0537-0744), and this article ("A monograph of the South Asian, Papuan, Melanesian and Australian frogs of the genus Rana") also as the ISBN 8121104327.

There are parallels with Google books scanning project, which has been the subject of criticism on several fronts, including the quality of the metadata they have for each book. Geoff Nunberg has an entertaining post entitled Google Books: A Metadata Train Wreck which lists many examples of errors. This blog post also contains a detailed response from Jon Orwant of Google books. In essence, Google books is riddled with metadata errors (such as books on the Internet with publication dates predating the birth of their authors), but most of these errors have come from library catalogues (not unexpected given the scale of the task), not Google.

What could BHL do about its metadata? One thing is crowdsourcing. BHL does a little of this already, for example capturing user-provided metadata when PDFs are created, but I wonder if we could do more. For example, imagine dumping metadata for all 39,000 items into a semantic wiki and inviting people to edit and annotate the metadata. This could be extended to adding article boundaries (i.e., identifying which page corresponds to the start of an article). There is also considerable scope for trying to find article boundaries using existing metadata from bibliographies assembled by individual scientists.

But we should watch closely what Google does with its book project. Eric Hellman has argued that, far from creating the metadata mess, Google is ideally positioned to sort it out. He writes:

What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so?

Subscribe to: Posts ( Atom )