Search this keyword

Showing posts with label data quality. Show all posts
Showing posts with label data quality. Show all posts

Google Scholar metadata quality and Mendeley hype

Hot on the heels of Geoffrey Nunberg's essay about the train wreck that is Google books metadata (see my earlier post) comes Google Scholar’s Ghost Authors, Lost Authors, and Other Problems by Péter Jacsó. It's a fairly scathing look at some of the problems with the quality of Google Scholar's metadata.

Now, Google Scholar isn't perfect, but it's come to play a key role in a variety of bibliographic tools, such as Mendeley, and Papers. These tools do a delicate dance with Google Scholar who, strictly speaking, don't want anybody scraping their content. There's no API, so Mendeley, Papers (and my own iSpecies) have to keep up with the HTML tweaks that Google introduces, pretend to be web browsers, fuss with cookies, and try to keep the rate of queries below the level at which the Google monster stirs and slaps them down.

Jacsó's critique also misses the main point. Why do we have free (albeit closed) tools like Google Scholar in the first place? It's largely because scientists have ceeded the field of citation analysis to commercial companies, such as Elsevier and Thompson Reuters. To echo Martin Kalfatovic's comment:
Over the years, we've (librarians and the user community) have allowed an important class of metadata - specifically the article level metadata - migrate to for profit entities.
Some visionaries, such as Robert Cameron in his A Universal Citation Database as a Catalyst for Reform in Scholarly Communication, argued for free, open citation databases, but this came to nought.

For me, this is the one thing the ridiculously over-hyped Mendeley could do that would merit the degree of media attention it is getting -- be the basis of an open citation database. It would need massive improvement to its metadata extraction algorithms, which currently suck (Google Scholar's, for all Jacsó's complaints, are much better), but it would generate something of lasting value.




Biodiversity Heritage Library, Google books, and metadata quality

I've been playing recently with the Biodiversity Heritage Library (BHL), and am starting to get a sense for the complexities (and limitations) of the metadata BHL stores about publications. The more I look at BHL the more I think the resource is (a) wonderfully useful and (b) hampered by some dodgy metadata.

The BHL data model has three kinds of entities, "Titles", "Items", and "Pages". Pages are individual pages in an item, where an item which corresponds to a physical object that has been scanned (such as a book or a bound volume of a journal). A title may comprise a single item, such as book, or many items, such as volumes of a journal. Most of the metadata BHL has relates to physical items (books and bound volume issues), as opposed to article-level metadata, which is basically absent (see But where are the articles?).

bhl_model.png


This model reflects the sources of the BHL metadata (library catalogues) and the mode of operation (bulk scanning of bound volumes). But it can make working out dates of somewhat challenging.

To give an example, I did a search on the frog name Hyla rivularis Taylor, 1952 (NameBankID 27357), currently known as Isthmohyla rivularis. I wanted to find the original description of this frog. A BHL search returns 34 pages containing the name Hyla rivularis, distributed over 5 titles (a title in BHL may be a book, or a journal). Given that the name was published in 1952, it would be nice if I could sort these results by date, and then look at items from 1952. Unfortunately I can't. BHL has limited information on dates, especially at the level I would need to find a document published in 1952.

For the five titles returned in the search, I have dates for four of them, albeit two are ranges (University of Kansas publications, Museum of Natural History, 1946-1971, and The University of Kansas science bulletin, 1902-1996). At the level of individual items, only item 25858 (University of Kansas publications, Museum of Natural History) has dates (1961-1966). If I look at the VolumeInfo field for an item (you can get this from the database dump, or using the JSON web service) I sometimes get strings like this "v.35:pt.1 (1952)". This item (25857) is the one I'm after, but the date is buried in the VolumeInfo string. So, the information I need is there, but it's going to need some parsing.

84203b53fc75bd5bbc7f6a62fe8500f1.jpeg

Another issue is that of duplicates. Searching for publications on Rana grahamii, I found items 41040 and 45847. Although one item is treated as a book, and the other as a volume of the journal Records of the Indian Museum, these are the same thing. Having duplicates is a complication, but it might also be useful for quality control and testing (for example, do taxon name extraction algorithms return the same names from OCR text from both copies?). Nor is having duplicate copies and/or identifiers unique to BHL. The Records of the Indian Museum has a series-level identifier (ISSN 0537-0744), and this article ("A monograph of the South Asian, Papuan, Melanesian and Australian frogs of the genus Rana") also as the ISBN 8121104327.

There are parallels with Google books scanning project, which has been the subject of criticism on several fronts, including the quality of the metadata they have for each book. Geoff Nunberg has an entertaining post entitled Google Books: A Metadata Train Wreck which lists many examples of errors. This blog post also contains a detailed response from Jon Orwant of Google books. In essence, Google books is riddled with metadata errors (such as books on the Internet with publication dates predating the birth of their authors), but most of these errors have come from library catalogues (not unexpected given the scale of the task), not Google.

What could BHL do about its metadata? One thing is crowdsourcing. BHL does a little of this already, for example capturing user-provided metadata when PDFs are created, but I wonder if we could do more. For example, imagine dumping metadata for all 39,000 items into a semantic wiki and inviting people to edit and annotate the metadata. This could be extended to adding article boundaries (i.e., identifying which page corresponds to the start of an article). There is also considerable scope for trying to find article boundaries using existing metadata from bibliographies assembled by individual scientists.

But we should watch closely what Google does with its book project. Eric Hellman has argued that, far from creating the metadata mess, Google is ideally positioned to sort it out. He writes:
What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so?