Search this keyword

Showing posts with label Zootaxa. Show all posts
Showing posts with label Zootaxa. Show all posts

Extracting semantic goodness from Zootaxa articles

zootaxa.png

I've just come back from a holiday in New Zealand, during which time I spent a morning chatting with Zhi-Qiang Zhang (@Zootaxa, editor of Zootaxa) and Stephen Thorpe (stho002, a major contributor to Wikispecies).

Fresh from playing with PLoS XML to explore ways of redisplaying articles (described in my commentary on the PLoS iPad app), I was extolling the virtues of the XML mark-up that underlies PLoS (and other Open Access journals, such as the BMC series). These publishers provide Open Access XML versions of their papers that are quite richly marked up: internal citations, links to figures, the bibliography, etc. are all clearly identified, although they don't have the semantic mark-up of TaxPub, used in some recent Zookeys papers.

Talking to Zhi-Qiang Zhang is always a useful reality check. Zootaxa describes itself as the
World's foremost journal in taxonomy; publisher of 15,421 new taxa in 141,518 pages by 7,385 authors worldwide since 2001

This is taxonomic publishing on a grand scale, averaging more than an article a day. Since 2004 Zootaxa has published 12.60% percent of the new taxa recorded in Zoological Record, an order of magnitude more it's nearest rival. The journal is being tightly run, and doesn't have cash to spare (it has nothing like the funding PLoS has, for example). Any change to the basic work flow (author submits Word file, this is imported into Adobe Framemaker, which creates the PDF files displayed on the Zootaxa web site) requires compelling justification. Furthermore, any change would have to scale. The level of work required to embellish articles using custom mark-up, such as TaxPub, just isn't feasible.

Zhi-Qiang waxed enthusiastically about Google Books' interface, where basic information such as keywords, geographic location, and references are extracted automatically. Google Books was one inspiration for the article display I use in BioStor, so I wondered how hard it would be to take some of the work I've been doing on BioStor and on adding mark-up to PLoS XML and apply it to Zootaxa PDFs. After some fussing with regular expressions, the bioGUID OpenURL resolver and uBio's FindIT taxonomic name tool, I've some scripts that automate extracting basic information from a Zootaxa PDF, such as the abstract, localities, taxonomic names, GenBank sequences, and the bibliography. You can see some examples at http://iphylo.org/~rpage/zootaxa/. It's all a bit crude, and isn't the same as being able to mark-up the actual text (which could be done, but with rather more effort), but there's potential here to create nice interfaces to Zootaxa papers, as well as extract the data needed to do some interesting queries.



e-Biosphere Challenge: visualising biodiversity digitisation in real time

e-Biosphere '09 kicks off next week, and features the challenge:
Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:
  • Capturing data in private and public databases;
  • Conducting quality assurance on the data by automated validation and/or peer review;
  • Indexing, linking and/or automatically submitting the new data records to other relevant databases;
  • Integrating the data with other databases and data streams;
  • Making these data available to relevant audiences;
  • Make the data and links to the data widely accessible; and
  • Offering interfaces for users to query or explore the data.


Originally I planned to enter the wiki project I've been working on for a while, but time was running out and the deadline was too ambitious. Hence, I switched to thinking about RSS feeds. The idea was to first create a set of RSS feeds for sources that lack them, which I've been doing over at http://bioguid.info/rss, then integrate these feeds in a useful way. For example, the feeds would include images from Flickr (such as EOL's pool), geotagged sequences from GenBank, the latest papers from Zootaxa, and new names from uBio (I'd hoped to include ION as well, but they've been spectacularly hacked).

After playing with triple stores and SPARQL (incompatible vocabularies and multiple identifiers rather buggers this approach), and visualisations based on Google Maps (building on my swine flu timemap), it dawned on me what I really needed was an eye-catching way of displaying geotagged, timestamped information, just like David Troy's wonderful twittervision and flickrvision.com. In particular, David took the Poly9 Globe and added Twitter and Flickr feeds (see twittervision 3D and flickrvision 3D. So, I took hacked David's code and created this, which you can view at http://bioguid.info/ebio09/www/3d/:



It's a lot easier to simply look at it rather than describe what it does, but here's a quick sketch of what's under the hood.

Firstly, I take RSS feeds, either the raw geoFeed from Flickr, or from http://bioguid.info/rss. The bioGUID feeds include the latest papers in Zootaxa (most new animal species are described in this journal), a modified version of uBio's new names feed, and a feed of the latest, geotagged sequences in GenBank (I'd hoped to use only DNA barcodes, but it turns out rather few barcode sequences are geotagged, and few have the "BARCODE" keyword). The Flickr feeds are simple to handle because they include locality information (including latitude, longitude, and Yahoo Where-on-Earth Identifiers (WOEIDs)). Similarly, the GenBank feed I created has latitude and longitudes (although extracting this isn't always as straightforward as it should be). Other feeds require more processing. The uBio feed already has taxonomic names, but no geotagging, so I use services from Yahoo! GeoPlanet™ to find localities from article titles. For the Zootaxa feed that I created I use uBio's SOAP service to extract taxonomic names, and Yahoo! GeoPlanet™ to extract localities.


I've tried to create a useful display popup. For Zootaxa papers you get a thumbnail of the paper, and where possible an icon of the taxonomic group the paper talks about (the presence of this icon depends on the success of uBio's taxonomic name finding service, the Catalogue of Life having the same name, and my having a suitable icon). The example above shows a paper about copepods. Other papers have a icon for the journal (again, a function of my being able to determine the journal ISSN and having a suitable icon). Flickr images simply display a thumbnail of the image.

What does it all mean? Well, I could say all sorts of things about integration and mash-ups but, dammit, it's pretty. I think it's a fun way to see just what is happening in digital biodiversity. I've deliberately limited the demo to items that came online in the month of May, and I'll be adding items during the conference (June 1-3rd in London). For example, if any more papers appear in Zootaxa, or in the uBio feeds I'll add those. If anybody uploads geotagged photos to EOL's Flickr group, I'll grab those as well. It's still a bit crude, but it shows some of the potential of bringing things together, coupled with a nice visualisation. I welcome any feedback.