Accounting Careers

Showing posts with label RSS. Show all posts

Viewing scientific articles on the iPad: the PLoS Reader

Continuing on from my previous post Viewing scientific articles on the iPad: towards a universal article reader, here are some brief notes on the PLoS iPad app that I've previously been critical of.

There are two key things to note about this app. The first is that it uses the page turning metaphor. The article is displayed as a PDF, a page at a time, and the user swipes the page to turn it over. Hence, the app is simulating paper on the iPad screen.

But perhaps more interesting is that, unlike the Nature app discussed earlier, the PLoS app doesn't use a custom API to retrieve articles. Instead the app uses RSS feeds from the PLoS site. PLoS provides journal-specific RSS feeds, as well as subject-specific feeds within journals (see, for example, the PLoS ONE home page). The PLoS Reader app takes these feeds and uses them to create a list of articles the reader can choose from.

A nice feature of the PLoS ATOM feeds is the provision of links to alternative formats for the article (unlike many journal RSS feeds, which provide just a DOI or a URL). For example, the feed item for the article "Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" doi:10.1371/journal.pone.0012303 contains links to the PDF and XML versions of the article:


<link rel="related" 
   type="application/pdf" 
   href="http://www.plosone.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pone.0012303&representation=PDF" 
   title="(PDF) Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" />
<link rel="related" 
   type="text/xml" 
   href="http://www.plosone.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pone.0012303&representation=XML" 
   title="(XML) Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" />

This makes the task of an article reader much easier. Rather than attempt to screen scrape the article web page, or rely on a rule for constructing the link to the desired file, the feed provides an explicit URL to the different available formats.

I've not seen this feature in other journal RSS feeds, although article web pages sometimes provide this information. BMC journals, for example, provide <link rel="alternate"> tags in the web page for each article, from which we can extract links to the XML and PDF versions, and some journals (BMC included) provide the Google Scholar metadata data tag <meta name="citation_pdf_url"> to link to the PDF. Hence, a generic article reader will need to be able to extract metadata tags from article web pages as it seeks formats suitable to display.

Show me the trees! Playing with the TreeBASE API

Being in an unusually constructive mood, I've spent the last couple of days playing with the TreeBASE II API, in an effort to find out how hard it would be to replace TreeBASE's frankly ghastly interface.

After some hair pulling and bad language I've got something to work. It's very crude, but gives a glimpse at what can be done. If you visit http://iphylo.org/~rpage/mytreebase/ and enter a taxon name, my code paddles off and queries TreeBASE to see if it has any phylogenies for that taxon. Gears grind, RSS feeds are crunched, a triple store is populated, NEXUS files are grabbed and Newick trees extracted, small creatures are needlessly harmed, and at last some phylogeny thumbnails are rendered in SVG (based on code I mentioned earlier), grouped by study. Functionality is limited (you can't click on the trees to make them bigger, for example), and the bibliographic information TreeBASE stores for studies is a bit ropey, but you get the idea.

What I'm looking for at this stage is a very simple interface that answers the question "show me the trees", which I think is the most basic question you can ask of TreeBASE (and one its own web interface makes unnecessarily hard). I've also gained some inspiration from the BioText search engine.

If you want to give it a try, here are some examples. These examples should be fairly responsive as the data is cached, but if you try searching for other taxa you may have a bit of a wait while my code talks to TreeBASE.

Wikispecies RSS feed

Following on from my previous post about Wikispecies (which generated some discussion on TAXACOM) I've played some more with Wikispecies.

AS a first step I've added a Wikispecies RSS feed to my list of RSS feeds. This feed takes the original Wikispecies RSS feed for new pages (generated by the page Special:NewPages) and tries to extract some details before reformatting it as an ATOM feed. Specifically, I extract GUIDs such as IPNI and Index Fungorum identifiers, bibliographic references (which I will later parse to try and extract identifiers such as DOIs), and latitude and longitude if the Wikispecies page has type locality information. Having the later means that the RSS feed can be displayed as a map (Google Maps can take a RSS feed with geotagged items and display it on a map for you).

The map below is live, so it will show any geotagged items in the current Wikispecies feed.

View Larger Map

Index Fungorum

I've added Index Fungorum to the list of RSS feeds that I generate at bioguid.info/rss. The feed uses the Index Fungorum web services to get the names added the previous day, and tries to extract any bibliographic identifiers from the metadata associated with each record (we get the metadata by resolving the LSID for the name). As with IPNI, bibliographic information in an Index Fungorum record lists the page the name was published on, which makes locating identifiers such as DOIs a bit of a struggle. Still, it's nice to have another feed of taxonomic names.

How to publish a journal RSS feed

This morning I posted
this tweet:

Harvesting Nuytsia RSS http://science.dec.wa.gov.a... Non-trivial as links are not to individual articles #fail

My grumpiness (on this occasion, seems lots of things seem to make me grumpy lately) is that often journal RSS feeds leave a lot to be desired. As RSS feeds are a major source of biodiversity information (for a great example of their use see uBio's RSS, described in doi:10.1093/bioinformatics/btm109) it would be helpful if publishers did a few basic things. Some of these suggestions are in Lisa Roger's RSS and Scholarly Journal Tables of Contents: the ticTOCs Project, and Good Practice Guidelines for Publishers, but some aren't.

In the spirit of being constructive, here are some dos and don'ts.

Do

1. Validate the RSS feed

Fun your feed through Feed Validator to make sure it's valid. Your feed is an XML document, if it won't validate then aggregators may struggle with it. Testing it in your favourite web browser isn't enough, but if a browser fails to display it this may be a clue that something's wrong. For example, Safari won't display the ZooKeys RSS feed, and at the time of writing this feed is not valid.

2. Make sure your feed is autodiscoverable

When I visit your web page my browser should tell me that there is a feed available (typically with a RSS icon in the location bar). If there's no such icon, then I have to look at your page to find the feed (if it exists). The Nuytsia page is an example of a non-discoverable feed. To make your feed autodiscoverable is easy, just add a link tag inside the head tag on the page. For example, something like this:


<link rel="alternate" type="application/rss+xml"
title="RSS Feed for Nuytsia"
href="http://science.dec.wa.gov.au/nuytsia/nuytsia.rss.xml" />

3. Use standard identifiers as the links

If your journal has DOIs, use those in the links, not the URL of the article web page. The later is likely to change (the DOI won't, unless you are being naughty), and given a DOI I can harvest the metadata via CrossRef.

4. Each item link in the feed links to ONE article

This was the reason for my grumpy tweet. The journal Nuytsia has a RSS feed (great!), but the links are not to individual articles. Instead, they are database queries that may generate one or more results. For example, this link RYE, B.L., (2009). Reinstatement of the Western Australian genus Oxymyrrhine (Myrtaceae : Chamelauci... actually lists two papers, both authored by B. L. Rye. This breaks the underlying model where the feed lists individual articles.

5. Include lots of metadata in your feed

If you don't use DOIs, then include metadata about your article in your feed. That way, I don't need to scrape your web pages, all I need is already in the feed.

6. Make it possible to harvest metadata about your articles

If you don't use DOIs are your article identifier, or use the DOIs are the item links in your RSS feed, then make it easy for me to get the bibliographic details from either the RSS feed, or from the web page. If you use RSS 1.0, then ideally you are using PRISM and I can get the metadata from that. If not, you can embed the metadata in the HTML page describing the article using Dublin Core and meta and link tags. For example, if you resolve this doi:10.1076/snfe.38.2.115.15923 and view the HTML source you will see this:


<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv="Content-Language" content="en-gb" />
<link rel="shortcut icon" href="/mpp/favicon.ico" />
<meta name="verify-v1" content="xKhof/of+uTbjR1pAOMT0/eOFPxG8QxB8VTJ07qNY8w=" />
<meta name="DC.publisher" content="Taylor & Francis" />
<meta name="DC.identifier" content="info:doi/10.1076/snfe.38.2.115.15923" />
<meta name="description" content="In this study we determined the effects of topography on the distribution of ground-dwelling ants in a primary terra-firme forest near Manaus, in cent..." />
<meta name="authors" content="Heraldo L. Vasconcelos ,Antônio C. C. Macedo,José M. S. Vilhena" />
<meta name="DC.creator" content="Heraldo L. Vasconcelos" />
<meta name="DC.creator" content="Antônio C. C. Macedo" />
<meta name="DC.creator" content="José M. S. Vilhena" />

Not pretty, but it enables me to get the details I want.

7. Support conditional HTTP GET

If you don't want feed readers and aggregators to hammer your service, support HTTP conditional GET (see here for details) so that feed readers only grab your feed if it has changed. Not many journal publishers do this, if they get overloaded by people grabbing RSS feeds they've only themselves to blame.

Don'ts

1. Sign up/log in

Don't ever ask me to sign up or log in to get the RSS feed (Cambridge University Press, I'm looking at you). If you think your content is so good/precious that I should sign up for it, you are sadly mistaken. Nature doesn't ask me to login, nor should you.

2. Break DOIs

Another major cause of grumpiness is the frequency with which DOIs break, especially for recently published articles (i.e., precisely those that will be encountered in RSS feeds). There is quite simply no excuse for this. If your workflow results in DOIs being put on web pages before they are registered with CrossRef, then you (or CrossRef) are incompetent.

e-Biosphere Challenge: visualising biodiversity digitisation in real time

e-Biosphere '09 kicks off next week, and features the challenge:

Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:
Capturing data in private and public databases;
Conducting quality assurance on the data by automated validation and/or peer review;
Indexing, linking and/or automatically submitting the new data records to other relevant databases;
Integrating the data with other databases and data streams;
Making these data available to relevant audiences;
Make the data and links to the data widely accessible; and
Offering interfaces for users to query or explore the data.

Originally I planned to enter the wiki project I've been working on for a while, but time was running out and the deadline was too ambitious. Hence, I switched to thinking about RSS feeds. The idea was to first create a set of RSS feeds for sources that lack them, which I've been doing over at http://bioguid.info/rss, then integrate these feeds in a useful way. For example, the feeds would include images from Flickr (such as EOL's pool), geotagged sequences from GenBank, the latest papers from Zootaxa, and new names from uBio (I'd hoped to include ION as well, but they've been spectacularly hacked).

After playing with triple stores and SPARQL (incompatible vocabularies and multiple identifiers rather buggers this approach), and visualisations based on Google Maps (building on my swine flu timemap), it dawned on me what I really needed was an eye-catching way of displaying geotagged, timestamped information, just like David Troy's wonderful twittervision and flickrvision.com. In particular, David took the Poly9 Globe and added Twitter and Flickr feeds (see twittervision 3D and flickrvision 3D. So, I took hacked David's code and created this, which you can view at http://bioguid.info/ebio09/www/3d/:

It's a lot easier to simply look at it rather than describe what it does, but here's a quick sketch of what's under the hood.

Firstly, I take RSS feeds, either the raw geoFeed from Flickr, or from http://bioguid.info/rss. The bioGUID feeds include the latest papers in Zootaxa (most new animal species are described in this journal), a modified version of uBio's new names feed, and a feed of the latest, geotagged sequences in GenBank (I'd hoped to use only DNA barcodes, but it turns out rather few barcode sequences are geotagged, and few have the "BARCODE" keyword). The Flickr feeds are simple to handle because they include locality information (including latitude, longitude, and Yahoo Where-on-Earth Identifiers (WOEIDs)). Similarly, the GenBank feed I created has latitude and longitudes (although extracting this isn't always as straightforward as it should be). Other feeds require more processing. The uBio feed already has taxonomic names, but no geotagging, so I use services from Yahoo! GeoPlanet™ to find localities from article titles. For the Zootaxa feed that I created I use uBio's SOAP service to extract taxonomic names, and Yahoo! GeoPlanet™ to extract localities.

I've tried to create a useful display popup. For Zootaxa papers you get a thumbnail of the paper, and where possible an icon of the taxonomic group the paper talks about (the presence of this icon depends on the success of uBio's taxonomic name finding service, the Catalogue of Life having the same name, and my having a suitable icon). The example above shows a paper about copepods. Other papers have a icon for the journal (again, a function of my being able to determine the journal ISSN and having a suitable icon). Flickr images simply display a thumbnail of the image.

What does it all mean? Well, I could say all sorts of things about integration and mash-ups but, dammit, it's pretty. I think it's a fun way to see just what is happening in digital biodiversity. I've deliberately limited the demo to items that came online in the month of May, and I'll be adding items during the conference (June 1-3rd in London). For example, if any more papers appear in Zootaxa, or in the uBio feeds I'll add those. If anybody uploads geotagged photos to EOL's Flickr group, I'll grab those as well. It's still a bit crude, but it shows some of the potential of bringing things together, coupled with a nice visualisation. I welcome any feedback.

Integrating and displaying data using RSS

Although I'd been thinking of getting the wiki project ready for e-Biosphere '09 as a challenge entry, lately I've been playing with RSS has a complementary, but quicker way to achieve some simple integration.

I've been playing with RSS on and off for a while, but what reignited my interest was the swine flu timemap I made last week. The neatest thing about the timemap was how easy it was to make. Just take some RSS that is geotagged and you get the timemap (courtesy of Nick Rabinowitz's wonderful Timemap library).

So, I began to think about taking RSS feeds for, say journals and taxonomic and genomic databases and adding them together and displaying them using tools such as timemap (see here for an earlier mock up of some GenBank data). Two obstacles are in the way. The first is that not every data source of interest provides RSS feeds. To address this I've started to develop wrappers around some sources, the first of which is ZooBank.

The second obstacle is that integration requires shared content (e.g., tags, identifiers, or localities). Some integration will be possible geographically (for example, adding geotagged sequences and images to a map), but this won't work for everything. So, I need to spend some time trying to link stuff together. In the case of Zoobank there's some scope for this, as ZooBank metadata sometimes includes DOIs, which enables us to link to the original publication, as well as bookmarking services such as Connotea. I'm aiming to include these links within the feed, as shown in this snippet (see the <link rel="related"...> element):

<entry>
<title>New Protocetid Whale from the Middle Eocene of Pakistan: Birth on Land, Precocial Development, and Sexual Dimorphism</title>
<link rel="alternate" type="text/html" href="http://zoobank.org/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
<updated>2009-05-06T18:37:34+01:00</updated>
<id>urn:uuid:c8f6be01-2359-1805-8bdb-02f271a95ab4</id>
<content type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></content>
<summary type="html">Gingerich, Philip D., Munir ul-Haq, Wighart von Koenigswald, William J. Sanders, B. Holly Smith & Iyad S. Zalmout<br/><a href="http://dx.doi.org/10.1371/journal.pone.0004366">doi:10.1371/journal.pone.0004366</a></summary>
<link rel="related" type="text/html" href="http://dx.doi.org/10.1371/journal.pone.0004366" title="doi:10.1371/journal.pone.0004366"/>
<link rel="related" type="text/html" href="http://bioguid.info/urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C" title="urn:lsid:zoobank.org:pub:8625FB9A-1FC3-43C3-9A99-7A3CDE0DFC9C"/>
</entry>

What I'm hoping is that there will be enough links to create something rather like my Elsevier Challenge entry, but with a much more diverse set of sources.

Subscribe to: Posts ( Atom )

Accounting Careers

Search this keyword