Accounting Careers

Showing posts with label mashup. Show all posts

Mashing up NCBI and Wikipedia using treemaps

Having made a first stab at mapping NCBI taxa to Wikipedia, I thought it might be fun to see what could be done with it. I've always wanted to get quantum treemaps working (quantum treemaps ensure that the cells in the treemap are all the same size, see my 2006[!] blog post for further description and links). After some fussing I have some code that seems to do the trick. As an example, here is a quantum treemap for Laurasiatheria.

The diagram shows the NCBI taxonomy subtree rooted on Laurasiatheria, with images (where available) from Wikipedia for the children of the the children of that node. In other words, the images correspond to the tips of the tree below:

There's a lot to be done to tidy this up, but there is potential to create a nice, visual way to navigate through the NCBI taxonomy (it might work well on the iPhone or iPad, for example).

EOL, the BBC, and Wikipedia

Last month EOL took the brave step of including Wikipedia content in its pages. I say "brave" because early on EOL was pretty reluctant to embrace Wikipedia on this scale (see the report of the Informatics Advisory Group that I chaired back in 2008), and also because not all of EOL's curators have been thrilled with this development. Partly to assuage their fears, EOL displays Wikipedia-derived content on a yellow background to flag its "unreviewed" status, such as this image of the python genus Leiopython:

It's interesting to compare EOL's approach to Wikipedia with that taken by the BBC, as documented in Case Study: Use of Semantic Web Technologies on the BBC Web Sites. The BBC makes extensive use of content from community-driven external sites such as MusicBrainz and Wikipedia. They embed the content in their own pages, stating where the content came from, but not flagging it as any less meaningful or reliable than the BBC's own content (i.e., no garish yellow background).

Furthermore, the BBC does two clever things. Firstly:

To facilitate integration with the resources external to bbc.co.uk the music site reuses MusicBrainz URL slugs and Wildlife Finder Wikipedia URL slugs. This means that it is relatively straight forward to find equivalent concepts on Wikipedia/DBpedia and Wildlife Finder and, MusicBrainz and /music.

This means that if the identifier for the artist Bat for Lashes in Musicbrainz is http://musicbrainz.org/artist/10000730-525f-4ed5-aaa8-92888f060f5f.html, the BBC reuse the "slug" 10000730-525f-4ed5-aaa8-92888f060f5f and create a page at http://www.bbc.co.uk/music/artists/10000730-525f-4ed5-aaa8-92888f060f5f. Likewise, if the Wikipedia page for Varanus komodoensis is http://en.wikipedia.org/wiki/Komodo_dragon, then the BBC Wildlife Finder page becomes http://www.bbc.co.uk/nature/species/Komodo_dragon, reusing the slug Komodo_dragon.

Reusing identifiers like this can greatly facilitate linking between databases. I don't need to do a search, or approximate string matching, I just reuse the slug. Note that this is a two-way thing, it is trivial for Musicbrainz to create links to BBC information, and visa versa. Reusing identifiers isn't new, other examples include Amazon.com's ASIN (which for books are ISBNs), and BHL reuses uBio NameBankIDs -- want literature that mentions the Komodo dragon? Use the uBio NameBankID 2546401 in a BHL URL http://www.biodiversitylibrary.org/name/2546401.

The second clever thing the BBC does is treat the web as a content management system:

BBC Music is underpinned by the Musicbrainz music database and Wikipedia, thereby linking out into the Web as well as improving links within the BBC site. BBC Music takes the approach that the Web itself is its content management system. Our editors directly contribute to Musicbrainz and Wikipedia, and BBC Music will show an aggregated view of this information, put in a BBC context.

Instead of separating BBC and Wikipedia content (and putting the later in quarantine as does EOL), the BBC embraces Wikipedia, editing Wikipedia content if they feel a page need improving. One advantage of this approach is that it avoids the need for the BBC to replicate Wikipedia, either in terms of content (the BBC doesn't need to write its own descriptions of what an organism does) or services (the BBC doesn't need to develop tools for people to edit the BBC pages, people use Wikipedia's infrastructure for this). Wikipedia provides core text and identifiers, BBC provides its own unique content and branding.

EOL is trying something different, and perhaps more challenging (at least to do it properly). Given that both EOL and Wikipedia offer text about organisms, there is likely to be overlap (and possibly conflict) between what EOL and Wikipedia say about the same taxon. Furthermore, there will be duplication of information such as bibliographic references. For example, the Wikipedia content included in the EOL page for Leiopython contains a bibliography, which includes these references:

Hubrecht AAW. 1879. Notes III on a new genus and species of Pythonidae from Salawatti. Notes from the Leyden Museum 14-15.

Boulenger GA. 1898. An account of the reptiles and batrachians collected by Dr. L. Loria in British New Guinea. Annali del Museo Civico de Storia Naturale di Genova (2) 18:694-710

The genus name Leiopython was published by Hubrecht (1879), and Boulenger (1898) is cited in support of a claim that a distribution record is erroneous. Hence, these look like useful papers to read. Neither reference on the Wikipedia page is linked to an online version of the article, but both have been scanned by EOL's partner BHL (you can see the articles in BioStor here, and here, respectively)¹.

Problem is, you'd be hard pressed to discover this from the EOL page. The BHL results do list the journal Notes from the Leyden Museum, but you'd have to visit the links manually to discover whether they include Hubrecht (1879) (they do, as well as various occurences of Leiopython in the indices for the journal). In part this problem is a consequence of the crude way EOL handles bibliographies retrieved from BHL, but it's symptomatic of a broader problem. By simply mashing EOL and Wikipedia content together, EOL is missing an opportunity to make both itself and Wikipedia more useful. Surely it would be helpful to discover what publications cited on Wikipedia pages are in BHL (or in the list of references for hand-curated EOL pages)? This requires genuine integration (for example by reusing existing bibliographic identifiers such as DOIs, and tools such as OpenURL resolvers). If it fails to do this, EOL will resemble crude pre-Web 2.0 mashups where people created web pages that had content from external sites enclosed in <IFRAME> tags.

The contrast between the approaches adopted by EOL and the BBC is pretty stark. The BBC has devolved text content to external, community-driven sites that it thinks will do a better job than the BBC could alone. EOL is trying to integrate Wikipedia into it's own text content, but without addressing the potentially massive duplication (and, indeed, possible contradictions) that are likely to arise. Perhaps it's time for EOL to be as brave as the BBC, as ask itself whether it is sensible for EOL to try and occupy the same space as Wikipedia.

1. Note that the bibliographic details of both papers are wanting, Hubrecht 1879 is in volume 1 of Notes from the Leyden Museum, and Annali del Museo Civico de Storia Naturale di Genova series 2, volume 18 is also treated as volume 38.

To wiki or not to wiki?

What follows are some random thoughts as I try and sort out what things I want to focus on in the coming days/weeks. If you don't want to see some wallowing and general procrastination, look away now.

I see four main strands in what I've been up to in the last year or so:

services
mashups
wikis
phyloinformatics

Let's take these in turns.

Services
Not glamourous, but necessary. This is basically bioGUID (see also hdl:10101/npre.2009.3079.1). bioGUID provides OpenURL services for resolving articles (it has nearly 84,000 articles in it's cache), looking up journal names, resolving LSIDs, and RSS feeds.

Mashups
iSpecies is my now aging tool for mashing up data from diverse sources, such as Wikipedia, NCBI, GBIF, Yahoo, and Google Scholar. I tweak it every so often (mainly to deal with Google Scholar forever mucking around with their HTML). The big limitation of iSpecies is that it doesn't make it's results reusable (i.e., you can't write a script to call iSpecies and return data). However, it's still the place I go to to quickly find out about a taxon.

The other mashups I've been playing with focus on taking standardised RSS feeds (provided by bioGUID, see above) and mashing them up, sometimes with a nice front end (e.g., my e-Biosphere 09 challenge entry).

Wiki
I've invested a huge amount of effort in learning how wikis (especially Mediawiki and its semantic extensions) work, documented in earlier posts. I created a wiki of taxonomic names as a sandbox to explore some of these ideas.

I've come to the conclusion that for basic taxonomic and biological information, the only sensible strategy for our community is to use (and contribute to) Wikipedia. I'm struggling to see any justification for continuing with a proliferation of taxonomic databases. After e-Biosphere 09 the game's up, people have started to notice that we've an excess of databases (see Claire Thomas in Science, "Biodiversity Databases Spread, Prompting Unification Call", doi:10.1126/science.324_1632).

Phyloinformatics
In truth I've not been doing much on this, apart from releasing tvwidget (code available from Google Code), and playing with a mapping of TreeBASE studies to bibliographic identifiers (available as a featured download from here). I've played with tvwidget in Mediawiki, and it seems to work quite well.

Where now?
So, where now? Here are some thoughts:

I will continue to hack bioGUID (it's now consuming RSS feeds from journals, as well as Zotero). Everything I do pretty much depends on the services bioGUID provides

iSpecies really needs a big overhaul to serve data in a form that can be built upon. But this requires decisions on what that format should be, so this isn't likely to happen soon. But I think the future of mashup work is to use RDF and triple stores (providing that some degree of editing is possible). I think a tool linking together different data sources (along the lines of my ill-fated Elsevier Challenge entry) has enormous potential.

I'm exploring Wikipedia and Wikispecies. I'm tempted to do a quantitative analysis of Wikipedia's classification. I think there needs to be some serious analysis of Wikipedia if people are going to use it as a major taxonomic resource.

If I focus on Wikipedia (i.e., using an existing wiki rather than try to create my own), then that leaves me wondering what all the playing with iTaxon was for. Well, actually I think the original goal of this blog (way back in December 2005) is ideally suited to a wiki. Pretty much all the elements are in place to dump a copy of TreeBASE into a wiki and open up the editing of links to literature and taxonomic names. I think this is going to handily beat my previous efforts (TbMap, doi:10.1186/1471-2105-8-158), especially as errors will be easy to fix.

So, food for thought. Now, I just need to focus a little and get down to actually doing the work.

H1N1 Swine Flu TimeMap

Tweets from @ attilacsordas and @stew alerted me to the Google Map of the H1N1 Swine Flu outbreak by niman.

Ryan Schenk commented: "It'd be a million times more useful if that map was hooked into a timeline so you could see the spread.", which inspired me to knock together a timemap of swine flu. The timemap takes the RSS feed from niman's map and generates a timemap using Nick Rabinowitz's Timemap library.

Gotcha
Although in principle this should have been a trivial exercise (cutting and pasting into existing examples), it wasn't quite so straightforward. The Google Maps RSS feed is a GeoRSS feed, but initially I couldn't get Timemap to accept it. The contents of the <georss:point> tag in the Google Maps feed looks like this:


    <georss:point>
      33.041477 -116.894531
    </georss:point>

Turns out there's a minor bug in the file timemap.js, which I fixed by adding coords= TimeMap.trim(coords); before line 1369. The contents of the <georss:point> taginclude leading white space, and because timemap.js splits the latitude and longitude using whitespace, Google's feed breaks the code.

Postscript
Nick Rabinowitz has fixed this bug.

Accounting Careers

Search this keyword

Mashing up NCBI and Wikipedia using treemaps

EOL, the BBC, and Wikipedia

To wiki or not to wiki?

H1N1 Swine Flu TimeMap

Blog Archive

Popular Posts

Labels