Search this keyword

Showing posts with label MySQL. Show all posts
Showing posts with label MySQL. Show all posts

Browsing a digital library using a map

Every so often I revisit the idea of browsing a collection of documents (or specimens, or phylogenies) geographically. It's one thing to display a map of localities for single document (as I did most recently for Zootaxa), it's quite another to browse a large collection.

Today I finally bit the bullet and put something together, which you can see at http://biostor.org/maps/. The website comprises a Google Map showing localities extracted from papers in BioStor, and a list of the papers that have one or more points visible on the map.

mapbrowser.png


In building this I hit a few obstacles. The first is the number of localities involved. I've extracted several thousand point localities from articles in BioStor. Displaying all these on a Google Map is going to be tedious. Fortunately, there's a wonderful library called MarkerCluster, part of the google-maps-utility-library-v3 that handles this problem. MarkerCluster cluster together markers based on zoom level. If you zoom out the markers cluster together, as you zoom in these clusters will start to resolve into their component points. Very, very cool.

The second challenge was to have the list of references update automatically as we move around or zoom in and out on the map. To do this I need to know the bounding box currently being displayed in the map, I can then query the MySQL database underlying BioStor for the localities within the bounding box, using MySQL's spatial extensions. The query is easy enough to implement using ajax, but the trick was knowing when to call it. Initially, listening for the bounds_changed event seemed a good idea. However, this event is fired as the map is being moved (i.e., if the user is panning or dragging the map a whole series of bounds_changed events are fired), whereas what I want is something that signals that the user has stopped moving the map, at which point I can query the database for articles that correspond to the region that map is currently displaying. Turns out that the event I need to listen for is idle (see Issue 1371: map.bounds_changed event fires repeatedly when the map is moving), so I have a function that captures that event and loads the corresponding set of articles.

Another "gotcha" occurs when the region being viewed crosses longitude 180° (or -180°) (see diagram below from http://georss.org/Encodings).

179-rule.jpg


In this case the polygon used to query MySQL would be incorrectly interpreted, so I create two polygons, each with 180° or -180° as one of the boundaries, and merge the articles with points in either of those two polygons.

I've made a short video showing the map in action. Although I've implemented this for BioStor, the code is actually pretty generic, and could easily be adapted to other cases where we want to navigate through a set of objects geographically.


n-gram fulltext indexing in MySQL

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for Gastrotheca aureomaculata gives the first reference for this name as:

Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl. Mus., 288: 177. Holotype: FMNH 69701, by original designation. Type locality: "in [Departamento] Huila, Colombia, at San Antonio, a small village 25 kilometers west of San Agustín, at 2,300 meters".


The journal that ASW abbreviates as "Bull. U.S. Natl. Mus." is in the BHL, which gives its title as "Bulletin - United States National Museum.". How do I link these two records? In my bioGUID OpenURL project I've been doing things like using SQL LIKE statements with periods (.) replaced by wildcards ('%') to find journal titles that match abbreviations (as well as building a database of these abbreviations). But this is error prone, and won't work for abbreviations such as "Bull. U.S. Natl. Mus." because the word "National" has been abbreviated to "Natl", which isn't a substring of "National".

After exploring various methods (including longest common subsequences, and sequence alignment algorithms) I came across a MySQL plugin for n-grams. The plugin tokenises strings into bi-grams (tokens with just two characters, see the Wikipedia page on N-grams for more information). This means that even though as words "National" and "Natl" are different, they will have some similarity due to the shared bi-grams "Na" and "at".

So, I grabbed the source for the plugin and the ICU dependency, compiled the plugin and added it to MySQL (I'm running MySQL 5.1.34 on Mac OS X 10.5.8). The plugin can be added while the MySQL server is running using this SQL command:

INSTALL PLUGIN bigram SONAME 'libftbigram.so';

Initial experiments seem promising. For the bhl_title table I created a bi-gram index:

ALTER TABLE `bhl_title` ADD FULLTEXT (`ShortTitle`) WITH PARSER bigram;

If I then take the abbreviation "Bull. U.S. Natl. Mus.", strip out the punctuation, and search for the resulting string ("Bull US Natl Mus")

SELECT TitleID, ShortTitle, MATCH(ShortTitle) AGAINST('Bull U S Natl Museum')
AS score FROM bhl_title
WHERE MATCH(ShortTitle) AGAINST('Bull U S Natl Museum') LIMIT 5;

I get this:
TitleIDShortTitle score
7548Bulletin - United States National Museum. 19.4019603729248
13855Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
14964Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191
5943Bulletin du Muséum national d'histoire naturelle. 17.6493873596191
12908Bulletin du Muséum National d'Histoire Naturelle. 17.6493873596191


The journal we want is the top hit (if only just). I'll probably have to do some post-processing to check that the top hit makes sense (e.g., is it a supersequence of the search term?) but this looks like a promising way to match abbreviated journal names and book titles to records in BHL (and other databases).