Ruminations of a Programmer: May 2009

Sunday, May 17, 2009

scouchdb gets View Server in Scala

CouchDB views are the real wings of the datastore that goes into every document and pulls out data exactly what you have asked for through your queries. The queries are different from the ones you do in an RDBMS using SQL - here you have all the state-of-the-art map/reduce being exercised through each of the cores that your server may have. One very good part of views in CouchDB is that the view server is a separate abstraction from the data store. Computation of views is delegated to an external server process that communicates with the main process over standard input/output using a simple line-based protocol. You can find more details about this protocol in the couchdb wiki.

The default implementation of the query server in CouchDB uses Javascript running via Mozilla SpiderMonkey. However, language aficionados always find a way to push their own favorite into any accessible option. People have developed query servers for Ruby, Php, Python and Common Lisp.

scouchdb gives one for Scala. You can now write map and reduce scripts for CouchDB views in Scala .. the reduce part is not yet ready. But the map functions actually do work in the repository. Here is a usual session using ScalaTest ..

// create some records in the store
couch(test doc Js("""{"item":"banana","prices":{"Fresh Mart":1.99,"Price Max":0.79,"Banana Montana":4.22}}"""))
couch(test doc Js("""{"item":"apple","prices":{"Fresh Mart":1.59,"Price Max":5.99,"Apples Express":0.79}}"""))
couch(test doc Js("""{"item":"orange","prices":{"Fresh Mart":1.99,"Price Max":3.19,"Citrus Circus":1.09}}"""))

// create a design document
val d = DesignDocument("power", null, Map[String, View]())
d.language = "scala"

// a sample map function in Scala
val mapfn1 = 
  """(doc: dispatch.json.JsValue) => {
    val it = couch.json.JsBean.toBean(doc, classOf[couch.json.TestBeans.Item_1])._3; 
    for (st <- it.prices)
      yield(List(it.item, st._2))
  }"""
    
// another map function
val mapfn2 = """(doc: dispatch.json.JsValue) => {
    import dispatch.json.Js._; 
    val x = Symbol("item") ? dispatch.json.Js.str;
    val x(x_) = doc; 
    val i = Symbol("_id") ? dispatch.json.Js.str;
    val i(i_) = doc;
    List(List(i_, x_)) ;
  }"""

Now the way the protocol works is that when the view functions are stored in the view server, CouchDB starts sending the documents one by one and every function gets invoked on every document. So once we create a design document and attach the view with the above map functions, the view server starts processing the documents based on the line based protocol with the main server. And if we invoke the views using scouchdb API as ..

couch(test view(
  Views builder("power/power_lunch") build))

and

couch(test view(
  Views builder("power/mega_lunch") build))

we get back the results based on the queries defined in the map functions. Have a look at the project home page for a complete description of the sample session that works with Scala view functions.

Setting up the View Server

The view server is an external program which will communicate with the CouchDB server. In order to set our scouchdb query server, here are the steps :

The common place to do custom settings for couchdb is local.ini. This can usually be found under /usr/local/etc/couchdb folder. There has been some changes in the configuration files since CouchDB 0.9 - check out the wiki for them. In my system, I set the view server path as follows in local.ini ..

[query_servers]
scala=$SCALA_HOME/bin/scala -classpath couch.db.VS "/tmp/vs.txt"

scala is the language of query server that needs to be registered with CouchDB. Once you start futon after registering scala as the language, you should be able to see "scala" registered as a view query language for writing map functions.

The classpath points to the jar where you deploy scouchdb.

couch.db.VS is the main program that interacts with the CouchDB server. Currently it takes as argument one file name where it sends all statements that it exchanges with the CouchDB server. If it is not supplied, all interactions are routed to the stderr.

another change that I needed to make was setting of the os_process_timeout value. The default is set to 5000 (5 seconds). I made the following changes in local.ini ..

[couchdb]
os_process_timeout=20000

Another thing that needs to be setup is an environment variable named CDB_VIEW_CLASSPATH. This should point to the classpath which needs to be passed to the Scala interpreter for executing the map/reduce functions.

You've been warned!

All the above stuff is very much development in progress and has been tested only to the limits of some unit test suites also recorded in the codebase. Use at your own risk, and please, please send feedbacks, patches, bug reports etc. in the project tracker.

Happy hacking!

P.S. Over the weekend I got a patch from Martin Kleppmann that adds the ability to store the type name of an object in the JSON blob when it is serialized (either as fully-qualified class name or as base name without the package component), and to automatically create a bean of the right type when that JSON blob is loaded from the database (without advance knowledge of what that type is going to be). Thanks Martin - I will have a look and integrate it in the trunk.

I have undertaken this as a side project and only get to work on it over the weekends. It is great to have contributory patches from the community that only goes on to enrich the framework. I need to work on the reduce part of the query server and then will launch into a major refactoring to incorporate 0.3 release of Nathan's dbDispatch. Nathan has made some fruitful changes on exceptions and response-code handling. I am itching to incorporate the goodness in scouchdb.

Monday, May 11, 2009

CouchDB and Scala - Updates on scouchdb

A couple of posts back, I introduced scouchdb, the Scala driver for CouchDB persistence. The primary goal of the framework is to offer non-intrusiveness in persistence, in the sense that the Scala objects can be absolutely oblivious to the underlying CouchDB existence. The last post discussed how Scala objects can be added, updated or deleted from CouchDB with the underlying JSON representation carefully veneered away from client APIs. Here is an example of the fetch API in scouchdb ..

val sh = couch(test by_id(s_id, classOf[Shop]))

The document is fetched as an instance of the Scala class Shop, which can then be manipulated using usual Scala machinery. The return type is a Tuple3, where the first two components are the id and revision that may be useful for doing future updates of the document, while sh._3 is the object retrieved from the data store. Returning tuples from a method is a typical Scala idiom that can give rise to some nice pattern matching code capsules ..

couch(test by_id(s_id, classOf[Shop])) match {
  case (id, rev, obj) =>
    //..
  //..
}

The last post also discussed the View APIs and the little builder syntax for View queries.

Over the weekend, scouchdb got some more features, hence a brief post introducing the new additions ..

Temporary Views

No frills, just shares the similar builder interface as ordinary views, with the addition of specifying the map and reduce functions. Here is the necessary spec for querying temporary views ..

describe("fetch from temporary views") {
  it("should fetch 3 rows with group option and 1 row without group option") {
    val mf = 
      """function(doc) {
           var store, price;
           if (doc.item && doc.prices) {
             for (store in doc.prices) {
               price = doc.prices[store];
               emit(doc.item, price);
             }
           }
         }"""
      
    val rf = 
      """function(key, values, rereduce) {
           return(sum(values))
         }"""
      
    // with grouping
    val aq = 
      Views.adhocBuilder(View(mf, rf))
           .options(optionBuilder group(true) build)
           .build
    val s = couch(
      test adhocView(aq))
    s.size should equal(3)
      
    // without grouping
    val aq_1 = 
      Views.adhocBuilder(View(mf, rf))
           .build
    val s_1 = couch(
      test adhocView(aq_1))
    s_1.size should equal(1)
  }
}

Attachment Handling

With each document, CouchDB allows attachments, much like emails. Along with creating a document, I can have a separate attachment associated with the document. However, when the document is retrieved, the attachment, by default is not fetched. It has to be fetched using a special URI. All these are now encapsulated in Scala APIs in scouchdb. Have a look at the following spec ..

describe("create a document and make an attachment") {
  val att = "The quick brown fox jumps over the lazy dog."
    
  val s = Shop("Sears", "refrigerator", 12500)
  val d = Doc(test, "sears")
  var ir:(String, String) = null
  var ii:(String, String) = null
    
  it("document creation should be successful") {
    couch(d add s)
    ir = couch(d ># %(Id._id, Id._rev))
    ir._1 should equal("sears")
  }
  it("query by id should fetch a row") {
    ii = couch(test by_id ir._1)
    ii._1 should equal("sears")
  }
  it("sticking an attachment should be successful") {
    couch(d attach("foo", "text/plain", att.getBytes, Some(ii._2)))
  }
  it("retrieving the attachment should equal to att") {
    val air = couch(d ># %(Id._id, Id._rev))
    air._1 should equal("sears")
    couch(d.getAttachment("foo") as_str) should equal(att)
  }
}

CouchDB also allows adding attachments to yet non-existing documents. Adding the attachment will create the document as well. scouchdb supports that as well. Have a look at the bdd specs in the test folder for details of the usage.

Bulk Documents

CouchDB has separate REST interfaces for handling editing of multiple documents at the same time. I can have multiple documents, some of which need to be added as new, some to be updated with specific revision information and some to be deleted from the existing database. And all these can be done using a single POST. scouchdb uses a small DSL for handling such requests. Here is how ..

describe("bulk updates of documents") {
  it("should create 3 documents with 1 post") {
    val cnt = couch(test all_docs).filter(_.startsWith("_design") == false).size 
      
    val s1 = Shop("cc", "refrigerator", 12500)
    val s2 = Shop("best buy", "macpro", 1500)
    val a1 = Address("Survey Park", "Kolkata", "700075")
    val a2 = Address("Salt Lake", "Kolkata", "700091")
      
    couch(test docs(List(s1, s2, a1, a2), false)).size should equal(4)
    couch(test all_docs).filter(_.startsWith("_design") == false).size should equal(cnt + 4)
  }
  it("should insert 2 new documents, update 1 existing document and delete 1 - all in 1 post") {
    val sz = couch(test all_docs).filter(_.startsWith("_design") == false).size
    val s = Shop("Shoppers Stop", "refrigerator", 12500)
    val d = Doc(test, "ss")
      
    val t = Address("Monroe Street", "Denver, CO", "987651")
    val ad = Doc(test, "add1")
      
    var ir:(String, String) = null
    var ir1:(String, String) = null
    
    couch(d add s)
    ir = couch(d ># %(Id._id, Id._rev))
    ir._1 should equal("ss")
      
    couch(ad add t)
    ir1 = couch(ad ># %(Id._id, Id._rev))
    ir1._1 should equal("add1")
      
    val s1 = Shop("cc", "refrigerator", 12500)
    val s2 = Shop("best buy", "macpro", 1500)
    val a1 = Address("Survey Park", "Kolkata", "700075")
      
    val d1 = bulkBuilder(Some(s1)).id("a").build 
    val d2 = bulkBuilder(Some(s2)).id("b").build
    val d3 = bulkBuilder(Some(s)).id("ss").rev(ir._2).build
    val d4 = bulkBuilder(None).id("add1").rev(ir1._2).deleted(true).build

    couch(test bulkDocs(List(d1, d2, d3, d4), false)).size should equal(4)
    couch(test all_docs).filter(_.startsWith("_design") == false).size should equal(sz + 3)
  }
}

As can be found from the above, there are 2 levels of APIs for bulk updates. scouchdb already has an api for creating a document from a Scala object with auto id generation :

def doc[T <: AnyRef](obj: T) = { //..

As an extension, I introduce the following which lets users add multiple new documents through a single API. Note here all of the documents will be added new ..

def docs(objs: List[_ <: AnyRef], allOrNothing: Boolean) = { //..

and the objects can be of any type, not necessarily the same. This is illustrated in the first of the 2 specs above.

But in case you need to use the full feature of bulk uploads and editing of multiple documents, I offer a builder based interface, which is illustrated in the second spec above. Here 2 new documents are added, 1 being updated and 1 deleted, all through one single API.

In case you are doing CouchDB and Scala stuff, give scouchdb a spin and post comments on your feedback. I am yet to write a meaningful application using scouchdb - any feedback will be immensely helpful.

Tuesday, May 05, 2009

Scala and Lift - Functional Recipes for the Web

As part of his IEEE Internet Computing columns, Steve Vinoski will be evangelizing The Functional Web, covering the application of functional programming languages and techniques to the world of web development. He had set up the background nicely in the March/April issue of the column.

The first topical column on the same subject is out with the May/June issue. I am deeply honored to coauthor with Steve in exploring Scala, the statically typed OO-functional language and Lift, the Web development framework based on the functional features of Scala.

The column highlights the functional programming features that Scala implements viz. immutable data structures, higher order functions and closures, pattern matching over abstract data types, alongside an advanced static typing model. Scala runs on the Java Virtual Machine and offers a synergistic interoperability with the rich ecosystem of frameworks and libraries that the Java world embodies. Lift is the Web MVC framework built on top of the richness of the Scala functional model.

In the coming issues of Internet Computing column, Steve has an exciting set of recipes for all the FP enthusiasts planning to implement their next Web application on the goodness of functional programming .. keep an eye on his blog for more details ..

Meanwhile here is the pointer to the pdf version of the current column on Scala and Lift .. enjoy!

Sunday, May 03, 2009

Hacking with Scala and CouchDB

I have been hacking with CouchDB and Scala since the last couple of week ends as a part time project. CouchDB is a REST based document store that steams with the force of map/reduce paradigm implemented in Erlang. Objects are stored as JSON documents in CouchDB in a format which is far too disruptive for the community so indoctrinated in the constraints of the relational database paradigm. This is not to predict the demise of the relational world - the use cases of CouchDB are somewhat orthogonal, but fits like a glove in cases where we have so long been trying to force-fit the fangs of SQL backed with a heavily normalized schema.

I wanted some of my Scala objects to reside in a CouchDB database. It should look like normal persistence API s and the primary pre condition was one of non-invasiveness. I do not want my Scala objects to be CouchDB aware. Incorporating CouchDB specific attributes like _id and _rev take away a lot of reusability goodness from domain objects and make them constrained only for the specific platform.

Suppose I have a Scala class used to record item prices in various stores ..

case class ItemPrice(store: String, item: String, price: Number)

and I would like to store it in CouchDB through an API that converts it to JSON under the covers and issues a PUT/POST to the CouchDB server.

Here is a sample session that does this for a local CouchDB server running on localhost and port 5984 ..

// specification of the db server running
val couch = Couch("127.0.0.1")
val test = Db("test_db")

// create the database
couch(test create)

// create the Scala object
val s = ItemPrice("Best Buy", "mac book pro", 3000)

// create a document for the database with an id
val doc = Doc(test, "best_buy")

// add
couch(doc add s)

// query by id to get the id and revision of the document
val id_rev = couch(test by_id "best_buy")

// query by id to get back the object
// returns a tuple3 of (id, rev, object)
val sh = couch(test by_id("best_buy", classOf[ItemPrice]))

// got back the original object
sh._3.item should equal(s.item)
sh._3.price should equal(s.price)

Suppose the price of a mac book pro has changed in Best Buy and I get a new ItemPrice. I need to update the document that I have in CouchDB with the new ItemPrice object. For updates, I need to pass in the original revision that I would like to update ..

val new_itemPrice = //..
couch(doc update(new_itemPrice, sh._2))

The Scala client is at a very early stage. All the above stuff works now, a lot more have been planned and is present in the roadmap. The main focus has been on the non intrusiveness of the framework, so that the Scala objects remain pure to be used freely in other contexts of the application. The library uses the goodness of Nathan Hamblen's dispatch library, which provides elegant Scala wrappers over apache commons Java http client and a great JSON parser with a set of extractors.

Very often we need to have different property names in the JSON document than what is present in the Scala object. Sometimes we may also want to filter out some properties while persisting in the data store. The framework uses annotations to achieve these functionalities (much like the ones used by jcouchdb, the Java client of CouchDB) ..

case class Trade(
  @JSONProperty("Reference No")
  val ref: String,

  @JSONProperty("Instrument"){val ignoreIfNull = true}
  val ins: Instrument,
  val amount: Number
)

When this class will be spitted out in JSON and stored in CouchDB, the properties will be renamed as suggested by the annotation. Also selective filtering is possible through usage of additional annotation properties as shown above.

Handling aggregate data members for JSON serialization is tricky, since erasure takes away information of the underlying types contained in the aggregates. e.g.

case class Person(
  lastName: String
  firstName: String,

  @JSONTypeHint(classOf[Address])
  addresses: List[Address]
)

Using the annotation makes it possible to get the proper types during runtime and generate the proper serialization format.

One of the biggest hits of CouchDB is the view engine that uses the power of MapReduce to fetch data to the users. The current version of the framework does not offer much in terms of view creation apart from basic abstractions that allow plugging in "map" and "reduce" functions in Javascript to the design document. There are some plans to make this more Scala ish with little languages that will enable map and reduce function generation from Scala objects.

But what it offers today is a small DSL that enables building up view queries along with the sea of options that CouchDB server offers ..

// fetches records from the view named least_cost_lunch
couch(test view(Views.builder("lunch/least_cost_lunch").build))

// fetches records from the view named least_cost_lunch 
// using key and limit options
couch(test view(
  Views.builder("lunch/least_cost_lunch")
       .options(optionBuilder key(List("apple", 0.79)) limit(10) build)
       .build))

// fetches records from the view named least_cost_lunch 
// using specific keys and other options for deciding output filters
couch(test view(
  Views.builder("lunch/least_cost_lunch")
       .options(optionBuilder descending(true) limit(10) build)
       .keys(List(List("apple", 0.79), List("banana", 0.79)))
       .build))

Reflection warts!

The framework is based on introspecting Scala objects for serialization and de-serialization. This brings in some of the usual warts like having default constructors for the class. This does not mean that the properties need to be mutable, this is only used for using the reflection magic to set the properties after a newInstance() within the framework. Still thinking of ways to get around this. I need to look at some third party frameworks that do bytecode instrumentation to preserve constructor parameter names .. but I guess this can wait .. and having the default constructor is not necessarily a constraint so long it does not invade the immutability guarantees of the abstraction with public setters.

Test It Early

The framework is very much a work-in-progress, as things are for a typical side project. It does not yet handle lots of stuffs like attachments, compaction, bulk document creation etc. I have been working on some of these and they will see the light of the day hopefully pretty soon.

The current snapshot of the source code is available in the google-code repository for scouchdb. No formal release has been made so far. However, there is a test suite that accompanies the project. It is not a unit test suite per se, in the sense that it actually requires a CouchDB server running on the localhost on port 5984. Still the intention is to give an idea of the API set that it exposes today. It is a very pre-alpha release, no API compatibility guarantees, as it plans to evolve.

Have fun .. and let me know your feedbacks on the API ..