Monday, June 28, 2010

A Phase Shift for the ORM

I came to know about Squealer from one of Dean's tweets over the last weekend. Over there at the git repo README, there's a statement which makes a very succinct point on the role that relational mappers will be playing in the days to come. It says "... ORMs had it the wrong way around: that the application should be persisting its data in a manner natural to it, and that external systems (like reporting and decision support systems - or even numbskull integration at the persistence layer) should bear the cost of mapping."

I have expressed similar observations in the past, when I talked about the rationale of modeling data close to the way applications will be using them. I talk about this same architecture in an upcoming IEEE Software Multi-Paradigm Programming Special Issue for Sep/Oct 2010.

In most of the applications that churn out domain models in an object oriented language and persist data in a relational store, we use the ORM layer as follows :


It sits between the domain model and the relational database, provides an isolation layer between the two at the expense of an intrusive framework invasion within an otherwise non-complicated application architecture.

The scenario changes if we allow the application to manipulate and persist data in the same form that it uses for modeling its domain. My email application needs an address book as a composite object instead of being torn apart into multiple relational tables in the name of normalization. It will be better if I can program the domain model to access and persist all its entities in a document store that gives it the same flexibility. So the application layer does not have to deal with the translation between the two data models that adds a significant layer of complexity today. The normal online application flow doesn't need an ORM.

How does the translation layer get shifted ?

Consider an example application that uses Terrastore as the persistent storage for implementing online domain functionalities. Terrastore is a document database having some similarities with each of CouchDB, MongoDB and Riak in the sense that data is stored in the form of JSON documents. However unlike others it has an emphasis on the "C" component of the CAP theorem and provides advanced scalability and elasticity features without sacrificing consistency.

Terrastore is built on top of Terracotta, which is itself an ACID based object database that offers storage of data larger than your RAM size clustered across multiple machines. Terrastore uses the storage and clustering capabilities of Terracotta and adds more advanced features like partitioning, data manipulation and querying through client APIs.

As long as you are using Terrastore as your persistent database for the domain model, your data layer is at the same level of abstraction as your domain layer. You manipulate objects in memory and store them in Terrastore in the same format. Terrastore, like all other NoSQL stores is schemaless and offers a collection based interface storing JSON documents. No impedance mismatch to handle so far between the application and the data model.

Have a look at the following figure where there's no additional framework sitting between the application model and the data storage (Terrastore).



However there are many reasons why you would like to have a relational database as an underlying store. Requirements like ad hoc reporting, building decision support systems or data warehouses are some of the areas which are best supported with relational engines or any of the extensions that relational vendors offer. Such applications are not real time and can very well be served out of a snapshot of data that has a temporal lag from the online version.

Every NoSQL store and many SQL ones as well offer commit handlers for publishing async jobs. In Terrastore you can write custom event handlers that you can use to publish information from the document store to an underlying relational store. It's as simple as implementing the terrastore.event.EventListener interface. This is well illustrated in the above figure. Translation to the relational model takes place here which is one level down the stack in your application architecture. The Terrastore event handlers are queued up in a synchronous FIFO manner while they execute asynchronously which is exactly what you need to scale out your application.

I took up Terrastore just as an example - you can do most of the stuff (some of them differently) with other NoSQL stores as well front ending as the frontal persistent store of your application. In real life usage choose the store that best maps the need of your domain model. It can be a document store, it can be a generic key value store or a graph store.

The example with which I started the post, Squealer, provides a simple, declarative Ruby DSL for mapping values from document trees into relations. It was built to serve exactly the above use case with MongoDB as the frontal store of your application and MySQL providing the system of record as an underlying relational store. In this model also we see a shift of the translation layer from the main online application functionalities to a more downstream component of the application architecture stack. As the document says "It can be used both in bulk operations on many documents (e.g. a periodic batch job) or executed for one document asynchronously as part of an after_save method (e.g. via a Resque job). It is possible that more work may done on this event-driven approach, perhaps in the form of a squealerd, to reduce latency". Nice!

All the above examples go on to show us that the translation layer which has so long been existing between the core application domain logic and the persistent data model has undergone a phase shift. With many of today's NoSQL data stores allowing you to model your data closer to how the domain needs it, you can push away the translation further downstream. But again, you can only do this for some applications that fit this paradigm. With applications that need instant access to the relational store, this will not be a good fit. Use it whereever you deem it's applicable - besides the simplicity that you will get in your application level programming model, you can also scale your application more easily when you have a scalable frontal database along with non-blocking asynchronous writes shoveling data into the relational model.

Sunday, June 20, 2010

Scala Implicits : Type Classes Here I Come

I was having some discussion on type classes in Scala with Daniel on Twitter the other day when I suddenly discovered one of my unfinished posts on the same topic. Not that there's anything new that you will get to know from this rant. But these are some of my thoughts on the value that type class based thinking adds to your design space. I started this post when I published my thoughts on orthogonality in design some time back.

Let's start with the GOF Adapter pattern .. the object adapter that uses the highly recommended composition technique to bind abstractions. The example is also the same which I used for discussing orthogonality in design ..

case class Address(no: Int, street: String, city: String, 
  state: String, zip: String)


And we want to adapt this to the interface of a LabelMaker .. i.e. we would want to use Address as a LabelMaker ..

trait LabelMaker[T] {
  def toLabel(value: T): String
}


the adapter that does the interface transformation ..

// adapter class
case class AddressLabelMaker extends LabelMaker[Address] {
  def toLabel(address: Address) = {
    import address._
    "%d %s, %s, %s - %s".format(no, street, city, state, zip)
  }
}

// the adapter provides the interface of the LabelMaker on an Address
AddressLabelMaker().toLabel(Address(100, "Monroe Street", "Denver", "CO", "80231"))


Now what's the incidental complexity that we introduce in the above design ?

As a client our point of focus has shifted to the adapter class AddressLabelMaker, which wraps our original abstraction, the Address instance. This leads to an identity crisis of the Address instance itself, a typical problem with any wrapper based idiom. The identity of the adaptee is now lost within the identity of the adaptor. And if you are brave enough to have the adaptee as an aggregate member of another object, then obviously the entire composition of the adapter idiom breaks down.

Conclusion : Object adapters don't compose. And the class variant of the adapter pattren is worse in the sense that you have a wiring by inheritance right from start, which makes the entire design much more rigid and coupled.

Enter Type Class ..

In the above structure of the adapter pattern, the adaptee was wrapped into the adapter leading to the identity loss of the adaptee that we discussed earlier. Let's take a different approach and see if we can abstract the adapter away from the client usage. The type class approach does exactly offer this way of composing abstractions - the type system of the language selects the appropriate adapter instance during compile time, so that the user can program explicitly to the adaptee.

Consider this function printLabel, that takes an argument and prints the label using the LabelMaker that we supply to it ..

def printLabel[T](t: T)(lm: LabelMaker[T]) = lm.toLabel(t)

In order to make a label out of an Address, we need to define the adapter that does it. In Scala we have first class module support through the object syntax. Let's define a module that defines the semantics to convert an Address to a LabelMaker.

object LabelMaker {
  implicit object AddressLabelMaker extends LabelMaker[Address] {
    def toLabel(address: Address): String = {
      import address._
      "%d %s, %s, %s - %s".format(no, street, city, state, zip)
    }
  }
}


Note that we make the adapter a Scala object with an implicit qualifier. What this does is that the compiler supplies the implicit argument if it finds one appropriate instance within the lexical scope. But for that we need to have the LabelMaker argument to printLabel implicit as well.

def printLabel[T](t: T)(implicit lm: LabelMaker[T]) = lm.toLabel(t)

or using the Scala 2.8 context bound syntax, we can make the implicit argument unnamed ..

def printLabel[T: LabelMaker](t: T) = implicitly[LabelMaker[T]].toLabel(t)

We don't supply the implicit argument at all, instead use the context bound syntax where the compiler automatically picks up the appropriate instance from the enclosing lexical scope. In the above example the implicit object AddressLabelMaker needs to be in scope of the function where you call printLabel. It will complain if it doesn't find one - hence you are alerted during compile time itself. Look ma - No evil run time failures ..

Now if we want to print a label for an Address, we do ..

printLabel(Address(100, "Monroe Street", "Denver", "CO", "80231"))

No incidental complexity in the client code in the form of adapter objects, the abstractions are explicit, we only supply the object for which we want to print the labels. We get rid of indirections as well. Not only that, if you look at the various abstractions that constitute the surface area of the entire design, modularity speaks for itself. The client only programs on the type class definition while the type class instance is nicely abstracted away *only* for the eyes of the compiler.

Let's see how Haskell does it ..

We have come this far discussing type classes without mentioning Haskell once. Let's make amends and see how the above design fits in the pure functional world of Haskell.

LabelMaker is the type class - let's define it in Haskell ..

class LabelMaker a where
    toLabel :: a -> String


This corresponds to our LabelMaker trait definition in Scala.

We want to make Address as a LabelMaker. Here's an instance of the above type class for Address ..

instance LabelMaker Address where
    toLabel (Address h s c st z) = show(h) ++ " " ++ s ++ "," ++ c ++ "," ++ st ++ "-" ++ z


This corresponds to the implicit object AddressLabelMaker definition in Scala that we did above. Scala's first class support for executable modules is an added feature that we don't have in Haskell.

Oh BTW here's the Address definition in Haskell record syntax ..

type HouseNo = Int
type Street = String
type City = String
type State = String
type Zip = String

data Address = Address {
      houseNo        :: HouseNo
    , street         :: Street
    , city           :: City
    , state          :: State
    , zip            :: Zip
    }


And now we can define printLabel that uses the type class and generates the String for the label ..

printLabel :: (LabelMaker a) => a -> String
printLabel a = toLabel(a)


This corresponds to the Scala definition of printLabel that we have above.

A little observation ..

Note that one place where Haskell is much less verbose than Scala in the above implemnetation is the instance definition of the type class. But here the added verbosity in Scala is not without a purpose and offers a definite advantage over the corresponding Haskell definition. In Scala's case we name the instance explicitly as AddressLabelMaker, while the instance is unnamed in case of Haskell. The Haskell compiler looks into the dictionary on the global namespace for a qualifying instance to be picked up. In Scala's case the search is performed locally in the scope of the method call that triggered it. And because it's an explicitly named instance in Scala, you can inject another instance into the scope and it will be picked up for use for the implicit. Consider the above example where we have another instance of the type class for Address that generates the label in a special way ..

object SpecialLabelMaker {
  implicit object AddressLabelMaker extends LabelMaker[Address] {
    def toLabel(address: Address): String = {
      import address._
      "[%d %s, %s, %s - %s]".format(no, street, city, state, zip)
    }
  }
}


We can choose to bring this special instance in scope instead of the default one and generate a label for our address in a very special way ..

import SpecialLabelMaker._
printLabel(Address(100, "Monroe Street", "Denver", "CO", "80231"))


You don't get this feature in Haskell.

Type classes in Scala and Haskell ..

Type classes define a set of contracts that the adaptee type needs to implement. Many people misinterpret type classes synonymously with interfaces in Java or other programming languages. With interfaces the focus is on subtype polymorphism, with type classes the focus changes to parametric polymorphism. You implement the contracts that the type class publishes across unrelated types.

A type class implementation has two aspects to it :-
  1. defining the contracts that instances need to implement
  2. allow selection of the appropriate instance based on the static type checking that the language offers
Haskell uses the class syntax to implement (1), though this class is a completely different concept than the one we associate with OO programming. For Scala we used traits to implement this feature and an extension of the trait into an object to implement instances of it.

As I mentioned earlier, Haskell uses a global dictionary to implement (2), while Scala does it through a lookup in the enclosing scope of invocation. This gives an additional flexibility in Scala where you can select the instance by bringing it into the local scope.

Monday, June 14, 2010

Functional Data Structures in a Persistent Setting - Part 2: Okasaki's BankersQueue

In the last post we saw how Okasaki's BatchedQueue does not fit in the model of persistence. Okasaki showed that the data structure gives you an amortized O(1) for tail but strictly in an ephemeral way where we have only one logical future for a sequence of operations. Let's consider a hypothetical situation with the scenario that we left with in the last post. The main problem why the data structure failed in the context of persistence was that by the time it reached the realization stage of logical future #2, we were left with no credits to pay for it. But what happens if we can ensure that the credits for logical future #2 has already been paid for before. That is, logical future #2 could reuse the computation that logical future #1 had already done before.


Enter Memoization and call-by-need. Implementing the functional queue in a lazy language that supports call-by-need (like Haskell) will enable us to reuse the computation that one logical future has already performed.

It's called a BankersQueue that gives us amortized O(1) in a persistent setting. Here's the snippet that implements the data model..

module BankersQueue (BankersQueue) where
  import Prelude hiding (head, tail)
  import Queue

  data BankersQueue a = BQ Int [a] Int [a]


The basic stuff is the same as the BatchedQueue. In the implementation of the BatchedQueue in a strict language, we would use lists with strict evaluation semantics. BankersQueue need lazy lists and call-by-need semantics. In addition to the two lists, the BankersQueue also maintains the lengths of the lists as part of the data structure. This information will be used when we try to determine when to reverse the rear and append it to the front.

Lazy functional amortized data structures are not easy to analyze for time complexity. The main difficulty is the fact that operations don't get executed when they seem to. Instead we have suspensions or thunks that get created, only to be forced when actually needed. Okasaki's framework based on debits suggests that suspensions can be forced only if the debts assigned to them (proportional to the shared cost of the operation) has been paid off. I am not going into the details of explaining the banker's method framework. The basic idea is that you can enjoy your pie only if you have completely paid for it. There has to be a linear and sequential ordering between the following three stages in the lifecycle of a suspension.

Create a Suspension (1) => Payoff a Suspension (2) => Force a Suspension (3)

In the implementation for BatchedQueue we were doing the reverse of the rear only when the front becomes empty. In the banker's method based on debits, doing so will not give us sufficient time to pay for the entire reverse operation. Remember the crucial thing is to ensure that the following sequence is honored:
      
  1. calling reverse creates a suspension and we assign a debt to it proportional to the shared cost of the operation. In our case it's the length of the rear list.
  2.  
  3. operations must discharge debits enough to pay off the debt assigned to the suspension for reverse.
  4.  
  5. actually execute reverse since it's debt has been paid off.
In order to ensure that we pay off the debt assigned to reverse, we have to discharge an equivalent number of debits before the suspension for reverse is forced. And since the debt on the suspension is |rear|, this is possible only if we force the reverse after |rear| operations. Every tail discharges one debit off the front and we need to force the reverse after |front| operations. Which means that, after |front| operations we must be ready to execute reverse assuming that we have paid off all debts associated with it. This is possible if we create the suspension when |rear| just exceeds |front|. The following changes to the check function ensures this ..

check lenf f lenr r =
  if lenr <= lenf then BQ lenf f lenr r
  else BQ (lenf + lenr) (++ reverse r) 0 []


Have a look at the following diagram that shows how reverse gets it done in O(1) amortized in a persistent setting.



Note how laziness with call-by-need is the secret sauce behind the numbers. BankersQueue can also be implemented in languages that offer optional laziness. Scala can implement it with Stream, which is a lazy sequence and lazy val, that offers call by need semantics. However an interesting benchmark can be done with the corresponding implementation in Haskell. It remains to be seen how the overhead of creating and managing suspensions can have an impact on the runtime complexity.

In the next post on this series I will discuss yet another data structure that offers O(1) amortized functional queue and deque implementations. It's actually a tree structured model and can be malleable enough to offer a host of other data structure implementations as well.

Monday, June 07, 2010

Functional Data Structures in a Persistent Setting - Part 1

When you have multiple options of implementing a data structure, or you have multiple data structure implementations of the same ADT to choose from, what do you do ? Do you choose the one that has the simplest of implementations, the one that has the best amortized complexity or the one that has the best worst case complexity. Dick Lipton has a great post on how to decide What is the right complexity measure for algorithms ?. In case you haven't read it yet, leave this useless blog post and go read it over at his blog.

Which implementation of a data structure you should use depends on your use case. If you are more concerned about the overall time complexity of a sequence of operations on the data structure, choose the one with the best amortized complexity bounds. When you have an amortized data structure with a time bound of O(n) for a sequence of n operations, there may be some individual operations that take more than O(1). But you are ok so long the overall bound is maintained at O(n) for the whole sequence.

However, there are situations where you cannot afford amortized data structures. Many real time systems need predictability that can only be guaranteed through worst case time bounds. For such cases, you need to use a data structure implementation that gives you the worst case promise.

In my last post I discussed many functional data structures that come with the guarantee of being persistent. Which means that even if you make changes in the data structure, you will still be able to access all the earlier versions at no added performance cost. In that post I briefly touched upon some of the techniques that implementation of persistence employs that ensures there's no brutal copying going on behind the scenes. Clojure does that to great effect in implementing its family of persistent data structures. Scala 2.8 also has an implementation of Bagwell's Hash Mapped Array Trie, the secret sauce behind Clojure's persistent hash map and vectors. But this is not a post for discussing Clojure or Scala data structure implementations  .. so let's move on ..

Amortized data structures, in a sense, walk the middle path between Dick Lipton's Genius Channels and Dumb Channels. You have an adversary that makes some of the operations complicated threatening to pull down your promise. But at the same time you have the friendly operations that act as the counter-balancing force thereby maintaining the amortized bounds over the whole sequence of operations.

Consider Okasaki's BatchedQueue implementation, that uses a pair of lists to implement a queue data structure. The list of elements in the queue is split across the two lists, front and rear. front contains the half of the elements in the correct order while rear contains the other half in the reverse order. Dequeue takes place from the front (see head below), while enqueue takes place at the rear (see snoc below). Here's the Haskell implementation of the BatchedQueue taken from Okasaki's book ..





Clearly head offers worst case O(1). But tail has worst case O(n), since it has to reverse the rear list when front becomes empty. But with an amortized analysis using either the credits of the Banker's method or the potential of the Physicist's method, you can prove that all operations offer amortized O(1). Okasaki's book contains the proof of the amortized bounds for BatchedQueue.

The problem with BatchedQueue is that all bounds break in a persistent setting. Remember we talked about multiple logical futures for a persistent data structure. For one specific instance of the BatchedQueue, when you invoke tail persistently, you are actually doing the reverse multiple times, once for each of the logical futures of the operation that create the new object. Have a look at the following diagram that illustrates how BatchedQueue cannot offer an amortized O(1) in a persistent setting.


In the figure above, we have the original queue containing the numbers from 1 to 10 distributed evenly across front and rear lists. After 4 invocations of tail, we have the second scenario with only a single element left in the front list. If we have 2 invocations of tail in a persistent setting, each will invoke reverse separately despite the fact that the rear list only has 5 credits that can be spent by one of those operations.

In an upcoming post, I will discuss BankersQueue, an alternative implementation of the queue data structure that offers O(1) amortized complexity under persistent setting.

Monday, May 31, 2010

Grokking Functional Data Structures

Purely Functional Data StructuresWith so many languages espousing functional programming paradigms, it's no wonder that functional implementations of common data structures are also becoming more and more popular. As a programmer it's important that we learn to adopt them into our repertoire of tools and techniques.

Imperative data structures operate through in-place mutation of data. In our everyday usage of data structures in Java we mutate lists, arrays and any other type to realize the algorithm that we implement. Most of the books on data structures and algorithms explain all the complexity analyses of them using imperative data structures as case studies. When we create an object in Java (an instance of an abstract data type) and invoke some methods on it (some of which can potentially mutate its state) we get some result as an observable outcome of the sequence of operations. This sequence of operations on the value of the abstract data type constitutes a logical future of that type.

With functional data structures, immutability is the buzz of the day, though it's not mandatory that all of them have to be immutable. Many functional languages allow mutable data structures, which, more often than not have also proved to be more pragmatic than the pure school of thought. I wrote a blog post on this same topic some time back.

The other point of view that purists say is that the main point is not that whether you need a mutable array or a hash table. The more important part is that you need to access or update a data structure within a certain bound of time and space complexity. Okasaki's book Purely Functional Data Structures contains a rich collection of functional implementations of many of the popular imperative data structures. And quite contrary to what we tend to believe, given proper language support, these implementations are often as efficient as their imperative counterparts.

When we talk about functional programming, we talk about immutability of data structures and purity of functions. In the absence of mutable state, these are the two cornerstones that make reasoning of concurrent programs much easier in the functional paradigm. Since data structures are immutable, there's no in-place update of values, which means that every operation that changes the value of a data structure will create a new instance of it. So at the end of n operations on an instance of an ADT we will have all the previous versions of the data structure available for access. This is called persistence of data structures, unlike the imperative ones that we discussed earlier, where one update destroys its previous state (we call them ephemeral data structures). Unlike ephemeral data structures which has a single logical future (as we saw earlier), persistent data structures can potentially have multiple logical futures.

Have a look at the following figure, where the list l can have 2 logical futures based on the two possible mutating operations that we make on it. None of them mutate the original list and at the end we still have l as '(1 2 3). Only the two logical futures point to two different instances of the ADT.


Since Clojure sequences are persistent, we have all instances of the ADT accessible even after the entire sequence of operations. Had persistence meant brutal copying of instances, obviously we would not have the same performance guarantees as the imperative ones. Persistent data structures are implemented through careful sharing of structures across instances that make them affordable in real life applications. Have a look at the following tree structure that employs copying of paths from the root down to the newly inserted leaf while sharing the rest with the original ADT.


As with the earlier scenario, both the versions of the tree are available even after adding the new node 8 to the tree - xs points to the original tree, while ys points to the new one.

Despite the fact that functional data structures are persistent, it's possible that their implementations make use of mutable storage. It's mostly done for efficiency of implementation, as we have seen with transients in Clojure. However, for all practical purposes, the data structure is published to its clients as an immutable one.

It's also possible that we can implement a persistent data structure using an imperative language like Java. In fact, FunctionalJava does exactly the same and offers a rich suite of persistent data structures developed in Java. But of course there's a trade-off. First, the usage looks intimidating with Java being quite verbose and without much of type-inferencing capabilities. And, as Okasaki points out in his book, you need to have an implementation of call-by-need in order to have improved amortized complexity of functional data structures. That's the subject of another post, some other day.

What would be a life like without mutable arrays and hash-tables, the bread and butter data structures on which imperative programming thrives upon?

Well, in functional languages lists, trees and tuples take the place of arrays and hash-tables. The basic unit of a functional data structure is to find a general representation of a sequence. And then apply recursive transformations on it to implement specific operations. In this post I give you a very brief look at a couple of such building blocks that are used extensively to implement persistent data structures in many languages.

Consider the use case for a data structure where you would like to reach a specific item (often called a focus point) within the structure and do manipulations around it. With mutable structures you can reach a specific node within a list, an array or a tree and make changes to it. Or the very sole reason for which doubly linked lists exist is to allow you to reach a node within the list and move in either direction. Or you may want to have a new node and splice it within an existing doubly linked list using a minimum number of pointer manipulations.

In the functional world we have zippers which can be adapted to this very use case. Zipper, invented by Gerard Huet, is a generic data structure which allows you to identify a focal point within an ADT and make all manipulations around it as cheap as in-place updates. And it does all this in a completely immutable and persistent way. Every focal point within the tree is identified by a pair of Tree and Context, each of which is again a recursive data structure. The root of the Tree is the focal point around which we would like to do mutable operations. The functional pearl paper by Huet has all the details of this functional data structure. Clojure has a nice implementation of zipper as part of its core distribution. Zipper is a general purpose data structure and has lots of exciting use cases. I hope to cover some of them in my upcoming posts.

Another interesting data structure used functionally is the finger tree, invented by Hinze and Paterson. A finger tree is a 2-3 tree representation of a sequence that gives you cheap access to 'fingers', where you would like to do your manipulations on. And then you can define transformation functions using reductions and monoids that implement specific operations. For example, you can annotate every node of a finger tree using a characteristic function (modeled as a monoid), which you can use to look up specific elements having that property. If you want to implement fast positional operations (like accessing the nth element of the sequence), annotate the finger tree with the 'size' function, so that every node has the size of the subtree stored as its measure. Given size annotations we can now find the nth element in O(log n) time. Similarly annotating the tree with a 'priority' function turns a finger tree into a priority queue. Similarly you can implement deques, interval trees and a host of other data structure variants using the single underlying representation.

This post is just an introduction to functional data structures and I hope to cover some of the specific ones in future. The key difference in implementation with mutable data structures is that the functional ones are based on recursive transformations, easier to model using the power of pure functions. They are persistent and as we discussed above can have multiple logical futures. This makes their runtime analyses a big challenge. In order to obtain amortized complexity that match their mutable counterparts you need to have support for laziness and call-by-need from the underlying implementation language. Computations once forced need to be memoized so that the next logical future can use it without any additional expense on its realized cost of execution.

Monday, May 10, 2010

Laziness in Clojure - Some Thoughts

Some readers commented on my earlier post on Thrush implementation in Clojure that the functional way of doing stuff may seem to create additional overhead through unnecessary iterations and intermediate structures that could have been avoided using traditional imperative loops. Consider the following thrush for the collection of accounts from the earlier post ..

(defn savings-balance
  [& accounts]
  (->> accounts
       (filter #(= (:type %) ::savings))
       (map :balance)
       (apply +)))


Is it one iteration of the collection for the filter and another for the map ?

It's not. Clojure offers lazy sequences and the functions that operate on them all return lazy sequences only. So in the above snippet Clojure actually produces a composition of the filter and the map that act on the collection accounts in a lazy fashion. Of course with apply, everything is computed since we need to realize the full list to compute the sum. Let's look at the following example without the sum to see how Clojure sequences differ from a language with eager evaluation semantics ..

user> (def lazy-balance
      (->> accounts
           (filter #(= (:type %) ::savings))
           (map #(do (println "getting balance") (:balance %)))))
#'user/lazy-balance


lazy-balance has not been evaluated - we don't yet have the printlns. Only when we force the evaluation we have it computed ..

user> lazy-balance
(getting balance
getting balance
200 300)


Had Clojure been a strict language it would have been stupid to follow the above strategy for a large list of accounts. We would have been doing multiple iterations over the list generating lots of intermediate structures to arrive at the final result. An imperative loop would have rested the case much more cheaply.

Laziness improves compositionality. With laziness, Clojure sequences and the higher order functions on them essentially reify loops so that you can transform them all at once. As Cale Gibbard defends laziness in Haskell with his comments on this LtU thread .. "It's laziness that allows you to think of data structures like control structures."

Clojure is not as lazy as Haskell. And hence the benefits are also not as pervasive. Haskell being lazy by default, the compiler can afford to make aggressive optimizations like reordering of operations and transformations that Clojure can't. With Haskell's purity that guarantees absence of side-effects, deforestation optimizations like stream fusion generates tight loops and minimizes heap allocations. But I hear that Clojure 1.2 will have some more compiler level optimizations centered around laziness of its sequences.

Laziness makes you think differently. I had written an earlier post on this context with Haskell as the reference language. I have been doing some Clojure programming of late. Many of my observations with Haskell apply to Clojure as well. You need to keep in mind the idioms and best practices that laziness demands. And at many times they may not seem obvious to you. In fact with Clojure you need to know the implementation of the abstraction in order to ensure that you get the benefits of lazy sequences.

You need to know that destructuring's & uses nthnext function which uses next that needs to know the future to determine the present. In short, next doesn't fit in the lazy paradigm.

The other day I was working on a generic walker that traverses some recursive data structures for some crap processing. I used walk from clojure.walk, but later realized that for seqs it does a doall that realizes the sequence - another lazy gotcha that caught me unawares. But I actually needed to peek into the implementation to get to know it.

Being interoperable with Java is one of the benefits of Clojure. However you need to be aware of the pitfalls of using Java's data structures with the lazy paradigm of Clojure. Consider the following example where I put all accounts in a java.util.concurrent.LinkedBlockingQueue.

(import '(java.util.concurrent LinkedBlockingQueue))
(def myq (new LinkedBlockingQueue))
(doto myq (.put acc-1) (.put acc-2) (.put acc-3))


Now consider the following snippet that does some stuff on the queue ..

(let [savings-accounts (filter #(= (:type %) ::savings) myq)]
     (.clear myq)
     (.addAll myq savings-accounts))


Should work .. right ? Doesn't ! filter is lazy and hence savings-accounts is empty within the let-block. Then we clear myq and when we do an addAll, it fails since savings-accounts is still empty. The solution is to use a doall, that blows away the laziness and realizes the filtered sequence ..

(let [savings-accounts (doall (filter #(= (:type %) ::savings) myq))]
     (.clear myq)
     (.addAll myq savings-accounts))


Of course laziness in Clojure sequences is something that adds power to your abstractions. However you need to be careful on two grounds :
  • Clojure as a language is not lazy by default in totality (unlike Haskell) and hence laziness may get mixed up with strict evaluation leading to surprising and unoptimized consequences.
  • Clojure interoperates with Java, which has mutable data structures and strict evaluation. Like the situation I described above with LinkedBlockingQueue, sometimes it's always safe to bite the bullet and do things the Java way.

Monday, April 26, 2010

DSL Interoperability and Language Cacophony

Many times I hear people say that DSL based development often leads to a situation where you have to manage a mix of code written in various languages, which are barely interoperable and often need to be integrated using glues like ScriptEngine that makes your life more difficult than easier. Well, if you are in this situation, you are in the world of language cacophony. You have somehow managed to pitchfork yourself into this situation through immature decisions of either selecting an improper language of implementation for your DSL or the means of integrating them with your core application.

Internal DSLs are nothing but a disciplined way of designing APis which speak the Ubiquitous Language of the domain you are modeling. Hence when you design an internal DSL, stick to the idioms and best practices that your host language offers. For domain rules that you think could be better modeled using a more expressive language, select one that has *good* interoperability with your host language. If you cannot find one, work within the limitations of your host language It's better to be slightly limiting in syntax and flexibility with your DSL than trying to force integration with a language that does not interoperate seamlessly with your host language. Otherwise you will end up having problems not only developing the interfaces between the DSL and the host language, you will have other endless problems of interoperability in the face of exceptions as well.

Don't let the language cacophony invade the horizons of your DSL implementation!

Consider the following example where we have a DSL in Groovy that models an Order that a client places to a securities broker firm for trading of stocks or bonds. The DSL is used to populate orders into the database when a client calls up the broker and asks for buy/sell transactions.



I am not going into the details of implementation of this DSL. But consider that the above script on execution returns an instance of Order, which is an abstraction that you developed in Groovy. Now you have your main application written in Java where you have the complete domain model of the order processing component. Your core application Order abstraction may be different from what you have in the DSL. Your DSL only constructs the abstraction that you need to populate the order details that the user receives from their clients.

It's extremely important that the Order abstraction that you pass from executing the Groovy script be available within your Java code. You can use the following mechanism to execute your Groovy code ..



This runs the script but does not have any integration with your Java application. Another option may be using the Java 6 ScriptEngine for talking back to your Java application. Something like the following snippet which you can have wiothin your main application to execute the Groovy script using ScriptEngine ..



Here you have some form of integration, but it's not ideal. You get back some objects into your Java code, can iterate over the collection .. still the objects that you get back from Groovy are opaque at best. Also since the script executes in the sandbox of the ScriptEngine, in case of any exception, the line numbers mentioned in the stack trace will not match the line number of the source file. This can lead to difficulty in debugging exceptions thrown from the DSL script.

Groovy has excellent interoperability with Java even at the scripting level. When integrating DSLs with your main application always prefer the language specific integration features over any generic script engine based support. Have a look at the following that does the same job of executing the Groovy script from within a Java application. But this time we use GroovyClassLoader, a ClassLoader that extends URLClassLoader and can load Groovy classes from within your Java application ..



You will have to make your DSL script return a Closure that then gets called within the Java application. Note that within Java we can now get complete handle over the Order classes which we defined in Groovy.

This is an example to demonstrate the usage of proper integration techniques while designing your DSL. In my upcoming book DSLs In Action I discuss many other techniques of integration for DSLs developed on the JVM. The above example is also an adaptation from what I discuss in the book in the context of integrating Groovy DSLs with Java application.

Thanks to Guillame Laforge and John Wilson for many suggestions on improving the Groovy DSL and it's interoperability with Java.

Monday, April 19, 2010

PubSub with Redis and Akka Actors

Redis (the version on the trunk) offers publish/subscribe based messaging. This is quite a big feature compared to the typical data structure oriented services that it had been offering so far. This also opens up lots of possibilities to use Redis as a messaging engine of a different kind. The sender and the receiver of the messages are absolutely decoupled from each other in the sense that senders do not send messages to specific receivers. Publishers publish messages on specific channels. Subscribers who subscribe to those channels get them and can take specific actions on them. As Salvatore notes in his weekly updates on Redis, this specific feature has evolved from lots of user requests who had been asking for a general notification mechanism to trap changes in the key space. Redis already offers BLPOP (Blocking list pop operation) for similar use cases. But still it's not sufficient to satisfy the needs of a general notification scheme. Salvatore explains it in more details in his blog post.

I have been working on a Scala client, which I forked from Alejandro Crosa's repository. I implemented pubsub very recently and also have integrated it with Akka actors. The full implementation of the pubsub client in Scala is in my github repository. And if you like to play around with the Akka actor based implementation, have a look at the Akka repository.

You define your publishers and subscribers as actors and exchange messages over channels. You can define your own callbacks as to what you would like to do when you receive a particular message. Let's have a look at a sample implementation at the client level .. I will assume that you want to implement your own pub/sub application on top of the Akka actor based pubsub substrate that uses the redis service underneath.

Implementing the publisher interface is easy .. here is how you can bootstrap your own publishing service ..



The publish method just sends a Publish message to the Publisher. Publisher is an actor defined in Akka as follows:



The subscriber implementation is a bit more complex since you need to register your callback as to what you would like to do when you receive a specific type of message. This depends on your use case and it's your responsibility to provide a proper callback function downstream.

Here is a sample implementation for the subscriber. We need two methods to subscribe and unsubscribe from channels. Remember in Redis the subscriber cannot publish - hence our Sub cannot do a Pub.



I have not yet specified the implementation of the callback. How should it look like ?

The callback will be invoked when the subscriber receives a specific type of message. According to Redis specification, the types of messages which a subscriber can receive are:

a. subscribe
b. unsubscribe
c. message

Refer to the Redis documentation for details of these message formats. In our case, we model them as case classes as part of the core Redis client implementation ..



Our callback needs to take appropriate custom action on receipt of any of these types of messages. The following can be one such implementation .. It is customized for a specific application which treats various formats of messages and gives appropriate application dependent semantics ..



Note in the above implementation we specialize some of the messages to give additional semantics. e.g. if I receive a message as "+t", I will interpret it as subscribing to the channel "t". Similarly "exit" will unsubscribe me from all channels.

How to run this application ?

I will assume that you have the Akka master with you. Also you need to have a version of Redis running that implements pubsub. You can start the subscription service using the above implementation and then use any other Redis client to publish messages. Here's a sample recipe for a run ..

Prerequisite: Need Redis Server running (the version that supports pubsub)



For running this sample application :-

Starting the Subscription service



Starting a Publishing service



Another publishing client using redis-cli



Have fun with the message formats



The full implementation of the above is there as a sample project in Akka master. And in case you are not using Akka, I also have a version of the above implemented using Scala actors in the scala-redis distribution.

Have fun!

Monday, April 12, 2010

Thrush in Clojure

Quite some time back I wrote a blog post on the Thrush Combinator implementation in Scala. Just for those who missed reading it, Thrush is one of the permuting combinators from Raymond Smullyan's To Mock a Mockingbird. The book is a fascinating read where the author teaches combinatorial logic using examples of birds from an enchanted forest. In case you've not yet read it, please do yourself a favor and get it from your favorite book store.

A Thrush is defined by the following condition: Txy = yx. Thrush reverses the order of evaluation. In our context, it's not really an essential programming tool. But if you're someone who takes special care to make your code readable to the human interface, the technique sometimes comes in quite handy.

Recently I came across Thrush in Clojure. You don't have to implement anything - it's there for you in the Clojure library implemented as a macro ..

Conside this simple example of bank accounts where we represent an account as a Map in Clojure ..

(def acc-1 {:no 101 :name "debasish" :type 'savings :balance 100})
(def acc-2 {:no 102 :name "john p." :type 'checking :balance 200})


We have a list of accounts and we need to find all savings accounts and compute the sum of their current balances .. well not too difficult in Clojure ..

(defn savings-balance
  [& accounts]
  (apply +
    (map :balance
      (filter #(= (:type %) 'savings) 
        (seq accounts)))))


To a programmer familiar with the concepts of functional programming, it's quite clear what the above function does. Let's read it out for all of us .. From a list of accounts, filter the ones with type as savings, get their balances and report the sum of them. That was easy .. but did you notice that we read it inside out from the implementation, which btw is a 4 level nested function ?

Enter Thrush ..

Being a permuting combinator, Thrush enables us to position the functions outside in, in the exact sequence that the human mind interprets the problem. In our Scala version we had to implement something custom .. with Clojure, it comes with the standard library .. have a look ..

(defn savings-balance
  [& accounts]
  (->> (seq accounts)
       (filter #(= (:type %) 'savings))
       (map :balance)
       (apply +)))


->> is implemented as a macro in Clojure that does the following :

  1. threads the first form (seq accounts) as the last argument of the second form (the filter), which makes (seq accounts) the last argument of filter
  2. then makes the entire filter form, including the newly added argument the last argument of the map
.. and so on for all the forms present in the argument list. Effectively the resulting form that we see during runtime is our previous version using nested functions. The Thrush combinator only dresses it up nicely for the human eyes synchronizing the thought process with the visual implementation of the logic. And all this at no extra runtime cost! Macros FTW :)

->> has a related cousin ->, which is same as ->>, but only threads the forms as the second argument of the next form instead of the last. These macros implement Thrush in Clojure. Idiomatic Clojure code is concise and readable and using a proper ubiquitous language of the domain, makes a very good DSL. Think about using Thrush when you feel that reordering the forms will make your API look more intuitive to the user.

Thrush also helps you implement the Decorator pattern in a very cool and concise way. In my upcoming book, DSLs In Action I discuss these techniques in the context of designing DSLs in Clojure.