Ruminations of a Programmer: terracotta

Showing posts with label terracotta. Show all posts

Monday, March 09, 2009

Bloom Filters - optimizing network bandwidth

Bloom Filter is one of the coolest data structures that qualify as being elegant as well as incredibly useful. I had a short post on a cool application of bloom filters in front ending access to disk based data to achieve improved throughput in query processing. I didn't know Oracle uses bloom filters for processing parallel joins and join filter pruning. In case of parallel joins, the idea is quite simple ..

Each slave process prepares a bloom filter for the join condition that it is processing

It then passes on the bloom filter to the other slave processes, which can apply the filter to its own selected set of records before passing on the final set to the join coordinator

Remember each of the above processes may run in a distributed environment - hence the above technique leads to less data being transported across nodes, thereby saving in network bandwidth for some extra CPU cycles. This paper describes all of these in full details with illustrative examples.

This idea of serializing bloom filters instead of data set has been used quite extensively in load balancing MapReduce operations to minimize intermediate results before sending everything across the network for final aggregation. In case of processing distributed join operations, we may need to compose multiple bloom filters to get the final dataset. Bloom joins, as they are called allow cheap serialization of filters over the wire, by employing some clever techniques like linear hash tables and multi-tier bloom filters, as described in this paper in Comonad Reader.

Bloom joins can also be used effectively in MapReduce processing with CouchDB. The map phase can produce the bloom filters, which can be joined in the reduce phase. In a recent application, I needed to store a very large list mainly for set operations. Instead of storing individual elements, I decided to store a bloom filter that nicely fit in a memcached slab. I could pull it out and do all sorts of bit operations easily and it's blinding fast. Next time you decide to distribute your huge list in Terracotta, think back - there may be a lighter weight option in distributing a bloom filter instead. There are use cases when you will be doing membership checks only and some false positives may not do much harm.

Sunday, February 01, 2009

Asynchronous Write Behinds and the Repository Pattern

The following is a typical implementation of service methods of the domain model of an application. The Repository is injected and is used to persist the domain model or lookup objects from the underlying store. The entire storage and the mechanics of the underlying retrieval is abstracted within the DAO / Repository layer.

public class RestaurantServiceImpl implements RestaurantService {

  @Autowired
  public RestaurantServiceImpl(..) {
    //..
  }

  // injected
  private final RestaurantRepository restaurantRepo;

  public void storeRestaurants(List<Restaurant> restaurants) {
    restaurantRepo.store(restaurants);
  }
}

In a typical layered architecture, the database often proves to be the hardest layer to scale. And in the above implementation, restaurantRepo.store() is a synchronous method that keeps you in abeyance till the data gets persisted across all the layers of your architecture down to the bits and pieces of the underlying relational store. Of course it can be any other store as well - after all, the repository is an abstraction, so it doesn't matter to the application whether you use a relational database, a native file system or a document database underneath. But you get the idea, synchronous communication with the database / hard disk often turns out to be the bottleneck here.

Terracotta provides a nice option of virtualizing your interaction with the database. Async tim (Terracotta Integration Module) provides asynchronous write behind to the database, while the application works on in-memory data structures. Terracotta offers network attached memory with transparent JVM clustering that allows data structures to be *declaratively* clustered. The value proposition here is that, the user can work on the object model, using POJOs, delegating the concerns of persistence to an asynchronous Terracotta process.

Here is an example of the above service extended to handle asynchronous write behinds ..

public class AsyncRestaurantServiceImpl extends RestaurantServiceImpl {

  // need to be clustered
  @Root
  private final AsyncCoordinator<Restaurant> asyncCommitter =
    new AsyncCoordinator<Restaurant>(new RestaurantAsyncConfig(), new NeverStealPolicy<ExamResult>());

  // dependency injected
  private final RestaurantCommitHandler handler;

  @Autowired
  public AsyncRestaurantServiceImpl(..) {
    super();
    asyncCommitter.start(handler, ..);
  }

  @Override
  public void storeRestaurants(List<Restaurant> restaurants) {
    asyncCommitter.add(restaurants);
  }

  //.. other methods
}

The AsyncCoordinator<> is the agent that handles the persistence asynchronously in the background. The class RestaurantCommitHandler contains the actual code that writes the collection of Restaurants to the database. RestaurantCommitHandler implements ItemProcessor<> - instances of ItemProcessor gets bucketed and throttled asynchronously for database commits, while the application continues by adding the objects to be persisted to a POJO.

@Service
public class RestaurantCommitHandler implements ItemProcessor<Restaurant> {
  //..
}

Now, we can take this one step further. The Repository is supposed to abstract the handling of the storage and retrieval - why not abstract the asynchronous persistence within the repository itself and keep the service implementation clean. Then it becomes simply injecting the proper repository to enable asynchrony at the service layer ..

interface RestaurantRepository {
  void store(List<Restaurant> restaurants);
}

class RestaurantRepositoryImpl implements RestaurantRepository {
  public void store(..) {
    //.. standard DAO based implementation
  }
}

class AsyncRestaurantRepositoryImpl implements RestaurantRepository {
  @Root
  private final AsyncCoordinator<ExamResult> asyncCommitter =
    new AsyncCoordinator<Restaurant>(new RestaurantAsyncConfig(), new NeverStealPolicy<Restaurant>());

  // dependency injected
  private final RestaurantCommitHandler handler;

  public AsyncRestaurantRepositoryImpl() {
    super();
    asyncCommitter.start(handler, ..);
  }

  public void store(..) {
    asyncCommitter.add(restaurants);
  }

  //.. other methods
}

I have not yet used the above in any production application. But the idea of decoupling the main processing from the underlying database decreases the write latency of domain objects. And couple this idea with Terracotta's original value proposition of cluster-wide in-process distributed coherent caching, I think it can prove to be a really wicked cool platform for scaling out your application. The system of record (SOR) is now closer to the application, and the database can act as a snapshot for audit trails and reporting purposes. Of course this asynchronous write behind is not suitable for a plug-in into an existing architectural framework where you have lots of loosely coupled systems interconnected through databases. But I guess there can be many use cases for which this can be a viable solution.

However, looking at the current state of Terracotta async write behind framework, one area that concerns me is the lack of an out-of-the-box support for cases when the database may be down for an extended period. The framework leaves it to the client to implement any such failover support. The ItemProcessor is a non-clustered local instance - hence the user can very well catch the ProcessingException and act upon it according to business needs. Still it will be nice to have some support from the framework, where by the application can continue to run in-memory and later can sync up when the database comes up.

Would love to hear some real life stories from anyone with experience to share on usage of Terracotta Async module ..

Monday, September 29, 2008

Memcached and Terracotta : Alternatives or Complementary ?

Last week I was having a chatty tweeter session with Ari Zilka, CTO of Terracotta. It all started with Ari's initial observation regarding the confusion that exists in people's mind regarding the actual use of memcached and how it compares to Terracotta as a caching solution. Ari was chatty and I thought it would be pretty useful to share his observations with a broader audience. Here is a transcript of the chat, with some snippets of my own personal observations and conclusions ..

Do you think Terracotta is an alternative to Memcached ?

The following is a snapshot directly captured from the Twitter stream ..

Reminder to self :

Memcached is a specialized engine for caching *only*. In case you are trying to use it as a data store, think twice and refactor your thoughts.

Memcached is NOT a distributed hash table. This is quite a popular misconception that even Dare also mentions in one of his recent posts. Every memcached server is atomic and is not aware of any other memcached server in the cluster. Any algorithm for distribution, HA, failover is the responsibility of the client.

How do you handle database updates and prevent staleness of your memcached data ? Updates to data are usually handled by pushing writes to the database and then having some asynchronous processes (or database triggers) build objects that are pushed into memcached. In case of Terracotta it is the other way round. You write into Terracotta Network Attached Memory and the data gets pushed into the database through asynchronous write behinds.

Terracotta offers a truly coherent and clustered cache that remains consistent even in the presence of database writes through write behind to System of Record. Hence by writing data directly into Terracotta Network Attached Memory, data can be safely and durable stored, without risk of loss or corruption, and later drained to the DB asynchronously.

Can we conclude that for read-only (mostly) usecases, use memcached, while, for read-write usecases, use Terracotta to obtain transparent durability to the persistent store. Rather than alternatives, the two technologies are complementary.