Ruminations of a Programmer: nosql

Showing posts with label nosql. Show all posts

Thursday, January 21, 2010

A new way to think of Data Storage for your Enterprise Application

A couple of posts earlier I had blogged about a real life case study of one of our projects where we are using a SQL store (Oracle) and a NoSQL store (MongoDB) in combination over a message based backbone. MongoDB was used to cater to a very specific subset of the application functionality, where we felt it made a better fit than a traditional RDBMS. This hybrid architecture of data organization is turning out to be an increasingly attractive option today with more and more specialized persistent storage structures being developed.

In many applications we need to process graph data structures. Neo4J can be a viable option for this. You can have your mainstream data storage still in an RDBMS and use Neo4J only for the subset of functionalities for which you need to use graph data structures. If you need to sync back to your main storage, use messaging as the transport to talk back to your relational database.

Multiple data storage use along with asynchronous messaging is one of the options that will looks very potent today. Drizzle has its entire replication based on a RabbitMQ based transport. And using AMQP messaging, Drizzle replicates data to a host of key/value stores like Voldemort, memcachedDB and Cassandra.

If we agree that messaging is going to be one of the most dominant paradigms in shaping application architectures, why not try to go one level up and look at some higher level abstractions for message based programming? Erlang programmers have been using the actor model for many years now and have demonstrated all the good qualities that the model imbibes. Inspired by Erlang, Scala also offers a similar model on the JVM. In an earlier post I had discussed how we can use the actor model in Scala to scale out messaging applications using a RabbitMQ storage.

Now with the developing ecosystem of polyglot storage, we can use the same model of actor based communication as the backbone for integrating multiple data storage options that you may plug in to your application. Have specific clients front end the storage that they need to work with and use messaging to sync that up with the main data storage backend to have a consistent system of record. Have a look at the following diagram that may not look that unreal today. You have a host of options that bring your data closer to the way you process them in your domain model, be it document oriented, graph based, key/value based or simple POJO based across a data grid like Terracotta.

When we have a bunch of architectural components loosely connected through messaging infrastructure, you can have a world of options managing interactions between them. In fact your options open up more when you get to interact with data shaped the way you would like to be. You now can think in terms of having a data model aligned with the model of your domain. You know once your rule base gets updated in Neo4J, it will somehow be synced up with the backend storage through some other service that will make it eventually consistent.

In a future post I will explore some of the options that a higher order middleware service like Akka can add to your stack. With Akka providing abstractions like transactors, pluggable persistence and out of the box integration modules for AMQP, there's a number of ways you can think of modularizing your application's domain model and storage. You can use peer to peer distributed actor based communication model that sets up synchronization options with your databases or you can use AMQP based transport to do the same much like what Drizzle does for replication. But that's some food for thought for yet another future post.

Thursday, December 24, 2009

A case for hybrid SQL - NoSQL stack

Alex Popescu talks about Drizzle replication in his MyNoSql column. He makes a very interesting observation in his post regarding Drizzle's replication capabilities into a host of NoSQL storage backends ..

"Leaving aside the technical details — which are definitely interesting .., the solution using the Erlang AMQP .. implementation RabbitMQ .. — I think this replication layer could represent a good basis for SQL-NoSQL hybrid solutions".

We are going to see more and more of such hybrid solutions in the days to come. Drizzle does it at the product level. Using RabbitMQ as the transport, Drizzle can replicate data as serialized Java objects to Voldemort, as JSON marshalled objects to Memcached or as a hashmap to column family based Cassandra.

Custom implementation projects have also started using the hybrid stack of persistent stores. When you are dealing with real high volumes and patterns of access where you cannot use joins, anyway you need to denormalize. ORMs cease to be a part of your solution toolset. Data access patterns vary widely across the profile of clients using your system. If you are running an ecommerce suite and your product is launched you may have an explosive use of the shopping cart module. It makes every sense to move your shopping cart from the single relational data store where it was lying around and have it served through a more appropriate data store that lives up to the scalability requirements. It's not that you need to throw away your relational database that has served you so long. Like Alex mentioned, you can always go along with a hybrid model.

In one of our recent projects, we were using Oracle as the main relational database for a securities trading back office solution implementation. The database load was computed based on all calculations that were done initially. In a very late stage of the project a new requirement came up that needed heavy processing and storage of semi-structured data and meta-data from an external feed. Both the data and the meta-data were extensible which meant that it was difficult to model them with a fixed schema.

We could not afford frequent schema changes since it would entail long downtime of the production database. But there also was the requirement that after processing of these semi-structured data lots of them will have to be made available in the production database. We could have modeled it following the key/value paradigm in Oracle itself, which we were using anyway as the primary database. But that's again going down the age old saying of the hammer and nail story.

We decided to supplement the stack with another data store that fits the bill for this specific use case. We used MongoDB, that gave us phenomenal performance for our requirements. We were getting the feed from external data sources and loaded our MongoDB database with all the semi-structured data and meta-data. All necessary processing was done in MongoDB on those data and relevant information from MongoDB were pushed to JMS based queues for consumption by appropriate services that copied data asynchrnously to our Oracle servers.

What did we achieve with the above architecture ?

Kept Oracle free to do what it does the best.

Took away unnecessary load from production database servers.

Introduced a document database for serving a requirement tailor made for its use - semi structured data, mainly reads, no constraints, no overhead of ORM ceremony. MongoDB supports a very clean programming model, a very decent query interface, simple to use and easy to convince your client.

Used message based mapping to sync up data ASYNCHRONOUSLY between the nosql MongoDB and sql based Oracle. Each of the data stores were doing what they do the best, keeping us away from the blames of the hammer-nail paradigm.

With more and more of the nosql stores coming up, message based replication is going to play a very important role. Even within the nosql datastore, we are seeing choices of sql based storage backends being offered. Voldemort offers MySql as one of the storage backends - so the hybrid model starts right up there. It's always advisable to use multiple storage that fits your use case than trying to force-fit everything into a single paradigm.

Monday, November 02, 2009

NOSQL Movement - Excited with the coexistence of Divergent Thoughts

Today we are witnessing a great bit of excitement with the NoSQL movement. Call it NoSQL (~SQL) or NOSQL (Not Only SQL), the movement has a mission. Not all applications need to store and process data the same way, and the storage should also be architected accordingly. Till today we have always been force-fitting a single hammer to drive every nail. Irrespective of how we process data in our application we have traditionally stored them as rows and columns in a relational database.

When we talk about really big write scaling applications, relational databases suck big time. Normalized data, joins, acid transactions are definite anti-patterns in write scalability. You may think sharding will solve your problems by splitting data into smaller chunks. But in reality, the biggest problem with sharding is that relational databases have never been designed for it. Sharding takes away many of the benefits that relational databases have traditionally been built for. Sharding cannot be an afterthought, sharding intrudes into the business logic of your application and joining data from multiple shards is definitely a non trivial effort. As long as you can scale up your data model vertically by increasing the size of your box, that's possibly the sanest way to go for. But Moore .. *cough* .. *cough* .. Even if you are able to scale up vertically, try migrating a really large MySQL database. It will take hours, and even days. That's one of the problems why some companies are moving to schemaless databases when their applications can afford to.

For horizontal scalability of an application if we sacrifice normalization, joins and ACID transactions, why should we use an RDBMS ? You don't need to .. Digg is moving to Cassandra from MySQL. It all depends on your application and the kind of write scalability that you need to achieve in processing of your data. For read scalability, you can still manage using read-only slaves replicating everything coming to the master database in realtime and setting up a smart proxy router between your clients and the database.

The biggest excitement that the NOSQL movement has created today is because of the divergence of thoughts that each of the products is promising. This is very much unlike the RDBMS movement which started as a single hammer named SQL that's capable of munging rows and columns of data based on the theory of mathematical set operations. And every application adopted the same storage architecture irrespective of how they process the data from within their application. One thing led to another, people thought they can solve this problem with yet another level of indirection .. and the strange thingy called an Object Relational Mapper was born.

At last it needed the momentum of the Web shaped data processing to make us realize that all data are not processed alike. The storage that works so well for your desktop trading application will fail miserably in a social application where you need to process linked data, more in the shape of a graph. The NOSQL community has responded with Neo4J, a graph database that offers easy storage and traversal of graph structures.

If you want to go big on write scalability, the only way out is decentralization and eventual consistency. The CAP theorem kicks in, and you need to compromise on at least one of consistency, availability and network partition tolerance. Riak and Cassandra offer decentralized data stores that can potentially scale indefinitely. If your application needs more structure than a key-value database, you can go for Cassandra, the distributed, peer-to-peer, column oriented data store. Have a look at the nice article from Digg which compares their use case between a relational storage and the columnar storage that Cassandra offers. For a document oriented database with all the goodness of REST and JSON, Riak is the option to choose. Also Riak offers linked map/reduce with the option to store linked data items, much in the way the Web works. Riak is truly a Web shaped data store.

CouchDB has yet another very interesting value proposition in this whole ecosystem of NOSQL databases. Most of the applications are inherently offline and need seamless and painless replication facilities. CouchDB's B-Tree based storage structure, append only operations with MVCC based model of concurrency control, lockless operations, REST APIs and incremental map/reduce operations position it with a sweet enough spot in the space of local browser storage. Chris Anderson, one of the core developers of CouchDB sums up the value of CouchDB in today's Web based world very nicely ..

"CouchApps are the product of an HTML5 browser and a CouchDB instance. Their key advantage is portability, based on the ubiquity of the html5 platform. Features like Web Workers and cross-domain XHR really make a huge difference in the fabric of the web. Their availability on every platform is key to the future of the web."

MongoDB, like CouchDB is also a document store. It doesn't offer REST out of the box, but it's based on JSON storage. It has map/reduce as well, but also offers a strong suite of query APIs much like SQL. This is the main sweet spot of MongoDB, which plays very well to people coming from a SQL background. MongoDB also offers master slave replication and has been working towards an autosharding based scalability and failover support.

There are quite a few other data stores that offer solutions to problems that you face in everyday application design. Caching, worker queues requiring atomic push/pop operations, processing activity streams, logging data etc. Redis and Tokyo Cabinet are nice fits for such use cases. You can think of Redis as a memcached with a backup persistent key-value database. It's single threaded, uses non-blocking IO and is blazing fast. Redis, besides offering every day key/value storage also offer list and sets to be stored along with atomic operations on each of them. Pick the one that fits your bill the best.

Another interesting aspect is the interoperability between these data stores. Riak, for example offers pluggable data backends - possibly we can have CouchDB as the data backend for Riak (can we ?). Possibly we will also see a Cassandra backend for Neo4J. It's extremely heartening to see that each of these communities has a deep sense of cooperation in making the entire ecosystem more meaningful and thriving.