Ruminations of a Programmer: data-structures

Showing posts with label data-structures. Show all posts

Thursday, January 01, 2015

Probabilistic techniques, data streams and online learning - Looking forward to a bigger 2015

I look forward to 2015 as the year when randomized algorithms, probabilistic techniques and data structures become more pervasive and mainstream. The primary driving factors for this will be more and more prevalence of big data and the necessity to process them in near real time using minimal (or constant) memory bandwidth. You are given data streams where possibly you will see every data only once in your lifetime and you need to churn out analytics from them in real time. You cannot afford to store all of them in a database on disk since it will incur an unrealistic performance penalty to serve queries in real time. And you cannot afford to store all information in memory even if you add RAM at your own will. You need to find clever ways to optimize your storage, employ algorithms and data structures that use sublinear space and yet deliver information in real time.

Many such data structures are already being used quite heavily for specialized processing of data streams ..

Bloom Filters for set membership using sublinear storage
HyperLogLog for efficient cardinality estimation
Count Min Sketch for estimating frequency related properties of a data stream

These data structures are becoming more and more useful as we prepare to embrace and process larger data sets with fairly strict online requirements. And it has started making a difference. Take for example Impala, the open source analytic database from Cloudera that works on top of Hadoop. Impala's NDV aggregate function (number of distinct values) uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space. This blog post has the details of the performance improvement that it offers in comparison to the standard distinct count. The immensely popular NoSQL store Redis also offers a HyperLogLog implementation that you can use to get an approximation on the cardinality of a set using randomization. Salvatore has the details here on the implementation of HyperLogLog algorithm in Redis.

The most important reason these algorithms and data structures are becoming popular is the increased focus on our "online" requirements. We are not only processing bigger and bigger data set, we need results faster too. We just cannot afford to push all analytics to the batch mode and expect results coming out after an overnight batch processing. Various architectural paradigms like the lambda architecture also target to address this niche area. But before investing on such complex architectures, often some neat data structures that use probabilistic techniques and randomization may offer a much lighter weight solution that you are looking for.

Consider processing the Twitter stream and generating analytics (of whatever form) online. This means that immediately after seeing one twitter feed you must be able to predict something and update your model at the same time. Which means you need to memorize the data that you see in the feed, apply it to update your model and yet cannot store the entire hose that you have seen so far. This is online learning and is the essence of techniques like stochastic gradient descent that help you do this - the model is capable of making up to date predictions after every data that you see. John Myles White has an excellent presentation on this topic.

Consider this other problem of detecting similarities between documents. When you are doing this on a Web scale you will have to deal with millions of documents to find the similar sets. There are techniques like minhash which enable you to compress documents into signature matrices. But even then the scale becomes too big to be processed and reported to the user in a meaningful amount of time. As an example (from Mining Massive Datasets), if you process 1 million document using signatures of length 250, you still have to use 1000 bytes per document - the total comes to 1 gigabyte which very well fits into the memory of a standard laptop. But when you check for similar pairs, you need to process (1,000,000 choose 2) or half a trillion pairs of documents which will take almost 6 days to compute all similarities on a laptop. Enter probabilistic techniques and locality sensitive hashing (LSH) algorithm fits this problem like a charm. Detecting similarity is a problem that arises in recommender systems with collaborative filtering and LSH can be used there as well. The basic idea of LSH as applied to similarity detection is to use hashing multiple number of times and identify candidate pairs that qualify for similarity checking. The idea is to reduce the search space using probabilistic techniques so that we can eliminate a class of candidates which have very low chance of being similar.

Here I have only scratched the surface of the areas where we apply randomization and probabilistic techniques to solve problems that are very real today. There are plentiful other areas in data mining, graph clustering, machine learning and big data processing where similar techniques are employed to reduce the curse of dimensionality and provide practical solution at scale. 2014 has already seen a big surge in terms of popularizing these techniques. I expect 2015 to be bigger and more mainstream in terms of their usage.

Personally I have been exploring data stream algorithms a lot and have prepared a collection of some useful references. Feel free to share in case you find it useful. I hope to do something more meaningful with stream processing data structures and online learning in 2015. Have a very happy and joyous new year ..

Monday, March 09, 2009

Bloom Filters - optimizing network bandwidth

Bloom Filter is one of the coolest data structures that qualify as being elegant as well as incredibly useful. I had a short post on a cool application of bloom filters in front ending access to disk based data to achieve improved throughput in query processing. I didn't know Oracle uses bloom filters for processing parallel joins and join filter pruning. In case of parallel joins, the idea is quite simple ..

Each slave process prepares a bloom filter for the join condition that it is processing

It then passes on the bloom filter to the other slave processes, which can apply the filter to its own selected set of records before passing on the final set to the join coordinator

Remember each of the above processes may run in a distributed environment - hence the above technique leads to less data being transported across nodes, thereby saving in network bandwidth for some extra CPU cycles. This paper describes all of these in full details with illustrative examples.

This idea of serializing bloom filters instead of data set has been used quite extensively in load balancing MapReduce operations to minimize intermediate results before sending everything across the network for final aggregation. In case of processing distributed join operations, we may need to compose multiple bloom filters to get the final dataset. Bloom joins, as they are called allow cheap serialization of filters over the wire, by employing some clever techniques like linear hash tables and multi-tier bloom filters, as described in this paper in Comonad Reader.

Bloom joins can also be used effectively in MapReduce processing with CouchDB. The map phase can produce the bloom filters, which can be joined in the reduce phase. In a recent application, I needed to store a very large list mainly for set operations. Instead of storing individual elements, I decided to store a bloom filter that nicely fit in a memcached slab. I could pull it out and do all sorts of bit operations easily and it's blinding fast. Next time you decide to distribute your huge list in Terracotta, think back - there may be a lighter weight option in distributing a bloom filter instead. There are use cases when you will be doing membership checks only and some false positives may not do much harm.

Sunday, March 01, 2009

Preemptive Bloom Filter

Michael Mitzenmacher describes a great trick with bloom filter and hash table that has been used in Longest Prefix Matching techniques to improve performance by reducing the number of dependent memory lookups.

Just in case you are not familiar with bloom filters, have a look at the great introduction in Wikipedia ..

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

Suppose you are doing lookups of large keys in a slow hash table, and the lookups are primarily for negative search, i.e. in most of the cases, the lookup will fail. In such cases you can front end the (slow) hash table lookup with a bloom filter for the keys in a fast memory. Since most of your lookups are likely to fail, you will potentially save a lot by avoiding access to slower memory based hash table. Of course the bloom filter can return false positives (remember, it never gives false negatives), where you will still have to lookup the hash table.

I was wondering if the same technique can be generalized for lookups in database tables. For all sets of indexes that will be typically used for searching the table, we can have separate bloom filters in fast memory, which will lead to lesser and lesser access of disk based database tables. Of course this will work meaningfully for tables on which lookup failures outweigh successes, as Michael gives some examples of URLs on blacklist or dangerous packet signatures.

A nice trick to have up your sleeve ..