(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
I don't know why, but I've been struggling with a series of posts on large-data clustering for a while now. Distilling out the key ideas in a way that is intuitive and yet rigorous has been trickier than I thought. But here goes ...
Let's consider a prototypical clustering algorithm: $k$-means. In a $k$-means iteration, you have to scan each point and determine its nearest cluster center. Then you have to organize all the points that share a cluster center and compute a new one by averaging. Each iteration takes $nk$ time, and this isn't even accounting for high dimensions.
This algorithm doesn't scale well to large data, needless to say. But clustering, like most data mining tasks, is needed most acutely when we're dealing with large data. So what do we do ?
In the next series of posts, I'll try to walk you through some of the key principles that get used when designing large data clustering algorithms. As always, the twin principles of sampling and streaming will be the keys to the kingdom.
Clusters can be formed by clustering clusters.
One of the ideas that makes large-data clustering tractable is that you can cluster clusters (say that three times quickly, will ya ?). $k$-means is a particularly good example of this. I can think of a cluster center as a "representative weighted point" and discard the data associated with it. This is critical, because now I can just worry about the cluster centers themselves, and repeat.
The advantage of this is locality. I can take small chunks of data and cluster them efficiently. Then I can combine the cluster reps from each chunk to get a smaller set of reps, and so on. Many of the "standard" large-data clustering methods (BIRCH, CLARA,
CLARANS and the like) are based on this principle. As a concept, it's
sound. But does it guarantee anything ?
The catch of course is error propagation. How do I make sure that in this repeated (hierarchical) clustering process, I didn't commit to a bad clustering early on because of locality ? It seems like I could pay a lot of error at each level, limiting my ability to group data sets into small chunks (because small chunks mean many levels).
It turns out that there are clever ways around this problem by careful merging. The first example of this is a
paper from 1997 by Charikar, Chekuri, Feder and Motwani. They look at the $k$-
center problem in a general metric space (minimize the
maximum radius) and their technique illustrates this principle well.
Their algorithm is quite elegant, and works in phases. In each phase, they first merge cluster centers together to ensure that all resulting cluster centers are far apart. Then they insert new points into clusters as long as they don't get too big, or create new clusters. This happens as long the number of clusters stays below $k$. When it crosses $k$, the phase ends, and then the process repeats.
Note that the algorithm reads in points and processes them on the fly in a single "pass". The reason it works inspite of the repeated mergings is because of a nifty property of the $k$-center algorithm (which lies at the heart of the Gonzalez 2-approximation for $k$-center):
If you can produce $k+1$ points that are all at distance at least $d$ from each other, then the optimal $k$-clustering has diameter at least $d$.
This is because of a 'quantitative pigeonhole principle': At least two of these points must be in a single cluster, which then has diameter at least $d$.
There are some tricks needed to figure out how much you can expand each cluster when inserting, and how far apart cluster centers should be in the phase after the current one. I'll spare you the details, but they lead to an 8-approximation, which can be improved using a neat randomization trick to a $2e$-approximation.
The above trick works because we can build a witness structure that lower bounds OPT, and so the merging can be done carefully so that error doesn't propagate badly. What happens if you change the cost function to the $k$-
median though, where the witness isn't as straightforward ?
An excellent source here is the work by
Guha, Meyerson, Mishra, Motwani and O'Callaghan. This paper, which combines prior work on both the theory and practice of $k$-median clustering, presents the following idea:
Break the data set into pieces. Cluster each set "loosely", using many more than $k$ clusters. Weight each cluster by the number of points assigned to it. Cluster these weighted cluster centers to find the desired $k$ clusters.
The key observations they make are:
- The total cost of clustering the pieces isn't too much larger than the overall clustering cost (so it gives a good lower bound). This follows from the triangle inequality.
- The new "weighted" instance admits a solution that again is not too much larger than the optimal cost (is actually within a factor of 6 for any partition). This again follows from the triangle inequality.
The last piece of this puzzle is: how to cluster each set "loosely" ? They propose using a bicriterion method, whose cost is compared to the optimal $k$-median cost, but is allowed to use more medians. Such bicriterion methods are often easier to design.
Note that there's nothing stopping them from using their own algorithm recursively to cluster the weighted cluster centers. What's neat about this is that it can be adapted to a streaming setting using another standard trick.
Specifically, if you have an algorithm that breaks things into pieces and then recombines them (a two-level scheme), you can first recursively apply it to get a multi-level hierarchical scheme, and then you can implement this in a streaming setting by updating (potentially) an entire side of the computation tree whenever a new item appears. This can also be viewed as a lazy implementation of the hierarchical strategy.