``I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it..."
-- Justice Potter Stewart in Jacobellis v. Ohio.
What makes one clustering better than another? To answer that question we have previously assumed that a well motivated objective function was given to us. Then the situation is easy, we compute the value of the objective and decide accordingly. In practice, however, clustering is often an intermediate step in a longer chain of computations, and there is not specific motivation for one objective function over another. Debating the pros and cons of the multitude of possible objective functions can easily take hours. Instead, we can further the ``I know it when I see it'' intuition by writing down the properties that any good clustering objective should satisfy and see if this axiomatic approach guides us towards the right solution.
This approach was undertaken by Kleinberg in his work on clustering axioms. A clustering function is one that takes a set of points (and their pairwise distances) and returns a partition of the data. The function should abide by the simple axioms:
- Scale-Invariance: If all distances are scaled by a constant factor, the clustering should not change. Put another way, measuring the distance in miles or kilometers (or nanometers) should not change the final clustering partition.
- Richness: Depending on the pairwise distances, any partition should be possible. For example, the clustering should not always put two specific points $x$ and $y$ together, regardless of the distance metric.
- Consistency: If the pairwise distance between points in a cluster decreases, while the distances between a pair of points in different clusters increases, the clustering should not change. Indeed, such a transformation makes clusters `tighter' and better separated from each other, so why should the clustering change?
This leads to a slightly different tack at this problem. Instead of thinking about a specific clustering, or a specific partition of the data, instead we try to define an objective function, so that the optimum clustering under that objective behaves in a reasonable manner. This is the approach adapted by Ackerman and Ben-David. Let $m$ be a function measuring the quality of the clustering. As before, Scale-Invariance requires the measure $m$ to be independent of the units on the distances and Richness requires that every clustering can be the optimum (under $m$) clustering.
The Consistency axiom is the one undergoing a subtle change. Instead of requiring that the clustering stay the same if such a perturbation to the distances occurs, we only require that the score, or measure of the clustering {\em improve} under such a transformation. This breaks the counterexample above -- while the score of a 1-cluster decreases as we move the distances, the score of a 2-clustering decreases even more and surpasses the score of the 1-clustering. Indeed, the authors demonstrate a set of different clustering metrics that are consistent under this set of axioms.
-- Sergei Vassilvitskii