The Geomblog: cs.LG

Showing posts with label cs.LG. Show all posts

Thursday, January 15, 2015

Brief Review of the post-SODA workshop on algorithmic challenges in machine learning.

My student +John Moeller attended the workshop on algorithmic challenges in machine learning organized by +Shachar Lovett and +Kamalika Chaudhuri after SODA. He wrote up a brief report of some papers that he liked.

Monday, October 06, 2014

Algorithms is the new Algorithms...

Hal Daumé wrote a rather provocative post titled ‘Machine Learning is the new algorithms’, and has begged someone to throw his gauntlet back at him. Consider it now thrown !

His thesis is the following quote:

Everything that algorithms was to computer science 15 years ago, machine learning is today

And among the conclusions he draws is the following:

we should yank algorithms out as an UG requirement and replace it with machine learning

Having spent many years having coffees and writing papers with Hal, I know that he truly does understand algorithms and isn’t just trying to be a troll (at least I hope not). So I’m trying to figure out exactly what he’s trying to say. It will help if you read his article first before returning…

First off, I don’t understand the conclusion. Why not (say) replace architecture with ML, or databases with ML. Or why replace anything at all ? the assumption is that ML is a core tool that students will need more than any other topic. Now I have no problem with adding ML to the list of “things a CS person ought to know”, but I don’t see why it’s not important for a CS person to understand how a database works, or how a compiler processes code, or even how we design an algorithm. This fake mutual exclusiveness appears to be without basis.

But maybe he’s saying that algorithms and ML are two flavors of the same object, and hence one can be replaced by the other. If so, what exactly is that object ? In his view, that object is:

given an input, what’s the best way to produce an (optimal) output ?

This seems to be an overly charitable view of ML. In ML, the goal is to, well, learn. Or to be more stodgy about it, ML provides tools for making systematic inferences and predictions from data.

Which suggests that the concerns are fundamentally orthogonal, not in opposition (and Sasho Nikolov makes this point very well in a comment on Hal’s post). As Hal correctly argues, the hard work in ML is front-loaded: modeling, feature extraction, and the like. The algorithm itself is mostly an afterthought.

But what’s ironic is that one of the most important trends in ML of late is the conversion of an ML problem to an optimization problem. The goal is to make good modeling choices that lead to an optimization problem that can be solved well. But wait: what do you need to know how to solve an optimization ? Wait for it…… ALGORITHMS !!

The argument about stability in algorithms is an odd one, especially coming from someone who’s just written a book on ML. Yes, some core algorithms techniques haven’t changed much in the last many years, but did you see that 2014 paper on improvements in recurrence analysis ? Or the new sorting algorithm by Mike Goodrich ? or even the flurry of new results for Hal’s beloved flow problems ?

As for “everything’s in a library”, I’ll take your boost graph library and give you back WEKA, or libSVM, or even scikit-learn. I don’t need to know anything from Hal’s book to do some basic futzing around in ML using libraries: much like I could invoke some standard hashing subroutine without knowing much about universal hash families.

In one sense though, Hal is right: ML is indeed where algorithms was 15 years ago. Because 15 years ago (approximately) is when the streaming revolution started, and with it the new interest in sub linear algorithms, communication complexity, matrix approximations, distributed algorithms with minimal communication, and the whole “theory of big data” effort. And ML is now trying to catch up to all of this, with some help with from algorithms folks :).

What is true is this though: it wouldn’t hurt us to revisit how we construct the core algorithms classes (undergrad and grad). I realize that CLRS is the canon, and it would cause serious heart palpitations to contemplate not stuffing every piece of paper of that book down students’ throats, but maybe we should be adapting algorithms courses to the new and exciting developments in algorithms itself. I bring in heavy doses of approximation and randomization in my grad algorithms class, and before we forked off a whole new class, I used to teach bits of streaming, bloom filters, min-wise hashing and the like as well. My geometry class used to be all about the core concepts from the 3 Marks book, but now I throw in lots of high dimensional geometry, approximate geometry, kernels and the like.

Ultimately, I think a claim of fake mutual exclusivity is lazy, and ignores the true synergy between ML and algorithms. I think algorithms research has a lot to learn from ML about robust solution design and the value of "noise-tolerance", and ML has plenty to learn from algorithms about how to organize problems and solutions, and how deep dives into the structure of a problem can yield insights that are resilient to changes in the problem specification.

Wednesday, April 18, 2012

Distributed Learning: A new model

Communication is now the key to modelling distributed/multicore computations. Jim Demmel has been writing papers and giving talks on this theme for a while now, and as processors get faster, and the cloud becomes a standard computing platform, communication between nodes is turning out to be the major bottleneck.

So suppose you want to learn in this setting ? Suppose you have data sitting on different nodes (you have a data center, or a heterogeneous sensor network, and so on) and you'd like to learn something on the union of the data sets. You can't afford to ship everything to a single server for processing: the data might be too large to store, and the time to ship might be prohibitive.

So can you learn over the (implicit) union of all the data, with as little discussion among nodes as possible ? This was the topic of my Shonan talk, as well as two papers that I've been working on with my student Avishek Saha, in collaboration with Jeff Phillips and Hal Daume. The first one will be presented at AISTATS this week, and the second was just posted to the arxiv.

We started out with the simplest of learning problems: classification. Supppose you have data sitting on two nodes (A and B), and you wish to learn a hypothesis over the union of A and B. What you'd like is a way for the nodes to communicate as little as possible with each other while still generating a hypothesis close to the optimal solution.

It's not hard to see that you could compute an $\epsilon$-sample on A, and ship it over to B. By the usual properties of an $\epsilon$-sample, you guarantee that any classifier on B's data combined with the sample will also classify A correctly to within some $\epsilon$-error. It's also not too hard to show a lower bound that matches this upper bound. The amount of communication is nearly linear in $1/\epsilon$.

But can you do better ? In fact yes, if you let the nodes talk to each other, rather than only allowing one-way communication. One way of gaining intuition for this is that $A$ can generate classifiers, and send them over to $B$, and $B$ can tell $A$ to turn the classifier left or right. Effectively, $B$ acts as an oracle for binary search. The hard part is showing that this is actually a decimation (in that a constant fraction of points are eliminated from consideration as support points in each step), and once we do that, we can show an exponential improvement over one-way communication. There's a trivial way to extend this to more than 2 players, with a $k^2$ blow up in communication for $k$ players.

This binary search intuition only works for points in 2D, because then the search space of classifiers is on the circle, which lends itself naturally to a binary search. In higher dimensions, we have to use what is essentially a generalization of binary search - the multiplicative weight update method. I'll have more to say about this in a later post, but you can think of the MWU as a "confused zombie" binary search, in that you only sort of know "which way to go" when doing the search, and even then points that you dismissed earlier might rise from the dead.

It takes a little more work to bring the overhead for k-players down to a factor k. This comes by selecting one node as a coordinator, and implementing one of the distributed continuous sampling techniques to pass data to the coordinator.

You can read the paper for more details on the method. One thing to note is that the MWU can be "imported" from other methods that use it, which means that we get distributed algorithms for many optimization problems for free. This is great because a number of ML problems essentially reduce to some kind of optimization.

A second design template is multipass streaming: it's fairly easy to see that any multipass sublinear streaming algorithm can be placed in the k-player distributed setting, and so if you want a distributed algorithm, design a multipass streaming algorithm first.

One weakness of our algorithms was that we didn't work in the "agnostic" case, where the optimal solution itself might not be a perfect classifier (or where the data isn't separable, to view it differently). This can be fixed: in an arxiv upload made simultaneously with ours, Blum, Balcan, Fine and Mansour solve this problem very neatly, in addition to proving a number of PAC-learning results in this model.

It's nice to see different groups exploring this view of distributed learning. It shows that the model itself has legs. There are a number of problems that remain to be explored, and I'm hoping we can crack some of them. In all of this, the key is to get from a 'near linear in error' bound to a 'logarithmic in error' bound via replacing sampling by active sampling (or binary search).

Saturday, February 19, 2011

Agnostic/Non-agnostic Learning and Nets/Samples of Range Spaces

There are deep connections between geometry and learning that go back to the origins of VC dimension and include concepts like support vector machines, lifting maps, kernel functions, and iterative reweighting. What is often difficult is the translation problem of going between machine learning concepts and the corresponding (often identical) geometric concepts. One example of this that I've talked about is the two different viewpoints on characterizing isometric embeddability in an $\ell_2$ space.

I strongly believe that with a better "dictionary", we might find a lot more cross-fertilization between geometry and learning, and this would be beneficial for both communities. The 2009 summer school on geometry and learning was a good example of this: alas, it almost completely coincided with SoCG 2009. This summer, there is again a summer school in geometry and learning just before SoCG 2011, and it's in conjunction with it (shameless plug: I'll be lecturing there), and I look forward to seeing more such events.

In this post, my postdoc Jeff Phillips and student Avishek Saha explore a fascinating connection between nets and samples in range spaces on the one hand, and agnostic/non-agnostic learning on the other. Avishek is a machine learning specialist, and Jeff is of course an expert in approximate geometry, sampling and uncertainty. So it is fitting that their joint conversation led to this post.

There is a deep connection between some concepts in combinatorial geometry and learning. It turns out that $\varepsilon$-approximations and $\varepsilon$-nets of range spaces correspond to sample complexity in agnostic learning and non-agnostic learning, respectively. We noticed this last fall when both reading a cute paper by Gandhi, Suri, and Welzl and a book by Anthony and Bartlett.

The geometry:
Given a data set $P$ and a family of subsets $\mathcal{A} \subset 2^P$, the pair $(P,\mathcal{A})$ is a range space. An $\varepsilon$-approximation of $(P,\mathcal{A})$ is subset $Q \subset P$ such that

$\max_{R \in \mathcal{A}} \left| \frac{|R|}{|P|} - \frac{|R \cap Q|}{|Q|} \right| \leq \varepsilon.$

An $\varepsilon$-net of $(P,\mathcal{A})$ is a subset $N \subset P$ such that for all $R \in \mathcal{A}$ with $|R| \geq \varepsilon |P|$ then the intersection $|R \cap Q| \geq 1$. Informally, the $\varepsilon$-approximation approximately preserves the density, and the $\varepsilon$-net hits every large enough subset.

The VC-dimension $v$ of a range space $(P,\mathcal{A})$ informally describes the description complexity of a range $R \in \mathcal{A}$. For instance, halfspaces in $\mathbb{R}^d$ have VC-dimension $d+1$. A random sample $Q \subset P$ of size $O(v/\varepsilon^2)$ is an $\varepsilon$-approximation of $(P,\mathcal{A})$ with constant probability, and a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$ is an $\varepsilon$-net of $(P,\mathcal{A})$ with constant probability.

Learning theory:
Agnostic and non-agnostic learning can also be described with respect to a range space $(P,\mathcal{A})$. We are given a data set $P$ where each element $p \in P$ is either labeled $+$ or $-$. We want to find a classifier $R \in \mathcal{A}$ such that all elements $p \in P$ labeled $+$ are in $R$ and all elements $p \in P$ labeled $-$ are not in $R$. Non-agnostic learning, also referred to as realizable learning, is the setting where we assume that there exists such a classifier $R^*$ in "concept class" $\mathcal{A}$ that has zero-error, it perfectly separates all $+$ points from $-$ points. Agnostic learning is the setting without this assumption of the existence of a zero-error classifier in $\mathcal{A}$, so the aim is to find a classifier $\hat R \in \mathcal{A}$ which misclassifies the fewest number of points. This weakened target function assumption can be alternatively viewed as traditional PAC learning that accommodates arbitrary noise.

The connection:
Now here is where the concepts start looking similar: In non-agnostic learning, if we are given a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$, and run an algorithm to perfectly classify $+$ and $-$ points from $N$, then the same classifier will misclassify at most $\varepsilon |P|$ points on $P$, with constant probability.
In agnostic learning, if we are given a random sample $Q \subset P$ of size $O(v/\varepsilon^2)$, and run an algorithm to find a classifier on $Q$ that misclassifies the fewest number of points, then the same classifier will misclassify at most $\varepsilon$ larger fraction of points on $P$, with constant probability.

Why are those bounds the same?

Lets first look at $\varepsilon$-nets and non-agnostic learning. The key is to construct a range space $(P,\mathcal{S})$ where each $T \in \mathcal{S}$ is the symmetric difference of two ranges $R^*,R^\prime \in \mathcal{A})$, that is it contains all points contained in exactly one of the two ranges. If $(P,\mathcal{A})$ has VC-dimension $v$, then $(P,\mathcal{S})$ has VC-dimension $O(v)$.
Now consider the optimal classifier $R^* \in \mathcal{A}$ of $P$, and any other classifier $R^\prime \in \mathcal{A}$ that misclassifies at most $\varepsilon |P|$ points of $P$. Then the symmetric difference of $R^*$ and $R^\prime$ is at most $\varepsilon |P|$ points, and corresponds with some range $T \in \mathcal{S}$. If $N \subset P$ is an $\varepsilon$-net of $(P,\mathcal{S})$, then any classifier $R^\prime \in \mathcal{A}$ learned perfectly on $(N,\mathcal{A})$, cannot misclassify more than $\varepsilon |P|$ points since it would then necessarily have too many points in the symmetric difference with the perfect classifier $R^*$ on $P$, and $N$ would contain at least one of them.

Now to understand $\varepsilon$-approximations and agnostic learning, we need to see why this same approach won't work. Namely, because there is, in general, no perfect classifier $R^\prime \in \mathcal{A}$ defined on a subset of $Q \subset P$. Thus when we look at the symmetric difference between $\hat R$ and $R^\prime$, we cannot argue that the set of misclassified points should be empty in $Q$ using $R^\prime$ if the difference is less than $\varepsilon |P|$, and can't apply the $\varepsilon$-net result.

But we can apply the stronger $\varepsilon$-approximation result in this agnostic setting. The difference between any two ranges $R, R^\prime \subset \mathcal{A}$ on $P$, that contain the same points in $Q$, is at most $\varepsilon |P|$. Thus the difference in the fraction of misclassified points in $P$ from the optimal range $\hat R$ found on $Q$ is no more than $\varepsilon$, since $\hat R$ only differs by an $\varepsilon$-fraction in the number of points it contains in $Q$ versus $P$.

An intuitive way to see this connection is to think of agnostic and non-agnostic learning in the context of presence and absence of noise. Non-agnostic learning is the ideal setting where we have no noise and standard VC-dimenion based learning bounds suggest that we need $O((v/\varepsilon)\log(1/\varepsilon))$ samples (an $\varepsilon$-net) to learn a target function with $\varepsilon$ classification error. On the other hand, agnostic learning accommodates noise by allowing points from either class to spill over the decision boundary. This implies that some of the randomly sampled points might be noisy and hence misleading for the purposes of learning the target classifier. To nullify this effect of noisy sampled points we need to oversample which increases the sample complexity by about an $(1/\varepsilon)$ factor (yielding an $\varepsilon$-approximation).

Wednesday, February 09, 2011

My talks at ITA and at the college of engineering in Montana State

This is the abstract for the talk I'm giving in brief at the ITA Workshop and more expanded at a College of Engineering colloquium at Montana State. Thanks to the ITA program commitee and Brendan Mumey (at Montana State) for the invitations.

Dimensionality reduction for distributions: the good and the bad

In many application areas, the natural representation of data is in the form of an empirical probability distribution. Documents are represented as normalized frequency counts of words, images are represented by color (or other feature) histograms, and speech signals can be represented by spectral histograms.

There are many natural and meaningful ways of measuring similarity (or distance) between such distributions, and these define different geometries in which we might wish to analyze collections of distributions for interesting patterns. However, a practical bottleneck is the high dimensionality of these representations: for example, an 256 x 256 image might be represented by a vector of over 1000 features, and a document might be represented as a sparse vector with hundreds of attributes.

Thus, a key problem is: can we reduce the dimensionality of collections of distributions to make data analysis tractable while preserving the distances in the collection ?

In this talk, I'll discuss a collection of recent results centered around this theme, that provide both good news (and bad) for dimensionality reduction on distributions in theory and in practice.

The above draws on information mostly from this paper with Arvind Agarwal and Jeff Phillips, and this work-in-progress with Rasmus Kyng and Jeff Phillips.

Friday, February 04, 2011

POTD: Reproducing Kernel Banach Spaces with the ℓ1 Norm

Reproducing Kernel Banach Spaces with the ℓ1 Norm

Guohui Song, Haizhang Zhang, and Fred J. Hickernell

http://arxiv.org/abs/1101.4388

Abstract:

Targeting at sparse learning, we construct Banach spaces B of functions on an input space X with the properties that (1) B possesses an l1 norm in the sense that it is isometrically isomorphic to the Banach space of integrable functions on X with respect to the counting measure; (2) point evaluations are continuous linear functionals on B and are representable through a bilinear form with a kernel function; (3) regularized learning schemes on B satisfy the linear representer theorem. Examples of kernel functions admissible for the construction of such spaces are given.

Notes:
This one probably requires some explanation, for the non-ML folks. Reproducing Kernel Hilbert spaces are the coin of the realm in machine learning and for good reason. They allow much of ML to be "ported" from linear classifiers to non-linear classifiers: the kernel mapping essentially linearizes (via lifting) the nonlinear classifiers so you can get the benefit of the nonlinearity while operating algorithmically in a linear world. Even though the induced Hilbert space is typically a function space and is therefore infinite-dimensional, the representer theorem allows us in most cases to operate in a finite dimensional space (where the dimension is bounded by the number of samples). From a metric embedding perspective, kernels completely characterize the class of metrics isometrically embeddable in Euclidean space.

So RKHSs are great. So what's the deal with this paper ? What it tries to do is combine the power of RKHSs with the regularity and sparsity properties guaranteed by $\ell_1$ norms. Even though your typical Banach space doesn't admit an inner product (what you need for the kernel mapping), they show that you can define special Banach spaces in which kernels can be defined as before, and the representer theorem holds, but that you can get sparse bases for solutions because of the nice $\ell_1$ properties.

I promised SHORT summaries, so I'm not going to go further. But the takeaway message here is the ability to extend the nice properties of RKHSs to Banach spaces. For completeness I should mention that there are other approaches that have tried to do this, but using different mathematical constructs that are in some way less well suited.

Wednesday, February 02, 2011

Sample Complexity for eps-approximations of Range Spaces

This is a guest post by Jeff Phillips, who among other things ponders the computation of $\epsilon$-samples for range spaces under uncertainty. Jeff is looking for permanent research positions right now, so uncertainty is a major part of his life !

At some point I had plans to write a blog post detailing a simpler proof of the number of random samples needed for an $\varepsilon$-approximation of a range space. This is a very powerful result that is simple to state (I'll define below) and simple to use. I view it as very geometric, but it originated in learning theory. I had used it in several papers, but until recently had not actually read the full proof, and even slightly less recently was unaware of the best bound. The original paper by Vapnik and Chervonenkis is hard to find, but I eventually found a proof in a learning-theory book by Anthony and Bartlett. The proof uses a series of tricks I was not aware of, and the notation is different from what I was used to.

After reading the 20+ page proof, I felt there was a simpler geometric way to think about the problem, and even sketched-out what I thought was a simpler proof. I thought that would be a great thing to post on this blog! But slowly things began to break down. I prepared a pdf writeup, that started at under one page. Then realizing gaps in the reasoning, it grew to 2-3 pages, and then 4 as I attempted to write the tighter result. In presenting this approach to the informal theory seminar here at Utah and realized one of the main geometric tricks I was hoping to use was insufficient for proving what was needed. Looking back at the Anthony and Bartlett proof, I finally realized why all of tricks and machinery they use are really needed!

Anyways, I'd like sketch a proof of the $\varepsilon$-approximation theorem and try to include intuition of why techniques are being used.

Let $P$ be a set of objects (i.e. points) and $\mathcal{R}$ be a set of subset of $P$ (i.e. ranges, like disks). Then $(P,\mathcal{R})$ is called a range space. We will concern ourselves with range spaces with bounded VC-dimension d, which implies that the total number of distinct ranges is at most $c |P|^d$ for some fixed constant $c$. Now a subset $Q$ of $P$ is an $\varepsilon$-approximation of $(P,\mathcal{R})$ (for $0 \le \varepsilon \le 1$) for all ranges $A \in \mathcal{R}$

$\left| \frac{|P \cap A|}{|P|} - \frac{|Q \cap A|}{|Q|} \right| \leq \varepsilon. \hspace{.2in} (1)$

The main result is that:

Theorem. If $Q$ is a random sample from $P$ of size $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ then it is an $\varepsilon$-approximation of $(P,\mathcal{R})$ with probability at least $1-\delta$.

This was first proven by Talagrand in 94. The original result of Vapnik and Chervonenkis had a slightly worse bound of $h = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$. The main reason this is a powerful tool is that $h$ is independent of $|P|$. As I found out, even this weaker result is difficult to prove.

I'll also note, that I first became aware of this stronger result as a reduction from a more general result by Li, Long, and Srinivasan 01, as cited by Har-Peled and Sharir 07.

There is a nice progression to the full result through a series of weaker results.

Step 1.
We can show that a random sample of $r = (1/2 \varepsilon^2) \ln(2/\delta)$ points is sufficient to prove (1) for any one fixed range, by a Chernoff-Hoeffding (C-H) bound. Then since there are only $c |P|^d$ distinct ranges, by the union bound, we only need $h_1 = (1/2 \varepsilon^2)(d \ln |P| + \ln(2c/\delta))$ samples to hold for all ranges.

Note that $h_1$ is a function of $|P|$, we want to get rid of this dependence.

This C-H + union bound will be used again. Here the C-H bound has $h_1$ events, one for each random sample, we will reduce this in Step 3. The dependence on $|P|$ is through the union bound, we improve this in Step 2.

Step 2.
To get rid of the dependence on |P| we need to apply the union bound on a smaller number of ranges. We use a technique called symmetrization, where we consider two random samples $S$ and $Q$ of $P$, both of size $h_2$. Then we compare them against each other with the goal of showing for all $A \in \mathcal{R}$

$\left| \frac{|Q \cap A|}{|Q|} - \frac{|S \cap A|}{|S|} \right| \leq \varepsilon/2. \hspace{.2in} (2)$

We can intuitively see that showing (2) holds with probability $\geq 1-\delta$ is sufficient to show (1): If $h_2$ is too small, it is likely both $Q$ and $S$ have at least some range $A$ in which they are far from $P$, but since there are many ranges, it is likely that they are far on different ranges. Thus if $h_2$ is too small $Q$ and $S$ are likely different from each other on some range $A$ also. Likewise, if $h_2$ is large enough so $Q$ and $S$ are close on all ranges, then it is likely they are close to $P$ on each of those ranges too.

Another technical trick is required to show (2) holds. We correspond each element in $q_i \in Q$ with an element of $s_i \in S$. Then it is sufficient (once we have sampled $Q$ and $S$) to show (2) occurs with probability $\geq 1-\delta$ w.r.t. possible swaps of ($s_i \in Q$ and $q_i \in S$) vs. ($s_i \in S$ and $q_i \in Q$), for all corresponding pairs.

Since the events in the permutation have bounded effect on each range, and there are only $c |Q \cup S|^d$ ranges we need to consider, we can apply the C-H + Union bounds as in Step 1, but now we only have $O((d/\varepsilon)^d)$ events (independent of $|P|$) and we get the original V-C bound of $h_2 = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$.

Another neat effect of symmetrization is that it allows us to consider continuous sets $P$. The analysis in Step 1 required us to compare $Q$ to all ranges defined by $P$, which would have been infinite if $P$ were continuous. But now since we compare two random samples from $P$, even if it is continuous, the random samples are finite, and hence induce a finite number of ranges.

Step 3.
We now show how to get rid of the extra $\log(d/\varepsilon)$ term. We do this through a technique called chaining which allows us to decompose ranges and change the type of events we measure with the C-H bound.

Using symmetrization consider two subsets $Q$ and $S$ of $P$ and let $P^\prime = Q \cup S$. We define a series of range spaces $(P^\prime, \mathcal{T}_0)$, $(P^\prime, \mathcal{T}_1)$, $\ldots$ so that

(a) each range $A \in \mathcal{R}$ is a disjoint union $A \bigcup_j B_j$ where each $B_j \in \mathcal{T}_j$.
(b) each $B_j \in \mathcal{T}_j$ is of size $|B_j| \leq |P^\prime|/2^j$.
(c) $Q$ is an $\varepsilon_j$-approximation of $(P^\prime, \mathcal{T}_j)$ with probability $\geq \delta_j$ where $\varepsilon_j = \sqrt{j/2^j}/12$ and $\delta_j = \delta/2^{j+2}$.

Fact (a) allows us to analyze $(P^\prime,\mathcal{R})$ (with $\varepsilon, \delta$) by considering each $(P^\prime,\mathcal{T}_j)$ separately (with $\varepsilon_j, \delta_j$) since by (c) $\varepsilon \leq \sum_j \varepsilon_j$ and $\delta \leq \sum_j \delta_j$.

From (c) the extra $\sqrt{j}$ term in $\varepsilon_j$ cancels out the $\log(1/\varepsilon)$ term in the technical analysis leading to $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ as desired.

To prove (c) we need to use (b). The key insight is that for large $j$, each range $B_j$ is small. We can then apply the C-H bound using a different type of event which there are fewer of. For a range $B_j$, instead of measuring the event if $q \in Q$ is inside $B$ (as in Step 1 and 2) we measure if some $b \in B_j$ is in $Q$ -- thus there are only $|B_j| \leq |P'|/2^j$ events. Since $|B_j|$ decreases geometrically as $j$ increases, we can also decrease $\varepsilon_j$ and $\delta_j$ geometrically as $j$ increases, proving (c).

It only remains to show the existence of each $(P^\prime, \mathcal{T}_j)$.

We first consider a series of range spaces $(P^\prime, \mathcal{R}_j)$ where $\mathcal{R}_j$ correspond to all ranges in $(Q_j, \mathcal{R})$ where $Q_j$ is an $(1/2^j)$-approximation of $(P^\prime,\mathcal{R})$. We know $|Q_j| = O((1/\varepsilon^2) d \log(d/\varepsilon))$ by Step 2.
Then $\mathcal{T}_j$ is defined inductively as follows. $\mathcal{T}_0 = \mathcal{R}_0$. Each $B \in \mathcal{T}_j$ corresponds to a range $A \in \mathcal{R}_j$. We know there exists another range $A^\prime \in \mathcal{R}_{j-1}$ such that $A$ and $A^\prime$ differ by at most $|P^\prime|/2^j$ points; let $B$ be those points.

Note: Most proofs construct $\mathcal{T}_j$ using a combinatorial result of Haussler 95, but this approach does not affect the asymptotics of the final solution and may actually improve the constants.

Wednesday, February 13, 2008

Metric spaces, VC-dimension, and metric entropy.

For a problem that I've been working on, it turns out that if a related range space has bounded VC-dimension, the problem can be solved exactly (but with running time exponential in the dimension). The range space is constructed from two parameters: a metric space (X, d), and a radius e, and consists of the domain X, and all balls of radius at most e in X.

So a natural question that I've been unable to answer is:

What properties of a metric space ensure that the induced range space has bounded VC dimension ?

Most of what we do know comes from the PAC-learning community. For instance, the doubling dimension of a metric space is the smallest number d such that any ball of radius e can be covered by at most 2^d balls of radius e/2. In recent years, it has been popular to explore the extent to which

"metric space of bounded doubling dimension == bounded dimensional Euclidean space"

is true. Unfortunately, there are spaces of bounded VC-dimension that do not have bounded doubling dimension.

Another related proxy for the "dimension" of a metric space is its metric entropy: Determine the minimum number of balls of radius e needed to cover all points in a metric space. The log of this number is the metric entropy, and among other things is useful as a measure of the number of points needed to "cover" a space approximately (in a dominating set sense). It's known that the metric entropy of concept classes is closely related to their VC-dimension, (where the underlying metric is symmetric difference between the classes). I am not aware of any general result that relates the two though.

For more on the dizzying array of numbers used to describe metric space dimension, do read Ken Clarkson's magnificent survey.

[On a side note, I don't quite understand why the term "entropy" is used. It seems to me that if one wanted to use entropy, one would compute the entropy of the resulting set of balls, rather than merely the log of their number. ]

The Geomblog

Pages