The Geomblog: sabbatical

Showing posts with label sabbatical. Show all posts

Monday, September 09, 2013

Life at Simons

There have been many posts on the technical talks happening at Simons (see +Moritz Hardt's latest on gradient descent if you haven't yet). Not as many yet on the less-formed discussions that are happening in and around the building, but maybe it's too soon for that.

In the meantime, I thought it might be fun to describe "a day at Calvin Hall".

I usually bike in after 9 am. Biking in Berkeley is not for the (literally) weak-kneed, and I'm discovering how weak my knees actually are. I haven't yet found a good site to plot elevation gains and losses for an arbitrary route, but this one is pretty decent.

The Institute lends out bikes if you don't have your own. It's a great way to get around if you're staying near campus, because parking is difficult and expensive.

Of course, the next step is:

It's a great machine. Only downside is that it shuts down at 5pm when the building does. But there are plenty of cafes around for later.

Then of course the decision: which talks should I feel guilty about missing today ? There are A LOT OF TALKS going on at the various workshops/boot camps. I feel embarrassed complaining about this, because we have such a stellar crowd of speakers and such fascinating talks. But there are almost too many talks: I have to shut down the feelings of guilt in order to find big chunks of time to think (or write this blog post <ahem>).

Thankfully, there's the live stream. I've already mentioned the Simons livestream on G+, but it's worth mentioning again. If you're late to the start of any talk you can watch it on the high quality live feed. If you can't watch it live you can watch it as soon as it's done on the Ustream site for the event. And a few days after that the talks are archived on the Simons site directly. There's also this cute feature where the live stream plays the most recent talk during coffee breaks. I know people outside the Institute are watching talks live because someone pinged me on twitter to ask the speaker to repeat questions they get from the audience.

So I've either attended talks or I've spent the morning brainstorming on some of the projects I have going. It's now time for lunch: thankfully, we're walking distance from both Northside and Southside Berkeley, which means a huge variety of eating places all within 5-10 minutes walking. All we need to do is avoid the huge class let-out at 12:20 or so. On Tuesdays we have a lunch seminar (organized by +Moritz Hardt) as part of the big-data program.

Once we're back, it's time for this:

Did I mention that the coffee machine is awesome ?

Usually, discussions start picking up in the afternoon, and you'll often find me here:

3:30 is time for the daily tea, at which we get all kinds of nice cookies, and the occasional cake/ice cream (on thursdays). This is much like the Dagstuhl tea time, but without the fancy tortes (hint hint!). Of course, no daily tea is complete with the addition of this:

By five, people are starting to trickle out slowly. On Tuesdays we have a happy hour at a local bar ($3 pints!). And it's time for me to check the sad state of my knees for the ride back, which is hard because it's uphill both ways !

Friday, September 06, 2013

More camping with high dimensional boots...

I discovered today that you don't have to wait for talk videos to be posted on the Simons website. All videos are live streamed via ustream, and they have their own channel for the boot camp talks, where you can watch the videos immediately after they're streamed.

Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory.

Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this:

In classical statistics, we can evaluate the quality of an estimator as $n \rightarrow \infty$ using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ?

In particular, suppose you have a situation where $d/n \rightarrow \alpha > 0$ (this notion was first introduced by Kolmogorov). For example, suppose $\alpha = 0.5$. What happens to the behavior of your estimation methods ? He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the $d/n \rightarrow \alpha$ regime, the convergence is slowed down by the parameter $\alpha$.

Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from $N(0, I_{d \times d})$, and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than $1 + \sqrt{d/n}$ with high probability. But this error term $\sqrt{d/n}$ does not go away.

If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker).

In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting.

When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly.

Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically.

There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful.

Wednesday, September 04, 2013

Statistics, geometry and computer science.

Geometry is the glue between statistics and computer science. -- Michael Jordan

That's why everyone gets stuck in it.
-- Graham Cormode

~~Today~~ Yesterday was the first day of the "Big Data boot camp" at the Simons Institute. The idea behind these boot camps is to get participants in a program "on the same page". In a program like ours, with statisticians. ML folks, algorithms people and optimizers all mingling together, getting "on the same page" with "the same language" is critical. For a more general overview of the first day activities, see Muthu's post.

Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now.

We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on.

Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ?

He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip.

The James-Stein estimator is really a very counterintuitive object. Consider a set of samples $X_i \sim N(\theta_i, 1), i \le n$, where the $\theta_i$ are unknown parameters. The goal is to sample $X_i$ to determine the $\theta_i$, by minimizing some loss function $\sum (\theta_i - \hat{\theta_i})^2$. The $\theta_i$ are assumed unrelated to each other.

The "obvious" estimator is $\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator:
\[ \hat{\theta_i} = (1 - \frac{c}{S^2}) X_i \]
where $S = \sum_j X_j^2$.

What's surprising about this is that the estimator for a single $\theta_i$ takes into account samples from other $X_i$, even though the $\theta_i$ are assumed to be unrelated.

It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs $(X_i, \theta_i)$ as points in the plane. We don't quite know where these points lie because we don't know what $\theta_i$ is. Because the $X_i$ are normally distributed around the $\theta_i$, each point is really a horizontal interval of interest.

Now the standard estimator $\hat{\theta_i} = X_i$ arises from trying to solve a regression problem of $X$ versus $\theta$, or more precisely solving the regression $\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression $\hat{\theta} = E(\theta \mid X)$. In other words, regression of $y$ against $x$ is different from the regression of $x$ against $y$, as long as we have more than three points. And this is precisely what the JS estimator says.

Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense.

The idea is as follows. Consider the very simple problem of denoising where the observations $\mathbf{y}$ are generated from the input $\mathbf{x}$ by a noise perturbation:
\[ \mathbf{y} = \mathbf{x} + \sigma \mathbf{z} \]
where $\mathbf{z}$ is normally distributed. Let us assume that $\mathbf{x}$ is drawn from some set $S$ (for example, $S$ is the set of low-rank matrices, or the set of permutation matrices).

The simplest estimator for $x$ is an $\ell_2$ projection: compute the sample mean $\overline{\mathbf{y}}$ and then find its projection onto $S$. But this involves a minimization over $S$, which might be intractable.

We can relax $S$ to some $S'$, where $S \subset S'$ and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in.

Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true $\mathbf{x}^*$ and the estimated $\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is

expected risk is upper bounded by (C/n) * geometric complexity of $S, \mathbf{x}^*$

where $n$ is the number of samples.

Suppose we fix the desired risk. Then an increase in $n$ can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as
\[ g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ] \]
where $z$ is normally distributed.

In particular, the set $D$ used in the above expression is the set of directions from $\mathbf{x}^*$ to points in $S$. Suppose $S$ is very "pointed" at $\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation $S'$ that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for $S'$ is significantly easier than the corresponding problem for $S$, then we still win, even though $n$ has increased, and even though the classical "approximation ratio" might have become worse.

In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.

Tuesday, August 27, 2013

And now for something different...

People who know me through my blog probably assume I have no life at all outside blogging and big data.

oh wait...

But more seriously, my family is here with me during this sa-battle-cal and +Karen Ho is ably keeping up the non-work blogginess. If you're interested in our adventures outside big theories of data, check out her blog Simply Batty: you might even learn what "academic headcheese" is.

Friday, August 23, 2013

Simons Institute opening, with pictures.

Today was the official kickoff for the Simons Institute, as well as the start of the two special programs in big data and real analysis. For the last few days I've been busy getting my paperwork organized over at Calvin Hall, which is a beautifully redone circular building named after the Nobel Laureate and Berkeley professor (more on the building here).

The inside is very conducive to collaborative work. The offices are organized around the periphery of the building, and are functional, but not overly so. The idea is that people will congregate in the large interior open spaces that are covered with whiteboards and comfy chairs.

This is what the interior looks like: (apologies for the picture quality)

The second floor open lounge

Comfy chairs in front of a whiteboard

even more chairs and more whiteboards

Let us not forget the very fancy coffee machine (which makes nice espresso)

and of course you need your Simons Institute mug to drink it in.

This is the main auditorium for workshops, on the first floor.

The next plan is to pave the outer walls of the building at ground level with chalkboards, so people can think and chat outside as well as inside. Ah, Berkeley weather. I'm told the boards are already there, and just need to be installed, so I hope to have pictures soon.

There are 93 visitors between the two programs (53/39 by my rough count) which includes 22 Fellows. Offices are either two or three-person, and there are some large cubicle spaces with 7+ inhabitants. The visitors are all on the second and third floors of the building, which both have large open areas (the pictures above are from the second floor).

Wednesday, August 14, 2013

The Sa-battle-cal starts..

I'm off on sabbatical, and we (two cars, two adults, two children and one cat) just started the long drive to Berkeley from SLC. This has turned out to be more exciting than I anticipated...

Our original plan was to drive 4 hours each day, making up the 12 hour drive to Berkeley in three days. With two drivers for two cars, this seemed like the best option to prevent us from getting over-tired.

Right at Wendover (Las Vegas for poor people!), about halfway on our first leg, my car broke down. Thankfully, I was able to coast it to a mechanic's shop just off the highway as we entered town. I had no clue what the problem was, and the mechanic wouldn't be able to take a look at it for a few hours (this is the week of the Bonneville races on the salt flats).

So I called my mechanic back in Salt Lake, and described the problem to him. He diagnosed it on the spot as a faulty ignition coil, which is apparently an easy part to procure and replace.... if you're near a dealership.

Which I was not...

He also needed me to figure out which coil was failing, which needed a computer scanner, which needed a mechanic.

So....

Here's what needed to happen, in short order. It was now 5pm. We (my wife and I) needed to

get the mechanic to at least scan the car to get the appropriate error codes
Call the car parts store (closing at 6) to see if they could procure the needed parts
Find a hotel in Wendover (did I mention this was the week of the Bonneville Races, and almost everything in town was booked?)
Change our reservations downstream because we were stuck in Wendover.

Thankfully, through a series of small miracles and the generosity of many strangers and non-strangers, we managed to get all of this done. My mechanic even offered to to walk me through the installation myself once I did the appropriate Youtube self-study (MOOCs FTW !!)

Long story short, it's now day II of our trek. We doubled up the driving on day II to make up for lost time, and we're back on schedule (minus one set of car keys that I managed to lose in all the hurry).

So far, this sabbatical is certainly not relaxing.

p.s Shout out to Utah Imports of SLC, and S&R auto repair in Wendover.

p.p.s I-80 through Nevada is one of the most mind-numbingly boring drives imaginable.

Monday, August 05, 2013

Hi ho, hi ho, it's off to sabbatical we go...

It is now 7 days and counting before I head off on sabbatical, not to return till August 2014. I'll be heading to the Simons Institute to think big thoughts about the theory of data, (or was it big data about theory, or maybe just the theory of big data). After that, I'll be enjoying the fine life at Google for a semester, followed by a visit to MADALGO in Aarhus (because big data in Europe just tastes better, you know).

I've been using this summer to nurse my psychic wounds from six years of the grind, and gently slide into sabbatical mode. The rush of joy that comes everytime I delete a departmental email without reading beyond the subject tells me I'm ready :).

So far, advice I've received on how to conduct myself during a sabbatical includes:

Don't plan too much
Work very hard, but on something other than your current research
Have an adventure, and if work happens, don't blame yourself.
Say NO repeatedly (this also applies to life post-tenure, apparently). I maybe took this advice a little too literally and managed to decline a review request in about 0.5 seconds, which surprised (and possibly annoyed) the editor who had made the request.
Do something different (or do something that you've been meaning to do for years but never got a chance to).

What else ?

The Geomblog

Pages