The Geomblog: shonan

Showing posts with label shonan. Show all posts

Friday, January 20, 2012

The Shonan Meeting (Part 3): Optimal Distributed Sampling

Reservoir sampling is a beautiful gems of sampling: easy to explain, and almost magic in how it works. The setting is this:

A stream of items passes by you, and your goal is to extract a sample of size $s$ that is uniform over all the elements you've seen so far.

The technique works as follows: if you're currently examining the i-th element, select it with probability 1/i, and then pick an element uniformly from the current sample to be replaced by it. I've talked about this result before, including three different ways of proving it.

But what happens if you're in a continuous distributed setting ? Now each of $k$ players is reading a stream of items, and they all talk to a coordinator who wishes to maintain a random sample of the union of streams. Let's assume for now that $s \le k$

Each player can run the above protocol and send an item to the coordinator, and the coordinator can pick a random subset from these. But this won't work ! At least, not unless each player has read in exactly the same amount of data as each other player. This is because we need to weight the sample element sent by a player with the number of elements that player has read.

It's not hard to see that each player sends roughly log n messages to the coordinator for a stream of length n. So maybe each player also annotates the element with the number of elements it has seen so far. This sort of works, but the counts could be off significantly, since a stream that doesn't send a sample might have read many more elements since the previous time it sent an update.

This can be fixed by having each player send an extra control message when its stream increases in size by a factor of 2, and that would not change the asymptotic complexity of the process, but we still don't get a truly uniform sample.

The problem with this approach is that it's trying to get around knowing $n$, the size of the stream, which is expensive to communicate in a distributed setting. So can we revisit the original reservoir method in a 'communication friendly way' ?

Let's design a new strategy for reservoir sampling that works as follows.

Maintain a current "threshold" t. When a new item arrives, assign it a random value r between 0 and 1. If r < t, keep the new item and set t = r, else discard it.

By using the principle of deferred decisions, you can convince yourself that this does exactly the same thing as the previous strategy (because at step i, the probability of the current element being retained is its probability of r being the minimum over the set seen so far, which is 1/i). the good thing is that this approach doesn't need to know how many elements have passed so far.

This approach can be extended almost immediately to the distributed setting. Each player now runs this protocol instead of the previous one, and every time the coordinate gets an update, it sends out a new global threshold (the minimum over all thresholds sent in) to all nodes. If you want to maintain a sample of size $s$, the coordinator keeps $s$ of the $k$ elements sent in, and the overall complexity is $O(ks \log n)$.

But you can do even better.

Now, each player maintains its own threshold. The coordinator doesn't broadcast the "correct" threshold until a player sends an element whose random value is above the global threshold. This tells the coordinator that the player had the wrong threshold, and it then updates that player (and only that player)

Analyzing this approach takes a little more work, but the resulting bound is much better:

The (expected) amount of communication is $O(k \frac{\log (n/s)}{\log (k/s)})$

What's even more impressive: this is optimal !

This last algorithm and the lower bound, were presented by Srikanta Tirthapura at the Shonan meeting, based on his DISC 2011 work with David Woodruff. Key elements of this result (including a broadcast-the-threshold variant of the upper bound) also appeared in a PODS 2010 paper by Muthu, Graham Cormode, Kevin Yi and Qin Zhang. The optimal lower bound is new, and rather neat.

Tuesday, January 17, 2012

The Shonan Meeting (Part 2): Talks review I

I missed one whole day of the workshop because of classes, and also missed a half day because of an intense burst of slide-making. While I wouldn't apologize for missing talks at a conference, it feels worse to miss them at a small focused workshop. At any rate, the usual disclaimers apply: omissions are not due to my not liking a presentation, but because of having nothing even remotely intelligent to say about it.

Jeff Phillips led off with his work on mergeable summaries. The idea is that you have a distributed collection of nodes, each with their own data. The goal is to compute some kind of summary from all the nodes, with the caveat that each node only transmits a fixed size summary to other nodes (or the parent in an implied hierarchy). What's tricky about this is keeping the error down. It's easy to see for example that $\epsilon$-samples compose - you could take two $\epsilon$-samples and take an $\epsilon$-sample of that, giving you a $2\epsilon$-sample over the union. But you want to keep the error fixed AND the size the sample fixed. He showed a number of summary structures that could be maintained in this mergeable fashion, and there are a number of interesting questions that remain open, including how to do clustering in a mergeable way.

In the light of what I talked about earlier, you could think of the 'mergeable' model as a restricted kind of distributed computation, where the topology is fixed, and messages are fixed size. The topology is a key aspect, because nodes don't encounter data more than once. This is good, because otherwise the lack of idempotence of some of the operators could be a problem: indeed, it would be interesting to see how to deal with non-idempotent summaries in a truly distributed fashion.

Andrew McGregor talked about graph sketching problems (sorry, no abstract yet). One neat aspect of his work is that in order to build sketches for graph connectivity, he uses a vertex-edge representation that essentially looks like the cycle-basis vector in the 1-skeleton of a simplicial complex, and exploits the homology structure to compute the connected components (aka $\beta_0$). He also uses the bipartite double cover trick to reduce bipartiteness testing to connected component computation. It's kind of neat to see topological methods show up in a useful way in these settings, and his approach probably extends to other homological primitives.

Donatella Firmani and Luigi Laura talked about different aspects of graph sketching and MapReduce, studying core problems like the MST and bi/triconnectivity. Donatella's talk in particular had a detailed experimental study of various MR implementations for these problems, and had interesting (but preliminary) observations about tradeoff between the number of reducers and the amount of communication needed.

This theme was explored further by Jeff Ullman in his talk on one-pass MR algorithms (the actual talk title was slightly different, since the unwritten rule at the workshop was to change the name of the title from the official listing). Again, his argument was that one should be combining both the communication cost and the overall computation cost. A particularly neat aspect of his work was showing (for the problem of finding a particular shaped subgraph in a given large graph) when there was an efficient one-pass MR algorithm, given the existence of a serial algorithm for the same problem. He called such algorithms convertible algorithms: one result type is that if there's an algorithm running in time $n^\alpha m^\beta$ for finding a particular subgraph of size $s$, and $s \le \alpha + 2\beta$, then there's an efficient MR algorithm for the problem (in the sense of total computation time being comparable to the serial algorithm).

The Shonan Meeting (Part 1): In the beginning, there was a disk...

What follows is a personal view of the evolution of large-data models. This is not necessarily chronological, or even reflective of reality, but it's a retroactive take on the field, inspired by listening to talks at the Shonan meeting.

Arguably, the first formal algorithmic engagement with large data was the Aggarwal-Vitter external memory model from 1988. The idea was simple enough: accessing an arbitrary element of disk was orders of magnitude more expensive than accessing an element of main memory, so let's ignore main memory access and charge a single unit for accessing a block of disk.

The external memory model was (and is still) a very effective model of disk access. It wasn't just a good guide to thinking about algorithm design, it also encouraged design strategies that were borne out well in practice. One could prove that natural-sounding buffering strategies were in fact optimal, and that prioritizing sequential scans as far as possible (even to the extent of preparing data for sequential scans) was more efficient. Nothing earth-shattering, but a model that guides (and conforms to) proper practice is always a good one.

Two independent directions spawned off from the external memory model. One direction was to extend the hierarchy. Why stop at one main memory level when we have multilevel caches ? A simple extension to handle caches is tricky, because the access time differential between caches and main memory isn't sufficient to justify the idealized "0-1" model that the EM model used. But throwing in another twist - I don't actually know the correct block size for transfer of data between hierarchy levels - led us to cache-obliviousness.

I can't say for sure whether the cache-oblivious model speaks to the practice of programming with caches as effectively as the EM model. Being aware of your cache can bring significant benefits in principle. But the design principles ("repeated divide and conquer, and emphasizing locality of access") are sound, and there's already at least one company (Tokutek, founded by Martin Farach-Colton, ~~and~~ Michael Bender and Bradley Kuszmaul) that is capitalizing on the performance yielded by cache-oblivious data structures.

The other direction was to weaken the power of the model. Since a sequential scan was so much more efficient than random access to disk, a natural question was to ask what you could do with just one scan. And thus was born the streaming model, which is by far the most successful model for large-data to date, with theoretical depth and immense practical value.

What we've been seeing over the past few years is the evolution of the streaming model to capture ever more complex data processing scenarios and communication frameworks.

It's quite useful to think of a stream algorithm as "communicating" a limited amount of information (the working memory) from the first half of the stream to the second. Indeed, this view is the basis for communication-complexity-based lower bounds for stream algorithms.

But if we think of this as an algorithmic principle, we then get into the realm of a distributed computation, where one player possesss the "first half" of the data, the other player has "the second half" and the goal is for them to exchange a small number of bits with each other in order to compute something (while streaming is one-way communication, a multi-pass streaming algorithm is a two-way communication).

Of course, there's nothing saying that we only have two players. This gets you to the $k$-player setup for a distributed computation, in which you wish to minimize the amount of communication exchanged as part of a computation. This model is of course not new at all ! It's exactly the distributed computing model pioneered in the 80s and 90s and has a natural home at PODC. What appears to be different in its new uses is that the questions being asked are not the old classics like leader election or byzantine agreement, but statistical estimation on large data sets. In other words, the reason to limit communication is because of the need to process a large data set, rather than the need to merely coordinate. It's a fine distinction, and I'm not sure I entirely believe it myself :)

There are many questions about how to compute various objects in a distributed setting: of course the current motivation is to do with distributed data centers, sensor networks, and even different cores on a computer. Because of the focus on data analysis, there are sometimes surprising results that you can prove. For example, a recent SIGMOD paper by Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu shows that if you want to do quantile estimation, you only need communication that's sublinear in the number of players ! The trick here is that you don't need to have very careful error bounds on the estimates at each player before sending up the summary to a coordinator.

It's also quite interesting to think about distributed learning problems, where the information being exchanged is specifically in order to build a good model for whatever task you're trying to learn. Some recent work that I have together with Jeff Phillips, Hal Daume and my student Avishek Saha explores the communication complexity of doing classification in such a setting.

An even more interesting twist on the distributed setting is the so-called 'continuous streaming' setting. Here, you don't just have a one-shot communication problem. Each player receives a stream of data, and now the challenge is not just to communicate a few bits of information to solve a problem, but to update the information appropriately as new input comes in. Think of this as streaming with windows, or a dynamic version of the basic distributed setting.

Here too, there are a number of interesting results, a beautiful new sampling trick that I'll talk about next, and some lower bounds.

I haven't even got to MapReduce yet, and how it fits in: while you're waiting, you might want to revisit this post.

Sunday, January 15, 2012

The Shonan Meeting (Part 0): On Workshops

Coming up with new ideas requires concentration and immersion. When you spend enough unbroken time thinking about a problem, you start forming connections between thoughts, and eventually you get a giant "connected component" that's an actual idea.

Distractions, even technical ones, kill this process. And this is why even at a focused theory conference, I don't reach that level of "flow". While I'm bombarded from all directions by interesting theory, there's a lot of context switching. Look, a new TSP result ! origami and folding - fun ! Who'd have thought of analyzing Jenga ! did someone really prove that superlinear epsilon net lower bound ?

This is why focused workshops are so effective. You get bombarded with information for sure, but each piece reinforces aspects of the overall theme if it's done well. Slowly, over the course of the event, a bigger picture starts emerging, connections start being made, and you can feel the buzz of new ideas.

And this is why the trend of 'conferencizing' workshops, that Moshe Vardi lamented recently, is so pernicious. it's another example of perverse incentives ("conferences count more than workshops for academic review, and so let's redefine a workshop as a conference"). A good workshop (with submitted papers or otherwise) provides focus and intensity, and good things come of it. A workshop that's really just a miniconference doesn't have either the intense intimacy of a true workshop or the quality of a larger symposium.

All of this is a very roundabout way of congratulating Muthu, Graham Cormode and Ke Yi (ed: Can we just declare that Muthu has reached exalted one-word status, like Madonna and Adele ? I can't imagine anyone in the theory community hearing the name 'Muthu' and not knowing who that is) for putting on a fantastic workshop on Large-Scale Distributed Computing at the Shonan Village Center (the Japanese Dagstuhl, if you will). There was reinforcement, intensity, the buzz of new ideas, and table tennis ! There was also the abomination of fish-flavored cheese sticks, of which nothing more will be said.

In what follows, I'll have a series of posts from the event itself, with a personal overview of the evolution of the area, highlights from the talks, and a wrap up. Stay tuned...

The Geomblog

Pages