The Geomblog: large-data

Showing posts with label large-data. Show all posts

Wednesday, May 16, 2012

"In the long run..."

"In the long run, we're all dead" is a famous quote attributed to John Maynard Keynes. The context was his arguments with economists of the time: he was trying to argue for government intervention in the markets to control inflation, rather than just letting it play out.

It's an apt response also to occasional claims that asymptotics will eventually win out, especially with large data. Asymptotics will eventually win out, as long as everything else stays fixed.

But that's the precise problem. Everything else doesn't stay fixed. Well before your $C n \log n$ algorithm beats the $c n^2$ algorithm, we run out of memory, or local cache, or something else, and the computational model changes on us.

We come up with external memory models, and are informed that in fact even a factor of log N is too much. We design streaming models and Mapreduce and so on and so forth, and realize that all this communication is burning a hole in our atmosphere.

Lesson to be learned (and re-learned): asymptotic analysis is a powerful tool, but it's only as good as the model of computation you're REALLY working with.

Tuesday, January 17, 2012

The Shonan Meeting (Part 2): Talks review I

I missed one whole day of the workshop because of classes, and also missed a half day because of an intense burst of slide-making. While I wouldn't apologize for missing talks at a conference, it feels worse to miss them at a small focused workshop. At any rate, the usual disclaimers apply: omissions are not due to my not liking a presentation, but because of having nothing even remotely intelligent to say about it.

Jeff Phillips led off with his work on mergeable summaries. The idea is that you have a distributed collection of nodes, each with their own data. The goal is to compute some kind of summary from all the nodes, with the caveat that each node only transmits a fixed size summary to other nodes (or the parent in an implied hierarchy). What's tricky about this is keeping the error down. It's easy to see for example that $\epsilon$-samples compose - you could take two $\epsilon$-samples and take an $\epsilon$-sample of that, giving you a $2\epsilon$-sample over the union. But you want to keep the error fixed AND the size the sample fixed. He showed a number of summary structures that could be maintained in this mergeable fashion, and there are a number of interesting questions that remain open, including how to do clustering in a mergeable way.

In the light of what I talked about earlier, you could think of the 'mergeable' model as a restricted kind of distributed computation, where the topology is fixed, and messages are fixed size. The topology is a key aspect, because nodes don't encounter data more than once. This is good, because otherwise the lack of idempotence of some of the operators could be a problem: indeed, it would be interesting to see how to deal with non-idempotent summaries in a truly distributed fashion.

Andrew McGregor talked about graph sketching problems (sorry, no abstract yet). One neat aspect of his work is that in order to build sketches for graph connectivity, he uses a vertex-edge representation that essentially looks like the cycle-basis vector in the 1-skeleton of a simplicial complex, and exploits the homology structure to compute the connected components (aka $\beta_0$). He also uses the bipartite double cover trick to reduce bipartiteness testing to connected component computation. It's kind of neat to see topological methods show up in a useful way in these settings, and his approach probably extends to other homological primitives.

Donatella Firmani and Luigi Laura talked about different aspects of graph sketching and MapReduce, studying core problems like the MST and bi/triconnectivity. Donatella's talk in particular had a detailed experimental study of various MR implementations for these problems, and had interesting (but preliminary) observations about tradeoff between the number of reducers and the amount of communication needed.

This theme was explored further by Jeff Ullman in his talk on one-pass MR algorithms (the actual talk title was slightly different, since the unwritten rule at the workshop was to change the name of the title from the official listing). Again, his argument was that one should be combining both the communication cost and the overall computation cost. A particularly neat aspect of his work was showing (for the problem of finding a particular shaped subgraph in a given large graph) when there was an efficient one-pass MR algorithm, given the existence of a serial algorithm for the same problem. He called such algorithms convertible algorithms: one result type is that if there's an algorithm running in time $n^\alpha m^\beta$ for finding a particular subgraph of size $s$, and $s \le \alpha + 2\beta$, then there's an efficient MR algorithm for the problem (in the sense of total computation time being comparable to the serial algorithm).

The Shonan Meeting (Part 1): In the beginning, there was a disk...

What follows is a personal view of the evolution of large-data models. This is not necessarily chronological, or even reflective of reality, but it's a retroactive take on the field, inspired by listening to talks at the Shonan meeting.

Arguably, the first formal algorithmic engagement with large data was the Aggarwal-Vitter external memory model from 1988. The idea was simple enough: accessing an arbitrary element of disk was orders of magnitude more expensive than accessing an element of main memory, so let's ignore main memory access and charge a single unit for accessing a block of disk.

The external memory model was (and is still) a very effective model of disk access. It wasn't just a good guide to thinking about algorithm design, it also encouraged design strategies that were borne out well in practice. One could prove that natural-sounding buffering strategies were in fact optimal, and that prioritizing sequential scans as far as possible (even to the extent of preparing data for sequential scans) was more efficient. Nothing earth-shattering, but a model that guides (and conforms to) proper practice is always a good one.

Two independent directions spawned off from the external memory model. One direction was to extend the hierarchy. Why stop at one main memory level when we have multilevel caches ? A simple extension to handle caches is tricky, because the access time differential between caches and main memory isn't sufficient to justify the idealized "0-1" model that the EM model used. But throwing in another twist - I don't actually know the correct block size for transfer of data between hierarchy levels - led us to cache-obliviousness.

I can't say for sure whether the cache-oblivious model speaks to the practice of programming with caches as effectively as the EM model. Being aware of your cache can bring significant benefits in principle. But the design principles ("repeated divide and conquer, and emphasizing locality of access") are sound, and there's already at least one company (Tokutek, founded by Martin Farach-Colton, ~~and~~ Michael Bender and Bradley Kuszmaul) that is capitalizing on the performance yielded by cache-oblivious data structures.

The other direction was to weaken the power of the model. Since a sequential scan was so much more efficient than random access to disk, a natural question was to ask what you could do with just one scan. And thus was born the streaming model, which is by far the most successful model for large-data to date, with theoretical depth and immense practical value.

What we've been seeing over the past few years is the evolution of the streaming model to capture ever more complex data processing scenarios and communication frameworks.

It's quite useful to think of a stream algorithm as "communicating" a limited amount of information (the working memory) from the first half of the stream to the second. Indeed, this view is the basis for communication-complexity-based lower bounds for stream algorithms.

But if we think of this as an algorithmic principle, we then get into the realm of a distributed computation, where one player possesss the "first half" of the data, the other player has "the second half" and the goal is for them to exchange a small number of bits with each other in order to compute something (while streaming is one-way communication, a multi-pass streaming algorithm is a two-way communication).

Of course, there's nothing saying that we only have two players. This gets you to the $k$-player setup for a distributed computation, in which you wish to minimize the amount of communication exchanged as part of a computation. This model is of course not new at all ! It's exactly the distributed computing model pioneered in the 80s and 90s and has a natural home at PODC. What appears to be different in its new uses is that the questions being asked are not the old classics like leader election or byzantine agreement, but statistical estimation on large data sets. In other words, the reason to limit communication is because of the need to process a large data set, rather than the need to merely coordinate. It's a fine distinction, and I'm not sure I entirely believe it myself :)

There are many questions about how to compute various objects in a distributed setting: of course the current motivation is to do with distributed data centers, sensor networks, and even different cores on a computer. Because of the focus on data analysis, there are sometimes surprising results that you can prove. For example, a recent SIGMOD paper by Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu shows that if you want to do quantile estimation, you only need communication that's sublinear in the number of players ! The trick here is that you don't need to have very careful error bounds on the estimates at each player before sending up the summary to a coordinator.

It's also quite interesting to think about distributed learning problems, where the information being exchanged is specifically in order to build a good model for whatever task you're trying to learn. Some recent work that I have together with Jeff Phillips, Hal Daume and my student Avishek Saha explores the communication complexity of doing classification in such a setting.

An even more interesting twist on the distributed setting is the so-called 'continuous streaming' setting. Here, you don't just have a one-shot communication problem. Each player receives a stream of data, and now the challenge is not just to communicate a few bits of information to solve a problem, but to update the information appropriately as new input comes in. Think of this as streaming with windows, or a dynamic version of the basic distributed setting.

Here too, there are a number of interesting results, a beautiful new sampling trick that I'll talk about next, and some lower bounds.

I haven't even got to MapReduce yet, and how it fits in: while you're waiting, you might want to revisit this post.

Sunday, January 15, 2012

The Shonan Meeting (Part 0): On Workshops

Coming up with new ideas requires concentration and immersion. When you spend enough unbroken time thinking about a problem, you start forming connections between thoughts, and eventually you get a giant "connected component" that's an actual idea.

Distractions, even technical ones, kill this process. And this is why even at a focused theory conference, I don't reach that level of "flow". While I'm bombarded from all directions by interesting theory, there's a lot of context switching. Look, a new TSP result ! origami and folding - fun ! Who'd have thought of analyzing Jenga ! did someone really prove that superlinear epsilon net lower bound ?

This is why focused workshops are so effective. You get bombarded with information for sure, but each piece reinforces aspects of the overall theme if it's done well. Slowly, over the course of the event, a bigger picture starts emerging, connections start being made, and you can feel the buzz of new ideas.

And this is why the trend of 'conferencizing' workshops, that Moshe Vardi lamented recently, is so pernicious. it's another example of perverse incentives ("conferences count more than workshops for academic review, and so let's redefine a workshop as a conference"). A good workshop (with submitted papers or otherwise) provides focus and intensity, and good things come of it. A workshop that's really just a miniconference doesn't have either the intense intimacy of a true workshop or the quality of a larger symposium.

All of this is a very roundabout way of congratulating Muthu, Graham Cormode and Ke Yi (ed: Can we just declare that Muthu has reached exalted one-word status, like Madonna and Adele ? I can't imagine anyone in the theory community hearing the name 'Muthu' and not knowing who that is) for putting on a fantastic workshop on Large-Scale Distributed Computing at the Shonan Village Center (the Japanese Dagstuhl, if you will). There was reinforcement, intensity, the buzz of new ideas, and table tennis ! There was also the abomination of fish-flavored cheese sticks, of which nothing more will be said.

In what follows, I'll have a series of posts from the event itself, with a personal overview of the evolution of the area, highlights from the talks, and a wrap up. Stay tuned...

Monday, October 31, 2011

Models for MapReduce

I've been listening to Jeff Phillips' comparison of different models for MapReduce (he's teaching a class on models for massive data). In what follows, I'll add the disclaimer IANACT (I am not a complexity theorist).

There's something that bothers me about the various models floating around that attempt to capture the MapReduce framework (specifically the MUD framework by Feldman et al, the MRC framework by Karloff, Suri and (co-blogger) Vassilvitskii, and the newer Goodrich-Sitchinava-Zhang framework).

For "seasoned" models of computation like the RAM, or PRAM, or even I/O and streaming, there's a sense of identifying some intrinsic feature of the computation that you wish to capture, building a model around it, and then identifying resources that you wish to expend little of. After the dust settled around the different TM/RAM models for effective computation, we now accept (for the most part) the RAM as a good model for sequential computation. The I/O model identifies disk access as the key operation to capture and builds a model of computation where disk access is a limited resource. Streaming identifies "amnesia" as the key limitation on a computation, and builds a model of computation centered around that.

In all these cases, it's possible to grumble that $O(n^5)$ isn't really "efficient" or that galactic algorithms shouldn't count, or that ignoring main memory computations is "cheating" in the I/O model, or even that allowing $n^{1-\epsilon}$ memory storage is useless for streaming. But those discussions are secondary to the model: and in fact history tells us that inevitably models that appear to allow for horribly inefficient computation facilitate algorithm design that is quite efficient !

Which is why I'm puzzled when I look at the different ways in which the MapReduce framework is being modelled. I see a number of choices being made: number of processors should be $O(n^{2-\epsilon})$, or total memory usage should be $O(n^\epsilon)$ or number of 'rounds' of computation should be polylogarithmic or even constant.

At one level I understand these choices: for example, a memory bound of $n^\epsilon$ appears to lead to constant $(1/\epsilon)$ round algorithms.

But should the model be getting into decisions about what's efficient before even deciding what resources they are trying to capture ?

Ultimately, this for me gets back to a question that I asked at the W8F workshop back in May):

What is the intrinsic computational metaphor that we are trying to capture with these models ?

It's not just parallelism. It's not just distributed computation. It's not even streaming (or more generally sublinear computations). Is it some delicate tradeoff between these three ? Should we think of map-reduce models like the circuit classes that have bounds on both space and depth ? Wouldn't this be an unholy mess ?

I suspect that over time, if the MapReduce computational framework persists, then these issues will get shaken out. But there's a provisional (and exciting!) feel about the models we currently have.

The Geomblog

Pages