The Geomblog

Friday, January 20, 2012

The Shonan Meeting (Part 3): Optimal Distributed Sampling

Reservoir sampling is a beautiful gems of sampling: easy to explain, and almost magic in how it works. The setting is this:

A stream of items passes by you, and your goal is to extract a sample of size $s$ that is uniform over all the elements you've seen so far.

The technique works as follows: if you're currently examining the i-th element, select it with probability 1/i, and then pick an element uniformly from the current sample to be replaced by it. I've talked about this result before, including three different ways of proving it.

But what happens if you're in a continuous distributed setting ? Now each of $k$ players is reading a stream of items, and they all talk to a coordinator who wishes to maintain a random sample of the union of streams. Let's assume for now that $s \le k$

Each player can run the above protocol and send an item to the coordinator, and the coordinator can pick a random subset from these. But this won't work ! At least, not unless each player has read in exactly the same amount of data as each other player. This is because we need to weight the sample element sent by a player with the number of elements that player has read.

It's not hard to see that each player sends roughly log n messages to the coordinator for a stream of length n. So maybe each player also annotates the element with the number of elements it has seen so far. This sort of works, but the counts could be off significantly, since a stream that doesn't send a sample might have read many more elements since the previous time it sent an update.

This can be fixed by having each player send an extra control message when its stream increases in size by a factor of 2, and that would not change the asymptotic complexity of the process, but we still don't get a truly uniform sample.

The problem with this approach is that it's trying to get around knowing $n$, the size of the stream, which is expensive to communicate in a distributed setting. So can we revisit the original reservoir method in a 'communication friendly way' ?

Let's design a new strategy for reservoir sampling that works as follows.

Maintain a current "threshold" t. When a new item arrives, assign it a random value r between 0 and 1. If r < t, keep the new item and set t = r, else discard it.

By using the principle of deferred decisions, you can convince yourself that this does exactly the same thing as the previous strategy (because at step i, the probability of the current element being retained is its probability of r being the minimum over the set seen so far, which is 1/i). the good thing is that this approach doesn't need to know how many elements have passed so far.

This approach can be extended almost immediately to the distributed setting. Each player now runs this protocol instead of the previous one, and every time the coordinate gets an update, it sends out a new global threshold (the minimum over all thresholds sent in) to all nodes. If you want to maintain a sample of size $s$, the coordinator keeps $s$ of the $k$ elements sent in, and the overall complexity is $O(ks \log n)$.

But you can do even better.

Now, each player maintains its own threshold. The coordinator doesn't broadcast the "correct" threshold until a player sends an element whose random value is above the global threshold. This tells the coordinator that the player had the wrong threshold, and it then updates that player (and only that player)

Analyzing this approach takes a little more work, but the resulting bound is much better:

The (expected) amount of communication is $O(k \frac{\log (n/s)}{\log (k/s)})$

What's even more impressive: this is optimal !

This last algorithm and the lower bound, were presented by Srikanta Tirthapura at the Shonan meeting, based on his DISC 2011 work with David Woodruff. Key elements of this result (including a broadcast-the-threshold variant of the upper bound) also appeared in a PODS 2010 paper by Muthu, Graham Cormode, Kevin Yi and Qin Zhang. The optimal lower bound is new, and rather neat.

Thursday, January 19, 2012

SODA Review II: The business meeting

Jeff Phillips posts a roundup of the SODA business meeting. Hawaii !!!!!!

I thought I would also post a few notes on the SODA business meeting. I am sure I missed some details, but here are the main points.

Everyone thought the organization of the conference was excellent (so far). The part in parenthesis is a joke by Kazuo Iwama towards his students - I guess that is Japanese humor, and encouragement.

Despite being outside of North America for the first time, the attendance was quite high, I think around 350 people. And the splits were almost exactly 1/3 NA, 1/3 Europe, 1/3 Asia.

Yuval Rabani talked about being PC for the conference. He said there were the most submissions ever, most accepted paper ever, and largest PC ever for SODA. Each PC member reviewers about 49 papers, and over 500 were sub-reviewed. We all thank Yuval for all of his hard work.

We voted next on location for 2014 (2013 is in New Orleans). The final votes were down to Honolulu, HI and Washington DC, where Honolulu won about 60 to 49. David Johnson said they would try to book a hotel in Honolulu if he could get hotel prices below 200 USD/night. A quick look on kayak.com made it appear several large hotels could be booked next year around this time for about 160 USD/night or so. Otherwise it will be in DC where the theory group at UMaryland (via David Mount) have stated they would help with local arrangements. They did a great job with SoCG a few years ago, but I heard many suggestions that it be held more downtown than by UofM campus. And also there were requests for good weather. We'll see what happens...

Finally, there was a discussion about how SODA is organized/governed. This discussion got quite lively. Bob Sedgewick led the discussion by providing a short series of slides outlining a rough plan for a "confederated SODA." I have linked to his slides. This could mean several things, for instance:

Having ALENEX and ANALCO (and SODA) talks spread out over 4 days and intermixed even possibly in the same session (much like ESA).
The PCs would stay separate most likely (although merging them was discussed, but this had less support).
For SODA the PC could be made more hierarchical where there are, say, 6 main area chairs. Then each area chair supervises say 12 or so PC members. The general chair would coordinate and normalize all of the reviews, but otherwise it would be more hierarchical and partitioned. Then PC members in each area would have fewer papers to review, and could submit to other subareas even.
There was also a suggestion that PC chairs / steering committee members have some SODA attendance requirements. (Currently it consists of David Johnson, 2 people appointed by SIAM, and the past two PC chairs - as I think I understand. David Johnson said he would provide a link to the official SODA bylaws somewhere.).

Anyways, there was a lot of discussion that was boiled down to 3 votes (I will try to paraphrase, all vote totals approximate):

Should the steering committee consider spreading ALENEX, ANALCO, and SODA talks over 4 days? About 50 to 7 in favor.
Should the steering committee consider/explore some variant of the Confederated SODA model? About 50 to 2 in favor.
Should the steering committee consider making the steering committee members elected? About 50 to 1 in favor.

There were about 100+ votes for location, so usually about half the crowd abstained for all votes. There were various arguments on either side of the positions. And other suggestions. Some people had very strong and well-argued sides of these discussion points, so I don't want to try to paraphrase (and probably get something nuanced wrong), but I encourage people to post opinions and ideas in the comments.

Wednesday, January 18, 2012

SODA review I: Talks, talks and more talks.

I asked Jeff Phillips (a regular contributor) if he'd do some conference posting for those of us unable to make it to SODA. Here's the first of his two missives.

Suresh requested I write a conference report for SODA. I never know how to write these reports since I always feel like I must have left out some nice talk/paper and then I risk offending people. The fact is, there are 3 parallel sessions, and I can't pay close attention to talks for 3 days straight, especially after spending the previous week at the Shonan meeting that Suresh has been blogging about.

Perhaps, it is apt to contrast it with the Shonan meeting. At Shonan there were many talks (often informal with much back and forth) on topics very well clustered on "Large-scale Distributed Computation". There were several talks earlier in the workshop that just overlaid the main techniques that have become quite powerful within an area, and then there were new talks on recent breakthroughs. But although we mixed up the ordering of subtopics a bit, there was never that far of a context switch, and you could see larger views coalescing in people's minds throughout the week.

At SODA, the spectrum is much more diverse - probably the most diverse conference on the theoretical end of computer science. The great thing is that I get to see colleagues across a much broader spectrum of areas. But the talks are often a bit more specific, and despite having usually fairly coherent sessions, the context switches are typically quite a bit larger and it seems harder to stay focused enough to really get at the heart at what is in each talk. Really getting the point requires both paying attention and being in the correct mind set to start with. Also, there are not too many talks in my areas of interest (i.e. geometry, big data algorithmics).

So then what is there to report. I've spent most of my time in the hallways, catching up on gossip (which either is personal, or I probably shouldn't blog about without tenure - or even with tenure), or discussing on-going or new research problems with friends (again not yet ready for a blog). And of the talks I saw, I generally captured vague notions or concepts. Usually stored away for when I think about a related problem and I need to make a similar connection, or look up a technique in the paper. And, although, I was given a CD of the proceedings, but my laptop has not CD drive. For the reasons discussed above, I rarely completely get how something works from a short conference talk. Here are some example snip-its of what I took away from a few talks:

Private Data Release Via Learning Thresholds | Moritz Hardt, Guy Rothblum, Rocco A. Servedio.

Take-away : There is a deep connection between PAC learning and differential privacy. Some results from one can be applied to the other, but perhaps many others can be as well.

Submatrix Maximum Queries in Monge Matrices and Monge Partial Matrices, and Their Applications | Haim Kaplan, Shay Mozes, Yahav Nussbaum and Micha Sharir

Take-away: There is a cool "Monge" property that matrices can have which makes many subset query operations more efficient. This can be thought of as each row represents a pseudo-line. Looks useful for matrix problems where the geometric intuition about what the columns mean is relevant.

Analyzing Graph Structure Via Linear Measurements | Kook Jin Ahn, Sudipto Guha, Andrew McGregor

Take-away : They presented a very cool linear sketch for graphs. This allows several graph problems to be solved under a streaming (or similar models) in the way usually more abstract, if not geometric, data is. (ed: see my note on Andrew's talk at Shonan)

Lsh-Preserving Functions and Their Applications | Flavio Chierichetti, Ravi Kumar

Take-away: They present a nice characterization of what sorts of similarities (based on combinatorial sets), showing which ones can and cannot be used within a LSH framework. There techniques seemed to be a bit more general that for these discrete similarities over sets, so if need to use this for another similarity may be good to check out in more detail.

Data Reduction for Weighted and Outlier-Resistant Clustering | Dan Feldman, Leonard Schulman

Take-away: They continue to develop the understanding on what can be done for core sets using sensitivity-based analysis. This helps outline not just what functions can be approximated with subsets as proxy, but also how the distribution of points affects these results. The previous talk by Xin Xiao (with Kasturi Varadarajan on A Near-Linear Algorithm for Projective Clustering Integer Points) also used these concepts.

There were many other very nice results and talks that I also enjoyed, but the take-away was often even less interesting to blog about. Or sometimes they just made more progress towards closing a specific subarea. I am not sure how others use a conference, but if you are preparing your talk, you might consider trying to build a clear concise take-away message into your talk so that people like me with finite attention spans can remember something precise out of it. And so people like me are most likely to look more carefully at the paper the next time we work on a related problem.

Upcoming deadline: Multiclust 2012

One of the posts queued up in the clustering series is a note on 'metaclustering', which starts with the idea that instead of looking for one good clustering, we should be trying to explore the space of clusterings obtained through different methods and pick out good solutions informed by the landscape of answers. While the concept has been around a while, the area has attracted much more interest in recent years, with two workshops on the topic at recent data mining conferences.

I'm involved with the organization of the latest incarnation, to be held in conjunction with SDM 2012 in April. If you have

unpublished original research papers (of upto 8 pages) that are not under review elsewhere, vision papers and descriptions of work-in-progress or case studies on benchmark data as short paper submissions of up to 4 pages

then you should submit it ! The deadline is Jan 25.

Tuesday, January 17, 2012

The Shonan Meeting (Part 2): Talks review I

I missed one whole day of the workshop because of classes, and also missed a half day because of an intense burst of slide-making. While I wouldn't apologize for missing talks at a conference, it feels worse to miss them at a small focused workshop. At any rate, the usual disclaimers apply: omissions are not due to my not liking a presentation, but because of having nothing even remotely intelligent to say about it.

Jeff Phillips led off with his work on mergeable summaries. The idea is that you have a distributed collection of nodes, each with their own data. The goal is to compute some kind of summary from all the nodes, with the caveat that each node only transmits a fixed size summary to other nodes (or the parent in an implied hierarchy). What's tricky about this is keeping the error down. It's easy to see for example that $\epsilon$-samples compose - you could take two $\epsilon$-samples and take an $\epsilon$-sample of that, giving you a $2\epsilon$-sample over the union. But you want to keep the error fixed AND the size the sample fixed. He showed a number of summary structures that could be maintained in this mergeable fashion, and there are a number of interesting questions that remain open, including how to do clustering in a mergeable way.

In the light of what I talked about earlier, you could think of the 'mergeable' model as a restricted kind of distributed computation, where the topology is fixed, and messages are fixed size. The topology is a key aspect, because nodes don't encounter data more than once. This is good, because otherwise the lack of idempotence of some of the operators could be a problem: indeed, it would be interesting to see how to deal with non-idempotent summaries in a truly distributed fashion.

Andrew McGregor talked about graph sketching problems (sorry, no abstract yet). One neat aspect of his work is that in order to build sketches for graph connectivity, he uses a vertex-edge representation that essentially looks like the cycle-basis vector in the 1-skeleton of a simplicial complex, and exploits the homology structure to compute the connected components (aka $\beta_0$). He also uses the bipartite double cover trick to reduce bipartiteness testing to connected component computation. It's kind of neat to see topological methods show up in a useful way in these settings, and his approach probably extends to other homological primitives.

Donatella Firmani and Luigi Laura talked about different aspects of graph sketching and MapReduce, studying core problems like the MST and bi/triconnectivity. Donatella's talk in particular had a detailed experimental study of various MR implementations for these problems, and had interesting (but preliminary) observations about tradeoff between the number of reducers and the amount of communication needed.

This theme was explored further by Jeff Ullman in his talk on one-pass MR algorithms (the actual talk title was slightly different, since the unwritten rule at the workshop was to change the name of the title from the official listing). Again, his argument was that one should be combining both the communication cost and the overall computation cost. A particularly neat aspect of his work was showing (for the problem of finding a particular shaped subgraph in a given large graph) when there was an efficient one-pass MR algorithm, given the existence of a serial algorithm for the same problem. He called such algorithms convertible algorithms: one result type is that if there's an algorithm running in time $n^\alpha m^\beta$ for finding a particular subgraph of size $s$, and $s \le \alpha + 2\beta$, then there's an efficient MR algorithm for the problem (in the sense of total computation time being comparable to the serial algorithm).

The Shonan Meeting (Part 1): In the beginning, there was a disk...

What follows is a personal view of the evolution of large-data models. This is not necessarily chronological, or even reflective of reality, but it's a retroactive take on the field, inspired by listening to talks at the Shonan meeting.

Arguably, the first formal algorithmic engagement with large data was the Aggarwal-Vitter external memory model from 1988. The idea was simple enough: accessing an arbitrary element of disk was orders of magnitude more expensive than accessing an element of main memory, so let's ignore main memory access and charge a single unit for accessing a block of disk.

The external memory model was (and is still) a very effective model of disk access. It wasn't just a good guide to thinking about algorithm design, it also encouraged design strategies that were borne out well in practice. One could prove that natural-sounding buffering strategies were in fact optimal, and that prioritizing sequential scans as far as possible (even to the extent of preparing data for sequential scans) was more efficient. Nothing earth-shattering, but a model that guides (and conforms to) proper practice is always a good one.

Two independent directions spawned off from the external memory model. One direction was to extend the hierarchy. Why stop at one main memory level when we have multilevel caches ? A simple extension to handle caches is tricky, because the access time differential between caches and main memory isn't sufficient to justify the idealized "0-1" model that the EM model used. But throwing in another twist - I don't actually know the correct block size for transfer of data between hierarchy levels - led us to cache-obliviousness.

I can't say for sure whether the cache-oblivious model speaks to the practice of programming with caches as effectively as the EM model. Being aware of your cache can bring significant benefits in principle. But the design principles ("repeated divide and conquer, and emphasizing locality of access") are sound, and there's already at least one company (Tokutek, founded by Martin Farach-Colton, ~~and~~ Michael Bender and Bradley Kuszmaul) that is capitalizing on the performance yielded by cache-oblivious data structures.

The other direction was to weaken the power of the model. Since a sequential scan was so much more efficient than random access to disk, a natural question was to ask what you could do with just one scan. And thus was born the streaming model, which is by far the most successful model for large-data to date, with theoretical depth and immense practical value.

What we've been seeing over the past few years is the evolution of the streaming model to capture ever more complex data processing scenarios and communication frameworks.

It's quite useful to think of a stream algorithm as "communicating" a limited amount of information (the working memory) from the first half of the stream to the second. Indeed, this view is the basis for communication-complexity-based lower bounds for stream algorithms.

But if we think of this as an algorithmic principle, we then get into the realm of a distributed computation, where one player possesss the "first half" of the data, the other player has "the second half" and the goal is for them to exchange a small number of bits with each other in order to compute something (while streaming is one-way communication, a multi-pass streaming algorithm is a two-way communication).

Of course, there's nothing saying that we only have two players. This gets you to the $k$-player setup for a distributed computation, in which you wish to minimize the amount of communication exchanged as part of a computation. This model is of course not new at all ! It's exactly the distributed computing model pioneered in the 80s and 90s and has a natural home at PODC. What appears to be different in its new uses is that the questions being asked are not the old classics like leader election or byzantine agreement, but statistical estimation on large data sets. In other words, the reason to limit communication is because of the need to process a large data set, rather than the need to merely coordinate. It's a fine distinction, and I'm not sure I entirely believe it myself :)

There are many questions about how to compute various objects in a distributed setting: of course the current motivation is to do with distributed data centers, sensor networks, and even different cores on a computer. Because of the focus on data analysis, there are sometimes surprising results that you can prove. For example, a recent SIGMOD paper by Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu shows that if you want to do quantile estimation, you only need communication that's sublinear in the number of players ! The trick here is that you don't need to have very careful error bounds on the estimates at each player before sending up the summary to a coordinator.

It's also quite interesting to think about distributed learning problems, where the information being exchanged is specifically in order to build a good model for whatever task you're trying to learn. Some recent work that I have together with Jeff Phillips, Hal Daume and my student Avishek Saha explores the communication complexity of doing classification in such a setting.

An even more interesting twist on the distributed setting is the so-called 'continuous streaming' setting. Here, you don't just have a one-shot communication problem. Each player receives a stream of data, and now the challenge is not just to communicate a few bits of information to solve a problem, but to update the information appropriately as new input comes in. Think of this as streaming with windows, or a dynamic version of the basic distributed setting.

Here too, there are a number of interesting results, a beautiful new sampling trick that I'll talk about next, and some lower bounds.

I haven't even got to MapReduce yet, and how it fits in: while you're waiting, you might want to revisit this post.

Sunday, January 15, 2012

The Shonan Meeting (Part 0): On Workshops

Coming up with new ideas requires concentration and immersion. When you spend enough unbroken time thinking about a problem, you start forming connections between thoughts, and eventually you get a giant "connected component" that's an actual idea.

Distractions, even technical ones, kill this process. And this is why even at a focused theory conference, I don't reach that level of "flow". While I'm bombarded from all directions by interesting theory, there's a lot of context switching. Look, a new TSP result ! origami and folding - fun ! Who'd have thought of analyzing Jenga ! did someone really prove that superlinear epsilon net lower bound ?

This is why focused workshops are so effective. You get bombarded with information for sure, but each piece reinforces aspects of the overall theme if it's done well. Slowly, over the course of the event, a bigger picture starts emerging, connections start being made, and you can feel the buzz of new ideas.

And this is why the trend of 'conferencizing' workshops, that Moshe Vardi lamented recently, is so pernicious. it's another example of perverse incentives ("conferences count more than workshops for academic review, and so let's redefine a workshop as a conference"). A good workshop (with submitted papers or otherwise) provides focus and intensity, and good things come of it. A workshop that's really just a miniconference doesn't have either the intense intimacy of a true workshop or the quality of a larger symposium.

All of this is a very roundabout way of congratulating Muthu, Graham Cormode and Ke Yi (ed: Can we just declare that Muthu has reached exalted one-word status, like Madonna and Adele ? I can't imagine anyone in the theory community hearing the name 'Muthu' and not knowing who that is) for putting on a fantastic workshop on Large-Scale Distributed Computing at the Shonan Village Center (the Japanese Dagstuhl, if you will). There was reinforcement, intensity, the buzz of new ideas, and table tennis ! There was also the abomination of fish-flavored cheese sticks, of which nothing more will be said.

In what follows, I'll have a series of posts from the event itself, with a personal overview of the evolution of the area, highlights from the talks, and a wrap up. Stay tuned...

Tuesday, January 03, 2012

Teaching review I: Programming

Please note: I'm in Boston this week at the Joint Math Meetings (6100 participants and counting!) for a SIAM minisymposium on computational geometry. If you happen to be in the area for the conference, stop by our session on Thursday at 8:30am.

I always assign programming questions in my graduate algorithms class. In the beginning, I used the ACM programming competition server, and also tried the Sphere Online system. This semester, I didn't do either, designing my own homeworks.

Why do I ask students to code ? There are many reasons it's a good idea to get students to program in a graduate algorithms class. There are some additional reasons in my setting: a majority of the students in my class take it as a required course for the MS or Ph.D program. They don't plan on doing research in algorithms, many of them are looking for jobs at tech companies, and many others will at best be using algorithmic thinking in their own research.

Given that, the kinds of programming assignments I tried to give this semester focused on the boundary between theory and practice (or where O() notation ends). In principle, the goal of each assignment on a specific topic was to convert the theory (dynamic programming, prune and search, hashing etc) into practice by experimenting with different design choices and seeing how they affected overall run time.

Designing assignments that couldn't be solved merely by downloading some code was particularly tricky. As Panos Ipeirotis recommended in his study of cheating in the classroom (original post deleted, here's the post-post), the key is to design questions whose answer requires some exploration after you've implemented the algorithm. I was reasonably happy with my hashing assignment, where students were asked to experiment with different hashing strategies (open addressing, chaining, cuckoo hashing), measure the hash table behavior for different parameter choices, and draw their own conclusions. There was a fair degree of consistency in the reported results, and I got the feeling that the students (and I!) learned something lasting about the choices one makes in hashing (for example, simple cuckoo hashing degrades rapidly after a 70% load ratio).

I did have a few disasters, and learned some important lessons.

If I expect to run student code on my test cases, I HAVE to provide a test harness into which they'd merely insert a subroutine. Otherwise, merely specifying input and output format is not enough. Students are not as familiar with a simple UNIX command line interface as I expected, and use pre-canned libraries in ways that make uniform testing almost impossible. Note that this is not language-independent, unless I'm willing to provide harnesses for the myriad different languages people use.
If you do want to accept code without using the 'subroutine-inside-harness' form, then you need some kind of testing environment where they can submit the code and get answers. This is essentially what ACM/TopCoder/SphereOnline do.
If you're accepting code from students, use Moss or something like it. It's great for detecting suspiciously common code. When I used it and detected duplicates, in all cases the students admitted to working together.
It's easier to design assignments where the submitted product is not code, but some analysis of the performance (like in the hashing assignment above). Firstly, this is platform independent, so I don't care if people use Java, C, C++, C#, python, perl, ruby, scheme, racket, haskell, .....Secondly, it avoids the issue of "did you write the code yourself" - not entirely, but partly.

But I'll definitely do this again. Firstly, it's good for the students - from what I hear, having written actual code for dynamic programming makes it much easier for them to write code under pressure during their tech interviews, and I've heard that knowledge of DPs and network flows is a key separator during tech hiring. Secondly, even for theory students, it helps illustrate where asymptotic analysis is effective and where it breaks down. I think this is invaluable to help theoreticians think beyond galactic algorithms.

Of course, no discussion of programming in algorithms classes is complete with a reference to Michael Mitzenmacher's exhortation.

Monday, December 26, 2011

On PC members submitting papers

Update: Michael Mitzenmacher's posts (one, two, and three, and the resulting comments) on implementing CoI at STOC are well worth reading (thanks, Michael). The comments there make me despair that *any* change will ever be implemented before the next century, but given that we've been able to make some changes already (electronic proceedings, contributed workshops, and so on), I remain hopeful.

For all but theory researchers, the reaction to the above statement is usually "don't they always?". In theoryCS, we pride ourselves on not having PC members submit papers to conferences. What ends up happening is:

You can't have too many PC members on a committee because otherwise there won't be enough submissions
The load on each PC member is much larger than reasonable (I'm managing 41 papers for STOC right now, and it's not uncommon to hit 60+ for SODA)

There's an ancillary effect that because of the first point, theory folks have fewer 'PC memberships' on their CV which can cause problems for academic performance review, but this is a classic Goodhart's Law issue, so I won't worry about it.

The main principle at play here is: we don't want potentially messy conflicts or complex conflict management issues if we do have PC members submitting papers. However, it seems to me that the practice of how we review papers is far different from this principle.

Consider: I get an assignment of X papers to review if I'm on a conference PC. I then scramble around finding subreviewers for a good fraction of the papers I'm assigned (I used to do this less, but I eventually realized that a qualified subreviewer is FAR better than me in most subareas outside my own expertise, and is better for the paper).

Note (and this is important) my subreviewers have typically submitted papers to this conference (although I don't check) and I rely on them to declare any conflicts as per conference guidelines.

Subreviewers also get requests from different PC members, and some subreviewers might themselves review 3-4 papers.

Compare this to (say) a data mining conference: there are 30+ "area chairs" or "vice chairs", and over 200 PC members. PC members each review between 5-10 papers, and often don't even know who the other reviewers are (although they can see their reviews once they're done). The area/vice chairs manage 20-30 papers each, and their job is to study the reviews, encourage discussion as needed, and formulate the final consensus decision and 'meta-review'.

If you set "theory subreviewer = PC member" and "theory PC member = vice chair", you get systems that aren't significantly different. The main differences are:

theory subreviewers don't typically get to see other reviews of the paper. So their score assignment is in a vacuum.
theory PC members are expected to produce a review for a paper taking the subreviewer comments into account (as opposed to merely scrutinizing the reviews being provided)
managing reviewer comments for 30 papers is quite different to generating 30 reviews yourself (even with subreviewer help)
A downside of the two-tier PC system is also that there isn't the same global view of the entire pool that a theory PC gets. But this is more a convention than a rule: there's nothing stopping a PC for opening up discussions to all vice chairs.
One advantage of area chairs is that at least all papers in a given area get one common (re)viewer. that's not necessarily the case in a theory PC without explicit coordination from the PC chair and the committee itself.

But the main claimed difference (that people submitting papers don't get to review them) is false. Even worse, when submitters do review papers, this is 'under the table' and so there isn't the same strict conflict management that happens with explicit PC membership.

We're dealing with problems of scale in all aspects of the paper review and evaluation process. This particular one though could be fixed quite easily.

Friday, December 23, 2011

Thoughts on ICDM II: Social networks

The other trend that caught my eye at ICDM is the dominance of social networking research. There was a trend line at the business meeting that bore this out, showing how topics loosely classified as social networking had a sharp rise among accepted papers in ICDM over the past few years.

There were at least three distinct threads of research that I encountered at the conference, and in each of them, there's something to interest theoreticians.

The first strand is modelling: is there a way to describe social network graphs using abstract evolution models or random graph processes. I spent some time discussing this in a previous post, so I won't say more about it here. Suffice it to say that there's interesting work in random graph theory underpinning this strand, as well as a lot of what I'll call 'social network archaeology': scouring existing networks for interesting structures and patterns that could be the basis for a future model.
The second strand is pattern discovery, and the key term here is 'community': is there a way to express natural communities in social networks in a graph-theoretic manner ? While modularity is one of the most popular ways of defining community, it's not the only one, and has deficiencies of its own. In particular, it's not clear how to handle "soft" or "overlapping" communities. More generally, there appears to be no easy way to capture the dynamic (or time-varying) nature of communities, something Tanya Berger-Wolf has spent a lot of energy thinking about. Again, while modelling is probably the biggest problem here, I think there's a lot of room for good theory, especially when trying to capture dynamic communities.
The final strand is influence flow. After all, the goal of all social networking research is to monetize it (I kid, I kid). A central question here is: can you identify the key players who can make something go viral for cheap ? is the network topology a rich enough object to identify these players, and even if you do, how can you maximize flow (on a budget, efficiently).

There were many papers on all of these topics -- too many to summarize here. But the landscape is more or less as I laid it out. Social networking research is definitely in its bubble phase, which means it's possible to get lots of papers published without necessarily going deep into the problem space. This can be viewed as an invitation to jump in, or a warning to stay out, depending on your inclination. And of course, the definitive tome on this topic is the Kleinberg-Easley book.

This concludes my ICDM wrap-up. Amazingly, it only took me a week after the conference concluded to write these up.

Thursday, December 22, 2011

CGWeek !!!

If you're on the compgeom mailing list, you probably know this already, but if not, read on:

There's an increased interest in expanding the nature and scope of events at SoCG beyond the conference proper. To this end, Joe Mitchell put together a committee titled "CG:APT" (CG: applications, practice and theory) to chalk out a strategy and solicit contributions.

The call for contributions is now out, and the main nugget is this:

Proposals are invited for workshops/minisymposia, or other types of events on topics related to all aspects of computational geometry and its applications. Typical events may feature some number of invited speakers and possibly some number of contributed presentations. Events may feature other forms of communications/presentations too, e.g., via software demos, panel discussions, industry forum, tutorials, posters, videos, implementation challenge, artwork, etc. CG:APT events will have no formal proceedings; optionally, the organizers may coordinate with journals to publish special issues or arrange for other dissemination (e.g., via arXiv, webpages, printed booklets, etc).

In other words, anything goes ! (this is an experiment, after all). Topics are essentially anything that might be of interest geometrically in a broad sense (i.e not limited to what might appear at the conference itself).

I'd strongly encourage people to consider putting together a proposal for an event. The procedure is really simple, and only needs a two page proposal containing:

Title/theme of the workshop/minisymposium/event
Organizer(s) (name, email)
Brief scientific summary and discussion of merits (to CG) of the proposed topic.
A description of the proposed format and agenda
Proposed duration: include both minimum and ideal; we anticipate durations of approximately a half day (afternoon), with the possibility that some meritorious events could extend across two half-days.
Procedures for selecting participants and presenters
Intended audience
Potential invited speakers/panelists
Plans for dissemination (e.g., journal special issues)
Past experience of the organizer(s) relevant to the event

Please note: EVERY COMMUNITY DOES THIS (and now, even theory). The deadline is Jan 13, 2012, and proposals should be emailed to Joe Mitchell (joseph.mitchell@stonybrook.edu). Note that the idea is for such events to be in the afternoon, after morning technical sessions of SoCG.

There are many people out there who grumble about how insular the CG community is. Now's your chance to walk into the lion's den (at Chapel Hill) and tell it off :).

Thoughts on ICDM I: Negative results (part C)

This is the third of three posts (one, two) on negative results in data mining, inspired by thoughts and papers from ICDM 2011.

If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.

The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ($$ link) by David Hand (copy of paper here). His central result is a very surprising one:

The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.

To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.

If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,

It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used

This is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)

What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that

nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light.
It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers.
It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.

I'm not even close to describing the level of detail in his experiments: it's a really well-executed empirical study that should be a case study for anyone doing experimental work in the field. It's especially impressive because from personal experience I've found it to be REALLY HARD to do quality methodological studies in this area (as opposed to the "define algorithm-find-toy-data-profit" model that most DM papers seem to follow").

At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.

Tuesday, December 20, 2011

Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because

you can target your algorithms to graphs that look like this, hopefully making them more efficient
You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.

But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))

Sunday, December 18, 2011

Thoughts on ICDM I: Negative Results (part A)

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that's on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I'd like to highlight here is a trend (that I hope grows) in negative results.

It's not particularly hard to invent a new method for doing data mining. It's much harder to show why certain methods will fail, or why certain models don't make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I'll note here that I'm talking specifically about data mining, NOT machine learning - the difference between the two is left for another post).

In the interest of brevity, I'll break this trend down in three posts. The first result I want to highlight isn't even quite a result, and isn't even from ICDM ! It's actually from KDD 2011 (back in August this year). The paper is Leakage in Data Mining: Formulation, Detection, and Avoidance, by Shachar Kaufman, Saharon Rosset, and Claudia Perlich, and got the Best Paper Award at KDD this year.

The problem they examine is "leakage", or the phenomenon that even in a 'train model and test model" framework, it is possible for valuable information to "leak" from the test data or even from other sources to the training system, making a learned model look surprisingly effective and even give it predictive powers beyond what it really can do. Obviously, the problem is that when such models are then applied to new data, their performance is worse than expected, compared to a model that wasn't "cheating" in this way.

They cite a number of examples, including many that come from the data challenges that have become all the rage. The examples highlight different kinds of leakage, including "seeing the future", cross contamination of data from other data sets, and even leakage by omission, where a well-intentioned process of anonymization actually leaks data about the trend being analyzed.

While there are no results of any kind (experimental or theoretical), the authors lay out a good taxonomy of common ways in which leakage can happen, and describe ways of avoiding leakage (when you have control over the data) and detecting it (when you don't). What makes their paper really strong is that they illustrate this with specific examples from recent data challenges, explaining how the leakage occurs, and how the winners took advantage of this leakage explicitly or implicitly.

There are no quick and dirty fixes for these problems: ultimately, leakage can even happen through bad modelling, and sometimes modellers fail to remove leaky data because they're trying to encourage competitors to build good predictive models. Ironically, it is this encouragement that can lead to less predictive models on truly novel data. But the paper makes a strong argument that the way we collect data and use it to analyze our algorithms is fundamentally flawed, and this is especially true for the more sophisticated (and mysterious) algorithms that might be learning models through complex exploitation of the trained data.

It remains to be seen whether the recommendations of this paper will be picked up, or if there will be more followup work along these lines. I hope it does.

A rant about school science fair projects

It's science fair time at my son's school. He's in the first grade, so admittedly there's not a lot that he can do without a reasonable amount of *cough* parental*cough* help. But why do we not have a 'mathematics' fair or a 'programming fair ?

The science fair project format is very confining. You have to propose a hypothesis, tabulate a bunch of results, do some analysis, and discuss conclusions, with nice charts/graphs and other such science cultism. Even if you're interested in something more 'mathematical', there's no real way of shoehorning it into the format. A year or so ago, a colleague of mine was asking me about origami-related projects (because his daughter loves paper folding) but apart from experimenting with knapsack-style algorithms to determine how to fold a ruler into a specified length, we couldn't figure out something that fit into the 'hypothesis-experiment' setting.

Granted, it's a science fair. But at this age level, I assume the whole point of participation in science fairs is about learning something about science, rather than conducting rigorous analysis. You could equally well learn about something mathematical and demonstrate that knowledge. But there's no forum to do that in.

Friday, December 16, 2011

Simons Fellowship for theory grad students.

The Simons Foundation has been a great boon for theoretical computer science, supporting postdocs galore, and even running a "sooper-seekrit" search for a new TCS institute.

Their latest initiative is at the grad student level. They're offering a 2-year fellowship to grad students "with an outstanding track record of research accomplishments". I think the idea is to support students who've established a good body of work, and could use this to coast towards their graduation and ramp up their research even more.

The support is considerable:

Each award will provide annual support for the following:

Fellowship stipend as set by the student’s institution.

Tuition and fees at the student's institution, to the extent previously covered by other outside funding.

$3,000 in additional support for the student to use at his/her discretion.

$5,000 per year toward the student’s health insurance and other fringe benefits.

$5,000 per year for travel, books, a computer and other supplies.

$5,000 in institutional overhead allowance.

Fellowships will start September 2012 and end August 2014.

How do you apply ?

Applicants may apply through proposalCENTRAL (http://proposalcentral.altum.com/default.asp?GMID=50) beginning December 7, 2011. The deadline to apply is May 1, 2012. Please coordinate submission of the proposal with the appropriate officials in accordance with institution/university policies. Please see the Application Instructions for further information.

Application Requirements:

Research Statement (two page limit): A statement summarizing the applicant’s research contributions, research plans for the immediate future, and career goals. References do not need to be included within the page limit, but should not exceed an additional page.

A curriculum vitae (two page limit), which includes institution, advisor, and a list of publications.

A letter of support from the Department Chair.

A letter of support from the student’s advisor.

A letter from a reference outside the student’s institution. This letter must be submitted directly via proposalCENTRAL by the reference. Please see the Application Instructions for more information.

Thesis topic.

(HT Sampath Kannan for pointing this out)

Thursday, December 08, 2011

ACM Fellows, 2011

Many theoreticians on this year's list of ACM Fellows:

Serge Abiteboul
Guy Blelloch
David Eppstein
Howard Karloff
Joe Mitchell
Janos Pach
Diane Souvaine

Congratulations to them, and to all the Fellows this year (especially my collaborator Divesh Srivastava)

SoCG and ACM: The Results Show

From Mark de Berg:

The bottom row of the table [below] gives the number of votes for each of the three options

A.    I prefer to stay with ACM.

B.    If involvement of ACM can be restricted to publishing the proceedings, at low cost for SoCG, then I prefer to stay with ACM; otherwise I prefer to leave ACM.

C.    I prefer to leave ACM, and organize SoCG as an independent conference with proceedings published in LIPIcs.

and it also gives a breakdown of the votes by the number of SoCG’s that the voter attended in the last 10 years.

A: stay

B: proceedings only

C: leave

total

A: 0

4

3

3

10

B: 1-2

6

16

19

41

C: 3-5

11

15

16

42

D: >5

8

14

9

31

total

29

48

47

124

Based on the result of the Poll, the Steering Committee decided to start negotiating with ACM to see if they can offer SoCG an arrangement in which involvement of ACM is limited primarily to publishing the proceedings, with the possible option for greater involvement if the local organizers request/require it.

It will be interesting to see how this plays out.

Monday, November 21, 2011

On getting rid of ACM affiliation...

As I mentioned earlier, the SoCG steering committee is running a poll on whether to separate SoCG from ACM and run it as an independent symposium with proceedings published by the Dagstuhl LIPIcs series.

At the time I had considered writing a post expressing my own thoughts on the matter, and then promptly forgot about it. The poll is still open (although probably many of you have voted already) and so I thought (with some external nudging :)) that I'd state my position here (one that I summarized in brief for the poll).

If you'd rather not read any more, here's the summary:

I think it's a bad idea.

And here's why

I understand the rationale behind this: it can be a little difficult to work with ACM. ACM does take a contingency fee that increases registration. The cost of proceedings is a significant fraction of the total budget, and the timelines can be a little tricky to work with. These and more are outlined in the poll document, and you can read all the points being made there.

But that's not why I think this is a bad idea (and I do have specific objections to the claims above). I think it's a bad idea because ACM is not just a conference organizing, proceedings publishing enterprise that has bad taste in style files. It's also the home of SIGACT.

For better or worse, SIGACT is the flagship representative body for theoretical computer science in the US. General theory activities like the Theory Matters wiki and the SIGACT committee for the advancement of theoretical computer science were jump-started by SIGACT. Association with SIGACT means that the work you do is 'theoryCS (A)' in some shape of form.

If you're doing geometry in the US, then affiliation with SIGACT is not a bad thing. It means that you're part of the larger theory community. It means that when the theory community makes pitches to the NSF to get more funding for theory research, you're included. It also helps because there aren't other communities (SIGGRAPH?) ready to absorb geometry into their fold.

While de-linking from ACM doesn't mean that we turn in our 'theory badges', it doesn't help the already fraught relationship between computational geometry and the larger theory community. And while the relationship with ACM is not crucial to the survival of the geometry community in the US, being part of a larger theory community that speaks the same language is crucial.

p.s The center of mass in CG is closer to Europe than many might realize. I understand that our European colleagues could care less about US funding and community issues, which is fair for them. But this matter affects US-resident folks more than ACM's surcharges affect the European researchers, and our community isn't so big that we can withstand shocks to one side of it.

Tuesday, November 15, 2011

SODA 2012 student travel support

While it's not a free round trip to Japan for blogging, ....

Well actually it is !

David Johnson asks me to point out that the deadline for applying for student travel support to SODA 2012 has been extended to Nov 29.

The United States National Science Foundation, IBM, and Microsoft have provided funding to support student travel grants. The nominal award is $1000 toward travel expenses to Kyoto, Japan for SODA12, or ALENEX12 or ANALCO12. In special cases of large travel costs somewhat larger grants may be given.

Any full-time student in good standing who is affiliated with a United States institution is eligible to receive an award. All awardees are responsible for paying for meeting registration fees. However, registration fees may be included in travel expenses

Top priority will be given to students presenting papers at the meeting, with second priority to students who are co-authors of papers. Ties will be broken in favor of students who have not attended a prior SODA.

For more information, visit the SIAM travel support page.

And if you'd also like to blog about the conference once you're there, the cstheory blog is always looking for volunteers !