The Geomblog

Wednesday, April 18, 2012

Distributed Learning: A new model

Communication is now the key to modelling distributed/multicore computations. Jim Demmel has been writing papers and giving talks on this theme for a while now, and as processors get faster, and the cloud becomes a standard computing platform, communication between nodes is turning out to be the major bottleneck.

So suppose you want to learn in this setting ? Suppose you have data sitting on different nodes (you have a data center, or a heterogeneous sensor network, and so on) and you'd like to learn something on the union of the data sets. You can't afford to ship everything to a single server for processing: the data might be too large to store, and the time to ship might be prohibitive.

So can you learn over the (implicit) union of all the data, with as little discussion among nodes as possible ? This was the topic of my Shonan talk, as well as two papers that I've been working on with my student Avishek Saha, in collaboration with Jeff Phillips and Hal Daume. The first one will be presented at AISTATS this week, and the second was just posted to the arxiv.

We started out with the simplest of learning problems: classification. Supppose you have data sitting on two nodes (A and B), and you wish to learn a hypothesis over the union of A and B. What you'd like is a way for the nodes to communicate as little as possible with each other while still generating a hypothesis close to the optimal solution.

It's not hard to see that you could compute an $\epsilon$-sample on A, and ship it over to B. By the usual properties of an $\epsilon$-sample, you guarantee that any classifier on B's data combined with the sample will also classify A correctly to within some $\epsilon$-error. It's also not too hard to show a lower bound that matches this upper bound. The amount of communication is nearly linear in $1/\epsilon$.

But can you do better ? In fact yes, if you let the nodes talk to each other, rather than only allowing one-way communication. One way of gaining intuition for this is that $A$ can generate classifiers, and send them over to $B$, and $B$ can tell $A$ to turn the classifier left or right. Effectively, $B$ acts as an oracle for binary search. The hard part is showing that this is actually a decimation (in that a constant fraction of points are eliminated from consideration as support points in each step), and once we do that, we can show an exponential improvement over one-way communication. There's a trivial way to extend this to more than 2 players, with a $k^2$ blow up in communication for $k$ players.

This binary search intuition only works for points in 2D, because then the search space of classifiers is on the circle, which lends itself naturally to a binary search. In higher dimensions, we have to use what is essentially a generalization of binary search - the multiplicative weight update method. I'll have more to say about this in a later post, but you can think of the MWU as a "confused zombie" binary search, in that you only sort of know "which way to go" when doing the search, and even then points that you dismissed earlier might rise from the dead.

It takes a little more work to bring the overhead for k-players down to a factor k. This comes by selecting one node as a coordinator, and implementing one of the distributed continuous sampling techniques to pass data to the coordinator.

You can read the paper for more details on the method. One thing to note is that the MWU can be "imported" from other methods that use it, which means that we get distributed algorithms for many optimization problems for free. This is great because a number of ML problems essentially reduce to some kind of optimization.

A second design template is multipass streaming: it's fairly easy to see that any multipass sublinear streaming algorithm can be placed in the k-player distributed setting, and so if you want a distributed algorithm, design a multipass streaming algorithm first.

One weakness of our algorithms was that we didn't work in the "agnostic" case, where the optimal solution itself might not be a perfect classifier (or where the data isn't separable, to view it differently). This can be fixed: in an arxiv upload made simultaneously with ours, Blum, Balcan, Fine and Mansour solve this problem very neatly, in addition to proving a number of PAC-learning results in this model.

It's nice to see different groups exploring this view of distributed learning. It shows that the model itself has legs. There are a number of problems that remain to be explored, and I'm hoping we can crack some of them. In all of this, the key is to get from a 'near linear in error' bound to a 'logarithmic in error' bound via replacing sampling by active sampling (or binary search).

Monday, April 09, 2012

Approximate Bregman near neighbors.

(tl;dr: Our upcoming paper in SoCG 2012 shows that with a nontrivial amount of work, you can do approximate Bregman near neighbor queries in low dimensional spaces in logarithmic query time)

One of the things that has haunted me over the years (yes, haunted!) is the geometry of Bregman divergences. A Bregman divergence is a very interesting creature. It's defined as the difference between a convex function and its linear approximation starting at another point:
$$ d_\phi(p,q) = \phi(p) - \phi(q) - \langle \nabla \phi(q), p - q\rangle $$
Because the Bregman divergences are parametrized by the function $\phi$, they yield a family of divergences. The most familiar one is $\ell_2^2$, but the most important one is the Kullback-Leibler divergence. There are a ton of reasons why one should study Bregman divergences, and in a shameless plug I'll refer to you my slides (one, two, three) from a geometry summer school last year. Suffice it to say that it's possible to unify a number of different algorithms in machine learning under a single framework by realizing that they're all particular instances of doing something with a Bregman divergence: two notable examples of this are Adaboost and information-theoretic clustering.

So what's the geometric perspective on Bregman divergences ? They generalize duality !

There's a standard way to think about traditional point-hyperplane duality via a lifting map: take a point, lift it to the paraboloid one dimension up, and find the tangent plane. But suppose we replace the paraboloid by a generic convex function ? what we get is a general convex duality (technically a Fenchel-Legendre duality) defined by
$$\phi^*(u) = \sup_v \langle u,v \rangle - \phi(v)$$
The reason this doesn't come up in Euclidean space is that the mapping is self-dual if we set $\phi(x) = (1/2) \|x\|^2$, which explains the symmetry in $\ell_2^2$.

It turns out that this observation is enough to import much of standard combinatorial geometry over to general Bregman spaces. We can compute convex hulls, Voronoi Diagrams, delaunay triangulations etc, as long as we're careful to keep primal and dual straight. for example, the locus of points equidistant from two points under a Bregman divergence is either a primal straight line or a dual straight line (depending on how you measure things), and so on. This was all elegantly worked out by Nielsen, Nock and Boissonnat a while ago.

Alas, this beautiful connection doesn't help us with approximation problems.

One of the most interesting questions regarding Bregman divergences is solving the near-neighbor problem. This is interesting because the Kullback-Leibler distance is often used (for example) to compare images, and so there's been a lot of empirical work on this problem.

But here's the problem. Let's consider the low dimensional ANN problem. In Euclidean space (or even in spaces of bounded doubling dimension), here's how you solve the problem. You build some kind of quad-tree-like data structure, using the triangle inequality to reason about which cells you need to explore, and using packing bounds to bound the number of cells explored. You also need a crude bound on the near neighbor to start with, and to do this, you use some variant of a ring-tree.

The key ingredients: triangle inequality (twice over) and packing bounds.

Bregman divergences don't even satisfy a directed triangle inequality in general. And to date, we didn't even know how to define packing bounds properly for these directed distances.

In an upcoming paper at SoCG 2012 with my students Amirali Abdullah and John Moeller, we finally figured out how to get some "approximation" of the ANN machinery to work with Bregman divergences, and get logarithmic query times with small space.

If I say so myself, there are some nice ideas in the paper. Firstly, in a somewhat surprising twist, a Bregman divergence satisfies a "reverse triangle inequality" on the line:

\[ d_\phi(a,b) + d_\phi(b,c) \le d_\phi(a, c), a, b, c, \in R\]

This is neat, because it gives packing bounds ! Intuitively, if the sum of lengths of subintervals along an interval is at most the length of the interval, then you can't have too many subintervals.

The next trick is an observation that the square root of a Bregman divergence satisfies a kind of relaxed triangle inequality that we call $\mu$-defectiveness:

\[ |d(x, y) - d(x,z)| \le \mu d(y, z)\]

This allows us to import some of the ring-tree machinery to get a coarse approximation.

And I should mention that the $\mu$ values involved here are quite small. If you're comparing distributions with the KL-divergence, then the value of $\mu$ is less than 2 even quite close to the boundary of the simplex.

Even with all of this in place, the quad-tree argument breaks. This is because of $\mu$-defectiveness: we can't assume that cell sizes are "reasonably large" at some number of levels below the root of the tree, and so our packing bounds look terrible. It takes a lot more work to fix this problem: essentially we exploit second-order structure of the Bregman divergences to bound the cell sizes and get the desired packing bound.

Afterthoughts:

Most of the complexity of the algorithm is in the analysis: the actual algorithm looks mostly like a Euclidean ANN procedure. While we haven't implemented it, I'm hopeful that when we do, we'll be able enjoy the empirical behavior of our Euclidean cousins.
We're looking at high dimensional Bregman near neighbors, as well as some other approximate Bregman geometry questions. While our low-D ANN result comes from the "I will break your limbs one by one till you submit" school of algorithms, the hope is that we can start to exploit more of the dual geometry as we learn more.

Sunday, April 08, 2012

Postdoc Announcement

Piotr Indyk asked me to post this announcement for a postdoc position:

Applications are invited for a Postdoctoral Research Assistant position for the MIT-Shell-Draper Lab research project

"Applications of compressive sensing and sparse structure exploitation in hydrocarbon reservoir exploration and surveillance"

The goal of the project is to develop novel compressive sensing algorithms for geoscience problems in the area of hydrocarbon field exploration and surveillance. The appointment is initially for one-year, with the possibility of renewal for up to 3 years. The appointment should start either during the summer (the preferred option) or the fall of 2012.

Duties and Responsibilities:

Provide expertise on and contribute to the development of compressive sensing and sparse recovery algorithms for geoscience applications

Help MIT faculty in coordination of research projects, including periodic reporting and documentation as required by the program timeline

Frequent travel to Shell facilities in Houston

Minimum Qualifications
Ph.D. in Computer Science, Electrical Engineering, Mathematics or related disciplines

Preferred Qualifications

Expertise in sparse recovery, compressive sensing, streaming/sketching, machine learning and statistical inference algorithms

Experience and/or interest in research projects of interdisciplinary nature

Programming experience (MATLAB)

Applications (including CV and three reference letters) should be submitted to https://postdoc.csail.mit.edu/searches/sparsityp-search/ ideally by April 15, 2012. However, applications will be considered until the position is filled.

Thursday, March 29, 2012

STOC 2012 news: posters, registration and student travel.

This is a friendly reminder to submit your 1-2 paragraph abstracts for the STOC 2012 Poster session. The deadline is Mar 31 @ 5pm PDT (via email to stoc2012posters@gmail.com). Here's how to do it, and here's why you should do it !

Remember, you DON'T need to submit an actual poster by Mar 31 - just the abstract and information on who will present.

In related STOC news, STOC 2012 registration is now live and the deadline for applying for student travel funding is April 4th.

See you in NYC !

Sunday, March 18, 2012

STOC 2012 Posters: Submit your abstract now !

As I mentioned earlier, STOC 2012 is repeating the successful poster event from STOC 2011. For more details on the logistics of submitting, see my post. What I'd like to do is give you more reasons to submit !

No one attends talks !

Ok, that's a little bit of an exaggeration: but between "conference = journal in a hotel" and "chicken chicken chicken" it's fair to say that talks at conferences tend to draw only those who are particularly interested in the specifics of a work, and is less likely to draw casual attendees.

A poster session on the other hand is like wandering through a nice bookstore: you can browse the topics as you see fit, and jump into anything that catches the eye. If you're a poster presenter, this is a great opportunity to convey a higher level intuition for your work to an audience that might not be conversant in the specifics

It's all about the eyeballs !

Everyone should make posters ! We are in an attention-based economy now, and your work gets known and disseminated only if you can get people's attention. A poster is an extremely effective way to communicate with your audience, especially if you use the visual medium effectively. Most conferences now have elaborate poster sessions and it's a great way to meet people and hear about material I wouldn't have otherwise had time to assimilate.

It's all about the networking !

If you're a student, either presenting (or better yet, not presenting) at STOC, what better way to get conversations going with more senior researchers, instead of huddling together with your fellow students, wondering if you can make enough eye contact to get an intro (yes, we do notice :)). I can almost guarantee that you'd have more meaningful interactions with researchers at a poster than at a talk where people have half an eye on their email.

It's all about the impact !

So you got a paper into STOC ! Congratulations - there's a nice CV bullet for you. But what's next ? You want people to read your paper, talk about it, argue about it, and build upon it. Don't you ? Again, a well designed poster can draw in attention much more effectively than a talk in our attention-deficit world, and more attention means more discussions, and more potential impact.

It's easy !!

Now you surely don't believe me :). If you don't. consider your options. You can use LaTeX/beamer, and here's a fantastic resource to help. You can use Powerpoint or Inkscape, or if you're one of those fancy Mac people, you can use whatever fancy Mac tools I'm not cool enough to talk about.

Best of all, you don't need to have the poster ready by the deadline of Mar 31. All you need is a short abstract.

So don't think ! Whether you have a paper or not, do consider submitting a poster - by Mar 31 - to stoc2012posters@gmail.com - and here are the details.

Traffic...

When I first started blogging, I obsessed over page views and visits and other metrics. Eventually I got over myself. But reading Igor's post about pageviews (congrats on the million !) got me wondering, and I decided to look at my sitemeter:

Took me a while to get there, but it's nice to see regardless.

Monday, February 27, 2012

STOC 2012 Call For Posters

This year, STOC 2012 will have a poster session, reprising the successful event from STOC 2011. The poster session will be held during the main conference, and should be thought of as an extended hallway-discussion with a visual aid. The poster session will be ba separate session on its own, accompanied by refreshments. We welcome posters from attendees on research in any aspect of theoretical computer sciences, and believe presenting posters should be especially appealing for researchers with recent relevant result.

Researchers with papers in other conferences that would be of interest to the STOC community.
STOC 2012 authors who want to have a visual aid for extended/related discussions of their work.
Students who wish to discuss their work in a broader context (for example, by presenting an overview of a line of research)

Submission

In order to present a poster, authors must email the following information to stoc2012posters@gmail.com by March 31, 2012, 5pm PT

Title
A 1-2 paragraph abstract
Name of person presenting the poster
Whether the presenter is planning to register for STOC

Note that poster submissions are not refereed. However, the committee reserves the right to turn down submissions deemed out of scope. If the number of poster submissions exceeds capacity, priority will be given to registered STOC attendees and secondarily according to the date of submission. Poster slots will be confirmed by Apr 10, 2012. If any slots remain, submissions received after the deadline may be considered at the discretion of the committee.

Poster Preparation

Posters should be designed to fit within a space that is 42in wide and 48in high, and may consist of a single printed poster or separate pieces. Mounted posterboards, as well as pushpins and tape for attaching your poster to the posterboard, will be provided at the conference.

STOC 2012 Posters Committee
Chandra Chekuri
Sergei Vassilvitskii
Suresh Venkatasubramanian

Friday, January 20, 2012

The Shonan Meeting (Part 3): Optimal Distributed Sampling

Reservoir sampling is a beautiful gems of sampling: easy to explain, and almost magic in how it works. The setting is this:

A stream of items passes by you, and your goal is to extract a sample of size $s$ that is uniform over all the elements you've seen so far.

The technique works as follows: if you're currently examining the i-th element, select it with probability 1/i, and then pick an element uniformly from the current sample to be replaced by it. I've talked about this result before, including three different ways of proving it.

But what happens if you're in a continuous distributed setting ? Now each of $k$ players is reading a stream of items, and they all talk to a coordinator who wishes to maintain a random sample of the union of streams. Let's assume for now that $s \le k$

Each player can run the above protocol and send an item to the coordinator, and the coordinator can pick a random subset from these. But this won't work ! At least, not unless each player has read in exactly the same amount of data as each other player. This is because we need to weight the sample element sent by a player with the number of elements that player has read.

It's not hard to see that each player sends roughly log n messages to the coordinator for a stream of length n. So maybe each player also annotates the element with the number of elements it has seen so far. This sort of works, but the counts could be off significantly, since a stream that doesn't send a sample might have read many more elements since the previous time it sent an update.

This can be fixed by having each player send an extra control message when its stream increases in size by a factor of 2, and that would not change the asymptotic complexity of the process, but we still don't get a truly uniform sample.

The problem with this approach is that it's trying to get around knowing $n$, the size of the stream, which is expensive to communicate in a distributed setting. So can we revisit the original reservoir method in a 'communication friendly way' ?

Let's design a new strategy for reservoir sampling that works as follows.

Maintain a current "threshold" t. When a new item arrives, assign it a random value r between 0 and 1. If r < t, keep the new item and set t = r, else discard it.

By using the principle of deferred decisions, you can convince yourself that this does exactly the same thing as the previous strategy (because at step i, the probability of the current element being retained is its probability of r being the minimum over the set seen so far, which is 1/i). the good thing is that this approach doesn't need to know how many elements have passed so far.

This approach can be extended almost immediately to the distributed setting. Each player now runs this protocol instead of the previous one, and every time the coordinate gets an update, it sends out a new global threshold (the minimum over all thresholds sent in) to all nodes. If you want to maintain a sample of size $s$, the coordinator keeps $s$ of the $k$ elements sent in, and the overall complexity is $O(ks \log n)$.

But you can do even better.

Now, each player maintains its own threshold. The coordinator doesn't broadcast the "correct" threshold until a player sends an element whose random value is above the global threshold. This tells the coordinator that the player had the wrong threshold, and it then updates that player (and only that player)

Analyzing this approach takes a little more work, but the resulting bound is much better:

The (expected) amount of communication is $O(k \frac{\log (n/s)}{\log (k/s)})$

What's even more impressive: this is optimal !

This last algorithm and the lower bound, were presented by Srikanta Tirthapura at the Shonan meeting, based on his DISC 2011 work with David Woodruff. Key elements of this result (including a broadcast-the-threshold variant of the upper bound) also appeared in a PODS 2010 paper by Muthu, Graham Cormode, Kevin Yi and Qin Zhang. The optimal lower bound is new, and rather neat.

Thursday, January 19, 2012

SODA Review II: The business meeting

Jeff Phillips posts a roundup of the SODA business meeting. Hawaii !!!!!!

I thought I would also post a few notes on the SODA business meeting. I am sure I missed some details, but here are the main points.

Everyone thought the organization of the conference was excellent (so far). The part in parenthesis is a joke by Kazuo Iwama towards his students - I guess that is Japanese humor, and encouragement.

Despite being outside of North America for the first time, the attendance was quite high, I think around 350 people. And the splits were almost exactly 1/3 NA, 1/3 Europe, 1/3 Asia.

Yuval Rabani talked about being PC for the conference. He said there were the most submissions ever, most accepted paper ever, and largest PC ever for SODA. Each PC member reviewers about 49 papers, and over 500 were sub-reviewed. We all thank Yuval for all of his hard work.

We voted next on location for 2014 (2013 is in New Orleans). The final votes were down to Honolulu, HI and Washington DC, where Honolulu won about 60 to 49. David Johnson said they would try to book a hotel in Honolulu if he could get hotel prices below 200 USD/night. A quick look on kayak.com made it appear several large hotels could be booked next year around this time for about 160 USD/night or so. Otherwise it will be in DC where the theory group at UMaryland (via David Mount) have stated they would help with local arrangements. They did a great job with SoCG a few years ago, but I heard many suggestions that it be held more downtown than by UofM campus. And also there were requests for good weather. We'll see what happens...

Finally, there was a discussion about how SODA is organized/governed. This discussion got quite lively. Bob Sedgewick led the discussion by providing a short series of slides outlining a rough plan for a "confederated SODA." I have linked to his slides. This could mean several things, for instance:

Having ALENEX and ANALCO (and SODA) talks spread out over 4 days and intermixed even possibly in the same session (much like ESA).
The PCs would stay separate most likely (although merging them was discussed, but this had less support).
For SODA the PC could be made more hierarchical where there are, say, 6 main area chairs. Then each area chair supervises say 12 or so PC members. The general chair would coordinate and normalize all of the reviews, but otherwise it would be more hierarchical and partitioned. Then PC members in each area would have fewer papers to review, and could submit to other subareas even.
There was also a suggestion that PC chairs / steering committee members have some SODA attendance requirements. (Currently it consists of David Johnson, 2 people appointed by SIAM, and the past two PC chairs - as I think I understand. David Johnson said he would provide a link to the official SODA bylaws somewhere.).

Anyways, there was a lot of discussion that was boiled down to 3 votes (I will try to paraphrase, all vote totals approximate):

Should the steering committee consider spreading ALENEX, ANALCO, and SODA talks over 4 days? About 50 to 7 in favor.
Should the steering committee consider/explore some variant of the Confederated SODA model? About 50 to 2 in favor.
Should the steering committee consider making the steering committee members elected? About 50 to 1 in favor.

There were about 100+ votes for location, so usually about half the crowd abstained for all votes. There were various arguments on either side of the positions. And other suggestions. Some people had very strong and well-argued sides of these discussion points, so I don't want to try to paraphrase (and probably get something nuanced wrong), but I encourage people to post opinions and ideas in the comments.

Wednesday, January 18, 2012

SODA review I: Talks, talks and more talks.

I asked Jeff Phillips (a regular contributor) if he'd do some conference posting for those of us unable to make it to SODA. Here's the first of his two missives.

Suresh requested I write a conference report for SODA. I never know how to write these reports since I always feel like I must have left out some nice talk/paper and then I risk offending people. The fact is, there are 3 parallel sessions, and I can't pay close attention to talks for 3 days straight, especially after spending the previous week at the Shonan meeting that Suresh has been blogging about.

Perhaps, it is apt to contrast it with the Shonan meeting. At Shonan there were many talks (often informal with much back and forth) on topics very well clustered on "Large-scale Distributed Computation". There were several talks earlier in the workshop that just overlaid the main techniques that have become quite powerful within an area, and then there were new talks on recent breakthroughs. But although we mixed up the ordering of subtopics a bit, there was never that far of a context switch, and you could see larger views coalescing in people's minds throughout the week.

At SODA, the spectrum is much more diverse - probably the most diverse conference on the theoretical end of computer science. The great thing is that I get to see colleagues across a much broader spectrum of areas. But the talks are often a bit more specific, and despite having usually fairly coherent sessions, the context switches are typically quite a bit larger and it seems harder to stay focused enough to really get at the heart at what is in each talk. Really getting the point requires both paying attention and being in the correct mind set to start with. Also, there are not too many talks in my areas of interest (i.e. geometry, big data algorithmics).

So then what is there to report. I've spent most of my time in the hallways, catching up on gossip (which either is personal, or I probably shouldn't blog about without tenure - or even with tenure), or discussing on-going or new research problems with friends (again not yet ready for a blog). And of the talks I saw, I generally captured vague notions or concepts. Usually stored away for when I think about a related problem and I need to make a similar connection, or look up a technique in the paper. And, although, I was given a CD of the proceedings, but my laptop has not CD drive. For the reasons discussed above, I rarely completely get how something works from a short conference talk. Here are some example snip-its of what I took away from a few talks:

Private Data Release Via Learning Thresholds | Moritz Hardt, Guy Rothblum, Rocco A. Servedio.

Take-away : There is a deep connection between PAC learning and differential privacy. Some results from one can be applied to the other, but perhaps many others can be as well.

Submatrix Maximum Queries in Monge Matrices and Monge Partial Matrices, and Their Applications | Haim Kaplan, Shay Mozes, Yahav Nussbaum and Micha Sharir

Take-away: There is a cool "Monge" property that matrices can have which makes many subset query operations more efficient. This can be thought of as each row represents a pseudo-line. Looks useful for matrix problems where the geometric intuition about what the columns mean is relevant.

Analyzing Graph Structure Via Linear Measurements | Kook Jin Ahn, Sudipto Guha, Andrew McGregor

Take-away : They presented a very cool linear sketch for graphs. This allows several graph problems to be solved under a streaming (or similar models) in the way usually more abstract, if not geometric, data is. (ed: see my note on Andrew's talk at Shonan)

Lsh-Preserving Functions and Their Applications | Flavio Chierichetti, Ravi Kumar

Take-away: They present a nice characterization of what sorts of similarities (based on combinatorial sets), showing which ones can and cannot be used within a LSH framework. There techniques seemed to be a bit more general that for these discrete similarities over sets, so if need to use this for another similarity may be good to check out in more detail.

Data Reduction for Weighted and Outlier-Resistant Clustering | Dan Feldman, Leonard Schulman

Take-away: They continue to develop the understanding on what can be done for core sets using sensitivity-based analysis. This helps outline not just what functions can be approximated with subsets as proxy, but also how the distribution of points affects these results. The previous talk by Xin Xiao (with Kasturi Varadarajan on A Near-Linear Algorithm for Projective Clustering Integer Points) also used these concepts.

There were many other very nice results and talks that I also enjoyed, but the take-away was often even less interesting to blog about. Or sometimes they just made more progress towards closing a specific subarea. I am not sure how others use a conference, but if you are preparing your talk, you might consider trying to build a clear concise take-away message into your talk so that people like me with finite attention spans can remember something precise out of it. And so people like me are most likely to look more carefully at the paper the next time we work on a related problem.

Upcoming deadline: Multiclust 2012

One of the posts queued up in the clustering series is a note on 'metaclustering', which starts with the idea that instead of looking for one good clustering, we should be trying to explore the space of clusterings obtained through different methods and pick out good solutions informed by the landscape of answers. While the concept has been around a while, the area has attracted much more interest in recent years, with two workshops on the topic at recent data mining conferences.

I'm involved with the organization of the latest incarnation, to be held in conjunction with SDM 2012 in April. If you have

unpublished original research papers (of upto 8 pages) that are not under review elsewhere, vision papers and descriptions of work-in-progress or case studies on benchmark data as short paper submissions of up to 4 pages

then you should submit it ! The deadline is Jan 25.

Tuesday, January 17, 2012

The Shonan Meeting (Part 2): Talks review I

I missed one whole day of the workshop because of classes, and also missed a half day because of an intense burst of slide-making. While I wouldn't apologize for missing talks at a conference, it feels worse to miss them at a small focused workshop. At any rate, the usual disclaimers apply: omissions are not due to my not liking a presentation, but because of having nothing even remotely intelligent to say about it.

Jeff Phillips led off with his work on mergeable summaries. The idea is that you have a distributed collection of nodes, each with their own data. The goal is to compute some kind of summary from all the nodes, with the caveat that each node only transmits a fixed size summary to other nodes (or the parent in an implied hierarchy). What's tricky about this is keeping the error down. It's easy to see for example that $\epsilon$-samples compose - you could take two $\epsilon$-samples and take an $\epsilon$-sample of that, giving you a $2\epsilon$-sample over the union. But you want to keep the error fixed AND the size the sample fixed. He showed a number of summary structures that could be maintained in this mergeable fashion, and there are a number of interesting questions that remain open, including how to do clustering in a mergeable way.

In the light of what I talked about earlier, you could think of the 'mergeable' model as a restricted kind of distributed computation, where the topology is fixed, and messages are fixed size. The topology is a key aspect, because nodes don't encounter data more than once. This is good, because otherwise the lack of idempotence of some of the operators could be a problem: indeed, it would be interesting to see how to deal with non-idempotent summaries in a truly distributed fashion.

Andrew McGregor talked about graph sketching problems (sorry, no abstract yet). One neat aspect of his work is that in order to build sketches for graph connectivity, he uses a vertex-edge representation that essentially looks like the cycle-basis vector in the 1-skeleton of a simplicial complex, and exploits the homology structure to compute the connected components (aka $\beta_0$). He also uses the bipartite double cover trick to reduce bipartiteness testing to connected component computation. It's kind of neat to see topological methods show up in a useful way in these settings, and his approach probably extends to other homological primitives.

Donatella Firmani and Luigi Laura talked about different aspects of graph sketching and MapReduce, studying core problems like the MST and bi/triconnectivity. Donatella's talk in particular had a detailed experimental study of various MR implementations for these problems, and had interesting (but preliminary) observations about tradeoff between the number of reducers and the amount of communication needed.

This theme was explored further by Jeff Ullman in his talk on one-pass MR algorithms (the actual talk title was slightly different, since the unwritten rule at the workshop was to change the name of the title from the official listing). Again, his argument was that one should be combining both the communication cost and the overall computation cost. A particularly neat aspect of his work was showing (for the problem of finding a particular shaped subgraph in a given large graph) when there was an efficient one-pass MR algorithm, given the existence of a serial algorithm for the same problem. He called such algorithms convertible algorithms: one result type is that if there's an algorithm running in time $n^\alpha m^\beta$ for finding a particular subgraph of size $s$, and $s \le \alpha + 2\beta$, then there's an efficient MR algorithm for the problem (in the sense of total computation time being comparable to the serial algorithm).

The Shonan Meeting (Part 1): In the beginning, there was a disk...

What follows is a personal view of the evolution of large-data models. This is not necessarily chronological, or even reflective of reality, but it's a retroactive take on the field, inspired by listening to talks at the Shonan meeting.

Arguably, the first formal algorithmic engagement with large data was the Aggarwal-Vitter external memory model from 1988. The idea was simple enough: accessing an arbitrary element of disk was orders of magnitude more expensive than accessing an element of main memory, so let's ignore main memory access and charge a single unit for accessing a block of disk.

The external memory model was (and is still) a very effective model of disk access. It wasn't just a good guide to thinking about algorithm design, it also encouraged design strategies that were borne out well in practice. One could prove that natural-sounding buffering strategies were in fact optimal, and that prioritizing sequential scans as far as possible (even to the extent of preparing data for sequential scans) was more efficient. Nothing earth-shattering, but a model that guides (and conforms to) proper practice is always a good one.

Two independent directions spawned off from the external memory model. One direction was to extend the hierarchy. Why stop at one main memory level when we have multilevel caches ? A simple extension to handle caches is tricky, because the access time differential between caches and main memory isn't sufficient to justify the idealized "0-1" model that the EM model used. But throwing in another twist - I don't actually know the correct block size for transfer of data between hierarchy levels - led us to cache-obliviousness.

I can't say for sure whether the cache-oblivious model speaks to the practice of programming with caches as effectively as the EM model. Being aware of your cache can bring significant benefits in principle. But the design principles ("repeated divide and conquer, and emphasizing locality of access") are sound, and there's already at least one company (Tokutek, founded by Martin Farach-Colton, ~~and~~ Michael Bender and Bradley Kuszmaul) that is capitalizing on the performance yielded by cache-oblivious data structures.

The other direction was to weaken the power of the model. Since a sequential scan was so much more efficient than random access to disk, a natural question was to ask what you could do with just one scan. And thus was born the streaming model, which is by far the most successful model for large-data to date, with theoretical depth and immense practical value.

What we've been seeing over the past few years is the evolution of the streaming model to capture ever more complex data processing scenarios and communication frameworks.

It's quite useful to think of a stream algorithm as "communicating" a limited amount of information (the working memory) from the first half of the stream to the second. Indeed, this view is the basis for communication-complexity-based lower bounds for stream algorithms.

But if we think of this as an algorithmic principle, we then get into the realm of a distributed computation, where one player possesss the "first half" of the data, the other player has "the second half" and the goal is for them to exchange a small number of bits with each other in order to compute something (while streaming is one-way communication, a multi-pass streaming algorithm is a two-way communication).

Of course, there's nothing saying that we only have two players. This gets you to the $k$-player setup for a distributed computation, in which you wish to minimize the amount of communication exchanged as part of a computation. This model is of course not new at all ! It's exactly the distributed computing model pioneered in the 80s and 90s and has a natural home at PODC. What appears to be different in its new uses is that the questions being asked are not the old classics like leader election or byzantine agreement, but statistical estimation on large data sets. In other words, the reason to limit communication is because of the need to process a large data set, rather than the need to merely coordinate. It's a fine distinction, and I'm not sure I entirely believe it myself :)

There are many questions about how to compute various objects in a distributed setting: of course the current motivation is to do with distributed data centers, sensor networks, and even different cores on a computer. Because of the focus on data analysis, there are sometimes surprising results that you can prove. For example, a recent SIGMOD paper by Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu shows that if you want to do quantile estimation, you only need communication that's sublinear in the number of players ! The trick here is that you don't need to have very careful error bounds on the estimates at each player before sending up the summary to a coordinator.

It's also quite interesting to think about distributed learning problems, where the information being exchanged is specifically in order to build a good model for whatever task you're trying to learn. Some recent work that I have together with Jeff Phillips, Hal Daume and my student Avishek Saha explores the communication complexity of doing classification in such a setting.

An even more interesting twist on the distributed setting is the so-called 'continuous streaming' setting. Here, you don't just have a one-shot communication problem. Each player receives a stream of data, and now the challenge is not just to communicate a few bits of information to solve a problem, but to update the information appropriately as new input comes in. Think of this as streaming with windows, or a dynamic version of the basic distributed setting.

Here too, there are a number of interesting results, a beautiful new sampling trick that I'll talk about next, and some lower bounds.

I haven't even got to MapReduce yet, and how it fits in: while you're waiting, you might want to revisit this post.

Sunday, January 15, 2012

The Shonan Meeting (Part 0): On Workshops

Coming up with new ideas requires concentration and immersion. When you spend enough unbroken time thinking about a problem, you start forming connections between thoughts, and eventually you get a giant "connected component" that's an actual idea.

Distractions, even technical ones, kill this process. And this is why even at a focused theory conference, I don't reach that level of "flow". While I'm bombarded from all directions by interesting theory, there's a lot of context switching. Look, a new TSP result ! origami and folding - fun ! Who'd have thought of analyzing Jenga ! did someone really prove that superlinear epsilon net lower bound ?

This is why focused workshops are so effective. You get bombarded with information for sure, but each piece reinforces aspects of the overall theme if it's done well. Slowly, over the course of the event, a bigger picture starts emerging, connections start being made, and you can feel the buzz of new ideas.

And this is why the trend of 'conferencizing' workshops, that Moshe Vardi lamented recently, is so pernicious. it's another example of perverse incentives ("conferences count more than workshops for academic review, and so let's redefine a workshop as a conference"). A good workshop (with submitted papers or otherwise) provides focus and intensity, and good things come of it. A workshop that's really just a miniconference doesn't have either the intense intimacy of a true workshop or the quality of a larger symposium.

All of this is a very roundabout way of congratulating Muthu, Graham Cormode and Ke Yi (ed: Can we just declare that Muthu has reached exalted one-word status, like Madonna and Adele ? I can't imagine anyone in the theory community hearing the name 'Muthu' and not knowing who that is) for putting on a fantastic workshop on Large-Scale Distributed Computing at the Shonan Village Center (the Japanese Dagstuhl, if you will). There was reinforcement, intensity, the buzz of new ideas, and table tennis ! There was also the abomination of fish-flavored cheese sticks, of which nothing more will be said.

In what follows, I'll have a series of posts from the event itself, with a personal overview of the evolution of the area, highlights from the talks, and a wrap up. Stay tuned...

Tuesday, January 03, 2012

Teaching review I: Programming

Please note: I'm in Boston this week at the Joint Math Meetings (6100 participants and counting!) for a SIAM minisymposium on computational geometry. If you happen to be in the area for the conference, stop by our session on Thursday at 8:30am.

I always assign programming questions in my graduate algorithms class. In the beginning, I used the ACM programming competition server, and also tried the Sphere Online system. This semester, I didn't do either, designing my own homeworks.

Why do I ask students to code ? There are many reasons it's a good idea to get students to program in a graduate algorithms class. There are some additional reasons in my setting: a majority of the students in my class take it as a required course for the MS or Ph.D program. They don't plan on doing research in algorithms, many of them are looking for jobs at tech companies, and many others will at best be using algorithmic thinking in their own research.

Given that, the kinds of programming assignments I tried to give this semester focused on the boundary between theory and practice (or where O() notation ends). In principle, the goal of each assignment on a specific topic was to convert the theory (dynamic programming, prune and search, hashing etc) into practice by experimenting with different design choices and seeing how they affected overall run time.

Designing assignments that couldn't be solved merely by downloading some code was particularly tricky. As Panos Ipeirotis recommended in his study of cheating in the classroom (original post deleted, here's the post-post), the key is to design questions whose answer requires some exploration after you've implemented the algorithm. I was reasonably happy with my hashing assignment, where students were asked to experiment with different hashing strategies (open addressing, chaining, cuckoo hashing), measure the hash table behavior for different parameter choices, and draw their own conclusions. There was a fair degree of consistency in the reported results, and I got the feeling that the students (and I!) learned something lasting about the choices one makes in hashing (for example, simple cuckoo hashing degrades rapidly after a 70% load ratio).

I did have a few disasters, and learned some important lessons.

If I expect to run student code on my test cases, I HAVE to provide a test harness into which they'd merely insert a subroutine. Otherwise, merely specifying input and output format is not enough. Students are not as familiar with a simple UNIX command line interface as I expected, and use pre-canned libraries in ways that make uniform testing almost impossible. Note that this is not language-independent, unless I'm willing to provide harnesses for the myriad different languages people use.
If you do want to accept code without using the 'subroutine-inside-harness' form, then you need some kind of testing environment where they can submit the code and get answers. This is essentially what ACM/TopCoder/SphereOnline do.
If you're accepting code from students, use Moss or something like it. It's great for detecting suspiciously common code. When I used it and detected duplicates, in all cases the students admitted to working together.
It's easier to design assignments where the submitted product is not code, but some analysis of the performance (like in the hashing assignment above). Firstly, this is platform independent, so I don't care if people use Java, C, C++, C#, python, perl, ruby, scheme, racket, haskell, .....Secondly, it avoids the issue of "did you write the code yourself" - not entirely, but partly.

But I'll definitely do this again. Firstly, it's good for the students - from what I hear, having written actual code for dynamic programming makes it much easier for them to write code under pressure during their tech interviews, and I've heard that knowledge of DPs and network flows is a key separator during tech hiring. Secondly, even for theory students, it helps illustrate where asymptotic analysis is effective and where it breaks down. I think this is invaluable to help theoreticians think beyond galactic algorithms.

Of course, no discussion of programming in algorithms classes is complete with a reference to Michael Mitzenmacher's exhortation.

Monday, December 26, 2011

On PC members submitting papers

Update: Michael Mitzenmacher's posts (one, two, and three, and the resulting comments) on implementing CoI at STOC are well worth reading (thanks, Michael). The comments there make me despair that *any* change will ever be implemented before the next century, but given that we've been able to make some changes already (electronic proceedings, contributed workshops, and so on), I remain hopeful.

For all but theory researchers, the reaction to the above statement is usually "don't they always?". In theoryCS, we pride ourselves on not having PC members submit papers to conferences. What ends up happening is:

You can't have too many PC members on a committee because otherwise there won't be enough submissions
The load on each PC member is much larger than reasonable (I'm managing 41 papers for STOC right now, and it's not uncommon to hit 60+ for SODA)

There's an ancillary effect that because of the first point, theory folks have fewer 'PC memberships' on their CV which can cause problems for academic performance review, but this is a classic Goodhart's Law issue, so I won't worry about it.

The main principle at play here is: we don't want potentially messy conflicts or complex conflict management issues if we do have PC members submitting papers. However, it seems to me that the practice of how we review papers is far different from this principle.

Consider: I get an assignment of X papers to review if I'm on a conference PC. I then scramble around finding subreviewers for a good fraction of the papers I'm assigned (I used to do this less, but I eventually realized that a qualified subreviewer is FAR better than me in most subareas outside my own expertise, and is better for the paper).

Note (and this is important) my subreviewers have typically submitted papers to this conference (although I don't check) and I rely on them to declare any conflicts as per conference guidelines.

Subreviewers also get requests from different PC members, and some subreviewers might themselves review 3-4 papers.

Compare this to (say) a data mining conference: there are 30+ "area chairs" or "vice chairs", and over 200 PC members. PC members each review between 5-10 papers, and often don't even know who the other reviewers are (although they can see their reviews once they're done). The area/vice chairs manage 20-30 papers each, and their job is to study the reviews, encourage discussion as needed, and formulate the final consensus decision and 'meta-review'.

If you set "theory subreviewer = PC member" and "theory PC member = vice chair", you get systems that aren't significantly different. The main differences are:

theory subreviewers don't typically get to see other reviews of the paper. So their score assignment is in a vacuum.
theory PC members are expected to produce a review for a paper taking the subreviewer comments into account (as opposed to merely scrutinizing the reviews being provided)
managing reviewer comments for 30 papers is quite different to generating 30 reviews yourself (even with subreviewer help)
A downside of the two-tier PC system is also that there isn't the same global view of the entire pool that a theory PC gets. But this is more a convention than a rule: there's nothing stopping a PC for opening up discussions to all vice chairs.
One advantage of area chairs is that at least all papers in a given area get one common (re)viewer. that's not necessarily the case in a theory PC without explicit coordination from the PC chair and the committee itself.

But the main claimed difference (that people submitting papers don't get to review them) is false. Even worse, when submitters do review papers, this is 'under the table' and so there isn't the same strict conflict management that happens with explicit PC membership.

We're dealing with problems of scale in all aspects of the paper review and evaluation process. This particular one though could be fixed quite easily.

Friday, December 23, 2011

Thoughts on ICDM II: Social networks

The other trend that caught my eye at ICDM is the dominance of social networking research. There was a trend line at the business meeting that bore this out, showing how topics loosely classified as social networking had a sharp rise among accepted papers in ICDM over the past few years.

There were at least three distinct threads of research that I encountered at the conference, and in each of them, there's something to interest theoreticians.

The first strand is modelling: is there a way to describe social network graphs using abstract evolution models or random graph processes. I spent some time discussing this in a previous post, so I won't say more about it here. Suffice it to say that there's interesting work in random graph theory underpinning this strand, as well as a lot of what I'll call 'social network archaeology': scouring existing networks for interesting structures and patterns that could be the basis for a future model.
The second strand is pattern discovery, and the key term here is 'community': is there a way to express natural communities in social networks in a graph-theoretic manner ? While modularity is one of the most popular ways of defining community, it's not the only one, and has deficiencies of its own. In particular, it's not clear how to handle "soft" or "overlapping" communities. More generally, there appears to be no easy way to capture the dynamic (or time-varying) nature of communities, something Tanya Berger-Wolf has spent a lot of energy thinking about. Again, while modelling is probably the biggest problem here, I think there's a lot of room for good theory, especially when trying to capture dynamic communities.
The final strand is influence flow. After all, the goal of all social networking research is to monetize it (I kid, I kid). A central question here is: can you identify the key players who can make something go viral for cheap ? is the network topology a rich enough object to identify these players, and even if you do, how can you maximize flow (on a budget, efficiently).

There were many papers on all of these topics -- too many to summarize here. But the landscape is more or less as I laid it out. Social networking research is definitely in its bubble phase, which means it's possible to get lots of papers published without necessarily going deep into the problem space. This can be viewed as an invitation to jump in, or a warning to stay out, depending on your inclination. And of course, the definitive tome on this topic is the Kleinberg-Easley book.

This concludes my ICDM wrap-up. Amazingly, it only took me a week after the conference concluded to write these up.

Thursday, December 22, 2011

CGWeek !!!

If you're on the compgeom mailing list, you probably know this already, but if not, read on:

There's an increased interest in expanding the nature and scope of events at SoCG beyond the conference proper. To this end, Joe Mitchell put together a committee titled "CG:APT" (CG: applications, practice and theory) to chalk out a strategy and solicit contributions.

The call for contributions is now out, and the main nugget is this:

Proposals are invited for workshops/minisymposia, or other types of events on topics related to all aspects of computational geometry and its applications. Typical events may feature some number of invited speakers and possibly some number of contributed presentations. Events may feature other forms of communications/presentations too, e.g., via software demos, panel discussions, industry forum, tutorials, posters, videos, implementation challenge, artwork, etc. CG:APT events will have no formal proceedings; optionally, the organizers may coordinate with journals to publish special issues or arrange for other dissemination (e.g., via arXiv, webpages, printed booklets, etc).

In other words, anything goes ! (this is an experiment, after all). Topics are essentially anything that might be of interest geometrically in a broad sense (i.e not limited to what might appear at the conference itself).

I'd strongly encourage people to consider putting together a proposal for an event. The procedure is really simple, and only needs a two page proposal containing:

Title/theme of the workshop/minisymposium/event
Organizer(s) (name, email)
Brief scientific summary and discussion of merits (to CG) of the proposed topic.
A description of the proposed format and agenda
Proposed duration: include both minimum and ideal; we anticipate durations of approximately a half day (afternoon), with the possibility that some meritorious events could extend across two half-days.
Procedures for selecting participants and presenters
Intended audience
Potential invited speakers/panelists
Plans for dissemination (e.g., journal special issues)
Past experience of the organizer(s) relevant to the event

Please note: EVERY COMMUNITY DOES THIS (and now, even theory). The deadline is Jan 13, 2012, and proposals should be emailed to Joe Mitchell (joseph.mitchell@stonybrook.edu). Note that the idea is for such events to be in the afternoon, after morning technical sessions of SoCG.

There are many people out there who grumble about how insular the CG community is. Now's your chance to walk into the lion's den (at Chapel Hill) and tell it off :).

Thoughts on ICDM I: Negative results (part C)

This is the third of three posts (one, two) on negative results in data mining, inspired by thoughts and papers from ICDM 2011.

If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.

The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ($$ link) by David Hand (copy of paper here). His central result is a very surprising one:

The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.

To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.

If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,

It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used

This is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)

What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that

nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light.
It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers.
It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.

I'm not even close to describing the level of detail in his experiments: it's a really well-executed empirical study that should be a case study for anyone doing experimental work in the field. It's especially impressive because from personal experience I've found it to be REALLY HARD to do quality methodological studies in this area (as opposed to the "define algorithm-find-toy-data-profit" model that most DM papers seem to follow").

At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.

Tuesday, December 20, 2011

Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because

you can target your algorithms to graphs that look like this, hopefully making them more efficient
You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.

But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))