The Geomblog: icdm

Showing posts with label icdm. Show all posts

Friday, December 23, 2011

Thoughts on ICDM II: Social networks

The other trend that caught my eye at ICDM is the dominance of social networking research. There was a trend line at the business meeting that bore this out, showing how topics loosely classified as social networking had a sharp rise among accepted papers in ICDM over the past few years.

There were at least three distinct threads of research that I encountered at the conference, and in each of them, there's something to interest theoreticians.

The first strand is modelling: is there a way to describe social network graphs using abstract evolution models or random graph processes. I spent some time discussing this in a previous post, so I won't say more about it here. Suffice it to say that there's interesting work in random graph theory underpinning this strand, as well as a lot of what I'll call 'social network archaeology': scouring existing networks for interesting structures and patterns that could be the basis for a future model.
The second strand is pattern discovery, and the key term here is 'community': is there a way to express natural communities in social networks in a graph-theoretic manner ? While modularity is one of the most popular ways of defining community, it's not the only one, and has deficiencies of its own. In particular, it's not clear how to handle "soft" or "overlapping" communities. More generally, there appears to be no easy way to capture the dynamic (or time-varying) nature of communities, something Tanya Berger-Wolf has spent a lot of energy thinking about. Again, while modelling is probably the biggest problem here, I think there's a lot of room for good theory, especially when trying to capture dynamic communities.
The final strand is influence flow. After all, the goal of all social networking research is to monetize it (I kid, I kid). A central question here is: can you identify the key players who can make something go viral for cheap ? is the network topology a rich enough object to identify these players, and even if you do, how can you maximize flow (on a budget, efficiently).

There were many papers on all of these topics -- too many to summarize here. But the landscape is more or less as I laid it out. Social networking research is definitely in its bubble phase, which means it's possible to get lots of papers published without necessarily going deep into the problem space. This can be viewed as an invitation to jump in, or a warning to stay out, depending on your inclination. And of course, the definitive tome on this topic is the Kleinberg-Easley book.

This concludes my ICDM wrap-up. Amazingly, it only took me a week after the conference concluded to write these up.

Thursday, December 22, 2011

Thoughts on ICDM I: Negative results (part C)

This is the third of three posts (one, two) on negative results in data mining, inspired by thoughts and papers from ICDM 2011.

If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.

The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ($$ link) by David Hand (copy of paper here). His central result is a very surprising one:

The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.

To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.

If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,

It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used

This is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)

What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that

nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light.
It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers.
It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.

I'm not even close to describing the level of detail in his experiments: it's a really well-executed empirical study that should be a case study for anyone doing experimental work in the field. It's especially impressive because from personal experience I've found it to be REALLY HARD to do quality methodological studies in this area (as opposed to the "define algorithm-find-toy-data-profit" model that most DM papers seem to follow").

At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.

Tuesday, December 20, 2011

Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because

you can target your algorithms to graphs that look like this, hopefully making them more efficient
You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.

But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))

Sunday, December 18, 2011

Thoughts on ICDM I: Negative Results (part A)

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that's on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I'd like to highlight here is a trend (that I hope grows) in negative results.

It's not particularly hard to invent a new method for doing data mining. It's much harder to show why certain methods will fail, or why certain models don't make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I'll note here that I'm talking specifically about data mining, NOT machine learning - the difference between the two is left for another post).

In the interest of brevity, I'll break this trend down in three posts. The first result I want to highlight isn't even quite a result, and isn't even from ICDM ! It's actually from KDD 2011 (back in August this year). The paper is Leakage in Data Mining: Formulation, Detection, and Avoidance, by Shachar Kaufman, Saharon Rosset, and Claudia Perlich, and got the Best Paper Award at KDD this year.

The problem they examine is "leakage", or the phenomenon that even in a 'train model and test model" framework, it is possible for valuable information to "leak" from the test data or even from other sources to the training system, making a learned model look surprisingly effective and even give it predictive powers beyond what it really can do. Obviously, the problem is that when such models are then applied to new data, their performance is worse than expected, compared to a model that wasn't "cheating" in this way.

They cite a number of examples, including many that come from the data challenges that have become all the rage. The examples highlight different kinds of leakage, including "seeing the future", cross contamination of data from other data sets, and even leakage by omission, where a well-intentioned process of anonymization actually leaks data about the trend being analyzed.

While there are no results of any kind (experimental or theoretical), the authors lay out a good taxonomy of common ways in which leakage can happen, and describe ways of avoiding leakage (when you have control over the data) and detecting it (when you don't). What makes their paper really strong is that they illustrate this with specific examples from recent data challenges, explaining how the leakage occurs, and how the winners took advantage of this leakage explicitly or implicitly.

There are no quick and dirty fixes for these problems: ultimately, leakage can even happen through bad modelling, and sometimes modellers fail to remove leaky data because they're trying to encourage competitors to build good predictive models. Ironically, it is this encouragement that can lead to less predictive models on truly novel data. But the paper makes a strong argument that the way we collect data and use it to analyze our algorithms is fundamentally flawed, and this is especially true for the more sophisticated (and mysterious) algorithms that might be learning models through complex exploitation of the trained data.

It remains to be seen whether the recommendations of this paper will be picked up, or if there will be more followup work along these lines. I hope it does.

The Geomblog

Pages