Monday, October 31, 2011

Models for MapReduce

I've been listening to Jeff Phillips' comparison of different models for MapReduce (he's teaching a class on models for massive data). In what follows, I'll add the disclaimer IANACT (I am not a complexity theorist).

There's something that bothers me about the various models floating around that attempt to capture the MapReduce framework (specifically the MUD framework by Feldman et al, the MRC framework by Karloff, Suri and (co-blogger) Vassilvitskii, and the newer Goodrich-Sitchinava-Zhang framework).

For "seasoned" models of computation like the RAM, or PRAM, or even I/O and streaming, there's a sense of identifying some intrinsic feature of the computation that you wish to capture, building a model around it, and then identifying resources that you wish to expend little of. After the dust settled around the different TM/RAM models for effective computation, we now accept (for the most part) the RAM as a good model for sequential computation. The I/O model identifies disk access as the key operation to capture and builds a model of computation where disk access is a limited resource. Streaming identifies "amnesia" as the key limitation on a computation, and builds a model of computation centered around that.

In all these cases, it's possible to grumble that $O(n^5)$ isn't really "efficient" or that galactic algorithms shouldn't count, or that ignoring main memory computations is "cheating" in the I/O model, or even that allowing $n^{1-\epsilon}$ memory storage is useless for streaming. But those discussions are secondary to the model: and in fact history tells us that inevitably models that appear to allow for horribly inefficient computation facilitate algorithm design that is quite efficient !

Which is why I'm puzzled when I look at the different ways in which the MapReduce framework is being modelled. I see a number of choices being made:  number of processors should be $O(n^{2-\epsilon})$, or total memory usage should be $O(n^\epsilon)$ or number of 'rounds' of computation should be polylogarithmic or even constant.

At one level I understand these choices: for example, a memory bound of $n^\epsilon$ appears to lead to constant $(1/\epsilon)$ round algorithms.

But should the model be getting into decisions about what's efficient before even deciding what resources they are trying to capture ?

Ultimately, this for me gets back to a question that I asked at the W8F workshop back in May):
What is the intrinsic computational metaphor that we are trying to capture with these models ?
It's not just parallelism. It's not just distributed computation. It's not even streaming (or more generally sublinear computations). Is it some delicate tradeoff between these three ? Should we think of map-reduce models like the circuit classes that have bounds on both space and depth ? Wouldn't this be an unholy mess ?

I suspect that over time, if the MapReduce computational framework persists, then these issues will get shaken out. But there's a provisional (and exciting!) feel about the models we currently have.

Wednesday, October 12, 2011

O() notation breeds laziness

I will often write "We store the $n^2$ pairs of objects in a data structure" when what I really mean is "We store the $\binom{n}{2}$ pairs ..". Of course there's no "real" difference in asymptotic land. But I've often found that students reading a paper will puzzle over asymptotic handwaving wondering why someone wrote $O(n^2)$ when it should really be over all pairs and whether they're missing some key insight. It's even worse when people throw in gratuitous and specific constants just to make an analysis work ("why do we really need to run for 18.5 n steps???").

While I completely understand the reasons people do this (ranging from the last minute deadline insertion to a misguided attempt at clarity), I think that making these shortcuts often makes a paper harder to read rather than easier. There are places where the asymptotics are all that matter ("Run this randomized strategy $O(\log n)$ times and apply a Chernoff bound") but even there, it doesn't hurt to use $c \log n$ (or a specific constant if you need a specific polynomial bound) instead. And when there actually is a precise expression that conveys meaning (like in the binomial example above), it's far better to use that to signify what's going on.

I think the precept we teach to students in algorithm analysis (don't throw in O() notation till the end) would be well worth paying more heed to ourselves.

Caveat: I'm not advocating this in ALL instances. I'm advocating a greater sensitivity to when and where the use of O() notation is more or less appropriate.

SoCG 2012 CFP

It's out: !! Please also note the new workshop CG:APT (highlighted) that will attempt to expand the scope of the conference and bring in more applied work.I'm looking forward to hearing more about it.


28th Annual Symposium on Computational Geometry (SoCG 2012)
June 17-20, 2012 (tentative)
University of North Carolina, Chapel Hill
In cooperation with ACM SIGACT and SIGGRAPH



The Twenty-Eighth Annual Symposium on Computational Geometry will be held in University of North Carolina, Chapel Hill. We invite submissions of high-quality papers that describe original research on issues related to computational problems arising from geometric considerations. We also invite submissions of original videos and multimedia on these same themes. This year, SoCG will be collocated with a workshop, Computational Geometry: Applications, Practice, and Theory (CG:APT), whose call for submissions (due Feb 27) will be distributed separately.

The topics of the SoCG 2012 Symposium reflect the rich diversity of research interests in computational geometry. They are intended to highlight both the depth and scope of computational geometry, and to invite fruitful interactions with other disciplines. Topics of interest include, but are not limited to:
  • design, analysis, and implementation of geometric algorithms and data structures; lower bounds on the computational complexity of geometric problems;
  • mathematical, numerical, algebraic, and experimental issues arising in geometric algorithms and heuristics; discrete and combinatorial geometry; computational topology;
  • novel applications of computational geometry in computer graphics, geometric modeling, computer-aided design and manufacturing, scientific computing, geographic information systems, database systems, robotics, computational biology, machine learning, sensor networks, biomedical imaging, combinatorial optimization, statistical analysis, discrete differential geometry, theoretical computer science, graph drawing, pure mathematics, and other fields; novel problems in computational geometry arising from these fields. 

Important Dates 

November 22, 2011: Paper titles and abstracts (at most 300 words) due (23:59, Honolulu time)
December 2, 2011: Paper submissions due (23:59, Honolulu time)
February 17, 2012: Notification of acceptance/rejection of papers
February 27, 2012 : Submissions of videos and multimedia due (23:59, Honolulu Time)
March 8, 2012: Notification of acceptance/rejection of video/multimedia
March 22, 2012: Camera-ready versions due for papers and video/multimedia abstracts
April 26, 2012: Final versions of video/multimedia due
June 17-20, 2012 (tentative): Symposium in Chapel Hill, North Carolina

Call for Papers

We invite submissions of high-quality papers describing original research on geometric algorithms and data structures, their mathematical foundations and correctness, their implementation, and their applications.

Final versions of accepted papers will be published by ACM in the symposium proceedings. Proceedings will be distributed to symposium participants and will also be available from ACM for purchase and through the ACM digital library. An author of each accepted paper will be expected to attend the Symposium and give a presentation (approximately 20 minutes) of the paper. Authors of a selection of papers from the conference will be invited to submit extended versions of their papers to a special issue of one or more journals. This year we also plan to confer two Best Paper Awards, and a Best Student Presentation Award. The Student Presentation award will be based on the quality of the presentation of a paper by a student during the conference.


Paper Submission 

All submissions should be made electronically; see the EasyChair SoCG2012 web page https://www.easychair.org/conferences/?conf=socg2012 for detailed submission instructions (to be available shortly). If electronic submission is not feasible, please contact the program committee chairs, Tamal Dey and Sue Whitesides, well in advance of the submission deadlines.

Submission Guidelines
Papers should be submitted in the form of an extended abstract, which begins with the title and abstract page that contains the title of the paper, each author's name, affiliation, and e-mail address followed by a short abstract. The main body of the extended abstract should begin with a precise statement of the problem considered, a succinct summary of the results obtained (emphasizing the significance, novelty, and potential impact of the research), and a clear comparison with related work. The remainder of the extended abstract should provide sufficient detail to allow the program committee to evaluate the validity, quality, and relevance of the contribution. Clarity of presentation is very important; the whole extended abstract should be written carefully, taking into consideration that it will be read and evaluated by both experts and non-experts, often under tight time constraints.


Submissions should be typeset in single column format, using 11-point or larger font, with at least 1 inch/2.54 cm margins and single line spacing. Excluding the title page and bibliography, the extended abstract must not exceed 10 pages. Submissions deviating from these guidelines risk rejection without consideration of their merits.

All necessary details to verify the results must be provided. If they cannot fit within the 10-page limit, a clearly marked appendix containing omitted details should be included. Appendices are not counted in the 10 page limit, so while they may serve as a reference, they will be read by the program committee members at their discretion. The paper excluding the appendix should clearly describe the results and the approach to achieve them, and give sufficient confidence for their validity. The appendix should then give all the necessary details to verify correctness.

Anticipating the usual high overall quality of submissions, the program committee intends to interpret the scope of the conference broadly, and will seriously consider all papers that are of significant interest to our research community.

Authors must submit the title and an abstract (at most 300 words) of their papers by November 22, 2011. This pre-submission will be used to help make program committee reading assignments. Extended abstracts must be received by December 2, 2011 (23:59, Honolulu time). There will be no extension of these deadlines; late submissions will not be considered. Authors will be notified of acceptance or rejection by February 17, 2012. The final proceedings papers must be formatted in accordance with ACM proceedings guidelines; LaTeX style files will be made available to authors of accepted papers.

Concurrent submission of the same (or essentially the same) abstract to SoCG and to another conference with published proceedings is not allowed. An extended abstract of a paper that is under journal review, or scheduled for publication in a journal after June 2012, may be submitted, when it is clear that the extended abstract differs substantially from the journal version. In such cases, the authors must include the journal version in an appendix that clearly identifies the status of the journal submission.

Program Committee

  • Pankaj Agarwal(Duke University)
  • Dominique Attali (Gipsa-Lab, Grenoble)
  • Gill Barequet (Technion-Israel Institute of Technology)
  • Mark de Berg (Technische Universiteit, Eindhoven)
  • Danny Chen (University of Notre Dame)
  • Tamal Dey (The Ohio State University) (co-chair)
  • Vida Dujmović (Carleton University)
  • David Eppstein (University of California, Irvine)
  • Leonidas Guibas (Stanford University)
  • Sylvain Lazard (INRIA, Nancy Grand Est)
  • Dinesh Manocha (The University of North Carolina at Chapel Hill)
  • Steve Oudot (INRIA Saclay)
  • Konrad Polthier (Freie Universität Berlin)
  • Edgar Ramos (Universidad Nacional de Colombia, Medellín)
  • Jian Sun (Tsinghua University)
  • Takeshi Tokuyama (Tohoku University)
  • Yusu Wang (The Ohio State University)
  • Max Wardetzky (University of Göttingen)
  • Sue Whitesides (University of Victoria, British Columbia) (co-chair)


Call for Video and Multimedia Presentations

Video and multimedia presentations are sought for the 21st Annual Video and Multimedia Review of Computational Geometry, to accompany the 28th Annual Symposium on Computational Geometry. This review showcases the use of visualization in computational geometry for exposition and education, for visual exploration of geometry in research, and as an interface and a debugging tool in software development. Algorithm animations, visual explanations of structural theorems, descriptions of applications of computational geometry, and demonstrations of software systems are all appropriate.

Three to five minutes is ideal for most presentations; eight minutes is the upper limit. Accepted video and multimedia presentations will have an abstract in the published conference proceedings; video/multimedia authors will have an opportunity to present their work at the Symposium during a dedicated video session. Accepted presentations will be available online in various formats in a web proceedings. See http://www.computational-geometry.org/ for examples of previous years' proceedings.

Submissions of video clips in QuickTime or MPEG-4 compressed formats (e.g., XviD or DivX version 6) are encouraged. We also encourage submissions of Macromedia Flash, Java applets, and limited forms of other multimedia or software. These formats must come with a script that will allow them to be distributed in both interactive and canned Quicktime or MPEG video formats. In case of doubt, please email the Video and Multimedia Program chair.

Each submission should include a one or two-page description of the material shown in the presentation, and where applicable, the techniques used in the implementation. The final two-page descriptions must be formatted according to the guidelines for proceedings. LaTeX style files will be provided to authors of accepted presentations.

Submission

Submissions should be deposited online where they are accessible through the web or via FTP. Send email to the Video/Multimedia committee chair, Christian Knauer by 23:59 Honolulu Time, Monday, February 27, 2012, with the following information: the names and institutions of the authors, the email address of the corresponding author, and instructions for downloading the submission. For ease of sharing and viewing, we encourage (but do not require) that each submission be uploaded to YouTube, and that the corresponding URL be included with the submission.

We explicitly encourage video/multimedia submissions that support papers submitted to the Symposium. Submitted papers and associated video/multimedia submissions will be treated entirely separately by the respective committees: acceptance or rejection of one will not influence acceptance or rejection of the other.

Authors will be notified of acceptance or rejection, and given reviewers' comments by March 8, 2012. For each accepted submission, the final version of the 2-page textual description will be due by March 22, 2012 (electronically) for inclusion in the proceedings. Final versions of accepted video/multimedia presentations will be due April 26, 2012 in the best format available.

Important Dates

Feb. 27, 2012: Video and Multimedia submissions due
Mar. 8, 2012: Notification for Video/MM submissions (23:59 Honolulu time)
Mar. 22, 2012: Camera-ready video/MM abstracts due
Apr. 26, 2012: Final versions of video/MM presentations due

Video and Multimedia Presentation Program Committee
  • Helmut Alt -Freie Universität Berlin
  • Sandor Fekete - TU Braunschweig
  • Panos Giannopoulos -  Universität Bayreuth
  • Xavier Goaoc - INRIA Nancy Grand Est
  • Rolf Klein - Universität Bonn
  • Christian Knauer (chair) - Universität Bayreuth
  • Joseph S. B. Mitchell - SUNY Stony Brook
  • Bettina Speckmann - TU Eindhoven
  • Fabian Stehn - Universität Bayreuth
  • Alexander Wolff - Universität Würzburg


SOCG 2012 Local Arrangements

Jack Snoeyink, Chair

Friday, September 23, 2011

Finding the right reference for finding the majority element

Probably everyone knows the standard trick for determining in one pass whether a stream admits an element occuring more than half the time. But what not everyone might now is the origin of this method.

Stuart Shieber writes about the algorithm (discovered by Boyer and Moore) and the mishaps it had on the way to publication. The article also describes how the followup paper by Misra and Gries came into being. Interestingly, I almost always see the Misra-Gries generalization being cited, and not the original Boyer-Moore method.

(HT Fernando Pereira)

Wednesday, September 07, 2011

ECML-PKDD Conference Report: Part II

My student Parasaran Raman continues his conference report from ECML-PKDD in Athens. For his first post, see here.

Christopher Bishop gave an excellent plenary talk on the first day about the need to embrace uncertainty. He talked about the Kinect technology that uses infra-red camera and sensors to measure distance of aperson from the screen. Instead of modeling the problem as a motion tracking problem, the folks at MSR started to look at the challenge as a pattern recognition problem. Each frame is now to be cold-started and understood as a picture instead of tracking the motion in a video sense. The goal now is to classify sets of pixels enclosed inside an axis-aligned rectangle. The training data is collected as a series of motion capture videos by letting a test subject wear sensors on the body a la hollywood-animation style, and is combined with some synthetic data, corrupted artificially since these are “too pure”. The training set consists of a whopping 1 million examples of body positions .

They then use a decision tree to classify the pixels into one of 31 predefined body parts by maximizing an information gain function. Another sub-goal that Bishop talked about was to identify human joints. For this, they use a kind of mean-shift technique, applying a simple Gaussian smoothing function on joints.

The second half of the talk revolved around model-based ML; where Bishop recommends constructing model equivalent for algorithms. He says and I quote “Don't look at the algorithms; look at the models”. He talked about an application where the goal is to rank players who play each other in a common environment. They have a new rating system TrueSkill, that extends the popular Elo rating system [which typically only works in a two player game like chess] to a multi-player team games. TrueSkill performs a kind of Bayesian ranking inference on the graphs by local message passing .

He also introduced Infer.NET, a framework for running Bayesian inference in graphical models.
[http://research.microsoft.com/en-us/um/cambridge/projects/infernet/]. This looks like a cool tool to play with, especially since it provides a single platform for classification and clustering problems.

The plenary talk on day two was by Rakesh Agarwal. In a talk that focused on a very interesting application, Rakesh introduced the idea of using data mining towards enriching education. The key idea is to take concepts from textbooks and map it with a relevant Wikipedia article and then induce relationship between concepts.

The goal is to enrich sections in the textbooks by providing recommendations for illustrative pictures to go with the text. Dispersion [which is a measure of how unrelated the concepts are] and syntactic complexity play a big role in deciding “which picture should go where”. Since search engines fail when given long key words, one challenge is to find key concepts in a section that one can then match up to the most similar article in Wikipedia. The underlying assumption was that for most high school textbook sections will have a associated Wikipedia article.

Rakesh showed results on applying this to a huge word corpus from NCERT, who draft the high school textbook in India for most schools. He also talked about computing an immaturity score to see if a section is well-written and not surprisingly, subjects like physics and chemistry scored over commerce and economics!

To summarize, two things that go into solving the problem of augmenting textbooks with images are “Comity” [where they identify lots of short key queries by concatenate 2 or 3 concepts important concepts at a time] and “Affinity”. Comity ransk images by computing a relevance score of each image in context of the concepts picked out from a section and affinity identifies relevant webpages and computes an importance score. Another challenge is that these techniques cannot operate one section at a time since the textbooks would probably not like the images to repeat!

Tuesday, September 06, 2011

ECML-PKDD conference report: Part I

My student Parasaran Raman is in Athens attending ECML-PKDD and associated workshops, and will be updating this blog regularly with details from the conference. His posts are edited lightly for format and to add links.


I am in Athens [+1 to location-driven-research] to attend ECML PKDD 2011 and I am at the second MultiClust workshop today. This workshop is held in conjunction with ECML PKDD 2011.

The first talk of the morning was a invited talk by Michael Houle on "Combinatorial approaches to clustering". Michael Houle began the talk with how in high dimensions, all out intuitions about space go for a toss. One example that he gave illustrated how most of the points are concentrated along the boundaries in high dimensions. The talk revolved around how similarity measures between partitions cannot be fully trusted though there have been a range of combinatorial and spatial measures that have been introduced, since each measure suffers from some kind of drawback. He also talked about constructing secondary similarity measures [“Secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure”] and how they can get around the curse of dimensionality.

One interesting question that was thrown out was "Can we do a fully automated clustering?", i.e., can we convert similarity functions into ranking on the neighbors, assume no knowledge of distribution of the data, and automatically pick 'k'. The talk moved on to approaches towards computing quality of a partition by computing a significance score for each point which depends on the correlation between a member of the set to the members in neighboring sets and other members of the same set, along with the size of cluster and the feature set. Once we know an object's significance, we can use this to change the shape of a cluster, by adding points and removing them.

Bart Goethals gave another invited talk titled “Cartification: from similarities to item set frequencies”. He talked about doing a k-nearest neighbors for all data points to find the right centers of clusters. The idea is that true 'centers' occurs in many knn baskets.

There were many interesting ideas out of the other talks at the workshop. I will highlight a few of them:

  1. One of the first talks about “Subjectively interesting alternate clusters”, uses an information theoretical framework to find interesting alternate clusters.
  2. Jilles Vreeken gave a very good talk about the current approaches in pattern mining and how we can use that knowledge in data mining. On this talk titled, “where pattern met subspace cluster”, he highlighted the similarities between subspace clustering and pattern mining where the goal is to find informative local structures.
  3. The talk on “Factorial clustering” was about using latent variables to generate multiple clusterings. One question that they wanted an answer for was a way to compute the right distance between partitions and I said, "Hey ! Use ours!”.
  4. Another talk on “Browsing clustering alternatives" outlined constructing a Merge-Split tree structure on the space of partitions and enable a user driven browsing of partitions.
My talk was the final one of the session. I talked about some recent work with Suresh Venkatasubramanian and Jeff Phillips on exploring the landscape of clusterings by sampling from the space of partitions proportional to a quality measure. It was generally good to see a lot of head-nods during the talk. A couple of people from the industry who were present agreed to the fact that there exists data that is multi-clusterable and the hard part was finding them. So presenting a landscape where you can pick from looked like a useful idea!

The day ended with a rooftop reception with a magnificent view of the Acropolis and the temple of Zeus! I had very useful dinner conversations with Michael Houle and Thomas Seidl about general problems in the area of meta-clustering and on how to compute distances metrics on partitions. We chatted about combinatorial distances and spatial distances, and secondary level distances, where one constructs a ranking on top of a existing distance between many members.

Monday, September 05, 2011

FOCS 2011 Registration open

Marek Chrobak informs me that FOCS 2011 registration is open. The conference is being held from Oct 22-25 in Palm Springs CA, home of the Palm Springs Aerial Tramway that does a record 8500 ft (2600 m) elevation gain to give you views of the valley.

When you're not being made dizzy by the views, attend a tutorial or two or three ! Cynthia Dwork, Kirk Pruhs and Vinod Vaikuntanathan are the headliners for a day of tutorials on Saturday Oct 22. Shafi Goldwasser will be awarded the 2011 IEEE Emmanuel R. Piore Award, given for "outstanding contributions in the field of information processing, in relation to computer science".

I'm told there are also some talks, including some that smash through barriers for the k-server conjecture, metric TSP (in graphs) (twice over), and exact 3SAT.

Register now so that local organizers don't go crazy (I can assure you that they do :)). The deadline is Sep 29 for cheap rates on hotels and registration. And if you'd like to be a FOCS 2011 reporter and write a set of blog posts summarizing the conference for cstheory.blogoverflow.com, please add a note to this post.

Thursday, September 01, 2011

MADALGO Summer School: Wrapup

Speakers from the MADALGO summer school

I had mentioned the MADALGO summer school back in June. I'm happy to be informed by Piotr Indyk that lecture notes from the summer school are now up, along with additional reading material. Good stuff on similarity search, high dimensional combinatorics, the Hirsch non-conjecture, embedding/sketching and metrics/nets/dimension and measures.

Sunday, August 28, 2011

A way forward on reformatting conferences

In some circles, it's a badge of honor to attend as few talks at a conference as possible. Some of the usual comments are:
  • "I go to meet people, not attend talks"
  • "All the interesting conversations happen in the hallways"
  • "Talks are too difficult to follow: I'd rather read the paper or just ask the author"
  • "I read this paper 6 months ago when it appeared on the arxiv: it's old news now, and there are improvements"
You have to wonder why anyone gives talks at a conference any more ! And Moshe Vardi asks this question in a CACM note. Riffing off of "a conference is a journal held in a hotel" (attributed to Lance Fortnow, who attributes it to Ed Lazowska), he talks about the low quality of conference talks, and suggests ways to improve them.

But there's a certain 'band-aid on a bleeding carcass' aspect to this discussion. Indeed, between overloaded reviewers, authors who need the imprimatur of a prestigious conference, and registration fees that skyrocket as meetings get longer, it almost seems like this system is heading for a nervous breakdown.

But there are a number of experiments in play that point the way towards a gentler, kinder conference system (even if we decide not to grow up). In this G+ discussion, Fernando Pereira and Zach Ives describe two models that put together address the main problems with our conference process.

NIPS receives over 1400 submissions, and accepts a small fraction (generally under 20%, and usually much less a little over 20%). All papers are presented as posters (with a few special talks). This does two things:
  1. It removes artificial limits on number of papers accepted based on conference duration. Posters are presented in (semi)-parallel. 
  2. It eliminates the "20-minutes of droning with no questions" style of many conference talks. Posters are a much more interactive way of presenting material, and it's easier to skim papers, talk to the authors, and have a good discussion. The papers are still in the proceedings, so you can always "read the paper" if you want. As an aside, it really helps with communication skills if you have to answer questions on the fly. 
VLDB has now moved to a journal-based submission process. There's a deadline each month for submitting papers for review. The review process is fairly quick: 45 days or so, with enough time for back and forth with the authors. Accepted papers are published in the proceedings, and while I'm not sure exactly how the conference selects talks for presentation, it's possible that all accepted papers are then presented. The main advantages of this process:
  1. There isn't a huge burst of submissions, followed by a draining review process. Reviews are spread out over the year. Moreover, area chairs are used to partition papers further, so any (P)VLDB reviewer only gets  a few papers to review each month. This can only improve the quality of reviews.
  2. The journal-style back-and-forth makes papers better. Authors can make changes as recommended, rather than trying to defend their choices in an often-contentious rebuttal process. 
Between these two systems, we have a better review process for papers, and a better way of delivering the content once reviewed. Why not combine them ? 


Monday, July 18, 2011

Axioms of Clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

``I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it..."
-- Justice Potter Stewart in Jacobellis v. Ohio.
 What makes one clustering better than another? To answer that question we have previously assumed that a well motivated objective function was given to us. Then the situation is easy, we compute the value of the objective and decide accordingly. In practice, however, clustering is often an intermediate step in a longer chain of computations, and there is not specific motivation for one objective function over another. Debating the pros and cons of the multitude of possible objective functions can easily take hours. Instead, we can further the ``I know it when I see it'' intuition by writing down the properties that any good clustering objective should satisfy and see if this axiomatic approach guides us towards the right solution. 

This approach was undertaken by Kleinberg in his work on clustering axioms. A clustering function is one that takes a set of points (and their pairwise distances) and returns a partition of the data. The function should abide by the simple axioms:

  • Scale-Invariance: If all distances are scaled by a constant factor, the clustering should not change. Put another way, measuring the distance in miles or kilometers (or nanometers) should not change the final clustering partition.
  • Richness: Depending on the pairwise distances, any partition should be possible. For example, the clustering should not always put two specific points $x$ and $y$ together, regardless of the distance metric.
  • Consistency: If the pairwise distance between points in a cluster decreases, while the distances between a pair of points in different clusters increases, the clustering should not change. Indeed, such a transformation makes clusters `tighter' and better separated from each other, so why should the clustering change? 
Each of the axioms seems reasonable, and it is surprising that there is no clustering function that is consistent with all three axioms ! The trouble lies with the last axiom: by changing the distances we require that a clustering stay the same, but an even better clustering may emerge. For example, consider points lying in several ``obvious'' clusters, and proceed to move one of these clusters out to infinity. As we move the points further and further out, the resulting dataset appears to be two-clusterable (see the image below).


This leads to a slightly different tack at this problem. Instead of thinking about a specific clustering, or a specific partition of the data, instead we try to define an objective function, so that the optimum clustering under that objective behaves in a reasonable manner. This is the approach adapted by Ackerman and Ben-David. Let $m$ be a function measuring the quality of the clustering. As before, Scale-Invariance requires the measure $m$ to be independent of the units on the distances and Richness requires that every clustering can be the optimum (under $m$) clustering.

The Consistency axiom is the one undergoing a subtle change. Instead of requiring that the clustering stay the same if such a perturbation to the distances occurs, we only require that the score, or measure of the clustering {\em improve} under such a transformation. This breaks the counterexample above -- while the score of a 1-cluster decreases as we move the distances, the score of a 2-clustering decreases even more and surpasses the score of the 1-clustering. Indeed, the authors demonstrate a set of different clustering metrics that are consistent under this set of axioms.

-- Sergei Vassilvitskii

Friday, July 01, 2011

SODA 2012 submission question

This is what the SODA 2012 submission guidelines say (emphasis mine):
The submission, excluding title page and bibliography, must not exceed 10 pages (authors should feel free to send submissions that are significantly shorter than 10 pages.) If 10 pages are insufficient to include a full proof of the results, then a complete full version of the paper (reiterating the material in the 10-page abstract) must be appended to the submission after the bibliography. The length of the appended paper is not limited, and it will be read at the discretion of the committee. A detailed proof in the appended complete version is not a substitute for establishing the main ideas of the validity of the result within the 10-page abstract.
If I'm reading this right, you can't just shunt certain proofs and exposition to an appendix, as was previously done. You have to make an entire full version of the paper and append it to the shortened version. Is that correct ?

Update (7/5): I have an update from Yuval Rabani, the PC chair, which resolves the question (in the affirmative):
This is an experiment I introduced in response to conflicting attitudes in the PC. The idea is that some PC members would like to read the entire paper, and it's hard to follow the paper when you have to jump back and forth between the 10 page abstract and the appendix. So the idea is that most committee members will just print the 10 page abstract, and those
interested in the entire paper will print the attached full version. Since this is an experiment, I don't intend to enforce the guidelines very strictly, but if it's not too much of an effort on your part, we do prefer a submission according to the rules.

Thursday, June 30, 2011

Bob Morris and stream algorithms

Bob Morris, one of the early contributors to UNIX and an NSA cryptographer, died on Sunday. The New York Times has a remembrance that talks about his contribution to password security and cryptography in general.

What's not mentioned in the NYT profile is that he's the Morris of the 'Morris counter'. This was arguably the very first streaming algorithm, nearly 20 years before the AMS paper and the HRR tech report first introduced streaming algorithms, five years before the Flajolet-Martin paper on counting distinct elements in a stream (also not framed as as streaming problem), and two years before the Munro-Paterson algorithm for streaming median finding.

The Morris paper is titled "Counting large numbers of events in small registers":
It is possible to use a small counter to keep approximate counts of large numbers. The resulting expected error can be rather precisely controlled. An example is given in which 8-bit counters (bytes) are used to keep track of as many as 130,000 events with a relative error which is substantially independent of the number n of events. This relative error can be expected to be 24 percent or less 95 percent of the time (i.e. σ = n/8). The techniques could be used to advantage in multichannel counting hardware or software used for the monitoring of experiments or processes.
In modern language, this can be re-rendered as:
It is possible to approximately count the number of elements in a stream using $O(\log \log n)$ bits, where $n$ is the length of the stream.
The idea is very simple. The trick is to count $C = \log n$ instead of $n$. Clearly, doing so requires only $\log\log n$ bits. Then any additive approximation to $C$ yields a multiplicative approximation to $n$.

So how do we count $\log n$ ? Let's assume for now that we're using base 2 logarithms. If the current counter value is $i$, then we "know" that the next update should come only $i$ counts later, if indeed we've seen $i$ elements. However, we don't know that the counter value reflects the true count, so we instead do it probabilistically. Specifically, toss a coin whose probability of coming up heads is $2^{-C}$, and only increment the counter if the coin comes up heads.

Phillipe Flajolet did a masterful analysis  of this approach, and showed that the expected value of $2^C$ was $n+2$, yielding an unbiased estimator for the true count. He also showed that the variance of $2^C$ is $n(n+1)/2$.

This approach yields a fixed error bound for the resulting count. The next trick is then to change the base from 2 to some parameter $a$, and then reanalyze the method. Much work later, it can be shown that with roughly $\log \log n + \log(1/\epsilon)$ bits, you can get a multiplicative approximation of $1+\epsilon$.

Postscript

Stream algorithms are a good icebreaker in an algorithms class: they're unusual, seem to solve an impossible problem, and do it really well. I also like to point out to students that three of the more important streaming algorithms were designed well before people cared about "big data" or streaming. It's a valuable lesson in the importance of understanding "old" tools.

In case you're wondering if anyone still uses Morris counters, here's a paper from IJCAI 2009 that plugs them into a larger system to maintain counts of $n$-grams when building a statistical language model.

Friday, June 24, 2011

Workshop on Coding, Complexity and Sparsity

Anna Gilbert, S Muthukrishnan, Hung Ngo, Ely Porat, Atri Rudra,  and Martin Strauss are organizing a workshop on coding, complexity and sparsity at Ann Arbor in August, and are soliciting participation for a Dagstuhl-style event. Here's the blurb:


Workshop on Coding, Complexity and Sparsity
Ann Arbor, Aug1--4, 2011. 
Organized by Anna Gilbert, S Muthukrishnan, Hung Ngo,Ely Porat, Atri Rudra, Martin Strauss.

There has been a considerable amount of fruitful interaction between coding and complexity theories, and a fledgling interaction between sparse representation and coding theories. We believe that academic research will be far richer exploring more of the connections between sparse representation and coding theories, as well as, between sparse representation and complexity theories. Could there be a general structural complexity theory of sparse  representation problems and could techniques from algorithmic coding  theory help sparse representation problems? 
The plan is to get researchers from different areas together in a  Dagstuhl-like setting and explore the question. There will be tutorials,  talks and time to research. Registration is free and the deadline is July 15th.  We have limited funding to (partially) support the participation of some  graduate students.  
For more details on the workshop (including the tutorials, list of invited  speakers and how to apply for participation support), please go to the  workshop webpage.

Thursday, June 23, 2011

MADALGO Summer School

There are lots of good things happening at MADALGO. The institute was recently renewed for another 5 years, they recently organized the now-regular MASSIVE workshop in conjunction with SoCG (maybe it should also be organized in conjunction with SIGMOD or GIS ?), and now they're organizing a summer school on high dimensional geometric computing in conjunction with the new Center for the Theory of Interactive Computation (CTIC) run jointly by Aarhus and Tsinghua.

There's a stellar list of lecturers:
  • Alexandr Andoni (Microsoft Research Silicon Valley)

  • Ken Clarkson (IBM Research)

  • Thomas Dueholm Hansen (Aarhus University)

  • Piotr Indyk (MIT)

  • Nati Linial (Hebrew University)
and they're soliciting participation. Here's the blurb:
The summer school will take place on August 8-11, 2011 at Center for Massive Data Algorithmics (MADALGO) and Center for the Theory of Interactive Computation (CTIC) in the Department of Computer Science, Aarhus University, Denmark.
The school is targeted at graduate students, as well as researchers interested in an in-depth introduction to high-dimensional geometric computing.
The capacity of the summer school is limited. Prospective participants should register using the online registration form available here as soon as possible. Registering graduate students must also have their supervisor send a letter confirming their graduate student status directly to Gerth Brodal: gerth@madalgo.au.dk; the subject line of the email should be 'student_last_name/SS_2011/confirming'. Registration is on a first-come-first-serve basis and will close on Monday July 4, 2011.
Registration is free; handouts, coffee breaks, lunches and a dinner will be provided by MADALGO, CTIC and Aarhus University.
 I'd be remiss if I didn't point out that any MADALGO-sponsored event will also have rivers of beer, if you needed more incentive to attend. 

Wednesday, June 22, 2011

Applying for Jobs: the Job Talk

This is the fifth post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah starting Fall 2011.



The final component of applying for jobs is the job talk. It will usually be about an hour, but may vary so ask. Usually this means you can talk for about 58 minutes, before they want to stop you for questions. But again, its best to confirm this, try to ask your host specifically about this. Also ask about how frequently you should be expected to be interrupted with questions during the talk. Of course, this may vary greatly from talk to even at the same place (as I have seen in my experience). But, from my perspective, many instances where the speaker was delayed for more than 5 minutes due to questions in the talk was the fault of the speaker. This usually occurred by leaving ambiguities in the talk, by failing to answer questions clearly and concisely (this may lead to more questions), and by not taking charge and insisting on taking longer questions off-line.
The most important failure (ambiguities) will hopefully be avoided by taking the steps below.



First, what should you talk about?
  • Talk about your best work!
  • Talk about your work that appeals to a broadest audience and that you can convey the main ideas of most clearly.
  • Talk about the project you are most well-known for, this is probably because its your best received, and likely your actual best work even if you have some "personal'' connection to some other topic.
  • Talk about the work that is most likely to get you hired. This last one can be tricky, but try to ask your host or anyone else you know at the institution what specific type of person they are looking to hire. No one will fit a mold perfectly, but make sure the faculty there don't have to squint too hard to see you in fit in that mold.
If it is the same topic that fits all of these criteria, then you have a much easier task. Otherwise, you may need to compromise on some these. Try not to sacrifice strength of work and clarity of key ideas. If you are not sure which topic is your strongest, ask around.



What should go into talk? Not too much.
Present at most 2 or 3 main ideas. Most papers have one main idea, even if there are several smaller ideas in the paper. The lasting contribution is usually one key insight, (even if it took a lot of hard work and many other technical details to get everything to work). Try to abstract away those technical details, and convey the main new ideas your work added. You can mention that there were difficult technical details, but do not describe them!

So 2-3 main ideas equates to 2-3 papers. Perhaps you have a series of papers that layer on interesting ideas towards a single goal. In this case, you can possibly convey more, but do not try to cram more at the expense of not making clear what the main conceptual contribution is.

It is also helpful to have the talk tell a story, to have a single narrative. But, if you have two pieces of work that are both your strongest works and easiest to express the key contributions, then give the talk in two parts and choose to convey better work than to give a cleaner, but weaker story.

If you are a theoretician, give at most 1 proof. (Unless, with possible exception, you are being hired into a strong theory group and they alone make the hiring decision, then, maybe, give 2 proofs). Most people will get lost in the details. I gave one proof sketch, but "intuitively explained" why several things were true, usually each in one to two sentences. I am guessing for non-theory people, there is an equivalence between number of "proofs" and number of some other technical descriptions. The point is, it might be good to spend 3 minutes at some point showing off your technical prowess, just to try to convince the audience you are an expert in an area - but there can be better ways of doing this than diving so far into the technical details that you somewhat intentionally loose most of the audience.

You should aim to teach the audience something. That is, go in a bit more depth for some general technique in your field that you feel is not well-understood, and would serve the general audience to know. This does not need to be explicitly your work (and often will not be), but you may have extended this work. If you did extend it, the audience will appreciate it much more if they understand the importance of the general version. In this "teaching segment" the goal should be to allow the audience to understand how to operate some heavy machinery (at least in principle). Spend time developing simple examples that clearly demonstrate how it works and its power.
This segment is also important in demonstrating your teaching abilities. If someone leaves your talk comes away feeling they learned something, even if it was not your specific contribution, then they will have a positive impression. I know when I spend an hour in a talk, and don't come away with this impression, I am very disappointed in the speaker.



How should you structure your talk?
Spend the first 10 minutes motivating the entire area you are working in, and then the set of key problems in the area. Start at a very high level, and use well-planned examples to zero in on what are the major hard problems in the area, and why they are the critical remaining component in advancing the sub-area. This should allow you to outline the rest of the talk by just saying which of these problems you solved (and then spending 40 minutes explaining how).
This first 10 minutes of motivation can be modified if you give job talks at several place without changing much of the forth-coming meat of the talk. This may be necessary to paint yourself in a few slightly different roles depending on what each university/lab is looking to hire. Minor changes in the set up, can give the talk a very different flavor, perhaps focusing on different aspects that potential collaborators at each institutions could appreciate as fodder for possible proposals they could write with you.

Then spend about 15-20 minutes each on covering your contribution in 2-3 core problems. Probably don't spend more than this on any one topic, since if you lose some audience members, you can re-captivate them for the next section. And any less than this, you probably will not be emphasizing the contribution enough, and it you don't feel it deserves that much, then your talk might be cleaner if you did not waste the time to mention it.

Within each of these sections, explain the core problem in more detail, explain what other approaches are available and why they do not work. This should lead naturally into what you did. Do not just show results saying that your approach was better, make sure to explain the key idea of why your approach was better. What made you approach succeed when no one had before? If an audience member cannot immediately answer that question, then they may likely come away with the impression that you did not do anything interesting, other than tightening the existing bolts. You can conclude each section by mentioning extensions to this work, either your or others. These extensions are especially useful if they demonstrate how your key idea was integral in advancing the field.

Finally, there is likely other work that is not one of they 2-3 key contributions that you have done. You probably want to talk about it more than you actually should :). I dealt with this in two ways. Given than I had described a set of an important problems in the first 10 minutes with several sub problems, after a couple sub-problems I described my contribution for, I listed on a single page all of the other work I had done in this related sub-area. This allowed me to mention it and also have it in context of the field I was overviewing.
The other way was just to have 1 or 2 slides are the very end of the talk listing other work I had done. Its good to show you are broad, and that you have completed a lot of work. But its also easy to see this from glancing at your CV. So in these 1-2 slides attempt to cast each other project you did not talk about in relative comparison to the work you did talk about. If you mentioned 2 projects, and you had 4 other ones of similar proportion, try to convey this. This is more informative than CV paper-counting.

Conclude your talk with your vision for the next 5-10 years worth of research. I felt it best not to focus on specific problems, but on how the work I had done would have an impact down the road. Anyone can do work in a certain area or solve specific problems, but if you convince the audience that your work will be important for a lot of important future work, then not only did you have the foresight to produce that work, but you are well-positioned to continue to contribute in the field.



Finally, practice, practice, practice!!! I think I practiced my talk at least 50 times. Thats 50 hours of just speaking. When I mentioned 58 minutes to speak earlier, I meant more-or-less exactly 58 minutes (plus or minus 15 seconds). If you practice enough, you will know exactly how long the talk takes as a whole, and how long each sub-section takes. If you get a lot of questions in some early part and lose 3 minutes, you will know where you can make it up in a later part.

Also, practicing it will help you realize what is explained well and what is not. If you repeatedly find yourself explaining a slide and wishing you had a figure to help explain that, then add the figure. If there is another slide that has a figure that you don't really need, then remove it.

I only had the opportunity to practice in front of an audience (other than at actual interviews) twice. So, I spent a lot of time sitting in a room by myself talking to myself :).

So, the key take-aways are: (1) motivate the field to convince the audience your problems are important, (2) make sure you convey 2 or 3 main conceptual contributions, focusing on key new ideas, (3) teach the audience something, (4) demonstrate vision for the future, and (5) if you practice enough and keep these in mind, you will definitely give an excellent talk.

Wednesday, June 15, 2011

Applying for Jobs: On-Site Interviews

This is the fourth post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah starting Fall 2011.



Today's post is about what do do once you actually get a job interview. I'll try to cover everything except the job talk which will come in the final post.

You will usually be assigned a host. Talk to this person and try to get a feel for what the department is looking for. Perhaps the AI person is retiring and the department is happy to hire an Algorthims person as long as you can also teach AI. Or you would be the only theory person in the department and some of the other faculty are nervous of this, so the more you can talk about real world applications the better. These are very important things to find out.

And if you know other people in the department, by all means, ask them as well. If you get hired, you will need an advocate to make your case. Contact who you think might be an advocate for you. Don't be shy!

Try to figure out who the big shots are in the department. These people may be more senior or outspoken, bring in more money, and most importantly can potentially have way more influence than other people. If you can connect with these people, all the better. Sometimes hiring is done entirely within an area, so then the big shot is relative to the area.

Starting a week to a few days before the interview, try to get a schedule of your visit, even if it is only a draft. Research everyone on your schedule, know about their research areas and what classes they teach. All of this can usually be found on their webpage. Also try to read at least one of their recent papers that has the best probability of have some intersection with your research. And prepare several questions or topics for potential discussion.

Several people may be on your schedule who do research that is not really similar to yours. These people may be on the hiring committee, and should not be ignored. Try to make connection, even if it's not entirely about research. Like if you went to a common school , or know common people.

There is a good chance for last minute changes to your schedule, so research others in the department as well even if they are not on your schedule.

Also prepare answers to he same sort of questions as for phone interviews. If you suggest that you would teach a class, make sure you have many of those details sketched, because you may get asked specific questions about what you will cover.



On the actual visit I always carried a print-out of all of my notes, but never got a chance to look at it. So you'll have to memorize most of he information on there.

It is important to try to make positive connections with as many faculty as you can. Be friendly, outgoing and smile. Make eye contact. Be very excited about what your research. And obviously, don't say anything sexist or racist. Seriously, save the dirty jokes for later.




I've heard you should follow up with a short email with everyone you met, but I did not do this. I only sent something if it was meaningful. Like specific follow-up to a problem we discussed, or replying to a specific question. Usually about 3-5 per visit.


Monday, June 13, 2011

Clustering with outliers

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

(ed. note: I'm happy to welcome Sergei Vassilvitskii (of k-means++ and mapreduce-algos fame, among other things) to this series. Sergei has great insights on clustering, and we decided to join forces as we continue this series. Here's his first post, on clustering with outliers)

When abstracting the clustering problem, we often assume that the data is perfectly clusterable and so we only need to find the right clusters. But what if your data is not so perfect? Maybe there's background noise, or a few random points were added to your data by an adversary. Some clustering formulations, in particular k-center or k-means, are not stable -- the addition of a single point can dramatically change the optimal clustering. For the case of k-center, if a single point $x$ is added far away from all of the original data points, it will become its own center in the optimum solution, necessitating that the other points are only clustered with $k-1$ centers.

It is easy to appeal to intuition for the definition of outliers, but formally specifying when a point becomes an outlier is a non-trivial task. Consider for example a set of points generated by a single Gaussian with small variance. Now we pick one data point and slowly start moving it away from this set. We agree that originally that point is not an outlier; as we move it out to infinity, it certainly becomes one. But where exactly did it cross this threshold? 

There are two ways to define clustering with outliers. First we can simply chose to ignore some fraction of the points. In the case when there are no outliers, ignoring a small (say 1%) fraction of the points should not materially affect the final clustering results. In the case that outliers are present, however, the algorithm has the choice of ignoring them in computing the final metric. Consider again the example above; although we don't know when the point we are moving becomes an outlier, in some sense we don't care because we are always looking at the optimum clustering without it. 

A generalization of the above approach assigns a cost for not classifying a particular point and gives a budget for unclassified points. Thus, if we know ahead of time that some points should be classified and some can be ignored, one can guide the algorithm to focus on the points that matter. Of course assigning each point an identical weight leads exactly to the first formulation. 

Algorithms

The ability to ignore a fraction of the points seems like it would make the problem easier, since the algorithm now has explicit permission to 'sweep things under the rug.' It actually makes the problem harder, since the optimum solution can also do this. Put another way, the algorithm needs to not only find the right clustering of points, but also decide which points to cluster in the first place. 

There are two extreme approaches to the latter problem: one can first decide which points are outliers and then run the baseline clustering algorithm, or one can chose not to ignore any of the points and decide which once are outliers given a clustering. Unfortunately neither one of these works well -- the first scales poorly, as the number of choices for outlier points grows exponentially; the second leads to poor solutions -- as we saw before ignoring the outliers can materially change the final outcome. 

The key to finding clustering algorithms that handle outliers is to determine which points will be ignored during the algorithm execution, and not before or after. For an example, consider the $k$-center problem. Recall we are trying to find a set of centers $\mathcal{C}$ and a radius $r$ so that every point is within $r$ of some point in $\mathcal{C}$. There is a lower bound that says unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find better than a 2-approximation on the radius. There are many ways to achieve a 2-approximation: we previously saw the furthest point algorithm by Gonzales: repeatedly choose the point that is furthest away from the current set $\mathcal{C}$ and add it as s center. The algorithm breaks down if there are outliers --- we will surely choose an outlier as a cluster center. 

Here is an alternative algorithm for $k$-center: 
  1. Guess the optimal radius $r$. 
  2. While there are uncovered points, chose one point $p$ as a center, and mark all points within distance $2r$ as covered.
It is easy to see that in every iteration we mark as covered all points in the optimal cluster that $p$ belongs to. If after $k$ iterations there are still uncovered points remaining, it must be that the original guess of $r$ was incorrect.

To adapt this algorithm to the setting of outliers, we make two changes. First, we increase the slack on the radius from a factor of 2 to a factor of 3 (this turns out to be necessary; unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find a better than a 3-approximation to the radius). Second, instead of selecting a point arbitrarily, we select the point that has the maximum number of uncovered points within radius $r$ from it.


Thus the final algorithm is as follows: 
  1. Guess the optimal radius $r$. 
  2. Repeat $k$ times: select the point $p$ that has the largest number of uncovered points within distance $r$. Mark as covered all points within distance $3r$ of $p$. 
The points that are left over in the end are precisely the outliers: they are both far away from any individual cluster center, and are not well clustered themselves (otherwise one of them would have been chosen as a potential cluster). Charikar et al show that this algorithm attains a 3-approximation: if the optimum can cover all but $\ell$ points with radius $r$, then the algorithm above will cover all but $\ell$ points with radius $3r$.

Saturday, June 11, 2011

Clustering as compression

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
Numquam ponenda est pluralitas sine necessitate.
    -- Plurality must never be posited without necessity -- William of Ockham.

Clustering as compression generalizes mixture model clustering, as well as distance based methods. The central idea in this view of clustering is Occam's razor. Can you express the data as concisely as possible in a lossy manner ?

The devil in the details here is the definition of "concisely". There are two different notions of bounded space that are common, but they are actually not that different.

The information-theoretic approach.

The information theoretic approach dates back to Shannon. Consider sending information across a pipe.  How big on average does the pipe need to be in order to transmit the information ? One of Shannon's celebrated results is that this can be quantified by the mutual information of the channel, or the amount of shared information contained between the input and output. 

Thinking of mutual information as a measure of space is quite easy once you get used to it. It's helpful to consider the boundary cases. If the channel merely outputs a random flip of  the input, then the output has no information about the input and the mutual information is zero, corresponding to the idea you don't need to transmit much information through the channel.

On the other hand, if the output is a deterministic function of the input, then the mutual information of the channel equals the entropy of the source, corresponding to the idea that the channel needs to send over a complete description of the input to compute the output correctly.

What does this have to do with clustering ?  Think of the channel as the process by which the input points are mapped to clusters. In particular, the channel can be encoded by the conditional probability $p(t | x)$ of assigning a point x to cluster t. Once we have that, the mutual information $I(T;X)$ between the set of clusters T and the set of point X measures how concisely T represents X.

There's an intuitive explanation for this as well. Suppose that points were assigned to exactly one cluster each. Then I(T;X) equals the entropy of T, which is just the average number of bits needed to write down a description of T.

But what about the quality of the embedding ? Given some distance function on the input and output domains, you can compute the expected error as the sum of p(t | x) p(x) d(x,t) over all x,t. The study of how this varies with I(T;X) yields what's known in information theory as rate-distortion theory, and for an appropriate choice of the distance measure $d$ leads to the well known Information Bottleneck Method

Kolmogorov complexity.

The second approach takes the Kolmogorov view: that compactness of description can be expressed as the size of the smallest program expressing the data.  In many ways, the Kolmogorov complexity of a string serves as a point-wise proxy for entropy, and the equivalent of mutual information can be defined as well (for more on this, see the note by Grunwald and Vitanyi)

While the rate-distortion framework can be extended to the Kolmogorov setting, it's a lot more technical, and is hard to explain in a short note such as this. An easier application of the Kolmogorov method is due to Cilibrasi and Vitanyi, who showed how to do clustering by defining a distance metric based on compression (roughly, as the complement of a normalized mutual information), and then doing hierarchical clustering.

In summary, the main value of the compression-based viewpoint is the idea of a space-quality tradeoff, as opposed to a more artificial k vs quality tradeoff.  What's nice about the information-theoretic approach is that the units for space and quality are the same, and so you don't need ugly balancing constants to compare them.

p.s There are those who might argue that the information-theoretic view of clustering is merely a kind of mixture density estimation. In a sense, this is true: specifically, the information bottleneck method can be viewed as estimating a mixture of multinomials if we reinterpret the Kullback-Leibler divergence as the associated Bregman divergence for the multinomial family . This in turn connects to methods like LDA and probabilistic PCA. However, I'd argue that the viewpoint of compression is quite different, even though the end-result looks similar, and it's the viewpoint that's valuable here.

Tuesday, June 07, 2011

Applying for Jobs : Phone Interviews

This is the third post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah (yay!) starting Fall 2011.



A part of faculty interviews I had not heard or thought as much about was the phone interview. Not sure this was my strength in the interview process, since more places I had phone interviews did not follow-up with on-site interviews, than did. So if you have any different experiences, please post in the comments.

Some places do not have phone interviews, as some of my on-site interviews did not first have a phone interview. From my perspective, the places that did have phone interviews were places that may have wanted to confirm I was actually interested in coming. There were always several questions about why I was interested in that particular location/university.

Anyways, I'll try and provide some advice for performing well on phone interviews, even if I did not necessarily follow all of my advice.

First, prepare for these as if they were an on-site interview. Practice answering questions with someone over the phone (or in a room without eye-contact). You need to keep answers relatively short and to the point. Its harder to listen to a long answer over the phone. And its harder to take hints if you are giving too long or too short of an answer. Its probably more important here than in an on-site interview that you are well-prepared for questions, since you can't, say, move to a white board to explain something in more detail with a picture.

Try to figure out who will be on the other side of the interview a head of time. I've had one-on-one interviews as well as group interviews. The entire hiring committee was on the other side. Although usually in this case, one person asked most of the questions, but others would add follow-up questions. It could be a bit disorienting. When it was just one person, I usually tried to have their webpage up on my computer so I had a picture of them.

And, research who your interviews are, what their research areas are, and what classes they teach. If they bring up an topic, make sure you don't disrespect their research area, and you can try to positively relate to it if its relevant. If they realize you put in this effort, it will definitely help show you are serious about that university, which, I believe is a key aspect of why they are calling you.

Most importantly, make a list of potential questions and prepare and practice answers for them. I tried writing down answers, but this was not necessarily good on the interview since I really felt like I was just reading answers at some point. I'd rather suggest writing them down just to organize your ideas, but don't necessarily have them in front of you during the interviews. Bullet points might be ok.

Here are a list of questions similar to those I had:
- why do you want to go to UofX?
- why the location of UofX is attractive?
- who will you collaborate with? (try and answer with actual names, in the department)
- how will you fit in the department?
- what will your first proposal be about? (here they want to get a feeling that you have thought about proposals, name specific funding agencies and calls if you can. Don't be shy to mention any experience you have writing proposals even if you are not asked.)
- what you will teach? (try not to duplicate existing classes, especially not special topics ones)
- what is your research areas?/describe you research. (You may choose a topic so that is easy to describe over the phone.)
- Do you have any questions? (This one always gets asked. Sometimes you get the option to ask this at the beginning or wait til the end. I recommend waiting, since it allows the interview to get underway at first and you can get a sense of your interviewers before you ask.)

For faculty interviews, you probably won't get any technical questions (although you might - I did for research lab interviews).

And finally, you will probably have no idea how well it went after it is over. Don't worry this is normal, but quite frustrating. And you may not hear back for a couple weeks or more, so hang in there.

Applying for Jobs : Sending out Applications

This is the second post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah (yay!) starting Fall 2011.



I found actually applying for faculty jobs takes a lot of time. Maybe I spent too much time and others were successful with a less time-consuming approach; if so please add your perspective in the comments.

The first step is to find which jobs to apply for. I basically completely relied on the CRA jobs website to find out about job openings. Often I found that job listings appeared on their within a day or so of the listing becoming available on the school's website.
Comically, CRA also posts the same ads in a written newsletter which I get every couple months. The print edition appears several months later, often after the deadline has passed.

Some jobs will not appear on the CRA website. Some top schools don't need to advertise, so for those you need to either look for some ambiguous statement of faculty recruiting on the school's website, or better ask someone you know at that school (or have your advisor ask).

In fact, it is very important to have an advocate within any school you hope to get hired. So, as you are seeing friends or colleagues at conferences ask them "are you hiring next/this year?" For one, it is often a great conversations starter, and two, you can start implanting the thought in their head that you would be an excellent person to hire. I've heard that the head of the hiring committee can craft the description of the job posting to aim towards a particular type of candidate, or even a single particular candidate they are aiming for. Now is the time to approach you colleagues at universities you might want to apply and subtly try to get them to build an opening designed for you!



OK, now that you have found (or even crafted) the opening you are applying for, what is left to do? You already have your research statement, your teaching statement, and letters lined up that you have been working on-and-off all summer on.

The most customizable item is the cover letter. I've been given advice that you don't even need to write a cover letter unless it is required (presumably from people who when on the committee never even looked at them). While this may often be true, I felt it was worth trying to make a point of why this university it right for you.

Some middle tier university get many applicants with somewhat comparable resumes. How can you associate yourself with that university so they think if they made you an offer you would actually come? Have you lived nearby and liked the area? Do you have family nearby? Is there a particular program that the university is strong in that would be a great fit for you.

Also what I spent the most time on, was describing who I might collaborate with, within the department. Don't spend too much time on people outside the department since they usually have no bearing in the hiring decision. But if you have a lot of similar interests with someone on the hiring committee, definitely point those out. This usually took me 45-60 minutes per application, because not only did I want to find someone, I did not want to leave someone out. Perhaps, I could have not spent this time, but I felt it made a difference.

One thing I did not do much of, but could have done, is personalize the type of classes I would teach as described in my teaching statement. If someone, especially someone with similar background to you, already teaches a class very similar to a "new" one you propose, then many people outside that area would interpret that as you duplicating that person, and they would rather hire someone who could provide more breadth to the department. Rather, what I should have done is to try to identify which types of classes were missing from the university and adapt my teaching statement accordingly.

I've had advice saying I should prepare a few classes I would develop in my statement, and then choose which one to submit based on what the need was in the department.


Finally, the most important advice is to not rush the application process. I tried to do 1-2 applications every night for about 3 weeks (I submitted a lot of applications). This had two reasons, first it took me about 1 hour per application (some customization, but sometimes because of stupid forms). Second, if you do too many in a row, you get burnt out and start cutting corners.
Treat every application as if it is your top choice. Every time I applied, I convinced myself that this was a great place for me, and got in that mind set. For a few places where I just could not convince myself of that, I ended up not applying. I figured it wasn't worth anyone's time for me to submit a half-hearted application.




One final note. Although I intended for this series to be mainly about applying for faculty jobs, most of the advice carries over for research labs. I knew I wanted to apply for faculty positions, but I also applied to some research lab positions, and as it turned out, I almost went to one instead of accepting the Utah position. It seems each lab is somewhat different, and you might be pleasantly surprised if you get a chance to visit.

Most labs have some sort of hiring every year. But you generally need to know someone (or your advisor/mentor does) to get you foot in the door. Internships are a great way to meet these people and get the proper perspective. So, again ask colleagues at labs about hiring, and go through them to apply.

Disqus for The Geomblog