Showing posts with label algorithms. Show all posts
Showing posts with label algorithms. Show all posts

Friday, August 29, 2008

A Survey on Hash-Based Packet-Processing Algorithms

Sometime way back, Graham Comorde invited me to a DIMACS workshop on Algorithms for Next Generation Networks, where I talked the usual talk about hash tables, Bloom filters, and things. The organizers later had the wherewithal to undertake putting together a book or collected chapters on the topic, and asked us (us being, of course, me and my student Adam Kirsch) for a chapter related to the talk. Adam was just finishing his thesis, and was willing to take on modifying his thesis to form the basis for a survey chapter. I wrote up a section or two, and for good measure, we got George Varghese, who is the world expert -- and author of the book Network Algorithmics -- to join in as well. (In particular, we really wanted George's much more practical perspective on what works and what doesn't!) The result here is a hopefully pleasant light read on current goings-on in the area of hash-based network algorithms, attempting to blend and show the connections between theory and practice.

There's still time for feedback for the final version; please mail me if you have any constructive suggestions.

Friday, August 15, 2008

SIGCOMM 2008, Part 1

About a year ago, I took a look at some SIGCOMM 2007 papers. I won't be attending SIGCOMM this week, unfortunately, so in the interest of self-education, I thought I'd look at some of the papers this year. (The site currently has the papers up. Wow, what a neat idea...)

Before getting into papers, I thought I'd mention that Don Towsley is being given the ACM SIGCOMM award. This is a great choice, and well deserved. And relevant to this site's audience, Don is, in my mind, primarily a theorist. Not a FOCS/STOC theorist to be sure, but a theorist nonetheless. As the award announcement states:
Towsley, who is Distinguished Professor of Computer Science, has made innovative and pioneering contributions in developing foundational modeling and analysis techniques that have enabled a better understanding of some of the most important aspects of today's computer networks, network protocols and networked applications.
Modeling, analysis, understanding... that's what theory is all about. It's people like Don that made networking an open and understanding place for people like me. Thanks! And hooray for Don!

Now for papers. As before, I'll give brief synopses (at the level of the posted abstracts :) ), as I'm just looking at these papers on the fly. The network coding crowd has attacked again with the MIXIT system, which seems to throw together a bunch of ideas in a clever fashion to improve performance on wireless mesh networks. Recall that the basic working definition of network coding is that intermediate nodes do more than store and forward, they can process the packets as they come through (creating encoded packet variations). Here, the basic unit is not taken to be a packet, but a symbol (a small collection of bits), with symbols being packed into a packet. This allows nodes can "take apart" packets; if a whole packet doesn't come in error-free, the node can take symbols that appear to be right with high enough probability (based on information from the physical layer), and re-package (via linear combinations, a la "standard" network coding) and send on only those symbols. Because erroneous symbols might get through, an end-to-end error-correcting rateless code is also used. All of this appears to improve throughput.

The paper seems interesting -- another proof-of-concept paper for network coding in wireless systems, which is where I suspect network coding will be able to make the most inroads over the next few years. I can't tell yet how practical this really seems (without a more detailed reading), but the idea of taking apart packets and sending only the good pieces in combination with multiple coding techniques seems quite nice.

As an aside, the pdf for this paper seems to contain a picture or something that crashes my poor Mac around the 8th or 9th page. Help!

Wednesday, August 06, 2008

On Simulations

I've been coding up some simulations for some Allerton papers that are due all too soon. Of late I've depended far too much on my (now former) student Adam Kirsch to take care of doing our simulations, but he's graduated, and they're needed, so off I go. (Adam's graduating is all to the good, but clearly, I'll be missing him, especially when it comes time to write simulations.)

I'm always amazed at how averse theory people seem to be to doing simulations. I find them useful for generating ideas and thinking about problems in the early stages -- cutting off wrong directions and giving insight into the right ones. If you don't like doing simulations for such purposes, because it doesn't work for you, or you're clever enough to not need data, I have no issue with that -- people work differently.

But I also use simulations as a way of checking my work. If I have a theorem that says that a random process will behave a certain way, and it's possible to code a simulation of the process, I'll check my theorem with code. If the theory and the code don't match up, my assumption is that something is wrong somewhere, and the result is not ready until the two match or I know why they don't. Surprisingly, I think it's about 50-50 as to which I end up finding is wrong, the code or the theorem. (By the same token, if I don't have a theorem, and there's more than one way to simulate a process, I'll code multiple simulations, and make sure they match!)

Of course not all results can be checked by coding something up -- but many can. Particularly in the study of random processes, which is my area. And it's clear to me that many researchers don't check by coding -- because I (or students working with me) have several times found mistakes by doing a simple implementation and finding that we get different numbers out than the paper gives. Generally the mistakes aren't "fatal" -- usually a constant is off somewhere, and often eventually the O notation will take care of it -- but of course it is grating as a reader when something in a paper is plain wrong and you're left to figure out why. When someone doesn't do a check-by-code, I must admit, it lowers my trust of and my overall opinion of the person's work. Sure, people make mistakes (myself included) -- but if you're ignoring a straightforward approach for checking your work, that doesn't inspire confidence.

I imagine some theory people are so out of practice coding they "can't" do a simulation. (But hey, that's not really an excuse, that's what grad students are for...) And others probably just consider it a waste of time. If you really are good enough not to need to check your work this way, more power to you. Me, I'll get back to the (admitted drudgery of) coding my things up and seeing if they work the way I think they should....

Monday, June 23, 2008

Yet Another Survey... Deletion Codes

Next week I'll be at SWAT for an invited talk. When I gave them a choice of topics, they went with: A Survey of Results for Deletion Channels and Related Synchronization Channels. I've actually given this talk a few other places, but there's been some new stuff, so it's been updated a bit. (I'll put the latest version of the slides up on my talks page after the talk...)

The invitation gave me an excuse to buckle down and finish something I started over a year ago -- a survey article on results for the capacity of the deletion channel (and related problems) to go with the talk. I must be out of practice writing, because it's taken me most of the month to get this thing done. (Well, OK, there was that whole "baby incident" that took up some time, but still, I've clearly been doing far too much administration this last year, and I need to shake off the rust.) It's probably still a bit rough so anyone interested enough to read it, please feel free to mail corrections/suggestions directly.

Apparently, based on some mild teasing from a colleague and friend, I'm known as someone who writes survey articles. (The teasing was perhaps I'm known only as someone who writes survey articles...) I don't know why I end up writing them -- I can't really say it's a fun exercise to write a survey, though it is often educational -- but somehow I feel it's valuable. In this case, there's definitely the motivation that I've done a lot of work in this area, and I want to both promote the area and my work. The survey has a couple of dozen open problems spelled out right there, so let that be motivation for some of you to read it...

Also, I'm open on suggestions on where to publish it. My last few surveys have nicely been able to go to Internet Mathematics (see here for Bloom filters and here for power laws/lognormal distributions-- although the "uncensored" version of the power law survey from my publications page is more fun) so that they've had a worthwhile permanent home. This survey won't quite fit there, and I think it's good for such things to have a visible outlet (besides, say, arxiv). Ostensibly Transactions on Information Theory will take surveys, although I haven't seen that much in practice; and this survey doesn't really fit in the Foundations and Trends model. (I like the Foundations and Trends model, but it seems to me their surveys are like "mini-textbooks" -- and cost accordingly -- while my surveys are more like "here's some background, big ideas, and pointers to papers" -- and I'd like them to be essentially free.)

Friday, June 13, 2008

An Improved Analysis for "Why Simple Hash Functions Work"

I recently had a paper with Salil Vahdan (in SODA 2008) explaining why simple hash functions "work" for so many stream-related problems where data is hashed to be placed in a hash table or a Bloom filter or a similar structure. Here "work" means "behave essentially the way you would predict if the hash function was perfectly random." The basic reason, we suggested, was that if the data in the stream had enough randomness, it would combine with the randomness from the choice of hash function to produce value that looked independent and uniform.

Our analysis of how much randomness the data needed to have seemed a little loose. Salil and his student, Kai-Min Chung, have recently improved the analysis; here's the arxiv link, the paper will be appearing in RANDOM '08. There's some really nice and somewhat subtle arguments that I imagine will find some uses in other areas. While it "only" improves the constant factor (in terms of number of bits of randomness needed), this is certainly a case where constant factors are important.

The result further reinforces my belief that this is the right high-order explanation for why, whenever anyone actually does an experiment involving hashing, you get the right results no matter what hash function you use.

Friday, May 30, 2008

More Robust Cuckoo Hashing (ESA paper)

I have a paper at ESA, with Adam Kirsch and Udi Wieder, entitled "More Robust Hashing: Cuckoo Hashing with a Stash." (Here's an extended preprint version.) The starting point of the paper is that cuckoo hashing is wonderful, and everybody should know about it and use it. (If you don't know about it or use it, see here or here. But basically, cuckoo hashing is multiple choice hashing with the added bonus of getting to move items around among their multiple choices as needed to balance load on-line.)

So, of course, we want to make cuckoo hashing even better. And one of the problems of cuckoo hashing is that it "fails" with non-trivial probability. For example, in standard 2-choice cuckoo hashing, putting n items into 2(1+eps)n cells with at most 1 item per cell can be done with probability 1 - Theta(1/n). That gives a Theta(1/n) "failure" rate, which doesn't mean anything in terms of "average" behavior, because it can always be handled by re-hashing everything with a new hash function, so that's the end of the story in most theory papers.

In practice, that kind of failure rate is a bit high. Re-hashing an entire hash table could just be too expensive an operation to have occur that frequently for many applications. Can we improve the failure rate somehow?

Let's look at where that Theta(1/n) failure rate comes from. When there are 2 choices per item, the simplest type of failure one might imagine is that 3 items try to use the same two cells -- that is, 3 items have the same 2 choices of location. Indeed, such a problem occurs with probability Theta(1/n) [exercise LTR]. But when such a problem occurs, we can imagine taking one of those 3 items and stashing it somewhere -- in a stash as suggested by the paper title. The stash should be a little space on the side, that we may have to check whenever we look for something. If we implement a stash, how does the size of the stash affect the failure rate?

If failures were essentially independent, so that each item "fails" independently with probability proportional to 1/n^2, then we'd expect f failed items with probability proportional to O(n^{-f}). This turns out to be the right intuition; 0ur analysis shows that a stash sized for s items (for constant s) reduces the failure rate to O(n^{-s-1}). The analysis is a bit trickier, since of course item failures aren't independent, but the result seems natural.

So a small, constant-sized hash -- easily implementable in software or hardware -- reduces the failure probability dramatically, allowing one to avoid rehashing in practice. What I found particularly interesting from the theory+practice perspective is the power in this setting of a constant-sized stash. It's a very natural addition for practitioners -- I don't think it would cause an engineer to blink -- but I think it really changes the potential for cuckoo hashing, making it a structure you could imagine employing on devices used by millions of customers, without having to worry about this otherwise far-too-common failure mode.

We show similar results for multiple cuckoo hashing variations. (The analysis, unfortunately, is different for all of the variations; it would be nice for someone to develop a cleaner, nicer, more unified analysis of all the varieties of cuckoo hashing.) The case of 2 choices is of non-trivial interest, however, since Udi, along with Moni Naor and Gil Segev, have recently shown that 2-choice cuckoo hashing can be used to develop a very efficient history-independent hash table (check out Udi's web page for the paper). The main problem with such a structure is, naturally, the non-trivial probability of rehashing, which can be avoided using a stash.

The stash idea also fits nicely with my previous paper with Adam on the Power of One Move for cuckoo hashing style schemes, although there the stashes (with suggested implementation as Content Addressable Memories in hardware) were larger than constant-sized. I'm beginning to think the idea of stashes (or CAMs) should be a standard section in practically oriented hashing related papers.

Wednesday, May 14, 2008

Writing a Final Exam

For me, one of the hardest challenges of teaching is making up homework assignments and final exams. For homework assignments, there's some assistance from the textbooks (and I re-use problems a lot), but final exams are very challenging: coming up with problems that test a student's learning but under a 3 hour time limitation. Making a final exam usually takes me several hours.

For my final exam, usually about 1/2 is True-False, Multiple-Choice, Give-an-Example-or-Counterexample, or Execute-the-Algorithm-on-This-Small-Example type problem. These are very different from the homework assignments, which are usually proof/computation/programming-oriented, and hence have "bigger" problems. The other 1/2 is more like the homework assignments -- proof-type-problems -- but sufficiently easier (or with sufficient hints) that students can hopefully get through them quickly. Interestingly, although you might think the first kinds of problems are easier, on both halves, student averages are about 70%.

I've toyed with the idea of giving take-home finals (which I do sometimes in my graduate classes), but these days it's just too easy for students to cheat. I know I've had students anonymously mail questions around looking for people to answer them. And arguably it's useful to have the final exam test something different than the type of problem-solving that they do on the homework.

When I think of the time spent making a final, I know I'd be happier not to give one. And of course the students would be happier too. Not better off, I think, but happier. Perhaps there's a different solution....

Monday, May 12, 2008

A Bubblesearch (Semi-)Success Story

A colleague was backing up files to DVDs and wanted to know a good bin-packing heuristic. I pointed him to First-Fit-Decreasing and Best-Fit-Decreasing, nicely described the Wikipedia entry on bin-packing, but suggested as long as he was doing that he ought to try Bubblesearch as well. Recall Bubblesearch just tries random perturbations of the greedy ordering, to try to find a better solution; the full description is in this paper.

His reported results are that with FFD everything fit into 5 DVDs, which was the minimum necessary; but he did find that Bubblesearch improved the utilization of the first 4 DVDs from 94% to 99% (with about 10,000 random perturbations, and some small parameter tuning), which he thought was impressive.

Your mileage my vary, and I'm sure there are better specific heuristics for bin-packing. The point behind Bubblesearch is that it is a natural extension for a wide class of Greedy heuristics, and moreover is really easy to understand and to implement. My completely self-interested and biased opinion is that it (or something like it) should be taught in all undergraduate algorithms classes....

Tuesday, April 22, 2008

Failure, and a Second Try

About two years ago, I went through a huge failure on a project. In the back of my mind, I always wanted to use this blog as a jumping off point for a second chance for this project. Now is the time. Perhaps some of you can help me fix this failure.

Back in early 2006, while I was on what's now known as the SIGACT Committee for the Advancement of Theoretical Computer Science, I was thinking about ways to promote theoretical computer science (TCS). One idea that struck me is that most people entering college from high school know nothing about TCS. If they've had exposure to computer science, it has been programming. Compared to the excitement of biology (we're figuring out the building blocks of life!) or physics (we're figuring out the building blocks of the universe!), computer science must seem to the average high school student to be, scientifcally speaking, a wasteland.

Now, we know this isn't true. Those of us in TCS wouldn't be in it unless we thought there were fantastic, mind-bending, potentially world-changing problems in the field. And I think history is on our side; computer science, and in particular TCS, has changed the world, fundamentally, and continues to do so. I don't think, though, we do a good job of promotion, showing the world, and in particular high school students who we might hope to attract to our field, the excitement of TCS. Somehow, our field still attracts talent, but it's an uphill battle. If we could change that, I thought, we could change the perception of our field, attracting top students earlier, and long-term creating a greater appreciation of TCS in computer science and across the sciences.

So in 2006, I dreamed up a project. I wanted a book that could be handed to high-school sophomores, to give them an idea of what TCS is all about. I wanted a book I could give to friends' kids -- heck, to my own kids, when they got to the right age (and wanted to really know what Dad did with his time). I wanted a book that would be interesting, engaging, and cover the wide swathe of work we do.

So, naturally, I knew I couldn't write this myself.

Instead, I wanted a group-book, with experts writing chapters in their areas of expertise. I reached out to about 40 or so people in the TCS community, asking them to write a chapter. I hoped first drafts could be done over summer 2006. And that by the end of the year, we might have a book ready for a publisher. About 1/2 the people agreed, and I sent occasional e-mails reminding and encouraging people.

And nothing happened. I think lots of people started, but never finished, their chapters. I heard a lot about how it was difficult to write chapters for the target audience, high school sophomores. It seemed for everyone that other projects, other papers, other things took priority. Disheartened, and busy myself, I let it go.

I still thought the project was a good idea. I still think so. So now, over the next couple of weeks, I'm going to revive the project here on the blog. I hope to change the nature of the project a bit, so my initial management failure won't necessarily repeat itself. It's something I've wanted to do since I started the blog, but I've kept putting it off.

I think it's time to try again. More to follow.

Monday, April 21, 2008

Books from my Shelf, Part 3

Today, let's look at some more practical books from the shelf.

Network Algorithmics, George Varghese: Even before he wrote this book, George Varghese had essentially written the book on how to get good performance on your network devices (e.g., routers) by smartly using algorithms combined with systems thinking. George is my role model in this area, even moreso now that I've worked with him on a few projects. If you want to see how network people are thinking about and using algorithms for packet classification, prefix-lookups, switching, scheduling, security, and so on, pick up this book.

Mining the Web, Soumen Chakrabarti: Sigh. Has it really been over ten years (no need to get more specific than that!) since Soumen and I were graduate students at Berkeley together? Soumen's one of those people who could have been a great theoretician, but was always lured by that "real world" thing, and instead became an expert on information retrieval and the web, which lets him still do theory now and again but with an eye toward problems of more immediate importance. A great book for everything you might want to know to start building your own search engine. Web crawling, information retrieval, clustering and classification, etc. And lots of algorithms and algorithmic insights.

TCP/IP Illustrated, Volumes 1 + 2, Stevens and Wright: At some point, I was looking at TCP for research purposes, and this was a great book for understanding the fine details. It may take a little looking, but the details are in there. Volume 1 also provides background on a lot of basics that I think are useful for networking research. Finally, I also think it would be useful for theorists to look over this book at some point is to get some insight into real-world systems-thinking.

Wednesday, April 16, 2008

Taxes and Algorithms

I'll continue with my walk through my bookshelf shortly, but as tax day just came and went, a thought or two on taxes, with an algorithmic bent...

Ideally, the computation of taxes is a (relatively simple) algorithm. Ostensibly, there should be a set of numbers one gives as inputs, and then an output-- the amount of tax you owe or are owed -- is produced. This is the basis for tax software programs -- you plug in the numbers, and off you go.

In real life, things are much worse. The required inputs are, at best, vaguely defined -- leading to such fun questions as when I make a charitable deduction by buying a ticket to a charity dinner, how much am I not allowed to deduct because of the value of the dinner -- and subject to interpretation. In computer-speak, garbage in, garbage out. I'm amused by reports like the one where Money magazine sends the same information to 50 tax professionals and ends up with 50 different tax returns (and a large spread in what is owed), or reports that in tests of IRS call-in helplines that the wrong answer is given over a third of the time. And when exceptions or odd cases occur, as they do often with the way the tax code is set up (AMT, anyone?), things rapidly become incomprehensible.

Naturally, I think there are lessons here. I actually find many tax forms are simple to understand, when I understand what the end goal and result is supposed to be. That is, it's much easier to understand an algorithm when one understands the goal of the algorithm, and there's some natural connection on how the steps of the algorithm are getting you to that goal. The problem with many tax forms is that they give the steps of the process without clarifying the goal, so I can't tell if my situation is a standard case or a special case that requires more time and attention. When theorists explain algorithms to outsiders, we ought to keep in mind that they might not understand special cases or detours in the argument that seem natural to us. We should keep the end goal clear throughout our explanation. And I admit I'd be thrilled if the US moved toward a tax system that was algorithmically speaking comprehensible (at least to me).

Another way of viewing this lesson is that there are (or should be) significant differences between algorithms designed for a computer to use, and algorithms designed for a human to use (with or without the help of computers). Humans, for instance, often like to understand the output produced, for their own satisfaction or so they can explain it to others. Computers don't care. In my past foray into human-guided search systems, this was one of the lessons I learned. In theory, we often think of an algorithm as a set of instructions transforming an input to an output. Sticking too closely to that worldview can be quite limiting; algorithms can be objects that humans -- or other agents -- interact with, leading to different design goals than we commonly consider. Even if you could just plug numbers into a computer and have your tax bill pop out, I think many people (myself included) would be unhappy, if we didn't have some understanding of how that final number was derived.

Tuesday, April 01, 2008

Jennifer Rexford's take on "Practical Theory"

Jennifer Rexford, who I've mentioned before is a networking person who appreciates theory, gave me a rumor that she'll be giving a talk at STOC on "networking questions relevant to the theoretical CS community." I don't see the program up yet, but I hope it's true. She's a great speaker, with excellent perspective. Kudos to those who thought to invite her!!!

She also pointed me to an editorial she wrote on her ten favorite "practical theory" papers. Naturally, I enjoyed reading her perspective. Particularly since her perspective included my name (flattery, indeed, will you get you everywhere-- or at least on my blog).

I think, however, her editorial offers several lessons. Here's my top two. First, her notion of a theory paper is a little different than ours. Most of the papers are by networking people who think mathematically and theoretically (Varghese, Towsley) or more attuned to what I think of as EE theory. This first lesson is a reaffirmation of my long-standing notion that the FOCS/STOC view of theory is, in many respects, rather limited, especially to our peers outside of theory. I think a second related lesson goes to the importance of theory people getting out there and making their work more accessible to non-theorists. In my case, that has meant writing survey articles, giving talks for general audiences and not just theorists, and going to networking conferences to talk about my work. I'm sure there's plenty of other "mainstream theory" work that Jennifer and others from networking would greatly appreciate -- if only we as a community did a bit better job of letting them know about it.

Friday, March 21, 2008

CS 124 and xkcd

A few of my students in CS 124 (Algorithms and Data Structures) pointed me to this xkcd cartoon -- amused, no doubt, that we had covered the O(2^n n^2) algorithms in class about a week ago. And who says that you don't learn anything important in my class!

David Eppstein was inspired to write a longer blog post on the algorithm, well worth reading (especially if you're in my class!).

Saturday, March 01, 2008

Another Rant from Conference Reviewing : Algorithms and Evaluation

After my first conference rant on competitive analysis (based on my current stints on the SPAA and ICALP committees), I feel it's only fair to spread the love and rant a bit on my fellow algorithmists.

I simply claim the following : if you are presenting an algorithm with the claim that it might be useful in practice, you should aim to include at least a short experimental section showing that you've implemented the algorithm and that it/how it behaves.

Here's why:

1) If your suggestion is it's potentially going to be useful in practice, and it's your algorithm, it's incumbent on you to provide some evidence for this statement. An implementation is the best evidence. I don't expect pages of simulation results examining corner cases (although, if there's space, that's certainly nice); but a couple of paragraphs explaining that you implemented it, tested on basic data, and the program actually finished goes a long way.
2) It's not that I don't trust your math. It's just that -- well, no, it is just that I don't trust your math. Maybe you've proven that the algorithm is O(n^2). I'd like to know if in practice it seems to be O(n log n) [even if it's O(n^2) in the worst case -- now you've given an average-case or special-case open problem!]. I'd like to know if there's a hidden constant of 10,000 there that makes it really not practical in its current form. I'd like to see that you didn't make an error and that it doesn't look like Theta(n^3) when you run it. [I've got 46 papers to look over in a month. Make my job easier, your paper's more likely to get in.]
3) Maybe, just maybe, someone who might actually want to implement your algorithm will actually read your paper. A non-theorist. Don't you think they want to see that it seems to actually work before implementing it better themselves? Won't some experiments make talking about your work to non-theorists easier? Help make that connection...

I'm not saying every algorithms paper needs an implementation section. In many cases, "algorithmic" results are really "complexity" results -- we're just showing that something can be done, we don't actually expect anyone to do it, and in this case there's no need for a simulation or experimental results. (Of course, in such cases, I expect the authors to limit their claims of interest/utility in practice.) In some cases, space won't permit a reasonable evaluation of the algorithm -- but do plan on it for the journal version. In some cases, the coding and evaluation of the algorithm are so interesting it merits an entirely separate paper!

But I'm amazed at how few algorithms papers provide any actual experimental results (unless they're appearing in a networking/database/other systems conference, where it's more understood that results are also expected). I've actually submitted theory papers to theory conferences with experimental sections and had reviewers urge them to be taken out, which I find mystifying (and I ignore).

And yes, if I'm reviewing your paper, and you don't have any numbers where I think you should, I'm probably mentally (and numerically) docking your paper score....

Wednesday, February 27, 2008

Of Mice and Computers

In the last of my posts about New Zealand, I'll talk about Mike Langston's talks on computational biology. He talked a lot about the style of computational biology. The difficulty of getting data unless you form a real partnership with people (who view their data as very valuable), the noisiness of the underlying problems, the need to worry about entire computation systems with memory/compute power/specialized hardware rather than individual algorithms, the compromises one has to make between theory and making things actually work, and so on. As I listened, it dawned on me, there's another area of research where I hear about similar issues -- search engines and computational advertising. The similarities sound uncanny.

The similarities reached the level of specific problems. The technical aspects of the talk were about heuristics/methods for maximum clique and biclique (bipartite clique) on real data. I've certainly heard of biclique coming up in search engine papers. Langston claimed to have the fastest bipartite clique algorithm in practice (at least on the biological data he was interested in). I wonder if there's room for crossover between these two fields?

The talk left me with a feeling that I've had before when seeing computational biology talks. The field seems interesting, but it looks like to have a real impact, you really need a lot of devotion to learning the underlying biology and making connections with working biologists. It doesn't seem like a good field for dabbling. So far, I haven't found myself ready to take the plunge and try to get involved with the field. Luckily for me, I still have other interesting things to do.

Tuesday, February 26, 2008

Practical Graph Isomorphism

Continuing my efforts to prove I was not just sitting out by the pool in New Zealand, I'll mention a highlight of the talks of Brendan McKay (of the Australian National University). Brendan has written code for graph isomorphism called nauty that works quite well in practice, handling graphs with up to millions of nodes. So while the complexity of the graph isomorphism problem is not known, it can be pretty hard to come up with bad examples that are actually hard to solve. As someone who's happy with good heuristics, I found it quite amusing to watch it in action. (On the other hand, it seems that many of the ideas behind this program are "old" by computer science standards. Are there any competing programs/approaches out there? Seems like a possible research direction. to revisit the algorithm engineering for this problem...)

While settling whether graph isomorphism is in P would certainly be interesting, Brendan pointed out that the question of whether it is in coNP is also quite important. Are there polynomial-sized certificates that can be used to prove that two graphs are not isomorphic? And one can imagine in practice one would like to have some methodology for easily convincing someone that two graphs aren't isomorphic.

As Brendan is a co-editor-in-chief of the Electronic Journal of Combinatorics, we also talked a bit at lunch about electronic journals (in particular the EJC). The EJC continues to grow successfully. (I'm happy to say I've published in it, back in 2004.) Remember to support your electronic journals.

Monday, February 25, 2008

Voting and Anti-Voting

While at NZIMA, I listened to Dominic Welsh's talks on Markov chains. It was a general talk, with a large chunk devoted to rapid mixing and applications. At some point he brought up the voter and anti-voter model, problems that I heard about a long time ago in the distant past, and that raised some questions in my mind.

The voter model can be expressed as follows. We have an undirected graph G, where each vertex has a label. For convenience, let's assume the labels are in {0,1}, although larger alphabets are possible. At each time step, a vertex x is chosen uniformly at random, and x chooses a neighbor y uniformly at random. Then x changes its label to match y's label. This is a Markov chain with absorbing states with all vertices having the same label. It's natural to ask things like what is the probability that you end up in the all 0/all 1 state from a given configuration.

If the graph is d-regular, and you start in a state with k 0's and n-k 1's, a simple argument gives that the probability of eventually being absorbed into the all 0 case is k/n. (Exercise Left To Reader.) Things seem to get more complicated for more general graphs. Here are some interesting questions that come to mind. Note: I don't know which are open and which aren't...
  1. Given a graph G and a starting configuration of labels, is there a polynomial time algorithm for computing the probability of being absorbed into the all 0 state? Or could a hardness result be proved? [My guess is there's a hardness result in there somewhere.]
  2. Given a star graph (one node in the center, connected to n-1 other vertices) and a starting configuration, what is the probability of being absorbed into the all 0 state (preferably in some natural closed form)? In particular, if you start with a 1 in the middle and a 0 everywhere else, what is this probability as a function of n? Notice that by symmetry here one can easily compute these values in polynomial time; the question is whether there are pleasant equations. [On some reflection, it's clear there will be nice equations; this is almost a birth-death chain with an added "phase" needed to track the center vertex. Exercise left to reader...]
  3. Similar questions to 1/2, but ask about the expected time/distribution of time until absorption.
Perhaps more interesting is the anti-voter model, which works just like the voter model, except that x changes its label to disagree with y's label. (Here the {0,1} alphabet is the most sensible.) Now we do not have absorbing states; we have a stationary distribution. But again one can ask similar questions:
  1. Given a graph G and a configuration of labels, is there a polynomial time algorithm for compute the stationary probability for that state?
  2. Are there natural families of graphs for which the stationary distribution for the anti-voting model has an easily expressible form? [d-regularity no longer seems to be enough...]
Obviously, one can argue about the importance of these questions, but I would just take the point of view that such easily expressible and seemingly natural questions about basic Markov chains are always interesting in their own right.

Wednesday, February 20, 2008

Conference Reviewing: Another Rant on Competitive Analysis

Because of an inadequate lack of attention regarding conference deadlines on my part, I find myself currently reading lots of papers while concurrently serving on the SPAA and ICALP Program Committees.

And somehow, this means I'm reading lots of papers with competitive analyses of algorithms. (Both the standard variety, and the more new-fangled but generally similar "price of anarchy" variety.) Looking back on it, I should have simply have told the PC chairs that it might be best to make sure not to assign me any such papers, as I'm not sure it's fair to the authors who get me as a reviewer.

OK, I'm exaggerating. Don't get me wrong; I'm not automatically assigning negative scores to any such paper. Some of these papers I really like. But in general, as I've expressed frequently elsewhere, I'm a skeptic when it comes to competitive analysis. Why exactly should anyone care that you have an algorithm that in the worst case always gets within a factor of 6.5 of the optimal solution possible? Why is that the right way to judge an algorithm? (Never mind the offshoots, like those where you have an algorithm that's competitive if it has twice as much as memory as you allow for the optimal algorithm, and so on...)

In my mind, the problem is that competitive analysis has become such a standard in Theoretical Computer Science that people don't feel the need to bother justifying anywhere in the paper why they should use competitive analysis, when in fact these papers scream for a need for motivation.

Here are some motivations that tend to work for me when reading competitive analysis papers.

1. The resulting problem is just so mathematically interesting that we really have to look at it. (One person I can think of that manages to pull off this motivation with me time and again is Susanne Albers.) Generally, to pull off this motivation, your problem should be simple, clean, and natural; if you start to bog down in details, claiming that you're giving a "more realistic" model of the problem, you've already missed the boat in my mind. (If you really wanted to be realistic, you wouldn't be using competitive analysis...)
2. You are considering the performance of real, existing algorithms. This seems reasonable to me; here's something people actually do (like, say, Shortest Job First scheduling), and you want to gain some insight into its worst-case performance for this notion of worst-case. Many price-of-anarchy results actually correspond to the behavior of reasonable algorithms one might consider in practice, so I admit I often find myself more sympathetic to Price of Anarchy results, which seem more motivated.
3. You are using competitive analysis for an essentially (complexity)-theoretical result.
4. You are using competitive analysis to gain significant insight into the problem (and algorithms for the problem) that could be useful in understanding real performance or designing real algorithms. I think most papers implicitly claim that their paper fits into this last category, just because it obviously doesn't fit into the other three, but usually this isn't explicitly discussed. I think this motivation works best when the problem is inherently very hard.

I'm sure there are other reasonable motivations. But overall I would call for more restraint from the community; not everything should be subjected to competitive analysis, just because it can.

Monday, February 04, 2008

Microsoft Cambridge Lab II

It's officially hit the wires, Jennifer Chayes and Christian Borgs will be heading up a new Microsoft lab in Cambridge II -- that is, here by Harvard and MIT. The proposed theme is interesting (and, happily, very appealing to me) -- algorithms with an emphasis on social networks and algorithmic game theory.

It opens this summer. I can't wait to see how it plays out.

Saturday, February 02, 2008

Shopping Period Lecture

One of the great Harvard "traditions" that I enjoyed as a student is the "shopping period". Students don't pre-register for classes; they spend the first week "shopping" whatever classes they like, and then choose what ones they want to take. (I'll talk more about the "politics" of shopping -- the pros and cons -- next time.)

Because of this, rather than dive right into material the first class for my Algorithms and Data Structures class, besides going over the syllabus and requirements, I do something that at least I consider fun. We talk about how to get fair bits from a biased coin. The class starts with the classic brain-teaser: suppose you have a coin that may be biased. How can we flip that coin to decide something fairly, like who should pay for lunch?

[The simple solution to this question, unless I'm mistaken, is commonly attributed to von Neumann.]

Starting from there, I try and take the class through a series of questions, leading up to how to efficiently extract lots of random bits from a sequence of biased flips. The method I base the lecture on is due to Yuval Peres [(see "Iterating von Neumann's Procedure for Extracting Random Bits," Annals of Statistics, March 1992)], and I learned about it at some point in graduate school at Berkeley. I try to run this lecture in a very back-and-forth manner, asking questions of the students and trying to get them to answer. (I also do this a bunch during the semester, with varying degrees of success...) Here's a version of my notes, with the various questions.

For the students who decide not to take the course, I figure at the very least they've learned something interesting that they can take with them. Also, it's conceivably the only time students will hear the word "entropy" in a computer science class, so I think it's worthwhile for that alone. Somehow, this problem fascinates people. Way back when I had more energy, I wrote a Dr. Dobb's article on it to show it to a wider audience, and there's been lots of related research on the problem. In some sense, this problem is the pre-history of all the randomness extraction work that has come since.