Sunday, September 19, 2010

Van Emde Boas and its space complexity

In this post, I want to describe 3 neat and very different ways of making the space of the van Emde Boas (vEB) data structure linear. While this is not hard, it is subtle enough to confuse even seasoned researchers at times. In particular, it is the first bug I ever encountered in a class: Erik Demaine was teaching Advanced Data Structures at MIT in spring of 2003 (the first grad course I ever took!), and his solution for getting linear space was flawed.


Erik is the perfect example of how you can get astronomically high teaching grades while occasionally having bugs in your lectures. In fact, I sometimes suspected him of doing it on purpose: deliberately letting a bug slip by to make the course more interactive. Perhaps there is a lesson to be learned here.

***

Here is a quick review of vEB if you don't know it. Experienced readers can skip ahead.

The predecessor problem is to support a set S of |S|=n integers from the universe {1, ..., u} and answer:
predecessor(q) = max { xS | xq }
The vEB data structure can answer queries in O(lglg u) time, which is significantly faster than binary search for moderate universes.

The first idea is to divide the universe into √u segments of size √u. Let hi(x) = ⌊x/√u⌋ be the segment containing x, and lo(x) = x mod √u be the location of x within its segment. The data structure has the following components:
  • a hash table H storing hi(x) for all xS.
  • a top structure solving predecessor search among { hi(x) | xS }. This is the same as the original data structure, i.e. use recursion.
  • for each element α∈H, a recursive bottom structure solving predecessor search inside the α segment, i.e. among the keys { lo(x) | xS and hi(x)=α }.
The query algorithm first checks if hi(q) ∈ H. If so, all the action is in q's segment, so you recurse in the appropriate bottom structure. (You either find its predecessor there, or in the special case when q is less than the minimum in that segment, find the successor and follow a pointer in a doubly linked list.)

If q's segment is empty, all the action is at the segment level, and q's predecessor is the max in the preceding non-empty segment. So you recurse in the top structure.

In one step, the universe shrinks from u to √u, i.e. lg u shrinks to ½ lg u. Thus, in O(lglg u) steps the problem is solved.

***

So what is the space of this data structure? As described above, each key appears in the hash table, and in 2 recursive data structures. So the space per key obeys the recursion S(u) = 1 + 2 S(√u). Taking logs: S'(lg u) = 1 + 2 S'(½ lg u), so the space is O(lg u) per key.

How can we reduce this to space O(n)? Here are 3 very different ways:

Brutal bucketing. Group elements into buckets of O(lg u) consecutive elements. From each bucket, we insert the min into a vEB data structure. Once we find a predecessor in the vEB structure, we know the bucket where we must search for the real predecessor. We can use binary search inside the bucket, taking time O(lglg u). The space is (n/lg u) ·lg u = O(n).


Better analysis. In fact, the data structure from above does take O(n) space if you analyze it better! For each segment, we need to remember the max inside the segment in the hash table, since a query in the top structure must translate the segment number into the real predecessor. But then there's no point in putting the max in the bottom structure: once the query accesses the hash table, it can simply compare with the max in O(1) time. (If the query is higher than the max in its segment, the max is the predecessor.)

In other words, every key is stored recursively in just one data structure: either the top structure (for each segment max) or the bottom structure (for all other keys). This means there are O(lglg u) copies of each element, so space O(n lglg u).

But note that copies get geometrically cheaper! At the first level, keys are lg u bits. At the second level, they are only ½ lg u bits; etc. Thus, the cost per key, in bits, is a geometric series, which is bounded by O(lg u). In other words, the cost is only O(1) words per key. (You may ask: even if the cost of keys halves every time, what about the cost of pointers, counters, etc? The cost of a pointer is O(lg n) bits, and nu in any recursive data structure.)


Be slick. Here's a trickier variation due to Rasmus Pagh. Consider the trie representing the set of keys (a trie is a perfect binary tree of depth lg u in which each key is a root-to-leaf path). The subtree induced by the keys has n-1 branching nodes, connected by 2n-1 unbranching paths. It suffices to find the lowest branching node above the query. (If each branching node stores a pointer to his children, and the min and max values in its subtree, we can find the predecessor with constant work after knowing the lowest branching node.)

We can afford space O(1) per path. The data structure stores:
  • a top structure, with all paths that begin and end above height ½ lg u.
  • a hash table H with the nodes at depth ½ lg u of every path crossing this depth.
  • for each α∈H, a bottom structure with all paths starting below depth ½ lg u which have α as prefix.
Observe that each path is stored in exactly one place, so the space is linear. But why can we query for the lowest branching node above some key? As the query proceeds, we keep a pointer p to the lowest branching node found so far. Initially p is the root. Here is the query algorithm:
  • if p is below depth ½ lg u, recurse in the appropriate bottom structure. (We have no work to do on this level.)
  • look in H for the node above the query at depth ½ lg u. If not found, recurse in the top structure. If found, let p be the bottom node of the path crossing depth ½ lg u which we just found in the hash table. Recurse to the appropriate bottom structure.
The main point is that a path is only relevant once, at the highest level of the recursion where the path crosses the middle line. At lower levels the path cannot be queried, since if you're on the path you already have a pointer to the node at the bottom of the path!

Tuesday, September 7, 2010

IOI Wrap-up

In the past 2 years, I have been a member of the Host Scientific Committee (HSC) of the IOI. This is the body that comes up with problems and test data. While it consists primarily of people from the host country (Bulgaria in 2009, Canada in 2010), typically the host will have a call-for-problems and invite the authors of problems they intend to use.

This year, I was elected member of the International Scientific Committee (ISC). This committee works together with the HSC on the scientific aspects, the hope being that a perenial body will maintain similar standards of quality from one year to another. There are 3 elected members in the ISC, each serving 3-year terms (one position is open each year).

I anticipate this will be a lot of fun, and you will probably hear more about the IOI during this time. When a call for problems comes up (will be advertised here), do consider submitting!

I will end with an unusual problem from this IOI:

Consider the largest 50 languages on Wikipedia. We picked 200 random articles in each language, and extracted an excerpt of 100 consecutive characters from each. You will receive these 10000 texts one at a time in random order, and for each you have to guess its language. After each guess, your algorithm learns the correct answer. The score is the percentage of correct guesses.

To discourage students from coding a lot of special rules, a random permutation is applied to the Unicode alphabet, and the language IDs are random values. So, essentially, you start with zero knowledge.
Considering the tiny amount of training data and the real-time nature of guessing, one might not expect too good solutions. However, it turns out that one can get around 90% accuracy with relatively simple ideas.

My own approach was to define Score(text, language) as the minimal number of substrings seen previously in this language that compose the text. This can be computed efficiently by maintaining a suffix tree for each language, and using it to answer longest common prefix queries.

Sunday, August 29, 2010

Barriers 2

Later today, I'll be giving a talk at the 2nd Barriers Workshop in Princeton.


Here's my attempt to survey data structure lower bounds in 70 minutes: PDF, PPSX.

Update: Based on comments, I'm publishing the slides as PPSX (which can be viewed with a free viewer) and PDF. I will try to convert my other talks to these formats when I have time.

Thursday, August 26, 2010

IOI: A Medium Problem

Here is another, medium-level problem from the IOI. (Parental advisory: this is not quite as easy as it may sound!)


I think of a number between 1 and N. You want to guess the secret number by asking repeatedly about values in 1 to N. After your second guess, I will always reply "hotter" or "colder", indicating whether your recent guess was closer or farther from the secret compared to the previous one.

You have lg N + O(1) questions.

***

The solution to the first problem I mentioned can be found in the comments. Bill Hesse solved the open question that I had posed. He has a neat example showing that the space should be N1.5log23 bits, up to lower order terms. It is very nice to know the answer.

A very elegant solution to the second problem was posted in the comments by a user named Radu (do you want to disclose your last name?). Indeed, this is simpler than the one I had. However, the one I had worked even when the numbers in the array were arbitrary (i.e. you could not afford to sort them in linear time). I plan to post it soon if commenters don't find it.

Sunday, August 22, 2010

IOI: Another Hard Problem

You are given a matrix A[1..N][1..M] that contains a permutation of the numbers {1, ..., NM}. You are also given W≤N and H≤M. The goal is to find that rectangle A[i ... i+W][j ... j+H] which has the lowest possible median.


Running time: O(NM).

***

This could have been another hard problem at the IOI, but it was decided to give maximum score to solutions running in O(NM lg(NM)). This is considerably easier to obtain (but still interesting).

In the style of what Terry Tao tried for the IMO, I am opening this blog post as a discussion thread to try to solve the problem collectively ("poly-CS" ?). You are encouraged to post any suggestions, half-baked ideas, trivial observations, etc – with the goal to jointly reach a solution. If you think about the problem individually and find the solution, please don't post, as it will ruin the fun of the collective effort. Following this rule, I will not engage in the discussion myself.

I did not encourage a discussion for the first problem, since it was the kind of problem that only required one critical observation to solve. This second problem requires several ideas, and in fact I can see two very different solutions.

Friday, August 20, 2010

IOI: The Hard Problem

The International Olympiad in Informatics (IOI 2010) is taking part this week at the University of Waterloo, Canada.


The olympiad often features a hard problem, which is intended to be solved by a handful of contestants. This year, the problem was solved by about 6 people. Read the problem below and give it a shot! :)

I will describe the problem in both TCS and IOI fashion.

Asymptotic version. You are given an unweighted, undirected graph on N vertices. Some sqrt(N) vertices are designated as "hubs". You have to encode the pairwise distances between all hubs and all vertices in O(N1.5) bits of space.

The encoder and decoder may run in polynomial time. Of course, the decoder does not see the original graph; it receives the output of the encoder and must output the explicit distances between any hub and any other vertex. (This list of explicit distances takes O(N1.5lg N) bits.)

Non-asymptotic version. You are given a graph on 1000 nodes and 36 designated hubs. You have to encode the distances between all hubs and all vertices in 70,000 bits of space.

The non-asymptotic version is a bit harder, since you have to pay more attention to some details.

The research version. Prove or disprove that the distances can be encoded using (1+o(1)) N1.5 bits of space. I don't know the answer to this question (but I find the question intriguing.)

Wednesday, August 4, 2010

A taxonomy of range query problems

In this post, I will try to enumerate the range query problems that I know about. Let me know if I'm missing anything.

The query. Say you have n points in the plane, and you query for the points in an axis-parallel rectangle. What could we mean by "query"?
  • existential range queries: Is there any point in the rectangle?
  • counting queries: How many points are there in the rectangle?
  • reporting queries: List the points in the rectangle. Unlike the previous cases, the query time is now broken into two components: it is usually given as f(n) + k*g(n), where k is the number of output points.
Now let's assume the points have some number associated to them (a weight or a priority). Then one could ask the following natural queries:
  • weighted counting: What is the total weight of the points inside?
  • range min (max) query
  • range median query. (Possible generalizations: selection or quantiles.)
  • top-k reporting: Report just the top k guys, by priority (for k given). One may demand the output to be sorted. More stringently, one may ask the query algorithm to enumerate points sorted by priority, in time g(n) per point, until the user says "stop."
The number associated to a point can also be a "color". For instance, points could represent Irish pubs / Belgian bars / etc, and we may only want one point of each type in our output. Then the queries become:
  • colored counting: How many distinct colors are in the rectangle?
  • colored reporting: List the distinct colors in the rectangle (possibly with one example point from each color).
  • top-k colored reporting: If the colors are sorted by priorities (e.g. I prefer points of color 1 over points of color 2), one can then ask for the top-k distinct colors inside the rectangle.

Dynamism. The problem could be:
  • static: Preprocess the point set and then answer queries.
  • dynamic: Insert and delete from the point set.
  • incremental / decremental: We only insert or delete.
  • offline: The sequence of operations is known in advance. This is enough for many applications to algorithms.
  • parametric / kinetic. I confess ignorance with respect to these.

Orthogonal range queries. The setting from above works in any number of dimensions d≥1: the data set consists of n points in d-dimensional space and the query is a box [a1, b1]×···×[ad, bd]. This setup is usually called "orthogonal range queries".

We can consider the following important restrictions on the query:
  • dominance queries: the box is [0, b1]×···×[0, bd]. In other words, we are asking for the points dominated, coordinate-wise, by a point (b1, ..., bd).

  • k-sided queries: exactly 2d-k values in (a1, a2, ..., ad) are zero. For instance, a 3-sided query in 2D is a rectangle with one side on the x axis. Dominance queries are the special case of d-sided queries.

The universe. The coordinates of the points and queries can come from the following sets:
  • general universe. In the standard RAM model, we assume that the coordinates are integers that fit in a machine word.

  • rank space: the coordinates are from {1, 2, ..., n}. One can reduce any static problem to rank space by running 2d predecessor queries. Most problems can be shown to be at least as hard as predecessor search, so their complexity is precisely: "optimal cost of predecessor search" + "optimal cost for the problem in rank space". In other words, for most problems it is sufficient to solve them in rank space.

  • dense universe: the points are exactly the points of the grid [1, n1]×···×[1, nd] where n1n2···nd = n. In 1D this is the same as rank space, but in 2 or more dimensions the problems are very different. (To my knowledgeable readers: Is there a standard name for this case? For counting queries people call this "the partial sums problem", but how about e.g. min queries?)
For dynamic problems, the "rank space" changes when a new coordinate value is inserted. Thus, a rank-space solution must support a "insert value" operation that increments all coordinate values after a given one, creating space for a newly inserted point. (I have heard the term "list space" for this. Should we just use "rank space"?)


Stabbing. So far, our data set consisted of points and the queries asked for points in a given rectangle. Conversely, one can consider a data set of rectangles; the query is a point and asks about the rectangles containing that point ("stabbed" by it). This problem is important, among others, in routing: we can have rules for packets coming from some range of IP addresses and going to some other range of IP addresses.

The notion of rank space, and all query combinations still make sense. For instance, interval max stabbing is the following problem: given a set of interval (in 1D) with priorities, return the interval of highest priority stabbed by a query point.

Note that the rectangles in the input can intersect! If we ask that the rectangles be disjoint, or more stringently, be a partition of the space, we obtain the point location problem.


Rectangle-rectangle queries. So far we looked at containment relations between rectangles and points. More generally, the data set can consist of rectangles, and the query can also be a rectangle. Then one can ask:
  • intersection queries: analyze the set of input rectangles that intersect the query rectangle.
  • containment queries: analyze the set of rectangles that contain / are-contained-by the query.
Two important special cases arise when the rectangles degenerate to line segments:
  • orthogonal segment intersection: Given a set of horizontal segments, find the ones that intersect a vertical query segment.

  • orthogonal ray shooting: Given a set of horizontal segments, find the lowest one immediately above a query point. In other words, consider a vertical ray leaving from the point and find the first segment it intersects. (This is the min segment intersection query, where the priority of each horizontal segment is its y coordinate.)

More complex geometry. Of course, our ranges need not be orthogonal. One can consider:
  • balls
  • arbitrary lines
  • half spaces
  • simplices (e.g 2D triangles).
In non-orthogonal geometry, the concept of rank space disappears. However, most other combinations are possible. For instance, one could ask about the points in a query range; the ranges stabbed by a query point; the ranges intersecting a query range; the first rangeintersected by a ray; etc. We can ask existential, counting, or reporting questions, on ranges that can have weights or priorities.