Google, MSN and Yahoo were recently served with subpoenas for search data from their databases. The background, from an article by Columbia professor
Tim Wu in
Slate:
Back in the 1990s, Congress passed a succession of laws designed to keep porn off the Internet. Those laws didn't work—largely because the courts kept striking them down as violations of the First Amendment. Sick of losing and finding themselves back in the district court defending a reconfigured version of the anti-porn bill (now named the "Child Online Protection Act"), lawyers in the Justice Department decided to hire Berkeley professor Philip Stark. Stark's assignment: Use statistics to show what everyone already knows—that there's an awful lot of porn on the Internet.
Stark's plan is to demand that Google and the other major search engines supply him, and thus the government, with a random selection of a million domain names available for search and a million sample user queries. He will then demonstrate how truly nasty the Internet is, and in particular, just how hard it is to block out stuff that's "Harmful to Minors." Faced with this third-party subpoena, Microsoft and Yahoo! agreed to supply Stark with this information, but Google refused, calling the subpoena request "overbroad, unduly burdensome, vague, and intended to harass."
He goes on to make the point that the problem is not that Google et al were slapped with subpeonas, but that they had the data lying around in the first place. His suggestion: get rid of tell-tale information:
That's why the public's demand must be of Google—not the state. It should be that Google please stop keeping quite so much information attached to our IP addresses; please modify logging practices so that all identifying information is stripped. And please run history's greatest "search and delete," right now, and take out the IP addresses from every file that contains everyone's last five years of searches.
Coincidentally, I was chatting with Piotr Indyk the other day at Penn, and he made the very valid point that in a sense, theoretical computer science is already on the case, and has been studying this problem for a very long time. The field of private information retrieval is precisely about this:
Can you make queries from a database without the database knowing what you want ?
It seems impossible (if you don't already know the answer). But here's a silly solution:
Ask for all the data from the database ! There's no way the database will know what you really wanted.
The solution is indeed silly, but why exactly is it ? Well, because you'd need to extract all the n bits of information from the database
[1] to get the answer bit
[2] for one query So if we think of "stuff the database returns" as a bounded resource, can we do better ?
Well, as you might have guessed, if you want to be deterministic, then not really. But if you can toss coins when asking your queries,
while still guaranteeing that the answer is always correct,
and if you assume that you are really querying multiple non-communicating copies of a database (not so unrealistic), then you can do a lot better. For example, if you wanted to know the value of a single bit (the "query") of a string of n bits (the "database"), you could pick a random subset of the string, and ask the first database for the XOR of the bits. Then XOR the index you want with the set, and ask the second database for a similar XOR. You can XOR the two answer to get your bit, and the databases are none the wiser.
Now this really doesn't save you much, since you'd send n bits in total (on average) to the servers, but the scheme can be generalized to k servers, in which case you'll end up with a total transmission of something like k log k n^(1/log k), which is significantly sublinear.
There are many variations: how smart can the servers be at trying to decipher your answers ? (in the above example, even an all powerful server can't do a thing) ? How much can the servers talk to each other ? what kinds of queries can you ask ? and so on.
One might argue that assuming the servers can't communicate with each other is an unreasonable restriction. Not so really, especially if you consider that it might be in Google's interests to facilitate this kind of querying
[3]. Another interesting question that I don't think has been studied would be: can you construct effective link graph structures and search results without needing to retain any information about the pages themselves ?
Piotr made the interesting point that there's a lot of scope here for a company that wants to innovate in the search space: "Search with our engine! We have no clue what you're looking for, and we'll find it anyway!". In fact, there's a
web page on practical PIR maintained at the Institut für Informatik at the Humboldt-Universität zu Berlin by Prof.
Johann-Christoph Freytag and
Dmitri Asonov.
Since the business of theoryCS these days is PR, it's worth pointing out that privacy is rapidly becoming a major area of concern, and work springing out of theoretical computer science is ideally placed to have an impact here. Just sayin'....
I drew my sources from the
wonderful survey on PIR written by
Bill Gasarch (who's current guest blogging at the
Complexity Blog) for Lance Fortnow's Complexity Column. He also has a
companion web page, and the two are a valuable resource of information on this topic.
(HT:
BoingBoing)
[1] Theoretical computer scientist love referring to all sizes as 'n'. If there are more, we will often use 'k', 'm', and the often scorned 'U".
[2] The one thing we love more than calling things 'n' is designing questions with YES/NO answers.
[3] There is another post waiting to be written on why it's NOT in Google's best interests to help make private information retrieval a reailty. Such a post would have a heavy dose of phrases like "Web 2.0", "AdSense", "AdWords", "revenue streams" and "targeted search" and would bore everyone to tears.