January 24, 2006
Google and online privacy
Reading Tim Wu's essay on the recent DOJ subpoena of Google for its search records, I thought of translucent databases. The core concept behind translucent databases is that the statistically, or epidemiologically, or biologically, or whateverologically interesting data in a record -- demographics, a web search, health history, credit rating, etc. -- is left in plain cleartext, but the key that uniquely identifies that record, which may be sensitive or personal information like a name, Social Security number, or an IP address, is passed through a secure, one-way encryption algorithm generating a so-called "hash", a string of character that looks to the untrained eye as a random gobbledygook. The encryption algorithm is strong enough that the sensitive data it represents can't be discovered by reverse engineering the hash: without a special passphrase known only to trusted parties, it would take untold computers untold years to de-encrypt any one hash back to the cleartext original. In addition -- and this is the key point -- for any input, as long as that input doesn't change, you get the same hash. In other words, there is, for all intents and purposes, one unique hash for each possible input*. So with translucent databases, institutions can continue to do interesting data analysis without losing the critically important need to uniquely identify and therefore couple records together: the difference is that they can do this work without actually knowing that this record belongs to Bob Smith, or that record is from the IP address 63.161.169.137. And arrangements can be made with the persons about whom data is collected to use aforementioned passphrases to reliably update records without untrusted parties becoming privy to the information underneath.
One part of the media coverage of this Google v. DOJ story that's unsatisfying to anyone who is familiar with IP networks is that an IP address doesn't necessarily uniquely identify something the way most people think it does. They can be dynamically assigned and therefore change regularly (though there is certainly no reason to think that ISPs aren't keeping track of IP assignment history). With the advent of NAT and private IP networks, an IP address is less likely than ever to even uniquely identify a single computer: there could be many Internet-accessing devices behind a router with a single IP address, which is certainly the case in many home and small business networks where the scarcity of available public IP addresses make it infeasible and an administrative burden to try and assign numbers to each machine. Think of a coffee shop with a WiFi access point: each of those macchiato-sipping laptop users are known by the rest of the Internet by the same IP address. It's far more likely that web browser cookies, tracked across many sites with sharing agreements and usually tied to a login session where a user has provided information that could ultimately be traced back to them, would yield interesting, per-surfer metrics.
But there are plenty homes out there with a single PC and connection to the Internet, so why even bother with storing IP addresses? Once a cursory examination of them is done -- for instance, country of origin, which can easily be discovered by widely available tools -- run them through your encryption scheme and toss the originals. Then if you ever do get in a situation where you're forced to hand over the data, you can at least do it secure in the knowledge that you're not compromising your user's privacy. You still have problems, just one less one on your conscience.
* The information space of a typical hash is 128 bits, or 2 to the power of 128, or 3.4e38, or an extremely large number of possible outcomes. So while "collisions" -- two different inputs that yield the same output hash -- can happen, and have, in the case of the MD5 algorithm, the odds of them occurring are infinitesimally small, and in any case would not diminish the practical utility of day-to-day use of such hashes; that is, until quantum computers get their hands on them, but that is another, terrifying matter.
One part of the media coverage of this Google v. DOJ story that's unsatisfying to anyone who is familiar with IP networks is that an IP address doesn't necessarily uniquely identify something the way most people think it does. They can be dynamically assigned and therefore change regularly (though there is certainly no reason to think that ISPs aren't keeping track of IP assignment history). With the advent of NAT and private IP networks, an IP address is less likely than ever to even uniquely identify a single computer: there could be many Internet-accessing devices behind a router with a single IP address, which is certainly the case in many home and small business networks where the scarcity of available public IP addresses make it infeasible and an administrative burden to try and assign numbers to each machine. Think of a coffee shop with a WiFi access point: each of those macchiato-sipping laptop users are known by the rest of the Internet by the same IP address. It's far more likely that web browser cookies, tracked across many sites with sharing agreements and usually tied to a login session where a user has provided information that could ultimately be traced back to them, would yield interesting, per-surfer metrics.
But there are plenty homes out there with a single PC and connection to the Internet, so why even bother with storing IP addresses? Once a cursory examination of them is done -- for instance, country of origin, which can easily be discovered by widely available tools -- run them through your encryption scheme and toss the originals. Then if you ever do get in a situation where you're forced to hand over the data, you can at least do it secure in the knowledge that you're not compromising your user's privacy. You still have problems, just one less one on your conscience.
* The information space of a typical hash is 128 bits, or 2 to the power of 128, or 3.4e38, or an extremely large number of possible outcomes. So while "collisions" -- two different inputs that yield the same output hash -- can happen, and have, in the case of the MD5 algorithm, the odds of them occurring are infinitesimally small, and in any case would not diminish the practical utility of day-to-day use of such hashes; that is, until quantum computers get their hands on them, but that is another, terrifying matter.