In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous. The state of Massachusetts had released data on 135,000 state employees and their families with obvious identifiers removed. However, the data contained zip code, birth date, and sex for each individual. Sweeney was able to cross reference this data with publicly available voter registration data to find the medical records of then Massachusetts governor William Weld.
An estimated 87% of Americans can be identified by the combination of zip code, birth date, and sex. A back-of-the-envelope calculation shows that this should not be surprising, but Sweeney appears to be the first to do this calculation and pursue the results. (Update: See such a calculation in the next post.)
In her paper Only You, Your Doctor, and Many Others May Know Sweeney says that her research was unwelcome. Over 20 journals turned down her paper on the Weld study, and nobody wanted to fund privacy research that might reach uncomfortable conclusions.
A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.
There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.
More privacy posts
Photo of Latanya Sweeney via Parker Higgins [CC BY 4.0], from Wikimedia Commons
I hope that flaws are first revealed discreetly rather than discretely.
They’re discovered one at a time, so it’s discrete. :)
Here’s the quote from the article that lays out the numbers for the back-of-the-envelope calculation:
“There are 365 days in a year, two genders, and people live about 78 years. Multiplying these numbers gives 56,940 unique combinations. However, the average five-digit ZIP code in the United States has only about 25,000 people.”
Then if you assume uniform distribution, the calculation goes as in https://en.wikipedia.org/wiki/Birthday_problem#Same_birthday_as_you, for which I get ~64% chance that you are identified uniquely by birth date and gender, which is a bit lower than the 87% figure, but at least in the same ballpark.
BTW, “Related Posts” should probably link this blogpost too: https://www.johndcook.com/blog/2018/03/02/bits-of-information-in-age-birthday-and-birthdate/.
Thanks, Malcolm.
I don’t understand how she can say an average zip code has 25,000 people. There are around 42,000 zip codes in the US, and 25,000 people per zip code would imply a US population over a billion.
I just did my own calculation in the next post with some Python code for simulation.
It’s possible that the distribution of people per zip code is skewed in such a way that the average person shares their zip code with 25,000 other people.
I suspect her analysis and results focused on the zip codes from which the data was drawn. No need or reason to consider any others.