How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?
Rows vs Columns
People commonly speak of big data in terms of volume—the “four v’s” of big data being volume, variety, velocity, and veracity—but what we’re concerned with here might better be called “area.” We’ll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.
In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?
All else being equal, your privacy goes down as columns go up. The more information someone has about you, the more likely some of it may be used in combination to identify you.
How privacy varies with the number of rows is more complicated. Your privacy could go up or down with the number of rows.
The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of k-anonymity, this makes it harder to identify you, but easier to reveal information about you.
Group privacy
For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you’re a Native American man in your 40’s and there are 49 others like you. Then someone who knows your demographics can’t be sure which record is yours; they’d have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data show that middle aged Native American men are highly susceptible to some disease, then the data imply that you are likely to be susceptible to that disease.
Because privacy measures aimed at protecting individuals don’t necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The Genetic Information Nondiscrimination Act of 2008 (GINA) makes some forms of genetic discrimination illegal, but it’s quite understandable that minority groups might still be reluctant to participate in studies.
Improving privacy with size
Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of randomization without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.
With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of queries on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.
If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or local differential privacy. With these approaches, there’s no need for a trusted data curator. This wouldn’t be possible with a small dataset because the noise would be too large relative to the size of the data.
Differential privacy sounds good, but how does one protect against repeated queries?
For example, suppose that you have enough information to identify a specific person (call him/her person X) and you want to find out a specific piece of private information about that person (call it field Y). Without differential privacy, you could choose some random, large set of people A, then query the average Y of A and the average Y of A union X in order to calculate the Y of X. With differential privacy, doing this once doesn’t work, but if you repeat it many times with many different sets A, you would expect the noise to average out and then you could calculate the Y of X. (Unless the noise somehow depends in a systematic way on whether or not X is in the set being queried, but then you might as well just randomize each row.)
There are a couple ways to handle repeated queries. One is to cache the results, so that if you ask the same question twice you get the same answer.
The other is to treat each query as a draw on your privacy budget. With this approach, if you ask for the same information twice, you pay for it twice. If you ask nearly the same query as you’ve asked before, you pay twice. DP has clever mechanisms that amount to a “bulk discount,” i.e. asking for more information at once spends less of your privacy budget than asking for the same information across multiple requests.