How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?
Rows vs Columns
People commonly speak of big data in terms of volume—the “four v’s” of big data being volume, variety, velocity, and veracity—but what we’re concerned with here might better be called “area.” We’ll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.
In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?
All else being equal, your privacy goes down as columns go up. The more information someone has about you, the more likely some of it may be used in combination to identify you.
How privacy varies with the number of rows is more complicated. Your privacy could go up or down with the number of rows.
The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of k-anonymity, this makes it harder to identify you, but easier to reveal information about you.
For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you’re a Native American man in your 40’s and there are 49 others like you. Then someone who knows your demographics can’t be sure which record is yours; they’d have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data show that middle aged Native American men are highly susceptible to some disease, then the data imply that you are likely to be susceptible to that disease.
Because privacy measures aimed at protecting individuals don’t necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The Genetic Information Nondiscrimination Act of 2008 (GINA) makes some forms of genetic discrimination illegal, but it’s quite understandable that minority groups might still be reluctant to participate in studies.
Improving privacy with size
Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of randomization without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.
With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of queries on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.
If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or local differential privacy. With these approaches, there’s no need for a trusted data curator. This wouldn’t be possible with a small dataset because the noise would be too large relative to the size of the data.