# Big aggregate queries can still violate privacy

Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies.

1. Data scientists can only query aggregate statistics, such as counts and averages.
2. These aggregate statistics must be based on results that return at least 1,000 database rows.

This sounds good, but it’s naive. It’s not enough to protect customer privacy.

Someone wants to know how much money customer 123456789 makes. If he asks for this person’s income, the query would be blocked by the first rule. If he asks for the average income of customers with ID 123456789 then he gets past the first rule but not the second.

He decides to test whether the customer in question makes a six-figure income. So first he queries for the number of customers with income over \$100,000. This is a count so it gets past the firs rule. The result turns out to be 14,254, so it gets past the second rule as well. Now he asks how many customers with ID not equal to 123456789 have income over \$100,000. This is a valid query as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.

Now he goes back and asks for the average income of customers with income over \$100,000. Then he asks for the average income of customers with income over \$100,000 and with ID not equal to 123456789. With a little algebra, he’s able to find customer 123456789’s exact income.

You might object that it’s cheating to have a clause such as “ID not equal 123456789” in a query. Of course it’s cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. [1]

That doesn’t work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.

You may be asking why let a data scientist have access to customer IDs at all. Obviously you wouldn’t do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.

Related:

***

[1] Readers familiar with measure theory might sense a σ-algebra lurking in the background.

## 6 thoughts on “Big aggregate queries can still violate privacy”

1. Jonathan

This is just an implementation detail, but why not (Sum of all salaries) – (Sum of all salaries where id != 123456789)? If ‘sum’ isn’t one of the aggregating functions allowed, well, sum = avg * count.

2. Joe Mc Swiney

How about a fourth rule that the solution space described by successive queries has to have more than unit width in any dimension. The challenge is that if you make this apply to the every query by any analyst (to avoid collaboration on data discovery) this rule will be broken innocently eventually. Thus the possible queries will diminish and the database will lock.
Fundamentally it is not possible with current technology to indefinitely protect data that exists. Destruction is the only reliable protection. Otherwise you can only adhere to a minimum time between breaches and try to make this 25 years or so.

3. Why is customer ID actually available to the analyst? Wouldn’t that be exactly something that we don’t want to make available in any way?

4. I address that in the last paragraph of the post.

5. Jonathan

As you say, in practice you wouldn’t give access to the customer_id column, but there might be other columns that are almost as uniquifying, and you’d want to restrict those too. I wonder if you could look at the (info. theoretic) entropy of each column, and only allow access to those columns where it is low. Maybe there’s even some theorem you could prove giving an upper bound on the entropy no matter how you intersected the sets (i.e. added conditions to the WHERE clause)?