Differential privacy is a strong form of privacy protection with a solid mathematical definition.
Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not. This intuitive idea can be made precise as follows.
Queries and algorithms
First of all, differential privacy is something that applies to queries, not to databases. It could apply to a database if your query is to select everything in the database, but typically you want to run far more specific queries. Differential privacy adds noise to protect privacy, and the more intrusive the query, the more noise must be added, so typically you want queries to be more narrow than “Tell me everything about everything.”
Generally the differential privacy literature speaks of algorithms rather than queries. The two are the same if you have a general idea of what a query is, but using the word algorithm emphasizes that the query need not be a simple SQL query.
Opt-out and neighboring databases
To quantify the idea of “whether your information is included or not” we look at two versions of a database, D and D‘, that differ by one row, which you can think of as the row containing your data. Our formalism is symmetric, so we do not specify which database, D or D‘, is the one which contains your row and which is missing your row.
We should mention some fine print here. The paragraph above implicitly assumes that your database is just a simple table. In general we can speak of neighboring databases. In the case of a more complex database design, the notion of what it means for two databases to be neighboring is more complex. Differential privacy can be defined whenever a notion of neighboring databases can be defined, so you could, for example, consider differential privacy for a non-relational (“No SQL”) database. But for the simple case of a single table, two databases are neighboring if they differ by a row.
However, there’s a subtlety regarding what it means even for two tables to “differ by one row.” Unbound differential privacy assumes one row has been removed from one of the databases. Bound differential privacy assumes the two databases have the same number of rows, but the data in one row has been changed. In practice the difference between bound and unbound differential privacy doesn’t often matter.
We said it “makes little difference” whether an individual’s data is included or not. How to we quantify that statement? Differential privacy is a stochastic procedure, so we need to measure difference in terms of probability.
Whenever mathematicians want to speak of two things being close together, we often use ε as our symbol. While in principle ε could be large, the choice of symbol implies that you should think of it as a quantity you can make as small as you like (though there may be some cost that increases as ε decreases).
We’re going to look at ratios of probabilities, so we want these ratios to be near 1. This means we’ll work with eε = exp(ε) rather than ε itself. A convenient property that falls out of the Taylor series for the exponential function is that if ε is small then exp(ε) is approximate 1+ε. For example, exp(0.05) = 1.0513, and so if the ratio of two probabilities is bounded by exp(0.05), the probabilities are roughly within 5% of each other.
Definition of differential privacy
Now we can finally state our definition. An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D‘,
Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.
Update: See a follow-on post about Rényi differential privacy, a generalization of differential privacy that measures the difference between output probability distributions using information theory rather than bounding ratios. This includes the definition above a special case.
Help with differential privacy
If you’d like help understanding and implementing differential privacy, let’s talk.