Quantifying privacy loss in a statistical database

bench

In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

More general randomized response

Suppose we have a binary response to some question as a field in our database. With probability t we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call t was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for t = 1/2 last time, and here we’ll look at the result for general t.

If t = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, t = 1, the recorded result is pure private information and statistically useful. The closer t is to 0, the more privacy we have, and the closer t is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

Privacy loss

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

\frac{(t+1) p}{2tp -t + 1}

where p is the overall probability of a true response of 1.

The privacy loss associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be p; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

\log_2 \left( \frac{(t+1) p}{2tp - t + 1} \middle/ \ p \right) = \log_2\left( \frac{(t+1)}{2tp - t + 1} \right)

When t = 0, the privacy loss is 0. When t = 1, the loss is -log2(p) bits, i.e. the entire information contained in the response. When t = 1/2, the loss is -log2(3/(2p + 1)) bits.

Privacy / utility trade-off

We’ve looked at the privacy cost of setting t to various values. What are the statistical costs? Why not make t as small as possible? Well, 0 is a possible value of t, corresponding to complete loss of statistical utility. So we’d expect that small positive values of t make it harder to estimate p.

Each recorded response is a 1 with probability tp + (1 – t)/2. Suppose there are N database records and let S be the sum of the recorded values. Then our estimator for p is

\hat{p} = \frac{\frac{S}{N} - \frac{1-t}{2}}{t}

The variance of this estimator is inversely proportional to t, and so the width of our confidence intervals for p are proportional to 1/√t. Note that the larger N is, the smaller we can afford to make t.

Related posts

Next up: Adding Laplace or Gaussian noise and differential privacy