Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.

Big aggregate queries can still violate privacy

Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies.

  1. Data scientists can only query aggregate statistics, such as counts and averages.
  2. These aggregate statistics must be based on results that return at least 1,000 database rows.

This sounds good, but it’s naive. It’s not enough to protect customer privacy.

Someone wants to know how much money customer 123456789 makes. If he asks for this person’s income, the query would be blocked by the first rule. If he asks for the average income of customers with ID 123456789 then he gets past the first rule but not the second.

He decides to test whether the customer in question makes a six-figure income. So first he queries for the number of customers with income over $100,000. This is a count so it gets past the firs rule. The result turns out to be 14,254, so it gets past the second rule as well. Now he asks how many customers with ID not equal to 123456789 have income over $100,000. This is a valid query as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.

Now he goes back and asks for the average income of customers with income over $100,000. Then he asks for the average income of customers with income over $100,000 and with ID not equal to 123456789. With a little algebra, he’s able to find customer 123456789’s exact income.

You might object that it’s cheating to have a clause such as “ID not equal 123456789” in a query. Of course it’s cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. [1]

That doesn’t work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.

You may be asking why let a data scientist have access to customer IDs at all. Obviously you wouldn’t do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.

Related:

***

[1] Readers familiar with measure theory might sense a σ-algebra lurking in the background.

Toxic pairs, re-identification, and information theory

Database fields can combine in subtle ways. For example, nationality is not usually enough to identify anyone. Neither is religion. But the combination of nationality and religion can be surprisingly informative.

Information content of nationality

How much information is contained in nationality? That depends on exactly how you define nations versus territories etc., but for this blog post I’ll take this Wikipedia table for my raw data. You can calculate that nationality has entropy of 5.26 bits. That is, on average, nationality is slightly more informative than asking five independent yes/no questions. (See this post for how to calculate information content.)

Entropy measures expected information content. Knowing that someone is from India (population 1.3 billion) carries only 2.50 bits of information. Knowing that someone is from Vatican City (population 800) carries 23.16 bits of information.

One way to reduce the re-identification risk of PII (personally identifiable information) such as nationality is to combine small categories. Suppose we lump all countries with a population under one million into “other.” Then we go from 240 categories down to 160. This hardly makes any difference to the entropy: it drops from 5.26 bits to 5.25 bits. But the information content for the smallest country on the list is now 8.80 bits rather than 23.16.

Information content of religion

What about religion? This is also subtle to define, but again I’ll use Wikipedia for my data. Using these numbers, we get an entropy of 2.65 bits. The largest religion, Christianity, has an information content 1.67 bits. The smallest religion on the list, Rastafari, has an information content of 13.29 bits.

Joint information content

So if nationality carries 5.25 bits of information and religion 2.65 bits, how much information does the combination of nationality and religion carry? At least 5.25 bits, but no more than 5.25 + 2.65 = 7.9 bits on average. For two random variables X and Y, the joint entropy H(X, Y) satisfies

max( H(X), H(Y) ) ≤ H(X, Y) ≤ H(X) + H(Y)

where H(X) and H(Y) are the entropy of X and Y respectively.

Computing the joint entropy exactly would require getting into the joint distribution of nationality and religion. I’d rather not get into this calculation in detail, except to discuss possible toxic pairs. On average, the information content of the combination of nationality and religion is no more than the sum of the information content of each separately. But particular combinations can be highly informative.

For example, there are not a lot of Jews living in predominantly Muslim countries. According to one source, there are at least five Jews living in Iraq. Other sources put the estimate as “less than 10.” (There are zero Jews living in Libya.)

Knowing that someone is a Christian living in Mexico, for example, would not be highly informative. But knowing someone is a Jew living in Iraq would be extremely informative.

More information

Adding Laplace or Gaussian noise to database for privacy

pixelated face

In the previous two posts we looked at a randomization scheme for protecting the privacy of a binary response. This post will look briefly at adding noise to continuous or unbounded data. I like to keep the posts here fairly short, but this topic is fairly technical. To keep it short I’ll omit some of the details and give more of an intuitive overview.

Differential privacy

The idea of differential privacy is to guarantee bounds on how much information may be revealed by someone’s participation in a database. These bounds are described by two numbers, ε (epsilon) and δ (delta). We’re primarily interested in the multiplicative bound described by ε. This number is roughly the number of bits of information an analyst might gain regarding an individual. (The multiplicative bound is exp(ε) and so ε, the natural log of the multiplicative bound, would be the information measure, though technically in nats rather than bits since we’re using natural logs rather than logs base 2.)

The δ term is added to the multiplicative bound. Ideally δ is 0, that is, we’d prefer (ε, 0)-differential privacy, but sometimes we have to settle for (ε, δ)-differential privacy. Roughly speaking, the δ term represents the possibility that a few individuals may stand to lose more privacy than the rest, that the multiplicative bound doesn’t apply to everyone. If δ is very small, this risk is very small.

Laplace mechanism

The Laplace distribution is also known as the double exponential distribution because its distribution function looks like the exponential distribution function with a copy reflected about the y-axis; these two exponential curves join at the origin to create a sort of circus tent shape. The absolute value of a Laplace random variable is an exponential random variable.

Why are we interested this particular distribution? Because we’re interested in multiplicative bounds, and so it’s not too surprising that exponential distributions might make the calculations work out because of the way the exponential scales multiplicatively.

The Laplace mechanism adds Laplacian-distributed noise to a function. If Δf is the sensitivity of a function f, a measure of how revealing the function might be, then adding Laplace noise with scale Δf/ε preserves (ε 0)-differential privacy.

Technically, Δf is the l1 sensitivity. We need this detail because the results for Gaussian noise involve l2 sensitivity. This is just a matter of whether we measure sensitivity by the l1 (sum of absolute values) norm or the l2 (root sum of squares) norm.

Gaussian mechanism

The Gaussian mechanism protects privacy by adding randomness with a more familiar normal (Gaussian) distribution. Here the results are a little messier. Let ε be strictly between 0 and 1 and pick δ > 0. Then the Gaussian mechanism is (ε, δ)-differentially private provided the scale of the Gaussian noise satisfies

\sigma \geq \sqrt{2 \log(1.25/\delta)}\,\, \frac{\Delta_2 f}{\varepsilon}

It’s not surprising that the l2 norm appears in this context since the normal distribution and l2 norm are closely related. It’s also not surprising that a δ term appears: the Laplace distribution is ideally suited to multiplicative bounds but the normal distribution is not.

Related

Previous related posts:

Quantifying privacy loss in a statistical database

bench

In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

More general randomized response

Suppose we have a binary response to some question as a field in our database. With probability t we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call t was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for t = 1/2 last time, and here we’ll look at the result for general t.

If t = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, t = 1, the recorded result is pure private information and statistically useful. The closer t is to 0, the more privacy we have, and the closer t is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

Privacy loss

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

\frac{(t+1) p}{2tp -t + 1}

where p is the overall probability of a true response of 1.

The privacy loss associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be p; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

\log_2 \left( \frac{(t+1) p}{2tp - t + 1} \middle/ \ p \right) = \log_2\left( \frac{(t+1)}{2tp - t + 1} \right)

When t = 0, the privacy loss is 0. When t = 1, the loss is -log2(p) bits, i.e. the entire information contained in the response. When t = 1/2, the loss is -log2(3/(2p + 1)) bits.

Privacy / utility trade-off

We’ve looked at the privacy cost of setting t to various values. What are the statistical costs? Why not make t as small as possible? Well, 0 is a possible value of t, corresponding to complete loss of statistical utility. So we’d expect that small positive values of t make it harder to estimate p.

Each recorded response is a 1 with probability tp + (1 – t)/2. Suppose there are N database records and let S be the sum of the recorded values. Then our estimator for p is

\hat{p} = \frac{\frac{S}{N} - \frac{1-t}{2}}{t}

The variance of this estimator is inversely proportional to t, and so the width of our confidence intervals for p are proportional to 1/√t. Note that the larger N is, the smaller we can afford to make t.

Related

Previous related posts:

Next up: Adding Laplace or Gaussian noise and differential privacy

Randomized response, privacy, and Bayes theorem

blurred lights

Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability.

Randomized response

Each student is asked to flip two coins. If the first coin comes up heads, the student answers the question truthfully, yes or no. Otherwise the student reports “yes” if the second coin came up heads and “no” it came up tails. Every student has deniability because each “yes” answer may have come from an innocent student who flipped tails on the first coin and heads on the second.

How can the professor estimate p, the proportion of students who cheated? Around half the students will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

Database anonymization

The calculations above assume that everyone complied with the protocol, which may not be reasonable. If everyone were honest, there’d be no reason for this exercise in the first place. But we could imagine another scenario. Someone holds a database with identifiers and answers to a yes/no question. The owner of the database could follow the procedure above to introduce randomness in the data before giving the data over to someone else.

Information contained in a randomized response

What can we infer from someone’s randomized response to the cheating question? There’s nothing you can infer with certainty; that’s the point of introducing randomness. But that doesn’t mean that the answers contain no information. If we completely randomized the responses, dispensing with the first coin flip, then the responses would contain no information. The responses do contain information, but not enough to be incriminating.

Let C be a random variable representing whether someone cheated, and let R be their response, following the randomization procedure above. Given a response R = 1, what is the probability p that C = 1, i.e. that someone cheated? This is a classic application of Bayes’ theorem.

\begin{eqnarray*} P(C=1 \mid R = 1) &=& \frac{P(R=1 \mid C=1) P(C=1)}{P(R=1\mid C=1)P(C=1) + P(R=1\mid C=0)P(C=0)} \\ &=& \frac{\frac{3}{4} p}{\frac{3}{4} p + \frac{1}{4}(1-p)} \\ &=& \frac{3p}{2p+1} \end{eqnarray*}

If we didn’t know someone’s response, we would estimate their probability of having cheated as p, the group average. But knowing that their response was “yes” we update our estimate to 3p / (2p + 1). At the extremes of p = 0 and p = 1 these coincide. But for any value of p strictly between 0 and 1, our estimate goes up. That is, the probability that someone cheated, conditional on knowing they responded “yes”, is higher than the unconditional probability. In symbols, we have

P(C = 1 | R = 1) > P(C = 1)

when 0 < < 1. The difference between the left and right sides above is maximized when p = (√3 – 1)/2 = 0.366. That is, a “yes” response tells us the most when about 1/3 of the students cheated. When p = 0.366, P(= 1 | R= 1) = 0.634, i.e. the posterior probability is almost twice the prior probability.

You could go through a similar exercise with Bayes theorem to show that P(C = 1 | R = 0) = p/(3 – 2p), which is less than p provided 0 < p < 1. So if someone answers “yes” to cheating, that does make it more likely that the actually cheated, but not so much more that you can justly accuse them of cheating. (Unless p = 1, in which case you’re in the realm of logic rather than probability: if everyone cheated, then you can conclude that any individual cheated.)

Update: See the next post for a more general randomization scheme and more about the trade-off between privacy and utility. The post after that gives an overview of randomization for more general kinds of data.

If you would like help with database de-identification, please let me know.

A statistical problem with “nothing to hide”

One problem with the nothing-to-hide argument is that it assumes innocent people will be exonerated certainly and effortlessly. That is, it assumes that there are no errors, or if there are, they are resolved quickly and easily.

Suppose the probability of a correctly analyzing an email or phone call is not 100% but 99.99%. In other words, there’s one chance in 10,000 of an innocent message being incriminating. Imagine authorities analyzing one message each from 300,000,000 people, roughly the population of the United States. Then around 30,000 innocent people will have some ‘splaining to do. They will have to interrupt their dinner to answer questions from an agent knocking on their door, or maybe they’ll spend a few weeks in custody. If the legal system is 99.99% reliable, then three of them will go to prison.

Now suppose false positives are really rare, one in a million. If you analyze 100 messages from each person rather than just one, you’re approximately back to the scenario above.

Scientists call indiscriminately looking through large amounts of data “a fishing expedition” or “data dredging.” One way to mitigate the problem of massive false positives from data dredging is to demand a hypothesis: before you look through the data, say what you’re hoping to prove and why you think it’s plausible.

The legal analog of a plausible hypothesis is a search warrant. In statistical terms, “probable cause” is a judge’s estimation that the prior probability of a hypothesis is moderately high. Requiring scientists to have a hypothesis and requiring law enforcement to have a search warrant both dramatically reduce the number of false positives.

Related:

Place, privacy, and dignity

Richard Weaver argues in Visions of Order that our privacy and dignity depend on our being rooted in space. He predicted that as people become less attached to a geographical place, privacy and dignity erode.

There is something protective about “place”; it means isolation, privacy, and finally identity. … we must again become sensitive enough to realize that “place” means privacy and dignity …

When Weaver wrote those words in 1964, he was concerned about physical mobility. Imagine what he would have thought of online life.

I find it interesting that Weaver links privacy and dignity. There is a great deal of talk about loss of privacy online, but not much about loss of dignity. The loss of dignity is just as real, and more under our control. We may lose privacy through a third party mishandling data, but our loss of dignity we often bring on ourselves.

Related posts: