Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.

Leave a Reply

Your email address will not be published. Required fields are marked *