Categorical Data Analysis

Categorical data analysis could mean a couple different things. One is analyzing data that falls into unordered categories (e.g. red, green, and blue) rather than numerical values (e..g. height in centimeters).

Another is using category theory to assist with the analysis of data. Here “category” means something more sophisticated than a list of items you might choose from in a drop-down menu. Instead we’re talking about applied category theory.

So we have ((categorical data) analysis) and (categorical (data analysis)), i.e. analyzing categorical data and categorically analyzing data. The former is far, far more common.

I ran across Alan Agresti’s classic book the other day in a used book store. The image below if from the third (2012) edition. The book store had the 1st (1990) edition with a more austere cover.

I bought Agresti’s book because it’s a good reference to have. But I was a little disappointed. My first thought was  that someone has written a book on category theory and statistics, which is not the case, as far as I know.

The main reference for category theory and statistics is Peter McCullagh’s 2002 paper What is a statistical model? That paper raised a lot of interesting ideas, but the statistics community did not take McCullagh’s bait.

commutative diagram for statistical models

Maybe this just wasn’t a fruitful idea. I suspect it is a fruitful idea, but the number of people available to develop it, conversant in both statistics and category theory, is very small. I’ve seen category theory used in mathematical modeling more generally, but not in statistics per se.

At its most basic, category theory asks you to be explicit about the domain and range (codomain) of functions. It would be very helpful if statisticians merely did this. Statistical notation is notoriously bad at where a function goes from and to, or even when a function is a function. Just 0th level category theory, defining categories, would be useful. Maybe it would be useful to go on to identifying limits or adjoints, but simply being explicit about “from” and “to” would be a good start.

Category theory is far too abstract to completely carry out a statistical analysis. But it can prompt you to ask questions that check whether your model has any inconsistencies you hadn’t noticed. The idea of a “categorical error” doesn’t differ that much moving from its philosophical meaning under Aristotle to its mathematical meaning under MacLane. Nor does the idea of something being “natural.” One of the primary motivations for creating category theory was to come up with a rigorous definition of what it means for something in math to be “natural.”

Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.

Generalized normal distribution and kurtosis

The generalized normal distribution adds an extra parameter β to the normal (Gaussian) distribution. The probability density function for the generalized normal distribution is

\frac{\beta}{2\sigma \Gamma\left(\frac{1}{\beta}\right )} \exp\left(-\left|\frac{x-\mu}{\sigma} \right|^\beta \right)

Here the location parameter μ is the mean, but the scaling factor σ is not the standard deviation unless β = 2.

For small values of the shape parameter β, the distribution is more sharply pointed in the middle and decays more slowly in the tails. We say the tails are “thick” or “heavy.” When β = 1 the generalized normal distribution reduces to the Laplace distribution.

Here are examples with μ = 0 and σ = 1.

The normal distribution is a special case corresponding to β = 2. Large values of β make the distribution flatter on top and thinner (lighter) in the tails. Again μ = 0 and σ = 1 in the plots below.

thick-tailed generalized normal densities

One way to measure the thickness of probability distribution tails is kurtosis. The normal distribution has kurtosis equal to 3. Smaller values of kurtosis correspond to thinner tails and larger values to thicker tails.

There’s a common misunderstanding that kurtosis measures how pointy the distribution is in the middle. Often that’s the case, and in fact that’s the case for the generalized normal distribution. But it’s not true in general. It’s possible for a distribution to be flat on top and have heavy tails or pointy on top and have thin tails.

Distributions with thinner tails than the normal are called “platykurtic” and distributions with thicker tails than the normal are called “leptokurtic.” The names were based on the misunderstanding mentioned above. The platy– prefix means broad, but it’s not the tails that are broader, it’s the middle. Similarly, the lepto– prefix means “thin”, referring to being pointy in the middle. But leptokurtic distributions have thicker tails!

The kurtosis of the generalized normal distribution is given by

\frac{ \Gamma\left( \frac{5}{\beta} \right ) \Gamma\left( \frac{1}{\beta} \right ) }{\Gamma\left(\frac{3}{\beta}\right)^2}

We can use that to visualize how the kurtosis varies as a function of the shape parameter β.

The Laplace distribution (β = 1) has kurtosis 6 and the normal distribution (β = 2) has kurtosis 3.

You can use the fact that Γ(x) ~ 1/x for small x to show that in the limit as β goes to infinity, the kurtosis approaches 9/5.

Related post: Computing skewness and kurtosis in one pass

Making sense of a probability problem in the WSJ

Wall Street Journal clipping

Someone wrote to me the other day asking if I could explain a probability example from the Wall Street Journal. (“Proving Investment Success Takes Time,” Spencer Jakab, November 25, 2017.)

Victor Haghani … and two colleagues told several hundred acquaintances who worked in finance that they would flip two coins, one that was normal and the other that was weighted so it came up heads 60% of the time. They asked the people how many flips it would take them to figure out, with a 95% confidence level, which one was the 60% coin. Told to give a “quick guess,” nearly a third said fewer than 10 flips, while the median response was 40. The correct answer is 143.

The anecdote is correct in spirit: it takes longer to discover the better of two options than most people suppose. But it’s jarring to read the answer is precisely 143 when the question hasn’t been stated clearly.

How many flips would it take to figure out which coin is better with a 95% confidence level? For starters, the answer would have to be a distribution, not a single number. You might quickly come to the right conclusion. You might quickly come to the wrong conclusion. You might flip coins for a long time and never come to a conclusion. Maybe there is a way a formulating the problem so that so that the expected value of the distribution is 143.

How are you to go about flipping the coins? Do you flip both of them, or just flip one coin? For example, you might flip both coins until you are confident that one is better, and conclude that the better one is the one that was designed to come up heads 60% of the time. Or you could just flip one of them and test the hypothesis Prob(heads) = 0.5 versus the alternative Prob(heads) = 0.6. Or maybe you flip one coin two times for every one time you flip the other. Etc.

What do you mean by “95% confidence level”? Is this a frequentist confidence interval? And do you compute the (Bayesian) predictive probability of arriving at such a confidence level? Are you computing the (Bayesian) posterior model probabilities of two models, one in which the first coin has probability of heads 0.5 and the second has probability 0.6 versus the opposite model?

Do you assume that you know a priori that one coin has probability of heads 0.5 and the other 0.6, or do you not assume this and just want to find the coin with higher probability of heads, and evaluate such a model when in fact the probabilities of heads are as stated?

Are you conducting an experiment with a predetermined sample size of 143? Or are you continuous monitoring the data, stopping when you reach your conclusion?

I leave it as an exercise to the reader to implement the various alternatives suggested above and see whether one of them produces 143 as a result. (I did a a back-of-the-envelope calculation that suggests there is one.) So the first question is to reverse engineer which problem statement the article was based on. The second question is to decide which problem formulation you believe would be most appropriate in the context of the article.

How can a statistician help a lawyer?

I’ll be presenting at a webinar on Wednesday, December 13 at 1:00 PM Eastern. The title of the presentation is “Seven questions a statistician and answer for an attorney.”

I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

ACEDS association of certified e-discovery specialists and LexInsight

Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.


[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.

Pareto distribution and Benford’s law

The Pareto probability distribution has density

f(x) = \frac{a}{x^{a+1}}

for x ≥ 1 where a > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. “80-20” rule) are named after the same person, the Italian economist Vilfredo Pareto.

Samples from a Pareto distribution obey Benford’s law in the limit as the parameter a goes to zero. That is, the smaller the parameter a, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford’s law.

Here’s an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape a = 1 and shape a = 0.2 with the counts expected under Benford’s law.

Distribution of leading digits of Pareto samples in base 10

Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.

Distribution of leading digits of Pareto samples in base 16

More posts on Benford’s law

More posts on Pareto

Random number generation posts

Random number generation is typically a two step process: first generate a uniformly distributed value, then transform that value to have the desired distribution. The former is the hard part, but also the part more likely to have been done for you in a library. The latter is relatively easy in principle, though some distributions are hard to (efficiently) sample from.

Here are some posts on testing a uniform RNG.

Here’s a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.

A few posts on manipulating a random number generator.

And finally, a post on a cryptographically secure random number generator.

Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
    from scipy.stats import beta as beta
    from scipy import log2

    def infogain(post_a, post_b, prior_a, prior_b):

        p = beta(post_a, post_b).pdf
        q = beta(prior_a, prior_b).pdf

        (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
        return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
    print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

Database anonymization for testing

How do you create a database for testing that is like your production database? It depends on in what way you want the test database to be “like” the production one.

Replacing sensitive data

Companies often use an old version of their production database for testing. But what if the production database has sensitive information that software developers and testers should not have access to?

You can’t completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be

  1. Use the original data.
  2. Generate completely new artificial data.
  3. Use the real data as a guide to generating new data.

We’ll assume the first option is off the table and consider the pros and cons of the other two options.

For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That’s fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don’t have 11 year-old retirees or 98 year-old clients who can’t legally purchase alcohol.

Random vs realistic

There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find more bugs, but a realistic database is likely to find more important bugs.

Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you’ve never sold your product to someone in Louisiana, and there’s a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)

On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the joint distributions to match and not just the unconditional distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.

So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.

How to make a realistic test database

So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like differential privacy. But if you just want moderately realistic test cases, you can do something much simpler.

Finally, you have to address your privacy-utility trade-off. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.

If you’d like help with this process, let’s talk. We can help you determine what your obligations are and how best to meet them while meeting your business objectives.

Quantifying privacy loss in a statistical database


In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

More general randomized response

Suppose we have a binary response to some question as a field in our database. With probability t we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call t was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for t = 1/2 last time, and here we’ll look at the result for general t.

If t = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, t = 1, the recorded result is pure private information and statistically useful. The closer t is to 0, the more privacy we have, and the closer t is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

Privacy loss

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

\frac{(t+1) p}{2tp -t + 1}

where p is the overall probability of a true response of 1.

The privacy loss associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be p; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

\log_2 \left( \frac{(t+1) p}{2tp - t + 1} \middle/ \ p \right) = \log_2\left( \frac{(t+1)}{2tp - t + 1} \right)

When t = 0, the privacy loss is 0. When t = 1, the loss is -log2(p) bits, i.e. the entire information contained in the response. When t = 1/2, the loss is -log2(3/(2p + 1)) bits.

Privacy / utility trade-off

We’ve looked at the privacy cost of setting t to various values. What are the statistical costs? Why not make t as small as possible? Well, 0 is a possible value of t, corresponding to complete loss of statistical utility. So we’d expect that small positive values of t make it harder to estimate p.

Each recorded response is a 1 with probability tp + (1 – t)/2. Suppose there are N database records and let S be the sum of the recorded values. Then our estimator for p is

\hat{p} = \frac{\frac{S}{N} - \frac{1-t}{2}}{t}

The variance of this estimator is inversely proportional to t, and so the width of our confidence intervals for p are proportional to 1/√t. Note that the larger N is, the smaller we can afford to make t.


Previous related posts:

Next up: Adding Laplace or Gaussian noise and differential privacy

Randomized response, privacy, and Bayes theorem

blurred lights

Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability.

Randomized response

Each student is asked to flip two coins. If the first coin comes up heads, the student answers the question truthfully, yes or no. Otherwise the student reports “yes” if the second coin came up heads and “no” it came up tails. Every student has deniability because each “yes” answer may have come from an innocent student who flipped tails on the first coin and heads on the second.

How can the professor estimate p, the proportion of students who cheated? Around half the students will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

Database anonymization

The calculations above assume that everyone complied with the protocol, which may not be reasonable. If everyone were honest, there’d be no reason for this exercise in the first place. But we could imagine another scenario. Someone holds a database with identifiers and answers to a yes/no question. The owner of the database could follow the procedure above to introduce randomness in the data before giving the data over to someone else.

Information contained in a randomized response

What can we infer from someone’s randomized response to the cheating question? There’s nothing you can infer with certainty; that’s the point of introducing randomness. But that doesn’t mean that the answers contain no information. If we completely randomized the responses, dispensing with the first coin flip, then the responses would contain no information. The responses do contain information, but not enough to be incriminating.

Let C be a random variable representing whether someone cheated, and let R be their response, following the randomization procedure above. Given a response R = 1, what is the probability p that C = 1, i.e. that someone cheated? This is a classic application of Bayes’ theorem.

\begin{eqnarray*} P(C=1 \mid R = 1) &=& \frac{P(R=1 \mid C=1) P(C=1)}{P(R=1\mid C=1)P(C=1) + P(R=1\mid C=0)P(C=0)} \\ &=& \frac{\frac{3}{4} p}{\frac{3}{4} p + \frac{1}{4}(1-p)} \\ &=& \frac{3p}{2p+1} \end{eqnarray*}

If we didn’t know someone’s response, we would estimate their probability of having cheated as p, the group average. But knowing that their response was “yes” we update our estimate to 3p / (2p + 1). At the extremes of p = 0 and p = 1 these coincide. But for any value of p strictly between 0 and 1, our estimate goes up. That is, the probability that someone cheated, conditional on knowing they responded “yes”, is higher than the unconditional probability. In symbols, we have

P(C = 1 | R = 1) > P(C = 1)

when 0 < < 1. The difference between the left and right sides above is maximized when p = (√3 – 1)/2 = 0.366. That is, a “yes” response tells us the most when about 1/3 of the students cheated. When p = 0.366, P(= 1 | R= 1) = 0.634, i.e. the posterior probability is almost twice the prior probability.

You could go through a similar exercise with Bayes theorem to show that P(C = 1 | R = 0) = p/(3 – 2p), which is less than p provided 0 < p < 1. So if someone answers “yes” to cheating, that does make it more likely that the actually cheated, but not so much more that you can justly accuse them of cheating. (Unless p = 1, in which case you’re in the realm of logic rather than probability: if everyone cheated, then you can conclude that any individual cheated.)

Update: See the next post for a more general randomization scheme and more about the trade-off between privacy and utility. The post after that gives an overview of randomization for more general kinds of data.

If you would like help with database de-identification, please let me know.

Quantifying the information content of personal data

It can be surprisingly easy to identify someone from data that’s not directly identifiable. One commonly cited result is that the combination of birth date, zip code, and sex is enough to identify 87% of Americans. This post will look at how to quantify the amount of information contained in such data.

If the answer to a question has probability p, then it contains -log2 p bits of information. Knowing someone’s sex gives you 1 bit of information because -log2(1/2) = 1.

Knowing whether someone can roll their tongue could give you more or less information than knowing their sex. Estimates vary, but say 75% can roll their tongue. Then knowing that someone can roll their tongue gives you 0.415 bits of information, but knowing that they cannot roll their tongue gives you 2 bits of information.

On average, knowing someone’s tongue rolling ability gives you less information than knowing their sex. The average amount of information, or entropy, is

0.75(-log2 0.75) + 0.25(-log2 0.25) = 0.81.

Entropy is maximized when all outcomes are equally likely. But for identifiability, we’re concerned with maximum information as well as average information.

Knowing someone’s zip code gives you a variable amount of information, less for densely populated zip codes and more for sparsely populated zip codes. An average zip code contains about 7,500 people. If we assume a US population of 326,000,000, this means a typical zip code would give us about 15.4 bits of information. (Update: see a more precise calculation here.)

The Safe Harbor provisions of US HIPAA regulations let you use the first three digits of someone’s zip code except when this would represent less than 20,000 people, as it would in several sparsely populated areas. Knowing that an American lives in a region of 20,000 people would give you 14 bits of information about that person.

Birth dates are complicated because age distribution is uneven. Knowing that someone’s birth date was over a century ago is highly informative, much more so than knowing it was a couple decades ago. That’s why the Safe Harbor provisions do not allow including age, much less birth date, for people over 90.

Birthdays are simpler than birth dates. Birthdays are not perfectly evenly distributed throughout the year, but they’re close enough for our purposes. If we ignore leap years, a birthday contains -log2(1/365) or about 8.5 bits of information. If we consider leap years, knowing someone was born on a leap day gives us two extra bits of information.

Independent information is additive. I don’t expect there’s much correlation between sex, geographical region, and birthday, so you could add up the bits from each of these information sources. So if you know someone’s sex, their zip code (assuming 7,500 people), and their birthday (not a leap day), then you have 25 bits of information, which may be enough to identify them.

This post didn’t consider correlated information. For example, suppose you know someone’s zip code and primary language. Those two pieces of information together don’t provide as much information as the sum of the information they provide separately because language and location are correlated. I may discuss the information content of correlated information in a future post. (Update: Here is a post on correlated pairs of data.)

RelatedHIPAA de-identification

Negative correlation introduced by success

Suppose you measure people on two independent attributes, X and Y, and take those for whom X+Y is above some threshold. Then even though X and Y are uncorrelated in the full population, they will be negatively correlated in your sample.

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange
from scipy.stats import norm, pearsonr
import matplotlib.pyplot as plt

# Correlation.
# The function pearsonr returns correlation and a p-value.
def corr(x, y):
    return pearsonr(x, y)[0]

x = norm.rvs(100, 15, 10000)
y = norm.rvs(100, 15, 10000)
z = x + y

span = arange(80, 260, 10)
c = [ corr( x[z > low], y[z > low] ) for low in span ]

plt.plot( span, c )
plt.xlabel( "minimum sum" )
plt.ylabel( "correlation coefficient" )