# Bits of information in a US zip code

If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.

To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.

There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.

Average information is maximized by dividing people into categories as evenly as possible. Maximum information about one person is maximized by dividing people into categories as unevenly as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!

If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.

But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.

We can go through the data and find the probability p of someone living in each ZCTA and add up –p logp over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)

The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.

If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.

See the next post on information contained in age, birthday, and birth date.

# Twenty questions and conditional probability

The previous post compared bits of information to answers in a game of Twenty Questions.

The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.

The minmax approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is “no,” in which case you’ve only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.

The best average approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is “yes,” but there’s only a 10% change that the answer is yes. There’s a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. Entropy is maximized when all outcomes are equally likely. That means you can’t learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.

Now suppose you want to ask about height and sex. As in the previous post, we assume men and women’s heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.

If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you’ve put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.

If you’re going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each step conditioning on your previous answer. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.

The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, conditional on all previous data. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won’t do as well.

# Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
from scipy.stats import beta as beta
from scipy import log2

def infogain(post_a, post_b, prior_a, prior_b):

p = beta(post_a, post_b).pdf
q = beta(prior_a, prior_b).pdf

(info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
return info


This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
print( infogain(3, 8, 3, 7) )


The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

# Why is Kullback-Leibler divergence not a distance?

The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.

The Kullback-Leibler divergence between two random variables X and Y is defined as

This is pronounced/interpreted several ways:

• The divergence from Y to X
• The relative entropy of X with respect to Y
• How well Y approximates X
• The information gain going from the prior Y to the posterior X
• The average surprise in seeing Y when you expected X

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

    from scipy.integrate import quad
from scipy.stats import expon, gamma
from scipy import inf

def KL(X, Y):
f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))

e = expon
g = gamma(a = 2)

print( KL(e, g) )
print( KL(g, e) )


This returns

    (0.5772156649008394, 1.3799968612282498e-08)
(0.4227843350984687, 2.7366807708872898e-09)


The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL function to integrate over the whole real line

    def KL(X, Y):
f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))


and trying an example.

n1 = norm(1, 1)
n2 = norm(2, 1)

print( KL(n1, n2) )
print( KL(n2, n1) )


This returns

(0.4999999999999981, 1.2012834963423225e-08)
(0.5000000000000001, 8.106890774205374e-09)


and so both integrals are equal to within the error in the numerical integration.

# How much do you expect to learn from that experiment?

Bayesian statisticians often talk about models “learning” as data accumulate. Here’s an example that applies information theory to quantify how much you can learn from an experiment using the same likelihood function but two different priors: a conjugate prior and a robust prior.

Here’s an example from a paper Luis Pericchi and I wrote recently. Suppose X ~ Normal(θ, 1) where the prior on θ is either a standard Cauchy distribution or a normal distribution with mean 0 and variance 2.19. (The variance on the normal was chosen following an example by Jim Berger so that both priors put half their mass on the interval [-1, 1].)

The expected information gain from a single observation using the normal (conjugate) prior was 0.58. The corresponding gain for the Cauchy (robust) prior was 1.20. Because robust priors are more responsive to data, the expected gain in information is larger (in this case twice as large) when using these priors.

Related: Quantifying information content

# Learning is not the same as gaining information

Learning is not the same as just gaining information. Sometimes learning means letting go of previously held beliefs. While this is true in life in general, my point here is to show how this holds true when using the mathematical definition of information.

The information content of a probability density function p(x) is given by

Suppose we have a Beta(2, 6) prior on the probability of success for a binary outcome.

The prior density has information content 0.597. Then suppose we observe a success. The posterior density is distributed as Beta(3, 6). The posterior density has information 0.516, less information than the prior density.

Observing a success pulled the posterior density toward the right. The posterior density is a little more diffuse than the prior and so has lower information content. In that sense, we know less than before we observed the data! Actually, we’re less certain than we were before observing the data. But if the true probability of response is larger than our prior would indicate, we’re closer to the truth by becoming less confident of our prior belief, and we’ve learned something.

# Probability and information

E. T. Jaynes gave a speech entitled A Backward Look to the Future in which he looked back on his long career as a physicist and statistician. The speech contains several quotes related to my recent post on what a probability means.

Jaynes advocated the view of probability theory as logic extended to include reasoning on incomplete information. Probability need not have anything to do with randomness. Jaynes believed that frequency interpretations of probability are unnecessary and misleading.

… think of probability theory as extended logic, because then probability
distributions are justified in terms of their demonstrable information content, rather than their imagined—and as it now turns out, irrelevant—frequency connections.

He concludes with this summary of his approach to probability.

As soon as we recognize that probabilities do not describe reality—only our information about reality—the gates are wide open to the optimal solution of problems of reasoning from that information.