Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.


[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.

Pareto distribution and Benford’s law

The Pareto probability distribution has density

f(x) = \frac{a}{x^{a+1}}

for x ≥ 1 where a > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. “80-20” rule) are named after the same person, the Italian economist Vilfredo Pareto.

Samples from a Pareto distribution obey Benford’s law in the limit as the parameter a goes to zero. That is, the smaller the parameter a, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford’s law.

Here’s an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape a = 1 and shape a = 0.2 with the counts expected under Benford’s law.

Distribution of leading digits of Pareto samples in base 10

Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.

Distribution of leading digits of Pareto samples in base 16

More posts on Benford’s law

More posts on Pareto

Random number generation posts

Random number generation is typically a two step process: first generate a uniformly distributed value, then transform that value to have the desired distribution. The former is the hard part, but also the part more likely to have been done for you in a library. The latter is relatively easy in principle, though some distributions are hard to (efficiently) sample from.

Here are some posts on testing a uniform RNG.

Here’s a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.

A few posts on manipulating a random number generator.

And finally, a post on a cryptographically secure random number generator.

Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
    from scipy.stats import beta as beta
    from scipy import log2

    def infogain(post_a, post_b, prior_a, prior_b):

        p = beta(post_a, post_b).pdf
        q = beta(prior_a, prior_b).pdf

        (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
        return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
    print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

Database anonymization for testing

How do you create a database for testing that is like your production database? It depends on in what way you want the test database to be “like” the production one.

Replacing sensitive data

Companies often use an old version of their production database for testing. But what if the production database has sensitive information that software developers and testers should not have access to?

You can’t completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be

  1. Use the original data.
  2. Generate completely new artificial data.
  3. Use the real data as a guide to generating new data.

We’ll assume the first option is off the table and consider the pros and cons of the other two options.

For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That’s fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don’t have 11 year-old retirees or 98 year-old clients who can’t legally purchase alcohol.

Random vs realistic

There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find more bugs, but a realistic database is likely to find more important bugs.

Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you’ve never sold your product to someone in Louisiana, and there’s a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)

On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the joint distributions to match and not just the unconditional distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.

So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.

How to make a realistic test database

So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like differential privacy. But if you just want moderately realistic test cases, you can do something much simpler.

Finally, you have to address your privacy-utility trade-off. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.

If you’d like help with this process, let’s talk. I can help you determine what your obligations are and how best to meet them while meeting your business objectives.

Quantifying privacy loss in a statistical database


In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

More general randomized response

Suppose we have a binary response to some question as a field in our database. With probability t we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call t was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for t = 1/2 last time, and here we’ll look at the result for general t.

If t = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, t = 1, the recorded result is pure private information and statistically useful. The closer t is to 0, the more privacy we have, and the closer t is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

Privacy loss

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

\frac{(t+1) p}{2tp -t + 1}

where p is the overall probability of a true response of 1.

The privacy loss associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be p; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

\log_2 \left( \frac{(t+1) p}{2tp - t + 1} \middle/ \ p \right) = \log_2\left( \frac{(t+1)}{2tp - t + 1} \right)

When t = 0, the privacy loss is 0. When t = 1, the loss is -log2(p) bits, i.e. the entire information contained in the response. When t = 1/2, the loss is -log2(3/(2p + 1)) bits.

Privacy / utility trade-off

We’ve looked at the privacy cost of setting t to various values. What are the statistical costs? Why not make t as small as possible? Well, 0 is a possible value of t, corresponding to complete loss of statistical utility. So we’d expect that small positive values of t make it harder to estimate p.

Each recorded response is a 1 with probability tp + (1 – t)/2. Suppose there are N database records and let S be the sum of the recorded values. Then our estimator for p is

\hat{p} = \frac{\frac{S}{N} - \frac{1-t}{2}}{t}

The variance of this estimator is inversely proportional to t, and so the width of our confidence intervals for p are proportional to 1/√t. Note that the larger N is, the smaller we can afford to make t.


Previous related posts:

Next up: Adding Laplace or Gaussian noise and differential privacy

Randomized response, privacy, and Bayes theorem

blurred lights

Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability.

Randomized response

Each student is asked to flip two coins. If the first coin comes up heads, the student answers the question truthfully, yes or no. Otherwise the student reports “yes” if the second coin came up heads and “no” it came up tails. Every student has deniability because each “yes” answer may have come from an innocent student who flipped tails on the first coin and heads on the second.

How can the professor estimate p, the proportion of students who cheated? Around half the students will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

Database anonymization

The calculations above assume that everyone complied with the protocol, which may not be reasonable. If everyone were honest, there’d be no reason for this exercise in the first place. But we could imagine another scenario. Someone holds a database with identifiers and answers to a yes/no question. The owner of the database could follow the procedure above to introduce randomness in the data before giving the data over to someone else.

Information contained in a randomized response

What can we infer from someone’s randomized response to the cheating question? There’s nothing you can infer with certainty; that’s the point of introducing randomness. But that doesn’t mean that the answers contain no information. If we completely randomized the responses, dispensing with the first coin flip, then the responses would contain no information. The responses do contain information, but not enough to be incriminating.

Let C be a random variable representing whether someone cheated, and let R be their response, following the randomization procedure above. Given a response R = 1, what is the probability p that C = 1, i.e. that someone cheated? This is a classic application of Bayes’ theorem.

\begin{eqnarray*} P(C=1 \mid R = 1) &=& \frac{P(R=1 \mid C=1) P(C=1)}{P(R=1\mid C=1)P(C=1) + P(R=1\mid C=0)P(C=0)} \\ &=& \frac{\frac{3}{4} p}{\frac{3}{4} p + \frac{1}{4}(1-p)} \\ &=& \frac{3p}{2p+1} \end{eqnarray*}

If we didn’t know someone’s response, we would estimate their probability of having cheated as p, the group average. But knowing that their response was “yes” we update our estimate to 3p / (2p + 1). At the extremes of p = 0 and p = 1 these coincide. But for any value of p strictly between 0 and 1, our estimate goes up. That is, the probability that someone cheated, conditional on knowing they responded “yes”, is higher than the unconditional probability. In symbols, we have

P(C = 1 | R = 1) > P(C = 1)

when 0 < < 1. The difference between the left and right sides above is maximized when p = (√3 – 1)/2 = 0.366. That is, a “yes” response tells us the most when about 1/3 of the students cheated. When p = 0.366, P(= 1 | R= 1) = 0.634, i.e. the posterior probability is almost twice the prior probability.

You could go through a similar exercise with Bayes theorem to show that P(C = 1 | R = 0) = p/(3 – 2p), which is less than p provided 0 < p < 1. So if someone answers “yes” to cheating, that does make it more likely that the actually cheated, but not so much more that you can justly accuse them of cheating. (Unless p = 1, in which case you’re in the realm of logic rather than probability: if everyone cheated, then you can conclude that any individual cheated.)

Update: See the next post for a more general randomization scheme and more about the trade-off between privacy and utility. The post after that gives an overview of randomization for more general kinds of data.

If you would like help with database de-identification, please let me know.

Quantifying the information content of personal data

It can be surprisingly easy to identify someone from data that’s not directly identifiable. One commonly cited result is that the combination of birth date, zip code, and sex is enough to identify 87% of Americans. This post will look at how to quantify the amount of information contained in such data.

If the answer to a question has probability p, then it contains -log2 p bits of information. Knowing someone’s sex gives you 1 bit of information because -log2(1/2) = 1.

Knowing whether someone can roll their tongue could give you more or less information than knowing their sex. Estimates vary, but say 75% can roll their tongue. Then knowing that someone can roll their tongue gives you 0.415 bits of information, but knowing that they cannot roll their tongue gives you 2 bits of information.

On average, knowing someone’s tongue rolling ability gives you less information than knowing their sex. The average amount of information, or entropy, is

0.75(-log2 0.75) + 0.25(-log2 0.25) = 0.81.

Entropy is maximized when all outcomes are equally likely. But for identifiability, we’re concerned with maximum information as well as average information.

Knowing someone’s zip code gives you a variable amount of information, less for densely populated zip codes and more for sparsely populated zip codes. An average zip code contains about 7,500 people. If we assume a US population of 326,000,000, this means a typical zip code would give us about 15.4 bits of information.

The Safe Harbor provisions of US HIPAA regulations let you use the first three digits of someone’s zip code except when this would represent less than 20,000 people, as it would in several sparsely populated areas. Knowing that an American lives in a region of 20,000 people would give you 14 bits of information about that person.

Birth dates are complicated because age distribution is uneven. Knowing that someone’s birth date was over a century ago is highly informative, much more so than knowing it was a couple decades ago. That’s why the Safe Harbor provisions do not allow including age, much less birth date, for people over 90.

Birthdays are simpler than birth dates. Birthdays are not perfectly evenly distributed throughout the year, but they’re close enough for our purposes. If we ignore leap years, a birthday contains -log2(1/365) or about 8.5 bits of information. If we consider leap years, knowing someone was born on a leap day gives us two extra bits of information.

Independent information is additive. I don’t expect there’s much correlation between sex, geographical region, and birthday, so you could add up the bits from each of these information sources. So if you know someone’s sex, their zip code (assuming 7,500 people), and their birthday (not a leap day), then you have 25 bits of information, which may be enough to identify them.

This post didn’t consider correlated information. For example, suppose you know someone’s zip code and primary language. Those two pieces of information together don’t provide as much information as the sum of the information they provide separately because language and location are correlated. I may discuss the information content of correlated information in a future post. (Update: Here is a post on correlated pairs of data.)

RelatedHIPAA de-identification

Negative correlation introduced by success

Suppose you measure people on two independent attributes, X and Y, and take those for whom X+Y is above some threshold. Then even though X and Y are uncorrelated in the full population, they will be negatively correlated in your sample.

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange
from scipy.stats import norm, pearsonr
import matplotlib.pyplot as plt

# Correlation.
# The function pearsonr returns correlation and a p-value.
def corr(x, y):
    return pearsonr(x, y)[0]

x = norm.rvs(100, 15, 10000)
y = norm.rvs(100, 15, 10000)
z = x + y

span = arange(80, 260, 10)
c = [ corr( x[z > low], y[z > low] ) for low in span ]

plt.plot( span, c )
plt.xlabel( "minimum sum" )
plt.ylabel( "correlation coefficient" )

Bayesian methods at Bletchley Park

From Nick Patterson’s interview on Talking Machines:

GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an extreme minority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statistics works. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

Alan Turing

Testing the PCG random number generator

M. E. O’Neill’s PCG family of random number generators looks very promising. It appears to have excellent statistical and cryptographic properties. And it takes remarkably little code to implement. (PCG stands for Permuted Congruential Generator.)

The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.

Update: See the very informative note by the author of PCG in the comments below.

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a lot of random numbers as input.)

For both test suites I used the seed (state) 20170707105851 and sequence constant (inc) 42.

The PCG generator did well on all the NIST tests. For every test, at least least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

  2   0   2        0  0.213309     10/10   Frequency
  0   0   1        3  0.534146     10/10   BlockFrequency
  3   0   0        0  0.350485     10/10   CumulativeSums
  1   1   0        2  0.350485     10/10   CumulativeSums
  0   2   2        1  0.911413     10/10   Runs
  0   0   1        1  0.534146     10/10   LongestRun
  0   1   2        0  0.739918     10/10   Rank
  0   4   0        0  0.122325     10/10   FFT
  1   0   0        1  0.000439     10/10   NonOverlappingTemplate
  2   1   0        0  0.350485      9/10   NonOverlappingTemplate
  0   2   1        0  0.739918     10/10   OverlappingTemplate
  1   1   0        2  0.911413     10/10   Universal
  1   1   0        0  0.017912     10/10   ApproximateEntropy
  1   0   1        1     ----       3/4    RandomExcursions
  0   0   0        1     ----       4/4    RandomExcursions
  2   0   0        0     ----       4/4    RandomExcursionsVariant
  0   0   3        0     ----       4/4    RandomExcursionsVariant
  1   2   3        0  0.350485      9/10   Serial
  1   1   1        0  0.739918     10/10   Serial
  1   2   0        0  0.911413     10/10   LinearComplexity


The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

        test_name   |ntup| tsamples |psamples|  p-value |Assessment
   diehard_birthdays|   0|       100|     100|0.46682782|  PASSED
      diehard_operm5|   0|   1000000|     100|0.83602120|  PASSED
  diehard_rank_32x32|   0|     40000|     100|0.11092547|  PASSED
    diehard_rank_6x8|   0|    100000|     100|0.78938803|  PASSED
   diehard_bitstream|   0|   2097152|     100|0.81624396|  PASSED
        diehard_opso|   0|   2097152|     100|0.95589325|  PASSED
        diehard_oqso|   0|   2097152|     100|0.86171368|  PASSED
         diehard_dna|   0|   2097152|     100|0.24812341|  PASSED
diehard_count_1s_str|   0|    256000|     100|0.75417270|  PASSED
diehard_count_1s_byt|   0|    256000|     100|0.25725000|  PASSED
 diehard_parking_lot|   0|     12000|     100|0.59288414|  PASSED
    diehard_2dsphere|   2|      8000|     100|0.79652706|  PASSED
    diehard_3dsphere|   3|      4000|     100|0.14978100|  PASSED
     diehard_squeeze|   0|    100000|     100|0.35356584|  PASSED
        diehard_sums|   0|       100|     100|0.04522121|  PASSED
        diehard_runs|   0|    100000|     100|0.39739835|  PASSED
        diehard_runs|   0|    100000|     100|0.99128296|  PASSED
       diehard_craps|   0|    200000|     100|0.64934221|  PASSED
       diehard_craps|   0|    200000|     100|0.27352733|  PASSED
 marsaglia_tsang_gcd|   0|  10000000|     100|0.10570816|  PASSED
 marsaglia_tsang_gcd|   0|  10000000|     100|0.00267789|   WEAK
         sts_monobit|   1|    100000|     100|0.98166534|  PASSED
            sts_runs|   2|    100000|     100|0.05017630|  PASSED
          sts_serial|   1|    100000|     100|0.95153782|  PASSED
          sts_serial|  16|    100000|     100|0.59342390|  PASSED
         rgb_bitdist|   1|    100000|     100|0.50763759|  PASSED
         rgb_bitdist|  12|    100000|     100|0.98576422|  PASSED
rgb_minimum_distance|   2|     10000|    1000|0.23378443|  PASSED
rgb_minimum_distance|   5|     10000|    1000|0.13215367|  PASSED
    rgb_permutations|   2|    100000|     100|0.54142546|  PASSED
    rgb_permutations|   5|    100000|     100|0.96040216|  PASSED
      rgb_lagged_sum|   0|   1000000|     100|0.66587166|  PASSED
      rgb_lagged_sum|  31|   1000000|     100|0.00183752|   WEAK
      rgb_lagged_sum|  32|   1000000|     100|0.13582393|  PASSED
     rgb_kstest_test|   0|     10000|    1000|0.74708548|  PASSED
     dab_bytedistrib|   0|  51200000|       1|0.30789191|  PASSED
             dab_dct| 256|     50000|       1|0.89665788|  PASSED
        dab_filltree|  32|  15000000|       1|0.67278231|  PASSED
        dab_filltree|  32|  15000000|       1|0.35348003|  PASSED
       dab_filltree2|   0|   5000000|       1|0.18749029|  PASSED
       dab_filltree2|   1|   5000000|       1|0.92600020|  PASSED

Effective sample size for MCMC

In applications we’d like to draw independent random samples from complicated probability distributions, often the posterior distribution on parameters in a Bayesian analysis. Most of the time this is impractical.

MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is effective sample size (ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

\mbox{ESS} = \frac{n}{1 + 2\sum_{k=1}^\infty \rho(k)}

where n is the number of samples and ρ(k) is the correlation at lag k.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag k decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

Click to learn more about Bayesian statistics consulting


[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

Why do linear prediction confidence regions flare out?

Suppose you’re tracking some object based on its initial position x0 and initial velocity v0. The initial position and initial velocity are estimated from normal distributions with standard deviations σx and σv. (To keep things simple, let’s assume our object is moving in only one dimension and that the distributions around initial position and velocity are independent.)

The confidence region for the object flares out over time, something like the bell of a trumpet.

Why does the region get larger? Because there’s uncertainty in the velocity, and the velocity gets multiplied by elapsed time.

Why isn’t the confidence region a cone? Because that would ignore the uncertainty in the initial position. The result would be too small.

Why isn’t the confidence region a truncated cone? That’s not a bad approximation, though it’s a bit too large. If we ignore probability for a moment and treat confidence intervals as deterministic limits, then we get a truncated cone. For example, suppose assume position and velocity are each within two standard deviations of their estimates. Then we’d estimate position to be between x0 – 2σx + (v0 – 2σv) t on the low end and x0 + 2σx + (v0 + 2σv) t on the high end. This is only an approximation because we’ve ignored probability, and it’s pessimistic because it assumes extreme error values for both estimates at the same time.

So what is the confidence region? It’s some where between the cone and the truncated cone.

The position xt v is the sum of two random variables. The first has variance σx² and the second has variance t² σv². Variances of independent random variables add, so the standard deviation for the sum is

√(σx² + t² σv²) = t √(σx² / t² + σv²)

Note that as t increases, the latter approaches t σv from above. Ignoring the uncertainty in initial position underestimates standard deviation, but the relative error decreases as t increases.

For large t, a confidence interval for position at time t is approximately proportional to t, so the width of the confidence intervals over time look like a cone. But from small t, the dependence on t is less linear and more curved.

Extreme beta distributions

A beta probability distribution has two parameters, a and b. You can think of these as the number of successes and failures out of a+b trials. The PDF of a beta distribution is approximately normal if a and b are approximately equal and ab is large.

If a and b are close, they don’t have to be very large for the beta PDF to be approximately normal. (In all the plots below, the solid blue line is the beta distribution and the dashed orange line is the normal distribution with the same mean and variance.)

beta(9, 11) PDF vs normal

On the other hand, when a and b are very different, the beta distribution can be skewed and far from normal. Note that ab is the same in the example above and below.

beta(2, 18) PDF vs normal

Why the sharp corner above? The beta distribution is only defined on the interval [0, 1] and so the PDF is zero for negative values.

An application came up today that raised an interesting question: What if ab is very large, but a and b are very different? The former works in favor of the normal approximation but the latter works against it.

The application had a low probability of success but a very large number of trials. Specifically, a + b would be on the order of a million, but a would be less than 500. Does the normal approximation hold for these kinds of numbers? Here are some plots to see.

extreme beta distribution pdfs

When a = 500 the normal approximation is very good. It’s still pretty good when a = 50, but not good at all when a = 5.

Update: Mike Anderson suggested using the skewness to quantify how far a beta is from normal. Good idea.

The skewness of a beta(a, b) distribution is

2(ba)√(a + b + 1) / (a + b + 2) √(ab)

Let Nab and assume N is large and a is small, so that NN + 2,  b – a, and N – a are all approximately equal in ratios. Then the skewness is approximately 2 /√a. In other words, once the number of trials is sufficiently large, sknewness hardly depends on the number of trials and only depends on the number of successes.