Categorical Data Analysis

Categorical data analysis could mean a couple different things. One is analyzing data that falls into unordered categories (e.g. red, green, and blue) rather than numerical values (e..g. height in centimeters).

Another is using category theory to assist with the analysis of data. Here “category” means something more sophisticated than a list of items you might choose from in a drop-down menu. Instead we’re talking about applied category theory.

So we have ((categorical data) analysis) and (categorical (data analysis)), i.e. analyzing categorical data and categorically analyzing data. The former is far, far more common.

I ran across Alan Agresti’s classic book the other day in a used book store. The image below if from the third (2012) edition. The book store had the 1st (1990) edition with a more austere cover.

I bought Agresti’s book because it’s a good reference to have. But I was a little disappointed. My first thought was  that someone has written a book on category theory and statistics, which is not the case, as far as I know.

The main reference for category theory and statistics is Peter McCullagh’s 2002 paper What is a statistical model? That paper raised a lot of interesting ideas, but the statistics community did not take McCullagh’s bait.

commutative diagram for statistical models

Maybe this just wasn’t a fruitful idea. I suspect it is a fruitful idea, but the number of people available to develop it, conversant in both statistics and category theory, is very small. I’ve seen category theory used in mathematical modeling more generally, but not in statistics per se.

At its most basic, category theory asks you to be explicit about the domain and range (codomain) of functions. It would be very helpful if statisticians merely did this. Statistical notation is notoriously bad at where a function goes from and to, or even when a function is a function. Just 0th level category theory, defining categories, would be useful. Maybe it would be useful to go on to identifying limits or adjoints, but simply being explicit about “from” and “to” would be a good start.

Category theory is far too abstract to completely carry out a statistical analysis. But it can prompt you to ask questions that check whether your model has any inconsistencies you hadn’t noticed. The idea of a “categorical error” doesn’t differ that much moving from its philosophical meaning under Aristotle to its mathematical meaning under MacLane. Nor does the idea of something being “natural.” One of the primary motivations for creating category theory was to come up with a rigorous definition of what it means for something in math to be “natural.”

Asymmetric surprise

Motivating example: planet spacing

My previous post showed that planets are roughly evenly distributed on a log scale, not just in our solar system but also in extrasolar planetary systems. I hadn’t seen this before I stumbled on it by making some plots.

I didn’t think it was an original discovery—I assume someone did this exercise immediately when systems with several planets were discovered—but I didn’t know what this observation was called. I now know it’s known as the Titius-Bode law, a generalization of an observation about our solar system by Messrs. Titius and Bode a couple centuries ago. See, for example, [1].

Several people were skeptical of the claim that planets are distributed according to a power law and pointed out that uniformly distributed points can look fairly evenly distributed on a logarithmic scale. Which is true, and gets to the topic I want to discuss in this post. Planets are not spaced like uniform random samples (see [1]) and yet it reasonable, at first glance, to ask whether they are.

Asymmetric surprise

If you’re expecting a power law, and you’re given uniformly distributed data, it doesn’t look too surprising. On the other hand, if you’re expecting uniformly distributed data and you see data distributed according to a power law, you are surprised. I’ll formalize this below.

If you’ve ever tried to make a scaled model of our solar system, you were probably surprised that the planets are far from uniformly spaced. A scaled model of our solar system, say at a museum, is likely to position a few of the inner planets to scale, and then use text to explain where the outer planets should be. For example, there may be a footnote saying “And if everything were to scale, Pluto would be behind the Exxon station at the end of the street.” This is an example of implicitly expected a uniform distribution and receiving data distributed according to a power law.

Some people suspected that I was doing the opposite. By plotting distances on a log scale, I’m implicitly expected a power law distribution. Maybe the data were roughly uniform, but I fooled myself into seeing a power law.

Quantifying surprise

The Kullback-Liebler divergence from Y to X, written KL(X || Y), is the average surprise of seeing Y when you expected X. That’s one of the interpretations. See this post for more interpretations.

In general, Kullback-Liebler divergence is not symmetric. The divergence from X to Y typically does not equal the divergence from Y to X. The discussion above claims that the surprise from seeing power law data when expecting a uniform distribution is greater than the surprise from seeing uniform data when expected a power law distribution. We show below that this is true.

Let X be random variable uniformly distributed on [0, 1] and let Y be a random variable with distribution proportional to xα on the same interval. (The proportionality constant necessary to make the probability integrate to 1 is α + 1.) We will show that KL(X || Y) is greater than KL(Y || X).

First we calculate the two divergences.

\begin{eqnarray*} \mathrm{KL}(X || Y) &=& - \int_0^1 f_X(x) \, \log\left(\frac{f_Y(x)}{f_X(x)} \right) \, dx \\ &=& -\int_0^1 1 \cdot \left( \log(\alpha+1) + \alpha \log x - \log 1 \right) \, dx \\ &=& \alpha - \log(\alpha+1) \end{eqnarray*}


\begin{eqnarray*} \mathrm{KL}(Y || X) &=& - \int_0^1 f_Y(x) \, \log\left(\frac{f_X(x)}{f_Y(x)} \right) \, dx \\ &=& -\int_0^1 (\alpha + 1)x^\alpha \left(\log 1 -\log(\alpha+1) - \alpha \log x \right) \, dx \\ &=& \log(\alpha+1) - \frac{\alpha}{1 + \alpha} \end{eqnarray*}

And here is a plot comparing the two results as a function of the exponent α.

Related posts


[1] Timothy Bovaird, Charles H. Lineweaver; Exoplanet predictions based on the generalized Titius–Bode relation, Monthly Notices of the Royal Astronomical Society, Volume 435, Issue 2, 21 October 2013, Pages 1126–1138,

Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.

Generalized normal distribution and kurtosis

The generalized normal distribution adds an extra parameter β to the normal (Gaussian) distribution. The probability density function for the generalized normal distribution is

\frac{\beta}{2\sigma \Gamma\left(\frac{1}{\beta}\right )} \exp\left(-\left|\frac{x-\mu}{\sigma} \right|^\beta \right)

Here the location parameter μ is the mean, but the scaling factor σ is not the standard deviation unless β = 2.

For small values of the shape parameter β, the distribution is more sharply pointed in the middle and decays more slowly in the tails. We say the tails are “thick” or “heavy.” When β = 1 the generalized normal distribution reduces to the Laplace distribution.

Here are examples with μ = 0 and σ = 1.

The normal distribution is a special case corresponding to β = 2. Large values of β make the distribution flatter on top and thinner (lighter) in the tails. Again μ = 0 and σ = 1 in the plots below.

thick-tailed generalized normal densities

One way to measure the thickness of probability distribution tails is kurtosis. The normal distribution has kurtosis equal to 3. Smaller values of kurtosis correspond to thinner tails and larger values to thicker tails.

There’s a common misunderstanding that kurtosis measures how pointy the distribution is in the middle. Often that’s the case, and in fact that’s the case for the generalized normal distribution. But it’s not true in general. It’s possible for a distribution to be flat on top and have heavy tails or pointy on top and have thin tails.

Distributions with thinner tails than the normal are called “platykurtic” and distributions with thicker tails than the normal are called “leptokurtic.” The names were based on the misunderstanding mentioned above. The platy– prefix means broad, but it’s not the tails that are broader, it’s the middle. Similarly, the lepto– prefix means “thin”, referring to being pointy in the middle. But leptokurtic distributions have thicker tails!

The kurtosis of the generalized normal distribution is given by

\frac{ \Gamma\left( \frac{5}{\beta} \right ) \Gamma\left( \frac{1}{\beta} \right ) }{\Gamma\left(\frac{3}{\beta}\right)^2}

We can use that to visualize how the kurtosis varies as a function of the shape parameter β.

The Laplace distribution (β = 1) has kurtosis 6 and the normal distribution (β = 2) has kurtosis 3.

You can use the fact that Γ(x) ~ 1/x for small x to show that in the limit as β goes to infinity, the kurtosis approaches 9/5.

Related post: Computing skewness and kurtosis in one pass

Generating Laplace random variables

Differential privacy adds Laplace-distributed random noise to data to protect individual privacy. (More on that here.) Although it’s simple to generate Laplacian random values, the Laplace distribution is not always one of the built-in options for random number generation libraries.

The Laplace distribution with scale β has density

f(x) = \frac{1}{2\beta} \exp\left(-\frac{|x|}{\beta} \right )

The Laplace distribution is also called the double exponential because it looks like two mirror-image exponential distributions glued together.

Note that the scale β is not the standard deviation. The standard deviation is √2 β.

To generate samples from a Laplace distribution with scale β, generate two independent exponential samples with mean β and return their difference.

If you don’t have an API for generating exponential random values, generate uniform random values and return the negative of the log. That will produce exponential values with mean 1. To make random values with mean β, just multiply the results by β.

If you want to generate Laplace values in Python, you could simply use the laplace function in scipy.stats. But I’ll write a generator from scratch just to show what you might do in another environment where you didn’t have exponential or Laplace generators.

    from math import log
    from random import random

    def exp_sample(mean): 
        return -mean*log(random())

    def laplace(scale):
        e1 = exp_sample(scale)
        e2 = exp_sample(scale)
        return e1 - e2

Related: Stand-alone numerical code, useful when you need a few common mathematical functions but are in an environment that doesn’t provide them, or when you want to avoid adding a library to your project.

Bits of information in age, birthday, and birthdate

birthday party

The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone’s age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney’s famous result [1] that 87% of the US population can be identified based on zip code, sex, and birth date.


Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn’t matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.


I’ll be using the following age distribution data found on Wikipedia.

| Age range | Population |
|  0– 4     |   20201362 |
|  5– 9     |   20348657 |
| 10–14     |   20677194 |
| 15–19     |   22040343 |
| 20–24     |   21585999 |
| 25–29     |   21101849 |
| 30–34     |   19962099 |
| 35–39     |   20179642 |
| 40–44     |   20890964 |
| 45–49     |   22708591 |
| 50–54     |   22298125 |
| 55–59     |   19664805 |
| 60–64     |   16817924 |
| 65–69     |   12435263 |
| 70–74     |    9278166 |
| 75–79     |    7317795 |
| 80–84     |    5743327 |
| 85+       |    5493433 |

To get data for each particular age, I’ll assume ages are evenly distributed in each group, and I’ll assume the 85+ group consists of people from ages 85 to 92. [2]

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.

Birth date

If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

Zip code, sex, and age

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 228 and 229, it’s plausible that we’d have enough information to identify everyone.

We’ve made a number of simplifying assumptions. We were a little fast and loose with age data, and we’ve assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date [1]. Her study was based on 1990 census data, and at that time the US population was a little less than 228.

Related posts


[1] Latanya Sweeney. “Simple Demographics Often Identify People Uniquely”. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.

[1] Bob Wells and Mel Tormé. “The Christmas Song.” Commonly known as “Chestnuts Roasting on an Open Fire.”

Bits of information in a US zip code

Mr. Zip

If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.

To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.

There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.

Average information is maximized by dividing people into categories as evenly as possible. Maximum information about one person is maximized by dividing people into categories as unevenly as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!

If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.

But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.

We can go through the data and find the probability p of someone living in each ZCTA and add up –p log2p over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)

The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.

If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.

See the next post on information contained in age, birthday, and birth date.

Related posts:

Summing random powers up to a threshold

Pick random numbers uniformly between 0 and 1, adding them as you go, and stop when you get a result bigger than 1. How many numbers would you expect to add together on average?

You need at least two samples, and often two are enough, but you might get any number, and those larger numbers will pull the expected value up.

Here’s a simulation program in Python.

    from random import random
    from collections import Counter

    N = 1000000
    c = Counter()
    for _ in range(N):
        x = 0
        steps = 0
        while x < 1:
            x += random()
            steps += 1
        c[steps] += 1

    print( sum([ k*c[k] for k in c.keys() ])/N )

When I ran it I got 2.718921. There’s a theoretical result first proved by W. Weissblum that the expected value is e = 2.71828… Our error was on the order of 1/√N, which is what we’d expect from the central limit theorem.

Now we can explore further in a couple directions. We could take a look at a the distribution of the number steps, not just its expected value. Printing c shows us the raw data.

        2: 499786, 
        3: 333175, 
        4: 125300, 
        5:  33466, 
        6:   6856, 
        7:   1213, 
        8:    172, 
        9:     29, 
        10:     3

And here’s a plot.

We could also generalize the problem by taking powers of the random numbers. Here’s what we get when we use exponents 1 through 20.

expected steps as exponents increase

There’s a theoretical result that the expected number of steps is asymptotically equal to cn where

c = \exp(\exp(-\gamma)) - \int_{-\infty}^\infty \frac{\exp(-\exp(u)) }{ (u+ \gamma)^2 + \pi^2}\, du

I computed c = 1.2494. The plot above shows that the dependence on the exponent n does look linear. The simulation results appear to be higher than the asymptotic prediction by a constant, but that’s consistent with the asymptotic prediction since relative to n, a constant goes away as n increases.

Reference for theoretical results: D. J. Newman and M. S. Klamkin. Expectations for Sums of Powers. The American Mathematical Monthly, Vol. 66, No. 1 (Jan., 1959), pp. 50-51

Quantifying normal approximation accuracy

Probability is full of theorems that say that probability density approximates another as some parameter becomes large. All the dashed lines in the diagram below indicate a relationship like this.


You can find details of what everything in the diagram means here.

How can you quantify these approximations? One way is to use Kullback-Leibler divergence.  In this post I’ll illustrate this for the normal approximation to the beta and gamma distributions.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

We will compute this integral numerically in the code below to create graphs of how K-L divergence varies with parameters.

Here are the imports we’ll need.

    from scipy.integrate import quad
    from scipy.stats import beta, gamma, norm
    from scipy import inf
    import matplotlib.pyplot as plt

Beta distribution

As the shape parameters of a beta distribution become large, the probability distribution becomes approximately normal (Gaussian).

Here is code that will take the two shape parameters of a beta distribution, construct a normal approximation by moment matching, and compute the quality of the approximation as measured by Kullback-Leibler divergence.

    def KL_beta_norm(a, b):
        b = beta(a, b)
        n = norm(b.mean(), b.std())
        f = lambda x: -b.pdf(x)*(n.logpdf(x) - b.logpdf(x))    
        integral, error = quad(f, 0, 1)
        return integral

And here we make our plot.

    x = [2**i for i in range(11)]
    y = [KL_beta_norm(t, 2*t) for t in x]
    plt.plot(x, y)
    plt.ylabel("KL divergence")
    plt.title("Comparing beta(a, 2a) and its normal approximation")

The result is nearly linear on a log-log scale.

Kullback Leibler divergence from normal approximation to beta

I made the b parameter twice the a parameter to show that you don’t need symmetry. When you do have symmetry, i.e a = b, the approximation quality is better and the graph is even straighter.

Gamma distribution

As the shape parameter of a gamma distribution increases, the probability density becomes more and more like that of a normal distribution. We can quantify this with the following code.

    def KL_gamma_norm(shape):
        g = gamma(shape)
        n = norm(g.mean(), g.std())
        f = lambda x: -g.pdf(x)*(n.logpdf(x) - g.logpdf(x))
        mode = max(0, shape-1)
        integral1, error1 = quad(f, 0, mode)
        integral2, error2 = quad(f, mode, inf)    
        return integral1 + integral2

The integration code is a little more complicated this time. For small shape parameters, code analogous to that for the beta distribution will work just fine. But for larger parameters, the integration fails. The numerical integration routine needs a little help. The largest contribution to the integral is located near the mode of the gamma distribution. For large shape parameters, the integration routine misses this peak and grossly underestimates the integral. We break the integral into two pieces at the mode of the gamma distribution so the integration routine can’t miss it. This is a small example of why numerical integration cannot be completely automated. You have to know something about what you’re integrating.

(The quad() function has a parameter points to let you tell it about points of interest like this, but it doesn’t work when one of the limits of integration is infinite.)

The plotting code is essentially the same as that for the beta distribution. As before, the plot is linear on a log-log scale.

Kullback Leibler divergence for normal approximation to gamma

You could do a similar analysis on the other approximations in the distribution relationship diagram above.

Related posts

Scaling and Monte Carlo integration methods

Here’s an apparent paradox. You’ll hear that Monte Carlo methods are independent of dimension, and that they scale poorly with dimension. How can both statements be true?

The most obvious way to compute multiple integrals is to use product methods, analogous to the way you learn to compute multiple integrals by hand. Unfortunately the amount of work necessary with this approach grows exponentially with the number of variables, and so the approach is impractical beyond modest dimensions.

The idea behind Monte Carlo integration, or integration by darts, is to estimate the area under a (high dimensional) surface by taking random points and counting how many fall below the surface. More details here. The error goes down like the reciprocal of the square root of the number of points. The error estimate has nothing to do with the dimension per se. In that sense Monte Carlo integration is indeed independent of dimension, and for that reason is often used for numerical integration in dimensions too high to use product methods.

Suppose you’re trying to estimate what portion of a (high dimensional) box some region takes up, and you know a priori that the proportion is in the ballpark of 1/2. You could refine this to a more accurate estimate with Monte Carlo, throwing darts at the box and seeing how many land in the region of interest.

But in practice, the regions we want to estimate are small relative to any box we’d put around them. For example, suppose you want to estimate the volume of a unit ball in n dimensions by throwing darts at the unit cube in n dimensions. When n = 10, only about 1 dart in 400 will land in the ball. When n = 100, only one dart out of 1070 will land inside the ball. (See more here.) It’s very likely you’d never have a dart land inside the ball, no matter how much computing power you used.

If no darts land inside the region of interest, then you would estimate the volume of the region of interest to be zero. Is this a problem? Yes and no. The volume is very small, and the absolute error in estimating a small number to be zero is small. But the relative is 100%.

(For an analogous problem, see this discussion about the approximation sin(θ) = θ for small angles. It’s not just saying that one small number is approximately equal to another small number: all small numbers are approximately equal in absolute error! The small angle approximation has small relative error.)

If you want a small relative error in Monte Carlo integration, and you usually do, then you need to come up with a procedure that puts a larger proportion of integration points in the region of interest. One such technique is importance sampling, so called because it puts more samples in the important region. The closer the importance sampler fits the region of interest, the better the results will be. But you may not know enough a priori to create an efficient importance sampler.

So while the absolute accuracy of Monte Carlo integration does not depend on dimension, the problems you want to solve with Monte Carlo methods typically get harder with dimension.

Bounding the 3rd moment by the 4th moment

For a random variable X, the kth moment of X is the expected value of Xk.

For any random variable X with 0 mean, or negative mean, there’s an inequality that bounds the 3rd moment, m3 in terms of the 4th moment, m4:

m_3 \leq \left(\frac{4}{27} \right )^{1/4} m_4^{3/4}

The following example shows that this bound is the best possible.


u = (1 – √ 3)/√ 2
v = (1 + √ 3)/√ 2
p = (3 + √ 3) / 6

and suppose Xu with probability p and Xv with probability q = 1 – p. Then X has mean 0, and you can show that exact equality holds in the inequality above.

Here’s some Mathematica code to verify this claim.

    u = (1 - Sqrt[3])/Sqrt[2]
    v = (1 + Sqrt[3])/Sqrt[2]
    p = (Sqrt[3] + 1)/(2 Sqrt[3])
    q = 1 - p
    m3 = p u^3 + q v^3
    m4 = p u^4 + q v^4
    Simplify[ (4/27)^(1/4) m4^(3/4) / m3 ]

As hoped, the code returns 1.

Reference: Iosif Pinelis, Relations Between the First Four Moments, American Mathematical Monthly, Vol 122, No 5, pp 479-481.

Distribution of matches between two shuffled decks

Take two desks of cards and shuffle them. They can be standard 52-card decks, though the number of cards in the decks doesn’t matter as long as they’re the same and the decks are fairly large.

Now count the number of times the two desks match, i.e. how many times the same card is in the same position in both desks. The number of matches is random, and its distribution is approximately Poisson with mean 1. Let’s do a simulation and see how close the results come to the predicted outcome.

Here’s the Python code:

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt

def count_zeros(x):
    return len(x[x==0])

num_reps = 10000
deck_size = 52

matches = np.zeros(deck_size+1, dtype=int)

# Simulation
for _ in range(num_reps):

    # Shuffle two decks
    a = np.random.permutation(deck_size)
    b = np.random.permutation(deck_size)

    # Count how often they match
    num_matches = count_zeros(a-b)
    matches[ num_matches ] += 1

# Cut off outputs too small to see
cutoff = 8

# Matches predicted by a Poisson distribution with mean 1.
predicted = [num_reps*poisson(1).pmf(i) for i in range(cutoff)]

# Plot results
x = np.arange(cutoff)
w = 0.3 # bar width, matches[0:cutoff], w), predicted, w)
plt.legend(["actual", "predicted"])

And here’s the output based on 10,000 simulations:

Plot showing good fit between predicted and actual

About 1/3 of the time, you get no matches, another 1/3 of the time you get one match, and the rest of the time you get more. More precisely, according to the Poisson model zero matches and one match are both have probability 1/e.

Making sense of a probability problem in the WSJ

Wall Street Journal clipping

Someone wrote to me the other day asking if I could explain a probability example from the Wall Street Journal. (“Proving Investment Success Takes Time,” Spencer Jakab, November 25, 2017.)

Victor Haghani … and two colleagues told several hundred acquaintances who worked in finance that they would flip two coins, one that was normal and the other that was weighted so it came up heads 60% of the time. They asked the people how many flips it would take them to figure out, with a 95% confidence level, which one was the 60% coin. Told to give a “quick guess,” nearly a third said fewer than 10 flips, while the median response was 40. The correct answer is 143.

The anecdote is correct in spirit: it takes longer to discover the better of two options than most people suppose. But it’s jarring to read the answer is precisely 143 when the question hasn’t been stated clearly.

How many flips would it take to figure out which coin is better with a 95% confidence level? For starters, the answer would have to be a distribution, not a single number. You might quickly come to the right conclusion. You might quickly come to the wrong conclusion. You might flip coins for a long time and never come to a conclusion. Maybe there is a way a formulating the problem so that so that the expected value of the distribution is 143.

How are you to go about flipping the coins? Do you flip both of them, or just flip one coin? For example, you might flip both coins until you are confident that one is better, and conclude that the better one is the one that was designed to come up heads 60% of the time. Or you could just flip one of them and test the hypothesis Prob(heads) = 0.5 versus the alternative Prob(heads) = 0.6. Or maybe you flip one coin two times for every one time you flip the other. Etc.

What do you mean by “95% confidence level”? Is this a frequentist confidence interval? And do you compute the (Bayesian) predictive probability of arriving at such a confidence level? Are you computing the (Bayesian) posterior model probabilities of two models, one in which the first coin has probability of heads 0.5 and the second has probability 0.6 versus the opposite model?

Do you assume that you know a priori that one coin has probability of heads 0.5 and the other 0.6, or do you not assume this and just want to find the coin with higher probability of heads, and evaluate such a model when in fact the probabilities of heads are as stated?

Are you conducting an experiment with a predetermined sample size of 143? Or are you continuous monitoring the data, stopping when you reach your conclusion?

I leave it as an exercise to the reader to implement the various alternatives suggested above and see whether one of them produces 143 as a result. (I did a a back-of-the-envelope calculation that suggests there is one.) So the first question is to reverse engineer which problem statement the article was based on. The second question is to decide which problem formulation you believe would be most appropriate in the context of the article.

Hermite polynomials, expected values, and integration

In a previous post, I alluded to using Hermite polynomials in conjunction with higher-order Laplace approximation. In this post I’ll expand on what that means.

Hermite polynomials are orthogonal polynomials over the real line with respect to the weight given by the standard normal distribution. (There are two conventions for defining Hermite polynomials, what Wikipedia calls the physicist convention and the probabilist convention. We’re using the latter here.)

The first few Hermite polynomials are 1, x, x2 – 1, x3 – 3x, … . You can find the rest of the Hermite polynomials via the recurrence relation

Hn+1(x) = x Hnn Hn-1(x).

You could think of the Hermite polynomials as the right basis to use when working with normal probability distributions. Writing a polynomial as a linear combination of Hermite polyn0mials is a change of basis that makes integration easy:

\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty \left(\sum_{k=0}^N a_k\,H_k(x)\right) \exp(-x^2/2) \,dx = \sum_{k=0}^N a_k \, [k\, \mathrm{ even} ] \,k!!

Here [k even] is the function that returns 1 if k is even and 0 otherwise. This notation was introduced by Kenneth Iverson in the APL programming language and has become moderately common in mathematics.


How can a statistician help a lawyer?

I’ll be presenting at a webinar on Wednesday, December 13 at 1:00 PM Eastern. The title of the presentation is “Seven questions a statistician and answer for an attorney.”

I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

ACEDS association of certified e-discovery specialists and LexInsight