Relative error in the central limit theorem

If you average a large number independent versions of the same random variable, the central limit theorem says the average will be approximately normal. That is the absolute error in approximating the density of the average by the density of a normal random variable will be small. (Terms and conditions apply. See notes here.)

But the central limit theorem says nothing about relative error. Relative error can diverge to infinity while absolute error converges to zero. We’ll illustrate this with an example.

The average of N independent exponential(1) random variables has a gamma distribution with shape N and scale 1/N.

As N increases, the average becomes more like a normal in distribution. That is, the absolute error in approximating the distribution function of gamma random variable with that of a normal random variable decreases. (Note that we’re talking about distribution functions (CDFs) and not densities (PDFs). The previous post discussed a surprise with density functions in this example.)

The following plot shows that the difference between the distributions functions get smaller as N increases.

But when we look at the ratio of the tail probabilities, that is Pr(X > t) / Pr(Y  > t) where X is the average of N exponential r.v.s and Y is the corresponding normal approximation from the central limit theorem, we see that the ratios diverge, and they diverge faster as N increases.

To make it clear what’s being plotted, here is the Python code used to draw the graphs above.

import matplotlib.pyplot as plt
from scipy.stats import gamma, norm
from scipy import linspace, sqrt

def tail_ratio(ns):

    x = linspace(0, 4, 400)
    
    for n in ns:
        gcdf = gamma.sf(x, n, scale = 1/n)
        ncdf = norm.sf(x, loc=1, scale=sqrt(1/n))
        plt.plot(x, gcdf/ncdf

    plt.yscale("log")        
    plt.legend(["n = {}".format(n) for n in ns])
    plt.savefig("gamma_normal_tail_ratios.svg")

def cdf_error(ns):

    x = linspace(0, 6, 400)
    
    for n in ns:
        gtail = gamma.cdf(x, n, scale = 1/n)
        ntail = norm.cdf(x, loc=1, scale=sqrt(1/n))
        plt.plot(x, gtail-ntail)

    plt.legend(["n = {}".format(n) for n in ns])
    plt.savefig("gamma_normal_cdf_diff.svg")
    
ns = [1, 4, 16]
tail_ratio([ns)
cdf_error(ns)

Central limit theorem and Runge phenomena

I was playing around with something this afternoon and stumbled on something like Gibbs phenomena or Runge phenomena for the Central Limit Theorem.

The first place most people encounter Gibbs phenomena is in Fourier series for a step function. The Fourier series develops “bat ears” near the discontinuity. Here’s an example I blogged about before not with Fourier series but with analogous Chebyshev series.

Gibbs phenomena for Chebyshev interpolation

The series converges rapidly in the middle of the flat parts, but under-shoots and over-shoots near the jumps in the step function.

Runge phenomena is similar, where interpolating functions under- and over-shoot the function they’re approximating.

Runge example

Both plots above come from this post.

Here’s the example I ran across with the central limit theorem. The distribution of the average of a set of exponential random variables converges to the distribution of a normal random variable. The nice thing about the exponential distribution is that the averages have a familiar distribution: a gamma distribution. If each exponential has mean 1, the average has a gamma distribution with shape N and scale 1/N. The central limit theorem says this is converging in distribution to a normal distribution with the same mean and variance.

The plot below shows the difference between the density function of the average of N exponential random variables and the density function for its normal approximation, for N = 10 and for N = 400.

Notice that the orange line, corresponding to N = 400, is very flat, most of the time. That is, the normal approximation fits very well. But there’s this spike in the middle, something reminiscent of Gibbs phenomena or Runge phenomena. Going from 10 to 400 samples the average error decreases quite a bit, but the maximum error doesn’t go down much at all.

If you go back and look at the Central Limit Theorem, or its more quantitative counterpart the Berry-Esseen theorem, you’ll notice that it applies to the distribution function, not the density, or in other words, the CDF, not the PDF. I think the density functions do converge in this case because the exponential function has a smooth density, but the rate of convergence depends on the norm. It looks like the convergence is fast in square (L²) norm, but slow in sup norm. A little experiment shows that this is indeed the case.

Maybe the max norm doesn’t converge at all, i.e. the densities don’t converge pointwise. It looks like the max norm may be headed toward a horizontal asymptote, just like Gibbs phenomena.

Update: It seems we do not have uniform convergence. If we let N = 1,000,000, the sup norm of the error 0.1836. It appears the sup norm of the error is approaching a lower bound of approximately this value.

Related posts

Computing extreme normal tail probabilities

Let me say up front that relying on the normal distribution as an accurate model of extreme events is foolish under most circumstances. The main reason to calculate the probability of, say, a 40 sigma event is to show how absurd it is to talk about 40 sigma events. See my previous post on six-sigma events for an explanation.

For this post I’ll be looking at two-tailed events, the probability that a normal random variable is either less than –kσ or greater than kσ. If you’re only interested in one of these two probabilities, divide by 2. Also, since the results are independent of σ, let’s assume σ = 1 for convenience.

The following Python code will print a table of k-sigma event probabilities.

    from scipy.stats import norm

    for k in range(1, 40):
        print(k, 2*norm.cdf(-k))

This shows, for example, that a “25 sigma event,” something I’ve heard people talk about with a straight face, has a probability of 6 × 10-138.

The code above reports that a 38 or 39 sigma event has probability 0, exactly 0. That’s because the actual value, while not zero, is so close to zero that floating point precision can’t tell the difference. This happens around 10-308.

What if, despite all the signs warning hic sunt dracones you want to compute even smaller probabilities? Then for one thing you’ll need to switch over to log scale in order for the results to be representable as floating point numbers.

Exactly computing these extreme probabilities is challenging, but there are convenient upper and lower bounds on the probabilities that can be derived from Abramowitz and Stegun, equation 7.1.13 with a change of variables. See notes here.

\frac{1}{t + \sqrt{t^2 + 2\sqrt{2}}} \leq 2\sqrt{\frac{\pi}{2}}\, e^{t^2/2}\, \mbox{Pr}(|Z| > t) \leq \frac{1}{t + \sqrt{t^2 + 4\sqrt{2}/\pi}}

We can use these bounds to compute upper and lower bounds on the base 10 logs of the tail probabilities.

    from scipy import log, sqrt, pi

    def core(t, c):
        x = 2*sqrt(2/pi)/(t + sqrt(t**2 + c))
        ln_p = -0.5*t**2 + log(x)
        return ln_p/log(10)

    def log10_upper(t):
        return core(t, 2*sqrt(2))

    def log10_lower(t):
        return core(t, 4*sqrt(2)/pi)

This tells us that the log base 10 of the probability of a normal random variable being more than 38 standard deviations away from its mean is between -315.23867 and -315.23859. The upper and lower bounds agree to seven significant figures, and the accuracy only improves as k gets larger. So for large arguments, we can use either the upper or lower bound as an accurate approximation.

The code above was used to compute this table of tail probabilities for k = 1 to 100 standard deviations.

Six sigma events

I saw on Twitter this afternoon a paraphrase of a quote from Nassim Taleb to the effect that if you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a k-sigma event for various k.

        |-------+-----------------|
        | Sigma | Odds            |
        |-------+-----------------|
        |     1 | 2 : 1           |
        |     2 | 21 : 1          |
        |     3 | 370 : 1         |
        |     4 | 16,000 : 1      |
        |     5 | 1,700,000 : 1   |
        |     6 | 500,000,000 : 1 |
        |-------+-----------------|

If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

Bayes rule and Bayes factors

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a heavier-tailed distribution, say a Student t distribution with 10 degrees of freedom, rescaled to also have variance 1.

normal distribution vs t with 10 dof

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is relatively much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the t distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the t distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the heavier-tailed distribution is equally likely, just a possibility. If you did think a priori that both possibilities were equally likely, the posterior odds for the t distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your a priori confidence in a normal model.

By the way, a t distribution with 10 degrees of freedom is not a very heavy-tailed distribution. It has heavier tails than a normal for sure, but not nearly as heavy as a Cauchy, which corresponds to a t with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.

***

[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

Robustness and tests for equal variance

The two-sample t-test is a way to test whether two data sets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less than ideal circumstances.

This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the F-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.

Sensitivity and robustness

Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it’s important to know how procedures perform when assumptions don’t hold exactly. Since the F-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the F-test to perform terribly under simulation, relative to its recommended alternatives Bartlett’s test and Levene’s test. That’s not exactly what I found.

Simulation design

For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the F-test, Bartlett’s test, and Levene’s test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.

I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the F-test. Then I used the same designs for data coming from a heavy-tailed distribution to see how well each of the tests performed.

For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.

Normal distribution calibration

Here are the results from the normal distribution simulations.

|----------+-------+--------+---------|
| Test     | Alpha | Type I | Type II |
|----------+-------+--------+---------|
| F        |  0.13 | 0.0390 |  0.1863 |
| Bartlett |  0.04 | 0.0396 |  0.1906 |
| Levene   |  0.06 | 0.0439 |  0.2607 |
|----------+-------+--------+---------|

Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.

The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.

To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.

Heavy-tail simulation results

Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student t distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.

|----------+-------+--------+---------|
| Test     | Alpha | Type I | Type II |
|----------+-------+--------+---------|
| F        |  0.13 | 0.2417 |  0.2852 |
| Bartlett |  0.04 | 0.2165 |  0.2859 |
| Levene   |  0.06 | 0.0448 |  0.4537 |
|----------+-------+--------+---------|

The operating characteristics degraded when drawing samples from a heavy-tailed distribution, t with 4 degrees of freedom, but they didn’t degrade uniformly.

Compared to the F-test, the Bartlett test had slightly better Type I error and the same Type II error.

The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.

Conclusion: The F-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett’s test doesn’t seem much better in these particular scenarios. Levene’s test, however, does perform better than the F-test, depending on the relative importance you place on Type I and Type II error.

Related posts

Two-sample t-test and robustness

A two-sample t-test is intended to determine whether there’s evidence that two samples have come from distributions with different means. The test assumes that both samples come from normal distributions.

Robust to non-normality, not to asymmetry

It is fairly well known that the t-test is robust to departures from a normal distribution, as long as the actual distribution is symmetric. That is, the test works more or less as advertised as long as the distribution is symmetric like a normal distribution, but it may not work as expected if the distribution is asymmetric.

This post will explore the robustness of the t-test via simulation. How far can you be from a normal distribution and still do OK? Can you have any distribution as long as it’s symmetric? Does a little asymmetry ruin everything? If something does go wrong, how does it go wrong?

Experiment design

For the purposes of this post, we’ll compare the null hypothesis that two groups both have mean 100 to the alternative hypothesis that one group has mean 100 and the other has mean 110. We’ll assume both distributions have a standard deviation of 15. There is a variation on the two-sample t-test that does not assume equal variance, but for simplicity we will keep the variance the same in both groups.

We’ll do a typical design, 0.05 significance and 0.80 power. That is, under the null hypothesis, that is if the two groups do come from the same distribution, we expect the test to wrongly conclude that the two distributions are different 5% of the time. And under the alternative hypothesis, when the groups do come from distributions with means 100 and 110, we expect the test to conclude that the two means are indeed different 80% of the time.

Under these assumptions, you’d need a sample of 36 in each group.

Python simulation code

Here’s the code we’ll use for our simulation.

    from scipy.stats import norm, t, gamma, uniform, ttest_ind

    num_per_group = 36

    def simulate_trials(num_trials, group1, group2):
        num_reject = 0
        for _ in range(num_trials):
            a = group1.rvs(num_per_group)
            b = group2.rvs(num_per_group)
            stat, pvalue = ttest_ind(a, b)
            if pvalue < 0.05:
                num_reject += 1
        return(num_reject)

Normal simulation

Normal distributions

Let’s see how the two-sample t-test works under ideal conditions by simulating from the normal distributions that the method assumes.

First we simulate from the null, i.e. we draw the data for both groups from the same distribution

    n1 = norm(100, 15)
    n2 = norm(100, 15)
    print( simulate_trials(1000, n1, n2) )

Out of 1000 experiment simulations, the method rejected the null 54 times, very close to the expected number of 50 based on 0.05 significance.

Next we simulate under the alternative, i.e. we draw data from different distributions.

    n1 = norm(100, 15)
    n2 = norm(110, 15)
    print( simulate_trials(1000, n1, n2) )

This time the method rejected the null 804 times in 1000 simulations, very close to the expected 80% based on the power.

Gamma simulation

Gamma distributions

Next we draw data from a gamma distribution. First we draw both groups from a gamma distribution with mean 100 and standard deviation 15. This requires a shape parameter of 44.44 and a scale of 2.25.

    g1 = gamma(a = 44.44, scale=2.25)
    g2 = gamma(a = 44.44, scale=2.25)
    print( simulate_trials(1000, g1, g2) )

Under the null simulation, we rejected the null 64 times out of 1000 simulations, higher than expected, but not that remarkable.

Under the alternative simulation, we pick the second gamma distribution to have mean 110 and standard deviation 15, which requires shape 53.77 and scale 2.025.

    g1 = gamma(a = 44.44, scale=2.25)
    g2 = gamma(a = 53.77, scale=2.045)
    print( simulate_trials(1000, g1, g2) )

We rejected the null 805 times in 1000 simulations, in line with what you’d expect from the power.

Gamma distributions

When the gamma distribution has such large shape parameters, the distributions are approximately normal, though slightly asymmetric. To increase the asymmetry, we’ll use a couple gamma distributions with smaller mean but shifted over to the right. Under the null, we create two gamma distributions with mean 10 and standard deviation 15, then shift them to the right by 90.

    g1 = gamma(a = 6.67, scale=1.5, loc=90)
    g2 = gamma(a = 6.67, scale=1.5, loc=90)
    print( simulate_trials(1000, g1, g2) )

Under this simulation we reject the null 56 times out of 1000 simulations, in line with what you’d expect.

For the alternative simulation, we pick the second distribution to have a mean of 20 and standard deviation 15, and then shift it to the right by 90 so tht it has mean 110. This distribution has quite a long tail.

    g1 = gamma(a = 6.67, scale=1.5, loc=90)
    g2 = gamma(a = 1.28, scale=11.25, loc=90)
    print( simulate_trials(1000, g1, g2) )

This time we rejected the null 499 times in 1000 simulations. This is a serious departure from what’s expected. Under the alternative hypothesis, we reject the null 50% of the time rather than the 80% we’d expect. If higher mean is better, this means that half the time you’d fail to conclude that the better group really is better.

Uniform distribution simulation

Next we use uniform random samples from the interval [74, 126]. This is a symmetric distribution with mean 100 and standard deviation 15. When we draw both groups from this distribution we rejected the null 48 times, in line with what we’d expect.

    u1 = uniform(loc=74, scale=52)
    u2 = uniform(loc=74, scale=52)
    print( simulate_trials(1000, u1, u2) )

If we move the second distribution over by 10, drawing from [84, 136], we rejected the null 790 times out of 1000, again in line with what you’d get from a normal distribution.

    u1 = uniform(loc=74, scale=52)
    u2 = uniform(loc=84, scale=52)
    print( simulate_trials(1000, u1, u2) )

In this case, we’ve made a big departure from normality and the test still worked as expected. But that’s not always the case, as in the t(3) distribution below.

Student-t simulation

Finally we simulate from a Student-t distribution. This is a symmetric distribution, but heavier in the tails than the normal distribution.

t distributions with 6 dof

First, we simulate from a t distribution with 6 degrees of freedom and scale 12.25, making the standard deviation 15. We shift the location of the distribution by 100 to make the mean 100.

    t1 = t(df=6, loc=100, scale=12.25)
    t2 = t(df=6, loc=100, scale=12.25)
    print( simulate_trials(1000, t1, t2) )

When both groups come from this distribution, we rejected the null 46 times. When we shifted the second distribution to have mean 110, we rejected the null 777 times out of 1000.

    t1 = t(df=6, loc=100, scale=12.25)
    t2 = t(df=6, loc=110, scale=12.25)
    print( simulate_trials(1000, t1, t2) )

In both cases, the results are in line what we’d expect given a 5% significance level and 80% power.

t distributions with 3 dof

A t distribution with 6 degrees of freedom has a heavy tail compared to a normal. Let’s try making the tail even heavier by using 3 degrees of freedom. We make the scale 15 to keep the standard deviation at 15.

    t1 = t(df=3, loc=100, scale=15)
    t2 = t(df=3, loc=100, scale=15)
    print( simulate_trials(1000, t1, t2) )

When we draw both samples from the same distribution, we rejected the null 37 times out of 1000, less than the 50 times we’d expect.

    t1 = t(df=3, loc=100, scale=15)
    t2 = t(df=3, loc=110, scale=15)
    print( simulate_trials(1000, t1, t2) )

When we shift the second distribution over to have mean 110, we rejected the null 463 times out of 1000, far less than the 80% rejection you’d expect if the data were normally distributed.

Just looking at a plot of the PDFs, a t(3) looks much closer to a normal distribution than a uniform distribution does. But the tail behavior is different. The tails of the uniform are as thin as you can get—they’re zero!—while the t(3) has heavy tails.

These two examples show that you can replace the normal distribution with a moderately heavy tailed symmetric distribution, but don’t overdo it. When the data some from a heavy tailed distribution, even one that is symmetric, the two-sample t-test may not have the operating characteristics you’d expect.

Sensitivity of logistic regression prediction on coefficients

The output of a logistic regression model is a function that predicts the probability of an event as a function of the input parameter. This post will only look at a simple logistic regression model with one predictor, but similar analysis applies to multiple regression with several predictors.

p(x) = \frac{1}{1 + \exp(-a + bx)}

Here’s a plot of such a curve when a = 3 and b = 4.

Flattest part

The curvature of the logistic curve is small at both extremes. As x comes in from negative infinity, the curvature increases, then decreases to zero, then increases again, then decreases as x goes to positive infinity. We quantified this statement in another post where we calculate the curvature. The curvature is zero at the point where the second derivative of p

p''(x) = \frac{b^2 \exp(a + bx)\left(\exp(a +bx) -1\right)}{(1 + \exp(a + bx))^3}

is zero, which occurs when x = –a/b. At that point p = 1/2, so the curve is flattest where the probability crosses 1/2. In the graph above, this happens at x = -0.75.

A little calculation shows that the slope at the flattest part of the logistic curve is simply b.

Sensitivity to parameters

Now how much does the probability prediction p(x) change as the parameter a changes? We now need to consider p as a function of three variables, i.e. we need to consider a and b as additional variables. The marginal change in p in response to a change in a is the partial derivative of p with respect to a.

To know where this is maximized with respect to x, we take the partial derivative of the above expression with respect to x

\frac{\partial^2 p}{\partial x\, \partial a} = \frac{b(\exp(a + bx) - 1) \exp(a + bx)}{(1 + \exp(a + bx))^3}

which is zero when  x = –a/b, the same place where the logistic curve is flattest. And the partial of p with respect to a at that point is simply 1/4, independent of b. So a small change Δa results in a change of approximately Δa/4 at the flattest part of the logistic curve and results in less change elsewhere.

What about the dependence on b? That’s more complicated. The rate of change of p with respect to b is

\frac{\partial p}{\partial b} = \frac{\exp(a + bx) x }{(1 + \exp(a + bx))^2}

and this is maximized where

\frac{\partial^2 p}{\partial x \partial b} = 0

which in turn requires solving a nonlinear equation. This is easy to do numerically in a specific case, but not easy to work with analytically in general.

However, we can easily say how p changes with b near the point x = –a/b. This is not where the partial of p with respect to b is maximized, but it’s a place of interest because it has come up two times above. At that point the derivative of p with respect to b is –a/4b. So if a and b have the same sign, then a small increase in b will result in a small decrease in p and vice versa.

Obesity index: Measuring the fatness of probability distribution tails

A probability distribution is called “fat tailed” if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called “thin” or “light,” and densities that decay slower are called “thick”, “heavy,” or “fat,” or more technically “subexponential.” The distinction is important because fat-tailed distributions tend to defy our intuition.

One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let “several” = 4 for reasons that’ll be apparent soon, though the result is true for any n. As x goes to infinity, the probability that

X1 + X2 + X3 + X4

is larger than x is asymptotically equal the probability that

max(X1, X2, X3, X4)

is larger than x.

The idea behind the obesity index [1] is turn the theorem above around, making it an empirical measure of how thick a distribution’s tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, X1 + X4, is greater than the sum of the middle samples, X2 + X3.

The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails.

Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the X values the same, so it doesn’t change the probability that X1 + X4 is greater than X2 + X3.

To get an idea of the obesity index in action, we’ll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we’ll actually look at the folded normal and folded Cauchy distributions, i.e. we’ll take their absolute values to create right-tailed distributions.

To calculate the obesity index exactly you’d need to do analytical calculations with order statistics. We’ll simulate the obesity index because that’s easier. It’s also more in the spirit of calculating the obesity index from data.

    from scipy.stats import norm, expon, cauchy

    def simulate_obesity(dist, N):
        data = abs(dist.rvs(size=(N,4)))
        count = 0
        for row in range(N):
            X = sorted(data[row])
            if X[0] + X[3] > X[1] + X[2]:
                count += 1
        return count/N

    for dist in [norm, expon, cauchy]:
        print( simulate_obesity(dist, 10000) )

When I ran the Python code above, I got

    0.6692
    0.7519
    0.8396

This ranks the three distributions in the anticipated order of tail thickness.

Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that’s already positive-valued.

[I found out after writing this blog post that SciPy now has foldnorm and foldcauchy, but they don’t seem to work like I expect.]

Let’s try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter b goes to zero like x-1-b and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when b is greater than 1, and a larger index when b is less than one. Once again the simulation results are what we’d expect.

The code

    for dist in [lognorm, pareto(2), pareto(0.5)]:
        print( simulate_obesity(dist, 10000) )

returns

    0.7766
    0.8242
    0.9249

By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier.

Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better.

[1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.

Categorical Data Analysis

Categorical data analysis could mean a couple different things. One is analyzing data that falls into unordered categories (e.g. red, green, and blue) rather than numerical values (e..g. height in centimeters).

Another is using category theory to assist with the analysis of data. Here “category” means something more sophisticated than a list of items you might choose from in a drop-down menu. Instead we’re talking about applied category theory.

So we have ((categorical data) analysis) and (categorical (data analysis)), i.e. analyzing categorical data and categorically analyzing data. The former is far, far more common.

I ran across Alan Agresti’s classic book the other day in a used book store. The image below if from the third (2012) edition. The book store had the 1st (1990) edition with a more austere cover.

I bought Agresti’s book because it’s a good reference to have. But I was a little disappointed. My first thought was  that someone has written a book on category theory and statistics, which is not the case, as far as I know.

The main reference for category theory and statistics is Peter McCullagh’s 2002 paper What is a statistical model? That paper raised a lot of interesting ideas, but the statistics community did not take McCullagh’s bait.

commutative diagram for statistical models

Maybe this just wasn’t a fruitful idea. I suspect it is a fruitful idea, but the number of people available to develop it, conversant in both statistics and category theory, is very small. I’ve seen category theory used in mathematical modeling more generally, but not in statistics per se.

At its most basic, category theory asks you to be explicit about the domain and range (codomain) of functions. It would be very helpful if statisticians merely did this. Statistical notation is notoriously bad at where a function goes from and to, or even when a function is a function. Just 0th level category theory, defining categories, would be useful. Maybe it would be useful to go on to identifying limits or adjoints, but simply being explicit about “from” and “to” would be a good start.

Category theory is far too abstract to completely carry out a statistical analysis. But it can prompt you to ask questions that check whether your model has any inconsistencies you hadn’t noticed. The idea of a “categorical error” doesn’t differ that much moving from its philosophical meaning under Aristotle to its mathematical meaning under MacLane. Nor does the idea of something being “natural.” One of the primary motivations for creating category theory was to come up with a rigorous definition of what it means for something in math to be “natural.”

Asymmetric surprise

Motivating example: planet spacing

My previous post showed that planets are roughly evenly distributed on a log scale, not just in our solar system but also in extrasolar planetary systems. I hadn’t seen this before I stumbled on it by making some plots.

I didn’t think it was an original discovery—I assume someone did this exercise immediately when systems with several planets were discovered—but I didn’t know what this observation was called. I now know it’s known as the Titius-Bode law, a generalization of an observation about our solar system by Messrs. Titius and Bode a couple centuries ago. See, for example, [1].

Several people were skeptical of the claim that planets are distributed according to a power law and pointed out that uniformly distributed points can look fairly evenly distributed on a logarithmic scale. Which is true, and gets to the topic I want to discuss in this post. Planets are not spaced like uniform random samples (see [1]) and yet it reasonable, at first glance, to ask whether they are.

Asymmetric surprise

If you’re expecting a power law, and you’re given uniformly distributed data, it doesn’t look too surprising. On the other hand, if you’re expecting uniformly distributed data and you see data distributed according to a power law, you are surprised. I’ll formalize this below.

If you’ve ever tried to make a scaled model of our solar system, you were probably surprised that the planets are far from uniformly spaced. A scaled model of our solar system, say at a museum, is likely to position a few of the inner planets to scale, and then use text to explain where the outer planets should be. For example, there may be a footnote saying “And if everything were to scale, Pluto would be behind the Exxon station at the end of the street.” This is an example of implicitly expected a uniform distribution and receiving data distributed according to a power law.

Some people suspected that I was doing the opposite. By plotting distances on a log scale, I’m implicitly expected a power law distribution. Maybe the data were roughly uniform, but I fooled myself into seeing a power law.

Quantifying surprise

The Kullback-Liebler divergence from Y to X, written KL(X || Y), is the average surprise of seeing Y when you expected X. That’s one of the interpretations. See this post for more interpretations.

In general, Kullback-Liebler divergence is not symmetric. The divergence from X to Y typically does not equal the divergence from Y to X. The discussion above claims that the surprise from seeing power law data when expecting a uniform distribution is greater than the surprise from seeing uniform data when expected a power law distribution. We show below that this is true.

Let X be random variable uniformly distributed on [0, 1] and let Y be a random variable with distribution proportional to xα on the same interval. (The proportionality constant necessary to make the probability integrate to 1 is α + 1.) We will show that KL(X || Y) is greater than KL(Y || X).

First we calculate the two divergences.

\begin{eqnarray*} \mathrm{KL}(X || Y) &=& - \int_0^1 f_X(x) \, \log\left(\frac{f_Y(x)}{f_X(x)} \right) \, dx \\ &=& -\int_0^1 1 \cdot \left( \log(\alpha+1) + \alpha \log x - \log 1 \right) \, dx \\ &=& \alpha - \log(\alpha+1) \end{eqnarray*}

and

\begin{eqnarray*} \mathrm{KL}(Y || X) &=& - \int_0^1 f_Y(x) \, \log\left(\frac{f_X(x)}{f_Y(x)} \right) \, dx \\ &=& -\int_0^1 (\alpha + 1)x^\alpha \left(\log 1 -\log(\alpha+1) - \alpha \log x \right) \, dx \\ &=& \log(\alpha+1) - \frac{\alpha}{1 + \alpha} \end{eqnarray*}

And here is a plot comparing the two results as a function of the exponent α.

Related posts

***

[1] Timothy Bovaird, Charles H. Lineweaver; Exoplanet predictions based on the generalized Titius–Bode relation, Monthly Notices of the Royal Astronomical Society, Volume 435, Issue 2, 21 October 2013, Pages 1126–1138, https://doi.org/10.1093/mnras/stt1357

Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.

Generalized normal distribution and kurtosis

The generalized normal distribution adds an extra parameter β to the normal (Gaussian) distribution. The probability density function for the generalized normal distribution is

\frac{\beta}{2\sigma \Gamma\left(\frac{1}{\beta}\right )} \exp\left(-\left|\frac{x-\mu}{\sigma} \right|^\beta \right)

Here the location parameter μ is the mean, but the scaling factor σ is not the standard deviation unless β = 2.

For small values of the shape parameter β, the distribution is more sharply pointed in the middle and decays more slowly in the tails. We say the tails are “thick” or “heavy.” When β = 1 the generalized normal distribution reduces to the Laplace distribution.

Here are examples with μ = 0 and σ = 1.

The normal distribution is a special case corresponding to β = 2. Large values of β make the distribution flatter on top and thinner (lighter) in the tails. Again μ = 0 and σ = 1 in the plots below.

thick-tailed generalized normal densities

One way to measure the thickness of probability distribution tails is kurtosis. The normal distribution has kurtosis equal to 3. Smaller values of kurtosis correspond to thinner tails and larger values to thicker tails.

There’s a common misunderstanding that kurtosis measures how pointy the distribution is in the middle. Often that’s the case, and in fact that’s the case for the generalized normal distribution. But it’s not true in general. It’s possible for a distribution to be flat on top and have heavy tails or pointy on top and have thin tails.

Distributions with thinner tails than the normal are called “platykurtic” and distributions with thicker tails than the normal are called “leptokurtic.” The names were based on the misunderstanding mentioned above. The platy– prefix means broad, but it’s not the tails that are broader, it’s the middle. Similarly, the lepto– prefix means “thin”, referring to being pointy in the middle. But leptokurtic distributions have thicker tails!

The kurtosis of the generalized normal distribution is given by

\frac{ \Gamma\left( \frac{5}{\beta} \right ) \Gamma\left( \frac{1}{\beta} \right ) }{\Gamma\left(\frac{3}{\beta}\right)^2}

We can use that to visualize how the kurtosis varies as a function of the shape parameter β.

The Laplace distribution (β = 1) has kurtosis 6 and the normal distribution (β = 2) has kurtosis 3.

You can use the fact that Γ(x) ~ 1/x for small x to show that in the limit as β goes to infinity, the kurtosis approaches 9/5.

Related post: Computing skewness and kurtosis in one pass

Generating Laplace random variables

Differential privacy adds Laplace-distributed random noise to data to protect individual privacy. (More on that here.) Although it’s simple to generate Laplacian random values, the Laplace distribution is not always one of the built-in options for random number generation libraries.

The Laplace distribution with scale β has density

f(x) = \frac{1}{2\beta} \exp\left(-\frac{|x|}{\beta} \right )

The Laplace distribution is also called the double exponential because it looks like two mirror-image exponential distributions glued together.

Note that the scale β is not the standard deviation. The standard deviation is √2 β.

To generate samples from a Laplace distribution with scale β, generate two independent exponential samples with mean β and return their difference.

If you don’t have an API for generating exponential random values, generate uniform random values and return the negative of the log. That will produce exponential values with mean 1. To make random values with mean β, just multiply the results by β.

If you want to generate Laplace values in Python, you could simply use the laplace function in scipy.stats. But I’ll write a generator from scratch just to show what you might do in another environment where you didn’t have exponential or Laplace generators.

    from math import log
    from random import random

    def exp_sample(mean): 
        return -mean*log(random())

    def laplace(scale):
        e1 = exp_sample(scale)
        e2 = exp_sample(scale)
        return e1 - e2

Related: Stand-alone numerical code, useful when you need a few common mathematical functions but are in an environment that doesn’t provide them, or when you want to avoid adding a library to your project.

Bits of information in age, birthday, and birthdate

birthday party

The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone’s age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney’s famous result [1] that 87% of the US population can be identified based on zip code, sex, and birth date.

Birthday

Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn’t matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.

Age

I’ll be using the following age distribution data found on Wikipedia.

|-----------+------------|
| Age range | Population |
|-----------+------------|
|  0– 4     |   20201362 |
|  5– 9     |   20348657 |
| 10–14     |   20677194 |
| 15–19     |   22040343 |
| 20–24     |   21585999 |
| 25–29     |   21101849 |
| 30–34     |   19962099 |
| 35–39     |   20179642 |
| 40–44     |   20890964 |
| 45–49     |   22708591 |
| 50–54     |   22298125 |
| 55–59     |   19664805 |
| 60–64     |   16817924 |
| 65–69     |   12435263 |
| 70–74     |    9278166 |
| 75–79     |    7317795 |
| 80–84     |    5743327 |
| 85+       |    5493433 |
|-----------+------------|

To get data for each particular age, I’ll assume ages are evenly distributed in each group, and I’ll assume the 85+ group consists of people from ages 85 to 92. [2]

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.

Birth date

If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

Zip code, sex, and age

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 228 and 229, it’s plausible that we’d have enough information to identify everyone.

We’ve made a number of simplifying assumptions. We were a little fast and loose with age data, and we’ve assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date [1]. Her study was based on 1990 census data, and at that time the US population was a little less than 228.

Related posts

***

[1] Latanya Sweeney. “Simple Demographics Often Identify People Uniquely”. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.

[1] Bob Wells and Mel Tormé. “The Christmas Song.” Commonly known as “Chestnuts Roasting on an Open Fire.”

Bits of information in a US zip code

Mr. Zip

If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.

To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.

There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.

Average information is maximized by dividing people into categories as evenly as possible. Maximum information about one person is maximized by dividing people into categories as unevenly as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!

If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.

But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.

We can go through the data and find the probability p of someone living in each ZCTA and add up –p log2p over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)

The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.

If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.

See the next post on information contained in age, birthday, and birth date.

Related posts: