Sufficient statistic paradox

A sufficient statistic summarizes a set of data. If the data come from a known distribution with an unknown parameter, then the sufficient statistic carries as much information as the full data [0]. For example, given a set of n coin flips, the number of heads and the number of flips are sufficient statistics. More on sufficient statistics here.

There is a theorem by Koopman, Pitman, and Darmois that essentially says that useful sufficient statistics exist only if the data come from a class of probability distributions known as normal families [1]. This leads to what Persi Diaconis [2] labeled a paradox.

Exponential families are quite a restricted family of measuures. Modern statistics deals with far richer classes of probabilities. This suggests a kind of paradox. If statistics is to be of any real use it must provide ways of boiling down great masses of data to a few humanly interpretable numbers. The Koopman-Pitman-Darmois theorem suggests this is impossible unless nature follows highly specialized laws which no one really believes.

Diaconis suggests two ways around the paradox. First, the theorem he refers to concerns sufficient statistics of a fixed size; it doesn’t apply if the summary size varies with the data size. Second, and more importantly, the theorem says nothing about a summary containing approximately as much information as the full data.


[0] Students may naively take this to mean that all you need is sufficient statistics. No, it says that if you know the distribution the data come from then all you need is the sufficient statistics. You cannot test whether a model fits given sufficient statistics that assume the model. For example, mean and variance are sufficient statistics assuming data come from a normal distribution. But knowing only the mean and variance, you can’t assess whether a normal distribution model fits the data.

[1] This does not mean the data have to come from the normal distribution. The normal family of distributions includes the normal (Gaussian) distribution, but other distributions as well.

[2] Persi Diaconis. Sufficiency as Statistical Symmetry. Proceedings of the AMS Centennial Symposium. August 8–12, 1988.

A new take on the birthday problem

Vitalii Tymchyshyn and Andrii Khlevniuk posted a new paper here entitled “On the average number of birthdays in birthday paradox setup.” This paper gives a different perspective on the birthday problem and a new proof.

In the usual formation of the birthday problem, one asks the probability that everyone in a group of size n has a different birthday, or one asks how big n needs to be for the probability to be approximately 1/2 that two people share a birthday. Tymchyshyn and Khlevniuk ask about the expected number of unique birthdays in a group of size n and show that it equals

(1 – αn) / (1 – α)

where α = 364/365.

Variations on the birthday problem often come up in practice. For example, you often need to consider how likely it is that a hash function will map set of inputs to unique values. There’s a rule of thumb that if there are N possible hash values, then there’s a 50-50 chance of a collision if you hash √N values.

Let α = (N-1)/N and p = 1/N. If we take the first three terms in the Taylor series for (1 – p)n we see that the expected number of unique hash values after hashing n inputs is approximately

nn(n – 1) p/2.

If we choose n = √N, then the expected number of unique hash values is approximately

n – 1/2,

which is consistent with having about a 0.5 probability of one collision.

Related probability posts

Convergence rate of random walk on integers mod p

Consider a random walk on the integers mod p. At each step, you either take a step forward or a step back, with equal probability. After just a few steps, you’ll have to be near where you started. After a few more steps, you could be far from your starting point, but you’re probably not. But eventually, every point is essentially equally likely.

How many steps would you have to take before your location is uniformly distributed? There’s a theorem that says you need about p² steps. We’ll give the precise statement of the theorem shortly, but first let’s do a simulation.

We’ll take random walks on the integers mod 25. You could think of this as a walk on a 24-hour clock, with one extra hour squeezed in. Suppose we want to take N steps. We would generate a +1 or -1 at random, add that to our current position, and reduce mod 25, doing that N times. That would give us one outcome of a random walk with N steps. We could do this over and over, keeping track of where each random walk ended up, to get an idea how the end of the walk is distributed.

We will do an optimization to speed up the simulation. Suppose we generated all our +1s and -1s at once. Then we just need to add up the number of +1’s and subtract the number of -1’s. The number n of +1’s is a binomial(N, 1/2) random variable, and the number of +1’s minus the number of -1’s is 2nN. So our final position will be

(2nN) mod 25.

Here’s our simulation code.

    from numpy import zeros
    from numpy.random import binomial
    import matplotlib.pyplot as plt

    p = 25
    N = p*p//2
    reps = 100000

    count = zeros(p)

    for _ in range(reps):
        n = 2*binomial(N, 0.5) - N
        count[n%p] += 1, count/reps)

After doing half of p² steps the distribution is definitely not uniform.

Distribution after p^2/2 steps

But after p² steps the distribution is close to uniform, close enough that you start to wonder how much of the lack of uniformity is due to simulation.

Distribution after p^2 steps

Now here’s the theorem I promised, taken from Group Representations in Probability and Statistics by Persi Diaconis.

For np² with p odd and greater than 7, the maximum difference between the random walk density after n steps and the uniform density is no greater than exp(- π²n / 2p²).

There is no magic point at which the distribution suddenly becomes uniform. On the other hand, the difference between the random walk density and the uniform density does drop sharply at around p² steps. Between p²/2 steps and p² steps the difference drops by almost a factor of 12.

More random walk posts

Random sample overlap

Here’s a simple probability problem with an answer you may find surprising. (The statement of the problem and solution are simple. The proof is not as simple.)

Suppose you draw 1,000 serial numbers at random from a set of 1,000,000. Then you make another random sample of 1,000. How likely is it that no numbers will be the same on both lists?

To make the problem slightly more general, take two samples of size n from a population of size n² where n is large.

The probability of no overlap in the two samples is approximately 1/e. That is, in the limit as n approaches infinity, the probability converges to 1/e.

For n = 1,000 the approximation is good to three figures.


There are Binomial(n², n) ways to choose a sample of size n from a population of size n². After we’ve drawn the first sample, there are Binomial(n² – n, n) ways to draw a new sample that doesn’t overlap with the first one. The probability of such a sample is

\begin{align*} \left.\binom{n^2 - n}{n} \middle/ \binom{n^2}{n}\right. &= \frac{(n^2 - n)!}{(n^2 - 2n)!} \frac{(n^2 - n)!}{(n^2)!} \\ & = \frac{(n^2 -n)(n^2 - n -1) \cdots (n^2 - 2n+1)}{n^2 (n^2-1) \cdots (n^2 - n+1)} \\ &= \left( \frac{n^2 - n}{n^2} \right) \left( \frac{n^2 - n - 1}{n^2 -1} \right) \cdots \left( \frac{n^2 - 2n + 1}{n^2 - n+1} \right) \\ \end{align*}

The fractions on the last line are in decreasing order, and so their product is less than the first one raised to the nth power, and greater than the last one raised to the nth power.

By taking logs one can show that

\lim_{n\to\infty} \left( \frac{n^2 - n}{n^2} \right)^n = \lim_{n\to\infty} \left( \frac{n^2 - 2n + 1}{n^2 - n+1} \right)^n = \frac{1}{e}

Since upper and lower bounds on the probability converge to 1/e, the probability converges to 1/e.


This problem is reminiscent of the birthday problem, but it’s a little different because the samples are draw without replacement.

Incidentally, I didn’t make up this problem out of thin air. It came up naturally in the course of a consulting project this week.

Data Science and Star Science

I recently got a review copy of Statistics, Data Mining, and Machine Learning in Astronomy. I’m sure the book is especially useful to astronomers, but those of us who are not astronomers could use it as a survey of data analysis techniques, especially using Python tools, where all the examples happen to come from astronomy. It covers a lot of ground and is pleasant to read.

What is a privacy budget?

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Some metaphors are dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

More mathematical privacy posts

Hyperexponential and hypoexponential distributions

There are a couple different ways to combine random variables into a new random variable: means and mixtures. To take the mean of X and Y you average their values. To take the mixture of X and Y you average their densities. The former makes the tails thinner. The latter makes the tails thicker. When X and Y are exponential random variables, the mean has a hypoexponential distribution and the mixture has a hyperexponential distribution.

Hypoexponential distributions

Suppose X and Y are exponentially distributed with mean μ. Then their sum X + Y has a gamma distribution with shape 2 and scale μ. The sum has mean 2μ and variance 2μ². The coefficient of variation, the ratio of the standard deviation to the mean, is 1/√2. The hypoexponential distribution is so-called because its coefficient of variation is less than 1, whereas an exponential distribution has coefficient of variation 1 because the mean and standard deviation are always the same.

The means of X and Y don’t have to be the same. When they’re different, the sum does not have a gamma distribution, and so hypoexponential distributions are more general than gamma distributions.

A hypoexponential random variable can also be the sum of more than two exponentials. If it is the sum of n exponentials, each with the same mean, then the coefficient of variation is 1/√n. In general, the coefficient of variation for a hypoexponential distribution is the coefficient of variation of the means [1].

In the introduction we talked about means rather than sums, but it makes no difference to the coefficient of variation because going from sum to mean divides the mean and the standard deviation by the same amount.

Hyperexponential distributions

Hyperexponential random variables are constructed as a mixture of exponentials rather than an average. Instead selecting a value from X and a value from Y, we select a value from X or a value from Y. Given a mixture probability p, we choose a sample from X with probability p and a value from Y with probability 1 – p.

The density function for a mixture is a an average of the densities of the two components. So if X and Y have density functions fX and fY, then the mixture has density

p fX + (1 – p) fY

If you have more than two random variables, the distribution of their mixture is a convex combination of their individual densities. The coefficients in the convex combination are the probabilities of selecting each random variable.

If X and Y are exponential with means μX and μY, and we have a mixture that selects X with probability p, then then mean of the mixture is the mixture of the means

μ = p μX + (1 – p) μY

which you might expect, but the variance

σ² = p μX ² + (1 – p) μY ² + p(1 – p)(μX – μY

is not quite analogous because of the extra p(1 – p)(μX – μY)² term at the end. If μX = μY this last term drops out and the coefficient of variation is 1: mixing two identically distributed random variables doesn’t do anything to the distribution. But when the means are different, the coefficient of variation is greater than 1 because of the extra term in the variance of the mixture.


Suppose μX = 2 and μY = 8. Then the average of X and Y has mean 5, and so does an equal mixture of X and Y.

The average of X and Y has standard deviation √17, and so the coefficient of variation is √17/5 = 0.8246.

An exponential distribution with mean 5 would have standard deviation 5, and so the coefficient of variation 1.

An equal mixture of X and Y has standard deviation √43, and so the coefficient of variation is √43/5 = 1.3114.

More probability distribution posts

[1] If exponential random variables Xi have means μi, then the coefficient of variation of their sum (or average) is

√(μ1² + μ2² + … + μn²) / (μ1 + μ2 + … + μn)

Fat tails and the t test

Suppose you want to test whether something you’re doing is having any effect. You take a few measurements and you compute the average. The average is different than what it would be if what you’re doing had no effect, but is the difference significant? That is, how likely is it that you might see the same change in the average, or even a greater change, if what you’re doing actually had no effect and the difference is due to random effects?

The most common way to address this question is the one-sample t test. “One sample” doesn’t mean that you’re only taking one measurement. It means that you’re taking a set of measurements, a sample, from one thing. You’re not comparing measurements from two different things.

The t test assumes that the data are coming from some source with a normal (Gaussian) distribution. The Gaussian distribution has thin tails, i.e. the probability of seeing a value far from the mean drops precipitously as you move further out. What if the data are actually coming from a distribution with heavier tails, i.e. a distribution where the probability of being far from the mean drops slowly?

With fat tailed data, the t test loses power. That is, it is less likely to reject the null hypothesis, the hypothesis that the mean hasn’t changed, when it should. First we will demonstrate by simulation that this is the case, then we’ll explain why this is to be expected from theory.


We will repeatedly draw a sample of 20 values from a distribution with mean 0.8 and test whether the mean of that distribution is not zero by seeing whether the t test produces a p-value less than the conventional cutoff of 0.05. We will increase the thickness of the distribution tails and see what that does to our power, i.e. the probability of correctly rejecting the hypothesis that the mean is zero.

We will fatten the tails of our distribution by generating samples from a Student t distribution and decreasing the number of degrees of freedom: as degrees of freedom go down, the weight of the tail goes up.

With a large number of degrees of freedom, the t distribution is approximately normal. As the number of degrees of freedom decreases, the tails get fatter. With one degree of freedom, the t distribution is a Cauchy distribution.

Here’s our Python code:

from scipy.stats import t, ttest_1samp

n = 20
N = 1000

for df in [100, 30, 10, 5, 4, 3, 2, 1]:
    rejections = 0
    for _ in range(N):
        y = 0.8 + t.rvs(df, size=n)
        stat, p = ttest_1samp(y, 0)
        if p < 0.05:
            rejections += 1
    print(df, rejections/N)

And here’s the output:

100 0.917
 30 0.921 
 10 0.873 
  5 0.757  
  4 0.700    
  3 0.628  
  2 0.449  
  1 0.137  

When the degrees of freedom are high, we reject the null about 90% of the time, even for degrees of freedom as small as 10. But with one degree of freedom, i.e. when we’re sampling from a Cauchy distribution, we only reject the null around 14% of the time.


Why do fatter tails lower the power of the t test? The t statistic is

\frac{\bar{y} - \mu_0}{s / \sqrt{n}}

where y bar is the sample average, μ0 is the mean under the null hypothesis (μ0 = 0 in our example), s is the sample standard deviation, and n is the sample size.

As distributions become fatter in the tails, the sample standard deviation increases. This means the denominator in the t statistic gets larger and so the t statistic gets smaller. The smaller the t statistic, the greater the probability that the absolute value of a t random variable is greater than the statistic, and so the larger the p-value.

t statistic, t distribution, t test

There are a lot of t‘s floating around in this post. I’ll finish by clarifying what the various t things are.

The t statistic is the thing we compute from our data, given by the expression above. It is called a t statistic because if the hypotheses of the test are satisfied, this statistic has a t distribution with n-1 degrees of freedom. The t test is a hypothesis test based on the t statistic and its distribution. So the t statistic, the t distribution, and the t test are all closely related.

The t family of probability distributions is a convenient example of a family of distributions whose tails get heavier or lighter depending on a parameter. That’s why in the simulation we drew samples from a t distribution. We didn’t need to, but it was convenient. We would get similar results if we sampled from some other distribution whose tails get thicker, and so variance increases, as we vary some parameter.

More probability posts

Testing Rupert Miller’s suspicion

I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line:

I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in the tails.

Rupert wrote these words in 1986 when it would have been difficult to test is hunch. Now it’s easy, and so I wrote up a little simulation to test whether his feeling was justified. I’m sure this has been done before, but it’s easy (now—it would not have been in 1986) and so I wanted to do it myself.

I’ll compare the Kolmogorov-Smirnov test, a popular test for goodness-of-fit, with the Shapiro-Wilks test that Miller preferred. I’ll run each test 10,000 times on non-normal data and count how often each test produces a p-value less than 0.05.

To produce departures from normality in the tails, I’ll look at samples from a Student t distribution. This distribution has one parameter, the number of degrees of freedom. The fewer degrees of freedom, the thicker the tails and so the further from normality in the tails.

Then I’ll look at a mixture of a normal and uniform distribution. This will have thin tails like a normal distribution, but will be flatter in the middle.

If Miller was right, we should expect the Shapiro-Wilks to be more sensitive for fat-tailed t distributions, and the K-S test to be more sensitive for mixtures.

First we import some library functions we’ll need and define our two random sample generators.

from numpy import where
from scipy.stats import *

def mixture(p, size=100):
    u = uniform.rvs(size=size)
    v = uniform.rvs(size=size)
    n = norm.rvs(size=size)
    x = where(u < p, v, n)
    return x

def fat_tail(df, size=100):
    return t.rvs(df, size=size)

Next is the heart of the code. It takes in a sample generator and compares the two tests, Kolmogorov-Smirnov and Shapiro-Wilks, on 10,000 samples of 100 points each. It returns what proportion of the time each test detected the anomaly at the 0.05 level.

def test(generator, parameter):

    ks_count = 0
    sw_count = 0

    N = 10_000
    for _ in range(N):
        x = generator(parameter, 100)

        stat, p = kstest(x, "norm")
        if p < 0.05:
            ks_count += 1
        stat, p = shapiro(x)
        if p < 0.05:
            sw_count += 1
    return (ks_count/N, sw_count/N)

Finally, we call the test runner with a variety of distributions.

for df in [100, 10, 5, 2]:
    print(test(fat_tail, df))

for p in [0.05, 0.10, 0.15, 0.2]:

Note that the t distribution with 100 degrees of freedom is essentially normal, at least as far as a sample of 100 points can tell, and so we should expect both tests to report a lack of fit around 5% of the time since we’re using 0.05 as our cutoff.

Here’s what we get for the fat-tailed samples.

(0.0483, 0.0554)
(0.0565, 0.2277)
(0.1207, 0.8799)
(0.8718, 1.0000)   

So with 100 degrees of freedom, we do indeed reject the null hypothesis of normality about 5% of the time. As the degrees of freedom decrease, and the fatness of the tails increases, both tests reject the null hypothesis of normality more often. However, in each chase the Shapiro-Wilks test picks up on the non-normality more often than the K-S test, about four times as often with 10 degrees of freedom and about seven times as often with 5 degrees of freedom. So Miller was right about the tails.

Now for the middle. Here’s what we get for mixture distributions.

(0.0731, 0.0677)
(0.1258, 0.1051)
(0.2471, 0.1876)
(0.4067, 0.3041)

We would expect both goodness of fit tests to increase their rejection rates as the mixture probability goes up, i.e. as we sample from the uniform distribution more often. And thatis what we see. But the K-S test outperforms the S-W test each time. Both test have rejection rates that increase with the mixture probability, but the rejection rates increase faster for the K-S test. Miller wins again.

More posts on hypothesis testing

Estimating vocabulary size with Heaps’ law

Heaps’ law says that the number of unique words in a text of n words is approximated by

V(n) = K nβ

where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 an 0.6.

(Note that it’s Heaps’ law, not Heap’s law. The law is named after Harold Stanley Heaps. However, true to Stigler’s law of eponymy, the law was first observed by someone else, Gustav Herdan.)

I’ll demonstrate Heaps law looking at books of the Bible and then by looking at novels of Jane Austen. I’ll also look at unique words, what linguists call “hapax legomena.”

Demonsrating Heaps law

For a collection of related texts, you can estimate the parameters K and β from data. I decided to see how well Heaps’ law worked in predicting the number of unique words in each book of the Bible. I used the King James Version because it is easy to download from Project Gutenberg.

I converted each line to lower case, replaced all non-alphabetic characters with spaces, and split the text on spaces to obtain a list of words. This gave the following statistics:

    | Book       |     n |    V |
    | Genesis    | 38520 | 2448 |
    | Exodus     | 32767 | 2024 |
    | Leviticus  | 24621 | 1412 |
    | III John   |   295 |  155 |
    | Jude       |   609 |  295 |
    | Revelation | 12003 | 1283 |

The parameter values that best fit the data were K = 10.64 and β = 0.518, in keeping with the typical ranges of these parameters.

Here’s a sample of how the actual vocabulary size and predicted vocabulary size compare.

    | Book       | True | Model |
    | Genesis    | 2448 |  2538 |
    | Exodus     | 2024 |  2335 |
    | Leviticus  | 1412 |  2013 |
    | III John   |  155 |   203 |
    | Jude       |  295 |   296 |
    | Revelation | 1283 |  1387 |

Here’s a visual representation of the results.

KJV bible total words vs distinct words

It looks like the predictions are more accurate for small books, and that’s true on an absolute scale. But the relative error is actually smaller for large books as we can see by plotting again on a log-log scale.

KJV bible total words vs distinct words

Jane Austen novels

It’s a little surprising that Heaps’ law applies well to books of the Bible since the books were composed over centuries and in two different languages. On the other hand, the same committee translated all the books at the same time. Maybe Heaps’ law applies to translations better than it applies to the original texts.

I expect Heaps’ law would fit more closely if you looked at, say, all the novels by a particular author, especially if the author wrote all the books in his or her prime. (I believe I read that someone did a vocabulary analysis of Agatha Christie’s novels and detected a decrease in her vocabulary in her latter years.)

To test this out I looked at Jane Austen’s novels on Project Gutenberg. Here’s the data:

    | Novel                 |      n |    V |
    | Northanger Abbey      |  78147 | 5995 |
    | Persuasion            |  84117 | 5738 |
    | Sense and Sensibility | 120716 | 6271 |
    | Pride and Prejudice   | 122811 | 6258 |
    | Mansfield Park        | 161454 | 7758 |
    | Emma                  | 161967 | 7092 |

The parameters in Heaps’ law work out to K = 121.3 and β = 0.341, a much larger K than before, and a smaller β.

Here’s a comparison of the actual and predicted vocabulary sizes in the novels.

    | Novel                 | True | Model |
    | Northanger Abbey      | 5995 |  5656 |
    | Persuasion            | 5738 |  5799 |
    | Sense and Sensibility | 6271 |  6560 |
    | Pride and Prejudice   | 6258 |  6598 |
    | Mansfield Park        | 7758 |  7243 |
    | Emma                  | 7092 |  7251 |

If a suspected posthumous manuscript of Jane Austen were to appear, a possible test of authenticity would be to look at its vocabulary size to see if it is consistent with her other works. One could also look at the number of words used only once, as we discuss next.

Hapax legomenon

In linguistics, a hapax legomenon is a word that only appears once in a given context. The term comes comes from a Greek phrase meaning something said only once. The term is often shortened to just hapax.

I thought it would be interesting to look at the number of hapax legomena in each book since I could do it with a minor tweak of the code I wrote for the first part of this post.

Normally if someone were speaking of hapax legomena in the context of the Bible, they’d be looking at unique words in the original languages, i.e. Hebrew and Greek, not in English translation. But I’m going with what I have at hand.

Here’s a plot of the number of haxap in each book of the KJV compared to the number of words in the book.

Hapax logemenon in Bible, linear scale

This looks a lot like the plot of vocabulary size and total words, suggesting the number of hapax also follow a power law like Heaps law. This is evident when we plot again on a logarithmic scale and see a linear relation.

Number of hapax logemena on a log-log scale

Just to be clear on the difference between two analyses this post, in the first we looked at vocabulary size, the number of distinct words in each book. In the second we looked at words that only appear once. In both cases we’re counting unique words, but unique in different senses. In the first analysis, unique means that each word only counts once, no matter how many times it’s used. In the second, unique means that a work only appears once.

Related posts