Sample size calculation

If you’re going to run a test on rabbits, you have to decide how many rabbits you’ll use. This is your sample size. A lot of what statisticians do in practice is calculate sample sizes.

A researcher comes to talk to a statistician. The statistician asks what effect size the researcher wants to detect. Do you think the new thing will be 10% better than the old thing? If so, you’ll need to design an experiment with enough subjects to stand a good chance of detecting a 10% improvement. Roughly speaking, sample size is inversely proportional to the square of effect size. So if you want to detect a 5% improvement, you’ll need 4 times as many subjects as if you want to detect a 10% improvement.

You’re never guaranteed to detect an improvement. The race is not always to the swift, nor the battle to the strong. So it’s not enough to think about what kind of effect size you want to detect, you also have to think about how likely you want to be to detect it.

Here’s what often happens in practice. The researcher makes an arbitrary guess at what effect size she expects to see. Then initial optimism may waver and she decides it would be better to design the experiment to detect a more modest effect size. When asked how high she’d like her chances to be of detecting the effect, she thinks 100% but says 95% since it’s necessary to tolerate some chance of failure.

The statistician comes back and says the researcher will need a gargantuan sample size. The researcher says this is far outside her budget. The statistician asks what the budget is, and what the cost per subject is, and then the real work begins.

The sample size the negotiation will converge on is the budget divided by the cost per sample. The statistician will fiddle with the effect size and probability of detecting it until the inevitable sample size is reached. This sample size, calculated to 10 decimal places and rounded up to the next integer, is solemnly reported with a post hoc justification containing no mention of budgets.

Sample size is always implicitly an economic decision. If you’re willing to make it explicitly an economic decision, you can compute the expected value of an experiment by placing a value on the possible outcomes. You make some assumptions—you always have to make assumptions—and calculate the probability under various scenarios of reaching each conclusion for various sample sizes, and select the sample size that leads to the best expected value.

More on experimental design

[1] There are three ways an A/B test can turn out: A wins, B wins, or there isn’t a clear winner. There’s a tendency to not think enough about the third possibility. Interim analysis often shuts down an experiment not because there’s a clear winner, but because it’s becoming clear there is unlikely to be a winner.

Truncated distributions vs clipped distributions

In the previous post, I looked at truncated probability distributions. A truncated normal distribution, for example, lives on some interval and has a density proportional to a normal distribution; the proportionality constant is whatever it has to be to make the density integrate to 1.

Truncated distributions

Suppose you wanted to generate random samples from a normal distribution truncated to [a, b]. How would you go about it? You could generate a sample x from the (full) normal distribution, and if axb, then return x. Otherwise, try again. Keep trying until you get a sample inside [a, b], and when you get one, return it [1].

This may be the most efficient way to generate samples from a truncated distribution. It’s called the accept-reject method. It works well if the probability of a normal sample being in [a, b] is high. But if the interval is small, a lot of samples may be rejected before one is accepted, and there are more efficient algorithms. You might also use a different algorithm if you want consistent run time more than minimal average run time.

Clipped distributions

Now consider a different kind of sampling. As before, we generate a random sample x from the full normal distribution and if axb we return it. But if x > b we return b. And if x < a we return a. That is, we clip x to be in [a, b]. I’ll call this a clipped normal distribution. (I don’t know whether this name is standard, or even if there is a standard name for this.)

Clipped distributions are not truncated distributions. They’re really a mixture of three distributions. Start with a continuous random variable X on the real line, and let Y be X clipped to [a, b]. Then Y is a mixture of a continuous distribution on [a, b] and two point masses: Y takes on the value a with probability P(X < a) and Y takes on the value b with probability P(X > b).

Clipped distributions are mostly a nuisance. Maybe the values are clipped due to the resolution of some sensor: nobody wants to clip the values, that’s just a necessary artifact. Or maybe values are clipped because some form only allows values within a given range; maybe the person designing the form didn’t realize the true range of possible values.

It’s common to clip extreme data values while deidentifying data to protect privacy. This means that not only is some data deliberately inaccurate, it also means that the distribution changes slightly since it’s now a clipped distribution. If that’s unacceptable, there are other ways to balance privacy and statistical utility.

Related posts

[1] Couldn’t this take forever? Theoretically, but with probability zero. The number of attempts follows a geometric distribution, and you could use that to find the expected number of attempts.

The truncated Cauchy paradox

The Cauchy distribution is the probability distribution on the real line with density proportional to 1/(1 + x²). It comes up often as an example of a fat-tailed distribution, one that can wreak havoc on intuition and on applications. It has no mean and no variance.

The truncated Cauchy distribution has the same density, with a different proportionality constant, over a finite interval [a, b]. It’s a well-behaved distribution, with a mean, variance, and any other moment you might care about.

Here’s a sort of paradox. The Cauchy distribution is pathological, but the truncated Cauchy distribution is perfectly well behaved. But if a is a big negative number and b is a big positive number, surely the Cauchy distribution on [a, b] is sorta like the full Cauchy distribution.

I will argue here that in some sense the truncated Cauchy distribution isn’t so well behaved after all, and that a Cauchy distribution truncated to a large interval is indeed like a Cauchy distribution.

You can show that the variance of the truncated Cauchy distribution over a large interval is approximately equal to the length of the interval. So if you want to get around the problems of a Cauchy distribution by truncating it to live on a large interval, you’ve got a problem: it matters very much how you truncate it. For example, you might think “It doesn’t matter, make it [-100, 100]. Or why not [-1000, 1000] just to be sure it’s big enough.” But the variances of the two truncations differ by a factor of 10.

Textbooks usually say that the Cauchy distribution has no variance. It’s more instructive to say that it has infinite variance. And as the length of the interval you truncate the Cauchy to approaches infinity, so does its variance.

The mean of the Cauchy distribution does not exist. It fails to exist in a different way than the variance. The integral defining the variance of a Cauchy distribution diverges to +∞. But the integral defining the mean of the Cauchy simply does not exist. You can get different values of the integral depending on how you let the end points go off to infinity. In fact, you could get any value you want by specifying a particular way for truncation interval to grow to the real line.

So once again you can’t simply say “Just truncate it to a big interval” because the mean depends on how you choose your interval.

If we were working with a thin-tailed distribution, like a normal, and truncating it to a big interval, the choice of interval would make little difference. If you truncate a normal distribution to [-10, 5], or [-33, 42], or [-2000, 1000] makes almost no difference to the mean or the variance. But for the Cauchy distribution, it makes a substantial difference.


I’ve experimented with something new in this post: it’s deliberately short on details. Lots of words, no equations. Everything in this post can be made precise, and you may consider it an exercise to make everything precise if you’re so inclined.

By leaving out the details, I hope to focus attention on the philosophical points of the post. I expect this will go over well with people who don’t want to see the details, and with people who can easily supply the details, but maybe not with people in between.

Expected length of longest common DNA substrings

If we have two unrelated sequences of DNA, how long would we expect the longest common substring of the two sequences to be? Since DNA sequences come from a very small alphabet, just four letters, any two sequences of moderate length will have some common substring, but how long should we expect it to be?

Motivating problem

There is an ongoing controversy over whether the SARS-CoV-2 virus contains segments of the HIV virus. The virologist who won the Nobel Prize for discovering the HIV virus, Luc Montagnier, claims it does, and other experts claim it does not.

I’m not qualified to have an opinion on this dispute, and do not wish to discuss the dispute per se. But it brought to mind a problem I’d like to say something about.

This post will look at a simplified version of the problem posed by the controversy over the viruses that cause COVID-19 and AIDS. Namely, given two random DNA sequences, how long would you expect their longest common substring to be?

I’m not taking any biology into account here other than using an alphabet of four letters. And I’m only looking at exact substring matches, not allowing for any insertions or deletions.

Substrings and subsequences

There are two similar problems in computer science that could be confused: the longest common substring problem and the longest common subsequence problem. The former looks for uninterrupted matches and the latter allows for interruptions in matches in either input.

For example, “banana” is a subsequence of “bandana” but not a substring. The longest common substring of “banana” and “anastasia” is “ana” but the longest common subsequence is “anaa.” Note that the latter is not a substring of either word, but a subsequence of both.

Here we are concerned with substrings, not subsequences. Regarding the biological problem of comparing DNA sequences, it would seem that a substring is a strong form of similarity, but maybe too demanding since it allows no insertions. Also, the computer science idea of a subsequence might be too general for biology since it allows arbitrary insertions. A biologically meaningful comparison would be somewhere in between.

Simulation results

I will state theoretical results in the next section, but I’d like to begin with a simulation.

First, some code to generate a DNA sequence of length N.

    from numpy.random import choice

    def dna_sequence(N):
        return "".join(choice(list("ACGT"), N))

Next, we need a way to find the longest common substring. We’ll illustrate how SequenceMatcher works before using it in a simulation.

    from difflib import SequenceMatcher

    N = 12
    seq1 = dna_sequence(N)
    seq2 = dna_sequence(N)   
    # "None" tells it to not consider any characters junk
    s = SequenceMatcher(None, seq1, seq2)

    print(s.find_longest_match(0, N, 0, N))

This produced

    Match(a=5, b=1, size=5)

meaning that the first match of maximum length, ATCAC in this case, starts in position 5 of the first string and position 1 of the second string (indexing from 0).

Now let’s run a simulation to estimate the expected length of the longest common substring in two sequences of DNA of length 300. We’ll do 1000 replications of the simulation to compute our average.

    def max_common_substr_len(a, b):
        # The None's turn off junk detection heuristic
        s = SequenceMatcher(None, a, b, None)
        m = s.find_longest_match(0, len(a), 0, len(b))
        return m.size

    seqlen = 300
    numreps = 1000
    avg = 0
    for _ in range(numreps):
        a = dna_sequence(seqlen)
        b = dna_sequence(seqlen)
        avg += max_common_substr_len(a, b)
    avg /= numreps

When I ran this code I got 7.951 as the average. Here’s a histogram of the results.

Expected substring length

Arratia and Waterman [1] published a paper in 1985 looking at the longest common substring problem for DNA sequences, assuming each of the four letters in the DNA alphabet are uniformly and independently distributed.

They show that for strings of length N coming from an alphabet of size b, the expected length of the longest match is asymptotically

2 logb(N).

This says in our simulation above, we should expect results around 2 log4(300) = 8.229.

There could be a couple reasons our simulation results were smaller than the theoretical results. First, it could be a simulation artifact, but that’s unlikely. I repeated the simulation several times and consistently got results below the theoretical expected value. I expect Arratia and Waterman’s result, although exact in the limit, is biased upward for finite n.

Update: The graph below plots simulation results for varying sequence sizes compared to the asymptotic value, the former always below the latter.

Simulation vs asymptotic estimate

Arratia and Waterman first assume i.i.d. probabilities to derive the result above, then prove a similar theorem under the more general assumption that the sequences are Markov chains with possibly unequal transition probabilities. The fact that actual DNA sequences are not exactly uniformly distributed and not exactly independent doesn’t make much difference to their result.

Arratia and Waterman also show that allowing for a finite number of deletions, or a number of deletions that grows sufficiently slowly as a function of n, does not change their asymptotic result. Clearly allowing deletions makes a difference for finite n, but the difference goes away in the limit.

Related posts

[1] Richard Arratia and Michael S. Waterman. An Erdős-Rényi Law with Shifts. Advances in Mathematics 55, 13-23 (1985).


Underestimating risk

Construction site

When I hear that a system has a one in a trillion (1,000,000,000,000) chance of failure, I immediately translate that in my mind to “So, optimistically the system has a one in a million (1,000,000) chance of failure.”

Extremely small probabilities are suspicious because they often come from one of two errors:

  1. Wrongful assumption of independence
  2. A lack of imagination

Wrongfully assuming independence

The Sally Clark case is an infamous example of a woman’s life being ruined by a wrongful assumption of independence. She had two children die of what we would call in the US sudden infant death syndrome (SIDS) and what was called in her native UK “cot death.”

The courts reasoned that the probability of two children dying of SIDS was the square of the probability of one child dying of SIDS. The result, about one chance in 73,000,000, was deemed to be impossibly small, and Sally Clark was sent to prison for murder. She was released from prison years later, and drank herself to death.

Children born to the same parents and living in the same environment hardly have independent risks of anything. If the probability of losing one child to SIDS is 1/8500, the conditional probability of losing a sibling may be small, but surely not as small as 1/8500.

The Sally Clark case only assumed two events were independent. By naively assuming several events are independent, you can start with larger individual probabilities and end up with much smaller final probabilities.

As a rule of thumb, if a probability uses a number name you’re not familiar with (such as septillion below) then there’s reason to be skeptical.

Lack of imagination

It is possible to legitimately calculate extremely small probabilities, but often this is by calculating the probability of the wrong thing, by defining the problem too narrowly. If a casino owner believes that the biggest risk to his business is dealing consecutive royal flushes, he’s not considering the risk of a tiger mauling a performer.

A classic example of a lack of imagination comes from cryptography. Amateurs who design encryption systems assume that an attacker must approach a system the same way they do. For example, suppose I create a simple substitution cipher by randomly permuting the letters of the English alphabet. There are over 400 septillion (4 × 1026) permutations of 26 letters, and so the chances of an attacker guessing my particular permutation are unfathomably small. And yet simple substitution ciphers are so easy to break that they’re included in popular books of puzzles. Cryptogram solvers do not randomly permute the alphabet until something works.

Professional cryptographers are not nearly so naive, but they have to constantly be on guard for more subtle forms of the same fallacy. If you create a cryptosystem by multiplying large prime numbers together, it may appear that an attacker would have to factor that product. That’s the idea behind RSA encryption. But in practice there are many cases where this is not necessary due to poor implementation. Here are three examples.

If the calculated probability of failure is infinitesimally small, the calculation may be correct but only address one possible failure mode, and not the most likely failure mode at that.

More posts on risk management

Reasoning under uncertainty

Reasoning under uncertainty sounds intriguing. Brings up images of logic, philosophy, and artificial intelligence.

Statistics sounds boring. Brings up images of tedious, opaque calculations followed by looking some number in a table.

But statistics is all about reasoning under uncertainty. Many people get through required courses in statistics without ever hearing that, or at least without ever appreciating that. Rote calculations are easy to teach and easy to grade, so introductory courses focus on that.

Statistics is more interesting than it may sound. And reasoning under uncertainty is less glamorous than it may sound.

Conditional independence notation

Ten years ago I wrote a blog post that concludes with this observation:

The ideas of being relatively prime, independent, and perpendicular are all related, and so it makes sense to use a common symbol to denote each.

This post returns to that theme, particularly looking at independence of random variables.


Graham, Knuth, and Patashnik proposed using ⊥ for relatively prime numbers in their book Concrete Mathematics, at least by the second edition (1994). Maybe it was in their first edition (1988), but I don’t have that edition.

Philip Dawid proposed a similar symbol ⫫ for (conditionally) independent random variables in 1979 [1].

We write X ⫫ Y to denote that X and Y are independent.

As explained here, independent random variables really are orthogonal in some sense, so it’s a good notation.


The symbol ⫫ (Unicode 2AEB, DOUBLE TACK UP) may or may not show up in your browser; it’s an uncommon character and your font may not have a glyph for it.

There’s no command in basic LaTeX for the symbol. You can enter the Unicode character in XeTeX, and there are several other alternatives discussed here. A simple work-around is to use


This says to take two perpendicular symbols, and kern them together by inserting three negative spaces between them.

The package MsSymbol has a command \upmodels to produce ⫫. Why “upmodels”? Because it is a 90° counterclockwise rotation of the \models symbol ⊧ from logic.

To put a strike through ⫫ in LaTeX to denote dependence, you can use \nupmodels from the MsSymbol package or if you’re not using a package you could use the following.


Graphoid axioms

As an example of where you might see the ⫫ symbol used for conditional independence, the table below gives the graphoid axioms for conditional independence. (They’re theorems, not axioms, but they’re called axioms because you could think of them as axioms for working with conditional independence at a higher level of abstraction.)

\begin{align*} \mbox{Symmetry: } & (X \perp\!\!\!\perp Y \mid Z) \implies (Y \perp\!\!\!\perp X \mid Z) \\ \mbox{Decomposition: } & (X \perp\!\!\!\perp YW \mid Z) \implies (X \perp\!\!\!\perp Y \mid Z) \\ \mbox{Weak union: } & (X \perp\!\!\!\perp YW \mid Z) \implies (X \perp\!\!\!\perp Y \mid ZW) \\ \mbox{Contraction: } & (X \perp\!\!\!\perp Y \mid Z) \,\wedge\, (X \perp\!\!\!\perp W \mid ZY)\implies (X \perp\!\!\!\perp YW \mid Z) \\ \mbox{Intersection: } & (X \perp\!\!\!\perp W \mid ZY) \,\wedge\, (X \perp\!\!\!\perp Y \mid ZW)\implies (X \perp\!\!\!\perp YW \mid Z) \\ \end{align*}

Note that the independence symbol ⫫ has higher precedence than the conditional symbol |. That is, X ⫫ Y | Z means X is independent of Y, once you condition on Z.

The axioms above are awfully dense, but they make sense when expanded into words. For example, the symmetry axiom says that if knowledge of Z makes Y irrelevant to X, it also makes X irrelevant to Y. The decomposition axiom says that if knowing Z makes the combination of Y and W irrelevant to X, then knowing Z makes Y alone irrelevant to X.

The intersection axiom requires strictly positive probability distributions, i.e. you can’t have events with probability zero.

More on conditional probability

[1] AP Dawid. Conditional Independence in Statistical Theory. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 41, No. 1 (1979), pp. 1-31

Three composition theorems for differential privacy

This is a brief post, bringing together three composition theorems for differential privacy.

  1. The composition of an ε1-differentially private algorithm and an ε2-differentially private algorithm is an (ε12)-differentially private algorithm.
  2. The composition of an (ε1, δ1)-differentially private algorithm and an (ε2, δ2)-differentially private algorithm is an (ε12, δ12)-differentially private algorithm.

The three composition rules can be summarized briefly as follows:

ε1 ∘ ε2 → (ε1 + ε2)
1, δ1) ∘ (ε2, δ2) → (ε12, δ12)
(α, ε1) ∘ (α, ε2) → (α, ε12)

What is the significance of these composition theorems? In short, ε-differential privacy and Rényi differential privacy compose as one would hope, but (ε, δ)-differential privacy does not.

The first form of differential privacy proposed was ε-differential privacy. It is relatively easy to interpret, composes nicely, but can be too rigid.

If you have Gaussian noise, for example, you are lead naturally to (ε, δ)-differential privacy. The δ term is hard to interpret. Roughly speaking you could think  it as the probability that ε-differential privacy fails to hold. Unfortunately with (ε, δ)-differential privacy the epsilons add and so do the deltas. We would prefer that δ didn’t grow with composition.

Rényi differential privacy is a generalization of ε-differential privacy that uses a family of information measures indexed by α to measure the impact of a single row being or not being in a database. The case of α = ∞ corresponds to ε-differential privacy, but finite values of α tend to be less pessimistic. The nice thing about the composition theorem for Rényi differential privacy is that the α parameter doesn’t change, unlike the δ parameter in (ε, δ)-differential privacy.

Product of copulas

A few days ago I wrote a post about copulas and operations on them that have a group structure. Here’s another example of group structure for copulas. As in the previous post I’m just looking at two-dimensional copulas to keep things simple.

Given two copulas C1 and C2, you can define a sort of product between them by

(C_1 * C_2)(u,v) = \int_0^1 D_2C_1(u,t)\,\, D_1C_2(t,v) \,\, dt

Here Di is the partial derivative with respect to the ith variable.

The product of two copulas is another copula. This product is associative but not commutative. There is an identity element, so copulas with this product form a semigroup.

The identity element is the copula

M(u,v) = \min\{u, v\}

that is,

M * C = C * M = C

for any copula C.

The copula M is important because it is the upper bound for the Fréchet-Hoeffding bounds: For any copula C,

\max\{u+v-1, 0\}\leq C(u,v) \leq \min\{u, v\}

There is also a sort of null element for our semigroup, and that is the independence copula

\Pi(u,v) = uv

It’s called the independence copula because it’s the copula for two independent random variables: their joint CDF is the product of their individual CDFs. It acts like a null element because

\Pi * C = C * \Pi = \Pi

This tells us we have a semigroup and not a group: the independence copula cannot have an inverse.

Reference: Roger B. Nelsen. An Introduction to Copulas.

Group symmetry of copula operations

You don’t often see references to group theory in a statistics book. Not that there aren’t symmetries in statistics that could be described in terms of groups, but this isn’t often pointed out.

Here’s an example from An Introduction to Copulas by Roger Nelsen.

Show that under composition the set of operations of forming the survival copula, the dual of a copula, and the co-copula of a given copula, along with the identity (i.e., ^, ~, *, and i) yields the dihedral group.

Nelsen gives the following multiplication table for copula operations.

    o | i ^ ~ *
    i | i ^ ~ *
    ^ | ^ i * ~
    ~ | ~ * i ^
    * | * ~ ^ i

The rest of this post explains what a copula is and what the operations above are.

What is a copula?

At a high level, a copula is a mathematical device for modeling the dependence between random variables. Sklar’s theorem says you can express the joint distribution of a set of random variables in terms of their marginal distributions and a copula. If the distribution functions are continuous, the copula is unique.

The precise definition of a copula is technical. We’ll limit ourselves to copulas in two dimensions to make things a little simpler.

Let I be the unit interval [0, 1]. Then a (two-dimensional) copula is a function from I × I to I  that satisfies

\begin{align*} C(0, v) &= 0\\ C(u, 0) &= 0\\ C(u, 1) &= u\\ C(1, v) &= v \end{align*}

and is 2-increasing.

The idea of a 2-increasing function is that “gradients point northeast.” Specifically, for all points (x1, y1) and (x2, y2) with x1x2 and y1y2, we have

C(x_2, y_2) - C(x_2, y_1) - C(x_1, y_2) + C(x_1, y_1) \,\geq\, 0

The definition of copula makes no mention of probability, but the 2-increasing condition says that C acts like the joint CDF of two random variables.

Survival copula, dual copula, co-copula

For a given copula C, the corresponding survival copula, dual copula, and co-copula are defined by

\begin{align*} \hat{C}(u, v) &= u + v - 1 + C(1-u, 1-v) \\ \tilde{C}(u, v) &= u + v - C(u,v) \\ C^*(u,v) &= 1 - C(1-u, 1-v) \end{align*}


The reason for the name “survival” has to do with a survival function, i.e. complementary CDF of a random variable. The survival copula is another copula, but the dual copula and co-copulas aren’t actually copulas.

This post hasn’t said much too about motivation or application—that would take a lot more than a short blog post—but it has included enough that you could verify that the operations do compose as advertised.

Update: See this post for more algebraic structure for copulas, a sort of convolution product.