The negative binomial distribution and Pascal’s triangle

The Poisson probability distribution gives a simple, elegant model for count data. You can even derive from certain assumptions that data must have a Poisson distribution. Unfortunately reality doesn’t often go along with those assumptions.

A Poisson random variable with mean λ also has variance λ. But it’s often the case that data that would seem to follow a Poisson distribution has a variance greater than its mean. This phenomenon is called over-dispersion: the dispersion (variance) is larger than a Poisson distribution assumption would allow.

One way to address over-dispersion is to use a negative binomial distribution. This distribution has two parameters, r and p, and has the following probability mass function (PMF).

P(X = x) = \binom{r + x - 1}{x} p^r(1-p)^x

As the parameter r goes to infinity, the negative binomial distribution converges to a Poisson distribution. So you can think of the negative binomial distribution as a generalization of the Poisson distribution.

These notes go into the negative binomial distribution in some detail, including where its name comes from.

If the parameter r is a non-negative integer, then the binomial coefficients in the PMF for the negative binomial distribution are on the (r+1)st diagonal of Pascal’s triangle.

Pascal's triangle

The case r = 0 corresponds to the first diagonal, the one consisting of all 1s. The case r = 1 corresponds to the second diagonal consisting of consecutive integers. The case r = 2 corresponds to the third diagonal, the one consisting of triangular numbers. And so forth.

Related posts

Variance matters more than mean in the extremes

Suppose you have two normal random variables, X and Y, and that the variance of X is less than the variance of Y.

Let M be an equal mixture of X and Y. That is, to sample from M, you first chose X or Y with equal probability, then you choose a sample from the random variable you chose.

Now suppose you’ve observed an extreme value of M. Then it is more likely the that the value came from Y. The means of X and Y don’t matter, other than determining the cutoff for what “extreme” means.

High-level math

To state things more precisely, there is some value t such that the posterior probability that a sample m from M came from Y, given that |m| > t, is greater than the posterior probability that m came from X.

Let’s just look at the right-hand tails, even though the principle applies to both tails. If X and Y have the same variance, but the mean of X is greater, then larger values of Z are more likely to have come from X. Now suppose the variance of Y is larger. As you go further out in the right tail of M, the posterior probability of an extreme value having come from Y increases, and eventually it surpasses the posterior probability of the sample having come from X. If X has a larger mean than Y that will delay the point at which the posterior probability of Y passes the posterior probability of X, but eventually variance matters more than mean.

Detailed math

Let’s give a name to the random variable that determines whether we choose X or Y. Let’s call it C for coin flip, and assume C takes on 0 and 1 each with probability 1/2. If C = 0 we sample from X and if C = 1 we sample from Y. We want to compute the probability P(C = 1 | Mt).

Without loss of generality we can assume X has mean 0 and variance 1. (Otherwise transform X and Y by subtracting off the mean of X then divide by the standard deviation of X.) Denote the mean of Y by μ and the standard deviation by σ.

From Bayes’ theorem we have

\text{P}(C = 1 \mid M \geq t) = \frac{ \text{P}(Y \geq t) }{ \text{P}(X \geq t) + \text{P}(Y \geq t) } = \frac{\Phi^c\left(\frac{t-\mu}{\sigma}\right)}{\Phi^c(t) + \Phi^c\left(\frac{t-\mu}{\sigma}\right)}

where Φc(t) = P(Zt) for a standard normal random variable.

Similarly, to compute P(C = 1 | Mt) just flip the direction of the inequality signs replace Φc(t) = P(Zt) with Φ(t) = P(Zt).

The calculation for P(C = 1  |  |M| ≥ t) is similar

Example

Suppose Y has mean −2 and variance 10. The blue curve shows that a large negative sample from M very likely comes from Y and the orange line shows that large positive sample very likely comes from Y as well.

The dip in the orange curve shows the transition zone where Y‘s advantage due to a larger mean gives way to the disadvantage of a smaller variance. This illustrates that the posterior probability of Y increases eventually but not necessarily monotonically.

Here’s a plot showing the probability of a sample having come from Y depending on its absolute value.

Related posts

Why do medical tests always have error rates?

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Related posts

Rényi’s parking constant

Parallel parking photo by IFCAR, Public Domain

Imagine parallel parking is available along the shoulder of a road, but no parking spaces are marked.

The first person to park picks a spot on the shoulder at random. Then another car also chooses a spot along the shoulder at random, with the constraint that the second car can’t overlap the first. This process continues until no more cars can park. How many people can park this way?

Assume all cars have the same length, and we choose our units so that cars have length 1. The expected number of cars that can park is a function M(x) of the length of the parking strip x. Clearly if x < 1 then M(x) = 0. Alfréd Rényi [1] found that for x ≥ 1, M(x) satisfies the equation

M(x) = 1 + \frac{2}{x-1}\int_0^{x-1} M(t)\, dt

This is an unusual equation, difficult to work with because it defined M only implicitly. It’s not even clear that the equation has a solution. But it does, and the ratio of M(x) to x approaches a constant as x increases.

m = \lim_{x\to\infty} \frac{M(x)}{x} = 0.74759\ldots

The number m is known as Rényi’s parking constant.

This says for a long parking strip, parking sequentially at random will allow about 3/4 as many cars to park as if you were to pack the cars end-to-end.

More posts on Rényi’s work

[1] Steven R. Finch. Mathematical Constants. Cambridge University Press, 2003.

Fitting a line without an intercept term

The other day I was looking at how many lumens LED lights put out per watt. I found some data on Wikipedia, and as you might expect the relation between watts and lumens is basically linear, though not quite.

If you were to do a linear regression on the data you’d get a relation

lumens = a × watts + b

where the intercept term b is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore b; it’s probably small. But what if you wanted to require that b = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope a without using any regression software.

Let x be the vector of input data, the wattage of the LED bulbs. And let y be the corresponding light output in lumens. The regression line uses the slope a that minimizes

(a xy)² = a² x · x − 2a x · y + y · y.

Setting the derivative with respect to a to zero shows

a = x · y / x · x

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring b = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

Related posts

[1] The orange line in the image above is the least squares fit for the model y = ax, but it’s not quite the same line you’d get if you fit the model y = ax + b.

Small probabilities add, big ones don’t

A video has been making the rounds in which a well-known professor [1] says that if something has a 20% probability of happening in one attempt, then it has a 40% chance of happening in two attempts, a 60% chance in happening in three attempts, etc.

This is wrong, but it’s a common mistake. And one reason it’s common is that a variation on the mistake is approximately correct, which we will explain shortly.

It’s obvious the reasoning in the opening paragraph is wrong when you extend it to five, or especially six, attempts. Are you certain to succeed after five attempts? What does it even mean that you have a 120% chance of success after six attempts?!

But let’s reduce the probabilities in the opening paragraph. If there’s a 2% chance of success on your first attempt, is there a 4% chance of success in two attempts and a 6% chance of success in three attempts? Yes, approximately.

Two attempts

Here is the correct formula for the probability of an event happening in two tries.

P(A \cup B) = P(A) + P(B) - P(A\cap B)

In words, the probability of A or B happening equals the probability of A happening, plus the probability of B happening, minus the probability of A and B both happening. The last term is a correction term. Without it, you’re counting some possibilities twice.

So if the probability of success on each attempt is 0.02, the probability of success on two attempts is

0.02 + 0.02 − 0.0004 = 0.0396 ≈ 0.04.

When the probabilities of A and B are each small, the probability of A and B both happening is an order of magnitude smaller, assuming independence [2]. The smaller the probabilities of A and B, the less the correction term matters.

If the probability of success on each attempt is 0.2, now the probability of success after two attempts is 0.36. Simply adding probabilities and neglecting the correction term is incorrect, but not terribly far from correct in this case.

Three attempts

When you consider more attempts, things get more complicated. The probability of success after three attempts is given by

\begin{align*} P(A \cup B \cup C) &= P(A) + P(B) + P(C) \\ &- P(A\cap B) - P(B \cap C) - P(A \cap C) \\ &+ P(A \cap B \cap C) \end{align*}

as I discuss here. Adding the probabilities of success separately over-estimates the correct probability. So you correct by subtracting the probabilities of pairs of successes. But then this is over-corrects, because you need to add back in the probability of three successes.

If A, B, and C all have a 20% probability, the probability of A or B or C happening is 48.8%, not 60%, again assuming independence.

The error from naively adding probabilities increases when the number of probabilities increase.

n attempts

Now let’s look at the general case. Suppose your probability of success on each attempt is p. Then your probability of failure on each independent attempt is 1 − p. The probability of at least one success out of n attempts is the complement of the probability of all failures, i.e.

1 - (1 - p)^n

When p is small, and when n is small, we can approximate this by np. That’s why naively adding probabilities works when the probabilities are small and there aren’t many of them. Here’s a way to say this precisely using the binomial theorem.

\begin{align*} 1 - (1 - p)^n &= 1 - \left(1 - np + {n \choose 2}p^2 - {n \choose 3}p^3 - \cdots\right ) \\ &= np + {\cal O}(p^2) \end{align*}

The exact probability is np plus (n − 1) terms that involve higher powers of p. When p and n are sufficiently small, these terms can be ignored.

 

[1] I’m deliberately not saying who. My point here is not to rub his nose in his mistake. This post will be online long after the particular video has been forgotten.

[2] Assuming A and B are independent. This is not always the case, and wrongly assuming independence can have disastrous consequences as I discuss here, but that’s a topic for another day.

Logistic regression quick takes

This post is a series of quick thoughts related to logistic regression. It starts with this article on moving between logit and probability scales.

***

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

***

The article cited above is entitled “Rough approximations to move between logit and probability scales.” Here is a paragraph from the article giving its motivation.

When working in this space, it’s helpful to have some approximations to move between the logit and probability scales. (As an analogy, it is helpful to know that for a normal distribution, the interval ± 2 standard deviations around the mean covers about 95% of the distribution, while ± 3 standard deviations covers about 99%.)

Here are half the results from the post; the other half follow by symmetry.

    |-------+-------|
    |  prob | logit |
    |-------+-------|
    | 0.500 |     0 |
    | 0.750 |     1 |
    | 0.900 |     2 |
    | 0.995 |     3 |
    | 0.999 |     7 |
    |-------+-------|

Zero on the logit scale corresponds exactly to a probability of 0.5. The other values are approximate.

When I say the rest of the table follows by symmetry, I’m alluding to the fact that

logit(1 − p) = − logit(p).

So, for example, because logit(0.999) ≈ 7, logit(0.001) ≈ −7.

***

The post reminded me of the decibel scale. As I wrote in this post, “It’s a curious and convenient fact that many decibel values are close to integers.”

  • 3 dB ≈ 2
  • 6 dB ≈ 4
  • 7 dB ≈ 5
  • 9 dB ≈ 8

I was curious whether the logit / probability approximations were as accurate as these decibel approximations. Alas, they are not. They are rough approximations, as advertised in the title, but still useful.

***

The post also reminded me of a comment by Andrew Gelman and Jennifer Hill on why  natural logs are natural for regression.

 

MCMC and the coupon collector problem

Bob Carpenter wrote today about how Markov chains cannot thoroughly cover high-dimensional spaces, and that they do not need to. That’s kinda the point of Monte Carlo integration. If you want systematic coverage, you can/must sample systematically, and that’s impractical in high dimensions.

Bob gives the example that if you want to get one integration point in each quadrant of a 20-dimensional space, you need a million samples. (220 to be precise.) But you may be able to get sufficiently accurate results with a lot less than a million samples.

If you wanted to be certain to have one sample in each quadrant, you could sample at (±1, ±1, ±1, …, ±1). But if for some weird reason you wanted to sample randomly and hit each quadrant, you have a coupon collector problem. The expected number of samples to hit each of N cells with uniform [1] random sampling with replacement is

N( log(N) + γ )

where γ is Euler’s constant. So if N = 220, the expected number of draws would be over 15 million.

Related posts

[1] We’re assuming each cell is sampled independently with equal probability each time. Markov chain Monte Carlo is more complicated than that because the draws are not independent.

A surprising result about surprise index

Surprise index

Warren Weaver [1] introduced what he called the surprise index to quantify how surprising an event is. At first it might seem that the probability of an event is enough for this purpose: the lower the probability of an event, the more surprise when it occurs. But Weaver’s notion is more subtle than this.

Let X be a discrete random variable taking non-negative integer values such that

\text{Prob}(X = i) = p_i

Then the surprise index of the ith event is defined as

S_i = \frac{\sum_{j=0}^\infty p_j^2 }{p_i}

Note that if X takes on values 0, 1, 2, … N−1 all with equal probability 1/N, then Si = 1, independent of N. If N is very large, each outcome is rare but not surprising: because all events are equally rare, no specific event is surprising.

Now let X be the number of legs a human selected at random has. Then p2 ≈ 1, and so the numerator in the definition of Si is approximately 1 and S2 is approximately 1, but Si is large for any value of i ≠ 2.

The hard part of calculating the surprise index is computing the sum in the numerator. This is the same calculation that occurs in many contexts: Friedman’s index of coincidence, collision entropy in physics, Renyi entropy in information theory, etc.

Poisson surprise index

Weaver comments that he tried calculating his surprise index for Poisson and binomial random variables and had to admit defeat. As he colorfully says in a footnote:

I have spent a few hours trying to discover that someone else had summed these series and spent substantially more trying to do it myself; I can only report failure, and a conviction that it is a dreadfully sticky mess.

A few years later, however, R. M. Redheffer [2] was able to solve the Poisson case. His derivation is extremely terse. Redheffer starts with the generating function for the Poisson

e^{-\lambda} e^{\lambda x} = \sum p_mx^m

and then says

Let x = eiθ; then eiθ; multiply; integrate from 0 to 2π and simplify slightly to obtain

\sum p_m^2 = (e^{2\lambda}/\pi) \int_0^\pi e^{2\lambda \cos \theta}\, d\theta

The integral on the right is recognized as the zero-order Bessel function …

Redheffer then “recognizes” an expression involving a Bessel function. Redheffer acknowledges in a footnote at a colleague M. V. Cerrillo was responsible for recognizing the Bessel function.

It is surprising that the problem Weaver describes as a “dreadfully sticky mess” has a simple solution. It is also surprising that a Bessel function would pop up in this context. Bessel functions arise frequently in solving differential equations but not that often in probability and statistics.

Unpacking Redheffer’s derivation

When Redheffer says “Let x = eiθ; then eiθ; multiply; integrate from 0 to 2π” he means that we should evaluate both sides of the equation for the Poisson generating function equation at these two values of x, multiply the results, and average the both sides over the interval [0, 2π].

On the right hand side this means calculating

\frac{1}{2\pi} \int_0^{2\pi}\left(\sum_{m=0}^\infty p_m \exp(im\theta)\right) \left(\sum_{n=0}^\infty p_n \exp(-in\theta)\right)\, d\theta

This reduces to

\sum_{m=0}^\infty p_m^2

because

alt=

i.e. the integral evaluates to 1 when m = n but otherwise equals zero.

On the left hand side we have

\begin{align*} \frac{1}{2\pi} \int_0^{2\pi} \exp(-2\lambda) \exp(\lambda(e^{i\theta} + e^{-i\theta})) \, d\theta &= \frac{1}{2\pi} \int_0^{2\pi} \exp(-2\lambda) \exp(2 \cos \theta) \, d\theta \\ &= \frac{e^{-2\lambda}}{\pi} \int_0^{\pi} \exp(2 \cos \theta) \, d\theta \end{align*}

Cerrillo’s contribution was to recognize the integral as the Bessel function J0 evaluated at -2iλ or equivalently the modified Bessel function I0 evaluated at -2λ. This follows directly from equations 9.1.18 and 9.6.16 in Abramowitz and Stegun.

Putting it all together we have

\frac{\pi}{e^{2\lambda}}\sum_{m=0}^\infty p_m^2 = \int_0^{\pi} \exp(2 \cos \theta) \, d\theta = J_0(-2i\lambda ) = I_0(-2\lambda )

Using the asymptotic properties of I0 Redheffer notes that for large values of λ,

\sum_{m=0}^\infty p_m^2 \sim \frac{1}{2\sqrt{\pi\lambda}}

[1] Warren Weaver, “Probability, rarity, interest, and surprise,” The Scientific Monthly, Vol 67 (1948), p. 390.

[2] R. M. Redheffer. A Note on the Surprise Index. The Annals of Mathematical Statistics, Mar., 1951, Vol. 22, No. 1 pp. 128ndash;130.

Estimating an author’s vocabulary

How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, or that at least not in works you have access to.

Brainerd [1] suggested the following estimator based on a Markov chain model of language. The estimated vocabulary is the number N satisfying the equation

\sum_{j=0}^{x-1}\left(1 - \frac{j}{N}\right)^{-1} = n

The left side is a decreasing function of N, so you could solve the equation by finding a values of N that make the sum smaller and larger than n, then use a bisection algorithm.

We can see that the model is qualitatively reasonable. If every word is unique, i.e. x = n, then the solution is N = ∞. If you haven’t seen any repetition, you the author could keep writing new words indefinitely. As the amount of repetition increases, the estimate of N decreases.

Brainerd’s model is simple, but it tends to underestimate vocabulary. More complicated models might do a better job.

Problems analogous to estimating vocabulary size come up in other applications. For example, an ecologist might want to estimate the number of new species left to be found based on the number of species seen so far. In my work in data privacy I occasionally have to estimate diversity in a population based on diversity in a sample. Both of these examples are analogous to estimating potential new words based on the words you’ve seen.

[1] Brainerd, B. On the relation between types and tokes in literary text, J. Appl. Prob. 9, pp. 507-5