The negative binomial distribution is interesting because it illustrates a common progression of statistical thinking. My aim here is to tell a story, not to give details; the details are available here. The following gives a progression of three perspectives.
First view: Counting
The origin of the negative binomial is very concrete. It is unfortunate that the name makes the distribution seem more abstract than it is. (What could possibly be negative about a binomial distribution? Sounds like abstract nonsense.)
Suppose you have decided to practice basketball free throws. You’ve decided to practice until you have made 20 free throws. If your probability of making a single free throw is p, how many shots will you have to attempt before you make your goal of 20 successes? Obviously you’ll need at least 20 attempts, but you might need a lot more. What is the expected number of attempts you would need? What’s the probability that you’ll need more than 50 attempts? These questions could be answered by using a negative binomial distribution. A negative binomial probability distribution with parameters r and p gives the probabilities of various numbers of failures before the rth success when each attempt has probability of success p.
Second view: Cheap generalization
After writing down the probability mass function for the negative binomial distribution as described above, somebody noticed that the number r didn’t necessarily have to be an integer. The distribution was motivated by integer values of r, counting the number of failures before the rth success, but the resulting formula makes sense even when r is not an integer. It doesn’t make sense to wait for 2.87 successes; you can’t interpret the formula as counting events unless r is an integer, but the formula is still mathematically valid.
The probability mass function involves a binomial coefficient. These coefficients were first developed for integer arguments but later extended to real and even complex arguments. See these notes for definitions and these notes for how to calculate the general coefficients. The probability mass function can be written most compactly when one of the binomial coefficient has a negative argument. See page two of these notes for an explanation. There’s no intuitive explanation of the negative argument. It’s just a consequence of some algebra.
What’s the point in using non-integer values of r? Just because we can? No, there are practical reasons, and that leads to our third view.
Third view: Modeling overdispersion
Next we take the distribution above and forget where it came from. It was motivated by counting successes and failures, but now we forget about that and imagine the distribution falling from the sky in its general form described above. What properties does it have?
The negative binomial distribution turns out to have a very useful property. It can be seen as a generalization of the Poisson distribution. (See this distribution chart. Click on the dashed arrow between the negative binomial and Poisson boxes.)
The Poisson is the simplest distribution for modeling count data. It is in some sense a very natural distribution and it has nice theoretical properties. However, the Poisson distribution has one severe limitation: its variance is equal to its mean. There is no way to increase the variance without increasing the mean. Unfortunately, in many data sets the variance is larger than the mean. That’s where the negative binomial comes in. When modeling count data, first try the simplest thing that might work, the Poisson. If that doesn’t work, try the next simplest thing, negative binomial.
When viewing the negative binomial this way, a generalization of the Poisson, it helps to use a new parameterization. The parameters r and p are no longer directly important. For example, if we have empirical data with mean 20.1 and variance 34.7, we would naturally be interested in the negative binomial distribution with this mean and variance. We would like a parametrization that reflects more directly the mean and variance and one that makes the connection with the Poisson more transparent. That is indeed possible, and is described in these notes.
Update: Here’s a new post giving a fourth view of the negative binomial distribution — a continuous mixture of Poisson distributions. This view explains why the negative binomial is related to the Poisson and yet has greater variance.
Related links:
Notes on the negative binomial distribution
General binomial coefficients
Diagram of distribution relationships
Upper and lower bounds on binomial coefficients

{ 1 trackback }
{ 3 comments… read them below or add one }
EastwoodDC 10.25.09 at 20:09
I had not ever though about the NB being a generalization of the Poisson, but that is a useful was to think of it. A colleague of vast experience tells me he has never fit a Poisson model where overdispersion was not a problem.
Jan Galkowski 01.01.10 at 19:03
I’m wondering what the general expression for the variance of a mixture distribution might look like. To be specific, suppose X(1), X(2), …, X(n) are r.v. each having means u(1), u(2), …, u(n), respectively, and variances v(1), v(2), …, v(n). The k-th distribution is chosen with probability p(k), and the choice is mutually exclusive. So, in each case, it’s a multinomial gating which of the n distributions are used. In the case of n = 2,
VAR[...] = p(1) v(1) + p(2) v(2) + p(1) p(2) [u(1) - u(2)]^2.
The question is, what does this look like in general? It had been suggested to me that the general expression, given the intermediate definition,
Y(i) = P(S == i) X(i)
for the variance looks like
SUM-over-i VAR[Y(i)] + 2 SUM-over-i-not-equal-j COV[Y(i),Y(j)]
where, of course, COV[U,V] = E[U V] – E[U] E[V]
Here, since Y(i) and Y(j) are mutually exclusive, E[Y(i) Y(j)] is presumably zero.
But I cannot reconcile this in the case of two r.v. with n = 2 expression.
Any suggestions?
John Myles White 01.12.10 at 12:11
This was incredibly useful, John. I’ve known just a little about the negative binomial distribution for a while now after reading the start of “Inference and Disputed Authorship,” but never fully appreciated the distribution’s place. Thanks for giving such a useful buildup of motivating intuitions for it.