The negative binomial distribution is interesting because it illustrates a common progression of statistical thinking. **My aim here is to tell a story, not to give details;** the details are available here. The following gives a progression of three perspectives.

### First view: Counting

The origin of the negative binomial is very concrete. It is unfortunate that the name makes the distribution seem more abstract than it is. (What could possibly be negative about a binomial distribution? Sounds like abstract nonsense.)

Suppose you have decided to practice basketball free throws. You’ve decided to practice until you have made 20 free throws. If your probability of making a single free throw is *p*, how many shots will you have to attempt before you make your goal of 20 successes? Obviously you’ll need at least 20 attempts, but you might need a lot more. What is the expected number of attempts you would need? What’s the probability that you’ll need more than 50 attempts? These questions could be answered by using a negative binomial distribution. A negative binomial probability distribution with parameters *r* and *p* gives the probabilities of various numbers of failures before the *r*th success when each attempt has probability of success *p*.

### Second view: Cheap generalization

After writing down the probability mass function for the negative binomial distribution as described above, somebody noticed that the number *r* didn’t necessarily have to be an integer. The distribution was *motivated* by integer values of *r*, counting the number of failures before the *r*th success, but the resulting formula makes sense even when *r* is not an integer. It doesn’t make sense to wait for 2.87 successes; you can’t interpret the formula as counting events unless *r* is an integer, but the formula is still mathematically valid.

The probability mass function involves a binomial coefficient. These coefficients were first developed for integer arguments but later extended to real and even complex arguments. See these notes for definitions and these notes for how to calculate the general coefficients. The probability mass function can be written most compactly when one of the binomial coefficient has a negative argument. See page two of these notes for an explanation. There’s no intuitive explanation of the negative argument. It’s just a consequence of some algebra.

What’s the point in using non-integer values of *r*? Just because we can? No, there are practical reasons, and that leads to our third view.

### Third view: Modeling overdispersion

Next we take the distribution above and forget where it came from. It was motivated by counting successes and failures, but now we forget about that and imagine the distribution falling from the sky in its general form described above. What properties does it have?

The negative binomial distribution turns out to have a very useful property. It can be seen as a generalization of the Poisson distribution. (See this distribution chart. Click on the dashed arrow between the negative binomial and Poisson boxes.)

The Poisson is the simplest distribution for modeling count data. It is in some sense a very natural distribution and it has nice theoretical properties. However, the Poisson distribution has one severe limitation: its variance is equal to its mean. There is no way to increase the variance without increasing the mean. Unfortunately, in many data sets the variance is larger than the mean. That’s where the negative binomial comes in. When modeling count data, first try the simplest thing that might work, the Poisson. If that doesn’t work, try the next simplest thing, negative binomial.

When viewing the negative binomial this way, a generalization of the Poisson, it helps to use a new parameterization. The parameters *r* and *p* are no longer directly important. For example, if we have empirical data with mean 20.1 and variance 34.7, we would naturally be interested in the negative binomial distribution with this mean and variance. We would like a parametrization that reflects more directly the mean and variance and one that makes the connection with the Poisson more transparent. That is indeed possible, and is described in these notes.

**Update**: Here’s a new post giving a fourth view of the negative binomial distribution — a continuous mixture of Poisson distributions. This view explains why the negative binomial is related to the Poisson and yet has greater variance.

I had not ever though about the NB being a generalization of the Poisson, but that is a useful was to think of it. A colleague of vast experience tells me he has never fit a Poisson model where overdispersion was not a problem.

I’m wondering what the general expression for the variance of a mixture distribution might look like. To be specific, suppose X(1), X(2), …, X(n) are r.v. each having means u(1), u(2), …, u(n), respectively, and variances v(1), v(2), …, v(n). The k-th distribution is chosen with probability p(k), and the choice is mutually exclusive. So, in each case, it’s a multinomial gating which of the n distributions are used. In the case of n = 2,

VAR[…] = p(1) v(1) + p(2) v(2) + p(1) p(2) [u(1) – u(2)]^2.

The question is, what does this look like in general? It had been suggested to me that the general expression, given the intermediate definition,

Y(i) = P(S == i) X(i)

for the variance looks like

SUM-over-i VAR[Y(i)] + 2 SUM-over-i-not-equal-j COV[Y(i),Y(j)]

where, of course, COV[U,V] = E[U V] – E[U] E[V]

Here, since Y(i) and Y(j) are mutually exclusive, E[Y(i) Y(j)] is presumably zero.

But I cannot reconcile this in the case of two r.v. with n = 2 expression.

Any suggestions?

This was incredibly useful, John. I’ve known just a little about the negative binomial distribution for a while now after reading the start of “Inference and Disputed Authorship,” but never fully appreciated the distribution’s place. Thanks for giving such a useful buildup of motivating intuitions for it.

You may want to look at my 2007 book, “Negative Binomial Regression” (Cambridge University Press). I am nearly finished with a substantially expanded second edition, that will come out at about 500 pages in length. I use Stata and R for examples, many of which i have written. I’ve seen many data sets that are equi-dispersed, in fact, I’ve seen underdispersed Poisson models. These of course cannot be modeled using the traditional NB procedure. But a generalized Poisson, as well as hurdle models and double Poisosn models can estimate under-dispersed count data.

There are a wide variety of NB models, including one that is not a Poisson-gamma mixture, but rather derives directly from PDF as understood by the first version mentioned by John. I have called this the NB-C parameterization, and it is not related to the Poisson at all. On the other hand, with the variance defined as mu+alpha*mu^2, where alpha is the overdispersion parameter and mu is the mean, the traditonal NB, called in the literature as NB2, is parameterized such that the Poisson model is a NB with alpha=0. A geometric model is NB with alpha=1. A NB2 model can be extra-dispersed as well — either uncer of over dispersed. The data can in fact be Poisson overdispersed but NB under-dispersed. How to identify possible overdispersion (correlation) and its causes is important in deciding which type of count model to use for a given data situation. If you are really interested in this subject, email me and I can perhaps refer you to sources that will help you understand the family of negative binomial models a bit better. Hilbe@asu.edu

Thanks John for your work. The content is a well organized refresh to me, and it is first time I feel that I understand the naming of “negative” binomial. Do you plan to write something about how the name “hypergeometric” comes from? Thanks.

I don’t really know how the hypergeometric

distributiongets its name, but I know how the hypergeometricserieswas named. Hypergeometric series are a generalization of the geometric series, and I would assume the hypergeometricdistributionhas some connection to the hypergeometricseries.According to Bishop, Fienberg, and Holland in

Discrete Multivariate Analysis, Springer, 2007, section 13.5, pages 448-449, the hypergeometric distribution gets its name because hypergeometric probabilities are scaled terms from the hypergeometric series. This presentation context is symbolically challenged, so I just refer the reader to equations (13.5-1) through (13.5-5) of the cited text.By the way, I find this text

veryuseful and illuminating, in many ways. I came to it through its extension of the Fisher Exact Test to instances where theextended hypergeometric distributionis more appropriate for tables of counts, where the model suggests the populations compared may not have the same probabilities of appearance. There’s a lot more to this book.I have also looked at both Fienberg,

The Analysis of Cross-Classified Categorical Data(2nd edition, Springer, 2007) , which is useful, and Congdon’sBayesian Models for Categorical Data, Wiley, 2005. I find Congdon less useful, even if he incorporates some of Epstein and Fienberg, “Bayesian estimation in multidimensional contingency tables” (fromComputer Science and Statistics: Proceedings of the 23rd Symposium on the Interface, Keramidas, E, editor,Inteface Foundation of North America: Fairfax, 37-47).Hope this helps.

Thanks Jan Galkowski for your suggested chapter. I read it from our library links. It does make sense that hypergeometric distribution got its name from its connection to hypergeometric series by a scaling factor. However, the hypergeometric series and hypergeometric function are derived when studying second-order linear ordinary differential equation, and it thus appears mysterious that why statisticians want to bring the context of physics to statistics. Do you have more ideas? I guess there might be some anecdote beyond coincidence. Thanks.

Hypergeometric functions have many faces. Yes, they were studied first because of their connection to differential equations. But later people realized they were useful in combinatorics as generating functions. This latter perspective is easier to connect to probability. Think about the combinatorial properties of the coefficients in the series rather than the analytical properties of the function.