Sometimes you can approximate a binomial distribution with a normal distribution. Under the right conditions, a Binomial(n, p) has approximately the distribution of a normal with the same mean and variance, i.e. mean np and variance np(1-p). The approximation works best when n is large and p is near 1/2.

This afternoon I was reading a paper that used a normal approximation to a binomial when n was around 10 and p around 0.001.  The relative error was enormous. The paper used the approximation to find an analytical expression for something else and the error propagated.

A common rule of thumb is that the normal approximation works well when np > 5 and n(1-p) > 5.  This says that the closer p is to 0 or 1, the larger n needs to be. In this case p was very small, but n was not large enough to compensate since np was on the order of 0.01, far less than 5.

Another rule of thumb is that normal approximations in general hold well near the center of the distribution but not in the tails. In particular the relative error in the tails can be unbounded. This paper was looking out toward the tails, and relative error mattered.

For more details, see these notes on the normal approximation to the binomial.

## 3 thoughts on “Bad normal approximation”

1. I think this is why many a/b testing tools will tell you it have significant results very early on, then the significance will disappear, and often return later. Obviously part of this is because these significance aren’t intended for repeated measurements. But I think it’s also because a z-test is so convenient that you don’t want to have to switch to a t-test some of the time.

2. katastrofa

Your post reminded me of the “proxy integration” approximation used for CDO pricing, which did exactly that: used the normal distribution to approximate a quasi-binomial  one in the tails.

 Quasi-binomial, because it counts the number of successes in n trials with a different (but small) success probability for each trial.

3. Bram Lap

Nice article. I’am currently trying to get a feeling for how data behaves in various distributions and what estimatoins/estimators are usefull. Do you have any recommendations regarding literature about these kind of things?