The normal distribution pops up everywhere in statistics. Contrary to popular belief, the name does not come from “normal” as in “conventional.” Instead the term comes from a detail in a proof by Gauss discussed below where he showed that two things were perpendicular in a sense.

(The word “normal” originally meant “at a right angle,” going back to the Latin word *normalis* for a carpenter’s square. Later the word took on the metaphorical meaning of something in line with custom. Mathematicians sometimes use “normal” in the original sense of being orthogonal.)

The mistaken etymology persists because the normal distribution *is* conventional. Statisticians often assume anything random has a normal distribution by default. While this assumption is not always justified, it often works remarkably well. This post gives four lines of reasoning that lead naturally to the normal distribution.

1) The earliest characterization of the normal distribution is the central limit theorem, going back to Abraham de Moivre. Roughly speaking, this theorem says that if you average enough distributions together, even if they’re not normal, in the limit their average is normal. But this justification for assuming normal distributions everywhere has a couple problems. First, the convergence in the central limit theorem may be slow, depending on what is being averaged. Second, if you relax the hypotheses on the central limit theorem, other stable distributions with thicker tails also satisfy a sort of central limit theorem. The characterizations given below are more satisfying because they do not rely on limit theorems.

2) The astronomer William Herschel discovered the simplest characterization of the normal. He wanted to characterize the errors in astronomical measurements. He assumed (1) the distribution of errors in the *x* and *y* directions must be independent, and (2) the distribution of errors must be independent of angle when expressed in polar coordinates. These are very natural assumptions for an astronomer, and the only solution is a product of the same normal distribution in *x* and *y*. James Clerk Maxwell came up with an analogous derivation in three dimensions when modeling gas dynamics.

3) Carl Friedrich Gauss came up with the characterization of the normal distribution that caused it to be called the “Gaussian” distribution. There are two strategies for estimating the mean of a random variable from a sample: the arithmetic mean of the samples, and the maximum likelihood value. Only for the normal distribution do these coincide.

4) The final characterization listed here is in terms of entropy. For a specified mean and variance, the probability density with the greatest entropy (least information) is the normal distribution. I don’t know who discovered this result, but I read it in C. R. Rao‘s book. Perhaps it’s his result. If anyone knows, please let me know and I’ll update this post. For advocates of maximum entropy this is the most important characterization of the normal distribution.

**Related post**: How the Central Limit Theorem began

For daily posts on probability, follow @ProbFact on Twitter.

There are two strategies for estimating the mean of a random variable from a sample: the arithmetic mean of the samples, and the maximum likelihood value. Only for the normal distribution do these coincide.Can you link to references so that we can read more about these characterizations? For example, I’d never heard the quoted characterization of the normal distribution, and I’d like to read more.

Sure. (1) is in any book on probability. (2) and (3) can be found in “Probability Theory: The Logic of Science” by E. T. Jaynes, chapter 7. (4) can be found in Rao’s book linked to in my post, page 162.

Why then we sometimes prefer the Tanh distribution or the log normal distribution??

regards

The Gaussian does not fit well distributions as granulometric’s.

I’d rather prefer the Tanh distribution or the log normal distribution

Median , mode, and medium used not to be equal.

I once sat through a lecture by simulationist Averill Law (author of a seminal modeling and simulation text) where he advised the students that nothing in a system he has simulated followed the normal distribution. His assertion was that nothing anywhere followed the normal distribution.

I suppose this means he never tried to model errors.

Great post.

Thanks, Jason.

Many things are normal

in the middlebut not many things are normal in the tails. So it all depends on how you’re going to use the normal distribution. For example, I wrote a pair of posts, one explaining why human heights are normally distributed and another explaining why they are not. It all depends on what question you’re asking.It is very simple; sometimes negative values are prohibited . So , how the Gaussian distribution may work in these kind of situations??

Agustín: The fact that the normal distribution can be negative is not such a handicap in application. Obviously it could be for some applications, but often negative values are so rare that this isn’t a problem. For example, women’s heights are normally distributed with mean 64 and standard deviation 3. That distribution could have negative values, but they would be 22 standard deviations away from the mean, something with probability less than 10^-100.

Sometimes people use a truncated normal, a normal distribution restricted to an interval, to avoid negative values. Sometimes people use a log normal distribution, which is equivalent to assuming the log of the data is normally distributed. Or you could use a gamma distribution with a large shape parameter; it’s approximately normal but positive valued.

The normal distribution maximising entropy is in Shannon’s seminal 1948 paper “A Mathematical Theory of Communication” (Section 20). Shannon founded all this information theory stuff, so I’d be surprised if it appears anywhere before that.

Can you say more about the two things that were perpendicular that led to the distribution being called Normal? (In the UK, we routinely use the term “normal” to describe things that are perpendicular.)

Hi John,

I should probably look at the references you mention first but I’m going to fire away my query first anyway!

My query is about the claim in (3) that the MLE for the mean of a random variable and the sample mean only co-incide for the normal. There are other distributions where this is true too (Poisson, exponential, Bernoulli,…)?

I would also be interested in the reference discussing the original use of “normal” (in the sense of orthogonal or perpendicular) in this context.

Mark: You are correct. There must be some additional hypothesis, and now I don’t know what it was.

I am using the tanh internal distribution .I think it is a camouflaged gamma dsitrbution

Tanh [(x/p)^n]. It produces the p parameter (76,1%) in a satisfactory way

The Mode is calculaterd in a fast iteration routine.

Medium and gross moments are calculated by a simple gamma function ( of my own)

Sometimes it is difficult to give a suitable explanation when n< 1 because it does not produce a mode, it seems I am in front of a Poisson random dsitribution .I am not sure of it

I want to consult you about the internal tanh distribution

Y=tanh[(x/p)^n]

After performing a lot of calculus to evaluate the medium I produce “an approximate solution”

µ= p* Gamma[0,71158 + 0,68926/n]/Gamma[0,71158]

If I wnat to calculate the kth gross moment I apply

µ= (p^k)* Gamma[0,71158 +k* 0,68926/n]/Gamma[0,71158]

Are those expressions a strong solutions in the n>2 zone?

One of my son told me that these expressions are so exact to be approximations

New comments

I want to consult you about the internal tanh distribution

Y=tanh[(x/p)^n]

After performing a lot of calculus to evaluate the medium I produce “an approximate solution”

µ= p* Gamma[0,71158 + 0,68926/n]/Gamma[0,71158]

If I want to calculate the kth gross moment I apply

µ= (p^k)* Gamma[0,71158 +k* 0,68926/n]/Gamma[0,71158]

Are those expressions a strong solutions in the n>2 zone?

One of my son told me that these expressions are so exact to be approximations