“Which distribution describes my data?” Variations on that question pop up regularly on various online forums. Sometimes the person asking the question is looking for a goodness of fit test but doesn’t know the jargon “goodness of fit.” But more often they have something else in mind. They’re thinking of some list of familiar, named distribution families — normal, gamma, Poisson, etc. — and want to know which distribution from this list best fits their data. So the real question is something like the following:
Which distribution from the well-known families of probability distributions fits my data best?
Statistics classes can give the impression that there is a short list of probability distribution families, say the list in the index of text book for that class, and that something from one of those families will always fit any data set. This impression starts to seem absurd when stated explicitly. It raises two questions.
- What exactly is the list of well-known distributions?
- Why should a distribution from this list fit your data?
As for the first question, there is some consensus as to what the well-known distributions are. The distribution families in this diagram would make a good start. But the question of which distributions are “well known” is a sociological question, not a mathematical one. There’s nothing intrinsic to a distribution that makes it well-known. For example, most statisticians would consider the Kumaraswamy distribution obscure and the beta distribution well-known, even though the two are analytically similar.
You could argue that the canonical set of distributions is somewhat natural by a chain of relations. The normal distribution is certainly natural due to the central limit theorem. The chi-squared distribution is natural because the square of a normal random variable has a chi-squared distribution. The F distribution is related to the ratio of chi-squared variables, so perhaps it ought to be included. And so on and so forth. But each link in the chain is a little weaker than the previous. Also, why this chain of relationships and not some other?
Alternatively, you could argue that the distributions that made the canon are there because they have been found useful in practice. And so they have. But had people been interested in different problems, a somewhat different set of distributions would have been found useful.
Now on to the second question: Why should a famous distribution fit a particular data set?
Suppose a police artist asked a witness which U. S. president a criminal most closely resembled. The witness might respond
Well, she didn’t look much like any of them, but if I have to pick one, I’d pick John Adams.
The U. S. presidents form a convenient set of faces. You can find posters of their faces in many classrooms. The U. S. presidents are historically significant, but a police artist would do better to pick a different set of faces as a first pass in making a sketch.
I’m not saying it is unreasonable to want to fit a famous distribution to your data. Given two distributions that fit the data equally well, go with the more famous distribution. This is a sort of celebrity version of Occam’s razor. It’s convenient to use distributions that other people recognize. Famous distributions often have nice mathematical properties and widely available software implementations. But the list of famous distributions can form a Procrustean bed that we force our data to fit.
The extreme of Procrustean statistics is a list of well-known distributions with only one item: the normal distribution. Researchers often apply a normal distribution where it doesn’t fit at all. More dangerously, experienced statisticians can assume a normal distribution when the lack of fit isn’t obvious. If you implicitly assume a normal distribution, then any data point that doesn’t fit the distribution is an outlier. Throw out the outliers and the normal distribution fits well! Nassim Taleb calls the normal distribution the “Great Intellectual Fraud” in his book The Black Swan because people so often assume the distribution fits when it does not.
“Procrustean statistics.” Heh, I’ll have to adopt that.
It is good to remember that we do this implicitly when we summarize data with mean and sd. While mean and sd are useful in many cases (these estimates having a meaning in many distributions) we still tend to think in terms of mean +/- 2*sd. For this reason, I have become disenchanted with the ubiquitous summary table in clinical trials.
These are all good points which I’d hesitate to disagree with. But I will anyway.
Lots of things have a normal distribution for good reason – they are a sum of a large number of small independent effects. The problem is more that the normal distribution is a victim of its own popularity, and is more widely used than it deserves. Some of the other distributions on the list of usual suspects are there for good reason too – binomial and exponential, for example.
A slightly different question to the one you posed is:
3) What distribution are my statistics?
because if we know that, we can make inferences. Normal is popular for some sound theoretical reasons, although non-normality will bite you if you’re complacent. This third question motivates a lot of robust statistics. One way of looking at robust statistics is that they take data with an unknown – and perhaps unknowable – distribution, and compute a statistic that nonetheless has a known distribution (e.g. the Mann-Whitney test).
William: Good point with question #3. Classical distributions may fit the derived statistics better than they fit the underlying data.
“Why should a famous distribution fit a particular data set?”
Because there are so many.
If forcing any problem or data to fit in a “Gaussian assumption” is doing “Procrustean statistics”, then the “Reverse Procrustean statistics” would very much look like using the Weibull distribution to fit one’s data, right ?
;)
PS: thanks for this very interesting blog, by the way !
I’m under the impression that the reason we fit distributions to our data at all is largely for historical reasons. Do you have a sense of whether this is true?
My hunch: If cheap computation had been available to Fisher and the Pearsons and Neyman, I bet that something like bootstrapping would be the standard Stats 101 approach today. You’d let the data speak for itself and make as few assumptions as you can.
But since that was computationally impractical, they had to take an analytical approach instead. In order to answer William’s question #3 and make reasonable inferences, they used formalized distributions because their properties can be studied analytically. As for your question #2, there’s often no reason that any famous distribution should fit a given dataset… but without cheap computers you’ve got to fit it to something, and it might as well be something that’s been well-studied; hence there has to exist some list of famous distributions, as per your question #1. And now the standard methods based on distributional assumptions have been taught in Stat 101 for decades, so that the comparatively new resampling/nonparametric approaches are “extras” for special situations rather than being the defaults that they probably should be.
So: if you re-ran the past hundred years we might get a slightly different set of canonical distributions; but if you rewound the clock a hundred years and gave them computers, we’d see resampling etc instead of parametrics in Stats 101 today.
Does that sound plausible or am I totally off the mark?
Jerzy: Interesting. It seems quite plausible that bootstrapping would have been much more common if computers had pre-dated the standardization of statistics.
I also think Bayesian statistics would have been more popular, perhaps even the dominant paradigm, had computation been available sooner. Bayesian statistics was developed before “classical” statistics, but did not become widely adopted because it usually requires evaluating integrals that can’t be computed in closed form. Bayesian statistics had a renaissance beginning in the 1980’s as computers and algorithms made computing these integrals easier.
I suspect that it’s no accident that the majority of “canonical” distributions belong to the exponential family, so all are more or less tractable to some sort of symbolic (as opposed to numerical) analysis.
I advise my students to consider the underlying phenomenon they’re trying to model before they start zeroing in on some name-brand distribution. Averages, waiting times, and species diversity each call for very different families of distributions, and these only scratch the surface.
Of course, if model selection were easy, we statisticians might be out of a job…
Most populations can be adequately described by one of four basic distributions: normal, lognormal, triangular, or uniform. For more complex problems, researchers will find a multitude of available distributions available in the literature. For those interested, I would recommend this free eBook as a handy general reference:
http://www.vosesoftware.com/content/ebook.pdf
Contact me directly if I can be of further help with specific distribution issues through my website at:
http://www.mckibbinusa.com
Thank you for the opportunity to comment…
William: The reference you linked to could be handy, but it may encourage the kind of thinking I’m warning against in this post. It’s the wrong starting point to ask which of the 50 distributions in the book best fit you data.
John, I agree with you. As I said, most populations can be described by simple normal, lognormal, triangular, and uniform distributions, which should generally be the starting point for analysis. Also, validating the model’s variables and framework is generally a more important challenge than identifying that one “perfect” distribution that describes the independent variables. Modeling is as much art as it is science, and I often find that too much time goes into “fitting” the probabilitiy distributions, and not enough goes into formulating, testing, and validating the model framework itself. Thanks again for the opportunity to comment…
John: I hadn’t heard the phrase “Procrustean statistics” before and I’d like to think I coined it, but alas, a quick Google search shows it’s been around a long time.
I am an ORMS type and not a statistician. One of the reasons that I want to have an idea of what is going on in the unobserved tail of the distribution. Saying that “The last couple of products had certain properties that were approximately Gamma, or Weibull etc. distributed, and that the current product seems to be as well, but each of the products has slightly different parameters”, tends to be much better received than saying “Lets run the prototypes for another five hundred years, and we’ll have a better idea”. I realize that this is partially what “The Black Swan” was complaining about, but it is the best I know how to do.
I know very little about modern statistical techniques, I can barely spell MCMC. How is the behavior in the tail estimated without fitting a distribution?
deinst: There are statistical ways of estimating tails by looking at extremes etc. But these methods break down at some point and you have to rely on non-probabilistic methods for capping losses. For example, rather than estimating the probability of someone having a $100,000,000 medical claim, insurance companies simply place a cap on coverage, say a lifetime $1,000,000 maximum benefit.
>Most populations can be adequately described by one of four basic distributions: normal, lognormal, triangular, or uniform.
Hmm. That reminds me of a rule of thumb told to me by an engineer: “Everything is linear on a log-log scale with a fat marker pen!”
“insurance companies simply place a cap on coverage, say a lifetime $1,000,000 maximum benefit” Yes, they do. But soon that will be illegal.
How about the related question: What are the legitimate reasons for wanting to fit a known distribution to your data?
For a non-statistician, knowing a population fits a known distribution allows one to quickly read the properties of that distribution and make better decisions based on the data. A means to determine what well-known distribution (if any) a population fits, particularly as opposed to just assuming all data forms a normal distribution, will thus allow us to better understand the characteristics of our data and make better decisions based on it.
To give a real example, web analytics very commonly end up with a pile of page access times and other #’s. People frequently assume any arbitrary set of values form a normal distribution and then reason based on the properties of a normal distribution and ultimately make decisions based on this. If the values were actually not a very good fit for a normal distribution it seems this could lead to poor decision making.
Here’s a question. Consider the space of all possible distributions. What % of the space do the famous distributions occupy?
I would guess it’s a set of measure zero, but I don’t know (also I don’t think distributions can fit in a set — can they?).
(I was just reading a preview of “Iceberg Risk” last night … on Wilmott.com … which is why this question popped to mind. Oh, did you think that the odds of three heads by three uncorrelated fair coins was 1/8? Sorry.)
Chris: Famous distributions are described by a finite number of parameters, so yes, they would make up a set of measure zero within the space of probability distributions.
More importantly, famous distributions make uneven approximations. Think of the space of continuous functions over an interval. Polynomials span a subspace of measure zero, but they’re dense. You can approximate any of continuous functions with a polynomial. Famous distributions don’t approximate the space of distributions the same way.
For example, most famous distributions are unimodal. The only exception that comes to mind is the beta distribution which can be bimodal for some parameters. So depending on your definition of “famous,” it may be that no famous distribution can approximate a distribution with three modes. (But if you allow mixtures of famous distributions, that opens up a whole new realm of possibilities.)
Exactly what I wanted to know. “Iceberg Risk” is about mixtures of distributions.
But so how well do they stack up? I drew a picture of N(0,1) + N(-3,1) in R and it barely looked skewed. Certainly not bimodal.
Chris: A mixture of normals can take on quite a variety of shapes. See some examples here. If the differences in means is large relative to the standard deviations, the result will be bimodal.
I imagine you can get good approximation results for mixtures. Mixtures of probability distributions are analogous to wavelet bases, and there you can get good approximation results by shifting and scaling just one function and taking linear combinations.
can i determine what distribution a set of data best approximates using excel or some other software?
John: many thanks for the post and blog. I have a simple question.
We calculate residuals to check a linear model’s assumptions. Normality is one of them. With the help of graphical (Q-Q plot, etc.) and numerical (Shapiro—Wilk’s statistic, etc.) techniques we decide whether our data come from a normally distributed population or not. If yes —- we choose parametric tests, otherwise we use nonparametric approaches.
Your post and some comments say that we should pay attention when we fit different distributions to our data. But what if beta or gamma distribution fits well to my residuals? Say, if I’ve got a sample drawn from a (nonnormal) beta destributed population, how does it affect my statistical data analysis flow? Thanks a lot.
On a related note:
One can check if some sample (x) might be able to be represented by a certain distribution using the kolmogorov-smirnov test, correct?
However, when I try any distribution besides the normal ‘norm’ in scipy.stats I get an error saying not enough arguments, but I am not sure what I am missing…..
kstest(x, ‘genextreme’)
kstest(logx, ‘pearson3’)
after importing scipy.stats kstest
where x is a data set of 100 points
insight?
Thanks!
John: I meant in general – if a statistical population is distributed by another distributional law rather than we justified for a sample we test. (We can’t check all distributions, right?) How this fact of discrepancy will affect our conclusions based on our statistical analysis.
But if I go back to your question – if I got it right – we can check a distribtion fit by Q-Q plot in SAS with PROC UNIVARIATE, for example (http://blogs.sas.com/content/iml/2011/10/28/modeling-the-distribution-of-data-create-a-qq-plot.html).