Someone asked me yesterday how people justify probability distribution assumptions. Sometimes the most mystifying assumption is the first one: “Assume X is normally distributed …” Here are a few answers.
- Sometimes distribution assumptions are not justified.
- Sometimes distributions can be derived from fundamental principles. For example, there are axioms that uniquely specify a Poisson distribution.
- Sometimes distributions are justified on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed.
- Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.
- Sometimes a distribution can be a bad fit and still work well, depending on what you’re asking of it.
The last point is particularly interesting. It’s not hard to imagine that a poor fit would produce poor results. It’s surprising when a poor fit produces good results. Here’s an example of the latter.
Suppose you are testing a new drug and hoping that it improves how long patients live. You want to stop the clinical trial early if it looks like patients are living no longer than they would have on standard treatment. There is a Bayesian method for monitoring such experiments that assumes survival times have an exponential distribution. But survival times are not exponentially distributed, not even close.
The method works well because of the question being asked. The method is not being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. Simulations show that the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.
4 thoughts on “How do you justify that distribution?”
Good points. Sometimes distributions are chosen based on maximum entropy.
Dear John D. Cook!
In this paper some aspects of the problem are discussed.
SOME PROBLEMS OF IDENTIFICATION OF RANDOM VALUES DISTRIBUTION
MODELS WHEN USING THE MODERN SOFTWARE (in Russian)
In my opinion, the choice of distribution cannot be arbitrary. The analyst must have some reasoning for the distribution chosen. However, I agree that the reasons can be grounded in data, theory, expert opinion, or simply a mandate. Whatever, the reason, the analyst must be able to footnote a distribution choice with reasoning other than, …”I selected this distribution from a gallery screen…” — even a triangular distribution can be justified if data or other rational reasoning fails. The sad truth is that many models I audit are assigned distributions that are selected by the analyst for no reason at all…
A typical approach in my domain (quantitative finance) is to build a basic model using a normal or log-normal distribution, and then “fudge it” by adding stuff on top of it, to account for the fact that market doesn’t follow the assumption normality (or log-normality). It turns out that the users of the models (traders, risk managers, regulators) prefer it that way, because people have a lot of intuition now about Gaussian processes, and it’s just easier to incrementally expand this intuition beyond that, and start including the parameters which cause the model to deviate from Gaussian behaviour (the famous “volatility smile” for example). This preference and ease of tractability are the main reasons why models using e.g. alpha-stable fat-tailed processes never went much beyond academic literature.