# How do you justify that distribution?

Someone asked me yesterday how people justify probability distribution assumptions. Sometimes the most mystifying assumption is the first one: “Assume X is normally distributed …” Here are a few answers.

1. Sometimes distribution assumptions are not justified.
2. Sometimes distributions can be derived from fundamental principles. For example, there are axioms that uniquely specify a Poisson distribution.
3. Sometimes distributions are justified on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed.
4. Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.
5. Sometimes a distribution can be a bad fit and still work well, depending on what you’re asking of it.

The last point is particularly interesting. It’s not hard to imagine that a poor fit would produce poor results. It’s surprising when a poor fit produces good results. Here’s an example of the latter.

Suppose you are testing a new drug and hoping that it improves how long patients live. You want to stop the clinical trial early if it looks like patients are living no longer than they would have on standard treatment. There is a Bayesian method for monitoring such experiments that assumes survival times have an exponential distribution. But survival times are not exponentially distributed, not even close.

The method works well because of the question being asked. The method is not being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. Simulations show that the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.

## 4 thoughts on “How do you justify that distribution?”

1. Good points. Sometimes distributions are chosen based on maximum entropy.

Dear John D. Cook!
In this paper some aspects of the problem are discussed.

SOME PROBLEMS OF IDENTIFICATION OF RANDOM VALUES DISTRIBUTION
MODELS WHEN USING THE MODERN SOFTWARE (in Russian)
http://www.rae.ru/use/pdf/2011/11/11.pdf

Sinserely Yours,