If all you know about a person is that he or she is around 5′ 7″, it’s a toss-up whether this person is male or female. If you know someone is over 6′ tall, they’re probably male. If you hear they are over 7″ tall, they’re almost certainly male. This is a consequence of heights having a thin-tailed probability distribution. Thin, medium, and thick tails all behave differently for the attribution problem as we will show below.
Thin tails: normal
Suppose you have an observation that either came from X or Y, and a priori you believe that both are equally probable. Assume X and Y are both normally distributed with standard deviation 1, but that the mean of X is 0 and the mean of Y is 1. The probability you assign to X and Y after seeing data will change, depending on what you see. The larger the value you observe, the more sure you can be that it came from Y.
This plot shows the posterior probability that an observation came from Y, given that we know the observation is greater than the value on the horizontal axis.
Suppose I’ve seen the exact value, and it’s 4. All I tell you is that it’s bigger than 2. Then you would say it probably came from Y. When I go back and tell you that in fact it’s bigger than 3. You would be more sure it came from Y. The more information I give you, the more convinced you are. This isn’t the case with other probability distributions.
Thick tails: Cauchy
Now let’s suppose X and Y have a Cauchy distribution with unit scale, with X having mode 0 and Y having mode 1. The plot below shows how likely is it that our observation came from Y given a lower bound on the observation.
We are most confident that our data came from Y when we know that our data is greater than 1. But the larger our lower bound is, the further we look out in the tails, the less confident we are! If we know, for example, that our data is at least 5, then we still think that it’s more likely that it came from Y than from X, but we’re not as sure.
As above, suppose I’ve seen the data value but only tell you lower bounds on its value. Suppose I see a value of 4, but only tell you it’s positive. When I come back and tell you that the value is bigger than 1, your confidence goes up that the sample came from Y. But as I give you more information, telling you that the sample is bigger than 2, then bigger than 3, your confidence that it came from Y goes down, just the opposite of the normal case.
What accounts for the very different behavior?
In the normal example, seeing a value of 5 or more from Y is unlikely, but seeing a value so large from X is very unlikely. Both tails are getting smaller as you move to the right, but in relative terms, the tail of X is getting thinner much faster than the tail of Y.
In the Cauchy example, the value of both tails gets smaller as you move to the right, but the relative difference between the two tails is decreasing. Seeing a value greater than 10, say, from Y is unlikely, but it would only be slightly less likely from X.
In between thin tails and thick tails are medium tails. The tails of the Laplace (double exponential) distribution decay exponentially. Exponential tails are a often considered the boundary between thin tails and thick tails, or super-exponential and sub-exponential tails.
Suppose you have two Laplace random variables with the same scale, with X centered at 0 and Y centered at 1. What is the posterior probability that a sample came from Y rather than X, given that it’s at least some value Z > 1? It’s constant! Specifically, it’s e/(1 + e).
In the normal and Cauchy examples, it didn’t really matter that one distribution was centered at 0 and the other at 1. We’d get the same qualitative behavior no matter what the shift between the two distributions. The limit would tend to 1 for the normal distribution and 1/2 for the Cauchy. The constant value of the posterior probability with the Laplace example depends on the size of the shift between the two.
We’ve assumed that X and Y are equally likely a priori. The limiting value in the normal case does not depend on the prior probabilities of X and Y as long as they’re both positive. The prior probabilities will effect the limiting values for the Cauchy and Laplace case.
For anyone who wants a more precise formulation of the examples above, let B be a non-degenerate Bernoulli random variable and define Z = BX + (1-B)Y. We’re computing the conditional probability Pr(B = 0 | Z > z) using Bayes’ theorem. If X and Y are normally distributed, the limit of Pr(B = 0 | Z > z) as z goes to infinity is 1. If X and Y are Cauchy distributed, the limit is the unconditional probability Pr(B = 0).
In the normal case, as z goes to infinity, the distribution of B carries no useful information.
In the Cauchy case, as z goes to infinity, the observation z carries no useful information.