Convenient and innocuous priors

Andrew Gelman has some interesting comments on non-informative priors this morning. Rather than thinking of the prior as a static thing, think of it as a way to prime the pump.

… a non-informative prior is a placeholder: you can use the non-informative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. …

At first this may sound like tweaking your analysis until you get the conclusion you want. It’s like the old joke about consultants: the client asks what 2+2 equals and the consultant counters by asking the client what he wants it to equal. But that’s not what Andrew is recommending.

A prior distribution cannot strictly be non-informative, but there are common intuitive notions of what it means to be non-informative. It may be helpful to substitute “convenient” or “innocuous” for “non-informative.” My take on Andrew’s advice is something like this.

Start with a prior distribution that’s easy to use and that nobody is going to give you grief for using. Maybe the prior doesn’t make much difference. But if your convenient/innocuous prior leads to too vague a conclusion, go back and use a more realistic prior, one that requires more effort or risks more criticism.

It’s odd that realistic priors can be more controversial than unrealistic priors, but that’s been my experience. It’s OK to be unrealistic as long as you’re conventional.

Levels of uncertainty

The other day I heard someone say something like the following:

I can’t believe how people don’t understand probability. They don’t realize that if a coin comes up heads 20 times, on the next flip there’s still a 50-50 chance of it coming up tails.

But if I saw a coin come up heads 20 times, I’d suspect it would come up heads the next time.

There are two levels of uncertainty here. IF the probability of a coin coming up heads is θ = 1/2 and the tosses are independent, then yes, the probability of a head is 1/2 each time, regardless of how many heads have shown before. The parameter θ models our uncertainty regarding which side will show after a toss of the coin. That’s the first level of uncertainty.

But what about our uncertainty in the value of θ? Twenty flips showing the same side up should cause us to question whether θ really is 1/2. Maybe it’s a biased coin and θ is greater than 1/2. Or maybe it really is a fair coin and we’ve just seen a one-in-a-million event. (Such events do happen, but only one in a million times.) Our uncertainty regarding the value of θ is a second level of uncertainty.

Frequentist statistics approaches these two kinds of uncertainty differently. That approach says that θ is a constant but unknown quantity. Probability describes the uncertainty regarding the coin toss given some θ but not the uncertainty regarding θ. The Bayesian models all uncertainty using probability. So the outcome of the coin toss given θ is random, but θ itself is also random. It’s turtles all the way down.

It’s possible to have different degrees of uncertainty at each level. You could, for example, calculate the probability of some quantum event very accurately. If that probability is near 1/2, there’s a lot of uncertainty regarding the event itself, but little uncertainty about the parameter. High uncertainty at the first level, low uncertainty at the next level. If you warp a coin, it may not be apparent what effect that will have on the probability of the outcome. Now there’s significant uncertainty at the first and second level.

We’ve implicitly assumed that a single parameter θ describes the uncertainty in a coin toss outcome. Maybe that’s not true. Maybe the person tossing the coin has the ability to influence the outcome. (Some very skilled people can. I’ve heard rumors that Persi Diaconis is good at this.) Now we have a third level of uncertainty, uncertainty regarding our model and not just its parameter.

If you’re sure that a parameter θ describes the coin toss, but you don’t know θ, then the coin toss outcome is an known unknown and θ is an unknown unknown, a second-order uncertainty. More often though people use the term “unknown unknown” to describe a third-order uncertainty, unforeseen factors that are not included in a model, not even as uncertain parameters.

Bayes : Python :: Frequentist : Perl

Bayesian statistics is to Python as frequentist statistics is to Perl.

Perl has the slogan “There’s more than one way to do it,” abbreviated TMTOWTDI and pronounced “tim toady.” Perl prides itself on variety.

Python takes the opposite approach. The Zen of Python says “There should be one — and preferably only one — obvious way to do it.” Python prides itself on consistency.

Frequentist statistics has a variety of approaches and criteria for various problems. Bayesian critics call this “adhockery.”

Bayesian statistics has one way to do everything: write down a likelihood function and prior distribution, then add data and compute a posterior distribution. This is sometimes called “turning the Bayesian crank.”

A statistical problem with “nothing to hide”

One problem with the nothing-to-hide argument is that it assumes innocent people will be exonerated certainly and effortlessly. That is, it assumes that there are no errors, or if there are, they are resolved quickly and easily.

Suppose the probability of a correctly analyzing an email or phone call is not 100% but 99.99%. In other words, there’s one chance in 10,000 of an innocent message being incriminating. Imagine authorities analyzing one message each from 300,000,000 people, roughly the population of the United States. Then around 30,000 innocent people will have some ‘splaining to do. They will have to interrupt their dinner to answer questions from an agent knocking on their door, or maybe they’ll spend a few weeks in custody. If the legal system is 99.99% reliable, then three of them will go to prison.

Now suppose false positives are really rare, one in a million. If you analyze 100 messages from each person rather than just one, you’re approximately back to the scenario above.

Scientists call indiscriminately looking through large amounts of data “a fishing expedition” or “data dredging.” One way to mitigate the problem of massive false positives from data dredging is to demand a hypothesis: before you look through the data, say what you’re hoping to prove and why you think it’s plausible.

The legal analog of a plausible hypothesis is a search warrant. In statistical terms, “probable cause” is a judge’s estimation that the prior probability of a hypothesis is moderately high. Requiring scientists to have a hypothesis and requiring law enforcement to have a search warrant both dramatically reduce the number of false positives.

Related posts

A priori overfitting

The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.

An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.

If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.

With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random—which would be an improvement over manually selecting scenarios that present a method in the best light—how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.

Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.

This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself—each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.

Related posts

Offended by conditional probability

It’s a simple rule of probability that if A makes B more likely, B makes A more likely. That is, if the conditional probability of A given B is larger than the probability of A alone, the conditional probability of B given A is larger than the probability of B alone. In symbols,

Prob( A | B ) > Prob( A ) ⇒ Prob( B | A ) > Prob( B ).

The proof is trivial: Apply the definition of conditional probability and observe that if Prob( AB ) / Prob( B ) > Prob( A ), then Prob( AB ) / Prob( A ) > Prob( B ).

Let A be the event that someone was born in Arkansas and let B be the event that this person has been president of the United States. There are five living current and former US presidents, and one of them, Bill Clinton, was born in Arkansas, a state with about 1% of the US population. Knowing that someone has been president increases your estimation of the probability that this person is from Arkansas. Similarly, knowing that someone is from Arkansas should increase your estimation of the chances that this person has been president.

The chances that an American selected at random has been president are very small, but as small as this probability is, it goes up if you know the person is from Arkansas. In fact, it goes up by the same proportion as the opposite probability. Knowing that someone has been president increases their probability of being from Arkansas by a factor of 20, so knowing that someone is from Arkansas increases the probability that they have been president by a factor of 20 as well. This is because

Prob( A | B ) / Prob( A ) = Prob( B | A ) / Prob( B ).

This isn’t controversial when we’re talking about presidents and where they were born. But it becomes more controversial when we apply the same reasoning, for example, to deciding who should be screened at airports.

When I jokingly said that being an Emacs user makes you a better programmer, it appears a few Vim users got upset. Whether they were serious or not, it does seem that they thought “Hey, what does that say about me? I use Vim. Does that mean I’m a bad programmer?”

Assume for the sake of argument that Emacs users are better programmers, i.e.

Prob( good programmer | Emacs user )  >  Prob( good programmer ).

We’re not assuming that Emacs users are necessarily better programmers, only that a larger proportion of Emacs users are good programmers. And we’re not saying anything about causality, only probability.

Does this imply that being a Vim user lowers your chance of being a good programmer? i.e.

Prob( good programmer | Vim user )  <  Prob( good programmer )?

No, because being a Vim user is a specific alternative to being an Emacs user, and there are programmers who use neither Emacs nor Vim. What the above statement about Emacs would imply is that

Prob( good programmer | not a Emacs user )  <  Prob( good programmer ).

That is, if knowing that someone uses Emacs increases the chances that they are a good programmer, then knowing that they are not an Emacs user does indeed lower the chances that they are a good programmer, if we have no other information. In general

Prob( A | B ) > Prob( A ) ⇒ Prob( A | not B ) < Prob( A ).

To take a more plausible example, suppose that spending four years at MIT obtaining a computer science degree makes you a better programmer. Then knowing that someone has a CS degree from MIT increases the probability that this person is a good programmer. But if that’s true, it must also be true that absent any other information, knowing that someone does not have a CS degree from MIT decreases the probability that this person is a good programmer. If a larger proportion of good programmers come from MIT, then a smaller proportion must not come from MIT.

* * *

This post uses the ideas of information and conditional probability interchangeably. If you’d like to read more on that perspective, I recommend Probability Theory: The Logic of Science by E. T. Jaynes.

Closet Bayesian

When I was a grad student, a statistics postdoc confided to me that he was a “closet Bayesian.” This sounded absolutely bizarre. Why would someone be secretive about his preferred approach to statistics? I could not imagine someone whispering that although she’s doing her thesis in algebra, she’s secretively interested in differential equations.

I knew nothing about statistics at the time and was surprised to find that there was a bitter rivalry between two schools of statistics. The rivalry is still there, though it’s not as bitter as it once was.

I find it grating when someone asks “Are you a Bayesian?” It implies an inappropriate degree of commitment and exclusivity. Bayesian statistics is just a tool. Statistics itself is just tool, one way of understanding the world.

My car has a manual transmission. I prefer manual transmissions. But if someone asked whether I was a manual transmissionist, I’d look at them like they’re crazy. I don’t have any moral objections to automatic transmissions.

I evaluate a car by how well it works. And for most purposes, I prefer the way a manual transmission works. But when I’m teaching one of my kids to drive, we go out in my wife’s car with an automatic transmission. Similarly, I evaluate a mathematical model (statistical or otherwise) by how it works for a given purpose. Sometimes a Bayesian and a frequentist approach lead to the same conclusions, but the latter is easier to understand or implement. Sometimes a Bayesian method leads to a better result because it can use more information or is easier to interpret. Sometimes it’s a toss up and I use a Bayesian approach because its more familiar, just like my old car.

Sleeper theorems

I’m using the term “sleeper” here for a theorem that is far more important than it seems, something that you may not appreciate for years after you first see it.

Bayes’ theorem

The first such theorem that comes to mind is Bayes’ theorem. I remember being unsettled by this theorem when I took my first probability course. I found it easy to prove but hard to understand. I couldn’t decide whether it was trivial or profound. Then years later I found myself using Bayes theorem routinely.

The key insight of Bayes theorem is that it gives you a way to turn probabilities around. That is, it lets you compute the probability of A given B from the probability of B given A. That may not seem so important, but it’s vital in application. It’s often easy to compute the probability of data given an hypothesis, but we need to know the probability of an hypothesis given data. Those unfamiliar with Bayes theorem often get probabilities backward.

Jensen’s inequality

Another sleeper theorem is Jensen’s inequality: If φ is a convex function and X is a random variable, φ( E(X) ) ≤ E( φ(X) ). In words, φ at the expected value of X is less than the expected value of φ of X. Like Bayes’ theorem, it’s a way of turning things around. If the convex function φ represents your gain from some investment, Jensen’s inequality says that randomness is good for you; variability in X is to your advantage on average. But if φ is concave, variability works against you.

Sam Savage’s book The Flaw of Averages is all about the difference between φ( E(X) ) and E( φ(X) ). When φ is linear, they’re equal. But in general they’re different and there’s not much you can say about the relation of the two. However, when φ is convex or concave, you can say what the direction of the difference is.

I’ve just started reading Nassim Taleb’s new book Antifragile, and it seems to be an extended meditation on Jensen’s inequality. Systems with concave returns are fragile; they are harmed by variability. Systems with convex returns are antifragile; they benefit from variability.

Other examples

What are some more examples of sleeper theorems?

Related posts

Product of normal PDFs

The product of two normal PDFs is proportional to a normal PDF. This is well known in Bayesian statistics because a normal likelihood times a normal prior gives a normal posterior. But because Bayesian applications don’t usually need to know the proportionality constant, it’s a little hard to find. I needed to calculate this constant, so I’m recording the result here for my future reference and for anyone else who might find it useful.

Denote the normal PDF by

\phi(x; m, s) = \frac{1}{\sqrt{2\pi} s} \exp\left(-\frac{(x-m)^2}{2s^2}\right)

Then the product of two normal PDFs is given by the equation

\phi(x; \mu_1, \sigma_1) \, \phi(x; \mu_2, \sigma_2) = \phi\left(\mu_1; \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2}\right) \,\phi(x, \mu, \sigma)

where

 \mu = \frac{ \sigma_1^{-2} \mu_1 + \sigma_2^{-2} \mu_2}{\sigma_1^{-2} + \sigma_2^{-2} }

and

 \sigma^2 = \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2}

Note that the product of two normal random variables is not normal, but the product of their PDFs is proportional to the PDF of another normal.

Shifting probability distributions

One reason the normal distribution is easy to work with is that you can vary the mean and variance independently. With other distribution families, the mean and variance may be linked in some nonlinear way.

I was looking for a faster way to compute Prob(X > Y + δ) where X and Y are independent inverse gamma random variables. If δ were zero, the probability could be computed analytically. But when δ is positive, the calculation requires numerical integration. When the calculation is in the inner loop of a simulation, most of the simulation’s time is spent doing the integration.

Let Z = Y + δ. If Z were another inverse gamma random variable, we could compute Prob(X > Z) quickly and accurately without integration. Unfortunately, Z is not an inverse gamma. But it is approximately an inverse gamma, at least if Y has a moderately large shape parameter, which it always does in my applications. So let Z be inverse gamma with parameters to match the mean and variance of Y + δ. Then Prob(X > Z) is a good approximation to Prob(X > Y + δ).

For more details, see Fast approximation of inverse gamma inequalities.

More random inequality posts