Convenient and innocuous priors

Andrew Gelman has some interesting comments on non-informative priors this morning. Rather than thinking of the prior as a static thing, think of it as a way to prime the pump.

… a non-informative prior is a placeholder: you can use the non-informative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. …

At first this may sound like tweaking your analysis until you get the conclusion you want. It’s like the old joke about consultants: the client asks what 2+2 equals and the consultant counters by asking the client what he wants it to equal. But that’s not what Andrew is recommending.

A prior distribution cannot strictly be non-informative, but there are common intuitive notions of what it means to be non-informative. It may be helpful to substitute “convenient” or “innocuous” for “non-informative.” My take on Andrew’s advice is something like this.

Start with a prior distribution that’s easy to use and that nobody is going to give you grief for using. Maybe the prior doesn’t make much difference. But if your convenient/innocuous prior leads to too vague a conclusion, go back and use a more realistic prior, one that requires more effort or risks more criticism.

It’s odd that realistic priors can be more controversial than unrealistic priors, but that’s been my experience. It’s OK to be unrealistic as long as you’re conventional.

Levels of uncertainty

The other day I heard someone say something like the following:

I can’t believe how people don’t understand probability. They don’t realize that if a coin comes up heads 20 times, on the next flip there’s still a 50-50 chance of it coming up tails.

But if I saw a coin come up heads 20 times, I’d suspect it would come up heads the next time.

There are two levels of uncertainty here. If the probability of a coin coming up heads is θ = 1/2 and the tosses are independent, then yes, the probability of a head is 1/2 each time, regardless of how many heads have shown before. The parameter θ models our uncertainty regarding which side will show after a toss of the coin. That’s the first level of uncertainty.

But what about our uncertainty in the value of θ? Twenty flips showing the same side up should cause us to question whether θ really is 1/2. Maybe it’s a biased coin and θ is greater than 1/2. Or maybe it really is a fair coin and we’ve just seen a one-in-a-million event. (Such events do happen, but only one in a million times.) Our uncertainty regarding the value of θ is a second level of uncertainty.

Frequentist statistics approaches these two kinds of uncertainty differently. That approach says that θ is a constant but unknown quantity. Probability describes the uncertainty regarding the coin toss given some θ but not the uncertainty regarding θ. The Bayesian models all uncertainty using probability. So the outcome of the coin toss given θ is random, but θ itself is also random. It’s turtles all the way down.

It’s possible to have different degrees of uncertainty at each level. You could, for example, calculate the probability of some quantum event very accurately. If that probability is near 1/2, there’s a lot of uncertainty regarding the event itself, but little uncertainty about the parameter. High uncertainty at the first level, low uncertainty at the next level. If you warp a coin, it may not be apparent what effect that will have on the probability of the outcome. Now there’s significant uncertainty at the first and second level.

We’ve implicitly assumed that a single parameter θ describes the uncertainty in a coin toss outcome. Maybe that’s not true. Maybe the person tossing the coin has the ability to influence the outcome. (Some very skilled people can. I’ve heard rumors that Persi Diaconis is good at this.) Now we have a third level of uncertainty, uncertainty regarding our model and not just its parameter.

If you’re sure that a parameter θ describes the coin toss, but you don’t know θ, then the coin toss outcome is an known unknown and θ is an unknown unknown, a second-order uncertainty. More often though people use the term “unknown unknown” to describe a third-order uncertainty, unforeseen factors that are not included in a model, not even as uncertain parameters.

Bayes : Python :: Frequentist : Perl

Bayesian statistics is to Python as frequentist statistics is to Perl.

Perl has the slogan “There’s more than one way to do it,” abbreviated TMTOWTDI and pronouced “tim toady.” Perl prides itself on variety.

Python takes the opposite approach. The Zen of Python says “There should be one — and preferably only one — obvious way to do it.” Python prides itself on consistency.

Frequentist statistics has a variety of approaches and criteria for various problems. Bayesian critics call this “adhockery.”

Bayesian statistics has one way to do everything: write down a likelihood function and prior distribution, then add data and compute a posterior distribution. This is sometimes called “turning the Bayesian crank.”

A statistical problem with “nothing to hide”

One problem with the nothing-to-hide argument is that it assumes innocent people will be exonerated certainly and effortlessly. That is, it assumes that there are no errors, or if there are, they are resolved quickly and easily.

Suppose the probability of a correctly analyzing an email or phone call is not 100% but 99.99%. In other words, there’s one chance in 10,000 of an innocent message being incriminating. Imagine authorities analyzing one message each from 300,000,000 people, roughly the population of the United States. Then around 30,000 innocent people will have some ‘splaining to do. They will have to interrupt their dinner to answer questions from an agent knocking on their door, or maybe they’ll spend a few weeks in custody. If the legal system is 99.99% reliable, then three of them will go to prison.

Now suppose false positives are really rare, one in a million. If you analyze 100 messages from each person rather than just one, you’re approximately back to the scenario above.

Scientists call indiscriminately looking through large amounts of data “a fishing expedition” or “data dredging.” One way to mitigate the problem of massive false positives from data dredging is to demand a hypothesis: before you look through the data, say what you’re hoping to prove and why you think it’s plausible.

The legal analog of a plausible hypothesis is a search warrant. In statistical terms, “probable cause” is a judge’s estimation that the prior probability of a hypothesis is moderately high. Requiring scientists to have a hypothesis and requiring law enforcement to have a search warrant both dramatically reduce the number of false positives.

Related:

You commit three felonies a day
You do too have something to hide

 

A priori overfitting

The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.

An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.

If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.

With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random — which would be an improvement over manually selecting scenarios that present a method in the best light — how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.

Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.

This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself — each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.

Related posts:

Robustness of simple rules
Small data
Works well versus well understood

Offended by conditional probability

It’s a simple rule of probability that if A makes B more likely, B makes A more likely. That is, if the conditional probability of A given B is larger than the probability of A alone, the the conditional probability of B given A is larger than the probability of B alone. In symbols,

Prob( A | B ) > Prob( A ) ⇒ Prob( B | A ) > Prob( B ).

The proof is trivial: Apply the definition of conditional probability and observe that if Prob( A ∩ B ) / Prob( B ) > Prob( A ), then Prob( A ∩ B ) / Prob( A ) > Prob( B ).

Let A be the event that someone was born in Arkansas and let B be the event that this person has been president of the United States. There are five living current and former US presidents, and one of them, Bill Clinton, was born in Arkansas, a state with about 1% of the US population. Knowing that someone has been president increases your estimation of the probability that this person is from Arkansas. Similarly, knowing that someone is from Arkansas should increase your estimation of the chances that this person has been president.

The chances that an American selected at random has been president are very small, but as small as this probability is, it goes up if you know the person is from Arkansas. In fact, it goes up by the same proportion as the opposite probability. Knowing that someone has been president increases their probability of being from Arkansas by a factor of 20, so knowing that someone is from Arkansas increases the probability that they have been president by a factor of 20 as well. This is because

Prob( A | B ) / Prob( A ) = Prob( B | A ) / Prob( B ).

This isn’t controversial when we’re talking about presidents and where they were born. But it becomes more controversial when we apply the same reasoning, for example, to deciding who should be screened at airports.

When I jokingly said that being an Emacs user makes you a better programmer, it appears a few Vim users got upset. Whether they were serious or not, it does seem that they thought “Hey, what does that say about me? I use Vim. Does that mean I’m a bad programmer?”

Assume for the sake of argument that Emacs users are better programmers, i.e.

Prob( good programmer | Emacs user )  >  Prob( good programmer ).

We’re not assuming that Emacs users are necessarily better programmers, only that a larger proportion of Emacs users are good programmers. And we’re not saying anything about causality, only probability.

Does this imply that being a Vim user lowers your chance of being a good programmer? i.e.

Prob( good programmer | Vim user )  <  Prob( good programmer )?

No, because being a Vim user is a specific alternative to being an Emacs user, and there are programmers who use neither Emacs nor Vim. What the above statement about Emacs would imply is that

Prob( good programmer | not a Emacs user )  <  Prob( good programmer ).

That is, if knowing that someone uses Emacs increases the chances that they are a good programmer, then knowing that they are not an Emacs user does indeed lower the chances that they are a good programmer, if we have no other information. In general

Prob( A | B ) > Prob( A ) ⇒ Prob( A | not B ) < Prob( A ).

To take a more plausible example, suppose that spending four years at MIT obtaining a computer science degree makes you a better programmer. Then knowing that someone has a CS degree from MIT increases the probability that this person is a good programmer. But if that’s true, it must also be true that absent any other information, knowing that someone does not have a CS degree from MIT decreases the probability that this person is a good programmer. If a larger proportion of good programmers come from MIT, then a smaller proportion must not come from MIT.

***

This post uses the ideas of information and conditional probability interchangeably. If you’d like to read more on that perspective, I recommend Probability Theory: The Logic of Science by E. T. Jaynes.

Closet Bayesian

When I was a grad student, a statistics postdoc confided to me that he was a “closet Bayesian.” This sounded absolutely bizarre. Why would someone be secretive about his preferred approach to statistics? I could not imagine someone whispering that although she’s doing her thesis in algebra, she’s secretively interested in differential equations.

I knew nothing about statistics at the time and was surprised to find that there was a bitter rivalry between two schools of statistics. The rivalry is still there, though it’s not as bitter as it once was.

I find it grating when someone asks “Are you a Bayesian?” It implies an inappropriate degree of commitment and exclusivity. Bayesian statistics is just a tool. Statistics itself is just tool, one way of understanding the world.

My car has a manual transmission. I prefer manual transmissions. But if someone asked whether I was a manual transmissionist, I’d look at them like they’re crazy. I don’t have any moral objections to automatic transmissions.

I evaluate a car by how well it works. And for most purposes, I prefer the way a manual transmission works. But when I’m teaching one of my kids to drive, we go out in my wife’s car with an automatic transmission. Similarly, I evaluate a mathematical model (statistical or otherwise) by how it works for a given purpose. Sometimes a Bayesian and a frequentist approach lead to the same conclusions, but the latter is easier to understand or implement. Sometimes a Bayesian method leads to a better result because it can use more information or is easier to interpret. Sometimes it’s a toss up and I use a Bayesian approach because its more familiar, just like my old car.

Related post: Bayes isn’t magic

Sleeper theorems

I’m using the term “sleeper” here for a theorem that is far more important than it seems, something that you may not appreciate for years after you first see it.

The first such theorem that comes to mind is Bayes theorem. I remember being unsettled by this theorem when I took my first probability course. I found it easy to prove but hard to understand. I couldn’t decide whether it was trivial or profound. Then years later I found myself using Bayes theorem routinely.

The key insight of Bayes theorem is that it gives you a way to turn probabilities around. That is, it lets you compute the probability of A given B from the probability of B given A. That may not seem so important, but it’s vital in application. It’s often easy to compute the probability of data given an hypothesis, but need we need to know the probability of an hypothesis given data. Those unfamiliar with Bayes theorem often get probabilities backward.

Another sleeper theorem is Jensen’s inequality: If φ is a convex function and X is a random variable, φ( E(X) ) ≤ E( φ(X) ). In words, φ at the expected value of X is less than the expected value of φ of X. Like Bayes’ theorem, it’s a way of turning things around. If the function φ represents your gain from some investment, Jensen’s inequality says that randomness is good for you; variability in X is to your advantage on average. But if φ is concave, variability works against you.

Sam Savage’s book The Flaw of Averages is all about the difference between φ( E(X) ) and E( φ(X) ). When φ is linear, they’re equal. But in general they’re different and there’s not much you can say about the relation of the two. However, when φ is convex or concave, you can say what the direction of the difference is.

I’ve just started reading Nassim Taleb’s new book Antifragile, and it seems to be an extended meditation on Jensen’s inequality. Systems with concave returns are fragile; they are harmed by variability. Systems with convex returns are antifragile; they benefit from variability.

What are some more examples of sleeper theorems?

Related posts:

Mortgages, banks, and Jensen’s inequality
Interpreting statistics

Product of normal PDFs

The product of two normal PDFs is proportional to a normal PDF. This is well known in Bayesian statistics because a normal likelihood times a normal prior gives a normal posterior. But because Bayesian applications don’t usually need to know the proportionality constant, it’s a little hard to find. I needed to calculate this constant, so I’m recording the result here for my future reference and for anyone else who might find it useful. Continue reading

Shifting probability distributions

One reason the normal distribution is easy to work with is that you can vary the mean and variance independently. With other distribution families, the mean and variance may be linked in some nonlinear way.

I was looking for a faster way to compute Prob(X > Y + δ) where X and Y are independent inverse gamma random variables. If δ were zero, the probability could be computed analytically. But when δ is positive, the calculation requires numerical integration. When the calculation is in the inner loop of a simulation, most of the simulation’s time is spent doing the integration.

Let Z = Y + δ. If Z were another inverse gamma random variable, we could compute Prob(X > Z) quickly and accurately without integration. Unfortunately, Z is not an inverse gamma. But it is approximately an inverse gamma, at least if Y has a moderately large shape parameter, which it always does in my applications. So let Z be inverse gamma with parameters to match the mean and variance of Y + δ. Then Prob(X > Z) is a good approximation to Prob(X > Y + δ).

For more details, see Fast approximation of inverse gamma inequalities.

Related posts:

Fast approximation of beta inequalities
Gamma inequalities

Fast approximation of beta inequalities

A beta distribution has an approximate normal shape if its parameters are large, and so you could use normal approximations to compute beta inequalities. The corresponding normal inequalities can be computed in closed form.

This works surprisingly well. Even when the beta parameters are small and the normal approximation is a bad fit, the corresponding inequality approximation is pretty good.

For more details, see the tech report Fast approximation of beta inequalities.

Related post:

Beta inequalities in R

How do you justify that distribution?

Someone asked me yesterday how people justify probability distribution assumptions. Sometimes the most mystifying assumption is the first one: “Assume X is normally distributed …” Here are a few answers.

  1. Sometimes distribution assumptions are not justified.
  2. Sometimes distributions can be derived from fundamental principles. For example, there are axioms that uniquely specify a Poisson distribution.
  3. Sometimes distributions are justified on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed.
  4. Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.
  5. Sometimes a distribution can be a bad fit and still work well, depending on what you’re asking of it.

The last point is particularly interesting. It’s not hard to imagine that a poor fit would produce poor results. It’s surprising when a poor fit produces good results. Here’s an example of the latter.

Suppose you are testing a new drug and hoping that it improves how long patients live. You want to stop the clinical trial early if it looks like patients are living no longer than they would have on standard treatment. There is a Bayesian method for monitoring such experiments that assumes survival times have an exponential distribution. But survival times are not exponentially distributed, not even close.

The method works well because of the question being asked. The method is not being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. As the simulations in this paper show, the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.

Related posts:

What distribution does my data have?
Four views of the negative binomial

Vague priors are informative

Data analysis has to start from some set of assumptions. Bayesian prior distributions drive some people crazy because they make assumptions explicit that people prefer to leave implicit. But there’s no escaping the need to make some sort of prior assumptions, whether you’re doing Bayesian statistics or not.

One attempt to avoid specifying a prior distribution is to start with a “non-informative” prior. David Hogg gives a good explanation of why this doesn’t accomplish what some think it does.

In practice, investigators often want to “assume nothing” and put a very or infinitely broad prior on the parameters; of course putting a broad prior is not equivalent to assuming nothing, it is just as severe an assumption as any other prior. For example, even if you go with a very broad prior on the parameter a, that is a different assumption than the same form of very broad prior on a2 or on arctan(a). The prior doesn’t just set the ranges of parameters, it places a measure on parameter space. That’s why it is so important.

Related posts:

Bad logic, but good statistics
Bayesian view of Amazon resellers
Bayes isn’t magic

Avoiding underflow in Bayesian computations

Here’s a common problem that arises in Bayesian computation. Everything works just fine until you have more data than you’ve seen before. Then suddenly you start getting infinite, NaN, or otherwise strange results. This post explains what might be wrong and how to fix it.

A posterior density is (proportional to) a likelihood function times a prior distribution. The likelihood function is a product. The number of data points is the number of terms in the product. If these numbers are less than 1, and you multiply enough of them together, the result will be too small to represent in a floating point number and your calculation will underflow to zero. Then subsequent operations with this number, such as dividing it by another number that has also underflowed to zero, may produce an infinite or NaN result.

The instinctive reaction regarding underflow or overflow is to use logs. And often that works. If you wanted to know where the maximum of the posterior density occurs, you could find the maximum of the logarithm of the posterior density. But in Bayesian computations you often need to integrate the posterior density times some functions. Now you cannot just work with logs because logs and integrals don’t commute.

One way around the problem is to multiply the integrand by a constant so large that there is no danger of underflow. Multiplying by constants does commute with integration.

So suppose your integrand is on the order of 10^-400, far below the smallest representable number. Do you need extended precision arithmetic? No, you just need to understand your problem.

If you multiply your integrand by 10^400 before integrating, then your integrand is roughly 1 in magnitude. Then do you integration, and remember that the result is 10^400 times the actual value.

You could take the following steps.

  1. Find m, the maximum of the log of the integrand.
  2. Let I be the integral of exp( log of the integrand – m ).
  3. Keep track that your actual integral is exp(m) I, or that its log is m + log I.

Note that you can be extremely sloppy in this process. You don’t need an accurate estimate of the maximum of the integrand per se. If you’re within a few dozen orders of magnitude, for example, that could be sufficient to carry out your integration without underflow.

One way to estimate the maximum is to use a frequentist estimator of your parameters as an approximate MLE, and assume that this is approximately where your posterior density takes on its maximum value. This might actually be very accurate, but it doesn’t need to be.

Note also that it’s OK if some of the evaluations of your integrand underflow to zero. You just don’t want the entire integral to underflow to zero. Places where the integrand is many orders of magnitude less than its maximum don’t contribute to the integration anyway. (It’s important that your integration pays attention to the region where the integrand is largest. A naive integration could entirely miss the most important region and completely underestimate the integral. But that’s a matter for another blog post.)

Monkeying with Bayes' theorem

In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes’ theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) the algorithm works better. He goes on to explain

Now should we dig up Bayes and notify him that he was wrong? No, I don’t think that’s it. …

I imagine most statisticians would respond that this cannot possibly be right. While it appears to work, there must be some underlying reason why and we should find that reason before using an algorithm based on an ad hoc tweak.

While such a reaction is understandable, it’s also a little hypocritical. Statisticians are constantly drawing inference from empirical data without understanding the underlying mechanisms that generate the data. When analyzing someone else’s data, a statistician will say that of course we’d rather understand the underlying mechanism than fit statistical models, that’s just not always possible. Reality is too complicated and we’ve got to do the best we can.

I agree, but that same reasoning applied at a higher level of abstraction could be used to accept Norvig’s translation algorithm. Here’s this model (derived from spurious math, but we’ll ignore that). Let’s see empirically how well it works.