Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:
- Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
- Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
- Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
- No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.
This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.
For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?
Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.
Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.
How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.
Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.
There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)
Hmmm… Everywhere I’ve worked, I have soon developed a reputation for being the guy to go ask when a question related with statistics comes up. I was sort of aware that this had more to do with the general lack of statistical education, even among populations of engineers, physicists and mathematicians, than with the depth and breadth of my knowledge. But even with those precautions, I did have a sense of pride about it…
And here comes John, letting me know that I’m no more than an old fashioned frequentist… Any suggestions on where to start reading to redo myself and become a cool and chic Bayesian?
Here’s a good book on the philosophy of Bayesian statistics: Scientific Reasoning, The Bayesian Approach. For a more operational book, one of the most popular is Bayesian Data Analysis.
The only elementary (i.e. pre-calculus) book on Bayesian statistics I know of is Statistics: A Bayesian Approach. Since it doesn’t use calculus, it has to give lots of tangible examples, drawing chips from bowls etc. It might help build intuition more than an advanced book.
Jaime 04.09.09 at 01:53:
I laughed out loud at this, because of a comment I heard from a speaker at a conference long ago … he was a Bayesian and it was a Bayesian conference. During his talk he expressed the opinion that most Bayesian conferences primarily involve sitting around drinking beer and congratulating each other on having had the good sense to become Bayesians.
One of my favorite professors, a dyed-in-the-wool Aristotelian Frequentist, likened the spread of Bayesian statistics to the pod people in “Invasion of the Body Snatchers” … first one person in the department becomes one, then a couple more, and finally the whole department. He was one of the funniest professors I’ve had of any field.
And one for the hardcore geeks:
At another conference, this one neither particularly Bayesian nor Frequentist, two statisticians gave a presentation on estimating the failure time distribution for the nuclear waste containment systems which were proposed to be installed in Yucca Mountain. Given that they were designed to last at least 10,000 years, this presented a problem for pure Frequentist methods. Naturally, they adopted Bayesian methods, which happened to be relatively new to them. Nothing unusual there. But one of them concluded with some remarks about how much he liked Bayesian methods. He mentioned that in one episode of Star Trek: The Next Generation, the character Data says something about a “Bayesian millenium” [I can’t remember the reference exactly], which the speaker thought was wonderful, and ended with a loud toast to the coming Bayesian millenium!
John, you’ve outdone yourself with these last three columns. Your writing is always worth reading, but I’m knocked out by your technical depth, clarity, and thoroughness. Thanks for teaching. Sincerely, %
Dr. Cook:
I arrived at your site via a Python-related search, and was pleased to find your Bayesian blog-site (onto which I just landed). My initiation was Edwin Jaynes’s book, which I haven’t yet finished. Jaynes is quite right about physics, and convinced at least me about the Bayesian approach to probability. Have you seen Jaynes’s book? (Sorry, I haven’t read your blogs yet.)
— George Hacken
George, Thank you for your note. Yes, I’ve read Jaynes’s book. I have some reservations about it, but I like his basic approach, especially his comment that often probabilities don’t describe reality but our knowledge of reality.
If someone wants to insist that a parameter is constant but unknown, fine. It’s a constant. But we’re uncertain about it’s value. If we quantify our uncertainty about that constant according to certain reasonable axioms, our uncertainties obey the laws of probability. Let’s go ahead and call them probabilities. And now you have Bayesian statistics.
You can search on “Jaynes” to find the posts where I’ve mentioned his book.
I find the Bayesian/frequentist dichotomy rather odd, but maybe I have it all wrong. To my mind, a core of the distinction has to do with contingent probability: if I believe I spotted a lynx in my back yard, it is very interesting to know that there are only a handful of lynxes in my state. With that prior information in mind, suddenly my belief in my personal lynx sighting drops dramatically. This seems to me the bridge between the two ways of approaching a problem. Second, I believe that frequentism often comes with a strange confusion about what a p value really is. It is almost as if frequentists believe that unlikely events must have a cause beyond pure chance, a thought error not found as much in the Bayesian camp…
I thought this was a good post, but had serious issue with one comment in particular:
“Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.”
I ask in return “If everything is inferred, why should the soundness of Baysian Inference be based on a binary notion of truth and falsity? Should not there be degrees of correctness for a theory of inference? Therefore, shouldn’t Bayesian inference be capable of supplying a proof for it’s own correctness? Yet Bayesian inference lacks the ability to ‘prove,’ should ones axioms be taken as anything but absolutely true. It also lacks the ability for universal generalization. Seems like Bayesian inference would have a tough time inferring that one should use Bayesian inference. Yet, is there a system that can do any better? Prove (or infer) that there isn’t and then I’m sold.