Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:
- Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
- Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
- Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
- No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.
This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.
For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?
Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.
Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.
How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.
Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.
There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)