Occam’s razor and Bayes’ theorem

Occam’s razor says that if two models fit equally well, the simpler model is likely to be a better description of reality. Why should that be?

A paper by Jim Berger suggests a Bayesian justification of Occam’s razor: simpler hypotheses have higher posterior probabilities when they fit well.

A simple model makes sharper predictions than a more complex model. For example, consider fitting a linear model and a cubic model. The cubic model is more general and fits more data. The linear model is more restrictive and hence easier to falsify. But when the linear and cubic models both fit, Bayes’ theorem “rewards” the linear model for making a bolder prediction. See Berger’s paper for a details and examples.

From the conclusion of the paper:

Ockham’s razor, far from being merely an ad hoc principle, can under many practical situations in science be justified as a consequence of Bayesian inference. Bayesian analysis can shed new light on what the notion of “simplest” hypothesis consistent with the data actually means.

Related links

14 thoughts on “Occam’s razor and Bayes’ theorem”

william

12 January 2011 at 04:19

On a similar note, you can try the Bayesian information criterion (BIC) the next time you need to select a model to fit data.

Sunny

12 January 2011 at 08:36

BIC is good only when the true model is assumed to be in the model space. However, this may not be true all the time. BIC may lead to bad results: cf: here.

Alejandro Weinstein

12 January 2011 at 08:38

Chapter 28 of “Information Theory, Inference, and Learning Algorithms” by David MacKay has an analysis along the same lines. You can get an electronic copy here:

http://www.inference.phy.cam.ac.uk/mackay/itila/

John

12 January 2011 at 08:44

Sunny and Alejandro: Thanks for the links.

Daniel Black

12 January 2011 at 08:53

Some years ago, I intended to write up a n ambitious blog post about Occam’s/Ockham’s Razor, and still may. While investigating it a bit, particularly with the help of a chap who goes by Hugo Holbling at The Galilean Library (an online forum largely focused on philosophy), I came across the idea of the theory-ladenness of our hypotheses. Of course, we probably agree that our approach to investigation is colored by our experience and current knowledge, but this notion makes a stronger statement: that we are so crippled by the current state of our understanding that any assessment of best-fit and which variables are superfluous is highly suspect. That is, Occam’s Razor may be a helpful rule of thumb, it may easily lead one to false conclusivity.

What would Bayes’ Theorem say about this, if anything? I’m not asking rhetorically, mind you.

Bill Casarin

12 January 2011 at 10:57

There are a series of articles written by Eliezer Yudkowsky on these and related topics that readers may find interesting!

jsalvatier

12 January 2011 at 21:12

The LessWrong wiki article (http://wiki.lesswrong.com/wiki/Occam's_razor) on this topic also links to a number of articles on this topic.

Yaroslav Bulatov

12 January 2011 at 22:15

I especially like their objective interpretation of their Occam Razor through formula 5, seems remarkable that you can get a strong statement like “Einstein’s model is at least 15 times more likely than previous” that is Bayesian yet prior independent

Andrew Gelman

13 January 2011 at 08:57

Color me skeptical.

Daniel Bilar

9 January 2012 at 13:23

Hello

Tangentially related, but since a poster mentioned BIC: Have a look at MIC (maximal information coefficient) , http://www.sciencemag.org/content/334/6062/1502

Most scientists will be familiar with the use of Pearson’s correlation coefficient r to measure the strength of association between a pair of variables: for example, between the height of a child and the average height of their parents (r ≈ 0.5; see the figure, panel A), or between wheat yield and annual rainfall (r ≈ 0.75, panel B). However, Pearson’s r captures only linear association, and its usefulness is greatly reduced when associations are nonlinear. What has long been needed is a measure that quantifies associations between variables generally, one that reduces to Pearson’s in the linear case, but that behaves as we’d like in the nonlinear case. On page 1518 of this issue, Reshef et al. (1) introduce the maximal information coefficient, or MIC, that can be used to determine nonlinear correlations in data sets equitably

Gabriel

10 January 2012 at 14:19

Chapter 28 of MacKay’s ITILA (free ebook btw) is crystal clear on exactly that: there is no need to have lower priors for higher complexity hypotheses. Because there is only a finite density (of 1.0) to be spread on the space of possible datasets, lower expressiveness hypotheses will be sharper and so have a higher posterior probability.

_winnie

11 January 2012 at 04:57

Solomonoff induction: http://www.mdpi.com/1099-4300/13/6/1076/pdf
Solomonoff gives precise theoretical prior probability for models, using Kolmogorov komplexity. It can be estimated by some “XXX information criteria” or ZIP-compression.

Michael

28 February 2018 at 06:31

What the heck are ‘bolder predictions’? I like this article. But the vague language is putting me off.

John

28 February 2018 at 13:13

A prediction is bolder if it is more specific, if it has lower probability.

Comments are closed.