Occam’s razor says that if two models fit equally well, the simpler model is likely to be a better description of reality. Why should that be?
A paper by Jim Berger suggests a Bayesian justification of Occam’s razor: simpler hypotheses have higher posterior probabilities when they fit well.
A simple model makes sharper predictions than a more complex model. For example, consider fitting a linear model and a cubic model. The cubic model is more general and fits more data. The linear model is more restrictive and hence easier to falsify. But when the linear and cubic models both fit, Bayes’ theorem “rewards” the linear model for making a bolder prediction. See Berger’s paper for a details and examples.
From the conclusion of the paper:
Ockham’s razor, far from being merely an ad hoc principle, can under many practical situations in science be justified as a consequence of Bayesian inference. Bayesian analysis can shed new light on what the notion of “simplest” hypothesis consistent with the data actually means.
14 thoughts on “Occam’s razor and Bayes’ theorem”
On a similar note, you can try the Bayesian information criterion (BIC) the next time you need to select a model to fit data.
BIC is good only when the true model is assumed to be in the model space. However, this may not be true all the time. BIC may lead to bad results: cf: here.
Chapter 28 of “Information Theory, Inference, and Learning Algorithms” by David MacKay has an analysis along the same lines. You can get an electronic copy here:
Sunny and Alejandro: Thanks for the links.
Some years ago, I intended to write up a n ambitious blog post about Occam’s/Ockham’s Razor, and still may. While investigating it a bit, particularly with the help of a chap who goes by Hugo Holbling at The Galilean Library (an online forum largely focused on philosophy), I came across the idea of the theory-ladenness of our hypotheses. Of course, we probably agree that our approach to investigation is colored by our experience and current knowledge, but this notion makes a stronger statement: that we are so crippled by the current state of our understanding that any assessment of best-fit and which variables are superfluous is highly suspect. That is, Occam’s Razor may be a helpful rule of thumb, it may easily lead one to false conclusivity.
What would Bayes’ Theorem say about this, if anything? I’m not asking rhetorically, mind you.
There are a series of articles written by Eliezer Yudkowsky on these and related topics that readers may find interesting!
The LessWrong wiki article (http://wiki.lesswrong.com/wiki/Occam's_razor) on this topic also links to a number of articles on this topic.
I especially like their objective interpretation of their Occam Razor through formula 5, seems remarkable that you can get a strong statement like “Einstein’s model is at least 15 times more likely than previous” that is Bayesian yet prior independent
Color me skeptical.
Tangentially related, but since a poster mentioned BIC: Have a look at MIC (maximal information coefficient) , http://www.sciencemag.org/content/334/6062/1502
Most scientists will be familiar with the use of Pearson’s correlation coefficient r to measure the strength of association between a pair of variables: for example, between the height of a child and the average height of their parents (r ≈ 0.5; see the figure, panel A), or between wheat yield and annual rainfall (r ≈ 0.75, panel B). However, Pearson’s r captures only linear association, and its usefulness is greatly reduced when associations are nonlinear. What has long been needed is a measure that quantifies associations between variables generally, one that reduces to Pearson’s in the linear case, but that behaves as we’d like in the nonlinear case. On page 1518 of this issue, Reshef et al. (1) introduce the maximal information coefficient, or MIC, that can be used to determine nonlinear correlations in data sets equitably
Chapter 28 of MacKay’s ITILA (free ebook btw) is crystal clear on exactly that: there is no need to have lower priors for higher complexity hypotheses. Because there is only a finite density (of 1.0) to be spread on the space of possible datasets, lower expressiveness hypotheses will be sharper and so have a higher posterior probability.
Solomonoff induction: http://www.mdpi.com/1099-4300/13/6/1076/pdf
Solomonoff gives precise theoretical prior probability for models, using Kolmogorov komplexity. It can be estimated by some “XXX information criteria” or ZIP-compression.
What the heck are ‘bolder predictions’? I like this article. But the vague language is putting me off.
A prediction is bolder if it is more specific, if it has lower probability.