by John on September 4, 2008
Bayesian statisticians often talk about models “learning” as data accumulate. Here’s an example that applies information theory to quantify how much you can learn from an experiment using the same likelihood function but two different priors: a conjugate prior and a robust prior.
Here’s an example from a paper Luis Pericchi and I wrote recently. Suppose X ~ Normal(θ, 1) where the prior on θ is either a standard Cauchy distribution or a normal distribution with mean 0 and variance 2.19. (The variance on the normal was chosen following an example by Jim Berger so that both priors put half their mass on the interval [-1, 1].)
The expected information gain from a single observation using the normal (conjugate) prior was 0.58. The corresponding gain for the Cauchy (robust) prior was 1.20. Because robust priors are more responsive to data, the expected gain in information is larger (in this case twice as large) when using these priors.
Learning is not the same as just gaining information. Sometimes learning means letting go of previously held beliefs. While this is true in life in general, my point here is to show how this holds true when using the mathematical definition of information.
The information content of a probability density function p(x) is given by

Suppose we have a Beta(2, 6) prior on the probability of success for a binary outcome.

The prior density has information content 0.597. Then suppose we observe a success. The posterior density is distributed as Beta(3, 6). The posterior density has information 0.516, less information than the prior density.

Observing a success pulled the posterior density toward the right. The posterior density is a little more diffuse than the prior and so has lower information content. In that sense, we know less than before we observed the data! Actually, we’re less certain than we were before observing the data. But if the true probability of response is larger than our prior would indicate, we’re closer to the truth by becoming less confident of our prior belief, and we’ve learned something.
E. T. Jaynes gave a speech entitled A Backward Look to the Future in which he looked back on his long career as a physicist and statistician. The speech contains several quotes related to my recent post on what a probability means.
Jaynes advocated the view of probability theory as logic extended to include reasoning on incomplete information. Probability need not have anything to do with randomness. Jaynes believed that frequency interpretations of probability are unnecessary and misleading.
… think of probability theory as extended logic, because then probability
distributions are justified in terms of their demonstrable information content, rather than their imagined — and as it now turns out, irrelevant — frequency connections.
He concludes with this summary of his approach to probability.
As soon as we recognize that probabilities do not describe reality — only our information about reality — the gates are wide open to the optimal solution of problems of reasoning from that information.