In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.
So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes’ theorem] to the 1.5 power, you get a better result.
In other words, if we change Bayes’ theorem (!) the algorithm works better. He goes on to explain
Now should we dig up Bayes and notify him that he was wrong? No, I don’t think that’s it. …
I imagine most statisticians would respond that this cannot possibly be right. While it appears to work, there must be some underlying reason why and we should find that reason before using an algorithm based on an ad hoc tweak.
While such a reaction is understandable, it’s also a little hypocritical. Statisticians are constantly drawing inference from empirical data without understanding the underlying mechanisms that generate the data. When analyzing someone else’s data, a statistician will say that of course we’d rather understand the underlying mechanism than fit statistical models, that’s just not always possible. Reality is too complicated and we’ve got to do the best we can.
I agree, but that same reasoning applied at a higher level of abstraction could be used to accept Norvig’s translation algorithm. Here’s this model (derived from spurious math, but we’ll ignore that). Let’s see empirically how well it works.
I think the pea was slipped under the other walnut shell when Norvig took the step from “Pr” to “p” — that is, from the population probability to the frequency within a sample. He did offer a very reasonable explanation when he said “I think what’s going on here is that we have more confidence in the first model” that is, we have a much larger sample of English texts than of known French-to-English translations, and hence more confidence that the frequency of a given English string reflects its “true” probability. The assumption of equivalent confidence is necessary to justify application of Bayes’ theorem to any finite sample. I have a truly marvelous proof, but this comment box is too small to contain it.
Reminds me of something I saw a few years ago: a student came to a meeting with pretty bad translation results when correctly using Bayes’ rule. But when they used the “wrong” likelihood function, the results improved dramatically. The lesson one of the senior researchers gave: “hacking works”. The whole thing is one big hack; formulating it in terms of likelihoods and Bayes’ rule is really less of a formalism and more of a framework that provides some constraints that are useful for limiting the search space. But those constraints may also cut off useful lines of inquiry, and we only find out when we’re willing to violate them.
I think the way you presentated this algorithm is misguided. (1) Norvig is not really raising the probabilities to 1.5, he’s raising he’s estimates of those probabilities to that power, which can easily compensate for the estimates of the other probabilities having higher variance than these. (2) While Bayes’ theorem describes a way of obtaining the actual posterior probability, maximizing that is only loosely related to any downstream loss function you actually care about, and there are decision-theoretic reasons to add extra parameters (a temperature in this case) to your model to improve a downstream loss.
It’s not just sample bias, it’s modeling error. Statistical MT models make fairly silly independence assumptions when it comes to modeling Pr(e) and Pr(f|e). In particular, the independence assumptions made when modeling Pr(e) are empirically “less bad” than the ones made when modeling Pr(f|e).
Should we be surprised or conclude that Bayes was in error? No, for the same reason we aren’t surprised when we find that logistic regression outperforms naive Bayes.
I agree with the comments above. Overlearned Bayesians tend to forget the simple premise for predictive optimality: Both likelihood (observation model) and prior (the true parameter distribution) need to be correct. In the present case neither are correct, hence, arbitrary modification of either can by good use of intuition / luck lead to improvements in test error (ref: http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=4834).
I think the next step would be to either a) investigate why the powered version of the posterior predicts better – does it point to a particular inadequacy of the model specification and can that be fixed, or does it indicate serious prior-data disagreements or an unintentionally strong prior, or b) accept that we have this heuristic that works really well. But I don’t think you should conflate b) with having a model, at least not in the strictest sense. Of course, maybe you don’t need one if you’re just interested in predicting/classifying. But I would find it difficult (and probably ill-advised) to interpret probability statements derived from Norvig’s “posterior” (I didn’t watch the whole talk so I don’t mean to suggest that he advocates that, but it is something you give up when you move away from a bona fide probability model and Bayesian procedure).
(I hope this isn’t a dupe post — the last one had a typo in the URL and may have been spam filtered.)
Chris D. hit the nail on the head — the answer to Jared’s question a is that there are naive independence assumptions that lead to gross mis-estimates of word or sound probabilities by failing to take into account the context of who’s speaking, what they’re speaking about, who they’re speaking to, or even where they are syntactically in a phrase.
By raising a probability term to a fractional power, you’re compensating somewhat for the unmodeled correlations among words (or sounds).
For instance, naive Bayes text classifiers assume independence of the words, so that
p(x1,…,xN) = p(x1) * … p(xN).
The calibrated model p2 just defines the joint probability by
p2(x1,…,xN) = p(x1,…,xN)^alpha
You can get even better results from naive Bayes by using document-length normalization. If you want to treat a length N document as if it were length K, you can raise to the K/N power. You can think of this as another “hack” on Bayes’s rule, raising p(x1,…,xN) to the K/N power. But you can also think of it as a different joint model, p2, defined by
p2(x1,…,XN) propto p(x1,…,xN)^(K/N)
Because it’s used for first-best classification, you rarely see anyone calculate the normalizing constant.
Mosteller and Wallace (1964, yes 1964) knew about the overdispersion of word counts and used negative binomial models on a word-by-word basis. Another way to account for this kind of overdispersion is to take the mixtures implied by something like a negative binomial seriously and treat them as latent random effects. For instance, you can think of a latent Dirichlet allocation (LDA) model this way, with each document being modeled as a mixture of multinomials determined by a latent document parameter theta.
…and we should find that reason before using an algorithm based on an ad hoc tweak…
True. Unfortunately business doesn’t work that way :(
Pavel: Science doesn’t either.
Can it be some sort of smoothing?
Bayes theorem is a remarkable thinking tool that has become sort of a revolution. And I think that this tribute is justified. I’ve tried to make tha subject fun by putting together a series of videos on YouTube entitled “Bayes’ Theorem for Everyone”. They are non-mathematical and easy to understand. And I explain why Bayes’ Theorem is important in almost every field. Bayes’ sets the limit for how much we can learn from observations, and how confident we should be about our opinions. And it points out the quickest way to the answers, identifying irrationality and sloppy thinking along the way. All this with a mathematical precision and foundation. Please check out the first one:
http://www.youtube.com/watch?v=XR1zovKxilw