In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.
So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.
In other words, if we change Bayes’ theorem (!) the algorithm works better. He goes on to explain
Now should we dig up Bayes and notify him that he was wrong? No, I don’t think that’s it. …
I imagine most statisticians would respond that this cannot possibly be right. While it appears to work, there must be some underlying reason why and we should find that reason before using an algorithm based on an ad hoc tweak.
While such a reaction is understandable, it’s also a little hypocritical. Statisticians are constantly drawing inference from empirical data without understanding the underlying mechanisms that generate the data. When analyzing someone else’s data, a statistician will say that of course we’d rather understand the underlying mechanism than fit statistical models, that’s just not always possible. Reality is too complicated and we’ve got to do the best we can.
I agree, but that same reasoning applied at a higher level of abstraction could be used to accept Norvig’s translation algorithm. Here’s this model (derived from spurious math, but we’ll ignore that). Let’s see empirically how well it works.