When Bayes factors and p-values disagree

Bayesian analysis and Frequentist analysis often lead to the same conclusions by different routes. But sometimes the two forms of analysis lead to starkly different conclusions.

The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5. The alternative hypothesis was that p is not 0.5. There were N = 104,490,000 bits emitted during the experiment, and s = 52,263,471 were 1’s. The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.

The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence for ESP while the Bayesian concludes there is strong evidence against ESP.

The following Python code shows how one might calculate the p-value and Bayes factor.

from scipy.stats import binom
from scipy import log, exp
from scipy.special import betaln

N = 104490000
s = 52263471

# sf is the survival function, i.e. complementary cdf
# ccdf multiplied by 2 because we're doing a two-sided test
print("p-value: ", 2*binom.sf(s, N, 0.5))

# Compute the log of the Bayes factor to avoid underflow.
logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) + betaln(0.5, 0.5)
print("Bayes factor: ", exp(logbf))

16 thoughts on “Bayes factors vs p-values”

John K. Kruschke

31 March 2015 at 08:03

Much ink has been spilled on this particular example. A key issue has been that the Bayes factor changes direction when a theoretically meaningful prior is used for the alternative hypothesis (instead of the theoretically meaningless beta(.5,.5) prior). See, e.g., http://www.indiana.edu/~kruschke/articles/Kruschke2011PoPScorrected.pdf

John K. Kruschke

31 March 2015 at 08:05

By “this example” I meant ESP generally, and more recently the kerfuffle over Bem’s ESP article, cited in the paper linked in the previous comment.

John

31 March 2015 at 08:41

Take any study that claims to support ESP, search and replace ESP with something people find plausible, and there will be no criticism. You could say this is confirmation bias. Or you could say everyone is a really Bayesian, and most of us have a highly informative prior belief that ESP is bunk. The former argument says people are irrational, the latter says they are rational, and yet both arguments are largely the same!

Pericchi’s argument was that p-values behave paradoxically for large samples and should be adjusted for sample size. He shows how to do this adjustment so that p-values and Bayes factors agree asymptotically.

John K. Kruschke

31 March 2015 at 10:40

Right. My point was simply that the Bayes factor can change dramatically when a theoretically meaningful alternative prior is used. In the Pericchi example, the proportion of “heads” in the sample was barely bigger than 50%. If this were a real domain of research, apparently with enormous sample sizes, then the researchers would know in advance, from previous experience, that the phenomenon in question is barely bigger than chance. If the alternative prior is supposed to represent the alternative hypothesis, then the alternative prior should express the prior knowledge. The researchers’ hypothesis might, therefore, be better expressed by a beta(5010,4990) prior than by a beta(0.5,0.5) prior. Then the Bayes factor is 0.15 for the null; that is, against the null. [By the way, I get a Bayes factor of 18.69 for the beta(0.5,0.5) prior.] This does not contradict anything you’ve said; it’s just a reminder that Bayes factors are only as meaningful as the hypotheses being compared.

Dave Tate

31 March 2015 at 12:54

I don’t know the Bernarno (2010) paper, but it sounds like he’s referring to the experimental program described at http://www.princeton.edu/~pear/publications.html, the notorious “Princeton Engineering Anomalies Research” program. There are a bunch of links to similar papers at that site, if you’re looking for concrete examples. See e.g. http://www.princeton.edu/~pear/pdfs/1997-correlations-random-binary-sequences-12-year-review.pdf

J Rogel

31 March 2015 at 16:51

Hello,
I am Al out sure that the Bernardo reference is the following:

Integrated Objective Bayesian Estimation and Hypothesis Testing, Jose ́ M. Bernardo; in BAYESIAN STATISTICS 9, J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) Oxford University Press, 2010

See example 7, after theorem 5
http://www4.stat.ncsu.edu/~ghosal/papers/Bernardo.pdf

Dan S

31 March 2015 at 19:44

via @mpigliucci – https://www.sciencebasedmedicine.org/psychology-journal-bans-significance-testing/

John Borneman

31 March 2015 at 21:26

Isn’t the p value issue one of the amount of data being so large that even tiny differences become significant, whether practical or not. I ran a One proportion test in Minitab and found that againts the null of true p = 0.5, the p value was indeed at or near zero and the 95% confidence interval of the true proportion is (0.500081, 0.500273). I think if I were the frequentist researcher I would not be celebrating my discovery of ESP. :-)
But nonetheless, great warning to be careful no matter what tool one is using.

Wayne

1 April 2015 at 07:26

You may have mentioned it and I missed it, but I believe that the technical term is that p-values for a point hypothesis are not consistent. Not sure if this is (as much of) a problem for a one-tailed hypothesis or not.

Dietrich

2 April 2015 at 05:26

Thanks for presenting this example.
Lindley’s paradox
(Wikipedia) has a nice discussion, why the results differ and are not contradictory:
The Frequentist finds that the null hypothesis is a poor explanation for the observation, where the Bayesian finds that the null hypothesis is a far better explanation for the observation than the alternative.

Tim Murray-Browne

30 April 2015 at 08:31

First time I’ve come across Lindley’s paradox. I can follow the numbers but can’t quite get my head around how increasing sample size can increase the probability of falsely accepting an alternative hypothesis… seems completely contrary to my understanding of hypothesis testing. Any chance of a follow up post?

Jeff

18 April 2016 at 15:43

I don’t understand what’s the problem with the p-value here. Seems like such a sample would in fact be strong evidence for ESP, because such samples under the null are really unlikely. What’s the problem? Isn’t that how it should be?

John

18 April 2016 at 15:46

A problem with p-values is that data can be unlikely under the null, but even more unlikely under the alternative.

will hulme

7 February 2017 at 14:48

“The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5.” – isn’t this false? The null hypothesis is ONLY that p = 0.5. Not also that the individual had no influence. Therefore, there is only strong evidence in favour of p0.5 under the conditions observed during the experiment, not necessarily strong evidence for ESP.

Given there’s no mention of a control (what happens when nobody is around to influence the bit stream?), the experimental set-up doesn’t tell you anything about ESP at all.

As for the Bayesian approach, Beta(0.5,0.5) is an incredibly conservative prior for a process that should (presumably) strongly be considered p=0.5. No wonder the null is favoured.

Rather than illustrating differing conclusions made by frequentist and Bayesian approaches, the experimental setup appears to simply show how not to do either.

Erik Erlandson

11 November 2017 at 10:29

I disagree with some of the above. The null hypothesis is NOT anything to do with probability of ‘1’ or ‘0’; the null hypothesis is that “subject’s attempt to influence the distribution did not change the distribution”

The likely reason that the p-value was so small is that random number generators do not output perfectly random numbers, and large populations yield very sensitive p-value tests.

IMO, a better experiment design is assuming nothing about the underlying distribution of ‘0’ and ‘1’ – but simply measure the frequencies for a control run, and then measure frequencies when “ESP is being attempted”, and do something like a chi-sq test on those two distributions.

3LbBrain

17 January 2018 at 18:56

RE: “The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.”

The probability that an infinite population would yield a value different from 0.5 (given the 104+ million actual sample size of bits) is only 0.03% !!! Therefore, the hypothesis of an esp effect would be rejected at a confidence level of 99.5% by frequentist theory (i.e., Fisher’s theory, treating confidence level as a matter of choice, not fixed at 95%). Moreover, if there had been , in fact, an esp effect, it was so small as to be of negligible importance for most ordinary situations.

Comments are closed.