If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that p-values fail this common sense criterion as a measure of evidence.
I just ran across a paper of Mark Schervish1 that contains a criticism of p-values I had not seen before. p-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis H implies another hypothesis H‘, then evidence in favor of H’ should be at least as great as evidence in favor of H.
Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let H be the hypothesis that μ is contained in the interval (-0.5, 0.5). Let H‘ be the hypothesis that μ is contained in the interval (-0.82, 0.52). Then suppose you observe x = 2.18. The p-value for H is 0.0502 and the p-value for H‘ is 0.0498. This says there is more evidence to support the hypothesis H that μ is in the smaller interval than there is to support the hypothesis H‘ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that -0.82 < μ < 0.52 but accept the hypothesis that -0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.
1 Mark J. Schervish. “P values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.
Update: I added the details of the p-value calculation here.