If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that *p*-values fail this common sense criterion as a measure of evidence.

I just ran across a paper of Mark Schervish^{1} that contains a criticism of *p*-values I had not seen before. *p*-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis *H* implies another hypothesis *H*‘, then evidence in favor of H’ should be at least as great as evidence in favor of *H*.

Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let *H* be the hypothesis that μ is contained in the interval (-0.5, 0.5). Let *H*‘ be the hypothesis that μ is contained in the interval (-0.82, 0.52). Then suppose you observe *x* = 2.18. The *p*-value for *H* is 0.0502 and the *p*-value for *H*‘ is 0.0498. This says there is more evidence to support the hypothesis *H* that μ is in the smaller interval than there is to support the hypothesis *H*‘ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that -0.82 < μ < 0.52 but accept the hypothesis that -0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.

^{1} Mark J. Schervish. “*P* values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.

**Update**: I added the details of the *p*-value calculation here.

**Related posts**:

I’m confused: does this anomaly come up because the larger hypothesis interval is skewed further away from the observation than the smaller interval?

I am confused by the confidence intervals. It looks like 0.52 > 0.50 (the top of the 95% confidence interval). Is it not the case that all Bears are Mammals? If so, should the smaller confidence interval not be nested inside the larger one?

Or, of course, I could have misread the two intervals and look like a fool. My apologies.

While I agree with the basic premise that

p-values can be misleading or inconsistent, because such a stretch is required to set this up, I don’t think that it is a great example of why I should be worried about the issue in practice.Where did these two (hypothetical) intervals come from? Presumably, they represent

`95`

% CI for a population sample of some quantity in the subtype (bears), which was found to have mean`0`

and SD`0.255`

; and in the type (mammals), having mean`-0.15`

and SD`0.342`

.So, first, the known measurement variation (SD=1) is at least 3 times that of the observed population variation for either the type or the subtype; not an unimaginable situation if the population estimates were derived from repeated measures, but in that case, we haven’t been given the relevant intervals. Second, given either a N(

`0`

,`0.255`

) or N(`-0.15`

,`0.341`

), the probability of observing`2.18`

or greater is close to`0`

; i.e., you’re extremely unlikely to observe`2.18`

if measuring an actual mammal.Nice counterexample.

OK – I’m missing something here….probably because I haven’t had my morning coffee yet and I’m a frequentist. What I want to do is take the difference between the observation (2.18), and the edge of the interval (.5 or .52), and normalize it (divide by 1, in this case). This gives us 1.68 or 1.66. I’m inclined towards 2-tailed tests (since it could be lower, too), giving p-values around 0.1. And yielding more evidence for not being in the tighter interval (not-a-bear), rather than not being in the wider interval (not-a-mammal). So I’m not seeing an inconsistency.

Of course, this completely ignores the lengths of the intervals, so I’m guessing that if I were to treat these as characterizing a (uniform?) prior, I’d get results closer to those of the post….

David Stivers: This is not true!

if you repeat the experiment with random values of

mu_1,mu_2,mu_1′,mu_2′ you will see that

the p-values of the first test are larger than those of the second test about

1 out of 5 times.

Best,

I’ve updated the post to link to the expression Schervish uses for his

p-value calculation.@kav: I”m sorry, I don’t follow.

David Stivers:> Sorry, my bad.

“So, first, the known measurement variation (SD=1) is at least 3 times that of the observed population variation…”

I too thought that Schervish’s values may not be representative, so i ran his experiment, this time using many randomly generated ranges and mean points (i.e. the 2.18). I find that in about 20% of the cases, the p-values associated with the tighter range are smaller than those associated with the larger range. So, Schervish’s point holds even for milder values of the mean point.

sorry, should read “in about 20% of the cases, the p-values associated with the tighter range are *larger*”

Can you elaborate more on the p-values? I get 0.047 for the bigger interval and 0.043 for the smaller one.

Nqkio: Yes. Please see the formula for the

p-values given here. I’ve verified the values in the example using this formula.I think you need to change ‘inconsistent’ to incoherent. One has a clear well defined term and one is entirely subjective and loaded. Can you guess which one you used and which one the author himself used?

Hi John,

1st off, let me say i’ve enjoy your blogs.

can you provide a link to the article?

WJC: Thanks.

The American Statistician does not make their articles publicly available, so I can’t provide a link. You can access the journal’s archives if you are an ASA member. Also, the article is available via JSTOR; Your library may have access to JSTOR.

@wjc: Just google for it, using Google Scholar and including the search term ‘pdf’.

this is weird.

I have been taught that p-value is a measure of the incapability between data and hypothesis. It is only used to disprove a (null) hypothesis. We do not accept all the things that Neyman and Pearson had proposed.

I also learned that p-value is a relative measure, with vague quantitative interpretation. Its numerical value is relative to the experimental conduct and to the hypothesis. Numerical comparisons of two p-values are meaningless if the sample sizes of the 2 experiments are different, or if the width of the 2 interval null hypotheses are different.

So the author make use of p-values in a complete bogus way, gets a complete bogus result and, somehow, it is the p-values fault… nice.