p-values are inconsistent

If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that p-values fail this common sense criterion as a measure of evidence.

I just ran across a paper of Mark Schervish¹ that contains a criticism of p-values I had not seen before. p-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis H implies another hypothesis H‘, then evidence in favor of H’ should be at least as great as evidence in favor of H.

Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let H be the hypothesis that μ is contained in the interval (−0.5, 0.5). Let H‘ be the hypothesis that μ is contained in the interval (−0.82, 0.52). Then suppose you observe x = 2.18. The p-value for H is 0.0502 and the p-value for H‘ is 0.0498. This says there is more evidence to support the hypothesis H that μ is in the smaller interval than there is to support the hypothesis H‘ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that −0.82 < μ < 0.52 but accept the hypothesis that −0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.

¹ Mark J. Schervish. “P values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.

Update: I added the details of the p-value calculation here.

20 thoughts on “p-values are inconsistent”

John Myles White

3 March 2010 at 14:05

I’m confused: does this anomaly come up because the larger hypothesis interval is skewed further away from the observation than the smaller interval?

Joseph Delaney

3 March 2010 at 16:12

I am confused by the confidence intervals. It looks like 0.52 > 0.50 (the top of the 95% confidence interval). Is it not the case that all Bears are Mammals? If so, should the smaller confidence interval not be nested inside the larger one?

Joseph Delaney

3 March 2010 at 16:13

Or, of course, I could have misread the two intervals and look like a fool. My apologies.

David Stivers

3 March 2010 at 16:16

While I agree with the basic premise that p-values can be misleading or inconsistent, because such a stretch is required to set this up, I don’t think that it is a great example of why I should be worried about the issue in practice.

Where did these two (hypothetical) intervals come from? Presumably, they represent 95% CI for a population sample of some quantity in the subtype (bears), which was found to have mean 0 and SD 0.255; and in the type (mammals), having mean -0.15 and SD 0.342.

So, first, the known measurement variation (SD=1) is at least 3 times that of the observed population variation for either the type or the subtype; not an unimaginable situation if the population estimates were derived from repeated measures, but in that case, we haven’t been given the relevant intervals. Second, given either a N(0, 0.255) or N(-0.15, 0.341), the probability of observing 2.18 or greater is close to 0; i.e., you’re extremely unlikely to observe 2.18 if measuring an actual mammal.

efrique

3 March 2010 at 20:29

Nice counterexample.

EnlightenedDuck

4 March 2010 at 08:38

OK – I’m missing something here….probably because I haven’t had my morning coffee yet and I’m a frequentist. What I want to do is take the difference between the observation (2.18), and the edge of the interval (.5 or .52), and normalize it (divide by 1, in this case). This gives us 1.68 or 1.66. I’m inclined towards 2-tailed tests (since it could be lower, too), giving p-values around 0.1. And yielding more evidence for not being in the tighter interval (not-a-bear), rather than not being in the wider interval (not-a-mammal). So I’m not seeing an inconsistency.

Of course, this completely ignores the lengths of the intervals, so I’m guessing that if I were to treat these as characterizing a (uniform?) prior, I’d get results closer to those of the post….

kav

4 March 2010 at 08:43

David Stivers: This is not true!

if you repeat the experiment with random values of
mu_1,mu_2,mu_1′,mu_2′ you will see that
the p-values of the first test are larger than those of the second test about
1 out of 5 times.

Best,

John

4 March 2010 at 10:03

I’ve updated the post to link to the expression Schervish uses for his p-value calculation.

David Stivers

4 March 2010 at 10:06

@kav: I”m sorry, I don’t follow.

kav

4 March 2010 at 11:40

David Stivers:> Sorry, my bad.

“So, first, the known measurement variation (SD=1) is at least 3 times that of the observed population variation…”

I too thought that Schervish’s values may not be representative, so i ran his experiment, this time using many randomly generated ranges and mean points (i.e. the 2.18). I find that in about 20% of the cases, the p-values associated with the tighter range are smaller than those associated with the larger range. So, Schervish’s point holds even for milder values of the mean point.

kav

4 March 2010 at 11:41

sorry, should read “in about 20% of the cases, the p-values associated with the tighter range are *larger*”

Nqkoi

4 March 2010 at 12:54

Can you elaborate more on the p-values? I get 0.047 for the bigger interval and 0.043 for the smaller one.

John

4 March 2010 at 13:30

Nqkio: Yes. Please see the formula for the p-values given here. I’ve verified the values in the example using this formula.

Will

4 March 2010 at 15:20

I think you need to change ‘inconsistent’ to incoherent. One has a clear well defined term and one is entirely subjective and loaded. Can you guess which one you used and which one the author himself used?

WJC

4 March 2010 at 21:38

Hi John,

1st off, let me say i’ve enjoy your blogs.
can you provide a link to the article?

John

4 March 2010 at 21:55

WJC: Thanks.

The American Statistician does not make their articles publicly available, so I can’t provide a link. You can access the journal’s archives if you are an ASA member. Also, the article is available via JSTOR; Your library may have access to JSTOR.

prairiedock

5 March 2010 at 09:22

@wjc: Just google for it, using Google Scholar and including the search term ‘pdf’.

wei

22 April 2010 at 08:18

this is weird.

I have been taught that p-value is a measure of the incapability between data and hypothesis. It is only used to disprove a (null) hypothesis. We do not accept all the things that Neyman and Pearson had proposed.

I also learned that p-value is a relative measure, with vague quantitative interpretation. Its numerical value is relative to the experimental conduct and to the hypothesis. Numerical comparisons of two p-values are meaningless if the sample sizes of the 2 experiments are different, or if the width of the 2 interval null hypotheses are different.

Fran

21 March 2013 at 15:30

So the author make use of p-values in a complete bogus way, gets a complete bogus result and, somehow, it is the p-values fault… nice.

Nick Adams

15 July 2020 at 14:33

The conclusion of this paper is invalid.
First, he argues that a point null hypothesis (say mu=0) and a dividing hypothesis (say mu <0) are simply instances of interval hypotheses where the interval width tends to zero and infinity respectively. This seems reasonable. Then he clearly demonstrates that 2 interval hypotheses are incoherent when one is entirely nested inside the other. Finally, he concludes that this incoherence must then apply to point and dividing hypotheses as well. However, it is not possible to entirely nest one dividing hypothesis within another nor to nest point hypotheses at all. The incoherence only applies to interval hypotheses.
Incidentally, if two dividing hypotheses (two 1-sided tests) are used instead of an interval hypothesis the incoherence disappears.

Comments are closed.

Related posts

20 thoughts on “p-values are inconsistent”