p-values are inconsistent

by John on March 3, 2010

If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that p-values fail this common sense criterion as a measure of evidence.

I just ran across a paper of Mark Schervish1 that contains a criticism of p-values I had not seen before. p-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis H implies another hypothesis H’, then evidence in favor of H’ should be at least as great as evidence in favor of H.

Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let H be the hypothesis that μ is contained in the interval (-0.5, 0.5). Let H’ be the hypothesis that μ is contained in the interval (-0.82, 0.52). Then suppose you observe x = 2.18. The p-value for H is 0.0502 and the p-value for H’ is 0.0498. This says there is more evidence to support the hypothesis H that μ is in the smaller interval than there is to support the hypothesis H’ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that -0.82 < μ < 0.52 but accept the hypothesis that -0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.

1 Mark J. Schervish. “P values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.

Update: I added the details of the p-value calculation here.

Related posts:

How loud is the evidence
The cult of significance testing
Most published research results are false

{ 2 trackbacks }

Otra agradable propiedad del p-valor: no es una medida de soporte « Apuntes de Estadística
03.04.10 at 10:11
Some statistics « The Daily Crockett
03.06.10 at 16:51

{ 18 comments… read them below or add one }

1

John Myles White 03.03.10 at 14:05

I’m confused: does this anomaly come up because the larger hypothesis interval is skewed further away from the observation than the smaller interval?

2

Joseph Delaney 03.03.10 at 16:12

I am confused by the confidence intervals. It looks like 0.52 > 0.50 (the top of the 95% confidence interval). Is it not the case that all Bears are Mammals? If so, should the smaller confidence interval not be nested inside the larger one?

3

Joseph Delaney 03.03.10 at 16:13

Or, of course, I could have misread the two intervals and look like a fool. My apologies.

4

David Stivers 03.03.10 at 16:16

While I agree with the basic premise that p-values can be misleading or inconsistent, because such a stretch is required to set this up, I don’t think that it is a great example of why I should be worried about the issue in practice.

Where did these two (hypothetical) intervals come from? Presumably, they represent 95% CI for a population sample of some quantity in the subtype (bears), which was found to have mean 0 and SD 0.255; and in the type (mammals), having mean -0.15 and SD 0.342.

So, first, the known measurement variation (SD=1) is at least 3 times that of the observed population variation for either the type or the subtype; not an unimaginable situation if the population estimates were derived from repeated measures, but in that case, we haven’t been given the relevant intervals. Second, given either a N(0, 0.255) or N(-0.15, 0.341), the probability of observing 2.18 or greater is close to 0; i.e., you’re extremely unlikely to observe 2.18 if measuring an actual mammal.

5

efrique 03.03.10 at 20:29

Nice counterexample.

6

EnlightenedDuck 03.04.10 at 08:38

OK – I’m missing something here….probably because I haven’t had my morning coffee yet and I’m a frequentist. What I want to do is take the difference between the observation (2.18), and the edge of the interval (.5 or .52), and normalize it (divide by 1, in this case). This gives us 1.68 or 1.66. I’m inclined towards 2-tailed tests (since it could be lower, too), giving p-values around 0.1. And yielding more evidence for not being in the tighter interval (not-a-bear), rather than not being in the wider interval (not-a-mammal). So I’m not seeing an inconsistency.

Of course, this completely ignores the lengths of the intervals, so I’m guessing that if I were to treat these as characterizing a (uniform?) prior, I’d get results closer to those of the post….

7

kav 03.04.10 at 08:43

David Stivers: This is not true!

if you repeat the experiment with random values of
\mu_1,\mu_2,\mu_1′,\mu_2′ you will see that
the p-values of the first test are larger than those of the second test about
1 out of 5 times.

Best,

8

John 03.04.10 at 10:03

I’ve updated the post to link to the expression Schervish uses for his p-value calculation.

9

David Stivers 03.04.10 at 10:06

@kav: I”m sorry, I don’t follow.

10

kav 03.04.10 at 11:40

David Stivers:> Sorry, my bad.

“So, first, the known measurement variation (SD=1) is at least 3 times that of the observed population variation…”

I too thought that Schervish’s values may not be representative, so i ran his experiment, this time using many randomly generated ranges and mean points (i.e. the 2.18). I find that in about 20% of the cases, the p-values associated with the tighter range are smaller than those associated with the larger range. So, Schervish’s point holds even for milder values of the mean point.

11

kav 03.04.10 at 11:41

sorry, should read “in about 20% of the cases, the p-values associated with the tighter range are *larger*”

12

Nqkoi 03.04.10 at 12:54

Can you elaborate more on the p-values? I get 0.047 for the bigger interval and 0.043 for the smaller one.

13

John 03.04.10 at 13:30

Nqkio: Yes. Please see the formula for the p-values given here. I’ve verified the values in the example using this formula.

14

Will 03.04.10 at 15:20

I think you need to change ‘inconsistent’ to incoherent. One has a clear well defined term and one is entirely subjective and loaded. Can you guess which one you used and which one the author himself used?

15

WJC 03.04.10 at 21:38

Hi John,

1st off, let me say i’ve enjoy your blogs.
can you provide a link to the article?

16

John 03.04.10 at 21:55

WJC: Thanks.

The American Statistician does not make their articles publicly available, so I can’t provide a link. You can access the journal’s archives if you are an ASA member. Also, the article is available via JSTOR; Your library may have access to JSTOR.

17

prairiedock 03.05.10 at 09:22

@wjc: Just google for it, using Google Scholar and including the search term ‘pdf’.

18

wei 04.22.10 at 08:18

this is weird.

I have been taught that p-value is a measure of the incapability between data and hypothesis. It is only used to disprove a (null) hypothesis. We do not accept all the things that Neyman and Pearson had proposed.

I also learned that p-value is a relative measure, with vague quantitative interpretation. Its numerical value is relative to the experimental conduct and to the hypothesis. Numerical comparisons of two p-values are meaningless if the sample sizes of the 2 experiments are different, or if the width of the 2 interval null hypotheses are different.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>