The follow list summarizes five criticisms of significance testing as it is commonly practiced.

- Andrew Gelman: In reality, null hypotheses are nearly always false. Is drug
*A*identically effective as drug*B*? Certainly not. You know before doing an experiment that there must be*some*difference that would show up given enough data. - Jim Berger: A small
*p*-value means the data were unlikely under the null hypothesis. Maybe the data were just as unlikely under the alternative hypothesis. Comparisons of hypotheses should be conditional on the data. - Stephen Ziliak and Deirdra McCloskey: Statistical significance is not the same as scientific significance. The most important question for science is the size of an effect, not whether the effect exists.
- William Gosset: Statistical error is only one component of real error, maybe a small component. When you actually conduct multiple experiments rather than speculate about hypothetical experiments, the variability of your data goes up.
- John Ioannidis: Small
*p*-values do not mean small probability of being wrong. In one review, 74% of studies with*p*-value 0.05 were found to be wrong.

I’ve never understood the alternative though. Confidence intervals? What if the best prediction a theory can make is: A and B should be different. In my field (psycholinguistics), we’re not at the point yet where we can say, under condition A, I expect at 250 ms reaction time, and under B a 300 ms reaction time.

I’m happy to accept that significance testing is a bad tool. What is a good replacement?

Using Bayes factors — the ratio of the posterior probabilities of the two hypotheses given the data — gets around some of the weaknesses of significance testing, especially the criticisms of Jim Berger and John Ioannidis. In fact, part of the problem with the naive use of p-values is that people often think that they

areBayes factors, without using that term.Bayes factors have an intuitive interpretation in terms of decibels by analogy to sound intensity.

Significance testing tends to be over utilized. Everyone wants that p-value for their publication (and the journals demand it), so a lot of hypothesis tests are forced when a simple estimate might have been more appropriate. It important to remember that a significance test is not the end of the investigation, just a step in the process.

@Stephen & Deirdra :

Right ! Not only scientific significance but also economic significance is something else than statistic significance, and much more important.

In commercial datamining, we work with millions of customers. Almost any effect, no matter how tiny, turns out to be statistically significant.

We don’t care. The model must increase the expected surplus sales value, i.e. the economic significance must be worthwhile.

It is disappointing (though not surprising) to see these “criticisms” from years ago without any reply, even though numerous replies exist in the literature. Just a word on the main alleged problems:

Andrew Gelman: If null hypotheses are presumed false, then why do Bayesian hypotheses tests assign a spiked degree of belief to them? They do so in order to get a disagreement with p-values, but once again, this just shows what’s wrong with the Bayesian hypothesis tests that assign spiked priors to the null (I’m not saying Gelman endorses them).

Jim Berger: This is not what a small p-value means, and it would be a fallacious use of tests to take a small p-value as evidence FOR an alternative that was just as unlikely. The inference to the alternative would have passed a highly INsevere test. There is no argument for claiming that “comparisons of hypotheses should be conditional on the data”, but plenty of evidence that this robs on of calculating any error probabilities. Now that Berger has himself abandoned the likelihood principle, apparently he agrees .

To Stephen Ziliak and Deirdra McCloskey: These authors join others in the ill-founded hysteria about the secret “cults” of significance testers that are allegedly undermining science. Any even one-quarter way respectable treatment of statistical significance tests of the frequentist variety blatantly announce that statistical significance is not the same as scientific significance. On the business of “effect size”, ways to interpret statistical tests so that the inferences are clearly in terms of the discrepancies for which there is an isn’t evidence are many. These authors do not read such defenders, since they attack D. Mayo on these grounds, even though she has developed a clear way of reporting on discrepancies (that is suprioer, by the way, to things like confidence intervals).

To John Ioannidis: It is well known that small p-values do not mean small probability of being wrong, but this fails utterly to show that we should instead be using posteriors (regardless of how they are interpreted). He needs to tell us where he gets the prior that leads to the 74% posterior—I guarantee it involves assigning a probability of p to a hypothesis because it has been randomly selected from an “urn” of hypotheses, p% of which are believed to be true! This is a fallacious and highly irrelevant calculation that leads to absurd results. For references, see D. Mayo.

Gelman has often said, as recently as yesterday, that testing a point-null hypothesis via Bayesian techniques is as bad as testing one via frequentist techniques.

John, how do you read the Gelman critique? I’m not sure I understand it. I was just reading an article from Brad DeLong and Kevin Lang, “Are all economic hypotheses false?”, and he points out that

I haven’t read Popper, so I’m wondering about this clause. Does this conflict with Gelman’s critique?

Jeffrey: I take Gelman’s critique to refer specifically to null hypotheses of equality, not null hypotheses in general.

People commonly assume that strong evidence against exact equality of two values is equivalent to pretty good evidence of an important gap between the two values. This isn’t necessarily so. There’s some reason for this belief: It’s easier to show statistical significance when the effect size is larger. But it’s also easier to show significance when you’ve got a lot of data. And to point #3, statistical significance is not the same as scientific significance.

John: Thanks for your reply.

I see the point about the interplay between sample size, effect size, and statistical significance. The way I was taught frequentist methods is to assume a null that is unlikely, and choose an alternate hypothesis that is mutually exclusive and exhaustive with the null. For example, if I’m testing whether or not adding piece-rates improves worker productivity, my null is that there is no difference in mean productivity between the treatment and control.

If I find a statistically significant difference, that tells me that a difference exists. It isn’t clear that the null is necessarily impossible, though I agree it is highly improbable. That means the test can only tell me that the incentive changed productivity, not in which direction.

I rely on the confidence interval to tell me that, and to tell me precisely how far away from zero the effect could be (under classical assumptions) in both best and worse cases.

I’m just having a hard time figuring out why the impossibility (improbability) of the null should matter for the test. I’m far more convinced of the other critiques. I’m also willing to believe authors regularly state more than the test allows them to, but it’s the job of the reader, ultimately, to figure out if they’re being snow-jobbed.

I think we should follow Robert Abelson’s advice and rely on the MAGIC criteria instead:

Magnitude – how big is it? Large effects are impressive, small effects are not

Articulation – how precise is it? Precise statements are more impressive than vague ones. Statements with many exceptions are less impressive than those with fewer.

Generality – to how big a group does it apply? Statements that apply to a large group are more impressive than those that apply to a small one

Interestingness – Interesting effects are more impressive

Credibility – Extraordinary claims require extraordinary evidence

Peter: Too complicated. Can’t you just give me a number? :)

In all seriousness, I think this is at the root of the problem. People want a single number, even better a yes/no answer, and they’ll grasp at anything that appears to give it to them.

John: 42.

Glad to see this come up again. Is that “D. Mayo” comment just flamebait? D. Mayo is a brilliant woman, so it seems implausible that she would write a comment on a website in which she refers to herself in the third person.

On the general comment of the p-value (with the very recent publication from the ASA, 2016, this totally does not make any sense. In my opinion, the p-value is an useful tool. It just the people that use it mistakenly, probably due to lack of a fundamental knowledge on it and on how to use it.

As a simple analogue to all of these criticism, is just equally by saying that we blame the car for the car accident! Totally does not make any sense!