<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Five criticisms of significance testing</title>
	<atom:link href="http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/</link>
	<description>The blog of John D. Cook</description>
	<lastBuildDate>Sat, 11 Feb 2012 22:42:11 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Jeffrey Horn</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-114227</link>
		<dc:creator>Jeffrey Horn</dc:creator>
		<pubDate>Fri, 11 Nov 2011 23:42:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-114227</guid>
		<description>John: Thanks for your reply.

I see the point about the interplay between sample size, effect size, and statistical significance. The way I was taught frequentist methods is to assume a null that is unlikely, and choose an alternate hypothesis that is mutually exclusive and exhaustive with the null. For example, if I&#039;m testing whether or not adding piece-rates improves worker productivity, my null is that there is no difference in mean productivity between the treatment and control.

If I find a statistically significant difference, that tells me that a difference exists. It isn&#039;t clear that the null is necessarily impossible, though I agree it is highly improbable. That means the test can only tell me that the incentive changed productivity, not in which direction. 

I rely on the confidence interval to tell me that, and to tell me precisely how far away from zero the effect could be (under classical assumptions) in both best and worse cases.

I&#039;m just having a hard time figuring out why the impossibility (improbability) of the null should matter for the test. I&#039;m far more convinced of the other critiques. I&#039;m also willing to believe authors regularly state more than the test allows them to, but it&#039;s the job of the reader, ultimately, to figure out if they&#039;re being snow-jobbed.</description>
		<content:encoded><![CDATA[<p>John: Thanks for your reply.</p>
<p>I see the point about the interplay between sample size, effect size, and statistical significance. The way I was taught frequentist methods is to assume a null that is unlikely, and choose an alternate hypothesis that is mutually exclusive and exhaustive with the null. For example, if I&#8217;m testing whether or not adding piece-rates improves worker productivity, my null is that there is no difference in mean productivity between the treatment and control.</p>
<p>If I find a statistically significant difference, that tells me that a difference exists. It isn&#8217;t clear that the null is necessarily impossible, though I agree it is highly improbable. That means the test can only tell me that the incentive changed productivity, not in which direction. </p>
<p>I rely on the confidence interval to tell me that, and to tell me precisely how far away from zero the effect could be (under classical assumptions) in both best and worse cases.</p>
<p>I&#8217;m just having a hard time figuring out why the impossibility (improbability) of the null should matter for the test. I&#8217;m far more convinced of the other critiques. I&#8217;m also willing to believe authors regularly state more than the test allows them to, but it&#8217;s the job of the reader, ultimately, to figure out if they&#8217;re being snow-jobbed.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-113810</link>
		<dc:creator>John</dc:creator>
		<pubDate>Fri, 11 Nov 2011 00:42:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-113810</guid>
		<description>Jeffrey: I take Gelman&#039;s critique to refer specifically to null hypotheses of equality, not null hypotheses in general. 

People commonly assume that strong evidence against exact equality of two values is equivalent to pretty good evidence of an important gap between the two values.  This isn&#039;t necessarily so. There&#039;s some reason for this belief: It&#039;s easier to show statistical significance when the effect size is larger. But it&#039;s also easier to show significance when you&#039;ve got a lot of data. And to point #3, statistical significance is not the same as scientific significance.</description>
		<content:encoded><![CDATA[<p>Jeffrey: I take Gelman&#8217;s critique to refer specifically to null hypotheses of equality, not null hypotheses in general. </p>
<p>People commonly assume that strong evidence against exact equality of two values is equivalent to pretty good evidence of an important gap between the two values.  This isn&#8217;t necessarily so. There&#8217;s some reason for this belief: It&#8217;s easier to show statistical significance when the effect size is larger. But it&#8217;s also easier to show significance when you&#8217;ve got a lot of data. And to point #3, statistical significance is not the same as scientific significance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeffrey Horn</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-113758</link>
		<dc:creator>Jeffrey Horn</dc:creator>
		<pubDate>Thu, 10 Nov 2011 22:30:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-113758</guid>
		<description>John, how do you read the Gelman critique? I&#039;m not sure I understand it. I was just reading an article from Brad DeLong and Kevin Lang, &lt;a href=&quot;http://www.jstor.org/stable/2138833&quot; rel=&quot;nofollow&quot;&gt;&quot;Are all economic hypotheses false?&quot;&lt;/a&gt;, and he points out that
&lt;blockquote&gt; Although the falsificationist view of scientific methodology (Popper 1959) stresses the importance of specifying test events that cannot occur under the null hypothesis...&lt;/blockquote&gt;

I haven&#039;t read Popper, so I&#039;m wondering about this clause. Does this conflict with Gelman&#039;s critique?</description>
		<content:encoded><![CDATA[<p>John, how do you read the Gelman critique? I&#8217;m not sure I understand it. I was just reading an article from Brad DeLong and Kevin Lang, <a href="http://www.jstor.org/stable/2138833" rel="nofollow">&#8220;Are all economic hypotheses false?&#8221;</a>, and he points out that</p>
<blockquote><p> Although the falsificationist view of scientific methodology (Popper 1959) stresses the importance of specifying test events that cannot occur under the null hypothesis&#8230;</p></blockquote>
<p>I haven&#8217;t read Popper, so I&#8217;m wondering about this clause. Does this conflict with Gelman&#8217;s critique?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-74232</link>
		<dc:creator>John</dc:creator>
		<pubDate>Sun, 03 Apr 2011 19:15:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-74232</guid>
		<description>Gelman has often said, as recently as &lt;a href=&quot;http://www.stat.columbia.edu/~cook/movabletype/archives/2011/04/so-called_bayes.html&quot; rel=&quot;nofollow&quot;&gt;yesterday&lt;/a&gt;, that testing a point-null hypothesis via Bayesian techniques is as bad as testing one via frequentist techniques.</description>
		<content:encoded><![CDATA[<p>Gelman has often said, as recently as <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2011/04/so-called_bayes.html" rel="nofollow">yesterday</a>, that testing a point-null hypothesis via Bayesian techniques is as bad as testing one via frequentist techniques.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Deborah Mayo</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-74231</link>
		<dc:creator>Deborah Mayo</dc:creator>
		<pubDate>Sun, 03 Apr 2011 19:08:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-74231</guid>
		<description>It is disappointing (though not surprising) to see these &quot;criticisms&quot; from years ago without any reply, even though numerous replies exist in the literature.  Just a word on the main alleged problems:
	Andrew Gelman: If null hypotheses are presumed false, then why do Bayesian hypotheses tests assign a spiked degree of belief to them?  They do so in order to get a disagreement with p-values, but once again, this just shows what’s wrong with the Bayesian hypothesis tests that assign spiked priors to the null (I&#039;m not saying Gelman endorses them).  
	Jim Berger: This is not what a small p-value means, and it would be a fallacious use of tests to take a small p-value as evidence FOR an alternative that was just as unlikely.  The inference to the alternative would have passed a highly INsevere test.   There is no argument for claiming that “comparisons of hypotheses should be conditional on the data”, but plenty of evidence that this robs on of calculating any error probabilities.  Now that Berger has himself abandoned the likelihood principle, apparently he agrees .
	To Stephen Ziliak and Deirdra McCloskey:  These authors join others in the ill-founded hysteria about the secret “cults” of significance testers that are allegedly undermining science.  Any even one-quarter way respectable treatment of statistical significance tests of the frequentist variety blatantly announce that  statistical significance is not the same as scientific significance.   On the business of “effect size”, ways to interpret statistical tests so that the inferences are clearly in terms of the discrepancies for which there is an isn’t evidence are many.  These authors do not read such defenders, since they attack D. Mayo on these grounds, even though she has developed a clear way of reporting on discrepancies (that is suprioer, by the way, to things like confidence intervals).  
	To John Ioannidis: It is well known that small p-values do not mean small probability of being wrong, but this fails utterly to show that we should instead be using posteriors (regardless of how they are interpreted).  He needs to tell us where he gets the prior that leads to the 74% posterior---I guarantee it involves assigning a probability of p to a hypothesis because it has been randomly selected from an “urn” of hypotheses, p% of which are believed to be true!  This is a fallacious and highly irrelevant calculation that leads to absurd results.  For references, see D. Mayo.</description>
		<content:encoded><![CDATA[<p>It is disappointing (though not surprising) to see these &#8220;criticisms&#8221; from years ago without any reply, even though numerous replies exist in the literature.  Just a word on the main alleged problems:<br />
	Andrew Gelman: If null hypotheses are presumed false, then why do Bayesian hypotheses tests assign a spiked degree of belief to them?  They do so in order to get a disagreement with p-values, but once again, this just shows what’s wrong with the Bayesian hypothesis tests that assign spiked priors to the null (I&#8217;m not saying Gelman endorses them).<br />
	Jim Berger: This is not what a small p-value means, and it would be a fallacious use of tests to take a small p-value as evidence FOR an alternative that was just as unlikely.  The inference to the alternative would have passed a highly INsevere test.   There is no argument for claiming that “comparisons of hypotheses should be conditional on the data”, but plenty of evidence that this robs on of calculating any error probabilities.  Now that Berger has himself abandoned the likelihood principle, apparently he agrees .<br />
	To Stephen Ziliak and Deirdra McCloskey:  These authors join others in the ill-founded hysteria about the secret “cults” of significance testers that are allegedly undermining science.  Any even one-quarter way respectable treatment of statistical significance tests of the frequentist variety blatantly announce that  statistical significance is not the same as scientific significance.   On the business of “effect size”, ways to interpret statistical tests so that the inferences are clearly in terms of the discrepancies for which there is an isn’t evidence are many.  These authors do not read such defenders, since they attack D. Mayo on these grounds, even though she has developed a clear way of reporting on discrepancies (that is suprioer, by the way, to things like confidence intervals).<br />
	To John Ioannidis: It is well known that small p-values do not mean small probability of being wrong, but this fails utterly to show that we should instead be using posteriors (regardless of how they are interpreted).  He needs to tell us where he gets the prior that leads to the 74% posterior&#8212;I guarantee it involves assigning a probability of p to a hypothesis because it has been randomly selected from an “urn” of hypotheses, p% of which are believed to be true!  This is a fallacious and highly irrelevant calculation that leads to absurd results.  For references, see D. Mayo.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tweets that mention Five criticisms of significance testing — The Endeavour -- Topsy.com</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-61008</link>
		<dc:creator>Tweets that mention Five criticisms of significance testing — The Endeavour -- Topsy.com</dc:creator>
		<pubDate>Mon, 17 Jan 2011 18:49:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-61008</guid>
		<description>[...] This post was mentioned on Twitter by Matt Gershoff, Michael Hoffman. Michael Hoffman said: @nparmalee http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/ [...]</description>
		<content:encoded><![CDATA[<p>[...] This post was mentioned on Twitter by Matt Gershoff, Michael Hoffman. Michael Hoffman said: @nparmalee <a href="http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/" rel="nofollow">http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: zyxo</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-60992</link>
		<dc:creator>zyxo</dc:creator>
		<pubDate>Mon, 17 Jan 2011 17:13:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-60992</guid>
		<description>@Stephen &amp; Deirdra :
Right ! Not only scientific significance but also economic significance is something else than statistic significance, and much more important.
In commercial datamining, we work with millions of customers.  Almost any effect, no matter how tiny, turns out to be statistically significant.
We don&#039;t care.  The model must increase the expected surplus sales value, i.e. the economic significance must be worthwhile.</description>
		<content:encoded><![CDATA[<p>@Stephen &amp; Deirdra :<br />
Right ! Not only scientific significance but also economic significance is something else than statistic significance, and much more important.<br />
In commercial datamining, we work with millions of customers.  Almost any effect, no matter how tiny, turns out to be statistically significant.<br />
We don&#8217;t care.  The model must increase the expected surplus sales value, i.e. the economic significance must be worthwhile.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-10028</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Fri, 21 Nov 2008 05:35:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-10028</guid>
		<description>Significance testing tends to be over utilized. Everyone wants that p-value for their publication (and the journals demand it), so a lot of hypothesis tests are forced when a simple estimate might have been more appropriate. It important to remember that a significance test is not the end of the investigation, just a step in the process.</description>
		<content:encoded><![CDATA[<p>Significance testing tends to be over utilized. Everyone wants that p-value for their publication (and the journals demand it), so a lot of hypothesis tests are forced when a simple estimate might have been more appropriate. It important to remember that a significance test is not the end of the investigation, just a step in the process.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-9928</link>
		<dc:creator>John</dc:creator>
		<pubDate>Wed, 19 Nov 2008 04:02:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-9928</guid>
		<description>Using Bayes factors &#8212; the ratio of the posterior probabilities of the two hypotheses given the data &#8212;  gets around some of the weaknesses of significance testing, especially the criticisms of Jim Berger and John Ioannidis. In fact, part of the problem with the naive use of p-values is that people often think that they &lt;i&gt;are&lt;/i&gt; Bayes factors, without using that term.

Bayes factors have an intuitive interpretation in terms of &lt;a href=&quot;http://www.johndcook.com/blog/2008/03/11/how-loud-is-the-evidence/&quot; rel=&quot;nofollow&quot;&gt;decibels&lt;/a&gt; by analogy to sound intensity.</description>
		<content:encoded><![CDATA[<p>Using Bayes factors &mdash; the ratio of the posterior probabilities of the two hypotheses given the data &mdash;  gets around some of the weaknesses of significance testing, especially the criticisms of Jim Berger and John Ioannidis. In fact, part of the problem with the naive use of p-values is that people often think that they <i>are</i> Bayes factors, without using that term.</p>
<p>Bayes factors have an intuitive interpretation in terms of <a href="http://www.johndcook.com/blog/2008/03/11/how-loud-is-the-evidence/" rel="nofollow">decibels</a> by analogy to sound intensity.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pedro</title>
		<link>http://www.johndcook.com/blog/2008/11/18/five-criticisms-of-significance-testing/comment-page-1/#comment-9927</link>
		<dc:creator>Pedro</dc:creator>
		<pubDate>Wed, 19 Nov 2008 03:50:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=902#comment-9927</guid>
		<description>I&#039;ve never understood the alternative though. Confidence intervals? What if the best prediction a theory can make is: A and B should be different. In my field (psycholinguistics), we&#039;re not at the point yet where we can say, under condition A, I expect at 250 ms reaction time, and under B a 300 ms reaction time.

I&#039;m happy to accept that significance testing is a bad tool. What is a good replacement?</description>
		<content:encoded><![CDATA[<p>I&#8217;ve never understood the alternative though. Confidence intervals? What if the best prediction a theory can make is: A and B should be different. In my field (psycholinguistics), we&#8217;re not at the point yet where we can say, under condition A, I expect at 250 ms reaction time, and under B a 300 ms reaction time.</p>
<p>I&#8217;m happy to accept that significance testing is a bad tool. What is a good replacement?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.363 seconds -->

