<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Endeavour &#187; Statistics</title>
	<atom:link href="http://www.johndcook.com/blog/category/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johndcook.com/blog</link>
	<description>John D. Cook</description>
	<lastBuildDate>Thu, 16 May 2013 13:09:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Bad normal approximation</title>
		<link>http://www.johndcook.com/blog/2013/04/23/bad-normal-approximation/</link>
		<comments>http://www.johndcook.com/blog/2013/04/23/bad-normal-approximation/#comments</comments>
		<pubDate>Tue, 23 Apr 2013 20:59:43 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=13292</guid>
		<description><![CDATA[Sometimes you can approximate a binomial distribution with a normal distribution. Under the right conditions, a Binomial(n, p) has approximately the distribution of a normal with the same mean and variance, i.e. mean np and variance np(1-p). The approximation works<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/04/23/bad-normal-approximation/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Sometimes you can approximate a binomial distribution with a normal distribution. Under the right conditions, a Binomial(<em>n</em>, <em>p</em>) has approximately the distribution of a normal with the same mean and variance, i.e. mean <em>np</em> and variance <em>np</em>(1-<em>p</em>). The approximation works best when <em>n</em> is large and <em>p</em> is near 1/2.</p>
<p>This afternoon I was reading a paper that used a normal approximation to a binomial when <em>n</em> was around 10 and <em>p</em> around 0.001.  The relative error was enormous. The paper used the approximation to find an analytical expression for something else and the error propagated.</p>
<p><strong>A common rule of thumb</strong> is that the normal approximation works well when <em>np</em> &gt; 5 and <em>n</em>(1-<em>p</em>) &gt; 5.  This says that the closer <em>p</em> is to 0 or 1, the larger <em>n</em> needs to be. In this case <em>p</em> was very small, but <em>n</em> was not large enough to compensate since <em>np</em> was on the order of 0.01, far less than 5.</p>
<p><strong>Another rule of thumb</strong> is that normal approximations in general hold well near the center of the distribution but not in the tails. In particular the <em>relative</em> error in the tails can be unbounded. This paper was looking out toward the tails, and relative error mattered.</p>
<p>For more details, see these notes on the <a href="http://www.johndcook.com/normal_approx_to_binomial.html">normal approximation to the binomial</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/04/23/bad-normal-approximation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Moments of mixtures</title>
		<link>http://www.johndcook.com/blog/2013/04/18/moments-of-mixtures-skewness-kurtosis/</link>
		<comments>http://www.johndcook.com/blog/2013/04/18/moments-of-mixtures-skewness-kurtosis/#comments</comments>
		<pubDate>Thu, 18 Apr 2013 16:10:15 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=13271</guid>
		<description><![CDATA[I needed to compute the higher moments of a mixture distribution for a project I&#8217;m working on. I&#8217;m writing up the code here in case anyone else finds this useful. (And in case I&#8217;ll find it useful in the future.)<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/04/18/moments-of-mixtures-skewness-kurtosis/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I needed to compute the higher moments of a mixture distribution for a project I&#8217;m working on. I&#8217;m writing up the code here in case anyone else finds this useful. (And in case I&#8217;ll find it useful in the future.) I&#8217;ll include the central moments first. From there it&#8217;s easy to compute skewness and kurtosis.</p>
<p>Suppose <em>X</em> is a mixture of <em>n</em> random variables <em>X</em><sub><em>i</em></sub> with weights <em>w</em><sub><em>i</em></sub>, non-negative numbers adding to 1. Then the <em>j</em>th central moment of <em>X</em> is given by</p>
<p style="text-align: center;"><img alt="E[(X - \mu)^j] = \sum_{i=1}^n \sum_{k=0}^j {j \choose k} (\mu_i - \mu)^{j-k} w_i E[(X_i- \mu_i)^k]" src="http://www.johndcook.com/mixture_moments.png" width="387" height="51" /></p>
<p>where μ<sub><em>i</em></sub> is the mean of <em>X</em><sub><em>i</em></sub>.</p>
<p>In my particular application, I&#8217;m interested in a mixture of normals and so the code below computes the moments for a mixture of normals. It could easily be modified for other distributions.</p>
<pre>
from scipy.misc import factorialk, comb

def mixture_central_moment(mixture, moment):

    '''Compute the higher moments of a mixture of normal rvs.
    mixture is a list of (mu, sigma, weight) triples.
    moment is the central moment to compute.'''

    mix_mean = sum( [w*m for (m, s, w) in mixture] )

    mixture_moment = 0.0
    for triple in mixture:
        mu, sigma, weight = triple
        for k in range(moment+1):
            prod = comb(moment, k) * (mu-mix_mean)**(moment-k)
            prod *= weight*normal_central_moment(sigma, k)
            mixture_moment += prod

    return mixture_moment


def normal_central_moment(sigma, moment):

    '''Central moments of a normal distribution'''

    if moment % 2 == 1:
        return 0.0
    else:
        # If Z is a std normal and n is even, E(Z^n) == (n-1)!!
        # So E (sigma Z)^n = sigma^n (n-1)!!
        return sigma**moment * factorialk(moment-1, 2)
</pre>
<p>Once we have code for central moments, it&#8217;s simple to add code for computing skewness and kurtosis.</p>
<pre>
def mixture_skew(mixture):

    variance = mixture_central_moment(mixture, 2)
    third = mixture_central_moment(mixture, 3)
    return third / variance**(1.5)

def mixture_kurtosis(mixture):

    variance = mixture_central_moment(mixture, 2)
    fourth = mixture_central_moment(mixture, 4)
    return fourth / variance**2 - 3.0
</pre>
<p>Here&#8217;s an example of how the code might be used.</p>
<pre>
# Test on a mixture of 30% Normal(-2, 1) and 70% Normal(1, 3)
mixture = [(-2, 1, 0.3), (1, 3, 0.7)]

print "Skewness = ", mixture_skew(mixture)
print "Kurtosis = ", mixture_kurtosis(mixture)
</pre>
<p><strong>Related post</strong>: <a href="http://www.johndcook.com/blog/2012/11/06/general-formula-for-normal-moments/">General formula for normal moments</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/04/18/moments-of-mixtures-skewness-kurtosis/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Data calls the model&#8217;s bluff</title>
		<link>http://www.johndcook.com/blog/2013/03/05/data-calls-the-models-bluff/</link>
		<comments>http://www.johndcook.com/blog/2013/03/05/data-calls-the-models-bluff/#comments</comments>
		<pubDate>Tue, 05 Mar 2013 14:34:04 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=13063</guid>
		<description><![CDATA[I hear a lot of people saying that simple models work better than complex models when you have enough data. For example, here&#8217;s a tweet from Giuseppe Paleologo this morning: Isn&#8217;t it ironic that almost all known results in asymptotic<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/03/05/data-calls-the-models-bluff/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I hear a lot of people saying that simple models work better than complex models when you have enough data. For example, here&#8217;s a tweet from <a href="https://twitter.com/gappy3000">Giuseppe Paleologo</a> this morning:</p>
<blockquote><p>Isn&#8217;t it ironic that almost all known results in asymptotic statistics don&#8217;t scale well with data?</p></blockquote>
<p>There are several things people could mean when they say that complex models don&#8217;t scale well.</p>
<p>First, they may mean that the <em>implementation</em> of complex models doesn&#8217;t scale. The computational effort required to fit the model increases disproportionately with the amount of data.</p>
<p>Second, they could mean that complex models aren&#8217;t necessary. A complex model might do even better than a simple model, but simple models work well enough given lots of data.</p>
<p>A third possibility, less charitable than the first two, is that the complex models are a bad fit, and this becomes apparent given enough data. The data calls the model&#8217;s bluff. If a statistical model performs poorly with lots of data, it must have performed poorly with a small amount of data too, but you couldn&#8217;t tell. It&#8217;s simple over-fitting.</p>
<p>I believe that&#8217;s what Giuseppe had in mind in his remark above. When I replied that the problem is modeling error, he said &#8220;Yes, big time.&#8221; The results of asymptotic statistics scale beautifully <em>when the model is correct</em>. But giving a poorly fitting model more data isn&#8217;t going to make it perform better.</p>
<p>The wrong conclusion would be to say that complex models work well for small data. I think the conclusion is that you <em>can&#8217;t tell</em> that complex models are <em>not</em> working well with small data. It&#8217;s a researcher&#8217;s paradise. You can fit a sequence of ever more complex models, getting a publication out of each. Evaluate your model using simulations based on your assumptions and you can avoid the accountability of the real world.</p>
<p>If the robustness of simple models is important with huge data sets, it&#8217;s even more important with small data sets.</p>
<p>Model complexity should increase with data, not decrease. I don&#8217;t mean that it should necessarily increase, but that it could. With more data, you have the ability to test the fit of more complex models. When people say that simple models scale better, they may mean that they haven&#8217;t been able to do better, that the data has exposed the problems with other things they&#8217;ve tried.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2011/11/01/floating-point-worries/">Floating point error is the least of my worries</a><br />
<a href="http://www.johndcook.com/blog/2013/03/05/robustness-of-equal-weights/">Robustness of equal weights</a><br />
<a href="http://www.johndcook.com/blog/2011/01/12/occams-razor-bayes-theorem/">Occam&#8217;s razor and Bayes&#8217; theorem</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/03/05/data-calls-the-models-bluff/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Robustness of equal weights</title>
		<link>http://www.johndcook.com/blog/2013/03/05/robustness-of-equal-weights/</link>
		<comments>http://www.johndcook.com/blog/2013/03/05/robustness-of-equal-weights/#comments</comments>
		<pubDate>Tue, 05 Mar 2013 13:00:10 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=13057</guid>
		<description><![CDATA[In Thinking, Fast and Slow, Daniel Kahneman comments on The robust beauty of improper linear models in decision making by Robyn Dawes. According to Dawes, or at least Kahneman&#8217;s summary of Dawes, simply averaging a few relevant predictors may work<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/03/05/robustness-of-equal-weights/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>In <a href="http://www.amazon.com/gp/product/0374533555/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0374533555&amp;linkCode=as2&amp;tag=theende-20">Thinking, Fast and Slow</a>, Daniel Kahneman comments on <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.188.5825">The robust beauty of improper linear models in decision making</a> by Robyn Dawes. According to Dawes, or at least Kahneman&#8217;s summary of Dawes, simply averaging a few relevant predictors may work as well or better than a proper regression model.</p>
<blockquote><p>One can do just as well by selecting a set of scores that have some validity for predicting the outcome and adjusting the values to make them comparable (by using standard scores or ranks). A formula that combines these predictors with equal weights is likely to be just as accurate in predicting new cases as the multiple-regression model that was optimal in the original sample. More recent research went further: formulas that assign equal weights to all the predictors are often superior, because they are not affected by accidents of sampling.</p></blockquote>
<p>If the data really do come from an approximately linear system, and you&#8217;ve identified the correct variables, then linear regression is optimal in some sense. If a simple-minded approach works nearly as well, one of these assumptions is wrong.</p>
<ol>
<li>Maybe the system isn&#8217;t approximately linear. In that case it would not be surprising that the best fit of an inappropriate model doesn&#8217;t work better than a crude fit.</li>
<li>Maybe the linear regression model is missing important predictors or has some extraneous predictors that are adding noise.</li>
<li>Maybe the system is linear, you&#8217;ve identified the right variables, but the application of your model is robust to errors in the coefficients.</li>
</ol>
<p>Regarding the first point, it can be hard to detect nonlinearities when you have several regression variables. It is especially hard to find nonlinearities when you assume that they must not exist.</p>
<p>Regarding the last point, depending on the purpose you put your model to, an accurate fit might not be that important. If the regression model is being used as a classifier, for example, maybe you could do about as good a job at classification with a crude fit.</p>
<p>The context of Dawes&#8217; paper, and Kahneman&#8217;s commentary on it, is a discussion of clinical judgement versus simple formulas. Neither author is discouraging regression but rather saying that a simple formula can easily outperform clinical judgment in some circumstances.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2012/09/17/robustness-of-simple-rules/">The robustness of simple rules</a><br />
<a href="http://www.johndcook.com/blog/2011/01/24/more-theoretical-power-less-real-power/">More theoretical power, less real power</a><br />
<a href="http://www.johndcook.com/blog/2013/03/05/data-calls-the-models-bluff/">Data calls the model&#8217;s bluff</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/03/05/robustness-of-equal-weights/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Offended by conditional probability</title>
		<link>http://www.johndcook.com/blog/2013/02/13/offended-by-conditional-probability/</link>
		<comments>http://www.johndcook.com/blog/2013/02/13/offended-by-conditional-probability/#comments</comments>
		<pubDate>Wed, 13 Feb 2013 19:33:06 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12886</guid>
		<description><![CDATA[It&#8217;s a simple rule of probability that if A makes B more likely, B makes A more likely. That is, if the conditional probability of A given B is larger than the probability of A alone, the the conditional probability<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/02/13/offended-by-conditional-probability/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>It&#8217;s a simple rule of probability that if A makes B more likely, B makes A more likely. That is, if the conditional probability of A given B is larger than the probability of A alone, the the conditional probability of B given A is larger than the probability of B alone. In symbols,</p>
<p style="padding-left: 30px;">Prob( A | B ) &gt; Prob( A ) ⇒ Prob( B | A ) &gt; Prob( B ).</p>
<p>The proof is trivial: Apply the definition of conditional probability and observe that if Prob( A ∩ B ) / Prob( B ) &gt; Prob( A ), then Prob( A ∩ B ) / Prob( A ) &gt; Prob( B ).</p>
<p>Let A be the event that someone was born in Arkansas and let B be the event that this person has been president of the United States. There are five living current and former US presidents, and one of them, Bill Clinton, was born in Arkansas, a state with about 1% of the US population. Knowing that someone has been president increases your estimation of the probability that this person is from Arkansas. Similarly, knowing that someone is from Arkansas should increase your estimation of the chances that this person has been president.</p>
<p>The chances that an American selected at random has been president are very small, but as small as this probability is, it goes up if you know the person is from Arkansas. In fact, it goes up by the same proportion as the opposite probability. Knowing that someone has been president increases their probability of being from Arkansas by a factor of 20, so knowing that someone is from Arkansas increases the probability that they have been president by a factor of 20 as well. This is because</p>
<p style="padding-left: 30px;">Prob( A | B ) / Prob( A ) = Prob( B | A ) / Prob( B ).</p>
<p>This isn&#8217;t controversial when we&#8217;re talking about presidents and where they were born. But it becomes more controversial when we apply the same reasoning, for example, to deciding who should be screened at airports.</p>
<p>When I jokingly said that being an Emacs user makes you a better programmer, it appears a few Vim users got upset. Whether they were serious 0r not, it does seem that they thought &#8220;Hey, what does that say about me? I use Vim. Does that mean I&#8217;m a bad programmer?&#8221;</p>
<p>Assume for the sake of argument that Emacs users are better programmers, i.e.</p>
<p style="padding-left: 30px;">Prob( good programmer | Emacs user )  &gt;  Prob( good programmer ).</p>
<p>We&#8217;re not assuming that Emacs users are necessarily better programmers, only that a larger proportion of Emacs users are good programmers. And we&#8217;re not saying anything about causality, only probability.</p>
<p>Does this imply that being a Vim user lowers your chance of being a good programmer? i.e.</p>
<p style="padding-left: 30px;">Prob( good programmer | Vim user )  &lt;  Prob( good programmer )?</p>
<p>No, because being a Vim user is a specific alternative to being an Emacs user, and there are programmers who use neither Emacs nor Vim. What the above statement about Emacs <em>would</em> imply is that</p>
<p style="padding-left: 30px;">Prob( good programmer | not a Emacs user )  &lt;  Prob( good programmer ).</p>
<p>That is, if knowing that someone uses Emacs increases the chances that they are a good programmer, then knowing that they are not an Emacs user does indeed lower the chances that they are a good programmer, <em>if we have no other information</em>. In general</p>
<p style="padding-left: 30px;">Prob( A | B ) &gt; Prob( A ) ⇒ Prob( A | not B ) &lt; Prob( A ).</p>
<p>To take a more plausible example, suppose that spending four years at MIT obtaining a computer science degree makes you a better programmer. Then knowing that someone has a CS degree from MIT increases the probability that this person is a good programmer. But if that&#8217;s true, it must also be true that <strong>absent any other information</strong>, knowing that someone does not have a CS degree from MIT decreases the probability that this person is a good programmer. If a larger proportion of good programmers come from MIT, then a smaller proportion must not come from MIT.</p>
<p style="text-align: center;">***</p>
<p>This post uses the ideas of information and conditional probability interchangeably. If you&#8217;d like to read more on that perspective, I recommend <a href="http://www.amazon.com/gp/product/0521592712/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0521592712&amp;linkCode=as2&amp;tag=theende-20">Probability Theory: The Logic of Science</a> by E. T. Jaynes.</p>
<p><a href="http://www.amazon.com/gp/product/0521592712/ref=as_li_ss_il?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0521592712&amp;linkCode=as2&amp;tag=theende-20"><img src="http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&amp;ASIN=0521592712&amp;Format=_SL160_&amp;ID=AsinImage&amp;MarketPlace=US&amp;ServiceVersion=20070822&amp;WS=1&amp;tag=theende-20" border="0" alt="" /></a><img style="border:none !important; margin:0px !important;" src="http://www.assoc-amazon.com/e/ir?t=theende-20&amp;l=as2&amp;o=1&amp;a=0521592712" border="0" alt="" width="1" height="1" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/02/13/offended-by-conditional-probability/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Visualization, modeling, and surprises</title>
		<link>http://www.johndcook.com/blog/2013/02/07/visualization-modeling-and-surprises/</link>
		<comments>http://www.johndcook.com/blog/2013/02/07/visualization-modeling-and-surprises/#comments</comments>
		<pubDate>Fri, 08 Feb 2013 04:01:29 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12844</guid>
		<description><![CDATA[This afternoon Hadley Wickham gave a great talk on data analysis. Here&#8217;s a paraphrase of something profound he said. Visualization can surprise you, but it doesn&#8217;t scale well. Modelling scales well, but it can&#8217;t surprise you. Visualization can show you<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/02/07/visualization-modeling-and-surprises/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>This afternoon <a href="http://had.co.nz/">Hadley Wickham</a> gave a great talk on data analysis. Here&#8217;s a paraphrase of something profound he said.</p>
<blockquote><p>Visualization can surprise you, but it doesn&#8217;t scale well.<br />
Modelling scales well, but it can&#8217;t surprise you.</p></blockquote>
<p>Visualization can show you something in your data that you didn&#8217;t expect. But some things are hard to see, and visualization is a slow, human process.</p>
<p>Modeling might tell you something slightly unexpected, but your choice of model restricts what you&#8217;re going to find once you&#8217;ve fit it.</p>
<p>So you iterate. Visualization suggests a model, and then you use your model to factor out some feature of the data. Then you visualize again.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2011/12/01/amputating-reality/">Amputating reality</a><br />
<a href="http://www.johndcook.com/blog/2012/11/07/r-without-hadley-wickham/">R without Hadley Wickham</a><br />
<a href="http://www.johndcook.com/blog/2009/08/31/the-iot-test/">The IOT test</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/02/07/visualization-modeling-and-surprises/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Statistics stories wanted</title>
		<link>http://www.johndcook.com/blog/2013/01/18/statistics-stories-wanted/</link>
		<comments>http://www.johndcook.com/blog/2013/01/18/statistics-stories-wanted/#comments</comments>
		<pubDate>Fri, 18 Jan 2013 16:31:24 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12765</guid>
		<description><![CDATA[Andrew Gelman is trying to collect 365 stories about life as a statistician: So here’s the plan. 365 of you write vignettes about your statistical lives. Get into the nitty gritty—tell me what you do, and why you’re doing it.<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/01/18/statistics-stories-wanted/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Andrew Gelman is trying to collect 365 stories about life as a statistician:</p>
<blockquote><p>So here’s the plan. 365 of you write vignettes about your statistical lives. Get into the nitty gritty—tell me what you do, and why you’re doing it. I’ll collect these and then post them at the Statistics Forum, one a day for a year. I think that could be great, truly a unique resource into what statistics and quantitative research is really like. Also it will be perfect for the Statistics Forum: people will want to tune in every day to see what comes next.</p></blockquote>
<p>If you&#8217;re interested in contributing, please contact Andrew. You can read more about the project <a href="http://andrewgelman.com/2013/01/wanted-365-stories-of-statistics/">here</a> and you can find Andrew&#8217;s contact info <a href="http://www.stat.columbia.edu/~gelman/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/01/18/statistics-stories-wanted/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Elementary statistics book recommendation</title>
		<link>http://www.johndcook.com/blog/2013/01/12/elementary-statistics-book/</link>
		<comments>http://www.johndcook.com/blog/2013/01/12/elementary-statistics-book/#comments</comments>
		<pubDate>Sat, 12 Jan 2013 14:34:53 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11820</guid>
		<description><![CDATA[I&#8217;ve thought about making a personal FAQ page. If I do, one of the questions would be what elementary statistics book I recommend. Unfortunately, I don&#8217;t have an answer for that one. I haven&#8217;t seen such a book I&#8217;d recommend<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/01/12/elementary-statistics-book/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve thought about making a personal FAQ page. If I do, one of the questions would be what elementary statistics book I recommend. Unfortunately, I don&#8217;t have an answer for that one. I haven&#8217;t seen such a book I&#8217;d recommend enthusiastically. </p>
<p>When asked for book recommendations, people will often recommend the textbook used in a course they had. But I never had an elementary statistics course. I had a PhD in math before I became interested in statistics, so I learned statistics from more advanced books. I&#8217;ve looked at a number of elementary books, but I haven&#8217;t found one I&#8217;m excited about.</p>
<p>Elementary statistics books may do more harm than good. They often brush difficulties under the rug. They avoid mathematical and philosophical details. They don&#8217;t define terms carefully, and even say things that are false. And they imply that statistical analysis is a matter of applying a set of rules by rote. (And it is, for many statisticians. But that&#8217;s a topic for another time.) If a statistics book doesn&#8217;t have fairly steep prerequisites, it will be hard for it not to be misleading.</p>
<p>This leads to another frequently asked question: Do I intend to write my own elementary statistics book? No. I don&#8217;t know whether I could write such a book that I&#8217;d be proud of. And if I could, it would take more time than I could afford to devote to it at this point in my life. </p>
<p>(I&#8217;ll write soon about what &#8220;this point in my life&#8221; is. If you don&#8217;t want to wait, here&#8217;s the news in a <a href="https://plus.google.com/u/0/107357401349156916755/posts/Jzz8P1c2H5i">nutshell</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/01/12/elementary-statistics-book/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Closet Bayesian</title>
		<link>http://www.johndcook.com/blog/2013/01/03/closet-bayesian/</link>
		<comments>http://www.johndcook.com/blog/2013/01/03/closet-bayesian/#comments</comments>
		<pubDate>Thu, 03 Jan 2013 15:33:16 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12436</guid>
		<description><![CDATA[When I was a grad student, a statistics postdoc confided to me that he was a &#8220;closet Bayesian.&#8221; This sounded absolutely bizarre. Why would someone be secretive about his preferred approach to statistics? I could not imagine someone whispering that<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2013/01/03/closet-bayesian/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>When I was a grad student, a statistics postdoc confided to me that he was a &#8220;closet Bayesian.&#8221; This sounded absolutely bizarre. Why would someone be secretive about his preferred approach to statistics? I could not imagine someone whispering that although she&#8217;s doing her thesis in algebra, she&#8217;s secretively interested in differential equations.</p>
<p>I knew nothing about statistics at the time and was surprised to find that there was a bitter rivalry between two schools of statistics. The rivalry is still there, though it&#8217;s not as bitter as it once was.</p>
<p>I find it grating when someone asks &#8220;Are you a Bayesian?&#8221; It implies an inappropriate degree of commitment and exclusivity. Bayesian statistics is just a tool. Statistics itself is just tool, one way of understanding the world.</p>
<p>My car has a manual transmission. I prefer manual transmissions. But if someone asked whether I was a manual transmissionist, I&#8217;d look at them like they&#8217;re crazy. I don&#8217;t have any moral objections to automatic transmissions.</p>
<p>I evaluate a car by how well it works. And for most purposes, I prefer the way a manual transmission works. But when I&#8217;m teaching one of my kids to drive, we go out in my wife&#8217;s car with an automatic transmission. Similarly, I evaluate a mathematical model (statistical or otherwise) by how it works for a given purpose. Sometimes a Bayesian and a frequentist approach lead to the same conclusions, but the latter is easier to understand or implement. Sometimes a Bayesian method leads to a better result because it can use more information or is easier to interpret. Sometimes it&#8217;s a toss up and I use a Bayesian approach because its more familiar, just like my old car.</p>
<p><strong>Related post</strong>: <a href="http://www.johndcook.com/blog/2011/09/06/bayes-isnt-magic/">Bayes isn&#8217;t magic</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2013/01/03/closet-bayesian/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Statistics blogosphere</title>
		<link>http://www.johndcook.com/blog/2012/11/13/statistics-blogosphere/</link>
		<comments>http://www.johndcook.com/blog/2012/11/13/statistics-blogosphere/#comments</comments>
		<pubDate>Tue, 13 Nov 2012 11:05:48 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12443</guid>
		<description><![CDATA[John Johnson did an analysis of the statistics blogosphere for the Coursera Social Networking Analysis class. His blog post about the analysis lists some of the lessons he learned from the project. It also includes a link to his paper<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/11/13/statistics-blogosphere/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>John Johnson did an analysis of the statistics blogosphere for the Coursera Social Networking Analysis class. His <a href="http://realizationsinbiostatistics.blogspot.com/2012/11/analysis-of-statistics-blogosphere.html">blog post</a> about the analysis lists some of the lessons he learned from the project. It also includes a link to his <a href="https://github.com/randomjohn/project/blob/master/out/Communities%20within%20the%20statistics%20blog%20network.pdf">paper</a> and to the <a href="https://github.com/randomjohn/project">Python code</a> used to do the analysis.</p>
<p style="text-align:center"><img src="http://www.johndcook.com/blogosphere.png" alt="statistics blogosphere" width="400" height="425" /></p>
<p>Image from Figure 6 of John&#8217;s paper.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/11/13/statistics-blogosphere/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Random inequalities and Edgeworth approximation</title>
		<link>http://www.johndcook.com/blog/2012/11/07/edgeworth-random-inequalities/</link>
		<comments>http://www.johndcook.com/blog/2012/11/07/edgeworth-random-inequalities/#comments</comments>
		<pubDate>Wed, 07 Nov 2012 17:04:39 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12374</guid>
		<description><![CDATA[I&#8217;ve written a lot about random inequalities. That&#8217;s because computers spend a lot of time computing random inequalities in the inner loop of simulations. I&#8217;m looking for ways to speed this up. Here&#8217;s my latest idea: Approximating random inequalities with<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/11/07/edgeworth-random-inequalities/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve written a lot about random inequalities. That&#8217;s because computers spend a lot of time computing random inequalities in the inner loop of simulations. I&#8217;m looking for ways to speed this up.</p>
<p>Here&#8217;s my latest idea: <a href="http://www.johndcook.com/edgeworth.pdf">Approximating random inequalities with Edgeworth expansions</a></p>
<p>An Edgeworth expansion is like a Fourier series, except you use derivatives of the normal density as your basis rather than sine functions. Sometimes the full Edgeworth expansion does not converge and yet the first few terms make a good approximation. The tech report explicitly considers Edgeworth approximations with just two terms, but demonstrates the integration tricks necessary to use more terms. The result is computed in closed form, no numerical integration required, and so may be much faster than other approaches.</p>
<p>One advantage of the Edgeworth approach is that it only depends on the moments of the distributions in the inequality. This means it provides an approximation that&#8217;s waiting to be used on new families of distributions. But because it&#8217;s not specific to a distribution family, its performance in a particular case needs to be explored. In the case of <a href="http://www.johndcook.com/blog/2012/10/15/beta-inequalities/">beta distributions</a>, for example, even a single-term approximation does pretty well.</p>
<p><strong>More blog posts on random inequalities</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2008/07/26/random-inequalities-i/">Introduction</a><br />
<a href="http://www.johndcook.com/blog/2008/07/26/random-inequalities-ii-analytical-results/">Analytical results</a><br />
<a href="http://www.johndcook.com/blog/2008/07/26/random-inequalities-iii-numerical-results/">Numerical results</a><br />
<a href="http://www.johndcook.com/blog/2008/08/09/random-inequalities-iv-cauchy-distributions/">Cauchy distributions</a><br />
<a href="http://www.johndcook.com/blog/2008/08/21/random-inequalities-v-beta-distributions/">Beta distributions</a><br />
<a href="http://www.johndcook.com/blog/2008/08/30/random-inequalities-vi-gamma-distributions/">Gamma distributions</a><br />
<a href="http://www.johndcook.com/blog/2008/09/06/random-inequalities-vii-three-or-more-variables/">Three or more random variables</a><br />
<a href="http://www.johndcook.com/blog/2009/07/13/random-inequalities-folded-normal/">Folded normals</a><br />
<a href="http://www.johndcook.com/blog/2011/09/27/bayesian-amazon/">A Bayesian view of Amazon Resellers</a><br />
<a href="http://www.johndcook.com/blog/2012/10/15/beta-inequalities/">Fast approximation of beta inequalities</a><br />
<a href="http://www.johndcook.com/blog/2012/10/17/shifting-probability-distributions/">Shifting probability distributions</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/11/07/edgeworth-random-inequalities/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Product of normal PDFs</title>
		<link>http://www.johndcook.com/blog/2012/10/29/product-of-normal-pdfs/</link>
		<comments>http://www.johndcook.com/blog/2012/10/29/product-of-normal-pdfs/#comments</comments>
		<pubDate>Mon, 29 Oct 2012 20:10:39 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12334</guid>
		<description><![CDATA[The product of two normal PDFs is proportional to a normal PDF. This is well known in Bayesian statistics because a normal likelihood times a normal prior gives a normal posterior. But because Bayesian applications don&#8217;t usually need to know<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/10/29/product-of-normal-pdfs/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>The product of two normal PDFs is proportional to a normal PDF. This is well known in Bayesian statistics because a normal likelihood times a normal prior gives a normal posterior. But because Bayesian applications don&#8217;t usually need to know the proportionality constant, it&#8217;s a little hard to find. I needed to calculate this constant, so I&#8217;m recording the result here for my future reference and for anyone else who might find it useful.<span id="more-12334"></span></p>
<p>Denote the normal PDF by</p>
<p style="text-align:center"><img src="http://www.johndcook.com/phidef.png" alt="phi(x; m, s) = frac{1}{sqrt{2pi} s} expleft(-frac{(x-m)^2}{2s^2}right)" width="269" height="44" /></p>
<p>Then the product of two normal PDFs is given by the equation</p>
<p style="text-align:center"><img src="http://www.johndcook.com/phiprod.png" alt="phi(x; mu_1, sigma_1) , phi(x; mu_2, sigma_2) = phi(mu_1; mu_2, sqrt{sigma_1^2 + sigma_2^2}) ,phi(x, mu, sigma)" width="392" height="34" /></p>
<p>where</p>
<p style="text-align:center"><img src="http://www.johndcook.com/phimu.png" alt=" mu = frac{ sigma_1^{-2} mu_1 + sigma_2^{-2} mu_2}{sigma_1^{-2} + sigma_2^{-2} }" width="140" height="44" /></p>
<p>and</p>
<p style="text-align:center"><img src="http://www.johndcook.com/phisigma.png" alt=" sigma^2 = frac{sigma_1^2 sigma_2^2}{sigma_1^2 + sigma_2^2}" width="96" height="42" /></p>
<p>Note that the product of two normal random variables is not normal, but the product of their PDFs is proportional to the PDF of another normal.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/10/29/product-of-normal-pdfs/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Sun, milk, red meat, and least-squares</title>
		<link>http://www.johndcook.com/blog/2012/10/26/sun-milk-red-meat-and-least-squares/</link>
		<comments>http://www.johndcook.com/blog/2012/10/26/sun-milk-red-meat-and-least-squares/#comments</comments>
		<pubDate>Fri, 26 Oct 2012 17:18:14 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>
		<category><![CDATA[Science]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12317</guid>
		<description><![CDATA[I thought this tweet from @WoodyOsher was pretty funny. Everything our parents said was good is bad. Sun, milk, red meat &#8230; the least-squares method. I wouldn&#8217;t say these things are bad, but they are now viewed more critically than<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/10/26/sun-milk-red-meat-and-least-squares/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I thought this tweet from <a href="https://twitter.com/woodyosher">@WoodyOsher</a> was pretty funny.</p>
<blockquote><p>Everything our parents said was good is bad. Sun, milk, red meat &#8230; the least-squares method.</p></blockquote>
<p>I wouldn&#8217;t say these things are <em>bad</em>, but they are now viewed more critically than they were a generation ago.</p>
<p>Sun exposure may be an apt example since it has alternately been seen as good or bad throughout history. The latest I&#8217;ve heard is that moderate sun exposure may lower your risk of cancer, even skin cancer, presumably because of vitamin D production. And sunlight appears to reduce your risk of multiple sclerosis since MS is more prevalent at higher latitudes. But like milk, red meat, or the least squares method, you can over do it.</p>
<p><strong>More on least squares</strong>: <a href="http://www.johndcook.com/blog/2011/01/27/when-it-works-it-works-really-well/">When it works, it works really well</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/10/26/sun-milk-red-meat-and-least-squares/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Shifting probability distributions</title>
		<link>http://www.johndcook.com/blog/2012/10/17/shifting-probability-distributions/</link>
		<comments>http://www.johndcook.com/blog/2012/10/17/shifting-probability-distributions/#comments</comments>
		<pubDate>Wed, 17 Oct 2012 13:45:38 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12266</guid>
		<description><![CDATA[One reason the normal distribution is easy to work with is that you can vary the mean and variance independently. With other distribution families, the mean and variance may be linked in some nonlinear way. I was looking for a<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/10/17/shifting-probability-distributions/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>One reason the normal distribution is easy to work with is that you can vary the mean and variance independently. With other distribution families, the mean and variance may be linked in some nonlinear way.</p>
<p>I was looking for a faster way to compute Prob(<em>X </em>&gt; <em>Y</em> + δ) where <em>X</em> and <em>Y</em> are independent inverse gamma random variables. If δ were zero, the probability could be computed analytically. But when δ is positive, the calculation requires numerical integration. When the calculation is in the inner loop of a simulation, most of the simulation&#8217;s time is spent doing the integration.</p>
<p>Let <em>Z</em> = <em>Y</em> + δ. If <em>Z</em> were another inverse gamma random variable, we could compute Prob(<em>X </em>&gt; <em>Z</em>) quickly and accurately without integration. Unfortunately, <em>Z</em> is not an inverse gamma. But it is <em>approximately </em>an inverse gamma, at least if <em>Y</em> has a moderately large shape parameter, which it always does in my applications. So let <em>Z</em> be inverse gamma with parameters to match the mean and variance of <em>Y</em> + δ. Then Prob(<em>X </em>&gt; <em>Z</em>) is a good approximation to Prob(<em>X </em>&gt; <em>Y</em> + δ).</p>
<p>For more details, see <a href="http://www.johndcook.com/GammaApprox.pdf">Fast approximation of inverse gamma inequalities</a>.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2012/10/15/beta-inequalities/">Fast approximation of beta inequalities</a><br />
<a href="http://www.johndcook.com/blog/2008/08/30/random-inequalities-vi-gamma-distributions/">Gamma inequalities</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/10/17/shifting-probability-distributions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Beta inequalities in R</title>
		<link>http://www.johndcook.com/blog/2012/10/11/beta-inequalities-in-r/</link>
		<comments>http://www.johndcook.com/blog/2012/10/11/beta-inequalities-in-r/#comments</comments>
		<pubDate>Thu, 11 Oct 2012 15:19:00 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>
		<category><![CDATA[Rstats]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12241</guid>
		<description><![CDATA[Someone asked me yesterday for R code to compute the probability P(X &#62; Y + δ) where X and Y are independent beta random variables. I&#8217;m posting the solution here in case it benefits anyone else. For an example of<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/10/11/beta-inequalities-in-r/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Someone asked me yesterday for R code to compute the probability P(<em>X</em> &gt; <em>Y</em> + δ) where <em>X</em> and <em>Y</em> are independent beta random variables. I&#8217;m posting the solution here in case it benefits anyone else.</p>
<p>For an example of why you might want to compute this probability, see <a href="http://www.johndcook.com/blog/2011/09/27/bayesian-amazon/">A Bayesian view of Amazon resellers</a>.<br />
<span id="more-12241"></span></p>
<p>Let <em>X</em> be a Beta(a, b) random variable and <em>Y</em> be a Beta(c, d) random variable. Denote PDFs by <em>f</em> and CDFs by <em>F</em>. Then the probability we need is</p>
<p style="text-align:center"><img src="http://www.johndcook.com/beta_ineq.png" alt="P(X &gt; Y + delta) &amp;=&amp; int_delta^1 int_0^{x-delta} f_X(x) , f_Y(y), dy,dx \ &amp;=&amp; int_delta^1 f_X(x), F_Y(x-delta) , dx" width="340" height="89" /></p>
<p>If you just need to compute this probability a few times, here is a <a href="https://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id=9">desktop application</a> to compute random inequalities.</p>
<p>But if you need to do this computation repeated inside R code, you could use the following.</p>
<pre>beta.ineq &lt;- function(a, b, c, d, delta)
{
    integrand &lt;- function(x) { dbeta(x, a, b)*pbeta(x-delta, c, d) }
    integrate(integrand, delta, 1, rel.tol=1e-4)$value
}</pre>
<p>The code is as good or as bad as R&#8217;s <code>integrate</code> function. It&#8217;s probably accurate enough as long as none of the parameters <em>a</em>, <em>b</em>, <em>c</em>, or <em>d</em> are near zero. When one or more of these parameters is small, the integral is harder to compute numerically.</p>
<p>There is no error checking in the code above. A more robust version would verify that all parameters are positive and that <code>delta</code> is less than 1.</p>
<p>Here&#8217;s the solution to the corresponding problem for gamma random variables, provided <code>delta</code> is zero: <a href="http://www.johndcook.com/blog/2011/03/15/a-support-one-liner/">A support one-liner</a>.</p>
<p>And here is a series of blog posts on random inequalities.</p>
<p><a href="http://www.johndcook.com/blog/2008/07/26/random-inequalities-i/">Introduction</a><br />
<a href="http://www.johndcook.com/blog/2008/07/26/random-inequalities-ii-analytical-results/">Analytical results</a><br />
<a href="http://www.johndcook.com/blog/2008/07/26/random-inequalities-iii-numerical-results/">Numerical results</a><br />
<a href="http://www.johndcook.com/blog/2008/08/09/random-inequalities-iv-cauchy-distributions/">Cauchy distributions</a><br />
<a href="http://www.johndcook.com/blog/2008/08/21/random-inequalities-v-beta-distributions/">Beta distributions</a><br />
<a href="http://www.johndcook.com/blog/2008/08/30/random-inequalities-vi-gamma-distributions/">Gamma distributions</a><br />
<a href="http://www.johndcook.com/blog/2008/09/06/random-inequalities-vii-three-or-more-variables/">Three or more random variables</a><br />
<a href="http://www.johndcook.com/blog/2009/07/13/random-inequalities-folded-normal/">Folded normals</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/10/11/beta-inequalities-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Personalized medicine</title>
		<link>http://www.johndcook.com/blog/2012/10/10/personalized-medicine/</link>
		<comments>http://www.johndcook.com/blog/2012/10/10/personalized-medicine/#comments</comments>
		<pubDate>Wed, 10 Oct 2012 11:58:20 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Clinical trials]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12233</guid>
		<description><![CDATA[When I hear someone say &#8220;personalized medicine&#8221; I want to ask &#8220;as opposed to what?&#8221; All medicine is personalized. If you are in an emergency room with a broken leg and the person next to you is lapsing into a<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/10/10/personalized-medicine/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>When I hear someone say &#8220;personalized medicine&#8221; I want to ask &#8220;as opposed to what?&#8221;</p>
<p>All medicine is personalized. If you are in an emergency room with a broken leg and the person next to you is lapsing into a diabetic coma, the two of you will be treated differently.</p>
<p>The aim of personalized medicine is to increase the <em>degree</em> of personalization, not to introduce personalization. In particular, there is the popular notion that it will become routine to sequence your DNA any time you receive medical attention, and that this sequence data will enable treatment uniquely customized for you. All we have to do is collect a lot of data and let computers sift through it. There are numerous reasons why this is incredibly naive. Here are three to start with.</p>
<ul>
<li>Maybe the information relevant to treating your malady is in how DNA is expressed, not in the DNA per se, in which case a sequence of your genome would be useless. Or maybe the most important information is not genetic at all. <a href="http://www.johndcook.com/blog/2009/02/18/the-data-may-not-contain-the-answer/">The data may not contain the answer.<br />
</a></li>
<li>Maybe the information a doctor needs is not in one gene but in the interaction of 50 genes or 100 genes. Unless a small number of genes are involved, there is no way to explore the combinations by brute force. For example, the number of ways to select 5 genes out of 20,000 is 26,653,335,666,500,004,000. The number of ways to select 32 genes is over a googol, and <a href="http://www.johndcook.com/blog/2010/10/13/googol/">there isn&#8217;t a googol of anything in the universe</a>. Moore&#8217;s law will not get us around this impasse.</li>
<li>Most clinical trials use no biomarker information at all. It is exceptional to incorporate information from one biomarker. Investigating a handful of biomarkers in a single trial is statistically dubious. Blindly exploring tens of thousands of biomarkers is out of the question, at least with current approaches.</li>
</ul>
<p>Genetic technology has the potential to incrementally increase the degree of personalization in medicine. But these discoveries will require new insight, and not simply more data and more computing power.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2010/08/25/predicting-height-using-genes/">Predicting height from genes</a><br />
<a href="http://www.johndcook.com/blog/2008/12/06/why-microarray-studies-are-often-wrong/">Why microarray studies are often wrong</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/10/10/personalized-medicine/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>How do you justify that distribution?</title>
		<link>http://www.johndcook.com/blog/2012/09/19/distribution-assumption/</link>
		<comments>http://www.johndcook.com/blog/2012/09/19/distribution-assumption/#comments</comments>
		<pubDate>Wed, 19 Sep 2012 12:22:37 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Clinical trials]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12111</guid>
		<description><![CDATA[Someone asked me yesterday how people justify probability distribution assumptions. Sometimes the most mystifying assumption is the first one: &#8220;Assume X is normally distributed &#8230;&#8221; Here are a few answers. Sometimes distribution assumptions are not justified. Sometimes distributions can be<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/09/19/distribution-assumption/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Someone asked me yesterday how people justify probability distribution assumptions. Sometimes the most mystifying assumption is the first one: &#8220;Assume X is normally distributed &#8230;&#8221; Here are a few answers.</p>
<ol>
<li>Sometimes distribution assumptions are not justified.</li>
<li>Sometimes distributions can be derived from fundamental principles. For example, there are axioms that uniquely specify a Poisson distribution.</li>
<li>Sometimes distributions are justified on theoretical grounds. For example, large samples and the <a href="http://www.johndcook.com/blog/2010/01/05/how-the-central-limit-theorem-began/">central limit theorem</a> together may justify assuming that something is normally distributed.</li>
<li>Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.</li>
<li>Sometimes a distribution can be a bad fit and still work well, depending on what you&#8217;re asking of it.</li>
</ol>
<p>The last point is particularly interesting. It&#8217;s not hard to imagine that a poor fit would produce poor results. It&#8217;s surprising when a poor fit produces good results. Here&#8217;s an example of the latter.</p>
<p>Suppose you are testing a new drug and hoping that it improves how long patients live. You want to stop the clinical trial early if it looks like patients are living no longer than they would have on standard treatment. There is a Bayesian <a href="https://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id=63">method</a> for monitoring such experiments that assumes survival times have an exponential distribution. But survival times are not exponentially distributed, not even close.</p>
<p>The method works well <em>because of the question being asked</em>. The method is <em>not</em> being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. As the simulations in <a href="http://biostats.bepress.com/mdandersonbiostat/paper16/">this paper</a> show, the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2010/08/11/what-distribution-does-my-data-have/">What distribution does my data have?</a><br />
<a href="http://www.johndcook.com/blog/2009/09/22/negative-binomial-distribution/">Four views of the negative binomial</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/09/19/distribution-assumption/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Accuracy versus perceived accuracy</title>
		<link>http://www.johndcook.com/blog/2012/09/18/perceived-accuracy/</link>
		<comments>http://www.johndcook.com/blog/2012/09/18/perceived-accuracy/#comments</comments>
		<pubDate>Tue, 18 Sep 2012 12:00:30 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12106</guid>
		<description><![CDATA[Commercial weather forecasters need to be accurate, but they also need to be perceived as being accurate, and sometimes the latter trumps the former. For instance, the for-profit weather forecasters rarely predict exactly a 50% chance of rain, which might<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/09/18/perceived-accuracy/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Commercial weather forecasters need to be accurate, but they also need to be <em>perceived</em> as being accurate, and sometimes the latter trumps the former.</p>
<blockquote><p>For instance, the for-profit weather forecasters rarely predict exactly a 50% chance of rain, which might seem wishy-washy and indecisive to customers. Instead, they&#8217;ll flip a coin and round up to 60, or down to 40, even though this makes the forecasts both less accurate and less honest.</p></blockquote>
<p>Forecasters also exaggerate small chances of rain, such as reporting 20% when they predict 5%.</p>
<blockquote><p>People notice one type of mistake &#8212; the failure to predict rain &#8212; more than another kind, false alarms. If it rains when it isn&#8217;t supposed to, they curse the weatherman for ruining their picnic, whereas an unexpectedly sunny day is taken as a serendipitous bonus.</p></blockquote>
<p>From <a id="static_txt_preview" href="http://www.amazon.com/gp/product/159420411X/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=159420411X&amp;linkCode=as2&amp;tag=theende-20" target="_blank">The Signal and the Noise</a>. The book gets some of its data from Eric Floehr of ForecastWatch. Read my interview with Eric <a href="http://www.johndcook.com/blog/2011/04/12/weather-forecast-accuracy/">here</a>.</p>
<p><a href="http://www.amazon.com/gp/product/159420411X/ref=as_li_ss_il?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=159420411X&amp;linkCode=as2&amp;tag=theende-20"><img src="http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&amp;ASIN=159420411X&amp;Format=_SL160_&amp;ID=AsinImage&amp;MarketPlace=US&amp;ServiceVersion=20070822&amp;WS=1&amp;tag=theende-20" border="0" alt="" /></a><img style="border:none !important; margin:0px !important;" src="http://www.assoc-amazon.com/e/ir?t=theende-20&amp;l=as2&amp;o=1&amp;a=159420411X" border="0" alt="" width="1" height="1" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/09/18/perceived-accuracy/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Robustness of simple rules</title>
		<link>http://www.johndcook.com/blog/2012/09/17/robustness-of-simple-rules/</link>
		<comments>http://www.johndcook.com/blog/2012/09/17/robustness-of-simple-rules/#comments</comments>
		<pubDate>Mon, 17 Sep 2012 14:50:19 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12104</guid>
		<description><![CDATA[In his speech The dog and the frisbee, Andrew Haldane argues that simple models often outperform complex models in complex situations. He cites as examples sports prediction, diagnosing heart attacks, locating serial criminals, picking stocks, and  understanding spending patterns. The<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/09/17/robustness-of-simple-rules/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>In his speech <a href="http://www.bis.org/review/r120905a.pdf?frames=0">The dog and the frisbee</a>, Andrew Haldane argues that simple models often outperform complex models in complex situations. He cites as examples sports prediction, diagnosing heart attacks, locating serial criminals, picking stocks, and  understanding spending patterns. The gist of his argument is this:</p>
<blockquote><p>Complex environments often instead call for simple decision rules. That is because these rules are more robust to ignorance.</p></blockquote>
<p>And yet behind every complex set of rules is a paper showing that it outperforms simple rules, under conditions of its author&#8217;s choosing. That is, the person proposing the complex model picks the scenarios for comparison. Unfortunately, the world throws at us scenarios not of our choosing. Simpler methods may perform better when model assumptions are violated. And model assumptions are always violated, at least to some extent.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2011/01/24/more-theoretical-power-less-real-power/">More theoretical power, less real power</a><br />
<a href="http://www.johndcook.com/blog/2011/05/25/crude-models/">Advantages of crude models</a><br />
<a href="http://www.johndcook.com/blog/2009/03/11/robust-statistics/">Canonical examples from robust statistics</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/09/17/robustness-of-simple-rules/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>True versus Publishable</title>
		<link>http://www.johndcook.com/blog/2012/09/13/true-versus-publishable/</link>
		<comments>http://www.johndcook.com/blog/2012/09/13/true-versus-publishable/#comments</comments>
		<pubDate>Thu, 13 Sep 2012 11:55:05 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12070</guid>
		<description><![CDATA[This weekend John Myles White and I discussed true versus publishable results in the comments to an earlier post. Methods that make stronger modeling assumptions lead to more statistical confidence, but less actual confidence. That is, they are more likely<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/09/13/true-versus-publishable/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>This weekend John Myles White and I discussed true versus publishable results in the comments to an <a href="http://www.johndcook.com/blog/2012/09/07/limits-of-statistics/">earlier post</a>. Methods that make stronger modeling assumptions lead to more statistical confidence, but less actual confidence. That is, they are more likely to produce positive results, but less likely to produce correct results.</p>
<blockquote><p><strong>JDC</strong>: If some scientists were more candid, they’d say “I don’t care whether my results are <em>true</em>, I care whether they’re <em>publishable</em>. So I need my <em>p</em>-value less than 0.05. Make as strong assumptions as you have to.”</p>
<p><strong>JMW</strong>: My sense of statistical education in the sciences is basically Upton Sinclair’s view of the Gilded Age: “It is difficult to get a man to  understand something when his salary depends upon his not understanding it.”</p></blockquote>
<p>Perhaps I should have said that scientists <em>know</em> that their conclusions are true. They just need the statistics to confirm what they know.</p>
<p>Brian Nosek talks about this theme on the <a href="http://www.econtalk.org/archives/2012/09/nosek_on_truth.html">EconTalk podcast</a>. He discusses the conflict of interest between creating publishable results and trying to find out what is actually true. However, he doesn&#8217;t just grouse about the problem; he offers specific suggestions for how to improve scientific publishing.</p>
<p><strong>Related post</strong>: <a href="http://www.johndcook.com/blog/2011/01/24/more-theoretical-power-less-real-power/">More theoretical power, less real power</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/09/13/true-versus-publishable/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Limits of statistics</title>
		<link>http://www.johndcook.com/blog/2012/09/07/limits-of-statistics/</link>
		<comments>http://www.johndcook.com/blog/2012/09/07/limits-of-statistics/#comments</comments>
		<pubDate>Fri, 07 Sep 2012 12:37:59 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12054</guid>
		<description><![CDATA[When statisticians analyze data, they don&#8217;t just by look at the data you bring to them. They also consider hypothetical data that you could have brought. In other words, they consider what could have happened as well as what actually<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/09/07/limits-of-statistics/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>When statisticians analyze data, they don&#8217;t just by look at the data you bring to them. They also consider hypothetical data that you could have brought. In other words, they consider what could have happened as well as what actually did happen.</p>
<p>This may seem strange, and sometimes it does lead to strange conclusions. But often it is undeniably the right thing to do. It also leads to endless debates among statisticians. The cause of the debates lies at the root of statistics.</p>
<p>The central dogma of statistics is that data should be viewed as realizations of random variables. This has been a very fruitful idea, but it has its limits. It&#8217;s a reification of the world. And like all reifications, it eventually becomes invisible to those who rely on it.</p>
<p>Data are what they are. In order to think of the data as having come from a random process, you have to construct a hypothetical process that could have produced the data. Sometimes there is near universal agreement on how this should be done. But often different statisticians create different hypothetical worlds in which to place the data. This is at the root of such arguments as how to handle multiple testing.</p>
<p>You can debunk any conclusion by placing the data in a large enough hypothetical model. Suppose it&#8217;s Jake&#8217;s birthday, and when he comes home, there are Scrabble tiles on the floor spelling out &#8220;Happy birthday Jake.&#8221; You might conclude that someone arranged the tiles to leave him a birthday greeting. But if you are so inclined, you could attribute the apparent pattern to chance. You could argue that there are many people around the world who have dropped bags of Scrabble tiles, and eventually something like this was bound to happen. If that seems to be an inadequate explanation, you could take a &#8220;many worlds&#8221; approach and posit entire new universes. Not only are people dropping Scrabble tiles in this universe, they&#8217;re dropping them in countless other universes too. We&#8217;re only remarking on Jake&#8217;s apparent birthday greeting because we happen to inhabit the universe in which it happened.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2011/12/01/amputating-reality/">Amputating reality</a><br />
<a href="http://www.johndcook.com/blog/2011/09/30/just-an-approximation/">Just an approximation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/09/07/limits-of-statistics/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Unprincipled analysis</title>
		<link>http://www.johndcook.com/blog/2012/08/13/unprincipled-analysis/</link>
		<comments>http://www.johndcook.com/blog/2012/08/13/unprincipled-analysis/#comments</comments>
		<pubDate>Mon, 13 Aug 2012 16:29:09 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11911</guid>
		<description><![CDATA[The other day I started to call someone&#8217;s data analysis &#8220;unprincipled&#8221; until I realized how harsh that sounds. I wanted to convey that an analysis seemed ad hoc, not based on general principles. Then I realized that &#8220;unprincipled&#8221; implies someone<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/08/13/unprincipled-analysis/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>The other day I started to call someone&#8217;s data analysis &#8220;unprincipled&#8221; until I realized how harsh that sounds. I wanted to convey that an analysis seemed <em>ad hoc</em>, not based on general principles. Then I realized that &#8220;unprincipled&#8221; implies someone is lacking <em>moral</em> principles rather than statistical principles so I changed my wording.</p>
<p><strong>Related post</strong>: <a href="http://www.johndcook.com/blog/2011/05/10/well-understood/">Works well versus well understood</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/08/13/unprincipled-analysis/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Vague priors are informative</title>
		<link>http://www.johndcook.com/blog/2012/08/05/vague-priors-are-informative/</link>
		<comments>http://www.johndcook.com/blog/2012/08/05/vague-priors-are-informative/#comments</comments>
		<pubDate>Sun, 05 Aug 2012 21:30:30 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11643</guid>
		<description><![CDATA[Data analysis has to start from some set of assumptions. Bayesian prior distributions drive some people crazy because they make assumptions explicit that people prefer to leave implicit. But there&#8217;s no escaping the need to make some sort of prior<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/08/05/vague-priors-are-informative/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Data analysis has to start from some set of assumptions. Bayesian prior distributions drive some people crazy because they make assumptions explicit that people prefer to leave implicit. But there&#8217;s no escaping the need to make some sort of prior assumptions, whether you&#8217;re doing Bayesian statistics or not.</p>
<p>One attempt to avoid specifying a prior distribution is to start with a &#8220;non-informative&#8221; prior. <a href="http://arxiv.org/abs/1205.4446v1">David Hogg</a> gives a good explanation of why this doesn&#8217;t accomplish what some think it does.</p>
<blockquote><p>In practice, investigators often want to “assume nothing” and put a very or infinitely broad prior on the parameters; of course putting a broad prior is not equivalent to assuming nothing, it is just as severe an assumption as any other prior. For example, even if you go with a very broad prior on the parameter <em>a</em>, that is a different assumption than the same form of very broad prior on <em>a</em><sup>2</sup> or on arctan(<em>a</em>). The prior doesn’t just set the ranges of parameters, it places a measure on parameter space. That’s why it is so important.</p></blockquote>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2011/11/28/bad-logic-but-good-statistics/">Bad logic, but good statistics</a><br />
<a href="http://www.johndcook.com/blog/2011/09/27/bayesian-amazon/">Bayesian view of Amazon resellers</a><br />
<a href="http://www.johndcook.com/blog/2011/09/06/bayes-isnt-magic/">Bayes isn&#8217;t magic</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/08/05/vague-priors-are-informative/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Avoiding underflow in Bayesian computations</title>
		<link>http://www.johndcook.com/blog/2012/07/26/avoiding-underflow-in-bayesian-computations/</link>
		<comments>http://www.johndcook.com/blog/2012/07/26/avoiding-underflow-in-bayesian-computations/#comments</comments>
		<pubDate>Thu, 26 Jul 2012 12:00:31 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[Integration]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11800</guid>
		<description><![CDATA[Here&#8217;s a common problem that arises in Bayesian computation. Everything works just fine until you have more data than you&#8217;ve seen before. Then suddenly you start getting infinite, NaN, or otherwise strange results. This post explains what might be wrong<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/07/26/avoiding-underflow-in-bayesian-computations/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Here&#8217;s a common problem that arises in Bayesian computation. Everything works just fine until you have more data than you&#8217;ve seen before. Then suddenly you start getting infinite, NaN, or otherwise strange results. This post explains what might be wrong and how to fix it.</p>
<p>A posterior density is (proportional to) a likelihood function times a prior distribution. The likelihood function is a product. The number of data points is the number of terms in the product. If these numbers are less than 1, and you multiply enough of them together, the result will be too small to represent in a floating point number and your calculation will underflow to zero. Then subsequent operations with this number, such as dividing it by another number that has also underflowed to zero, may produce an infinite or NaN result.</p>
<p>The instinctive reaction regarding underflow or overflow is to use logs. And often that works. If you wanted to know where the maximum of the posterior density occurs, you could find the maximum of the <em>logarithm</em> of the posterior density. But in Bayesian computations you often need to <em>integrate</em> the posterior density times some functions. Now you cannot just work with logs because logs and integrals don&#8217;t commute.</p>
<p>One way around the problem is to multiply the integrand by a constant so large that there is no danger of underflow. Multiplying by constants <em>does</em> commute with integration.</p>
<p>So suppose your integrand is on the order of 10^-400, far below the smallest representable number. Do you need extended precision arithmetic? <a href="http://www.johndcook.com/blog/2012/07/19/53-bits-ought-to-be-enough-for-anybody/">No</a>, you just need to understand your problem.</p>
<p>If you multiply your integrand by 10^400 before integrating, then your integrand is roughly 1 in magnitude. Then do you integration, and remember that the result is 10^400 times the actual value.</p>
<p>You could take the following steps.</p>
<ol>
<li>Find <em>m</em>, the maximum of the log of the integrand.</li>
<li>Let <em>I</em> be the integral of exp( log of the integrand &#8211; <em>m</em> ).</li>
<li>Keep track that your actual integral is exp(<em>m</em>) <em>I</em>, or that its log is <em>m</em> + log <em>I</em>.</li>
</ol>
<p>Note that you can be extremely sloppy in this process. You don&#8217;t need an accurate estimate of the maximum of the integrand <em>per se</em>. If you&#8217;re within a few dozen orders of magnitude, for example, that could be sufficient to carry out your integration without underflow.</p>
<p>One way to estimate the maximum is to use a frequentist estimator of your parameters as an approximate MLE, and assume that this is approximately where your posterior density takes on its maximum value. This might actually be very accurate, but it doesn&#8217;t need to be.</p>
<p>Note also that it&#8217;s OK if some of the evaluations of your integrand underflow to zero. You just don&#8217;t want the entire integral to underflow to zero. Places where the integrand is many orders of magnitude less than its maximum don&#8217;t contribute to the integration anyway. (It&#8217;s important that your integration pays attention to the region where the integrand is largest. A naive integration could entirely miss the most important region and completely underestimate the integral. But that&#8217;s a matter for another blog post.)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/07/26/avoiding-underflow-in-bayesian-computations/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Computing log gamma differences</title>
		<link>http://www.johndcook.com/blog/2012/07/14/log-gamma-differences/</link>
		<comments>http://www.johndcook.com/blog/2012/07/14/log-gamma-differences/#comments</comments>
		<pubDate>Sat, 14 Jul 2012 12:43:48 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[SciPy]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11711</guid>
		<description><![CDATA[Statistical computing often involves working with ratios of factorials. These factorials are often too big to fit in a floating point number, and so we work with logarithms. So if we need to compute log(a! / b!), we call software<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/07/14/log-gamma-differences/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Statistical computing often involves working with ratios of factorials. These factorials are often too big to fit in a floating point number, and so we work with logarithms. So if we need to compute log(<em>a</em>! / <em>b</em>!), we call software that calculates log(<em>a</em>!) and log(<em>b</em>!) directly without computing <em>a</em>! or  <em>b</em>! first. More on that <a href="http://www.johndcook.com/blog/2008/04/24/how-to-calculate-binomial-probabilities/">here</a>. But sometimes this doesn&#8217;t work.</p>
<p><span id="more-11711"></span>Suppose <em>a</em> = 10^40 and <em>b</em> = <em>a</em> &#8211; 10^10. Our first problem is that we cannot even compute <em>b</em> directly. Since <em>a</em> &#8211; <em>b</em> is 30 orders of magnitude smaller than <em>a</em>, we&#8217;d need about 100 bits of precision to begin to tell <em>a</em> and <em>b</em> apart, about twice as much as a standard  floating point number. (Why 100? That&#8217;s the log base 2 of 10^30. And how much precision does a floating point number have? See <a href="http://www.johndcook.com/blog/2009/04/06/anatomy-of-a-floating-point-number/">here</a>.)</p>
<p>Our next problem is that even if we could accurately compute <em>b</em>, the log gamma function is going to be approximately the same for <em>a</em> and <em>b</em>, and so the difference will be highly inaccurate.</p>
<p>So what do we do? We could use some sort of extended precision package, but that is not necessary. There&#8217;s an elegant solution using ordinary precision.</p>
<p>Let <em>f</em>(<em>x</em>) = log(Γ(<em>x</em>)). The mean value theorem says that</p>
<p style="padding-left: 30px;"><em>f</em>(<em>a</em>+1) &#8211; <em>f</em>(<em>b</em>+1) = (<em>a</em>-<em>b</em>)<em> f</em>&#8216;(<em>c</em>+1)</p>
<p>for some <em>c</em> between <em>a</em>+1 and <em>b</em>+1. We don&#8217;t need to compute <em>b</em>, only <em>a</em> &#8211; <em>b</em>, which we know is 10^10. The derivative of the log of the gamma function is called the digamma function, written ψ(<em>x</em>). So</p>
<p style="padding-left: 30px;">log(<em>a</em>! / <em>b</em>!) = 10^10 ψ(<em>c</em>+1).</p>
<p>But what is <em>c</em>? The mean value theorem only says that the equation above is true for some <em>c</em>. What <em>c</em> do we use? It doesn&#8217;t matter. Since <em>a</em> and <em>b</em> are so relatively close, the digamma function will take on essentially the same value for any value between <em>a</em>+1 and <em>b</em>+1. Therefore</p>
<p style="padding-left: 30px;">log(<em>a</em>! / <em>b</em>!) ≈ 10^10 ψ(<em>a</em>).</p>
<p>We could compute this in two lines of Python:</p>
<p style="padding-left: 30px;"><code>from scipy.special import digamma</code><br />
<code>print 1e10*digamma(1e40)</code></p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2010/08/16/how-to-compute-log-factorial/">How to compute log factorial</a><br />
<a href="http://www.johndcook.com/blog/2010/06/07/math-library-functions-that-seem-unnecessary/">Math library functions that seem unnecessary</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/07/14/log-gamma-differences/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Wrong and unnecessary</title>
		<link>http://www.johndcook.com/blog/2012/07/07/wrong-and-unnecessary/</link>
		<comments>http://www.johndcook.com/blog/2012/07/07/wrong-and-unnecessary/#comments</comments>
		<pubDate>Sat, 07 Jul 2012 22:08:27 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11657</guid>
		<description><![CDATA[David Hogg on linear regression: … in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously wrong and unnecessary. It is wrong because … linear relationship is exceedingly rare.<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/07/07/wrong-and-unnecessary/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p><a href="http://arxiv.org/abs/1008.4686">David Hogg</a> on linear regression:</p>
<blockquote><p>… in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously <em>wrong</em> and <em>unnecessary</em>. It is wrong because … linear relationship is exceedingly rare.</p>
<p>Even if the investigator doesn&#8217;t care that the fit is wrong, it is likely to be unnecessary. Why? Because it is rare that … the important result … is the slope and intercept of a best-fit line! Usually the full distribution of data is much more rich, informative, and important than any simple metrics made by fitting an overly simple model.</p>
<p>That said, it must be admitted that one of the most effective ways to communicate scientific results is with catchy punchlines and compact, approximate representations, even when those are unjustified and unnecessary.</p></blockquote>
<p><strong>Related post</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2012/07/05/responsible-data-analysis/">Responsible data analysis</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/07/07/wrong-and-unnecessary/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Responsible data analysis</title>
		<link>http://www.johndcook.com/blog/2012/07/05/responsible-data-analysis/</link>
		<comments>http://www.johndcook.com/blog/2012/07/05/responsible-data-analysis/#comments</comments>
		<pubDate>Thu, 05 Jul 2012 12:51:06 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11641</guid>
		<description><![CDATA[David Hogg on responsible data analysis: The key idea is that the result of responsible data analysis is not an answer but a distribution over answers. Data are inherently noisy and incomplete; they never answer your question precisely. So no<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/07/05/responsible-data-analysis/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p><a href="http://arxiv.org/abs/1205.4446v1">David Hogg</a> on responsible data analysis:</p>
<blockquote><p>The key idea is that the result of responsible data analysis is not an <em>answer</em> but a <em>distribution over answers</em>. Data are inherently noisy and incomplete; they never answer your question precisely. So no single number … will adequately represent the result of a data analysis. Results are always pdfs (or full likelihood functions); we must embrace that.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/07/05/responsible-data-analysis/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Window dressing for ideological biases</title>
		<link>http://www.johndcook.com/blog/2012/06/19/ideological-biases/</link>
		<comments>http://www.johndcook.com/blog/2012/06/19/ideological-biases/#comments</comments>
		<pubDate>Tue, 19 Jun 2012 23:56:01 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Economics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11591</guid>
		<description><![CDATA[From Russ Roberts on the latest EconTalk podcast: … this is really embarrassing as a professional economist — but I&#8217;ve come to believe that there may be no examples … of where a sophisticated multivariate econometric analysis … where important<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/06/19/ideological-biases/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>From Russ Roberts on the latest <a href="http://www.econtalk.org/archives/2012/06/manzi_on_knowle.html">EconTalk</a> podcast:</p>
<blockquote><p>… this is really embarrassing as a professional economist — but I&#8217;ve come to believe that there may be no examples … of where a sophisticated multivariate econometric analysis … where important policy issues are at stake, has led to a consensus. Where somebody says: Well, I guess I was wrong. Where somebody on the other side of the issue says: Your analysis, you&#8217;ve got a significant coefficient there — I&#8217;m wrong. No. They always say: You left this out, you left that out. And they&#8217;re right, of course. And then they can redo the analysis and show that in fact — and so what that means is that the tools, instead of leading to certainty and improved knowledge about the usefulness of policy interventions, are <strong>merely window dressing for ideological biases</strong> that are pre-existing in the case of the researchers.</p></blockquote>
<p>Emphasis added.</p>
<p><strong>Related post</strong>: <a href="http://www.johndcook.com/blog/2012/03/27/numerous-studies-have-confirmed/">Numerous studies have confirmed …</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/06/19/ideological-biases/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Methods that get used</title>
		<link>http://www.johndcook.com/blog/2012/06/18/methods-that-get-used/</link>
		<comments>http://www.johndcook.com/blog/2012/06/18/methods-that-get-used/#comments</comments>
		<pubDate>Mon, 18 Jun 2012 23:02:05 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11587</guid>
		<description><![CDATA[I have a conjecture regarding statistical methods: The probability of a method being used drops by at least a factor of 2 for every parameter that has to be determined by trial-and-error. A method could have a dozen inputs, and<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/06/18/methods-that-get-used/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I have a conjecture regarding statistical methods:</p>
<p style="padding-left: 30px;">The probability of a method being used drops by at least a factor of 2 for every parameter that has to be determined by trial-and-error.</p>
<p>A method could have a dozen inputs, and if they&#8217;re all intuitively meaningful, it might be adopted. But if there is even one parameter that requires trial-and-error fiddling to set, the probability of use drops sharply. As the number of non-intuitive parameters increases, the probability of anyone other than the method&#8217;s author using the method rapidly drops to zero.</p>
<p>John Tukey said that the <em>practical</em> power of a statistical test is its <em>statistical</em> power times the probability that someone will use it. Therefore practical power decreases exponentially with the number of non-intuitive parameters.</p>
<p><strong>Related post</strong>: <a href="http://www.johndcook.com/blog/2009/07/31/software-that-gets-used/">Software that gets used</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/06/18/methods-that-get-used/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Small data</title>
		<link>http://www.johndcook.com/blog/2012/06/06/small-data/</link>
		<comments>http://www.johndcook.com/blog/2012/06/06/small-data/#comments</comments>
		<pubDate>Wed, 06 Jun 2012 11:59:08 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Clinical trials]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Probability and Statistics]]></category>

		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=11187</guid>
		<description><![CDATA[Big data is getting a lot of buzz lately, but small data is interesting too. In some ways it&#8217;s more interesting. Because of limit theorems, a lot of things become dull in the large that are more interesting in the<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.johndcook.com/blog/2012/06/06/small-data/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Big data is getting a lot of buzz lately, but small data is interesting too. In some ways it&#8217;s more interesting. Because of limit theorems, a lot of things become dull in the large that are more interesting in the small.</p>
<p>When working with small data sets you have to accept that you will very often draw the wrong conclusion. You just can&#8217;t have high confidence in inference drawn from a small amount of data, unless you can do <a href="http://www.johndcook.com/blog/2011/09/06/bayes-isnt-magic/">magic</a>. But you do the best you can with what you have. You have to be content with the accuracy of your method relative to the amount of data available.</p>
<p>For example, a clinical trial may try to find the optimal dose of some new drug by giving the drug to only 30 patients. When you have five doses to test and only 30 patients, you&#8217;re just not going to find the right dose very often. You might want to assign 6 patients to each dose, but you can&#8217;t count on that. For safety reasons, you have to start at the lowest dose and work your way up cautiously, and that usually results in uneven allocation to doses, and thus less statistical power. And you might not treat all 30 patients. You might decide &#8212; possibly incorrectly &#8212; to stop the trial early because it appears that all doses are too toxic or ineffective. (This gives a glimpse of why testing drugs on people is a harder statistical problem than testing fertilizers on crops.)</p>
<p>Maybe your method finds the right answer 60% of the time, hardly a satisfying performance. But if alternative methods find the right answer 50% of the time under the same circumstances, your 60% looks great by comparison.</p>
<p><strong>Related posts</strong>:</p>
<p><a href="http://www.johndcook.com/blog/2010/02/25/the-law-of-medium-numbers/">The law of medium numbers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johndcook.com/blog/2012/06/06/small-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
