Five criticisms of significance testing

The follow list summarizes five criticisms of significance testing as it is commonly practiced.

  1. Andrew Gelman: In reality, null hypotheses are nearly always false. Is drug A identically effective as drug B? Certainly not. You know before doing an experiment that there must be some difference that would show up given enough data.
  2. Jim Berger: A small p-value means the data were unlikely under the null hypothesis. Maybe the data were just as unlikely under the alternative hypothesis. Comparisons of hypotheses should be conditional on the data.
  3. Stephen Ziliak and Deirdra McCloskey: Statistical significance is not the same as scientific significance. The most important question for science is the size of an effect, not whether the effect exists.
  4. William Gosset: Statistical error is only one component of real error, maybe a small component. When you actually conduct multiple experiments rather than speculate about hypothetical experiments, the variability of your data goes up.
  5. John Ioannidis: Small p-values do not mean small probability of being wrong. In one review, 74% of studies with p-value 0.05 were found to be wrong.

Related posts:

Creativity for left-brained people

I’ve written a fair amount about creativity on this blog, though I’ve written much more about other topics. I created a Squidoo lens to pull together some of my creativity posts. Maybe some people will enjoy that page who would otherwise be put off by some of the more technical posts here. [Update: apparently the Squidoo lens has gone away.]

The Squidoo page organizes posts into three categories: creativity, simplicity and complexity, and innovation. (*) There is a brief description for each link in each category, usually a single sentence.

In case you haven’t heard of Squidoo, it’s a project started by Seth Godin that encourages people to write about narrow topics they enjoy. One of their slogans is that everyone’s an expert at something. It’s fun to browse around and see what obscure topics people have written lenses for. They’re up to over 700,000 pages.

(*) Three cheers for the Oxford comma.

Fast way to test whether a number is a square

A question came up on StackOverflow today regarding testing whether integers were perfect squares. The person asking the question said his first idea was to take the (floating point) square root, round the result to the nearest integer, square that, and see whether he got the original number back.  But he thought maybe there was a faster way that avoided floating point arithmetic.

My suggestion was to first look at the number in hexadecimal (base 16) and see whether the number could be ruled out as a square before proceeding.  Squares in base 16 end in 0, 1, 4, or 9. (You can see this by brute force: square every number from 0 to 15 and take the remainder dividing by 16.) You can find the end of the number in hex by a fast bit operation, then rule out 12 out of every 16 numbers. Then do the square root and square procedure on the 4 out of 16 numbers that slip through the hexadecimal filter.  Here’s C++ code for the algorithm.

int PerfectSquare(int n)
{
    int h = n & 0xF; // last hexadecimal "digit"
    if (h > 9)
        return 0; // return immediately in 6 cases out of 16.

    // Take advantage of Boolean short-circuit evaluation
    if ( h != 2 && h != 3 && h != 5 && h != 6 && h != 7 && h != 8 )
    {
        // take square root if you must
        int t = (int) floor( sqrt((double) n) + 0.5 );
            return t*t == n;
    }
    return 0;
}

This algorithm ran about twice as fast as always taking the square root. The analogous C# code also ran about twice as fast as the more direct method.

What about looking at other bases? We want to use bases that are powers of 2 so we can get the last “digit” quickly. Half the numbers in base 4 are potential squares. Three out of eight numbers in base 8 are potential squares. Four out of 16 are potential squares in base 16. So taking smaller powers of 2 is probably not faster. Seven out of 32 numbers are potential squares base 32, a slightly lower ratio than base 16, but at the cost of more comparisons. I haven’t benchmarked using bases other than 16, but I doubt they are faster. If any are faster, I imagine the difference is small.

Here are a couple number theory problem that comes out of the problem above. First, how many numbers have square roots mod 2n, i.e. for how many values y does x2y (mod 2n) have a solution? Call that number g(n). For example, g(3) = 3, g(4) = 4, g(5) = 7, g(6) = 12. Is there a simple formula for g(n)? Second, what is the minimum value of g(n)/2n?

Update: See Michael Lugo’s solution to the number theory questions in the comments below or more discussion here.

Update: According to the StackOverflow discussion, the base 64 technique was a little faster than the base 16 technique, at least when implemented in C#. The relative speeds of variations on the algorithm depend on what language you’re using.

Origin of "statistically significant"

The words signify and signal come from the Latin word signus for sign. The phrase “statistically significant” was coined to indicate the existence of a statistical signal that an effect was present. The word significantin this context is not meant to imply that an effect is large or important. David Salsburg elaborates on this in his book The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century.

The word was used in its late-nineteenth-century meaning, which is simply that the computation signified or showed something. As the English language entered the twentieth century, the word significant began to take on other meanings, until it took on its current meaning, implying something very important. … Unfortunately, those who use statistical analysis often treat a significant test statistic as implying something closer to the modern meaning of the word.

An effect is statistically significant if it is considered unlikely to be a coincidence. An effect can be highly significant (very unlikely to be a coincidence) and yet be very small. Also, calling an effect “statistically significant” does not imply that anyone necessarily thinks the effect is consequential.

Here’s an example from astronomy. Newtonian mechanics describes the motions of the planets pretty well, but not perfectly. There is a measurable contribution from relativistic effects when studying the orbit of Mercury. The effect of relativity in this case would highly significant statistically in an experiment conducted carefully enough to measure the effect. I don’t know whether the effect is significant in the sense that NASA would need to take it into consideration when sending a probe to Mercury; it may not be. I imagine the contributions of relativity to describing the orbits of the rest of the planets are not significant to NASA engineers though they would be statistically significant if they could be measured.

10,000 spam comments blocked

The spam filter on my blog has now blocked over 10,000 comments since I installed it. As far as I know, there have only been a couple false positives (legitimate messages flagged as spam). Occasionally I’ll skim the blocked comments for false positives, but I usually delete them all without looking.

So far the number of false negatives (spam comments that slip through the filter) has been manageable.  However, the total amount of spam and the number of false negatives has increased as this blog has become more popular.

For reasons I don’t understand, one post has attracted far more spam than any other: One program to rule them all.

A pragmatist takes the witness stand

The idea for my previous post was the following quote from Josias Royce responding to William James. James embraced philosophical pragmatism, the view that whatever is expedient is true.

Would you then get on the witness stand in a court of law and swear to tell ‘the expedient, the whole expedient and nothing but the expedient, so help me future experience’…?

Philosophical pragmatism is not simply pragmatism in the sense of valuing the practical. Instead it is the idea that the only test of truth is what works. But “works” by what criteria? Pragmatism is inadequate as a philosophical foundation because you need other criteria to decide what it means for something to “work.”

A relativist takes the witness stand

Here’s how a relativist would administer a judicial oath.

Do you solemnly swear to tell what’s true for you, the whole of what’s true for you, and only what’s true for you?

Would that make you feel a little uneasy if the person being sworn in were testifying against you? Assuming you’re innocent, wouldn’t you rather they just swear to tell the truth?

Update: See further discussion in A pragmatist takes the witness stand.

When the normal approximation for Student t isn’t good enough

Folk wisdom says that for all practical purposes, a Student-t distribution with 30 or more degrees of freedom is a normal distribution. Well, not for all practical purposes.

For 30 or more degrees of freedom, the error in approximating the PDF or CDF of a Student-t distribution with a normal is less than 0.005. So for many applications, the n > 30 rule of thumb is appropriate. (See these notes for details.)

However, sometimes you need to look at the quantiles of a t distribution, such as when finding confidence intervals. For example, when computing confidence intervals, you don’t need to evaluate the CDF of a Student-t distribution per se but rather the inverse of such a CDF. And in that case, the error in the normal approximation was larger than I expected.

Say you’re computing a 95% confidence interval for the mean of a set of 31 data points. You first find t* such that P(t > t*) = 0.025 where t is a Student-t random variable with 31 – 1 = 30 degrees of freedom. Your confidence interval is the sample mean +/- t* s/√n where s is the sample standard deviation. For 30 degrees of freedom, t* = 2.04. If you used the normal approximation, you’d get 1.96 instead of 2.04, a relative error of about 4% meaning the error in computing your confidence interval is about 4%. While the error in normal approximation to the CDF is less than 0.005 for n > 30, the error in the normal approximation to the CDF inverse is an order of magnitude greater. Also, the error increases as the confidence increases. For example, for a 99% confidence interval, the error is about 6.3%.

It may be that none of this is a problem. If you only have 31 data points, there’s a fair amount of uncertainty in your estimate of the mean, and there’s no point in quantifying with great precision an estimate of how uncertain you are! Modeling assumptions are probably a larger source of error than the normal approximation to the Student-t.  But as a numerical problem, it’s interesting that the approximation error may be larger than expected. For n = 300, the error in the normal approximation to t* is about 0.4%. This means the error in the normal approximation to the inverse CDF is as good at n=300 as the normal approximation to the CDF itself is at n=30.

Errors in math papers not a big deal?

Daniel Lemire wrote a blog post this morning that ties together a couple themes previously discussed here.

Most published math papers contain errors, and yet there have been surprisingly few “major screw-ups” as defined by Mark Dominus. Daniel Lemire’s post quotes Doron Zeilberger on why these frequent errors are often benign.

Most mathematical papers are leaves in the web of knowledge, that no one reads, or will ever use to prove something else. The results that are used again and again are mostly lemmas, that while a priori non-trivial, once known, their proof is transparent. (Zeilberger’s Opinion 91)

Those papers that are “branches” rather than “leaves” receive more scrutiny and are more likely to be correct.

Zeilberger says lemmas get reused more than theorems. This dovetails with Mandelbrot’s observation mentioned a few weeks ago.

Many creative minds overrate their most baroque works, and underrate the simple ones. When history reverses such judgments, prolific writers come to be best remembered as authors of “lemmas,” of propositions they had felt “too simple” in themselves and had to be published solely as preludes to forgotten theorems.

There are obvious analogies to software.  Software that many people use has fewer bugs than software that few people use, just as theorems that people build on have fewer bugs than “leaves in the web of knowledge.” Useful subroutines and libraries are more likely to be reused than complete programs. And as Donald Knuth pointed out, re-editable code is better than black-box reusable code.

Everybody knows that software has bugs, but not everyone realizes how buggy theorems are. Bugs in software are more obvious because paper doesn’t abort. Proofs and programs are complementary forms of validation. Attempting to prove the correctness of an algorithm certainly reduces the chances of a bug, but proofs are fallible as well. Again quoting Knuth, he once said “Beware of bugs in the above code; I have only proved it correct, not tried it.” Not only can programs benefit from being more proof-like, proofs can benefit from being more program-like.

Upper and lower bounds on binomial coefficients

Binomial coefficients are easy to motivate and define. In its simplest form, the binomial coefficient (n, k) is the number of ways to choose k things from a set of n things and equals n!/(k! (n-k)!). On the other hand, binomial coefficients can be hard to work with when proving theorems.

In applications, such as establishing bounds on an algorithm’s run time, it’s often enough to have upper bound or lower bounds on binomial coefficients that are easier to work with. Here’s a pair of such bounds.

left( frac{n}{k} right)^k leq {nchoose k} leq left( frac{en}{k} right)^k

for positive integers n and k with 1 ≤ kn.

The bounds above are easy to remember due to the typographical similarity between the three terms. Start with the binomial coefficient symbol. To get a lower bound, add a horizontal bar to turn the coefficient into a fraction and add an exponent of k. To get an upper bound, insert a factor of “e” inside the lower bound.

I found the bounds above in Introduction to Algorithms. The statement and proof were limited to integer arguments, but binomial coefficients can be defined more generally and the bounds appear to be valid for real values of n and k, not just integer values.

How good are the bounds? Here’s a plot of the upper and lower bounds as well as the true value. Because binomial coefficients can get very large, I plotted the logarithms of the bounds and true values. In this plot n = 100 and k varies between 1 and 100 (including non-integer values).

plotting log of binomial coefficients and bounds

The lower bound is exact at the left end and the right end and is worse in the middle.

The upper bound is quite good for small k but terrible for large k. However, we could easily improve the upper bound by using the fact that binomial(n, k) = binomial(n, n-k). For k larger than n/2, we could apply the upper bound to n-k rather than k. We could do the same for the lower bound, though the benefit in doing so is not as great.

Here’s the plot that uses the best upper and lower bounds switching from binomial(n, k) to binomial(n-k) for k > n/2.

plotting log of binomial coefficients using the better of k or n-k in the bounds

I experimented briefly with other values of n and got very similar plots.

Related post: How to compute binomial coefficients

Inside Steve Jobs' brain

Last week I read Inside Drucker’s Brain and this week I’m reading Inside Steve’s Brain. The two books are written by different authors about very different men. But one theme that both books have in common is an appreciation for killing projects, even moderately successful projects, in order to focus on even more important projects.

So far, the thing I’ve found most impressive is how Steve Jobs dramatically reduced Apple’s product offerings in his plan to turn around the company.

When Jobs took over, Apple sold about forty different products … The lineup of computers was particularly baffling. There were several major lines … each with a dozen different models. But there was little to distinguish between the models except for their confusing product names.

The book goes on to describe Steve Job’s response.

Jobs drew a very simple two-by-two grid on the whiteboard. Across the top he wrote “Consumer” and “Professional,” and down the side, “Portable” and “Desktop.” Here was Apple’s new product strategy. Just four machines …

Sometimes I feel like I need to do something like that in my personal life, cutting my number of projects down by an order of magnitude.

I’m curious what the rest of the book holds. As I write this, most of the book’s reviews on Amazon are positive, four stars out of five on average. There’s only a single one-star review, but it’s worth considering. In a nutshell, the critic says the book is poorly written and contains little new material. He may be right. I’ll grant that the book isn’t the best prose I’ve ever read. But it doesn’t matter to me whether it’s original; I don’t know much about Steve Jobs, so the book’s content is new to me.