Sample size calculation: power, budget, and economics

If you’re going to run a test on rabbits, you have to decide how many rabbits you’ll use. This is your sample size. A lot of what statisticians do in practice is calculate sample sizes.

A researcher comes to talk to a statistician. The statistician asks what effect size the researcher wants to detect. Do you think the new thing will be 10% better than the old thing? If so, you’ll need to design an experiment with enough subjects to stand a good chance of detecting a 10% improvement. Roughly speaking, sample size is inversely proportional to the square of effect size. So if you want to detect a 5% improvement, you’ll need 4 times as many subjects as if you want to detect a 10% improvement.

You’re never guaranteed to detect an improvement. The race is not always to the swift, nor the battle to the strong. So it’s not enough to think about what kind of effect size you want to detect, you also have to think about how likely you want to be to detect it.

Here’s what often happens in practice. The researcher makes an arbitrary guess at what effect size she expects to see. Then initial optimism may waver and she decides it would be better to design the experiment to detect a more modest effect size. When asked how high she’d like her chances to be of detecting the effect, she thinks 100% but says 95% since it’s necessary to tolerate some chance of failure.

The statistician comes back and says the researcher will need a gargantuan sample size. The researcher says this is far outside her budget. The statistician asks what the budget is, and what the cost per subject is, and then the real work begins.

The sample size the negotiation will converge on is the budget divided by the cost per sample. The statistician will fiddle with the effect size and probability of detecting it until the inevitable sample size is reached. This sample size, calculated to 10 decimal places and rounded up to the next integer, is solemnly reported with a post hoc justification containing no mention of budgets.

Sample size is always implicitly an economic decision. If you’re willing to make it explicitly an economic decision, you can compute the expected value of an experiment by placing a value on the possible outcomes. You make some assumptions—you always have to make assumptions—and calculate the probability under various scenarios of reaching each conclusion for various sample sizes, and select the sample size that leads to the best expected value.

More on experimental design

[1] There are three ways an A/B test can turn out: A wins, B wins, or there isn’t a clear winner. There’s a tendency to not think enough about the third possibility. Interim analysis often shuts down an experiment not because there’s a clear winner, but because it’s becoming clear there is unlikely to be a winner.