Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?

To take another example, suppose you are testing children for perfect pitch. You’ve tested 100 children so far and haven’t found any with perfect pitch. Do you conclude that children don’t have perfect pitch? You know that some do because you’ve heard of instances before. But your data suggest perfect pitch in children is at least rare. But how rare?

The **rule of three** gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested *N* cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/*N*. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. In the perfect pitch example, you could conclude that fewer than 3% of children have perfect pitch.

Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.” The situation is a little different with the perfect pitch example, however, because you may know before you start that the probability cannot be zero.

If the sight of math makes you squeamish, you might want to stop reading now. Just remember that if you haven’t seen something happen in *N* observations, a good estimate is that the chances of it happening are less than 3/*N*.

What makes the rule of three work? Suppose the probability of what you’re looking for is *p*. If we want a 95% confidence interval, we want to find the largest *p* so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1-*p*)* ^{n}* = 0.05 for

*p*. Taking logs of both sides,

*n*log(1-

*p*) = log(0.05) ≈ -3. Since log(1-

*p*) is approximately -

*p*for small values of

*p*, we have

*p*≈ 3/

*n*.

The derivation above gives the frequentist perspective. I’ll now give the Bayesian derivation of the same result. Then you can say “*p* is probably less than 3/*N*” in clear conscience since Bayesians are allowed to make such statements.

Suppose you start with a uniform prior on* p*. The posterior distribution on *p* after having seen 0 successes and *N* failures has a beta(1, *N*+1) distribution. If you calculate the posterior probability of *p* being less than 3/*N* you get an expression that approaches 1 – exp(-3) as *N* gets large, and 1 – exp(-3) ≈ 0.95.

**Update**: Italian translation of this post.

**Related posts**:

Nancy, I think that’s a coincidence. Rather, this rule seems to come from the fact that log(20) is about 3, i. e. that e^3 is about 20.

A nice explanation! Although you don’t say it explicitly above, the Rule of 3 seems to come essentially from the fact that

eis approximately 3 (e=2.71828…), so ln(3) is very close to 1. Some readers may not realize when you discuss taking the log of both sides, that you mean the natural log (basee), not the log base 10.Thanks.

What makes the rule of three tick is that log(0.05) ≈ -3 or equivalently exp(-3) ≈ 0.05. Since log(0.05) = -2.996, this is a good approximation.

Sorry, I was reading quickly earlier. Yes, you are both of course correct. The confidence level of 95% is what leads to your “Rule of 3.” Thanks, Nancy

Great post. This is a really useful and practical application of probability.

One considerations came to mind.

Auto-correlation of events over time and issues of random sampling: In some domains events are not independent.

For example, if you adopted a procedure for writing a book where the early sections received more editing than the final sections, the lack of errors in the early sections may not be representative of the final sections.

Likewise, if the occurrence of errors were correlated, such that a page with one error was much more likely to have several errors, then the effective sample provided by a given number of pages would be less than implied assuming independence.

Thus, in the case of assessing errors in a book, it might be useful to randomly sample pages.

Not many children have perfect pitch because their hands aren’t yet large enough to adequately grasp the baseball. Some can still throw the ball at quite high velocity, though.

(grin)

In the example of proofreading a book, how do we interpret typo-free sentences versus typo-free pages? For example, if the first 200 pages are typo-free, we can say the chance of a typo is probably less than 1.5%. But if just the first 200 _sentences_ are typo-free, we can also say the chance of a typo is probably less than 1.5%. That seems counterintuitive. It’s similar to saying checking the first 200 pages of a 400-page novel gives you the same predictive power as checking the first 200 pages of the entire Encyclopedia Britannica. Can you explain how I can interpret this? Thanks

MJ: If the first 200

pageshave no typos, you can conclude that the probability of apagehaving a typo is probably less than 1.5%. If you change the units of your data (pages to sentences) you also change the units of your conclusion. In both cases, your conclusion is “less than 1.5%”, but you’re referring to two different probabilities, probability of apagehaving a typo or the probability of asentencehaving a typo.Note also that I am not estimating the probability that a work has no typos. I’m estimating the probability of a page having a typo. The expected number of typos in a book will indeed increase in proportion to the size of the book.

At first you suggested that if I found 7 errors in the first 20 pages of text that I might reasonably conclude that the chances of finding a page with an error on it is 7/20.

I disagree. 7/20 is the ratio between number of errors and the number of pages read. Not the number of pages with errors on them to the number of pages read. I would not assume that on average 7 out of 20 pages has an error, as you suggested, but rather that out of 20 pages I should count 7 errors.

You presumed that all errors are evenly distributed, give or take. I think its conceivable that most errors are clustered about each other. Im inclined to believe that the probability of finding a page with an error is less than than the probability of finding an error…

Its late right now and I cant explain myself well. But its making sense to me, at least at the moment.

Its just that the probability you computed, 7/20, is a different answer to a different question than the one it purports to answer. We are talking about different probabilities to different events. Its possible that there is a condition which skews the probability distribution.

Im curious about the rule of three.

Probabilities dont change from one trial to the next, right? A die still has a 1/6 chance of coming up a 3 as it does a 4.

So what if I found one error on the first page, then for the next three hundred pages after it I find no errors at all. Is the probability 1/301? Or can I take those three hundred pages as its own segment for independent analysis, and conclude the probability 3/300 = 1/100?

Notice how the event happening once already has decreased its probability. Had it not happened at all, the probability would have been higher.

Explain this, please

Using the same reasoning,

p = 3.95 / ngets you to 98% confidence.I ran across a similar idea in New Yorker magazine. The question was how to come up with confidence limits on the end of existence of some property which currently exists, based on how long this property has existed. The assumption was that the current point in time is uniformly distributed over the lifespan of the property. You can calculate a 95% confidence interval on the ending time under these assumptions. The author, who is not a statistician as far as I know, goes on to give many examples where the end point (such as the closing of a long-running broadway show, the closing of a restaurant, etc.) where the end point did indeed fall in the confidence interval. IMO this is OK as far as it goes, but I think really all you’re doing is constructing uselessly large confidence intervals.

Your construction above seems much more useful, though, and it reminds me of a method for estimating the total number of critters in a patch of dirt. You start finding them and as you are counting, you graph the total number discovered with repsect to time spent searching. It is possible to easily estimate when you have discovered 80% of the total, at which point you can stop and produce your estimate of the total. I always thought it was a very elegant and extremely practical method for biological field work, and I’m sure it can be widely applied beyond biology.

Regarding perfect pitch, in my locale of Houston, Texas, USA most of the houses, and all new ones, have roofs designed to bear alpine snow loads. Yet it almost never snows here. Of course, roofs with a steeper pitch are more dangerous to walk on, require more materials to build, and are much more succeptible to high winds, which we *do* get here when hurricanes come.

Why do this? I don’t know. Maybe it is architectural fashion. Maybe the builders have nephews who run roofing crews or manufacture roofing tiles. To me it is utterly perplexing. In the desert houses used to lack pitched roofs altogether. I wonder if they are now building them for snow, too.

I’m sorry if it’s obvious to others, but can you explain the logic behind the following:

“If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05…”

I get that this gives you the probability of no sucesses as being 0.05, but not sure how that relates to a 95% confidence level (beyond that 1-0.05 is 0.95 of course). In other words, why not set (1-p)^n = 0.95?

What does it mean that “p is probably less than 3/N” in clear conscience since Bayesians are allowed to make such statements.” Thanks.

Sara: In frequentist statistics, you cannot make probability statements about parameters. So you can’t say anything about p using the word “probably.”

In Bayesian statistics, the uncertainty in unknown parameters is represented by probability densities, so there are no difficulties in saying p is

probablyin some interval.This rule could be very useful for me but does it work only for the probability of a missed single observation? For example, when I am searching for an abnormality, I need to find at least 3 to conclude the item sampled is abnormal. Coincidentally, we usually sample 20 items. If I have found no abnormality in 20 samples, I assume the rule says there is a 15% chance I have missed one abnormality, but what about having missed 3? Or, if I have found only 2/20, what can be said of the probability of missing the required third abnormality? Thank you. I hope you can help.

But “Rule of succession” ( http://en.wikipedia.org/wiki/Rule_of_succession ) tells that in Bayes inference with uniform prior probability of an error in next trial is (0+1)/(N+2), not 1/3N ?

Why you had estimated some more complex thing, like “probability that parameter p is less then 0.05″, if we can just estimate “probability of error on next page”? May be this is more closer to “Estimating the chances of something that hasn’t happened yet”

Lot of thanks for your great blog!

In frequentist approach, shouldn’t you estimate parameter of all possible outcomes?

As far as I know, confidence interval is tied to algorithm of it’s calculation. So that calculation gives interval containing “p” in 95% cases when running the same model.

But your algorithm is not complete, it does not say which interval should be chosen for case when errors did happen.

Could you please give more details on how you justify this “Since log(1-p) is approximately -p for small values of p (…)”

Joe: The Taylor series for f(x) = log(1 + x) centered at 0 is x – x^2/2 + x^3/3 – …

So for small x, log(1 + x) is approximately x, with remainder on the order of x^2. Now substitute x = -p.

Citation

“Joe: The Taylor series for f(x) = log(1 + x) centered at 0 is x – x^2/2 + x^3/3 – …

So for small x, log(1 + x) is approximately x, with remainder on the order of x^2. Now substitute x = -p.”

Brilliant. Thank you.