How many errors are left to find?

There’s a simple statistic called the Lincoln Index that lets you estimate the total number of errors based on the number of errors found. I’ll explain what the Lincoln Index is, why it works, give some code for playing with it, and discuss how it applies to software testing.

What is the Lincoln Index?

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn’t very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let E1 be the number of errors the first tester finds and E2 the number of errors the second tester finds. Let S be the number of errors both testers find. The Lincoln Index estimates the total number of errors as E1 E2/S. You can find historical background on the Lincoln Index here.

How does the index work?

Suppose there are n bugs and the two testers find bugs with probability p1 and p2 respectively. You’d expect the two testers to find around np1 and np2 bugs. If you assume the probabilities of each tester finding a bug are independent, you’d expect the testers to find around np1 p2 bugs in common. That says E1*E2/S would be around

(n2 p1 p2) / (n p1 p2) = n.

The probabilities of each tester finding a bug cancel out leaving only n, the total number of bugs.

Simulation code

Here’s some Python code for simulating estimates using the Lincoln Index.


from random import random

def find_error(p):
    "Find an error with probability p"
    if random() < p:
        return 1
    return 0

def simulate(true_error_count, p1, p2, reps=10000):
    """Simulate Lincoln's method for estimating errors
    given the true number of errors, each person's probability
    of finding an error, and the number of simulations to run."""
    estimation_error_sum = 0
    for rep in xrange(reps):
        caught1 = 0
        caught2 = 0
        caught_both = 0
        for error in xrange(true_error_count):
            found1 = find_error(p1)
            found2 = find_error(p2)
            caught1 += found1
            caught2 += found2
            caught_both += found1*found2
        estimate = caught1*caught2 / float(caught_both)
        estimation_error_sum += abs(estimate - true_error_count)
    return estimation_error_sum / float(reps)

I used this to simulate the case of two testers, one with a 30% chance of finding a bug and the other with a 40% chance, and a total of 100 bugs. I simulated the Lincoln Index 1,000 times, keeping track of the absolute error in the estimates. The code to do this was simulate(100, 0.30, 0.40, 1000). On average, the Lincoln index over- or under-estimated the number of bugs by about 16. This is a good estimate considering each tester greatly under-estimated the number of bugs.

If you didn’t think about using something like the Lincoln Index, in the previous example one tester would find around 30 bugs and the other around 40. The two lists might have 10 bugs in common, so you’d estimate the total number at 60, far short of 100. But the Lincoln index would often find estimates between 84 and 116.

Note that it is possible that the testers won’t find any of the same bugs. In that case the Lincoln Index cannot be computed and the code will divide by zero. But this is unlikely unless the p’s are small and n is small.

Software testing

Does the Lincoln Index actually provide a good bug count estimate? That depends on how well the assumptions are met. The index assumes all bugs are equally hard for a given tester to find. It does not assume that both testers are equally skilled, but it does assume that their chances of finding a bug are independent. In other words, tester A is no more or less likely to find a bug just because tester B found it.

The most questionable assumption is that all bugs are equally hard to find. That’s usually not true. But it may be true that all bugs of a certain kind are equally hard to find. For example, spelling errors may be easier to find than validation oversights, but the Lincoln Index might be good for estimating separately how many spelling errors or validation errors there are.

The index might provide a rough rule of thumb even if the assumptions it that go into it are violated. For example, suppose one tester found 15 bugs and another found 20. But only 3 of the bugs were the same. A naive estimate would say since there are 32 unique bugs found, there must be around that many in total. But the Lincoln Index would estimate 100 bugs. Maybe the Lincoln estimate is not at all accurate, but it does tell you to be worried that there may be a lot more bugs to find since the overlap between the two bug lists was so small.

Related post:

Estimating the chances of something that hasn’t happened yet

19 thoughts on “How many errors are left to find?

  1. John, in the text you gave 30%, 40% as an example, but the sample call line has 25%, 30%. Just curious which one you actually generated your statistics with.

    The derivative with respect to each of the E’s is constant, so a deviation in E from the expected value will impose an corresponding percentage change in the value n from the true value. You can always hope that the two deviations in E will have opposite signs and still give a good n.

    It looks like the variation in S will be more critical, just because it is a smaller value to start with. It’s still proportional for small changes.

    It would be interesting to work up what the correction factor is as a function of the assumed correlation between the E’s. You could then determine how robust your n was under varying assumptions.

  2. Walt: Thanks for catching my mistake. I updated the post with a correction. The simulation results were for 30 and 40 percent.

  3. A short MATLAB version:
    lincoln = @(a,b) sum(a).*sum(b)./sum(a&b);

    simulate=@(eCount,p1,p2,reps) …
    lincoln(rand(eCount,reps)<p1,rand(eCount,reps)<p2);

    simulate(100,0.4,0.3,100)

    See what happens when you vary p1 and p2:
    f=@(p,q) mean(simulate(100,p,q,1e3));
    x=linspace(0,1,20);
    [XX,YY]=meshgrid(x,x);
    PP = arrayfun(f,XX,YY);
    surf(XX,YY,PP)

  4. This is very similar to Capture-Recapture (Mark-and-Recapture) estimates used for estimating population sizes in wildlife studies. — And a Google search shows me it is exactly the same thing — Silly me, but it’s been a while since I looked at this.

    This can apply to more than just programming; In doing statistical analysis, I’m always on the lookout for possible problems in the data. Given that I occasionally find problems that others have missed, I often worry about the ones that got away, and how many studies are skewed by bad data.

  5. Lincoln Index, I love it!

    EastwoodDC is right about it being the simple form of the Petersen capture-recapture estimator for population size. The form you give is a biased estimator, you can “unbias” it by using N = (E1+1)(E2+1)/(S+1) -1. Lots more on this type of estimator in Krebs, Ecological Methodology

    I’ll add this to my lecture on capture-recapture so my bio students can see they have something in common with the comp sci kids.

  6. I’m not sure that how hard are to find the bugs is relevant. What does it mean, “hard to find”? My interpretation is that a bug is harder to find when it is found only by more skilled testers.
    Given that the index is independant from the skills of the testers, if my interpretation is right, it is therefore also independent from the nature of the bugs.

  7. @Astrobe: The problem is correlation at the bug level. The results are independent of the skill of the testers, but not of the difficulty of the bugs. That’s why John points out that “The index assumes all bugs are equally hard for a given tester to find. ”

    There’s also a lot of relevant literature in epidemiology (where they need to estimate population prevalences of diseases with imperfect tests and population samples) and in educational testing (where you want to evaluate both students and exam questions). I (and others) have been applying similar models to Mechanical Turk and other crowdsourced data annotation problems.

    With enough testers, you can start estimating the difficulty of the items using something like an item-response model. You can also tease apart accuracy versus bias, though presumably with bugs the idea is that it’s all-or-nothing and there’s an absolute notion of what constitutes a bug.

  8. Another problem is the assumption that testers find bugs independently. In the real world testers talk to each other and if A tells B about a bug B either won’t look for that bug or won’t report it.

  9. James: True, testers collaborate and share bugs. But a company could do something like the following.

    Ask the testers to work independently for a short time at the beginning of a project. Then compare bug lists and estimate the number of bugs left to find. Then work together as usual to find the rest of the bugs.

  10. Nice work.
    Like you mentioned, this index is probably not very useful in actually estimating the number of bugs.

    For a relatively bug free strong system, two testers finding a unique bug each would mean the index is infinity.

    I think the index can be a suggestion if the testers need to work together or not.

    If the index is a low number for a couple of developers, they probably test the same way and don’t need to work together. If the index is high (or infinite) for two developers, they must sit and test together.

  11. Munim: actually the index only works if the testers are searching separately.

    Finding the same bug under an assumption that each bug would be found randomly with something along the same probability gives an estimator for the total number of bugs by conditional probability.

    When the index yields infinity it is because no reasonable estimate of the upper bound exists. When two testers find a disjoint set of bugs, you have no relative scale to fall back on. The the absence of overlap, you need some other form of information to estimate overall bug count.

  12. I recall that Petersen estimator has a lot of the same problems that Michael B. mentions in his post, and more, because animal populations are literally moving targets. I don’t agree to his criticism that it is a silly idea though. It’s a simple estimator based on very simple assumptions. Is it flawed? Yes. As we well know, “All models are wrong, but some are useful.” If you need to make this sort of estimate, and are willing to accept the accompanying assumptions, it is still the right approach to the problem.

  13. John, this is an interesting post.
    You are right when you say that there many assumtions under it.
    The ones you mention (all bugs are equally hard, chances of finding a bug are independent) but also a lot of other assumptions, that makes the index irrelevant.

    The asumptions that “bugs of a certain kind to be equally hard to find”, for example, is doubtful.
    The text also (wrongly) assumes that:
    - Testing only involves finding bugs
    - There is a finite number of bugs
    - There’s value in finding all bugs
    - There is a clear definition of what a bug is
    - The testing task is done when “all bugs are found”
    - Testers are able (or try) to estimate the number of bugs when they test or finish testing
    - Knowing all these numbers (if they were real) would have any benefit.

    I can’t see any benefit of using the index a real project. If anything, it looks dangerous.
    Do you have a real-life experience of using it in a project? What kind of conclusions were drawn?

Comments are closed.