Bugs, features, and risk

All software has bugs. Someone has estimated that production code has about one bug per 100 lines. Of course there’s some variation in this number. Some software is a lot worse, and some is a little better.

But bugs-per-line-of-code is not very useful for assessing risk. The risk of a bug is the probability of running into it multiplied by its impact. Some lines of code are far more likely to execute than others, and some bugs are far more consequential than others.

Devoting equal effort to testing all lines of code would be wasteful. You’re not going to find all the bugs anyway, so you should concentrate on the parts of the code that are most likely to run and that would produce the greatest harm if they were wrong.

However, here’s a complication. The probability of running into a bug can change over time as people use the software in new ways. For whatever reason people to want to use features that had not been exercised before. When they do so, they’re likely to uncover new bugs.

(This helps explain why everyone thinks his preferred software is more reliable than others. When you’re a typical user, you tread the well-tested paths. You also learn, often subconsciously, to avoid buggy paths. When you bring your expectations from an old piece of software to a new one, you’re more likely to uncover bugs.)

Even though usage patterns change, they don’t change arbitrarily. It’s still the case that some code is far more likely than other code to execute.

Good software developers think ahead. They solve more than they’re asked to solve. They think “I’m going to go ahead and include this other case while I’m at it in case they need it later.” They’re heroes when it turns out their guesses about future needs were correct.

But there’s a downside to this initiative. You pay for what you don’t use. Every speculative feature either has to be tested, incurring more expense up front, or delivered untested, incurring more risk. This suggests its better to disable unused features.

You cannot avoid speculation entirely. Writing maintainable software requires speculating well, anticipating and preparing for change. Good software developers place good bets, and these tend to be small bets, going to a little extra effort to make software much more flexible. As with bugs, you have to consider probabilities and consequences: how likely is this part of the software to change, and how much effort will it take to prepare for that change?

Developers learn from experience what aspects of software are likely to change and they prepare for that change. But then they get angry at a rookie who wastes a lot of time developing some unnecessary feature. They may not realize that the rookie is doing the same thing they are, but with a less informed idea of what’s likely to be needed in the future.

Disputes between developers often involve hidden assumptions about probabilities. Whether some aspect of the software is responsible preparation for maintenance or wasteful gold plating depends on your idea of what’s likely to happen in the future.

Related: Why programmers write unneeded code

Software exoskeletons

There’s a major divide between the way scientists and programmers view the software they write.

Scientists see their software as a kind of exoskeleton, an extension of themselves. Think Dr. Octopus. The software may do heavy lifting, but the scientists remain actively involved in its use. The software is a tool, not a self-contained product.

Spiderman versus Dr. Ock

Programmers see their software as something they will hand over to someone else, more like building a robot than an exoskeleton. Programmers believe it’s their job to encapsulate intelligence in software. If users have to depend on programmers after the software is written, the programmers didn’t finish their job.

I work with scientists and programmers, often bridging the gaps between the two cultures. One point of tension is defining when a project is done. To a scientist, the software is done when they get what they want out of it, such as a table of numbers for a paper. Professional programmers give more thought to reproducibility, maintainability, and correctness. Scientists think programmers are anal retentive. Programmers think scientists are cowboys.

Programmers need to understand that sometimes a program really only needs to run once, on one set of input, with expert supervision. Scientists need to understand that prototype code may need a complete rewrite before it can be used in production.

The real tension comes when a piece of research software is suddenly expected to be ready for production. The scientist will say “the code has already been written” and can’t imagine it would take much work, if any, to prepare the software for its new responsibilities. They don’t understand how hard it is for an engineer to turn an exoskeleton into a self-sufficient robot.

More software development posts

Pilots and pair programming

From Outliers by Malcolm Gladwell:

In commercial airlines, captains and first officers split the flying duties equally. But historically, crashes have been far more likely to happen when the captain is in the “flying seat.” At first this seems to make no sense, since the captain is almost always the pilot with the most experience. … Planes are safer when the least experienced pilot is flying, because it means the second pilot isn’t going to be afraid to speak up.

The context of this excerpt is an examination of airplane crashes in which the copilot was aware of the pilot’s errors but did not speak up assertively.

I wonder whether an analogous result holds for pair programming. Do more bugs slip into the code when the more experienced programmer has the keyboard? The German aerospace company DLR thinks so. The company pairs junior and senior programmers. The junior programmer writes all the code while the senior programmer watches.

Related posts

How to test a random number generator

Last year I wrote a chapter for O’Reilly’s book Beautiful Testing (ISBN 0596159811). The publisher gave each of us permission to post our chapters online, and so here is Chapter 10: How to test a random number generator.

Update: The chapter linked to above describes how to test transformations of a trusted uniform random number generator. For example, maybe your programming language provides a way to generate numbers between 0 and 1 uniformly, and you have written code to transform that into a normal distribution. That’s the most common case. Few people write their own core RNG; most bootstrap the core RNG into other kinds of RNG.

If you need to test a uniform random number generator, I can help with that.

Buggy code is biased code

As bad as corporate software may be, academic software is usually worse. I’ve worked in industry and in academia and have seen first-hand how much lower the quality bar is in academia. And I’m not the only one who has noticed this.

Why does this matter? Because buggy code is biased code. Bugs that cause the software to give unwanted results are more likely to be noticed and fixed. Bugs that cause software to produce expected results are more likely to remain in place.

If your software simulates some complex phenomena, you don’t know what it’s supposed to do; that’s why you’re simulating. Errors are easier to spot in consumer software. A climate model needs a higher level of quality assurance than a word processor because bugs in the latter are more obvious. Genomic analysis may contain egregious errors and no one ever know, but a bug in an MP3 player is sure to annoy users.

You have to test simulation software carefully. You have to test special cases and individual components to have any confidence in the final output. You can’t just look at the results and say “Yeah, that’s about what I expected.”

Related posts

Acknowledging problems versus solving problems

People want their problems acknowledged more than they want them solved, at least at first. That’s one of the points from Thomas Limoncelli’s book Time Management for System Administrators.

Suppose two system administrators get an email about similar problems. The first starts working on the problem right away and replies to the email a couple hours later saying the problem is fixed. The second replies immediately to say he understands the problem and will resolve it first thing tomorrow. The second system administrator will be more popular.

Of course people want their problems solved, and sooner is better than later. But first they want to know someone is listening. Sometimes that’s all they want.

Related posts

How many errors are left to find?

There’s a simple statistic called the Lincoln Index that lets you estimate the total number of errors based on the number of errors found. I’ll explain what the Lincoln Index is, why it works, give some code for playing with it, and discuss how it applies to software testing.

What is the Lincoln Index?

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn’t very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let E1 be the number of errors the first tester finds and E2 the number of errors the second tester finds. Let S be the number of errors both testers find. The Lincoln Index estimates the total number of errors as

E1 E2/S.

You can find historical background on the Lincoln Index here.

How does the index work?

Suppose there are n bugs and the two testers find bugs with probability p1 and p2 respectively. You’d expect the two testers to find around np1 and np2 bugs. If you assume the probabilities of each tester finding a bug are independent, you’d expect the testers to find around np1 p2 bugs in common. That says

E1 E2/S

would be around

(n2 p1 p2) / (n p1 p2) = n.

The probabilities of each tester finding a bug cancel out leaving only n, the total number of bugs.

Simulation code

Here’s some Python code for simulating estimates using the Lincoln Index.

from random import random

def find_error(p):
    "Find an error with probability p"
    if random() < p:
        return 1
    return 0

def simulate(true_error_count, p1, p2, reps=10000):
    """Simulate Lincoln's method for estimating errors
    given the true number of errors, each person's probability
    of finding an error, and the number of simulations to run."""
    estimation_error_sum = 0
    for rep in xrange(reps):
        caught1 = 0
        caught2 = 0
        caught_both = 0
        for error in xrange(true_error_count):
            found1 = find_error(p1)
            found2 = find_error(p2)
            caught1 += found1
            caught2 += found2
            caught_both += found1*found2

    estimate = caught1*caught2 / float(caught_both)
    estimation_error_sum += abs(estimate - true_error_count)
    return estimation_error_sum / float(reps)

I used this to simulate the case of two testers, one with a 30% chance of finding a bug and the other with a 40% chance, and a total of 100 bugs. I simulated the Lincoln Index 1,000 times, keeping track of the absolute error in the estimates. The code to do this was simulate(100, 0.30, 0.40, 1000). On average, the Lincoln index over- or under-estimated the number of bugs by about 16. This is a good estimate considering each tester greatly under-estimated the number of bugs.

If you didn’t think about using something like the Lincoln Index, in the previous example one tester would find around 30 bugs and the other around 40. The two lists might have 10 bugs in common, so you’d estimate the total number at 60, far short of 100. But the Lincoln index would often find estimates between 84 and 116.

Note that it is possible that the testers won’t find any of the same bugs. In that case the Lincoln Index cannot be computed and the code will divide by zero. But this is unlikely unless the p‘s are small and n is small.

Software testing

Does the Lincoln Index actually provide a good bug count estimate? That depends on how well the assumptions are met. The index assumes all bugs are equally hard for a given tester to find. It does not assume that both testers are equally skilled, but it does assume that their chances of finding a bug are independent. In other words, tester A is no more or less likely to find a bug just because tester B found it.

The most questionable assumption is that all bugs are equally hard to find. That’s usually not true. But it may be true that all bugs of a certain kind are equally hard to find. For example, spelling errors may be easier to find than validation oversights, but the Lincoln Index might be good for estimating separately how many spelling errors or validation errors there are.

The index might provide a rough rule of thumb even if the assumptions it that go into it are violated. For example, suppose one tester found 15 bugs and another found 20. But only 3 of the bugs were the same. A naive estimate would say since there are 32 unique bugs found, there must be around that many in total. But the Lincoln Index would estimate 100 bugs. Maybe the Lincoln estimate is not at all accurate, but it does tell you to be worried that there may be a lot more bugs to find since the overlap between the two bug lists was so small.

Related postEstimating the chances of something that hasn’t happened yet

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon

Dynamic typing and anti-lock brakes

When we make one part of our lives safer, we tend to take more chances somewhere else. Psychologists call this tendency risk homeostasis.

One of the studies often cited to support the theory of risk homeostasis involved German cab drivers. Drivers in the experimental group were given cabs with anti-lock brakes while drivers in the control group were given cabs with conventional brakes. There was no difference in the rate of crashes between the two groups. The drivers who had better brakes drove less carefully.

Risk homeostasis may explain why dynamic programming languages such as Python aren’t as dangerous as critics suppose.

Advocates of statically typed programming languages argue that it is safer to have static type checking than to not have it. Would you rather the computer to catch some of your errors or not? I’d rather it catch some of my errors, thank you. But this argument assumes two things:

  1. static type checking comes at no cost, and
  2. static type checking has no impact on programmer behavior.

Advocates of dynamic programming languages have mostly focused on the first assumption. They argue that static typing requires so much extra programming effort that it is not worth the cost. I’d like to focus on the second assumption. Maybe the presence or absence of static typing changes programmer behavior.

Maybe a lack of static type checking scares dynamic language programmers into writing unit tests. Or to turn things around, perhaps static type checking lulls programmers into thinking they do not need unit tests. Maybe static type checking is like anti-lock brakes.

Nearly everyone would agree that static type checking does not eliminate the need for unit testing. Someone accustomed to working in a statically typed language might say “I know the compiler isn’t going to catch all my errors, but I’m glad that it catches some of them.” Static typing might not eliminate the need for unit testing, but it may diminish the motivation for unit testing. The lack of compile-time checking in dynamic languages may inspire developers to write more unit tests.

See Bruce Eckel’s article Strong Typing vs. Strong Testing for more discussion of the static typing and unit testing.

Update: I’m not knocking statically typed languages. I spend most of my coding time in such languages and I’m not advocating that we get rid of static typing in order to scare people into unit testing.

I wanted to address the question of what programmers do, not what they should do. In that sense, this post is more about psychology than software engineering. (Though I believe a large part of software engineering is in fact psychology as I’ve argued here.) Do programmers who work in dynamic languages write more tests? If so, does risk homeostasis help explain why?

Finally, I appreciate the value of unit testing. I’ve spent most of the last couple days writing unit tests. But there is a limit to the kinds of bugs that unit tests can catch. Unit tests are good at catching errors in code that has been written, but most errors come from code that should have been written but wasn’t. See Software sins of omission.

Related posts

Rewarding complexity

Clay Shirky wrote an insightful article recently entitled The Collapse of Complex Business Models. The last line of the article contains the observation

… when the ecosystem stops rewarding complexity, it is the people who figure out how to work simply in the present, rather than the people who mastered the complexities of the past, who get to say what happens in the future.

It’s interesting to think how ecosystems reward complexity or simplicity.

Academia certainly rewards complexity. Coming up with ever more complex models is the safest road to tenure and fame. Simplification is hard work and isn’t good for your paper count.

Political pundits are rewarded for complex analysis, though politicians are rewarded for oversimplification.

The software market has rewarded complexity, but that may be changing.

Related posts

Maintenance costs

No engineered structure is designed to be built and then neglected or ignored. — Henry Petroski

The quote above comes from Henry Petroski’s recent interview on Tech Nation. In the same interview, Petroski says that a common rule of thumb is that maintenance costs about 4% of construction cost per year. For a structure as old as the Golden Gate Bridge (completed in 1937), for example, that’s a lot of 4%’s.

Golden Gate Bridge

Painting the bridge has cost far more than building it. The bridge is painted continuously: as soon as the painters reach the end of the bridge, they turn around and start over. The engineers who designed the bridge knew this would happen. When you build something out of steel and put it outside, it will need to be painted. It was all part of the design.

Image credit: Wikipedia

Related links:

Two kinds of software challenges
Do you really want to be indispensable?
Upcoming Y2K-like problems
The Essential Engineer, Henry Petroski’s new book