Posts tagged as:

Quality

Bugs, features, and risk

by John on January 12, 2012

All software has bugs. Someone has estimated that production code has about one bug per 100 lines. Of course there’s some variation in this number. Some software is a lot worse, and some is a little better.

But bugs-per-line-of-code is not very useful for assessing risk. The risk of a bug is the probability of running into it multiplied by its impact. Some lines of code are far more likely to execute than others, and some bugs are far more consequential than others.

Devoting equal effort to testing all lines of code would be wasteful. You’re not going to find all the bugs anyway, so you should concentrate on the parts of the code that are most likely to run and that would produce the greatest harm if they were wrong.

However, here’s a complication. The probability of running into a bug can change over time as people use the software in new ways. For whatever reason people to want to use features that had not been exercised before. When they do so, they’re likely to uncover new bugs.

(This helps explain why everyone thinks his preferred software is more reliable than others. When you’re a typical user, you tread the well-tested paths. You also learn, often subconsciously, to avoid buggy paths. When you bring your expectations from an old piece of software to a new one, you’re more likely to uncover bugs.)

Even though usage patterns change, they don’t change arbitrarily. It’s still the case that some code is far more likely than other code to execute.

Good software developers think ahead. They solve more than they’re asked to solve. They think “I’m going to go ahead and include this other case while I’m at it in case they need it later.” They’re heroes when it turns out their guesses about future needs were correct.

But there’s a downside to this initiative. You pay for what you don’t use. Every speculative feature either has to be tested, incurring more expense up front, or delivered untested, incurring more risk. This suggests its better to disable unused features.

You cannot avoid speculation entirely. Writing maintainable software requires speculating well, anticipating and preparing for change. Good software developers place good bets, and these tend to be small bets, going to a little extra effort to make software much more flexible. As with bugs, you have to consider probabilities and consequences: how likely is this part of the software to change, and how much effort will it take to prepare for that change?

Developers learn from experience what aspects of software are likely to change and they prepare for that change. But then they get angry at a rookie who wastes a lot of time developing some unnecessary feature. They may not realize that the rookie is doing the same thing they are, but with a less informed idea of what’s likely to be needed in the future.

Disputes between developers often involve hidden assumptions about probabilities. Whether some aspect of the software is responsible preparation for maintenance or wasteful gold plating depends on your idea of what’s likely to happen in the future.

Related post: Why programmers write unneeded code

{ 10 comments }

Software exoskeletons

by John on July 21, 2011

There’s a major divide between the way scientists and programmers view the software they write.

Scientists see their software as a kind of exoskeleton, an extension of themselves. Think Dr. Octopus. The software may do heavy lifting, but the scientists remain actively involved in its use. The software is a tool, not a self-contained product.

Spiderman versus Dr. Ock

Programmers see their software as something they will hand over to someone else, more like building a robot than an exoskeleton. Programmers believe it’s their job to encapsulate intelligence in software. If users have to depend on programmers after the software is written, the programmers didn’t finish their job.

I work with scientists and programmers, often bridging the gaps between the two cultures. One point of tension is defining when a project is done. To a scientist, the software is done when they get what they want out of it, such as a table of numbers for a paper. Professional programmers give more thought to reproducibility, maintainability, and correctness. Scientists think programmers are anal retentive. Programmers think scientists are cowboys.

Programmers need to understand that sometimes a program really only needs to run once, on one set of input, with expert supervision. Scientists need to understand that prototype code may need a complete rewrite before it can be used in production.

The real tension comes when a piece of research software is suddenly expected to be ready for production. The scientist will say “the code has already been written” and can’t imagine it would take much work, if any, to prepare the software for its new responsibilities. They don’t understand how hard it is for an engineer to turn an exoskeleton into a self-sufficient robot.

Related posts:

Good, fast, cheap: Can you really pick two?
Buggy code is biased code

{ 52 comments }

Pilots and pair programming

by John on January 28, 2011

From Outliers by Malcolm Gladwell:

In commercial airlines, captains and first officers split the flying duties equally. But historically, crashes have been far more likely to happen when the captain is in the “flying seat.” At first this seems to make no sense, since the captain is almost always the pilot with the most experience. … Planes are safer when the least experienced pilot is flying, because it means the second pilot isn’t going to be afraid to speak up.

The context of this excerpt is an examination of airplane crashes in which the copilot was aware of the pilot’s errors but did not speak up assertively.

I wonder whether an analogous result holds for pair programming. Do more bugs slip into the code when the more experienced programmer has the keyboard? The German aerospace company DLR thinks so. The company pairs junior and senior programmers. The junior programmer writes all the code while the senior programmer watches.

Related posts:

Software in space
Sleep debt and industrial accidents

{ 4 comments }

How to test a random number generator

by John on December 6, 2010

Last year I wrote a chapter for O’Reilly’s book Beautiful Testing. The publisher gave each of us permission to post our chapters electronically, and so here is Chapter 10: How to test a random number generator.

Beautiful Testing: Leading Professionals Reveal How They Test

{ 16 comments }

Buggy code is biased code

by John on October 19, 2010

As bad as corporate software may be, academic software is usually worse. I’ve worked in industry and in academia and have seen first-hand how much lower the quality bar is in academia. And I’m not the only one who has noticed this.

Why does this matter? Because buggy code is biased code. Bugs that cause the software to give unwanted results are more likely to be noticed and fixed. Bugs that cause software to produce expected results are more likely to remain in place.

If your software simulates some complex phenomena, you don’t know what it’s supposed to do; that’s why you’re simulating. Errors are easier to spot in consumer software. A climate model needs a higher level of quality assurance than a word processor because bugs in the latter are more obvious. Genomic analysis may contain egregious errors and no one ever know, but a bug in an MP3 player is sure to annoy users.

You have to test simulation software carefully. You have to test special cases and individual components to have any confidence in the final output. You can’t just look at the results and say “Yeah, that’s about what I expected.”

Related posts:

Popular research areas produce more false results
Errors in math papers not a big deal?
Writes large correct programs

{ 11 comments }

People want their problems acknowledged more than they want them solved, at least at first. That’s one of the points from Thomas Limoncelli’s book Time Management for System Administrators.

Suppose two system administrators get an email about similar problems. The first starts working on the problem right away and replies to the email a couple hours later saying the problem is fixed. The second replies immediately to say he understands the problem and will resolve it first thing tomorrow. The second system administrator will be more popular.

Of course people want their problems solved, and sooner is better than later. But first they want to know someone is listening. Sometimes that’s all they want.

Related posts:

Email isn’t the problem
Rethinking interruptions

{ 7 comments }

How many errors are left to find?

by John on July 13, 2010

There’s a simple statistic called the Lincoln Index that lets you estimate the total number of errors based on the number of errors found. I’ll explain what the Lincoln Index is, why it works, give some code for playing with it, and discuss how it applies to software testing.

What is the Lincoln Index?

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn’t very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let E1 be the number of errors the first tester finds and E2 the number of errors the second tester finds. Let S be the number of errors both testers find. The Lincoln Index estimates the total number of errors as E1 E2/S. You can find historical background on the Lincoln Index here.

How does the index work?

Suppose there are n bugs and the two testers find bugs with probability p1 and p2 respectively. You’d expect the two testers to find around np1 and np2 bugs. If you assume the probabilities of each tester finding a bug are independent, you’d expect the testers to find around np1 p2 bugs in common. That says E1*E2/S would be around

(n2 p1 p2) / (n p1 p2) = n.

The probabilities of each tester finding a bug cancel out leaving only n, the total number of bugs.

Simulation code

Here’s some Python code for simulating estimates using the Lincoln Index.


from random import random

def find_error(p):
    "Find an error with probability p"
    if random() < p:
        return 1
    return 0

def simulate(true_error_count, p1, p2, reps=10000):
    """Simulate Lincoln's method for estimating errors
    given the true number of errors, each person's probability
    of finding an error, and the number of simulations to run."""
    estimation_error_sum = 0
    for rep in xrange(reps):
        caught1 = 0
        caught2 = 0
        caught_both = 0
        for error in xrange(true_error_count):
            found1 = find_error(p1)
            found2 = find_error(p2)
            caught1 += found1
            caught2 += found2
            caught_both += found1*found2
        estimate = caught1*caught2 / float(caught_both)
        estimation_error_sum += abs(estimate - true_error_count)
    return estimation_error_sum / float(reps)

I used this to simulate the case of two testers, one with a 30% chance of finding a bug and the other with a 40% chance, and a total of 100 bugs. I simulated the Lincoln Index 1,000 times, keeping track of the absolute error in the estimates. The code to do this was simulate(100, 0.30, 0.40, 1000). On average, the Lincoln index over- or under-estimated the number of bugs by about 16. This is a good estimate considering each tester greatly under-estimated the number of bugs.

If you didn’t think about using something like the Lincoln Index, in the previous example one tester would find around 30 bugs and the other around 40. The two lists might have 10 bugs in common, so you’d estimate the total number at 60, far short of 100. But the Lincoln index would often find estimates between 84 and 116.

Note that it is possible that the testers won’t find any of the same bugs. In that case the Lincoln Index cannot be computed and the code will divide by zero. But this is unlikely unless the p’s are small and n is small.

Software testing

Does the Lincoln Index actually provide a good bug count estimate? That depends on how well the assumptions are met. The index assumes all bugs are equally hard for a given tester to find. It does not assume that both testers are equally skilled, but it does assume that their chances of finding a bug are independent. In other words, tester A is no more or less likely to find a bug just because tester B found it.

The most questionable assumption is that all bugs are equally hard to find. That’s usually not true. But it may be true that all bugs of a certain kind are equally hard to find. For example, spelling errors may be easier to find than validation oversights, but the Lincoln Index might be good for estimating separately how many spelling errors or validation errors there are.

The index might provide a rough rule of thumb even if the assumptions it that go into it are violated. For example, suppose one tester found 15 bugs and another found 20. But only 3 of the bugs were the same. A naive estimate would say since there are 32 unique bugs found, there must be around that many in total. But the Lincoln Index would estimate 100 bugs. Maybe the Lincoln estimate is not at all accurate, but it does tell you to be worried that there may be a lot more bugs to find since the overlap between the two bug lists was so small.

Related post:

Estimating the chances of something that hasn’t happened yet

{ 19 comments }

Dynamic typing and anti-lock brakes

by John on June 9, 2010

When we make one part of our lives safer, we tend to take more chances somewhere else. Psychologists call this tendency risk homeostasis.

One of the studies often cited to support the theory of risk homeostasis involved German cab drivers. Drivers in the experimental group were given cabs with anti-lock brakes while drivers in the control group were given cabs with conventional brakes. There was no difference in the rate of crashes between the two groups. The drivers who had better brakes drove less carefully.

Risk homeostasis may explain why dynamic programming languages such as Python aren’t as dangerous as critics suppose.

Advocates of statically typed programming languages argue that it is safer to have static type checking than to not have it. Would you rather the computer to catch some of your errors or not? I’d rather it catch some of my errors, thank you. But this argument assumes two things:

  1. static type checking comes at no cost, and
  2. static type checking has no impact on programmer behavior.

Advocates of dynamic programming languages have mostly focused on the first assumption. They argue that static typing requires so much extra programming effort that it is not worth the cost. I’d like to focus on the second assumption. Maybe the presence or absence of static typing changes programmer behavior.

Maybe a lack of static type checking scares dynamic language programmers into writing unit tests. Or to turn things around, perhaps static type checking lulls programmers into thinking they do not need unit tests. Maybe static type checking is like anti-lock brakes.

Nearly everyone would agree that static type checking does not eliminate the need for unit testing. Someone accustomed to working in a statically typed language might say “I know the compiler isn’t going to catch all my errors, but I’m glad that it catches some of them.” Static typing might not eliminate the need for unit testing, but it may diminish the motivation for unit testing. The lack of compile-time checking in dynamic languages may inspire developers to write more unit tests.

See Bruce Eckel’s article Strong Typing vs. Strong Testing for more discussion of the static typing and unit testing.

Update: I’m not knocking statically typed languages. I spend most of my coding time in such languages and I’m not advocating that we get rid of static typing in order to scare people into unit testing.

I wanted to address the question of what programmers do, not what they should do. In that sense, this post is more about psychology than software engineering. (Though I believe a large part of software engineering is in fact psychology as I’ve argued here.) Do programmers who work in dynamic languages write more tests? If so, does risk homeostasis help explain why?

Finally, I appreciate the value of unit testing. I’ve spent most of the last couple days writing unit tests. But there is a limit to the kinds of bugs that unit tests can catch. Unit tests are good at catching errors in code that has been written, but most errors come from code that should have been written. See Software sins of omission.

Related posts:

All languages equally complex
Reasoning about code

{ 25 comments }

Rewarding complexity

by John on April 5, 2010

Clay Shirky wrote an insightful article recently entitled The Collapse of Complex Business Models. The last line of the article contains the observation

… when the ecosystem stops rewarding complexity, it is the people who figure out how to work simply in the present, rather than the people who mastered the complexities of the past, who get to say what happens in the future.

It’s interesting to think how ecosystems reward complexity or simplicity.

Academia certainly rewards complexity. Coming up with ever more complex models is the safest road to tenure and fame. Simplification is hard work and isn’t good for your paper count.

Political pundits are rewarded for complex analysis, though politicians are rewarded for oversimplification.

The software market has rewarded complexity, but that may be changing. There’s a growing demand for simpler products, and software vendors are responding. For example, no one has ever accused Microsoft of having a minimalist aesthetic, but Window Phone 7 looks like a bold departure from Microsoft’s bloatware past.

Related posts:

Organizational scar tissue
Good, fast, or cheap: Can you really pick two?
Maybe NASA could use some buggy software

{ 7 comments }

Maintenance costs

by John on March 31, 2010

No engineered structure is designed to be built and then neglected or ignored. — Henry Petroski

The quote above comes from Henry Petroski’s recent interview on Tech Nation. In the same interview, Petroski says that a common rule of thumb is that maintenance costs about 4% of construction cost per year. For a structure as old as the Golden Gate Bridge (completed in 1937), for example, that’s a lot of 4%’s.

Golden Gate Bridge

Painting the bridge has cost far more than building it. The bridge is painted continuously: as soon as the painters reach the end of the bridge, they turn around and start over. The engineers who designed the bridge knew this would happen. When you build something out of steel and put it outside, it will need to be painted. It was all part of the design.

Image credit: Wikipedia

Related links:

Two kinds of software challenges
Do you really want to be indispensable?
Upcoming Y2K-like problems
The Essential Engineer, Henry Petroski’s new book

{ 3 comments }

Sleep debt and industrial accidents

by John on February 2, 2010

From The Power of Full Engagement:

… every one of the great industrial disasters of the past twenty years — Chernobyl, the Exxon Valdez, Bhopal, Three Mile Island — occurred in the middle of the night. For the most part, those in charge had worked very long hours and built up considerable sleep debt.

{ 0 comments }

Software sins of omission

by John on January 12, 2010

The Book of Common Prayer contains the confession

… we have left undone those things which we ought to have done, and we have done those things which we ought not to have done.

The things left undone are called sins of omission; things which ought not to have been done are called sins of commission.

In software testing and debugging, we focus on sins of commission, code that was implemented incorrectly. But according to Robert Glass, the majority of bugs are sins of omission. In Frequently Forgotten Fundamental Facts about Software Engineering Glass says

Roughly 35 percent of software defects emerge from missing logic paths, and another 40 percent are from the execution of a unique combination of logic paths.

If these figures are correct, three out of four software bugs are sins of omission, errors due to things left undone. These are bugs due to contingencies the developers did not think to handle. Three quarters seems like a large proportion, but it is plausible. I know I’ve often written plenty of bugs that amounted to not considering enough possibilities, particularly in graphical user interface software. It’s hard to think of everything a user might do and all the ways a user might arrive at a particular place. (When I first wrote user interface applications, my reaction to a bug report would be “Why would anyone do that?!” If everyone would just use my software the way I do, everything would be OK. )

It matters whether bugs are sins of omission or sins of commission. Different kinds of bugs are caught by different means. Developers have come to appreciate the value of unit testing lately, but unit tests primarily catch sins of commission. If you didn’t think to program something in the first place, you’re not likely to think to write a test for it. Complete test coverage could only find 25% of a projects bugs if you assume 75% of the bugs come from code that no one thought to write.

The best way to spot sins of omission is a fresh pair of eyes. As Glass says

Rigorous reviews are more effective, and more cost effective, than any other error-removal strategy, including testing. But they cannot and should not replace testing.

One way to combine the benefits of unit testing and code reviews would be to have different people write the unit tests and the production code.

Related posts:

The most subtle of the seven deadly sins
Shallow bugs versus reported bugs
Negative space in operating systems

{ 11 comments }

Counterfeit coins and rare diseases

by John on November 12, 2009

Here’s a puzzle I saw a long time ago that came to mind recently.

You have a bag of 27 coins. One of these coins is counterfeit and the rest are genuine. The genuine coins all weigh exactly the same, but the counterfeit coin is lighter. You have a simple balance. How can you find the counterfeit coin while using the scale the least number of times?

The surprising answer is that the false coin can be found while only using the scales only three times. Here’s how. Put nine coins on each side of the balance. If one side is lighter, the counterfeit is on that side; otherwise, it is one of the nine not on the scales. Now that you’ve narrowed it down to nine coins, apply the same idea recursively by putting three of the suspect coins on each side of the balance. The false coin is now either on the lighter side if the scales do not balance or one of the three remaining coins if the scales do balance. Now apply the same idea one last time to find which of the remaining three coins is the counterfeit. In general, you can find one counterfeit in 3n coins by using the scales n times.

The counterfeit coin problem came to mind when I was picking out homework problems for a probability class and ran into the following (problem 4.56 here):

A large number, N = mk, of people are subject to a blood test. This can be administered in two ways.

  1. Each person can be tested separately. In this case N tests are required.
  2. The blood samples of k people can be pooled and analyzed together. If the test is negative, this one test suffices for k people. If the test is positive, each of the k people must be tested separately, and, in all, k+1 test are required for the k people.

Suppose each person being tested has the disease with probability p. If the disease is rare, i.e. p is sufficiently small, the second approach will be more efficient. Consider the extremes. If p = 0, the first approach takes mk tests and the second approach takes only m tests. At the other extreme, if p = 1, the first approach still takes mk tests but the second approach now takes m(k+1) tests.

The homework problem asks for the expected number of tests used with each approach as a function of p for fixed k. Alternatively, you could assume that you always use the second method but need to find the optimal value of k. (This includes the possibility that k=1, which is equivalent to using the first method.)

I’d be curious to know whether these algorithms have names. I suspect computer scientists have given the coin testing algorithm a name. I also suspect the idea of pooling blood samples has several names, possibly one name when it is used in literally testing blood samples and other names when the same idea is applied to analogous testing problems.

{ 7 comments }

How to test a random number generator

by John on October 27, 2009

Random number generators are challenging to test.

  • The output is supposed to be unpredictable, so how do you know when the generator working correctly?
  • Your tests will fail occasionally, but how do you decide whether they’re failing too often?
  • What kinds of errors are most common when writing random number generation software?

These are some of the questions I address in Chapter 10 of Beautiful Testing.

Beautiful Testing: Leading Professionals Reveal How They Test

The book is now in stock at Amazon. It is supposed to be in book stores by Friday. All profits from Beautiful Testing go to Nothing But Nets, a project to distribute anti-malarial bed nets.

{ 2 comments }

Reviewing catch blocks

by John on October 22, 2009

Here’s an interesting exercise. If you’re writing code in a language like C# or C++ that has catch statements, write a script to report all catch blocks. You might be surprised at what you find. Some questions to ask:

  • Do catch blocks swallow exceptions and thus mask problems?
  • Is information lost by catching an exception and throwing a new one?
  • Are exceptions logged appropriately?
  • Are notification messages grammatically correct and helpful?

Here’s a PowerShell script that will report all catch statements plus the five lines following the catch statement.

Related post:

Finding embarrassing and unhelpful error messages

{ 1 comment }