Defensible software

It’s not enough for software to be correct. It has to be defensible.

I’m not thinking of defending against malicious hackers. I’m thinking about defending against sincere critics. I can’t count how many times someone was absolutely convinced that software I had a hand in was wrong when it in fact it was performing as designed.

In order to defend software, you have to understand what it does. Not just one little piece of it, but the whole system. You need to understand it better than the people who commissioned it: the presumed errors may stem from unforeseen consequences of the specification.

Related post: The buck stops with the programmer

Hunt down bad error messages

My printer is unable to clean 51. It won’t work, and all it says is “Unable to Clean 51.”

Here’s my suggestion for finding such useless error messages in a code review: Write a script to extract all string literals from your source code, then read over the output. The beauty of this approach is that the reviewer sees the text without the context of the surrounding code, just as the user does. Ideally someone who is not a programmer would review the output strings. I would hope someone who ran across “Unable to Clean 51” would flag that as something that doesn’t make sense.

It’s not hard to write a script to pull out string literals. If you’d like, you could use the script that accompanies my Code Project article PowerShell Script for Reviewing Text Shown to Users. That script tries to filter out strings that are not user output, such as file paths. It’s a pretty crude script, attempting to handle source code written in several languages. It will have a few false positives, and a few false negatives, but it works quite well for a short script. (Code Project’s “browse source” tab doesn’t work for PowerShell source. You’ll have to download the code to read it.)

Whenever I recommend this script, I run into a few objections that I’ll address below.

Q: Wouldn’t it be better if programmers put all user output text in string tables rather than putting user output text in the main body of their code?

A: Yes, but that takes more effort, and it’s not how most programmers work. And even if you have a policy to put all output strings in a resource table, you’d need something like this script to enforce the policy.

Q: Wouldn’t it be better to have a real parser extract the strings rather than doing regular expression guesswork?

A: Sure.

Q: Wouldn’t it be better to use a spell checker that’s built into your IDE?

A: No. The purpose of the review is not just to check spelling. You also want to catch grammar errors and unhelpful messages.

Q: Wouldn’t it be better to do a complete code review?

A: Yes and no. Complete code reviews are great for overall code quality. But they take a lot of effort and don’t happen often. A review of just the extracted strings takes far, far less time. Also, viewing the output strings out of context is better for catching unhelpful or ungrammatical messages.

Q: Isn’t this simplistic? Doesn’t every company do something like this?

A: If they do, why do I continually find spelling errors, grammatical errors, and unhelpful messages in the software I use?

How things break

Venkatesh Rao wrote a blog post today Stress Failures versus Decay Failures. It reminded me of three other resources I recommend on how things break. The first is about how things literally break. For example, why the steel in the Titanic was brittle.

The other two are about how complex systems break.

Bugs, features, and risk

All software has bugs. Someone has estimated that production code has about one bug per 100 lines. Of course there’s some variation in this number. Some software is a lot worse, and some is a little better.

But bugs-per-line-of-code is not very useful for assessing risk. The risk of a bug is the probability of running into it multiplied by its impact. Some lines of code are far more likely to execute than others, and some bugs are far more consequential than others.

Devoting equal effort to testing all lines of code would be wasteful. You’re not going to find all the bugs anyway, so you should concentrate on the parts of the code that are most likely to run and that would produce the greatest harm if they were wrong.

However, here’s a complication. The probability of running into a bug can change over time as people use the software in new ways. For whatever reason people to want to use features that had not been exercised before. When they do so, they’re likely to uncover new bugs.

(This helps explain why everyone thinks his preferred software is more reliable than others. When you’re a typical user, you tread the well-tested paths. You also learn, often subconsciously, to avoid buggy paths. When you bring your expectations from an old piece of software to a new one, you’re more likely to uncover bugs.)

Even though usage patterns change, they don’t change arbitrarily. It’s still the case that some code is far more likely than other code to execute.

Good software developers think ahead. They solve more than they’re asked to solve. They think “I’m going to go ahead and include this other case while I’m at it in case they need it later.” They’re heroes when it turns out their guesses about future needs were correct.

But there’s a downside to this initiative. You pay for what you don’t use. Every speculative feature either has to be tested, incurring more expense up front, or delivered untested, incurring more risk. This suggests its better to disable unused features.

You cannot avoid speculation entirely. Writing maintainable software requires speculating well, anticipating and preparing for change. Good software developers place good bets, and these tend to be small bets, going to a little extra effort to make software much more flexible. As with bugs, you have to consider probabilities and consequences: how likely is this part of the software to change, and how much effort will it take to prepare for that change?

Developers learn from experience what aspects of software are likely to change and they prepare for that change. But then they get angry at a rookie who wastes a lot of time developing some unnecessary feature. They may not realize that the rookie is doing the same thing they are, but with a less informed idea of what’s likely to be needed in the future.

Disputes between developers often involve hidden assumptions about probabilities. Whether some aspect of the software is responsible preparation for maintenance or wasteful gold plating depends on your idea of what’s likely to happen in the future.

Related: Why programmers write unneeded code

Software exoskeletons

There’s a major divide between the way scientists and programmers view the software they write.

Scientists see their software as a kind of exoskeleton, an extension of themselves. Think Dr. Octopus. The software may do heavy lifting, but the scientists remain actively involved in its use. The software is a tool, not a self-contained product.

Spiderman versus Dr. Ock

Programmers see their software as something they will hand over to someone else, more like building a robot than an exoskeleton. Programmers believe it’s their job to encapsulate intelligence in software. If users have to depend on programmers after the software is written, the programmers didn’t finish their job.

I work with scientists and programmers, often bridging the gaps between the two cultures. One point of tension is defining when a project is done. To a scientist, the software is done when they get what they want out of it, such as a table of numbers for a paper. Professional programmers give more thought to reproducibility, maintainability, and correctness. Scientists think programmers are anal retentive. Programmers think scientists are cowboys.

Programmers need to understand that sometimes a program really only needs to run once, on one set of input, with expert supervision. Scientists need to understand that prototype code may need a complete rewrite before it can be used in production.

The real tension comes when a piece of research software is suddenly expected to be ready for production. The scientist will say “the code has already been written” and can’t imagine it would take much work, if any, to prepare the software for its new responsibilities. They don’t understand how hard it is for an engineer to turn an exoskeleton into a self-sufficient robot.

More software development posts

Pilots and pair programming

From Outliers by Malcolm Gladwell:

In commercial airlines, captains and first officers split the flying duties equally. But historically, crashes have been far more likely to happen when the captain is in the “flying seat.” At first this seems to make no sense, since the captain is almost always the pilot with the most experience. … Planes are safer when the least experienced pilot is flying, because it means the second pilot isn’t going to be afraid to speak up.

The context of this excerpt is an examination of airplane crashes in which the copilot was aware of the pilot’s errors but did not speak up assertively.

I wonder whether an analogous result holds for pair programming. Do more bugs slip into the code when the more experienced programmer has the keyboard? The German aerospace company DLR thinks so. The company pairs junior and senior programmers. The junior programmer writes all the code while the senior programmer watches.

Related posts

How to test a random number generator

Last year I wrote a chapter for O’Reilly’s book Beautiful Testing (ISBN 0596159811). The publisher gave each of us permission to post our chapters online, and so here is Chapter 10: How to test a random number generator.

Update: The chapter linked to above describes how to test transformations of a trusted uniform random number generator. For example, maybe your programming language provides a way to generate numbers between 0 and 1 uniformly, and you have written code to transform that into a normal distribution. That’s the most common case. Few people write their own core RNG; most bootstrap the core RNG into other kinds of RNG.

If you need to test a uniform random number generator, I can help with that.

Buggy code is biased code

As bad as corporate software may be, academic software is usually worse. I’ve worked in industry and in academia and have seen first-hand how much lower the quality bar is in academia. And I’m not the only one who has noticed this.

Why does this matter? Because buggy code is biased code. Bugs that cause the software to give unwanted results are more likely to be noticed and fixed. Bugs that cause software to produce expected results are more likely to remain in place.

If your software simulates some complex phenomena, you don’t know what it’s supposed to do; that’s why you’re simulating. Errors are easier to spot in consumer software. A climate model needs a higher level of quality assurance than a word processor because bugs in the latter are more obvious. Genomic analysis may contain egregious errors and no one ever know, but a bug in an MP3 player is sure to annoy users.

You have to test simulation software carefully. You have to test special cases and individual components to have any confidence in the final output. You can’t just look at the results and say “Yeah, that’s about what I expected.”

Related posts

Acknowledging problems versus solving problems

People want their problems acknowledged more than they want them solved, at least at first. That’s one of the points from Thomas Limoncelli’s book Time Management for System Administrators.

Suppose two system administrators get an email about similar problems. The first starts working on the problem right away and replies to the email a couple hours later saying the problem is fixed. The second replies immediately to say he understands the problem and will resolve it first thing tomorrow. The second system administrator will be more popular.

Of course people want their problems solved, and sooner is better than later. But first they want to know someone is listening. Sometimes that’s all they want.

Related posts

How many errors are left to find?

There’s a simple statistic called the Lincoln Index that lets you estimate the total number of errors based on the number of errors found. I’ll explain what the Lincoln Index is, why it works, give some code for playing with it, and discuss how it applies to software testing.

What is the Lincoln Index?

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn’t very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let E1 be the number of errors the first tester finds and E2 the number of errors the second tester finds. Let S be the number of errors both testers find. The Lincoln Index estimates the total number of errors as

E1 E2/S.

You can find historical background on the Lincoln Index here.

How does the index work?

Suppose there are n bugs and the two testers find bugs with probability p1 and p2 respectively. You’d expect the two testers to find around np1 and np2 bugs. If you assume the probabilities of each tester finding a bug are independent, you’d expect the testers to find around np1 p2 bugs in common. That says

E1 E2/S

would be around

(n2 p1 p2) / (n p1 p2) = n.

The probabilities of each tester finding a bug cancel out leaving only n, the total number of bugs.

Simulation code

Here’s some Python code for simulating estimates using the Lincoln Index.

from random import random

def find_error(p):
    "Find an error with probability p"
    if random() < p:
        return 1
    return 0

def simulate(true_error_count, p1, p2, reps=10000):
    """Simulate Lincoln's method for estimating errors
    given the true number of errors, each person's probability
    of finding an error, and the number of simulations to run."""
    estimation_error_sum = 0
    for rep in xrange(reps):
        caught1 = 0
        caught2 = 0
        caught_both = 0
        for error in xrange(true_error_count):
            found1 = find_error(p1)
            found2 = find_error(p2)
            caught1 += found1
            caught2 += found2
            caught_both += found1*found2

    estimate = caught1*caught2 / float(caught_both)
    estimation_error_sum += abs(estimate - true_error_count)
    return estimation_error_sum / float(reps)

I used this to simulate the case of two testers, one with a 30% chance of finding a bug and the other with a 40% chance, and a total of 100 bugs. I simulated the Lincoln Index 1,000 times, keeping track of the absolute error in the estimates. The code to do this was simulate(100, 0.30, 0.40, 1000). On average, the Lincoln index over- or under-estimated the number of bugs by about 16. This is a good estimate considering each tester greatly under-estimated the number of bugs.

If you didn’t think about using something like the Lincoln Index, in the previous example one tester would find around 30 bugs and the other around 40. The two lists might have 10 bugs in common, so you’d estimate the total number at 60, far short of 100. But the Lincoln index would often find estimates between 84 and 116.

Note that it is possible that the testers won’t find any of the same bugs. In that case the Lincoln Index cannot be computed and the code will divide by zero. But this is unlikely unless the p‘s are small and n is small.

Software testing

Does the Lincoln Index actually provide a good bug count estimate? That depends on how well the assumptions are met. The index assumes all bugs are equally hard for a given tester to find. It does not assume that both testers are equally skilled, but it does assume that their chances of finding a bug are independent. In other words, tester A is no more or less likely to find a bug just because tester B found it.

The most questionable assumption is that all bugs are equally hard to find. That’s usually not true. But it may be true that all bugs of a certain kind are equally hard to find. For example, spelling errors may be easier to find than validation oversights, but the Lincoln Index might be good for estimating separately how many spelling errors or validation errors there are.

The index might provide a rough rule of thumb even if the assumptions it that go into it are violated. For example, suppose one tester found 15 bugs and another found 20. But only 3 of the bugs were the same. A naive estimate would say since there are 32 unique bugs found, there must be around that many in total. But the Lincoln Index would estimate 100 bugs. Maybe the Lincoln estimate is not at all accurate, but it does tell you to be worried that there may be a lot more bugs to find since the overlap between the two bug lists was so small.

Related postEstimating the chances of something that hasn’t happened yet