Software quality: better in practice than in theory

Sir Tony Hoare

C. A. R. Hoare wrote an article How Did Software Get So Reliable Without Proof? in 1996 that still sounds contemporary for the most part.

In the 1980’s many believed that programs could not get much bigger unless we started using formal proof methods. The argument was that bugs are fairly common, and that each bug has the potential to bring a system down. Therefore the only way to build much larger systems was to rely on formal methods to catch bugs. And yet programs continued to get larger and formal methods never caught on. Hoare asks

Why have twenty years of pessimistic predictions been falsified?

Another twenty years later we can ask the same question. Systems have gotten far larger, and formal methods have not become common. Formal methods are used—more on that shortly—but have not become common.

Better in practice than in theory

It’s interesting that Hoare was the one to write this paper. He is best known for the quicksort, a sorting algorithm that works better in practice than in theory! Quicksort is commonly used in practice, even though has terrible worst-case efficiency, because its average efficiency has optimal asymptotic order [1], and in practice it works better than other algorithms with the same asymptotic order.

Economic considerations

It is logically possible that the smallest bug could bring down a system. And there have been examples, such as the Mars Climate Orbiter, where a single bug did in fact lead to complete failure. But this is rare. Most bugs are inconsequential.

Some will object “How can you be so blasé about bugs? A bug crashed a $300 million probe!” But what is the realistic alternative? Would spending an additional billion dollars on formal software verification have prevented the crash? Possibly, though not certainly, and the same money could send three more missions to Mars. (More along these lines here.)

It’s all a matter of economics. Formal verification is extremely tedious and expensive. The expense is worth it in some settings and not in others. The software that runs pacemakers is more critical than the software that runs a video game. For most software development, less formal methods have proved more cost effective at achieving acceptable quality: code reviews, unit tests, integration testing, etc.

Formal verification

I have some experience with formal software verification, including formal methods software used by NASA. When someone says that software has been formally verified, there’s an implicit disclaimer. It’s usually the algorithms have been formally verified, not the implementation of those algorithms in software. Also, maybe not all the algorithms have been verified, but say 90%, the remaining 10% being too difficult to verify. In any case, formally verified software can and has failed. Formal verification greatly reduces the probability of encountering a bug, but it does not reduce the probability to zero.

There has been a small resurgence of interest in formal methods since Hoare wrote his paper. And again, it’s all about economics. Theorem proving technology has improved over the last 20 years. And software is being used in contexts where the consequences of failure are high. But for most software, the most economical way to achieve acceptable quality is not through theorem proving.

There are also degrees of formality. Full theorem proving is extraordinarily tedious. If I remember correctly, one research group said that they could formally verify about one page of a mathematics textbook per man-week. But there’s a continuum between full formality and no formality. For example, you could have formal assurance that your software satisfies certain conditions, even if you can’t formally prove that the software is completely correct. Where you want to be along this continuum of formality is again a matter of economics. It depends on the probability and consequences of errors, and the cost of reducing these probabilities.

Related posts

[1] The worst-case performance of quicksort is O(n²) but the average performance is O(n log n).

Photograph of C. A. R. Hoare by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr, CC BY-SA 2.0 fr

Ultra-reliable software

From a NASA page advocating formal methods:

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. P_f < 10^-7/hour).

Emphasis added.

Developing medium-reliability and high-reliability software are almost entirely different professions. Using typical software development procedures on systems that must be ultra-reliable would invite disaster. But using extremely cautious development methods on systems that can afford to fail relatively often would be an economic disaster.

Related post: Formal validation methods let you explore the corners

 

To err is human, to catch an error shows expertise

Experts are OK at avoiding mistakes, but they’re even better at recognizing and fixing mistakes.

If you ask an elementary student something like “How can we know that the answer to this problem cannot be 8769?” they might only be able to say “Because the correct answer is 8760.” That is, the only tool they have for checking an result is to compare it to a known quantity. A more sophisticated student might be able to say something like “We know the result must be even” or “The result cannot be more than 8764” for  reasons that come from context.

Experienced programmers write fewer bugs than rookies. But more importantly, the bugs they do write don’t live as long. The experienced programmer may realize almost immediately that a piece of code cannot be right, while the rookie may not realize there’s a problem until the output is obviously wrong. The more experienced programmer might notice that the vertical alignment of the code looks wrong, or that incompatible pieces are being used together, or even that the code “smells wrong” for reasons he or she can’t articulate.

An engineer might know that an answer is wrong because it has the wrong magnitude or the wrong dimensions. Or maybe a result violates a conservation law. Or maybe the engineer thinks “That’s not the way things are done, and I bet there’s a good reason why.”

“Be more careful” only goes so far. Experts are not that much more careful than novices. We can only lower our rate of mistakes so much. There’s much more potential for being able to recognize mistakes than to prevent them.

A major step in maturing as a programmer is accepting the fact that you’re going to make mistakes fairly often. Maybe you’ll introduce a bug for every 10 lines of code, at least one for every 100 lines. (Rookies write more bugs than this but think they write fewer.) Once you accept this, you begin to ask how you can write code to make bugs stand out. Given two approaches, which is more likely to fail visibly if is it’s wrong? How can I write code so that logic errors are more likely to show up as compile errors? How am I doing to debug this when it breaks?

Theory pays off in the long run. Abstractions that students dismiss as impractical probably are impractical at first. But in the long run, these abstractions may prevent or catch errors. I’ve come to see the practicality in many things that I used to dismiss as pedantic: dimensional analysis, tensor properties, strongly typed programming languages, category theory, etc. A few weeks into a chemistry class I learned the value of dimensional analysis. It has taken me much longer to appreciate category theory.

Related posts:

Example of unit testing R code with testthat

Here’s a little example of using Hadley Wickham’s testthat package for unit testing R code. You can read more about testthat here.

The function below computes the real roots of a quadratic. All that really matters for our purposes is that the function can return 0, 1, or 2 numbers and it could raise an error.

real.roots <- function(a, b, c)
{
    if (a == 0.)
        stop("Leading term cannot be zero")

    d = b*b - 4*a*c # discriminant

    if (d < 0)
       rr = c()
    else if (d == 0)
       rr = c( -b/(2*a) )
    else
        rr = c( (-b - sqrt(d))/(2*a), 
                (-b + sqrt(d))/(2*a)  )

    return(rr)
}

To test this code with testthat we create another file for tests. The name of the file should begin with test so that testthat can recognize it as a file of test code. So let name the file containing the code above real_roots.R and the file containing its tests test_real_roots.R.

The test file needs to read in the file being tested.

source("real_roots.R")

Now let’s write some tests for the case of a quadratic with two real roots.

test_that("Distinct roots", {

    roots <- real.roots(1, 7, 12)

    expect_that( roots, is_a("numeric") )
    expect_that( length(roots), equals(2) )
    expect_that( roots[1] < roots[2], is_true() )
})

This tests that we get back two numbers and that they are sorted in increasing order.

Next we find the roots of (x + 3000)2 = x2 + 6000x + 9000000. We’ll test whether we get back -3000 as the only root. In general you can’t expect to get an exact answer, though in this case we do since the root is an integer. But we’ll show in the next example how to test for equality with a given tolerance.

test_that("Repeated root", {

    roots <- real.roots(1, 6000, 9000000)

    expect_that( length(roots), equals(1) )

    expect_that( roots, equals(-3000) )

    # Test whether ABSOLUTE error is within 0.1 
    expect_that( roots, equals(-3000.01, tolerance  = 0.1) )

    # Test whether RELATIVE error is within 0.1
    # To test relative error, set 'scale' equal to expected value.
    # See base R function all.equal for optional argument documentation.
    expect_equal( roots, -3001, tolerance  = 0.1, scale=-3001) 
})

To show how to test code that should raise an error, we’ll find the roots of 2x + 3, which isn’t a quadratic. Notice that you can test whether any error is raised or you can test whether the error message matches a given regular expression.

test_that("Polynomial must be quadratic", {

    # Test for ANY error                     
    expect_that( real.roots(0, 2, 3), throws_error() )

    # Test specifically for an error string containing "zero"
    expect_that( real.roots(0, 2, 3), throws_error("zero") )

    # Test specifically for an error string containing "zero" or "Zero" using regular expression
    expect_that( real.roots(0, 2, 3), throws_error("[zZ]ero") )
})

Finally, here are a couple tests that shouldn’t pass.

test_that("Bogus tests", {

    x <- c(1, 2, 3)

    expect_that( length(x), equals(2.7) )
    expect_that( x, is_a("data.frame") )
})

To run the tests, you can run test_dir or test_file. If you are at the R command line and your working directory is the directory containing the two files above, you could run the tests with test_dir("."). In this case we have only one file of test code, but if we had more test files test_dir would find them, provided the file names begin with test.

* * *

Related: Help integrating R into your environment

Dogfooding

Dogfooding refers companies using their own software. According to Wikipedia,

In 1988, Microsoft manager Paul Maritz sent Brian Valentine, test manager for Microsoft LAN Manager, an email titled “Eating our own Dogfood”, challenging him to increase internal usage of the company’s product. From there, the usage of the term spread through the company.

Dogfooding is a great idea, but it’s no substitute for usability testing. I get the impression that some products, if they’re tested at all, are tested by developers intimately familiar with how they’re intended to be used.

If your company is developing consumer software, it’s not dogfooding if just the developers use it. It’s dogfooding when people in sales and accounting use it. But that’s still no substitute for getting people outside the company to use it.

Dogfooding doesn’t just apply to software development. Whenever I buy something with inscrutable assembly instructions, I wonder why the manufacturer didn’t pay a couple people off the street to put the thing together on camera.

* * *

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

Defensible software

It’s not enough for software to be correct. It has to be defensible.

I’m not thinking of defending against malicious hackers. I’m thinking about defending against sincere critics. I can’t count how many times someone was absolutely convinced that software I had a hand in was wrong when it in fact it was performing as designed.

In order to defend software, you have to understand what it does. Not just one little piece of it, but the whole system. You need to understand it better than the people who commissioned it: the presumed errors may stem from unforeseen consequences of the specification.

Related post: The buck stops with the programmer

Hunt down bad error messages

My printer is unable to clean 51. It won’t work, and all it says is “Unable to Clean 51.”

Here’s my suggestion for finding such useless error messages in a code review: Write a script to extract all string literals from your source code, then read over the output. The beauty of this approach is that the reviewer sees the text without the context of the surrounding code, just as the user does. Ideally someone who is not a programmer would review the output strings. I would hope someone who ran across “Unable to Clean 51” would flag that as something that doesn’t make sense.

It’s not hard to write a script to pull out string literals. If you’d like, you could use the script that accompanies my Code Project article PowerShell Script for Reviewing Text Shown to Users. That script tries to filter out strings that are not user output, such as file paths. It’s a pretty crude script, attempting to handle source code written in several languages. It will have a few false positives, and a few false negatives, but it works quite well for a short script. (Code Project’s “browse source” tab doesn’t work for PowerShell source. You’ll have to download the code to read it.)

Whenever I recommend this script, I run into a few objections that I’ll address below.

Q: Wouldn’t it be better if programmers put all user output text in string tables rather than putting user output text in the main body of their code?

A: Yes, but that takes more effort, and it’s not how most programmers work. And even if you have a policy to put all output strings in a resource table, you’d need something like this script to enforce the policy.

Q: Wouldn’t it be better to have a real parser extract the strings rather than doing regular expression guesswork?

A: Sure.

Q: Wouldn’t it be better to use a spell checker that’s built into your IDE?

A: No. The purpose of the review is not just to check spelling. You also want to catch grammar errors and unhelpful messages.

Q: Wouldn’t it be better to do a complete code review?

A: Yes and no. Complete code reviews are great for overall code quality. But they take a lot of effort and don’t happen often. A review of just the extracted strings takes far, far less time. Also, viewing the output strings out of context is better for catching unhelpful or ungrammatical messages.

Q: Isn’t this simplistic? Doesn’t every company do something like this?

A: If they do, why do I continually find spelling errors, grammatical errors, and unhelpful messages in the software I use?

How things break

Venkatesh Rao wrote a blog post today Stress Failures versus Decay Failures. It reminded me of three other resources I recommend on how things break. The first is about how things literally break. For example, why the steel in the Titanic was brittle.

The other two are about how complex systems break.

Bugs, features, and risk

All software has bugs. Someone has estimated that production code has about one bug per 100 lines. Of course there’s some variation in this number. Some software is a lot worse, and some is a little better.

But bugs-per-line-of-code is not very useful for assessing risk. The risk of a bug is the probability of running into it multiplied by its impact. Some lines of code are far more likely to execute than others, and some bugs are far more consequential than others.

Devoting equal effort to testing all lines of code would be wasteful. You’re not going to find all the bugs anyway, so you should concentrate on the parts of the code that are most likely to run and that would produce the greatest harm if they were wrong.

However, here’s a complication. The probability of running into a bug can change over time as people use the software in new ways. For whatever reason people to want to use features that had not been exercised before. When they do so, they’re likely to uncover new bugs.

(This helps explain why everyone thinks his preferred software is more reliable than others. When you’re a typical user, you tread the well-tested paths. You also learn, often subconsciously, to avoid buggy paths. When you bring your expectations from an old piece of software to a new one, you’re more likely to uncover bugs.)

Even though usage patterns change, they don’t change arbitrarily. It’s still the case that some code is far more likely than other code to execute.

Good software developers think ahead. They solve more than they’re asked to solve. They think “I’m going to go ahead and include this other case while I’m at it in case they need it later.” They’re heroes when it turns out their guesses about future needs were correct.

But there’s a downside to this initiative. You pay for what you don’t use. Every speculative feature either has to be tested, incurring more expense up front, or delivered untested, incurring more risk. This suggests its better to disable unused features.

You cannot avoid speculation entirely. Writing maintainable software requires speculating well, anticipating and preparing for change. Good software developers place good bets, and these tend to be small bets, going to a little extra effort to make software much more flexible. As with bugs, you have to consider probabilities and consequences: how likely is this part of the software to change, and how much effort will it take to prepare for that change?

Developers learn from experience what aspects of software are likely to change and they prepare for that change. But then they get angry at a rookie who wastes a lot of time developing some unnecessary feature. They may not realize that the rookie is doing the same thing they are, but with a less informed idea of what’s likely to be needed in the future.

Disputes between developers often involve hidden assumptions about probabilities. Whether some aspect of the software is responsible preparation for maintenance or wasteful gold plating depends on your idea of what’s likely to happen in the future.

Related:

Software exoskeletons

There’s a major divide between the way scientists and programmers view the software they write.

Scientists see their software as a kind of exoskeleton, an extension of themselves. Think Dr. Octopus. The software may do heavy lifting, but the scientists remain actively involved in its use. The software is a tool, not a self-contained product.

Spiderman versus Dr. Ock

Programmers see their software as something they will hand over to someone else, more like building a robot than an exoskeleton. Programmers believe it’s their job to encapsulate intelligence in software. If users have to depend on programmers after the software is written, the programmers didn’t finish their job.

I work with scientists and programmers, often bridging the gaps between the two cultures. One point of tension is defining when a project is done. To a scientist, the software is done when they get what they want out of it, such as a table of numbers for a paper. Professional programmers give more thought to reproducibility, maintainability, and correctness. Scientists think programmers are anal retentive. Programmers think scientists are cowboys.

Programmers need to understand that sometimes a program really only needs to run once, on one set of input, with expert supervision. Scientists need to understand that prototype code may need a complete rewrite before it can be used in production.

The real tension comes when a piece of research software is suddenly expected to be ready for production. The scientist will say “the code has already been written” and can’t imagine it would take much work, if any, to prepare the software for its new responsibilities. They don’t understand how hard it is for an engineer to turn an exoskeleton into a self-sufficient robot.

Related posts: