Distracted by the hard part

Last night I was helping my daughter with calculus homework. I told her that a common mistake was to forget what the original problem was after getting absorbed in sub-problems that have to be solved. I saw this over and over when I taught college.

Then a few minutes later, we both did exactly what I warned her against. She took the answer to a difficult sub-problem to be the final answer. I checked her work and confirmed that it was correct, until I saw we hadn’t actually answered the original question.

As I was waking up this morning, I realized I was about to make the same mistake on a client’s project. The goal was to write software to implement a function f which is a trivial composition of two other functions g and h. These two functions took a lot of work, including a couple levels of code generation. I felt I was done after testing g and h, but I forgot to write tests for f, the very thing I was asked to deliver.

This is a common pattern that goes beyond calculus homework and software development. It’s why checklists are so valuable. We resist checklists because they insult our intelligence, and yet they greatly reduce errors. Experienced people in every field can skip a step, most likely a simple step, without some structure to help them keep track.

Related posts

Converting between nines and sigmas

Nines and sigmas are two ways to measure quality. You’ll hear something has four or five nines of reliability or that some failure is a five sigma event. What do these mean, and how do you convert between them?


If a system has fives nines of availability, that means the probability of the system being up is 99.999% Equivalently, the probability of it being down is 0.00001.

\underbrace{\mbox{99.999}}_{\mbox{{\normalsize five 9's}}}\%

In general, n nines of availability means the probability of failure is 10n.

If a system has s sigmas of reliability, that means the probability of failure is the same as the probability of a Gaussian random variable being s standard deviations above its mean [1].

Conversion formulas

Let Φ be the cumulative density function for a standard normal, i.e. a Gaussian random variable with mean zero and standard deviation 1. Then s sigmas corresponds to n nines, where

n = -log10(Φ(-s))


s = -Φ-1(10n).

We’ll give approximate formulas in just a second that don’t involve Φ but just use functions on a basic calculator.

Here’s a plot showing the relationship between nines and sigmas.

The plot looks a lot like a quadratic, and in fact if we take the square root we get a plot that’s visually indistinguishable from a straight line. This leads to very good approximations

n ≈ (0.47 + 0.42 s


s ≈ 2.37 √n – 1.12.

The approximation is so good that it’s hard to see the difference between it and the exact value in a plot.

These approximations are more than adequate since nines and sigmas are crude measures, not accurate to more than one significant figure [2].

Related posts

[1] Here I’m considering the one-tailed case, the probability of being so many standard deviations above the mean. You could consider the two-tailed version, where you look at the probability of being so many standard deviations above or below the mean. The two-tailed probability is simply twice the one-tailed probability by symmetry.

[2] As I’ve written elsewhere, I’m skeptical of the implicit normal distribution assumption, particularly for rare events. The normal distribution is often a good modeling assumption in the middle, but not so often in the tails. Going out as far as six sigmas is dubious, and so the plot above covers as much range as is practical and then some.

Software quality: better in practice than in theory

Sir Tony Hoare

C. A. R. Hoare wrote an article How Did Software Get So Reliable Without Proof? in 1996 that still sounds contemporary for the most part.

In the 1980’s many believed that programs could not get much bigger unless we started using formal proof methods. The argument was that bugs are fairly common, and that each bug has the potential to bring a system down. Therefore the only way to build much larger systems was to rely on formal methods to catch bugs. And yet programs continued to get larger and formal methods never caught on. Hoare asks

Why have twenty years of pessimistic predictions been falsified?

Another twenty years later we can ask the same question. Systems have gotten far larger, and formal methods have not become common. Formal methods are used—more on that shortly—but have not become common.

Better in practice than in theory

It’s interesting that Hoare was the one to write this paper. He is best known for the quicksort, a sorting algorithm that works better in practice than in theory! Quicksort is commonly used in practice, even though has terrible worst-case efficiency, because its average efficiency has optimal asymptotic order [1], and in practice it works better than other algorithms with the same asymptotic order.

Economic considerations

It is logically possible that the smallest bug could bring down a system. And there have been examples, such as the Mars Climate Orbiter, where a single bug did in fact lead to complete failure. But this is rare. Most bugs are inconsequential.

Some will object “How can you be so blasé about bugs? A bug crashed a $300 million probe!” But what is the realistic alternative? Would spending an additional billion dollars on formal software verification have prevented the crash? Possibly, though not certainly, and the same money could send three more missions to Mars. (More along these lines here.)

It’s all a matter of economics. Formal verification is extremely tedious and expensive. The expense is worth it in some settings and not in others. The software that runs pacemakers is more critical than the software that runs a video game. For most software development, less formal methods have proved more cost effective at achieving acceptable quality: code reviews, unit tests, integration testing, etc.

Formal verification

I have some experience with formal software verification, including formal methods software used by NASA. When someone says that software has been formally verified, there’s an implicit disclaimer. It’s usually the algorithms have been formally verified, not the implementation of those algorithms in software. Also, maybe not all the algorithms have been verified, but say 90%, the remaining 10% being too difficult to verify. In any case, formally verified software can and has failed. Formal verification greatly reduces the probability of encountering a bug, but it does not reduce the probability to zero.

There has been a small resurgence of interest in formal methods since Hoare wrote his paper. And again, it’s all about economics. Theorem proving technology has improved over the last 20 years. And software is being used in contexts where the consequences of failure are high. But for most software, the most economical way to achieve acceptable quality is not through theorem proving.

There are also degrees of formality. Full theorem proving is extraordinarily tedious. If I remember correctly, one research group said that they could formally verify about one page of a mathematics textbook per man-week. But there’s a continuum between full formality and no formality. For example, you could have formal assurance that your software satisfies certain conditions, even if you can’t formally prove that the software is completely correct. Where you want to be along this continuum of formality is again a matter of economics. It depends on the probability and consequences of errors, and the cost of reducing these probabilities.

Related posts

[1] The worst-case performance of quicksort is O(n²) but the average performance is O(n log n).

Photograph of C. A. R. Hoare by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr, CC BY-SA 2.0 fr

Ultra-reliable software

From a NASA page advocating formal methods:

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. P_f < 10^-7/hour).

Emphasis added.

Developing medium-reliability and high-reliability software are almost entirely different professions. Using typical software development procedures on systems that must be ultra-reliable would invite disaster. But using extremely cautious development methods on systems that can afford to fail relatively often would be an economic disaster.

Related post: Formal validation methods let you explore the corners


To err is human, to catch an error shows expertise

Experts are OK at avoiding mistakes, but they’re even better at recognizing and fixing mistakes.

If you ask an elementary student something like “How can we know that the answer to this problem cannot be 8769?” they might only be able to say “Because the correct answer is 8760.” That is, the only tool they have for checking an result is to compare it to a known quantity. A more sophisticated student might be able to say something like “We know the result must be even” or “The result cannot be more than 8764” for  reasons that come from context.

Experienced programmers write fewer bugs than rookies. But more importantly, the bugs they do write don’t live as long. The experienced programmer may realize almost immediately that a piece of code cannot be right, while the rookie may not realize there’s a problem until the output is obviously wrong. The more experienced programmer might notice that the vertical alignment of the code looks wrong, or that incompatible pieces are being used together, or even that the code “smells wrong” for reasons he or she can’t articulate.

An engineer might know that an answer is wrong because it has the wrong magnitude or the wrong dimensions. Or maybe a result violates a conservation law. Or maybe the engineer thinks “That’s not the way things are done, and I bet there’s a good reason why.”

“Be more careful” only goes so far. Experts are not that much more careful than novices. We can only lower our rate of mistakes so much. There’s much more potential for being able to recognize mistakes than to prevent them.

A major step in maturing as a programmer is accepting the fact that you’re going to make mistakes fairly often. Maybe you’ll introduce a bug for every 10 lines of code, at least one for every 100 lines. (Rookies write more bugs than this but think they write fewer.) Once you accept this, you begin to ask how you can write code to make bugs stand out. Given two approaches, which is more likely to fail visibly if is it’s wrong? How can I write code so that logic errors are more likely to show up as compile errors? How am I doing to debug this when it breaks?

Theory pays off in the long run. Abstractions that students dismiss as impractical probably are impractical at first. But in the long run, these abstractions may prevent or catch errors. I’ve come to see the practicality in many things that I used to dismiss as pedantic: dimensional analysis, tensor properties, strongly typed programming languages, category theory, etc. A few weeks into a chemistry class I learned the value of dimensional analysis. It has taken me much longer to appreciate category theory.

Related posts

Example of unit testing R code with testthat

Here’s a little example of using Hadley Wickham’s testthat package for unit testing R code.

The function below computes the real roots of a quadratic. All that really matters for our purposes is that the function can return 0, 1, or 2 numbers and it could raise an error.

    real.roots <- function(a, b, c)
        if (a == 0.)
            stop("Leading term cannot be zero")

        d = b*b - 4*a*c # discriminant

        if (d < 0)
           rr = c()
        else if (d == 0)
           rr = c( -b/(2*a) )
            rr = c( (-b - sqrt(d))/(2*a), 
                    (-b + sqrt(d))/(2*a)  )


To test this code with testthat we create another file for tests. The name of the file should begin with test so that testthat can recognize it as a file of test code. So let name the file containing the code above real_roots.R and the file containing its tests test_real_roots.R.

The test file needs to read in the file being tested.


Now let’s write some tests for the case of a quadratic with two real roots.

    test_that("Distinct roots", {

        roots <- real.roots(1, 7, 12)

        expect_that( roots, is_a("numeric") )
        expect_that( length(roots), equals(2) )
        expect_that( roots[1] < roots[2], is_true() )

This tests that we get back two numbers and that they are sorted in increasing order.

Next we find the roots of (x + 3000)2 = x2 + 6000x + 9000000. We’ll test whether we get back -3000 as the only root. In general you can’t expect to get an exact answer, though in this case we do since the root is an integer. But we’ll show in the next example how to test for equality with a given tolerance.

    test_that("Repeated root", {

        roots <- real.roots(1, 6000, 9000000)

        expect_that( length(roots), equals(1) )

        expect_that( roots, equals(-3000) )

        # Test whether ABSOLUTE error is within 0.1 
        expect_that( roots, equals(-3000.01, tolerance  = 0.1) )

        # Test whether RELATIVE error is within 0.1
        # To test relative error, set 'scale' equal to expected value.
        # See base R function all.equal for optional argument documentation.
        expect_equal( roots, -3001, tolerance  = 0.1, scale=-3001) 

To show how to test code that should raise an error, we’ll find the roots of 2x + 3, which isn’t a quadratic. Notice that you can test whether any error is raised or you can test whether the error message matches a given regular expression.

    test_that("Polynomial must be quadratic", {

        # Test for ANY error                     
        expect_that( real.roots(0, 2, 3), throws_error() )

        # Test specifically for an error string containing "zero"
        expect_that( real.roots(0, 2, 3), throws_error("zero") )

        # Test specifically for an error string containing "zero" or "Zero" using regular expression
        expect_that( real.roots(0, 2, 3), throws_error("[zZ]ero") )

Finally, here are a couple tests that shouldn’t pass.

    test_that("Bogus tests", {

        x <- c(1, 2, 3)

        expect_that( length(x), equals(2.7) )
        expect_that( x, is_a("data.frame") )

To run the tests, you can run test_dir or test_file. If you are at the R command line and your working directory is the directory containing the two files above, you could run the tests with test_dir("."). In this case we have only one file of test code, but if we had more test files test_dir would find them, provided the file names begin with test.

* * *

Related: Help integrating R into your environment


Dogfooding refers companies using their own software. According to Wikipedia,

In 1988, Microsoft manager Paul Maritz sent Brian Valentine, test manager for Microsoft LAN Manager, an email titled “Eating our own Dogfood”, challenging him to increase internal usage of the company’s product. From there, the usage of the term spread through the company.

Dogfooding is a great idea, but it’s no substitute for usability testing. I get the impression that some products, if they’re tested at all, are tested by developers intimately familiar with how they’re intended to be used.

If your company is developing consumer software, it’s not dogfooding if just the developers use it. It’s dogfooding when people in sales and accounting use it. But that’s still no substitute for getting people outside the company to use it.

Dogfooding doesn’t just apply to software development. Whenever I buy something with inscrutable assembly instructions, I wonder why the manufacturer didn’t pay a couple people off the street to put the thing together on camera.

Defensible software

It’s not enough for software to be correct. It has to be defensible.

I’m not thinking of defending against malicious hackers. I’m thinking about defending against sincere critics. I can’t count how many times someone was absolutely convinced that software I had a hand in was wrong when it in fact it was performing as designed.

In order to defend software, you have to understand what it does. Not just one little piece of it, but the whole system. You need to understand it better than the people who commissioned it: the presumed errors may stem from unforeseen consequences of the specification.

Related post: The buck stops with the programmer

Hunt down bad error messages

My printer is unable to clean 51. It won’t work, and all it says is “Unable to Clean 51.”

Here’s my suggestion for finding such useless error messages in a code review: Write a script to extract all string literals from your source code, then read over the output. The beauty of this approach is that the reviewer sees the text without the context of the surrounding code, just as the user does. Ideally someone who is not a programmer would review the output strings. I would hope someone who ran across “Unable to Clean 51” would flag that as something that doesn’t make sense.

It’s not hard to write a script to pull out string literals. If you’d like, you could use the script that accompanies my Code Project article PowerShell Script for Reviewing Text Shown to Users. That script tries to filter out strings that are not user output, such as file paths. It’s a pretty crude script, attempting to handle source code written in several languages. It will have a few false positives, and a few false negatives, but it works quite well for a short script. (Code Project’s “browse source” tab doesn’t work for PowerShell source. You’ll have to download the code to read it.)

Whenever I recommend this script, I run into a few objections that I’ll address below.

Q: Wouldn’t it be better if programmers put all user output text in string tables rather than putting user output text in the main body of their code?

A: Yes, but that takes more effort, and it’s not how most programmers work. And even if you have a policy to put all output strings in a resource table, you’d need something like this script to enforce the policy.

Q: Wouldn’t it be better to have a real parser extract the strings rather than doing regular expression guesswork?

A: Sure.

Q: Wouldn’t it be better to use a spell checker that’s built into your IDE?

A: No. The purpose of the review is not just to check spelling. You also want to catch grammar errors and unhelpful messages.

Q: Wouldn’t it be better to do a complete code review?

A: Yes and no. Complete code reviews are great for overall code quality. But they take a lot of effort and don’t happen often. A review of just the extracted strings takes far, far less time. Also, viewing the output strings out of context is better for catching unhelpful or ungrammatical messages.

Q: Isn’t this simplistic? Doesn’t every company do something like this?

A: If they do, why do I continually find spelling errors, grammatical errors, and unhelpful messages in the software I use?

How things break

Venkatesh Rao wrote a blog post today Stress Failures versus Decay Failures. It reminded me of three other resources I recommend on how things break. The first is about how things literally break. For example, why the steel in the Titanic was brittle.

The other two are about how complex systems break.