# ASQ/ANSI Z1.4 sampling procedures

I mentioned the other day that the US military standard MIL-STD-105 for statistical sampling procedures lives on in the ASQ/ANSI standard Z1.4. The Department of Defense cancelled their own standard in 1995 in favor of adopting civilian standards, in particular ASQ/ANSI Z1.4.

There are two main differences between military standard and its replacement. First, the military standard is free and the ASNI standard costs $199. Second, the ASNI standard has better typography. Otherwise the two standards are remarkably similar. For example, the screen shot from MIL-STD-105 that I posted the other day appears verbatim in ASQ/ASNI Z1.4, except with better typesetting. The table even has the same name: “Table II-B Single sampling plans for tightened inspection (Master table).” Since the former is public domain and the latter is copyrighted, I’ll repeat my screenshot of the former. Everything I said about the substance of the military standard applies to the ASNI standard. The two give objective, checklist-friendly statistical sampling plans and acceptance criteria. The biggest strength and biggest weakness of these plans is the lack of nuance. One could create more sophisticated statistical designs that are more powerful, but then you lose the simplicity of a more regimented approach. A company could choose to go down both paths, using more informative statistical models for internal use and reporting results from the tests dictated by standards. For example, a company could create a statistical model that takes more information into account and use it to assess the predictive probability of passing the tests required by sampling standards. ## Related links # Using mean range method to measure variability The most common way to measure variability, at least for data coming from a normal distribution, is standard deviation. Another less common approach is to use mean range. Standard deviation is mathematically simple but operationally a little complicated. Mean range, on the other hand, is complicated to analyze mathematically but operationally very simple. ## ASQ/ANSI Z1.9 The ASQ/ANSI Z1.9 standard, Sampling Procedures and Table for Inspection by Variables for Percent Nonconforming, gives several options for measuring variability, and one of these is the mean range method. Specifically, several samples of five items each are drawn, and the average of the ranges is the variability estimate. The ANSI Z1.9 standard grew out of, and is still very similar to, the US military standard MIL-STD-414 from 1957. The ANSI standard, last updated in 2018, is not that different from the military standard from six decades earlier. The average mean is obviously simple to carry out: take five samples, subtract the smallest value from the largest, and write that down. Repeat this a few times and average the numbers you wrote down. No squares, no square roots, easy to carry out manually. This was obviously a benefit in 1957, but not as much now that computers are ubiquitous. The more important advantage today is that the mean range can be more robust for heavy-tailed data. More on that here. ## Probability distribution The distribution of the range of a sample is not simple to write down, even when the samples come from a normal distribution. There are nice asymptotic formulas for the range as the number of samples goes to infinity, but five is a bit far from infinity [1]. This is a problem that was thoroughly studied decades ago. The random variable obtained by averaging k ranges from samples of n elements each is denoted or sometimes without the subscripts. ## Approximate distribution There are several useful approximations for the distribution of this statistic. Pearson [2] evaluated several proposed approximations and found that the best one for n < 10 (as it is in our case, with n = 5) to be Here σ is the standard deviation of the population being sampled, and the values of c and ν vary with n and k. For example, when n = 5 and k = 4, the value of ν is 14.7 and the value of c is 2.37. The value of c doesn’t change much as k gets larger, though the value of ν does [3]. Note that the approximating distribution is chi, not the more familiar chi square. (I won’t forget a bug I had to chase down years ago that was the result of seeing χ in a paper and reading it as χ². Double, triple, quadruple check, everything looks right. Oh wait …) For particular values of n and k you could use the approximate distribution to find how to scale mean range to a comparable standard deviation, and to assess the relative efficiency of the two methods of measuring variation. The ANSI/ASQ Z1.9 standard gives tables for acceptance based on mean range, but doesn’t go into statistical details. ## Related posts [1] Of course every finite number is far from infinity. But the behavior at some numbers is quite close to the behavior at infinity. Asymptotic estimates are not just an academic exercise. They can give very useful approximations for finite inputs—that’s why people study them—sometimes even for inputs as small as five. [2] E. S. Pearson (1952) Comparison of two approximations to the distribution of the range in small samples from normal populations. Biometrika 39, 130–6. [3] H. A. David (1970). Order Statistics. John Wiley and Sons. # Military Standard 105 Military Standard 105 (MIL-STD-105) is the grand daddy of sampling acceptance standards. The standard grew out of work done at Bell Labs in the 1930s and was first published during WWII. There were five updates to the standard, the last edition being MIL-STD-105E, published in 1989. In 1995 the standard was officially cancelled when the military decided to go with civilian quality standards moving forward. Military Standard 105 lives on through its progeny, such as ANSI/ASQ Z1.4, ASTM E2234, and ISO 2859-1. There’s been an interesting interaction between civilian and military standards: Civilian organizations adopted military standards, then the military adopted civilian standards (which evolved from military standards). From a statistical perspective, it seems a little odd that sampling procedures are given without as much context as an experiment designed from scratch might have. But obviously a large organization, certainly an army, must have standardized procedures. A procurement department cannot review thousands of boutique experiment designs the way a cancer center reviews sui generis clinical trial designs. They have to have objective procedures that do not require a statistician to evaluate. Of course manufacturers need objective standards too. The benefits of standardization outweigh the potential benefits of customization: economic efficiency trumps the increase in statistical efficiency that might be obtained from a custom sampling approach. Although the guidelines in MIL-STD-105 are objective, they’re also flexible. For example, instead of dictating a single set of testing procedures, the standard gives normal, tightened, and reduced procedures. The level of testing can go up or down based on experience. During normal inspection, if two out of five consecutive lots have been rejected, then testing switches to tightened procedures. Then after five consecutive lots have been accepted, normal testing can resume. And if certain conditions are met under normal procedures, the manufacturer can relax to reduced procedures [1]. The procedures are adaptive, but there are explicit rules for doing the adapting. This is very similar to my experience with adaptive clinical trial designs. Researchers often think that “adaptive” means flying by the seat of your pants, making subjective decisions as a trial progresses. But an adaptive statistical design is still a statistical design. The conduct of the experiment may change over time, but only according to objective statistical criteria set out in advance. MIL-STD-105 grew out of the work of smart people, such as Harold Dodge at Bell Labs, thinking carefully about the consequences of the procedures. Although the procedures have all the statistical lingo stripped away—do this thing this many times, rather than looking at χ² values—statistical thought went into creating these procedures. ## Related links [1] This isn’t the most statistically powerful approach because it throws away information. It only considers whether batches met an acceptance standard; it doesn’t use the data on how many units passed or failed. The units in one batch are not interchangeable with units in another batch, but neither are they unrelated. A more sophisticated approach might use a hierarchical model that captured units within batches. But as stated earlier, you can’t have someone in procurement review hierarchical statistical analyses; you need simple rules. # Distracted by the hard part Last night I was helping my daughter with calculus homework. I told her that a common mistake was to forget what the original problem was after getting absorbed in sub-problems that have to be solved. I saw this over and over when I taught college. Then a few minutes later, we both did exactly what I warned her against. She took the answer to a difficult sub-problem to be the final answer. I checked her work and confirmed that it was correct, until I saw we hadn’t actually answered the original question. As I was waking up this morning, I realized I was about to make the same mistake on a client’s project. The goal was to write software to implement a function f which is a trivial composition of two other functions g and h. These two functions took a lot of work, including a couple levels of code generation. I felt I was done after testing g and h, but I forgot to write tests for f, the very thing I was asked to deliver. This is a common pattern that goes beyond calculus homework and software development. It’s why checklists are so valuable. We resist checklists because they insult our intelligence, and yet they greatly reduce errors. Experienced people in every field can skip a step, most likely a simple step, without some structure to help them keep track. ## Related posts # Converting between nines and sigmas Nines and sigmas are two ways to measure quality. You’ll hear something has four or five nines of reliability or that some failure is a five sigma event. What do these mean, and how do you convert between them? ## Definitions If a system has fives nines of availability, that means the probability of the system being up is 99.999% Equivalently, the probability of it being down is 0.00001. In general, n nines of availability means the probability of failure is 10n. If a system has s sigmas of reliability, that means the probability of failure is the same as the probability of a Gaussian random variable being s standard deviations above its mean [1]. ## Conversion formulas Let Φ be the cumulative density function for a standard normal, i.e. a Gaussian random variable with mean zero and standard deviation 1. Then s sigmas corresponds to n nines, where n = -log10(Φ(-s)) and s = -Φ-1(10n). We’ll give approximate formulas in just a second that don’t involve Φ but just use functions on a basic calculator. Here’s a plot showing the relationship between nines and sigmas. The plot looks a lot like a quadratic, and in fact if we take the square root we get a plot that’s visually indistinguishable from a straight line. This leads to very good approximations n ≈ (0.47 + 0.42 s and s ≈ 2.37 √n – 1.12. The approximation is so good that it’s hard to see the difference between it and the exact value in a plot. These approximations are more than adequate since nines and sigmas are crude measures, not accurate to more than one significant figure [2]. ## Related posts [1] Here I’m considering the one-tailed case, the probability of being so many standard deviations above the mean. You could consider the two-tailed version, where you look at the probability of being so many standard deviations above or below the mean. The two-tailed probability is simply twice the one-tailed probability by symmetry. [2] As I’ve written elsewhere, I’m skeptical of the implicit normal distribution assumption, particularly for rare events. The normal distribution is often a good modeling assumption in the middle, but not so often in the tails. Going out as far as six sigmas is dubious, and so the plot above covers as much range as is practical and then some. # Software quality: better in practice than in theory C. A. R. Hoare wrote an article How Did Software Get So Reliable Without Proof? in 1996 that still sounds contemporary for the most part. In the 1980’s many believed that programs could not get much bigger unless we started using formal proof methods. The argument was that bugs are fairly common, and that each bug has the potential to bring a system down. Therefore the only way to build much larger systems was to rely on formal methods to catch bugs. And yet programs continued to get larger and formal methods never caught on. Hoare asks Why have twenty years of pessimistic predictions been falsified? Another twenty years later we can ask the same question. Systems have gotten far larger, and formal methods have not become common. Formal methods are used—more on that shortly—but have not become common. ## Better in practice than in theory It’s interesting that Hoare was the one to write this paper. He is best known for the quicksort, a sorting algorithm that works better in practice than in theory! Quicksort is commonly used in practice, even though has terrible worst-case efficiency, because its average efficiency has optimal asymptotic order [1], and in practice it works better than other algorithms with the same asymptotic order. ## Economic considerations It is logically possible that the smallest bug could bring down a system. And there have been examples, such as the Mars Climate Orbiter, where a single bug did in fact lead to complete failure. But this is rare. Most bugs are inconsequential. Some will object “How can you be so blasé about bugs? A bug crashed a$300 million probe!” But what is the realistic alternative? Would spending an additional billion dollars on formal software verification have prevented the crash? Possibly, though not certainly, and the same money could send three more missions to Mars. (More along these lines here.)

It’s all a matter of economics. Formal verification is extremely tedious and expensive. The expense is worth it in some settings and not in others. The software that runs pacemakers is more critical than the software that runs a video game. For most software development, less formal methods have proved more cost effective at achieving acceptable quality: code reviews, unit tests, integration testing, etc.

## Formal verification

I have some experience with formal software verification, including formal methods software used by NASA. When someone says that software has been formally verified, there’s an implicit disclaimer. It’s usually the algorithms have been formally verified, not the implementation of those algorithms in software. Also, maybe not all the algorithms have been verified, but say 90%, the remaining 10% being too difficult to verify. In any case, formally verified software can and has failed. Formal verification greatly reduces the probability of encountering a bug, but it does not reduce the probability to zero.

There has been a small resurgence of interest in formal methods since Hoare wrote his paper. And again, it’s all about economics. Theorem proving technology has improved over the last 20 years. And software is being used in contexts where the consequences of failure are high. But for most software, the most economical way to achieve acceptable quality is not through theorem proving.

There are also degrees of formality. Full theorem proving is extraordinarily tedious. If I remember correctly, one research group said that they could formally verify about one page of a mathematics textbook per man-week. But there’s a continuum between full formality and no formality. For example, you could have formal assurance that your software satisfies certain conditions, even if you can’t formally prove that the software is completely correct. Where you want to be along this continuum of formality is again a matter of economics. It depends on the probability and consequences of errors, and the cost of reducing these probabilities.

## Related posts

[1] The worst-case performance of quicksort is O(n²) but the average performance is O(n log n).

Photograph of C. A. R. Hoare by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr, CC BY-SA 2.0 fr

# Ultra-reliable software

From a NASA page advocating formal methods:

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. P_f < 10^-7/hour).

Developing medium-reliability and high-reliability software are almost entirely different professions. Using typical software development procedures on systems that must be ultra-reliable would invite disaster. But using extremely cautious development methods on systems that can afford to fail relatively often would be an economic disaster.

# To err is human, to catch an error shows expertise

Experts are OK at avoiding mistakes, but they’re even better at recognizing and fixing mistakes.

If you ask an elementary student something like “How can we know that the answer to this problem cannot be 8769?” they might only be able to say “Because the correct answer is 8760.” That is, the only tool they have for checking an result is to compare it to a known quantity. A more sophisticated student might be able to say something like “We know the result must be even” or “The result cannot be more than 8764” for  reasons that come from context.

Experienced programmers write fewer bugs than rookies. But more importantly, the bugs they do write don’t live as long. The experienced programmer may realize almost immediately that a piece of code cannot be right, while the rookie may not realize there’s a problem until the output is obviously wrong. The more experienced programmer might notice that the vertical alignment of the code looks wrong, or that incompatible pieces are being used together, or even that the code “smells wrong” for reasons he or she can’t articulate.

An engineer might know that an answer is wrong because it has the wrong magnitude or the wrong dimensions. Or maybe a result violates a conservation law. Or maybe the engineer thinks “That’s not the way things are done, and I bet there’s a good reason why.”

“Be more careful” only goes so far. Experts are not that much more careful than novices. We can only lower our rate of mistakes so much. There’s much more potential for being able to recognize mistakes than to prevent them.

A major step in maturing as a programmer is accepting the fact that you’re going to make mistakes fairly often. Maybe you’ll introduce a bug for every 10 lines of code, at least one for every 100 lines. (Rookies write more bugs than this but think they write fewer.) Once you accept this, you begin to ask how you can write code to make bugs stand out. Given two approaches, which is more likely to fail visibly if is it’s wrong? How can I write code so that logic errors are more likely to show up as compile errors? How am I doing to debug this when it breaks?

Theory pays off in the long run. Abstractions that students dismiss as impractical probably are impractical at first. But in the long run, these abstractions may prevent or catch errors. I’ve come to see the practicality in many things that I used to dismiss as pedantic: dimensional analysis, tensor properties, strongly typed programming languages, category theory, etc. A few weeks into a chemistry class I learned the value of dimensional analysis. It has taken me much longer to appreciate category theory.

# Example of unit testing R code with testthat

Here’s a little example of using Hadley Wickham’s testthat package for unit testing R code.

The function below computes the real roots of a quadratic. All that really matters for our purposes is that the function can return 0, 1, or 2 numbers and it could raise an error.

    real.roots <- function(a, b, c)
{
if (a == 0.)

d = b*b - 4*a*c # discriminant

if (d < 0)
rr = c()
else if (d == 0)
rr = c( -b/(2*a) )
else
rr = c( (-b - sqrt(d))/(2*a),
(-b + sqrt(d))/(2*a)  )

return(rr)
}

To test this code with testthat we create another file for tests. The name of the file should begin with test so that testthat can recognize it as a file of test code. So let name the file containing the code above real_roots.R and the file containing its tests test_real_roots.R.

The test file needs to read in the file being tested.

    source("real_roots.R")

Now let’s write some tests for the case of a quadratic with two real roots.

    test_that("Distinct roots", {

roots <- real.roots(1, 7, 12)

expect_that( roots, is_a("numeric") )
expect_that( length(roots), equals(2) )
expect_that( roots[1] < roots[2], is_true() )
})

This tests that we get back two numbers and that they are sorted in increasing order.

Next we find the roots of (x + 3000)2 = x2 + 6000x + 9000000. We’ll test whether we get back -3000 as the only root. In general you can’t expect to get an exact answer, though in this case we do since the root is an integer. But we’ll show in the next example how to test for equality with a given tolerance.

    test_that("Repeated root", {

roots <- real.roots(1, 6000, 9000000)

expect_that( length(roots), equals(1) )

expect_that( roots, equals(-3000) )

# Test whether ABSOLUTE error is within 0.1
expect_that( roots, equals(-3000.01, tolerance  = 0.1) )

# Test whether RELATIVE error is within 0.1
# To test relative error, set 'scale' equal to expected value.
# See base R function all.equal for optional argument documentation.
expect_equal( roots, -3001, tolerance  = 0.1, scale=-3001)
})

To show how to test code that should raise an error, we’ll find the roots of 2x + 3, which isn’t a quadratic. Notice that you can test whether any error is raised or you can test whether the error message matches a given regular expression.

    test_that("Polynomial must be quadratic", {

# Test for ANY error
expect_that( real.roots(0, 2, 3), throws_error() )

# Test specifically for an error string containing "zero"
expect_that( real.roots(0, 2, 3), throws_error("zero") )

# Test specifically for an error string containing "zero" or "Zero" using regular expression
expect_that( real.roots(0, 2, 3), throws_error("[zZ]ero") )
})

Finally, here are a couple tests that shouldn’t pass.

    test_that("Bogus tests", {

x <- c(1, 2, 3)

expect_that( length(x), equals(2.7) )
expect_that( x, is_a("data.frame") )
})

To run the tests, you can run test_dir or test_file. If you are at the R command line and your working directory is the directory containing the two files above, you could run the tests with test_dir("."). In this case we have only one file of test code, but if we had more test files test_dir would find them, provided the file names begin with test.

* * *

# Dogfooding

Dogfooding refers companies using their own software. According to Wikipedia,

In 1988, Microsoft manager Paul Maritz sent Brian Valentine, test manager for Microsoft LAN Manager, an email titled “Eating our own Dogfood”, challenging him to increase internal usage of the company’s product. From there, the usage of the term spread through the company.

Dogfooding is a great idea, but it’s no substitute for usability testing. I get the impression that some products, if they’re tested at all, are tested by developers intimately familiar with how they’re intended to be used.

If your company is developing consumer software, it’s not dogfooding if just the developers use it. It’s dogfooding when people in sales and accounting use it. But that’s still no substitute for getting people outside the company to use it.

Dogfooding doesn’t just apply to software development. Whenever I buy something with inscrutable assembly instructions, I wonder why the manufacturer didn’t pay a couple people off the street to put the thing together on camera.