Military Standard 105

Military Standard 105 (MIL-STD-105) is the grand daddy of sampling acceptance standards. The standard grew out of work done at Bell Labs in the 1930s and was first published during WWII. There were five updates to the standard, the last edition being MIL-STD-105E, published in 1989.

In 1995 the standard was officially cancelled when the military decided to go with civilian quality standards moving forward. Military Standard 105 lives on through its progeny, such as ANSI/ASQ Z1.4, ASTM E2234, and ISO 2859-1. There’s been an interesting interaction between civilian and military standards: Civilian organizations adopted military standards, then the military adopted civilian standards (which evolved from military standards).

From a statistical perspective, it seems a little odd that sampling procedures are given without as much context as an experiment designed from scratch might have. But obviously a large organization, certainly an army, must have standardized procedures. A procurement department cannot review thousands of boutique experiment designs the way a cancer center reviews sui generis clinical trial designs. They have to have objective procedures that do not require a statistician to evaluate.

MIL-STD-105E Table II-B

Of course manufacturers need objective standards too. The benefits of standardization outweigh the potential benefits of customization: economic efficiency trumps the increase in statistical efficiency that might be obtained from a custom sampling approach.

Although the guidelines in MIL-STD-105 are objective, they’re also flexible. For example, instead of dictating a single set of testing procedures, the standard gives normal, tightened, and reduced procedures. The level of testing can go up or down based on experience. During normal inspection, if two out of five consecutive lots have been rejected, then testing switches to tightened procedures. Then after five consecutive lots have been accepted, normal testing can resume. And if certain conditions are met under normal procedures, the manufacturer can relax to reduced procedures [1]. The procedures are adaptive, but there are explicit rules for doing the adapting.

This is very similar to my experience with adaptive clinical trial designs. Researchers often think that “adaptive” means flying by the seat of your pants, making subjective decisions as a trial progresses. But an adaptive statistical design is still a statistical design. The conduct of the experiment may change over time, but only according to objective statistical criteria set out in advance.

MIL-STD-105 grew out of the work of smart people, such as Harold Dodge at Bell Labs, thinking carefully about the consequences of the procedures. Although the procedures have all the statistical lingo stripped away—do this thing this many times, rather than looking at χ² values—statistical thought went into creating these procedures.

MIL-STD-105E Table X Q

Related links

[1] This isn’t the most statistically powerful approach because it throws away information. It only considers whether batches met an acceptance standard; it doesn’t use the data on how many units passed or failed. The units in one batch are not interchangeable with units in another batch, but neither are they unrelated. A more sophisticated approach might use a hierarchical model that captured units within batches. But as stated earlier, you can’t have someone in procurement review hierarchical statistical analyses; you need simple rules.

Using Python as a statistical calculator

This post is for someone unfamiliar with SciPy who wants to use Python to do basic statistical calculations. More specifically, this post will look at working with common families of random variables—normal, exponential, gamma, etc.—and how to calculate probabilities with these.

All the statistical functions you need will be in the stats subpackage of SciPy. You could import everything with

    >>> from scipy.stats import *

This will make software developers cringe because it’s good software development practice to only import what you need [1]. But we’re not writing software here; we’re using Python as a calculator.

Distributions

Every distribution in SciPy has a location parameter loc which defaults to 0, and scale parameter scale that defaults to 1.

The table below lists additional parameters that some common distributions have.

Distribution SciPy name    Parameters
normal norm
exponential expon
beta beta shape1, shape2
binomial binom size, prob
chi-squared chi2 df
F f df1, df2
gamma gamma shape
hypergeometric hypergeom M, n, N
Poisson poisson lambda
Student t t df

When using any statistical software its good to verify that the software is using the parameterization that you expect. For example, the exponential distribution can be parameterized by rate or by scale. SciPy does everything by scale. One way to test the parameterization is to calculate the mean. For example, if the mean of an exp(100) random variable is 100, you’re software is using the scale paraemterization. If the mean is 1/100, it;s using the rate.

Functions

For a random variable X from one of the families above, we’ll show how to compute Prob(Xx), Prob(Xx), and how to solve for x given one of these probabilities. And for discrete distributions, we’ll show how to find Prob(X = p).

CDF: Cumulative distribution function

The probability Prob(Xx) is the CDF of X at x. For example, the probability that a standard normal random variable is less than 2 is

    norm.cdf(2)

We didn’t have to specify the location or scale because the standard normal uses default parameters. If X had mean 3 and standard deviation 10 we’d call

    norm.cdf(2, loc=3, scale=10)

For another example, suppose we’re working with a gamma distribution with shape 3 and scale 5 and want to know the probability of such a random variable taking on a value less than 2. Then we’d call

    gamma.cdf(2, 3, scale=5)

The second argument, 2, is the shape parameter.

CCDF: Complementary cumulative distribution function

The probability Prob(Xx) is the complementary CDF of X at x, which is 1 minus the CDF at x. SciPy uses the name sf for the CCDF. Here “sf” stands for survival function.

Why have both a CDF and CCDF function? One reason is convenience, but a more important reason is numerical accuracy. If a probability p is very near 1, then 1 – p will be partially or completely inaccurate.

For example, let X be a standard normal random variable. The probability that X is greater than 9 is

    norm.sf(9)

which is on the order of 10-19. But if we compute

    1 - norm.cdf(9)

we get exactly 0. Floating point numbers have 16 figures of precision, and we need 19 to represent the difference between our probability and 1. More on why mathematical libraries have functions that seem unnecessary at first here.

Inverse CDF

SciPy calls the inverse CDF function ppf for percentile point function.

For example, suppose X is a gamma random variable with shape 3 and scale 5, and we want to solve for the value of x such that Prob(Xx) is 0.7. We could do this with

    gamma.ppf(0.7, 3, scale=5)

Inverse CCDF

SciPy calls the inverse CCDF function isf for inverse survival function. So, for example,

    gamma.isf(0.3, 3, scale=5)

should return the same value as the example above because the point where the right tail has probability 0.3 is the same as the point where the left tail has probability 0.7.

Probability mass function

For a discrete random variable X we can compute the probability mass function, the probability that X takes on a value x. Unsurprisingly SciPy calls this function pmf.

For example, the probability of seeing 6 heads out of 10 flips of a fair coin is computed by

    binom.pmf(6, 10, 0.5)

Getting help

To find out more about a probability distribution, call help with that distribution as an argument. For example, you could call

    help(hypergeom)

to read more about the hypergeometric distribution.

For much more information, including topics not addressed here, see the documentation for scipy.stats.

***

[1] The reason to not import more than you need is to reduce namespace confusion. If you see a function foo, is that the foo function from package A or from package B. The most common way this comes up in statistics is confusion over the gamma function and the gamma distribution. In SciPy, the gamma function is in scipy.special while the gamma distribution is in scipy.stats. If you’ve imported everything from scipy.stats then gamma means scipy.stats.gamma. You could bring the gamma function into your environment by importing it with another name. For example

    from scipy.special import gamma as gammaf

would import the gamma function giving it the name gammaf.

Saving money on big queries

server room

I was surprised the first time a client told me that a query would cost them $100,000 to run. If you think about querying a database on your laptop, a long query would take a minute, and what’s the cost of a minute’s worth of electricity? Too small to meter.

But some of my clients have a LOT of data. (Some also come to me with spreadsheets of data.) If you have billions of records hosted in a cloud database like Snowflake, the cost of a query isn’t negligible, especially if it involves complex joins.

There are three ways to reduce the cost of expensive queries.

First of all, it may be possible to solve a different way the problem you’re trying to solve with the expensive query. Maybe the query is the most direct approach but there’s a more economical indirect approach.

Second, it may be possible to rewrite the query into a logically equivalent query that runs faster.

Third, it may be possible to save a tremendous amount of money by tolerating a small probability of error. For example, you may be able to use reservoir sampling rather than an exhaustive query.

This last approach is often misunderstood. Why would you tolerate any chance of error when you could have the exact answer? One reason is that the exact answer might cost 1000x as much to obtain. More importantly, the “exact” result may be the exact answer to a query that is trying estimate something else. The probability of error induced by random sampling may be small relative to the probability of error intrinsic in the problem being solved.

If you’d like me to take a look at how I could reduce your query costs, let’s talk.

Related posts

Efficiently testing a black box

Suppose you have a black box that takes three bits as input and produces one bit as output. You could think of the input bits as positions of toggle switches, and the output bit as a light attached to the box that is either on or off.

Full factorial design

Now suppose that only one combination of 3 bits produces a successful output. There’s one way to set the switches that makes the light turn on. You can find the successful input by brute force if you test all 2³ = 8 possible inputs. In statistical lingo, you are conducting an experiment with a factorial design, i.e. you test all combinations of inputs.

See the Python code below for a text version of this design

In the chart above, each row is an experimental run and each column is a bit. I used – and + rather than 0 and 1 because it is conventional in this topic to use a minus sign to indicate that a factor is not present and a plus sign to indicate that it is present.

Fractional factorial design

Now suppose your black box takes 4 bits as inputs, but only 3 of them matter. One of the bits does nothing, but you don’t know which bit that is. You could use a factorial design again, testing all 24 = 16 possible inputs. But there’s a more clever approach that requires only 8 guesses. In statistical jargon, this is a fractional factorial design.

See the Python code below for a text version of this design

No matter which three bits the output depends on, all combinations of those three bits will be tested. Said another way, if you delete any one of the four columns, the remaining columns contain all combinations of those bits.

Replications

Now suppose your black box takes 8 bits. Again only 3 bits matter, but you don’t know which 3. How many runs do you need to do to be certain of finding the successful combination of 3 bits? It’s clear that you need at least 8 runs: if you know that the first three bits are the important ones, for example, you still need 8 runs. And it’s also clear that you could go back to brute force and try all 28 = 256 possible inputs, but the example above raises your hopes that you could get by with less than 256 runs. Could you still get by with 8? That’s too much to hope for, but you could use only 16 runs.

--------, +----+++, -+--+-++, ++--++--, --+-+++-, +-+-+--+, -++--+-+, +++---+-, ---+++-+, +--++-+-, -+-+-++-, ++-+---+, --++--++, +-++-+--, -++++---, ++++++++

Note that since this design works for 8 factors, it also works for fewer factors. If you had 5, 6, or 7 factors, you could use the first 5, 6, or 7 columns of the design above.

This design has some redundancy: every combination of three particular bits is tested twice. This is unfortunate in our setting because we are assuming the black box is deterministic: the right combination of switches will always turn the light on. But what if the right combination of switches probably turns on the light? Then redundancy is a good thing. If there’s an 80% chance that the right combination will work, then there’s a 96% chance that at least one of the two tests of the right combination will work.

Fractional factorial experiment designs are usually used in with the assumption that there are random effects, and so redundancy is a good thing.

You want to test each main effect, i.e. each single bit, and interaction effects, i.e. combinations of bits, such as pairs of bits or triples of bits. But you assume that not all possible interactions are important; otherwise you’d need a full factorial design. You typically hit diminishing return with interactions quickly: pairs of effects are often important, combinations of three effects are less common, and rarely would an experiment consider forth order interactions.

If only main effects and pairs of main effects matter, and you have a moderately large number of factors n, a fractional factorial design can let you use a lot less than 2n runs while giving you as many replications of main and interaction effects as you want.

Verification

The following Python code verifies that the designs above have the claimed properties.

    import numpy as np
    from itertools import combinations
    
    def verify(matrix, k):
        "verify that ever choice of k columns has 2^k unique rows"
        nrows, ncols = matrix.shape
        for (a, b, c) in combinations(range(ncols), k):
            s = set()
            for r in range(nrows):
                s.add((matrix[r,a], matrix[r,b], matrix[r,c]))
            if len(s) != 2**k:
                print("problem with columns", a, b, c)
                print("number of unique elements: ", len(s))
                print("should be", 2**k)
                return
        print("pass")
    
    m = [
        [-1, -1, -1, -1],
        [-1, -1, +1, +1],
        [-1, +1, -1, +1],
        [-1, +1, +1, -1],
        [+1, -1, -1, +1],
        [+1, -1, +1, -1],
        [+1, +1, -1, -1],
        [+1, +1, +1, +1]
    ]
    
    verify(np.matrix(m), 3)
    
    m = [
        [-1, -1, -1, -1, -1, -1, -1, -1],
        [+1, -1, -1, -1, -1, +1, +1, +1],
        [-1, +1, -1, -1, +1, -1, +1, +1],
        [+1, +1, -1, -1, +1, +1, -1, -1],
        [-1, -1, +1, -1, +1, +1, +1, -1],
        [+1, -1, +1, -1, +1, -1, -1, +1],
        [-1, +1, +1, -1, -1, +1, -1, +1],
        [+1, +1, +1, -1, -1, -1, +1, -1],
        [-1, -1, -1, +1, +1, +1, -1, +1],
        [+1, -1, -1, +1, +1, -1, +1, -1],
        [-1, +1, -1, +1, -1, +1, +1, -1],
        [+1, +1, -1, +1, -1, -1, -1, +1],
        [-1, -1, +1, +1, -1, -1, +1, +1],
        [+1, -1, +1, +1, -1, +1, -1, -1],
        [-1, +1, +1, +1, +1, -1, -1, -1],
        [+1, +1, +1, +1, +1, +1, +1, +1],
    ]
    
    verify(np.matrix(m), 3)

Related services

Regex to match ICD-11 code

ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)

The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)

One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.

Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.

What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.

Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.

import re

icd11_re = re.compile(r"""
    ^                  # beginning of string
    [A-HJ-NP-Z0-9]     # alphanumeric
    [A-HJ-NP-Z]        # alphabetic
    [0-9]              # digit
    [A-HJ-NP-Z0-9]     # alphanumeric
    ((\.               # optional starting with .
    [A-HJ-NP-Z0-9])    # alphanumeric
    [A-HJ-NP-Z0-9]?)?  # optional further refinement
    $                  # end of string
    """, re.VERBOSE)

good = [
    "ND52",   # fracture of arm, level unspecified
    "9D00.3", # presbyopia
    "8B60.Y", # other specified increased intercranial pressure
    "DB98.7Z" # portal hypertension, unspecified
]

bad = [
    "ABCD",    # third character must be digit
    "AB3D.",   # dot must be followed by alphanumeric
    "9D0O.3",  # letter 'O' should be number 0
    "DB9872",  # missing dot
    "AB3",     # too short
    "DB90.123" # too long
]

for g in good:
    assert(icd11_re.match(g))
for b in bad:
    assert(icd11_re.match(b) == None)

Related posts

The Very Model of a Professor Statistical

The last chapter of George Box’s book Improving Almost Anything contains the lyrics to “I Am the Very Model of a Professor Statistical,” to be sung to the tune of “I Am the Very Model of a Modern Major General” by Gilbert & Sullivan.

Here’s the original:

The original song has a few funny math-related lines.

I’m very well acquainted, too, with matters mathematical,
I understand equations, both the simple and quadratical,
About binomial theorem I’m teeming with a lot o’ news,
With many cheerful facts about the square of the hypotenuse.

I’m very good at integral and differential calculus;
I know the scientific names of beings animalculous:
In short, in matters vegetable, animal, and mineral,
I am the very model of a modern Major-General.

Here are a few lines from George Box’s version.

I relentlessly uncover any aberrant contingency
I strangle it with rigor and stifle it with stringency
I understand the different symbols be they Roman, Greek, or cuneiform
And every distribution from the Cauchy to the uniform.

With derivation rigorous each lemma I can justify
My every estimator I am careful to robustify
In short in matters logical, mathematical, idealistical
I am the very model of a professor statistical.

Gilbert & Sullivan have come up on this blog a couple other times:

George Box has come up too, but only once. (I’m surprised he hasn’t come up more; I should rectify that.) This post has a great quote from Box: “To find out what happens to a system when you interfere with it, you have to interfere with it (and not just passively observe it).”

HCPCS (“hick pics”) codes

HCPCS stands for Healthcare Common Procedure Coding System. HCPCS codes are commonly pronounced like “hick pics.” I was curious/afraid to see what DALL-E would create with the prompt “HCPCS hick pics” and was pleasantly surprised that it produced the image above. More on that latter.

Searching for medical codes

I occasionally need to search for medical codes—ICD-9 codes, ICD-10 codes, SNOMED codes, etc.—including HCPCS codes. Here’s some data that is supposed to contain a certain kind of code; does it? Or, more importantly, this text is supposed to have been scrubbed of medical codes; was it?

The most accurate way to search for HCPCS codes would be to have a complete list of codes and search for each one. There are a couple reasons why this isn’t so useful. First of all, I’m doing a quick scan, not writing data validation software. So a quick-and-dirty regular expression is fine. In fact, it’s better: I may be more interested in finding things that look like HCPCS codes than actual HCPCS codes. The former would include some typos or codes that were added after my script was written.

Copyright

Another advantage of using a regular expression that approximates HCPCS codes is that HCPCS codes are copyrighted. I don’t know whether a script using a list of HCPCS codes would be fair use, but it doesn’t matter if I’m using a regular expression.

According to Wikipedia, “CPT is a registered trademark of the American Medical Association, and its largest single source of income.” CPT is part of HCPCS, namely Level I. It is controversial that CPT codes are a commercial product whose use is required by law.

Code patterns

There are three levels of HCPCS codes. Level I is synonymous with CPT and consists of five-digit numbers. It’s easy to search for five-digit numbers, say with the regular expression \d{5}, but of course many five-digit numbers are not medical codes. But if text that has supposedly been stripped of medical codes (and zip codes) has a lot of five-digit numbers, it could be something to look into.

Level II and Level III codes are five-character codes consisting of four digits and either an F or a T. So you might search on \d{4}[FT]. While a five-digit number is likely to be a false positive when searching for HCPCS codes, four-digits following by an F or T is more likely to be a hit.

About the image

I edited the image that DALL-E produced two ways. First, I cropped it vertically; the original image had a lot more grass at the bottom.

Second, I lowered the resolution. The images DALL-E generates are very large. Even after cutting off the lower half of the image, the remaining part was over a megabyte. I used Squoosh to reduce the image to 85 kb. Curiously, when I opened the squooshed output in GIMP and lowered the resolution further, the image actually got larger, so went back to the output from Squoosh.

The image looks more natural than anything I’ve made using DALL-E. I look at the image and think it would be nice to be there. But if you look at the barn closely, it has a little of that surreal quality that DALL-E images usually have. This is not a photograph or a painting.

This discussion of AI-generated images is a bit of a rabbit trail, but it does tie in with searching for medical codes. Both are a sort of forensic criticism: is this what it’s supposed to be or is something out of place? When looking for errors or fraud, you have to look at data similarly to the way you might analyze an image.

Related posts

Greek letter paradox

The Greek letter paradox is seeing the same symbol in two contexts and assuming it means the same thing. Maybe it’s used in many contexts, but I first heard it in the context of comparing statistical models.

I used the phrase in my previous post, looking at

α exp(5t) + β t exp(5t)

and

α exp(4.999 t) + β exp(5.001 t).

In both cases, α is the coefficient of a term equal to or approximately equal to exp(5t), so in some sense α means the same thing in both equations. But as that post shows, the value of α depends on its full context. In that sense, it’s a coincidence that the same letter is used in both equations.

When the two functions above are solutions to a differential equation and a tiny perturbation in the same equation, the values of α and β are very different even though the resulting functions are nearly equal (for moderately small t).

Conspicuously missing data

I was working on a report for a client this afternoon when I remembered this comic from Spiked Math.

Waitress: Does everyone want a beer? Logician 1: I don't know. Logician 2: I don't know. Logician 3: Yes!

I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also give that company (and every entity they share your data with) information about your relatives.

This comic doesn’t illustrate the point I had in mind, but it does illustrate a related point. The third logician didn’t reveal the preferences of the first two, though it looks like that at first. Actually, the first two implicitly reported their own preferences.

If the first logician did not want a beer, he or she could have said “No” to the question “Does everyone want a beer?” Answering this question with “I don’t know” is tantamount to answering the question “Do you want a beer?” with “Yes.” What appears to be a non-committal answer is a definite answer on closer examination.

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

A naive approach to removing sensitive data can be about as effective as bleeping out profanity: it’s not hard to infer what was removed.

Related posts

Order statistics for normal distributions

The last couple posts have touched on order statistics for normal random variables. I wrote the posts quickly and didn’t go into much detail, and more detail would be useful.

Given two integers nr ≥ 1, define E(r, n) to be the rth order statistic of n samples from standard normal random variables. That is, if we were to take n samples and then sort them in increasing order, the expected value of the rth sample is E(r, n).

We can compute E(r, n) exactly by

E(r, n) = \frac{n!}{(r-1)!\, (n-r)!} \int_{-\infty}^\infty x (1 - \Phi(x))^{r-1}\,\Phi(x)^{n-r}\, \phi(x) \, dx

where φ and Φ are the PDF and CDF of a standard normal random variable respectively. We can numerically evaluate E(r, n) by numerically evaluating its defining integral.

The previous posts have used

dn = E(n, n) – E(1, n) = 2 E(n, n).

The second equality above follows by symmetry.

We can compute dn in Mathematica as follows.

    Phi[x_] := (1 + Erf[x/Sqrt[2]])/2    
    phi[x_] := Exp[-x^2/2]/Sqrt[2 Pi]
    d[n_] := 2 n NIntegrate[
        x Phi[x]^(n - 1) phi[x], {x, -Infinity, Infinity}]

And we can reproduce the table here by

    Table[1/d[n], {n, 2, 10}]

Finally, we can see how dn behaves for large n by calling

    ListPlot[Table[d[n], {n, 1, 1000}]]

to produce the following graph.

See the next post for approximations to dn.