Efficiently testing a black box

Posted on 10 October 2022 by John

Suppose you have a black box that takes three bits as input and produces one bit as output. You could think of the input bits as positions of toggle switches, and the output bit as a light attached to the box that is either on or off.

Full factorial design

Now suppose that only one combination of 3 bits produces a successful output. There’s one way to set the switches that makes the light turn on. You can find the successful input by brute force if you test all 2³ = 8 possible inputs. In statistical lingo, you are conducting an experiment with a factorial design, i.e. you test all combinations of inputs.

See the Python code below for a text version of this design

In the chart above, each row is an experimental run and each column is a bit. I used − and + rather than 0 and 1 because it is conventional in this topic to use a minus sign to indicate that a factor is not present and a plus sign to indicate that it is present.

Fractional factorial design

Now suppose your black box takes 4 bits as inputs, but only 3 of them matter. One of the bits does nothing, but you don’t know which bit that is. You could use a factorial design again, testing all 2⁴ = 16 possible inputs. But there’s a more clever approach that requires only 8 guesses. In statistical jargon, this is a fractional factorial design.

See the Python code below for a text version of this design

No matter which three bits the output depends on, all combinations of those three bits will be tested. Said another way, if you delete any one of the four columns, the remaining columns contain all combinations of those bits.

Replications

Now suppose your black box takes 8 bits. Again only 3 bits matter, but you don’t know which 3. How many runs do you need to do to be certain of finding the successful combination of 3 bits? It’s clear that you need at least 8 runs: if you know that the first three bits are the important ones, for example, you still need 8 runs. And it’s also clear that you could go back to brute force and try all 2⁸ = 256 possible inputs, but the example above raises your hopes that you could get by with less than 256 runs. Could you still get by with 8? That’s too much to hope for, but you could use only 16 runs.

--------, +----+++, -+--+-++, ++--++--, --+-+++-, +-+-+--+, -++--+-+, +++---+-, ---+++-+, +--++-+-, -+-+-++-, ++-+---+, --++--++, +-++-+--, -++++---, ++++++++

Note that since this design works for 8 factors, it also works for fewer factors. If you had 5, 6, or 7 factors, you could use the first 5, 6, or 7 columns of the design above.

This design has some redundancy: every combination of three particular bits is tested twice. This is unfortunate in our setting because we are assuming the black box is deterministic: the right combination of switches will always turn the light on. But what if the right combination of switches probably turns on the light? Then redundancy is a good thing. If there’s an 80% chance that the right combination will work, then there’s a 96% chance that at least one of the two tests of the right combination will work.

Fractional factorial experiment designs are usually used with the assumption that there are random effects, and so redundancy is a good thing.

You want to test each main effect, i.e. each single bit, and interaction effects, i.e. combinations of bits, such as pairs of bits or triples of bits. But you assume that not all possible interactions are important; otherwise you’d need a full factorial design. You typically hit diminishing returns with interactions quickly: pairs of effects are often important, combinations of three effects are less common, and rarely would an experiment consider fourth order interactions.

If only main effects and pairs of main effects matter, and you have a moderately large number of factors n, a fractional factorial design can let you use a lot less than 2ⁿ runs while giving you as many replications of main and interaction effects as you want.

Verification

The following Python code verifies that the designs above have the claimed properties.

    import numpy as np
    from itertools import combinations
    
    def verify(matrix, k):
        "verify that every choice of k columns has 2^k unique rows"
        nrows, ncols = matrix.shape
        for (a, b, c) in combinations(range(ncols), k):
            s = set()
            for r in range(nrows):
                s.add((matrix[r,a], matrix[r,b], matrix[r,c]))
            if len(s) != 2**k:
                print("problem with columns", a, b, c)
                print("number of unique elements: ", len(s))
                print("should be", 2**k)
                return
        print("pass")
    
    m = [
        [-1, -1, -1, -1],
        [-1, -1, +1, +1],
        [-1, +1, -1, +1],
        [-1, +1, +1, -1],
        [+1, -1, -1, +1],
        [+1, -1, +1, -1],
        [+1, +1, -1, -1],
        [+1, +1, +1, +1]
    ]
    
    verify(np.matrix(m), 3)
    
    m = [
        [-1, -1, -1, -1, -1, -1, -1, -1],
        [+1, -1, -1, -1, -1, +1, +1, +1],
        [-1, +1, -1, -1, +1, -1, +1, +1],
        [+1, +1, -1, -1, +1, +1, -1, -1],
        [-1, -1, +1, -1, +1, +1, +1, -1],
        [+1, -1, +1, -1, +1, -1, -1, +1],
        [-1, +1, +1, -1, -1, +1, -1, +1],
        [+1, +1, +1, -1, -1, -1, +1, -1],
        [-1, -1, -1, +1, +1, +1, -1, +1],
        [+1, -1, -1, +1, +1, -1, +1, -1],
        [-1, +1, -1, +1, -1, +1, +1, -1],
        [+1, +1, -1, +1, -1, -1, -1, +1],
        [-1, -1, +1, +1, -1, -1, +1, +1],
        [+1, -1, +1, +1, -1, +1, -1, -1],
        [-1, +1, +1, +1, +1, -1, -1, -1],
        [+1, +1, +1, +1, +1, +1, +1, +1],
    ]
    
    verify(np.matrix(m), 3)

Related services

Regex to match ICD-11 code

Posted on 6 October 2022 by John

ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)

The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)

One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.

Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.

What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.

Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.

import re

icd11_re = re.compile(r"""
    ^                  # beginning of string
    [A-HJ-NP-Z0-9]     # alphanumeric
    [A-HJ-NP-Z]        # alphabetic
    [0-9]              # digit
    [A-HJ-NP-Z0-9]     # alphanumeric
    ((\.               # optional starting with .
    [A-HJ-NP-Z0-9])    # alphanumeric
    [A-HJ-NP-Z0-9]?)?  # optional further refinement
    $                  # end of string
    """, re.VERBOSE)

good = [
    "ND52",   # fracture of arm, level unspecified
    "9D00.3", # presbyopia
    "8B60.Y", # other specified increased intercranial pressure
    "DB98.7Z" # portal hypertension, unspecified
]

bad = [
    "ABCD",    # third character must be digit
    "AB3D.",   # dot must be followed by alphanumeric
    "9D0O.3",  # letter 'O' should be number 0
    "DB9872",  # missing dot
    "AB3",     # too short
    "DB90.123" # too long
]

for g in good:
    assert(icd11_re.match(g))
for b in bad:
    assert(icd11_re.match(b) == None)

The Very Model of a Professor Statistical

Posted on 5 October 2022 by John

The last chapter of George Box’s book Improving Almost Anything contains the lyrics to “I Am the Very Model of a Professor Statistical,” to be sung to the tune of “I Am the Very Model of a Modern Major General” by Gilbert & Sullivan.

Here’s the original:

The original song has a few funny math-related lines.

I’m very well acquainted, too, with matters mathematical,
I understand equations, both the simple and quadratical,
About binomial theorem I’m teeming with a lot o’ news,
With many cheerful facts about the square of the hypotenuse.

I’m very good at integral and differential calculus;
I know the scientific names of beings animalculous:
In short, in matters vegetable, animal, and mineral,
I am the very model of a modern Major-General.

Here are a few lines from George Box’s version.

I relentlessly uncover any aberrant contingency
I strangle it with rigor and stifle it with stringency
I understand the different symbols be they Roman, Greek, or cuneiform
And every distribution from the Cauchy to the uniform.

With derivation rigorous each lemma I can justify
My every estimator I am careful to robustify
In short in matters logical, mathematical, idealistical
I am the very model of a professor statistical.

Gilbert & Sullivan have come up on this blog a couple other times:

George Box has come up too, but only once. (I’m surprised he hasn’t come up more; I should rectify that.) This post has a great quote from Box: “To find out what happens to a system when you interfere with it, you have to interfere with it (and not just passively observe it).”

HCPCS (“hick pics”) codes

Posted on 23 September 2022 by John

HCPCS stands for Healthcare Common Procedure Coding System. HCPCS codes are commonly pronounced like “hick pics.”

Searching for medical codes

I occasionally need to search for medical codes—ICD-9 codes, ICD-10 codes, SNOMED codes, etc.—including HCPCS codes. Here’s some data that is supposed to contain a certain kind of code; does it? Or, more importantly, this text is supposed to have been scrubbed of medical codes; was it?

The most accurate way to search for HCPCS codes would be to have a complete list of codes and search for each one. There are a couple reasons why this isn’t so useful. First of all, I’m doing a quick scan, not writing data validation software. So a quick-and-dirty regular expression is fine. In fact, it’s better: I may be more interested in finding things that look like HCPCS codes than actual HCPCS codes. The former would include some typos or codes that were added after my script was written.

Copyright

Another advantage of using a regular expression that approximates HCPCS codes is that HCPCS codes are copyrighted. I don’t know whether a script using a list of HCPCS codes would be fair use, but it doesn’t matter if I’m using a regular expression.

According to Wikipedia, “CPT is a registered trademark of the American Medical Association, and its largest single source of income.” CPT is part of HCPCS, namely Level I. It is controversial that CPT codes are a commercial product whose use is required by law.

Code patterns

There are three levels of HCPCS codes. Level I is synonymous with CPT and consists of five-digit numbers. It’s easy to search for five-digit numbers, say with the regular expression \d{5}, but of course many five-digit numbers are not medical codes. But if text that has supposedly been stripped of medical codes (and zip codes) has a lot of five-digit numbers, it could be something to look into.

Level II and Level III codes are five-character codes consisting of four digits and either an F or a T. So you might search on \d{4}[FT]. While a five-digit number is likely to be a false positive when searching for HCPCS codes, four-digits following by an F or T is more likely to be a hit.

Greek letter paradox

Posted on 13 June 2022 by John

The Greek letter paradox is seeing the same symbol in two contexts and assuming it means the same thing. Maybe it’s used in many contexts, but I first heard it in the context of comparing statistical models.

I used the phrase in my previous post, looking at

α exp(5t) + β t exp(5t)

and

α exp(4.999 t) + β exp(5.001 t).

In both cases, α is the coefficient of a term equal to or approximately equal to exp(5t), so in some sense α means the same thing in both equations. But as that post shows, the value of α depends on its full context. In that sense, it’s a coincidence that the same letter is used in both equations.

When the two functions above are solutions to a differential equation and a tiny perturbation in the same equation, the values of α and β are very different even though the resulting functions are nearly equal (for moderately small t).

Conspicuously missing data

Posted on 30 May 2022 by John

I was working on a report for a client this afternoon when I remembered this comic from Spiked Math.

Waitress: Does everyone want a beer? Logician 1: I don't know. Logician 2: I don't know. Logician 3: Yes!

I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also give that company (and every entity they share your data with) information about your relatives.

This comic doesn’t illustrate the point I had in mind, but it does illustrate a related point. The third logician didn’t reveal the preferences of the first two, though it looks like that at first. Actually, the first two implicitly reported their own preferences.

If the first logician did not want a beer, he or she could have said “No” to the question “Does everyone want a beer?” Answering this question with “I don’t know” is tantamount to answering the question “Do you want a beer?” with “Yes.” What appears to be a non-committal answer is a definite answer on closer examination.

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

A naive approach to removing sensitive data can be about as effective as bleeping out profanity: it’s not hard to infer what was removed.

Order statistics for normal distributions

Posted on 9 March 2022 by John

The last couple posts have touched on order statistics for normal random variables. I wrote the posts quickly and didn’t go into much detail, and more detail would be useful.

Given two integers n ≥ r ≥ 1, define E(r, n) to be the rth order statistic of n samples from standard normal random variables. That is, if we were to take n samples and then sort them in increasing order, the expected value of the rth sample is E(r, n).

We can compute E(r, n) exactly by

$E(r, n) = \frac{n!}{(r-1)!\, (n-r)!} \int_{-\infty}^\infty x (1 - \Phi(x))^{r-1}\,\Phi(x)^{n-r}\, \phi(x) \, dx$

where φ and Φ are the PDF and CDF of a standard normal random variable respectively. We can numerically evaluate E(r, n) by numerically evaluating its defining integral.

The previous posts have used

d_n = E(n, n) – E(1, n) = 2 E(n, n).

The second equality above follows by symmetry.

We can compute d_n in Mathematica as follows.

    Phi[x_] := (1 + Erf[x/Sqrt[2]])/2    
    phi[x_] := Exp[-x^2/2]/Sqrt[2 Pi]
    d[n_] := 2 n NIntegrate[
        x Phi[x]^(n - 1) phi[x], {x, -Infinity, Infinity}]

And we can reproduce the table here by

    Table[1/d[n], {n, 2, 10}]

Finally, we can see how d_n behaves for large n by calling

    ListPlot[Table[d[n], {n, 1, 1000}]]

to produce the following graph.

See the next post for approximations to d_n.

Range trick for larger sample sizes

Posted on 9 March 2022 by John

The previous post showed that the standard deviation of a sample of size n can be well estimated by multiplying the sample range by a constant d_n that depends on n.

The method works well for relatively small n. This should sound strange: typically statistical methods work better for large samples, not small samples. And indeed this method would work better for larger samples. However, we’re not so much interested in the efficiency of the method per se, but its efficiency relative to the standard way of estimating standard deviation. For small samples, both methods are not very accurate, and the two methods appear to work equally well.

If we want to use the range method for larger n, there are a couple questions: how well does the method work, and how do we calculate d_n.

Simple extension

Ashley Kanter left a comment on the previous post saying

d_n = 3 log n^0.75 (where the log is base 10) seems to perform quite well even for larger n.

Ashley doesn’t say where this came from. Maybe it’s an empirical fit.

The constant d_n is the expected value of the range of n samples from a standard normal random variable. You could find a (complicated) expression for this and then find a simpler expression using an asymptotic approximation. Maybe I’ll try that later, but for now I need to wrap up this post and move on to client work.

Note that

log₁₀ n^0.75 = 0.977 log_e(n)

and so we could use

d_n = log n

where log is natural log. This seems like something that might fall out of an asymptotic approximation. Maybe Ashley empirically discovered the first term of a series approximation.

Update: See this post for a more detailed exploration of how well log n, square root of n, and another method approximate d_n.

Update 2: Ashley Kanter’s approximation was supposed to be 3 (log₁₀ n) ^0.75rather than 3 log₁₀ (n^0.75) and is a very good approximation. This is also addressed in the link in the first update.

Simulation

Here’s some Python code to try things out.

    import numpy as np
    from scipy.stats import norm

    np.random.seed(20220309)

    n = 20
    for _ in range(5):
        x = norm.rvs(size=n)
        w = x.max() - x.min()
        print(x.std(ddof=1), w/np.log(n))

And here are the results.

    |   std | w/d_n |
    |-------+-------|
    | 0.930 | 1.340 |
    | 0.919 | 1.104 |
    | 0.999 | 1.270 |
    | 0.735 | 0.963 |
    | 0.956 | 1.175 |

It seems the range method has more variance, though notice in the fourth row that the standard estimate can occasionally wander pretty far from the theoretical value of 1 as well.

We get similar results when we increase n to 50.

    |   std | w/d_n |
    |-------+-------|
    | 0.926 | 1.077 |
    | 0.889 | 1.001 |
    | 0.982 | 1.276 |
    | 1.038 | 1.340 |
    | 1.025 | 1.209 |

Not-so-simple extension

There are ways to improve the range method, if by “improve” you mean make it more accurate. One way is to divide the sample into random partitions, apply the method to each partition, and average the results. If you’re going to do this, partitions of size 8 are optimal [1]. However, the main benefit of the range method [2] is its simplicity.

[1] F. E. Grubbs and C. L. Weaver (1947). The best unbiased estimate of a population standard deviation based on group ranges. Journal of the American Statistical Association 42, pp 224–41

[2] The main advantage now is its simplicity, When it was discovered, the method reduced manual calculation, and so it could have been worthwhile to make the method a little more complicated as long as the calculation effort was still less than that of the standard method.

Estimating standard deviation from range

Posted on 7 March 2022 by John

Suppose you have a small number of samples, say between 2 and 10, and you’d like to estimate the standard deviation σ of the population these samples came from. Of course you could compute the sample standard deviation, but there is a simple and robust alternative.

Let W be the range of our samples, the difference between the largest and smallest value. Think “w” for “width.” Then

W / d_n

is an unbiased estimator of σ where the constants d_n can be looked up in a table [1].

    |  n | 1/d_n |
    |----+-------|
    |  2 | 0.886 |
    |  3 | 0.591 |
    |  4 | 0.486 |
    |  5 | 0.430 |
    |  6 | 0.395 |
    |  7 | 0.370 |
    |  8 | 0.351 |
    |  9 | 0.337 |
    | 10 | 0.325 |

The values d_n in the table were calculated from the expected value of W/σ for normal random variables, but the method may be used on data that do not come from a normal distribution.

Let’s try this out with a little Python code. First we’ll take samples from a standard normal distribution, so the population standard deviation is 1. We’ll draw five samples, and estimate the standard deviation two ways: by the method above and by the sample standard deviation.

    from scipy.stats import norm, gamma

    for _ in range(5):
        x = norm.rvs(size=10)
        w = x.max() - x.min()
        print(x.std(ddof=1), w*0.325)

Here’s the output:

    | w/d_n |   std |
    |-------+-------|
    | 1.174 | 1.434 |
    | 1.205 | 1.480 |
    | 1.173 | 0.987 |
    | 1.154 | 1.277 |
    | 0.921 | 1.083 |

Just from this example it seems the range method does about as well as the sample standard deviation.

For a non-normal example, let’s repeat our exercise using a gamma distribution with shape 4, which has standard deviation 2.

    | w/d_n |   std |
    |-------+-------|
    | 2.009 | 1.827 |
    | 1.474 | 1.416 |
    | 1.898 | 2.032 |
    | 2.346 | 2.252 |
    | 2.566 | 2.213 |

Once again, it seems both methods do about equally well. In both examples the uncertainty due to the small sample size is more important than the difference between the two methods.

Update: To calculate d_n for other values of n, see this post.

[1] Source: H, A. David. Order Statistics. John Wiley and Sons, 1970.

Find log normal parameters for given mean and variance

Posted on 24 February 2022 by John

Earlier today I needed to solve for log normal parameters that yield a given mean and variance. I’m going to save the calculation here in case I needed in the future or in case a reader needs it. The derivation is simple, but in the heat of the moment I’d rather look it up and keep going with my train of thought.

NB: The parameters μ and σ² of a log normal distribution are not the mean and variance of the distribution; they are the mean and variance of its log.

If m is the mean and v is the variance then

$\begin{align*} m &= \exp(\mu + \sigma^2/2) \\ v &= (\exp(\sigma^2) - 1) \exp(2\mu + \sigma^2) \end{align*}$

Notice that the square of the m term matches the second part of the v term.

Then

$\frac{v}{m^2} = \exp(\sigma^2) -1$

and so

$\sigma^2 = \log\left(\frac{v}{m^2} + 1 \right)$

and once you have σ² you can find μ by

$\mu = \log m - \sigma^2/2$

Here’s Python code to implement the above.

    from numpy immport log
    def solve_for_log_normal_parameters(mean, variance):
        sigma2 = log(variance/mean**2 + 1)
        mu = log(mean) - sigma2/2
        return (mu, sigma2)

And here’s a little test code for the code above.

    mean = 3.4
    variance = 5.6

    mu, sigma2 = solve_for_log_normal_parameters(mean, variance)

    X = lognorm(scale=exp(mu), s=sigma2**0.5)
    assert(abs(mean - X.mean()) < 1e-10)
    assert(abs(variance - X.var()) < 1e-10)

Statistics

Efficiently testing a black box

Full factorial design

Fractional factorial design

Replications

Verification

Related services

Regex to match ICD-11 code

Related posts

The Very Model of a Professor Statistical

HCPCS (“hick pics”) codes

Searching for medical codes

Copyright

Code patterns

Related posts

Greek letter paradox

Conspicuously missing data

Related posts

Order statistics for normal distributions

Range trick for larger sample sizes

Simple extension

Simulation

Not-so-simple extension

Related posts

Estimating standard deviation from range

Find log normal parameters for given mean and variance

Related posts