Efficiently testing a black box

Posted on 10 October 2022 by John

Suppose you have a black box that takes three bits as input and produces one bit as output. You could think of the input bits as positions of toggle switches, and the output bit as a light attached to the box that is either on or off.

Full factorial design

Now suppose that only one combination of 3 bits produces a successful output. There’s one way to set the switches that makes the light turn on. You can find the successful input by brute force if you test all 2³ = 8 possible inputs. In statistical lingo, you are conducting an experiment with a factorial design, i.e. you test all combinations of inputs.

See the Python code below for a text version of this design

In the chart above, each row is an experimental run and each column is a bit. I used − and + rather than 0 and 1 because it is conventional in this topic to use a minus sign to indicate that a factor is not present and a plus sign to indicate that it is present.

Fractional factorial design

Now suppose your black box takes 4 bits as inputs, but only 3 of them matter. One of the bits does nothing, but you don’t know which bit that is. You could use a factorial design again, testing all 2⁴ = 16 possible inputs. But there’s a more clever approach that requires only 8 guesses. In statistical jargon, this is a fractional factorial design.

See the Python code below for a text version of this design

No matter which three bits the output depends on, all combinations of those three bits will be tested. Said another way, if you delete any one of the four columns, the remaining columns contain all combinations of those bits.

Replications

Now suppose your black box takes 8 bits. Again only 3 bits matter, but you don’t know which 3. How many runs do you need to do to be certain of finding the successful combination of 3 bits? It’s clear that you need at least 8 runs: if you know that the first three bits are the important ones, for example, you still need 8 runs. And it’s also clear that you could go back to brute force and try all 2⁸ = 256 possible inputs, but the example above raises your hopes that you could get by with less than 256 runs. Could you still get by with 8? That’s too much to hope for, but you could use only 16 runs.

--------, +----+++, -+--+-++, ++--++--, --+-+++-, +-+-+--+, -++--+-+, +++---+-, ---+++-+, +--++-+-, -+-+-++-, ++-+---+, --++--++, +-++-+--, -++++---, ++++++++

Note that since this design works for 8 factors, it also works for fewer factors. If you had 5, 6, or 7 factors, you could use the first 5, 6, or 7 columns of the design above.

This design has some redundancy: every combination of three particular bits is tested twice. This is unfortunate in our setting because we are assuming the black box is deterministic: the right combination of switches will always turn the light on. But what if the right combination of switches probably turns on the light? Then redundancy is a good thing. If there’s an 80% chance that the right combination will work, then there’s a 96% chance that at least one of the two tests of the right combination will work.

Fractional factorial experiment designs are usually used with the assumption that there are random effects, and so redundancy is a good thing.

You want to test each main effect, i.e. each single bit, and interaction effects, i.e. combinations of bits, such as pairs of bits or triples of bits. But you assume that not all possible interactions are important; otherwise you’d need a full factorial design. You typically hit diminishing returns with interactions quickly: pairs of effects are often important, combinations of three effects are less common, and rarely would an experiment consider fourth order interactions.

If only main effects and pairs of main effects matter, and you have a moderately large number of factors n, a fractional factorial design can let you use a lot less than 2ⁿ runs while giving you as many replications of main and interaction effects as you want.

Verification

The following Python code verifies that the designs above have the claimed properties.

    import numpy as np
    from itertools import combinations
    
    def verify(matrix, k):
        "verify that every choice of k columns has 2^k unique rows"
        nrows, ncols = matrix.shape
        for (a, b, c) in combinations(range(ncols), k):
            s = set()
            for r in range(nrows):
                s.add((matrix[r,a], matrix[r,b], matrix[r,c]))
            if len(s) != 2**k:
                print("problem with columns", a, b, c)
                print("number of unique elements: ", len(s))
                print("should be", 2**k)
                return
        print("pass")
    
    m = [
        [-1, -1, -1, -1],
        [-1, -1, +1, +1],
        [-1, +1, -1, +1],
        [-1, +1, +1, -1],
        [+1, -1, -1, +1],
        [+1, -1, +1, -1],
        [+1, +1, -1, -1],
        [+1, +1, +1, +1]
    ]
    
    verify(np.matrix(m), 3)
    
    m = [
        [-1, -1, -1, -1, -1, -1, -1, -1],
        [+1, -1, -1, -1, -1, +1, +1, +1],
        [-1, +1, -1, -1, +1, -1, +1, +1],
        [+1, +1, -1, -1, +1, +1, -1, -1],
        [-1, -1, +1, -1, +1, +1, +1, -1],
        [+1, -1, +1, -1, +1, -1, -1, +1],
        [-1, +1, +1, -1, -1, +1, -1, +1],
        [+1, +1, +1, -1, -1, -1, +1, -1],
        [-1, -1, -1, +1, +1, +1, -1, +1],
        [+1, -1, -1, +1, +1, -1, +1, -1],
        [-1, +1, -1, +1, -1, +1, +1, -1],
        [+1, +1, -1, +1, -1, -1, -1, +1],
        [-1, -1, +1, +1, -1, -1, +1, +1],
        [+1, -1, +1, +1, -1, +1, -1, -1],
        [-1, +1, +1, +1, +1, -1, -1, -1],
        [+1, +1, +1, +1, +1, +1, +1, +1],
    ]
    
    verify(np.matrix(m), 3)

Related services

Which one is subharmonic?

Posted on 8 October 2022 by John

The Laplace operator Δ of a function of n variables is defined by

$\Delta f = \sum_{i=1}^n \frac{\partial^2 f}{\partial x^2_i}$

If Δ f = 0 in some region Ω, f is said to be harmonic on Ω. In that case f takes on its maximum and minimum over Ω at locations on the boundary ∂Ω of Ω. Here is an example of a harmonic function over a square which clearly takes on its maximum on two sides of the boundary and its minimum on the other two sides.

Plot of x^2 - y^2 over [-1,1] cross [-1,1].

The theorem above can be split into two theorems and generalized:

If Δ f ≥ 0, then f takes on its maximum on ∂Ω.

If Δ f ≤ 0, then f takes on its minimum on ∂Ω.

These two theorems are called the maximum principle and minimum principle respectively.

Now just as functions with Δf equal to zero are called harmonic, functions with Δf non-negative or non-positive are called subharmonic and superharmonic. Or is it the other way around?

If Δ f ≥ 0 in Ω, then f is called subharmonic in Ω. And if Δ f ≤ 0 then f is called superharmonic. Equivalently, f is superharmonic if −f is subharmonic.

The names subharmonic and superharmonic may seem backward: the theorem with the greater than sign is for subharmonic functions, and the theorem with the less than sign is for superharmonic functions. Shouldn’t the sub-thing be less than something and the super-thing greater?

Indeed they are, but you have to look f and not Δf. That’s the key.

If a function f is subharmonic on Ω, then f is below the harmonic function interpolating f from ∂Ω into the interior of Ω. That is, if g satisfies Laplace’s equation

$\begin{align*} \Delta g &= 0 \mbox{ on }\Omega \\ g &= f \mbox{ on } \partial\Omega \end{align*}$

then f ≤ g on Ω.

For example, let f(x) = ||x|| and let Ω be the unit ball in ℝⁿ. Then Δ f ≥ 0 and so f is subharmonic. (The norm function f has a singularity at the origin, but this example can be made rigorous.) Now f is constantly 1 on the boundary of the ball, and the constant function 1 is the unique solution of Laplace’s equation on the unit ball with such boundary condition, and clearly f is less than 1 on the interior of the ball.

Ideograph numerals

Posted on 7 October 2022 by John

This post is a follow on to my previous post on Unicode numbers. I always welcome feedback from readers, but I especially welcome it here because I’m walking into an area I know next to nothing about.

Consecutive code points

Unicode generally assigns code points to number-like things in consecutive order. For example, the Python code

    for n in range(1,10):
        print(chr(0x30+n), chr(0x24f4+n), chr(0x215f+n))

prints

    1 ⓵ Ⅰ
    2 ⓶ Ⅱ
    3 ⓷ Ⅲ
    4 ⓸ Ⅳ
    5 ⓹ Ⅴ
    6 ⓺ Ⅵ
    7 ⓻ Ⅶ
    8 ⓼ Ⅷ
    9 ⓽ Ⅸ

showing that ASCII digits, circled numerals, and Roman numerals are encoded consecutively.

Parenthesized and circled ideographs

So the same is probably true for ideographs representing digits, right?

一二三四五六七八九㈠㈡㈢㈣㈤㈥㈦㈧㈨㊀㊁㊂㊃㊄㊅㊆㊇㊈

No, but before we get into that, the following code shows that parenthesized ideographs and circled ideographs for digits are numbered consecutively. The code

    from unicodedata import numeric, name

    for i in range(1, 10):
        cp = 0x321f + i
        ch = chr(cp)
        print(ch, hex(cp), numeric(ch), name(ch))
    
    for i in range(1, 10):
        cp = 0x327f + i
        ch = chr(cp)
        print(ch, hex(cp), numeric(ch), name(ch))

prints

    ㈠ 0x3220 1.0 PARENTHESIZED IDEOGRAPH ONE
    ㈡ 0x3221 2.0 PARENTHESIZED IDEOGRAPH TWO
    ㈢ 0x3222 3.0 PARENTHESIZED IDEOGRAPH THREE
    ㈣ 0x3223 4.0 PARENTHESIZED IDEOGRAPH FOUR
    ㈤ 0x3224 5.0 PARENTHESIZED IDEOGRAPH FIVE
    ㈥ 0x3225 6.0 PARENTHESIZED IDEOGRAPH SIX
    ㈦ 0x3226 7.0 PARENTHESIZED IDEOGRAPH SEVEN
    ㈧ 0x3227 8.0 PARENTHESIZED IDEOGRAPH EIGHT
    ㈨ 0x3228 9.0 PARENTHESIZED IDEOGRAPH NINE
    ㊀ 0x3280 1.0 CIRCLED IDEOGRAPH ONE
    ㊁ 0x3281 2.0 CIRCLED IDEOGRAPH TWO
    ㊂ 0x3282 3.0 CIRCLED IDEOGRAPH THREE
    ㊃ 0x3283 4.0 CIRCLED IDEOGRAPH FOUR
    ㊄ 0x3284 5.0 CIRCLED IDEOGRAPH FIVE
    ㊅ 0x3285 6.0 CIRCLED IDEOGRAPH SIX
    ㊆ 0x3286 7.0 CIRCLED IDEOGRAPH SEVEN
    ㊇ 0x3287 8.0 CIRCLED IDEOGRAPH EIGHT
    ㊈ 0x3288 9.0 CIRCLED IDEOGRAPH NINE

CJK Unified Ideographs

Now let’s take the parentheses and circles off.

The following code shows that the CJK unified ideographs for digits are not digits (!) according to Unicode, but they are numeric. It also shows that their code points are not assigned in any apparent order.

    numerals = "一二三四五六七八九十"
    for n in numerals:
        print(n, hex(ord(n)), n.isdigit(), numeric(n))

This outputs the following.

    一 0x4e00 False 1.0
    二 0x4e8c False 2.0
    三 0x4e09 False 3.0
    四 0x56db False 4.0
    五 0x4e94 False 5.0
    六 0x516d False 6.0
    七 0x4e03 False 7.0
    八 0x516b False 8.0
    九 0x4e5d False 9.0
    十 0x5341 False 10.0

I assume the ordering of ideographs in Unicode has its own internal logic (with exceptions and historical quirks) that I know nothing about. If anyone knows of any patterns of how code points are assigned to ideographs, please let me know.

The names of the characters above say nothing about what the characters mean. For example, the official Unicode name for 九 (U+4E5D) is CJK UNIFIED IDEOGRAPH-4E5D. The name says nothing about the ideograph representing the digit 9, though the numeric property of the digit is indeed 9. My guess is that when that character represents a digit, it represents 9, but maybe it can mean other things in other contexts.

Unicode numbers

Posted on 7 October 2022 by John

There are 10 digits in ASCII, and I bet you can guess what they are. In ASCII, a digit is a decimal is a number.

Things are much wilder in Unicode. There are hundreds of decimals, digits, and numeric characters, and they’re different sets.

꩓ ٦ ³ ⓶ ₅ ⅕ Ⅷ ㊈

The following Python code loops through all possible Unicode characters, extracting the set of decimals, digits, and numbers.

    numbers  = set()
    decimals = set() 
    digits   = set()

    for i in range(1, 0x110000):
        ch = chr(i)
        if ch.isdigit():
            digits.add(ch)
        if ch.isdecimal():
            decimals.add(ch)
        if ch.isnumeric():
            numbers.add(ch)

These sets are larger than you may expect. The code

    print(len(decimals), len(digits), len(numbers))

tells us that the size of the three sets are 650, 778, and 1862 respectively.

The following code verifies that decimals are a proper subset of digits and that digits are a proper subset of numerical characters.

    assert(decimals < digits < numbers)

Now let’s look at the characters in the image above. The following code describes what each character is and how it is classified. The first three characters are digits, the next three are decimals but not digits, and the last three are numeric but not decimals.

    from unicodedata import name
    for c in "꩓٦":
        print(name(c))
        assert(c.isdecimal())
    for c in "³⓶₅":
        print(name(c))    
        assert(c.isdigit() and not c.isdecimal())
    for c in "⅕Ⅷ㊈":
        print(name(c))    
        assert(c.isnumeric() and not c.isdigit())

The names of the characters are

MATHEMATICAL DOUBLE-STRUCK DIGIT EIGHT
CHAM DIGIT THREE
ARABIC-INDIC DIGIT SIX
SUPERSCRIPT THREE
DOUBLE CIRCLED DIGIT TWO
SUBSCRIPT FIVE
VULGAR FRACTION ONE FIFTH
ROMAN NUMERAL EIGHT
CIRCLED IDEOGRAPH NINE

Update: See the next post on ideographic numerals.

Update: There are 142 distinct numbers that correspond to the numerical value associated with a Unicode character. This page gives a list of the values and an example of each value.

Regex to match ICD-11 code

Posted on 6 October 2022 by John

ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)

The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)

One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.

Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.

What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.

Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.

import re

icd11_re = re.compile(r"""
    ^                  # beginning of string
    [A-HJ-NP-Z0-9]     # alphanumeric
    [A-HJ-NP-Z]        # alphabetic
    [0-9]              # digit
    [A-HJ-NP-Z0-9]     # alphanumeric
    ((\.               # optional starting with .
    [A-HJ-NP-Z0-9])    # alphanumeric
    [A-HJ-NP-Z0-9]?)?  # optional further refinement
    $                  # end of string
    """, re.VERBOSE)

good = [
    "ND52",   # fracture of arm, level unspecified
    "9D00.3", # presbyopia
    "8B60.Y", # other specified increased intercranial pressure
    "DB98.7Z" # portal hypertension, unspecified
]

bad = [
    "ABCD",    # third character must be digit
    "AB3D.",   # dot must be followed by alphanumeric
    "9D0O.3",  # letter 'O' should be number 0
    "DB9872",  # missing dot
    "AB3",     # too short
    "DB90.123" # too long
]

for g in good:
    assert(icd11_re.match(g))
for b in bad:
    assert(icd11_re.match(b) == None)

The Very Model of a Professor Statistical

Posted on 5 October 2022 by John

The last chapter of George Box’s book Improving Almost Anything contains the lyrics to “I Am the Very Model of a Professor Statistical,” to be sung to the tune of “I Am the Very Model of a Modern Major General” by Gilbert & Sullivan.

Here’s the original:

The original song has a few funny math-related lines.

I’m very well acquainted, too, with matters mathematical,
I understand equations, both the simple and quadratical,
About binomial theorem I’m teeming with a lot o’ news,
With many cheerful facts about the square of the hypotenuse.

I’m very good at integral and differential calculus;
I know the scientific names of beings animalculous:
In short, in matters vegetable, animal, and mineral,
I am the very model of a modern Major-General.

Here are a few lines from George Box’s version.

I relentlessly uncover any aberrant contingency
I strangle it with rigor and stifle it with stringency
I understand the different symbols be they Roman, Greek, or cuneiform
And every distribution from the Cauchy to the uniform.

With derivation rigorous each lemma I can justify
My every estimator I am careful to robustify
In short in matters logical, mathematical, idealistical
I am the very model of a professor statistical.

Gilbert & Sullivan have come up on this blog a couple other times:

George Box has come up too, but only once. (I’m surprised he hasn’t come up more; I should rectify that.) This post has a great quote from Box: “To find out what happens to a system when you interfere with it, you have to interfere with it (and not just passively observe it).”

Sum of squares mod n uniformly distributed

Posted on 5 October 2022 by John

Let s be an integer equal to at least 5 and let n be a positive integer.

Look at all tuples of s integers, each integer being between 0 and n − 1 inclusive, and look at the sum of their squares mod n. About 1/n will fall into each possible value.

So, for example, if you look at a lists of 6 digits, sum the squares of the digits in each list, then about 1/10 of the results will end in 0, about 1/10 will end in 1, about 1/10 will end in 2, etc.

This is the gist of a theorem discussed on Math Overflow here.

And here is Python code to demonstrate it.

    from itertools import product

    n = 10
    s = 6

    counts = [0]*n
    for p in product(range(n), repeat=s):
        counts[ sum(x**2 for x in p) % n ] += 1

    print(counts)

The result is

    [103200, 99200, 99200, 99200, 99200, 
     103200, 99200, 99200, 99200, 99200]

and so between 99,200 and 103,200 sums end in each digit.

Let’s run it again with n = 11 and s = 5. Now we get

    [14641, 14762, 14520, 14762, 14762, 14762, 
     14520, 14520, 14520, 14762, 14520]

which shows the counts are between 14,520 and 14,762. If they were perfectly uniformly distributed we’d expect 11⁴ = 14,641 in each bin. The maximum deviation from a uniform distribution is 0.83%.

The code becomes impractical as s or n get larger, but we can try s = 7 and keep n = 11. Then we get a maximum deviation from uniformity of 0.075%, i.e. about 10 times closer to being uniform.

Quirks in Mathematica’s administrative division data for Mexico

Posted on 5 October 2022 by John

If you ask Mathematica for a list of Mexican states via

    CountryData["Mexico", "RegionNames"]

you will get a list of strings:

    "Aguascalientes", "Baja California", ..., "Zacatecas"}

However, when you try to turn this into a list of objects representing these states via

    states = Entity["AdministrativeDivision", {#, "Mexico"}] & /@ 
                 CountryData["Mexico", "RegionNames"]

something strange happens. Some items in the list are turned into useful objects, and some are uninterpreted symbols.

For example, Aguascalientes is recognized as an administrative division, but Baja California is not. It recognizes Oaxaca but not Nuevo León. The pattern is that states with a space are not recognized. There is an inconsistency in Mathematica: output names do not always match input names. To create the object representing Baja California, you need to pass in the string BajaCalifornia with no space.

    Entity["AdministrativeDivision", {"BajaCalifornia", "Mexico"}]

OK, so let’s remove spaces before we try to create a list of geographic objects.

    names = StringReplace[#, " " -> ""] & /@ 
               CountryData["Mexico", "RegionNames"]

This mostly works, but it trips up on Mexico City. The output name for the region is Ciudad de México, but Mathematica does not recognize CiudaddeMéxico as an administrative division. Mathematica does recognize MexicoCity as the name of a city but not as the name of an administrative division.

Changing CiudaddeMéxico to MexicoCity in the list of names did not fix the problem. But when I directly edited the list of state objects by replacing the uninterpreted value with the output running

    Entity["AdministrativeDivision", {"MexicoCity", "Mexico"}]

by itself everything worked. Then I was able to find a Traveling Salesman tour as in earlier posts (Africa, Americas, Eurasia and Oceania, Canada).

Traveling Salesman tour of Mexico

The tour is

Baja California
Baja California Sur
Sinaloa
Durango
Zacatecas
Aguascalientes
Nayarit
Jalisco
Colima
Michoacán
México
Mexico City
Morelos
Guerrero
Oaxaca
Chiapas
Tabasco
Campeche
Quintana Roo
Yucatán
Veracruz
Puebla
Tlaxcala
Hidalgo
Querétaro
Guanajuato
San Luis Potosí
Tamaulipas
Nuevo León
Coahuila
Chihuahua
Sonora

The tour is 8,343 kilometers.

Visualizing Swedish vowels

Posted on 4 October 2022 by John

A few days ago I wrote a post comparing English and Japanese vowel sounds in a 2D chart. In this post I’d like to do something similar for English and Swedish. As before the data come from [1].

A friend of mine who learned Swedish would joke about how terribly he had to contort his mouth to speak the language. Swedish vowels are objectively difficult for non-native speakers as can be seek in a vowel chart. The vertical axis runs from closed sounds on top to open sounds on the bottom. The horizontal axis runs from front vowels on the left to back vowels on the right.

Swedish vowel sounds

There are a lot of vowel sounds, and many of them are clustered close together. Japanese, by contrast, has only five vowel sounds, and they’re widely spread apart.

Japanese vowel sounds

The vowel charts for Spanish and Hebrew look fairly similar to the chart for Japanese above: five vowels spread out in roughly the same locations.

It wouldn’t matter so much that Swedish has a lot of tightly clustered vowel sounds if your native language has the same sounds, but the following chart shows that English and Swedish vowels are quite different. The red x’s mark English vowel locations and the blue dots mark Swedish vowels.

Swedish and English vowel sounds

[1] Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, 2021.

A traveling salesman tour of Canada

Posted on 4 October 2022 by John

Here is a Traveling Salesman tour of Canada’s provinces and territories created by Mathematica. This is the shortest path connecting the geographic centers of the regions.

Here is a much larger (4.5 MB) PDF file of the same map with higher resolution.

Starting in the northwest, the tour is

Yukon
Northwest Territories
Nunavut
Quebec
Newfoundland and Labrador
Prince Edward Island
Nova Scotia
New Brunswick
Ontario
Manitoba
Saskatchewan
Alberta
British Columbia

The tour is 11,070 km.

For more tours like this, see my earlier posts on tours of

Update: Here is an adjacency network for Canadian provinces and territories.

This is an SVG image so you can scale it to make it easier to read if you’d like.

Month: October 2022

Efficiently testing a black box

Full factorial design

Fractional factorial design

Replications

Verification

Related services

Which one is subharmonic?

Related posts

Ideograph numerals

Consecutive code points

Parenthesized and circled ideographs

CJK Unified Ideographs

Unicode numbers

Related posts

Regex to match ICD-11 code

Related posts

The Very Model of a Professor Statistical

Sum of squares mod n uniformly distributed

Quirks in Mathematica’s administrative division data for Mexico

Traveling Salesman tour of Mexico

Visualizing Swedish vowels

A traveling salesman tour of Canada