ADFGVX cipher and Morse code separation

A century ago the German army used a field cipher that transmitted messages using only six letters: A, D, F, G, V, and X. These letters were chosen because their Morse code representations were distinct, thus reducing transmission error.

The ADFGVX cipher was an extension of an earlier ADFGV cipher. The ADFGV cipher was based on a 5 by 5 grid of letters. The ADFGVX extended the method to a 6 by 6 grid of letters and digits. A message was first encoded using the grid coordinates of the letters, then a transposition cipher was applied to the sequence of coordinates.

This post revisits the design of the ADFGVX cipher. Not the encryption method itself, but the choice of letters used for transmission. How would you quantify the difference between two Morse code characters? Given that method of quantification, how good was the choice of ADFGV or its extension ADFGVX? Could the Germans have done better?

Quantifying separation

There are several possible ways to quantify how distinct two Morse code signals are.

Time signal difference

My first thought was to compare the signals as a function of time.

There are differing conventions for how long a dot or dash should be, and how long the space between dots and dashes should be. For this post, I will assume a dot is one unit of time, a dash is three units of time, and the space between dots or dashes is one unit of time.

The letter A is represented by a dot followed by a dash. I’ll represent this as 10111: on for one unit of time for the dot, off for one unit of time for the space between the dot and the dash, and on for three units of time for the dash. D is dash dot dot, so that would be 1110101.

We could quantify the difference between two letters in Morse code as the Hamming distance between their representations as 0s and 1s, i.e. the number of positions in which the two letters differ. To compare A and D, for example, I’ll pad the A with a couple zeros on the end to make it the same length as D.

    A: 1011100
    D: 1110101
        x x  x

The distance is 3 because the two sequences differ in three positions. (Looking back at the previous post on popcount, you could compute the distance as the popcount of the XOR of the two bit patterns.)

A problem with this approach is that it seems to underestimate the perceived difference between F and G.

    F: ..-. 1010111010
    G: --.  1110111010
             x

These only differ in the second bit, but they sound fairly different.

Symbolic difference

The example above suggests maybe we should compare the sequence of dots and dashes themselves rather than compare their corresponding time signals. By this measure F and G are distance 4 apart since they differ in every position.

Other possibilities

Comparing the symbol difference may over-estimate the difference between U (..-) and V (...-). We should look at some combination of time signal difference and symbolic difference.

Or maybe the thing to do would be to look at something like the edit distance between letters. We could say that U and V are close because it only takes inserting a dot to turn a U into a V.

Was ADFGV optimal?

There are several choices of letters that would have been better than ADFGV by either way of measuring distance. For example, CELNU has better separation and takes about 14% less time to transmit than ADFGV.

Here are a couple tables that give the time distance (dT) and the symbol distance (dS) for ADFGV and for CELNU.

    |------+----+----|
    | Pair | dT | dS |
    |------+----+----|
    | AD   |  3 |  3 |
    | AF   |  4 |  3 |
    | AG   |  5 |  2 |
    | AV   |  4 |  3 |
    | DF   |  3 |  3 |
    | DG   |  2 |  1 |
    | DV   |  3 |  2 |
    | FG   |  1 |  4 |
    | FV   |  2 |  2 |
    | GV   |  3 |  3 |
    |------+----+----|
    
    |------+----+----|
    | Pair | dT | dS |
    |------+----+----|
    | CE   |  7 |  4 |
    | CL   |  4 |  3 |
    | CN   |  4 |  2 |
    | CU   |  5 |  2 |
    | EL   |  5 |  3 |
    | EN   |  3 |  2 |
    | EU   |  4 |  2 |
    | LN   |  4 |  4 |
    | LU   |  3 |  3 |
    | NU   |  3 |  2 |
    |------+----+----|

For six letters, CELNOU is faster to transmit than ADFGVX. It has a minimum time distance separation of 3 and a minimum symbol distance 2.

Time spread

This is an update in response to a comment that suggested instead of minimizing transmission time of a set of letters, you might want to pick letters that are most similar in transmission time. It takes much longer to transmit C (-.-.) than E (.), and this could make CELNU harder to transcribe than ADFGV.

So I went back to the script I’m using and added time spread, the maximum transmission time minus the minimum transmission time, as a criterion. The ADFGV set has a spread of 4 because V takes 4 time units longer to transmit than A. CELNU has a spread of 10.

There are 210 choices of 5 letters that have time distance greater than 1, symbol distance greater than 1, and spread equal to 4. That is, these candidates are more distinct than ADFGV and have the same spread.

It takes 44 time units to transmit ADFGV. Twelve of the 210 candidates identified above require 42 or 40 time units. There are five that take 40 time units:

  • ABLNU
  • ABNUV
  • AGLNU
  • AGNUV
  • ALNUV

Looking at sets of six letters, there are 464 candidates that have better separation than ADFGVX and equal time spread. One of these, AGLNUX, is an extension of one of the 5-letter candidates above.

The best 6-letter are ABLNUV and AGLNUV. They are better than ADFGVX by all the criteria discussed above. They both have time distance separation 2 (compared to 1), symbol distance separation 2 (compared to 1), time spread 4, (compared to 6) and transmission time 50 (compared to 56).

More Morse code posts

More on attacking combination locks

A couple weeks ago I wrote about how De Bruijn sequences can be used to attack locks where there is no “enter” key, i.e. the lock will open once the right symbols have been entered.

Here’s a variation on this theme: what about locks that let you press more than one button at a time? [1]

Lock that lets you push more than one button at a time

You could just treat this as if it were a keypad with more buttons. For example, suppose you can push either one or two buttons at a time on the lock pictured above. Then you could treat this as a lock with 15 buttons: the five actual buttons and 10 virtual buttons corresponding to the ten ways you could choose 2 buttons out of 5 to press at the same time.

Maybe a lock would let you press even more buttons at once. I don’t think I’ve ever used a combination that required more than two buttons at once, but in principle you could push up to five buttons at once in the photo above, for a total of 31 possible button combinations. And since 31 is prime, you could use the algorithm here to generate the corresponding De Bruijn sequence.

If you know how long the password is, you can try the De Bruijn sequence for passwords of that length. But what if you don’t know the length of the password a priori? You could try the De Bruijn sequence for passwords of length 1, then length 2, then length 3, etc. But is there a more efficient way?

If there are k button combinations and the password has length n, then the optimal solution is to start entering a De Bruijn sequence B(k, n) until the lock opens. If you know the password has length no more than 4, then you could try a B(k, 4) sequence, and if the password is actually shorter than 4, say length 3, you’ll still open it.

But what if you’ve tried a B(k, 4) sequence and it turns out the password has length 5? You could do better than starting over with a B(k, 5) sequence, because some of the substrings in that B(k, 5) sequence will have already been tried. But how could you do this systematically? If you don’t know the length of the password, how could you do better than trying B(k, 1), then B(k, 2), then B(k, 3) etc.?

Related posts

[1] For this post, I’m assuming the lock will open as soon as you enter the right sequence of buttons. For example, if the pass code is 345 and you enter 12345 the lock will open. I don’t know whether these locks work that way. Maybe you have to turn the handle, which would effectively act as an enter key. But maybe there’s a way to listen to the lock so that you’ll know when the combination has been entered, before you twist the handle.

Update: According to the first comment below, locks of the kind featured in the photo only allow each button to be used once. That puts severe limits on the number of possible combinations, and the approach outlined here would be unnecessary. However, the post brings up more general questions that could be useful in another setting.

Linear feedback shift registers

The previous post looked at an algorithm for generating De Bruijn sequences B(k, n) where k is a prime number. These are optimal sequences that contain every possible consecutive sequence of n symbols from an alphabet of size k. As noted near the end of the post, the case k = 2 is especially important in application, i.e. binary sequences.

If we set k = 2, the generating algorithm is an example of a linear feedback shift register (LFSR) sequence.

Here’s the algorithm from the previous post:

  1. If (x1, x2, …, xn ) = (0,0, …, 0), return c1.
  2. If (x1, x2, …, xn ) = (1,0, …, 0), return 0.
  3. Otherwise return c1x1 + c2x2 + … cnxn mod k.

Note that the last line says to return a linear combination of the previous symbols. That is, we operate on the latest n symbols, saving them to registers. We take the feedback from the previous outputs and compute a linear combination. We then shift the register content over by one and add the new output on the end. Hence the name “linear feedback shift register” sequence.

Note that the LFSR algorithm is a linear operator over the field GF(2), except for the special cases in steps 1 and 2. The algorithm couldn’t be entirely linear because it would get stuck; It would produce nothing but zeros forevermore once it encountered an input sequence of all zeros. So technically a LFSR is an “nearly always linear feedback shift register.” It’s linear for 2n – 2 inputs and nonlinear for 2 special inputs.

A LFSR is more general than a binary De Bruijn sequence. For some choices of linear coefficients the output is a De Bruijn sequence, but not for others. If you associate the linear coefficients with polynomial coefficient (with a sign change, as noted in the previous post) then the LFSR sequence is a De Bruijn sequence if and only if the polynomial is a primitive polynomial of degree n.

A few months ago I wrote about LFSRs in the context of random number generation. LFSRs make very efficient random number generators, but they’re not appropriate for cryptography because their linear structure makes them vulnerable to attack. The idea of shrinking generators is use one random number generator to sample another generator. The two generators can both be LFSRs, but the combined generator is nonlinear because the sampling mechanism is nonlinear. The net result is that you can combine two efficient but insecure generators to create a new generator that is efficient and secure.

Related posts

Generating De Bruijn cycles with primitive polynomials

Last week I wrote about a way to use De Bruijn cycles. Today I’ll write about a way to generate De Bruijn cycles.

De Bruijn cycles

Start with an alphabet of k symbols. B(k, n) is the set of cycles of that contain every sequence of n symbols, and is as short as possible. Since there are kn possible sequences of n symbols, and every one corresponds to some starting position in the De Bruijn cycle, an element of B(k, n) has to have at least kn symbols. In fact, the elements of B(k, n) have exactly kn symbols.

It’s not obvious that B(k, n) should be non-empty for all k and n, but it is. And there are algorithms that will take a k and an n and return an element of B(k, n).

The post from last week gave the example of a sequence in B(4, 3) that contains all triples of DNA base pairs:

AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTT

Generating De Bruijn cycles

When k is a prime number, i.e. we’re working over an alphabet with a prime number of symbols, it is particularly simple generate De Bruijn sequences [1]. For example, let’s work over the alphabet {0, 1, 2}. Then the following code will produce the next symbol in a sequence in B(3, 4).

    def B34(a,b,c,d):
        if (a,b,c,d) == (0,0,0,0):
            return 1
        if (a,b,c,d) == (1,0,0,0):
            return 0
        return (a+b) % 3

We can initialize the sequence wherever we like since it produces a cycle, but if we start with 0000 we get the following:

000010011012110021020122101011112220112120002002202122001201021120202222111022121

You can verify that every sequence of four elements from {0, 1, 2} is in there somewhere, provided you wrap the end around. For example, 1000 can be found by starting in the last position.

Where did the algorithm above come from? How would you create an analogous algorithm for other values of k and n?

The algorithm goes back to Willem Mantel in 1894, and can be found, for example, in Knuth’s TAOCP Volume 4. Here is Mantel’s algorithm to generate an element of B(k, n) where k is prime. The function takes the latest n symbols in the De Bruijn cycle and returns the next symbol.

  1. If (x1, x2, …, xn ) = (0,0, …, 0), return c1.
  2. If (x1, x2, …, xn ) = (1,0, …, 0), return 0.
  3. Otherwise return c1x1 + c2x2 + … cnxn mod k.

In our example above, c1 = c2 = 1 and c3 = c4 = 0, but how do you find the c‘s in general?

Primitive polynomials

To find the c‘s, first find a primitive polynomial of degree n over the prime field with k elements. Then the c‘s are the coefficients of the polynomial, with a sign change, if you write the polynomial in the following form.

x^n - c_n x^{n-1} - \cdots - c_2x - c_1

In the example above, I started with the polynomial

x^4 + 2x + 2

We can read the coefficients off this polynomial: c1 = c2 = -2 and c3 = c4 = 0. Since -1 and 2 are the same working mod 3, I used c1 = c2 = 1 above.

Backing up a bit, what is a primitive polynomial and how do you find them?

A primitive polynomial of degree n with coefficients in GF(k), the finite field with k elements, has leading coefficient 1 and has a root α that generates the multiplicative group of GF(kn). That is, every nonzero element of GF(kn) can be written as a power of α.

In my example, I found a primitive polynomial in GF(34) by typing polynomials into Mathematica until I found one that worked.

    In[1]:= PrimitivePolynomialQ[x^4 + 2 x + 2, 3]

    Out[1]= True

Since coefficients can only be 0, 1, or 2 when you’re working mod 3, it only took a few guesses to find a primitive polynomial [2].

Brute force guessing works fine k and n are small, but clearly isn’t practical in general. There are algorithms for searching for primitive polynomials, but I’m not familiar with them.

The case where k = 2 and n may be large is particularly important in applications, and you can find where people have tabulated primitive binary polynomials, primitive polynomials in GF(2n). It’s especially useful to find primitive polynomials with a lot of zero coefficients because, for example, this leads to less computation when producing De Bruijn cycles.

Finding sparse primitive binary polynomials is its own field of research. See, for example, Richard Brent’s project to find primitive binary trinomials, i.e. primitive binary polynomials with only three non-zero coefficients.

More on binary sequences in the next post on linear feedback shift registers.

***

[1] The same algorithm can be used when k is not prime but a prime power because you can equate sequence of length n from an alphabet with k = pm elements with a sequence of length mn from an alphabet with p elements. For example, 4 is not prime, but we could have generated a De Bruijn sequence for DNA basepair triples by looking for binary sequences of length 6 and using 00 = A, 01 = C, 10 = G, and 11 = T.

[2] The number of primitive polynomials in GF(pn) is φ(pn – 1)/m where φ is Euler’s totient function. So in our case there were 8 polynomials we could have found.

Golay codes

Photo of Jupiter taken by Voyager

Suppose you want to sent pictures from Jupiter back to Earth. A lot could happen as a bit travels across the solar system, and so you need some way of correcting errors, or at least detecting errors.

The simplest thing to do would be to transmit photos twice. If a bit is received the same way both times, it’s likely to be correct. If a bit was received differently each time, you know one of them is wrong, but you don’t know which one. So sending the messages twice doesn’t let you correct any errors, but it does let you detect some errors.

There’s a much better solution, one that the Voyager probes actually used, and that is to use a Golay code. Take the bits of your photo in groups of 12, think of each group as a row vector, and multiply it by the following matrix, called the generator matrix:

100000000000101000111011
010000000000110100011101
001000000000011010001111
000100000000101101000111
000010000000110110100011
000001000000111011010001
000000100000011101101001
000000010000001110110101
000000001000000111011011
000000000100100011101101
000000000010010001110111
000000000001111111111110 

(You might see different forms of the generator matrix in different publications.)

That’s the (extended binary) Golay code in a nutshell. It takes groups of 12 bits and returns groups of 24 bits. It doubles the size of your transmission, just like transmitting every image twice, but you get more bang for your bits. You’ll be able to correct up to 3 corrupted bits per block of 12 and detect more.

Matrix multiplication here is done over the field with two elements. This means multiplication and addition of 0’s and 1’s works as in the integers, except 1 + 1 = 0.

Each pair of rows in the matrix above differ in 8 positions. The messages produced by multiplication are linear combinations of these rows, and they each differ in at least 8 positions.

When you receive a block of 24 bits, you know that there has been some corruption if the bits you receive are not in the row space of the matrix. If three or fewer bits have been corrupted, you can correct for the errors by replacing the received bits by the closest vector in the row space of the matrix.

If four bits have been corrupted, the received bits may be equally close to two different possible corrections. In that case you’ve detected the error, but you can’t correct it with certainty. If five bits are corrupted, you’ll be able to detect that, but if you attempt to correct the errors, you could mistakenly interpret the received bits to be a corruption of three bits in another vector.

Going deeper

In one sense, Golay codes are very simple: just multiply by a binary matrix. But there’s a lot going on beneath the surface. Where did the magic matrix above come from? It’s rows differ in eight positions, but how do you know that all linear combinations for the rows also differ in at least eight positions? Also, how would you implement it in practice? There are more efficient approaches than to directly carry out a matrix multiplication.

To give just a hint of the hidden structure in the generator matrix, split the matrix half, giving an identity matrix on the left and a matrix M on the right.

101000111011
110100011101
011010001111
101101000111
110110100011
111011010001
011101101001
001110110101
000111011011
100011101101
010001110111
111111111110

The 1’s along the last row and last column have to do with why this is technically an “extended” Golay code: the “perfect” Golay code has length 23, but an extra check sum bit was added. Here “perfect” means optimal in a sense that I may get into in another post. (Update: See this post.) Let’s strip off the last row and last column.

10100011101
11010001110
01101000111
10110100011
11011010001
11101101000
01110110100
00111011010
00011101101
10001110110
01000111011

The first row corresponds to quadratic residues mod 11. That is, if you number the columns starting from 0, the zeros are in columns 1, 3, 4, 5, and 9. These are the non-zero squares mod 11. Also, the subsequent rows are rotations of the first row.

Golay codes are practical for error correction—they were used to transmit the photo at the top of the post back to Earth—but they also have deep connections to other parts of math, including sphere packing and sporadic groups.

Update: Ternary Golay codes

Related posts

Detecting typos with the four color theorem

In my previous post on VIN numbers, I commented that if a check sum has to be one of 11 characters, it cannot detect all possible changes to a string from an alphabet of 33 characters. The number of possible check sum characters must be at least as large as the number of possible characters in the string.

Now suppose you wanted to create a check sum for text typed on a computer keyboard. You want to detect any change where a single letter was wrongly typed by using an adjacent letter.

You don’t need many characters for the check sum because you’re not trying to detect arbitrary changes, such as typing H for A on a QWERTY keyboard. You’re only trying to detect, for example, if someone typed Q, W, S, or Z for A. In fact you would only need one of five characters for the check sum.

Here’s how to construct the check sum. Think of the keys of the keyboard as a map, say by drawing boundaries through the spaces between the keys. By the four color theorem, you can assign the numbers 0, 1, 2, and 3 to each key so that no two adjacent keys have the same number. Concatenate all these digits and interpret the result as a base 4 number. Then take the remainder when the number is divided by 5. That’s your check sum. As proved here, this will detect any typo that hits an adjacent key. It will also detect transpositions of adjacent keys.

Note that this doesn’t assume anything about the particular keyboard. You could have any number of keys, and the keys could have any shape. You could even define “adjacent” in some non-geometrical way as long as your adjacency graph is planar.

Vehicle Identification Number (VIN) check sum

VIN number on an old car

A VIN (vehicle identification number) is a string of 17 characters that uniquely identifies a car or motorcycle. These numbers are used around the world and have three standardized formats: one for North America, one for the EU, and one for the rest of the world.

Letters that resemble digits

The characters used in a VIN are digits and capital letters. The letters I, O, and Q are not used to avoid confusion with the numerals 0, 1, and 9. So if you’re not sure whether a character is a digit or a letter, it’s probably a digit.

It would have been better to exclude S than Q. A lower case q looks sorta like a 9, but VINs use capital letters, and an S looks like a 5.

Check sum

The various parts of a VIN have particular meanings, as documented in the Wikipedia article on VINs. I want to focus on just the check sum, a character whose purpose is to help detect errors in the other characters.

Of the three standards for VINs, only the North American one requires a check sum. The check sum is in the middle of the VIN, the 9th character.

Algorithm

The scheme for computing the check sum is both complicated and weak. The end result is either a digit or an X. There are 33 possibilities for each character (10 digits + 23 letters) and 11 possibilities for a check sum, so the check sum cannot possibly detect all changes to even a single character.

The check sum is computed by first converting all letters to digits, computing a weighted sum of the 17 digits, and taking the remainder by 11. The weights for the 17 characters are

8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8 ,7 ,6, 5, 4, 3, 2

I don’t see any reason for these weights other than that adjacent weights are different, which is enough to detect transposition of consecutive digits (and characters might not be digits). Maybe the process was deliberately complicated in an attempt to provide a little security by obscurity.

Historical quirk

There’s an interesting historical quirk in how the letters are converted to digits: each letter is mapped to a digit with in a somewhat complicated way. More on that here. It has to so with gaps in the EBCDIC values of letters. Letters are not mapped to their EBCDIC values per se, but there are jumps that are explained by corresponding jumps in EBCDIC.

EBCDIC?! Why not ASCII? Both EBCDIC and ASCII go back to 1963. VINs date back to 1954 in the US but were standardized in 1981. Presumably the check sum algorithm using EBCDIC digits became a de facto standard before ASCII was ubiquitous.

A better check sum

Any error detection scheme that uses 11 characters to detect changes in 33 characters is necessarily weak.

A much better approach would be to use a slight variation on the check sum algorithm Douglas Crockford recommended for base 32 encoding described here. Crockford says to take a string of characters from an alphabet of 32 characters, interpret it as a base 32 number, and take the remainder by 37 as the check sum. The same algorithm would work for an alphabet of 33 characters. All that matters is that the number of possible characters is less than 37.

Since the check sum is a number between 0 and 36 inclusive, you need 37 characters to represent it. Crockford recommended using the symbols *, ~, $, =, and U for extra symbols in his base 32 system. His system didn’t use U, and VIN numbers do. But we only need four more characters, so we could use *, ~, $, and =.

The drawback to this system is that it requires four new symbols. The advantage is that any change to a single character would be detected, as would any transposition of adjacent characters. This is proved here.

Related posts

How UTF-8 works

UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes.

The problem UTF-8 solves

US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode these symbols since 27 = 128, and that’s what ASCII does. It represents each character with 8 bits since computers work with bits in groups of sizes that are powers of 2, but the first bit is always 0 because it’s not needed. Extended ASCII uses the left over space in ASCII to encode more characters.

A total of 256 characters might serve some users well, but it wouldn’t begin to let you represent, for example, Chinese. Unicode initially wanted to use two bytes instead of one byte to represent characters, which would allow for 216 = 65,536 possibilities, enough to capture a lot of the world’s writing systems. But not all, and so Unicode expanded to four bytes.

If you were to store English text using two bytes for every letter, half the space would be wasted storing zeros. And if you used four bytes per letter, three quarters of the space would be wasted. Without some kind of encoding every file containing English test would be two or four times larger than necessary. And not just English, but every language that can represented with ASCII.

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn’t use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer. You don’t have to suddenly make every character take up twice or four times as much space just because you want to use, say, a Euro sign € (U+20AC).

How UTF-8 does it

Since the first bit of ASCII characters is set to zero, bytes with the first bit set to 1 are unused and can be used specially.

When software reading UTF-8 comes across a byte starting with 1, it counts how many 1’s follow before encountering a 0. For example, in a byte of the form 110xxxxx, there’s a single 1 following the initial 1. Let n be the number of 1’s between the initial 1 and the first 0. The remaining bits in this byte and some bits in the next n bytes will represent a Unicode character. There’s no need for n to be bigger than 3 for reasons we’ll get to later. That is, it takes at most four bytes to represent a Unicode character using UTF-8.

So a byte of the form 110xxxxx says the first five bits of a Unicode character are stored at the end of this byte, and the rest of the bits are coming in the next byte.

A byte of the form 1110xxxx contains four bits of a Unicode character and says that the rest of the bits are coming over the next two bytes.

A byte of the form 11110xxx contains three bits of a Unicode character and says that the rest of the bits are coming over the next three bytes.

Following the initial byte announcing the beginning of a character spread over multiple bytes, bits are stored in bytes of the form 10xxxxxx. Since the initial bytes of a multibyte sequence start with two 1 bits, there’s no ambiguity: a byte starting with 10 cannot mark the start of a new multibyte sequence. That is, UTF-8 is self-punctuating.

So multibyte sequences have one of the following forms.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If you count the x’s in the bottom row, there are 21 of them. So this scheme can only represent numbers with up to 21 bits. Don’t we need 32 bits? It turns out we don’t.

Although a Unicode character is ostensibly a 32-bit number, it actually takes at most 21 bits to encode a Unicode character for reasons explained here. This is why n, the number of 1’s following the initial 1 at the beginning of a multibyte sequence, only needs to be 1, 2, or 3. The UTF-8 encoding scheme could be extended to allow n = 4, 5, or 6, but this is unnecessary.

Efficiency

UTF-8 lets you take an ordinary ASCII file and consider it a Unicode file encoded with UTF-8. So UTF-8 is as efficient as ASCII in terms of space. But not in terms of time. If software knows that a file is in fact ASCII, it can take each byte at face value, not having to check whether it is the first byte of a multibyte sequence.

And while plain ASCII is legal UTF-8, extended ASCII is not. So extended ASCII characters would now take two bytes where they used to take one. My previous post was about the confusion that could result from software interpreting a UTF-8 encoded file as an extended ASCII file.

Related posts

National Drug Code (NDC)

The US Food and Drug Administration tracks drugs using an identifier called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code.

An NDC contains 10 digits, separated into three segments by two dashes. The three segments are the labeler code, product code, and package code. The FDA assigns the labeler codes to companies, and each company assigns its own product and package codes.

Format

The segments are of variable length and so the dashes are significant. The labeler code could be 4 or 5 digits. The product code could be 3 or 4 digits, and the package code could be 1 or 2 digits. The total number of digits is must be 10, so there are three possible combinations:

  • 4-4-2
  • 5-3-2
  • 5-4-1.

There’s no way to look at just the digits and know how to separate them into three segments. My previous post looked at self-punctuating codes. The digits of NDC codes are not self-punctuating because they require the dashes. The digit combinations are supposed to be unique, but you can’t tell how to parse a set of digits from the digits alone.

Statistics

I downloaded the NDC data from the FDA to verify whether the codes work as documented, and to see the relative frequency of various formats.

(The data change daily, so you may get different results if you do try this yourself.)

Format

All the codes were 12 characters long, and all had the documented format as verified by the regular expression [1]

    \d{4,5}-\d{3,4}-\d{1,2}

Uniqueness exception

I found one exception to the rule that the sequence of digits should be unique. The command

    sed "s/-//g" ndc.txt | sort | uniq -d

returned 2950090777.

The set of NDC codes contained both 29500-907-77 and 29500-9077-7.

Distribution

About 60% of the codes had the form 5-3-2. About 30% had the form 5-4-1, and the remaining 10% had the form 4-4-2.

There were a total of 252,355 NDC codes with 6,532 different lablelers (companies).

There were 9448 NDC codes associated with the most prolific labeler. The 1,424 least prolific labelers had only one DNC code. In Pareto-like fashion, the top 20% of labelers accounted for about 90% of the codes.

Related posts

[1] Programming languages like Python or Perl will recognize this regular expression, but by default grep does not support \d for digits. The Gnu implementation of grep with the -P option will. It will also understand notation like {4,5} to mean a pattern is repeated 4 or 5 times, with or without -P, but I don’t think other implementations of grep necessarily will.

Prefix code examples

In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows whether you’ve dialed an in-office extension or the first three digits of an outside call.

This is an example of a prefix code: No valid phone number is a prefix of another valid phone number. We’ll look at a few more examples of prefix codes in the context of phone numbers, then look at Roman numerals, Morse code, Unicode encoding, data compression, and RPN calculators.

It used to be true in the US that you could dial four or five digits for a local call, seven digits for a call within the area code, and ten digits for a long distance phone call. This didn’t cause any ambiguity because no local number would begin with the digits of your area code. You had to dial a 1 before dialing a long distance number, and no local or area code numbers begin with 1.

There are still parts of the US where you can dial either a seven digit number or a ten digit number. In most of the US you always enter 10 digits. This is a trivial form of prefix coding because all fixed-length codes are prefix codes.

A final example of prefix codes related to telephony are country calling codes. These codes have varying lengths, but phone exchanges can know when the country code stops and when the number within a country starts because no prefix of a country code is a valid country code.

Prefix codes are sometimes called self-punctuating codes. This is because you don’t need an additional symbol, a form of punctuation, to mark the end of codes.

We usually think of Morse code as a system of two symbols: dots and dashes. But it’s really a system of four symbols: dots, dashes, short space between letters, and longer space between words. As a system of only dots and dashes, Morse code is not self-punctuating. Without spaces, you couldn’t tell, for example, whether dot dash represented an A (dot dash) , or an E (dot) followed by a T (dash). It’s possible to design a prefix code with just two symbols, but that’s not what Morse did.

Q codes are not part of Morse code per se but are often used in Morse code. These are three letter codes beginning with Q. For example, QRP is an abbreviation for the question “Should I reduce power?”. Q codes are prefix codes because they have fixed length. If QR were a valid Q code by itself, that would ruin the prefix property; a recipient would not know whether to interpret QR as a complete code until listening for the next letter.

The letter components of Roman numerals are not a prefix code because you can’t tell whether a letter stands for a positive or negative amount until you read the next letter. For example, in CLX the X represents 10 but in CXL the X represents -10. If you wrote Roman numerals backward, the letters would form a prefix code.

In the previous post, I discussed UTF-16 encoding of Unicode. The way UTF-16 encodes characters outside the Basic Multilingual Plane makes it a prefix code; the meaning of a surrogate doesn’t depend on any down-stream information. UTF-8 is also a prefix code, which I discuss in detail here.

You may have seen Huffman coding, a form of data compression that uses a prefix code.

Reverse Polish Notation is an example of prefix coding. An RPN calculator doesn’t need parentheses for punctuation. You can enter calculations unambiguously with just digits and arithmetic operators because the meaning of a computation does not depend on any future input. Prefix codes are sometimes called instantaneous codes because of this feature.

Related posts