Please update your RSS subscription

RSS icon

When I started this blog I routed my RSS feed through Feedburner, and now Feedburner is no longer working for my site.

If you subscribed by RSS, please check the feed URL. It should be

https://www.johndcook.com/blog/feed

which was previously forwarded to a Feedburner URL. If you subscribe directly to the feed with my domain, I should work. I had some problems with it earlier, but I believe they’ve been fixed.

Also, if you subscribed by email, you went through Feedburner. You won’t receive any more posts by email that way. I’m considering possible replacements.

In the mean time, you can sign up for monthly highlights from the blog via email. Each month I highlight the three or four most popular posts. Once in a while I may add a paragraph about what I’m up to. Short and sweet.

You can also find me on Twitter.

Big O tilde notation

There’s a variation on Landau’s big-O notation [1] that’s starting to become more common, one that puts a tilde on top of the O. At first it looks like a typo, a stray diacritic mark. What does that mean? In short,

{\cal O}(h(n) \log^k n) = \tilde{{\cal O}}(h(n))

That is, big O tilde notation ignores logarithmic factors. For example, the FFT algorithm computes the discrete Fourier transform of a sequence of length n in

O(n log n)

steps, and so you could write this as Õ(n). This notation comes up frequently in computational number theory where algorithms often have a mix of polynomial and logarithmic terms.

A couple days ago I blogged about algorithms for multiplying large numbers. The Schönhage-Strasse algorithm has a run time on the order of

O(n log(n) log(log(n))),

which you could write as simply Õ(n).

Shor’s quantum algorithm for factoring n-bit integers has a run time of

O(n² log(n) log(log(n))),

which we could write as Õ(n²). The fast Euclidean algorithm improves on the ancient Euclidean algorithm by reducing run time from O(n²) down to

O(n log²(n) log(log(n))),

which could be written simply as Õ(n).

The definition at the top of the post says we can ignore powers of logarithm, but the previous examples contained iterated logarithms. This is permissible because log(x) < x, and so log(log(x)) < log(x). [2]

Related posts

[1] Big O notation can be confusing at first. For example, the equal sign doesn’t have its standard meaning. For more details, see these notes.

[2] Sometimes in mathematics a superscript on a function is an exponent to be applied to the output, and sometimes it indicates the number of times a function should be iterated. That is, f²(x) could mean f(x)² or f( f(x) ). The former is the convention for logarithms, and we follow that convention here.

Unstructured data is an oxymoron

messy workshop

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More productive discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free text fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Related posts

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

How fast can you multiply really big numbers?

How long does it take to multiply very large integers? Using the algorithm you learned in elementary school, it takes O(n²) operations to multiply two n digit numbers. But for large enough numbers it pays to carry out multiplication very differently, using FFTs.

If you’re multiplying integers with tens of thousands of decimal digits, the most efficient algorithm is the Schönhage-Strassen algorithm, which takes

O(n log n  log log n)

operations. For smaller numbers, another algorithm, such as Karatsuba’s algorithm, might be faster. And for impractically large numbers, Fürer’s algorithm is faster.

What is impractically large? Let’s say a number is impractically large if storing it requires more than a billion dollars worth of RAM. If I did my back-of-the-envelop math correctly, you can buy enough RAM to store about 257 bits for about a billion dollars. The Schönhage-Strassen algorithm is more efficient than Fürer’s algorithm for numbers with less than 264 bits.

Related postHow fast can you multiply matrices?

Stigler’s law and human nature

Stigler’s law of eponymy states that no scientific discovery is named after the first person to discover it. Stephen Stigler acknowledged that he was not the first to realize this.

Of course this is just an aphorism. Sometimes discoveries are indeed named after their discoverers. But the times when this isn’t the case are more interesting.

Stigler’s law reveals something about human nature. The people who get credit for an idea are the ones who make their discovery known. They work in the open rather than in obscurity or secrecy. They may be the first to explain an idea well, or the first to popularize it, or the first to apply it to a problem that people care about. Apparently people value these things more than they value strict accounts of historical priority.

Related post: Stigler’s law and Avogadro’s number

Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

    for x in ["Poincaré", "Gödel"]:
        print(x, unidecode.unidecode(x))

This produces

    Poincaré Poincare
    Gödel Godel

Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode
>>> unidecode("北亰")
'Bei Jing '

(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that unidecode transliterates both to “Bei Jing.”)

Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

PP(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say P is idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

Related posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

Asteroids named after mathematicians

asteroid image

This evening I stumbled on the fact that John von Neumann and Fibonacci both have asteroids named after them. Then I wondered how many more famous mathematicians have asteroids named after them. As it turns out, most of them: Euclid, Gauss, Cauchy, Noether, Gödel, Ramanujan, … It’s easier to look at the names that are not assigned to asteroids.

I wrote a little script to search for the 100 greatest mathematicians (according to James Allen’s list) and look for names missing from a list of named asteroids. Here are the ones that were missing, in order of Allen’s list.

So of the top 100 list of mathematicians, 80 have asteroids named after them and the 20 above do not.

Alexandre Grothendieck, is the greatest mathematician—11th on James Allen’s list—not to have an asteroid named in his honor.

Why are dates of service prohibited under HIPAA’s Safe Harbor provision?

calendar

The HIPAA Privacy Rule offers two ways to say that data has been de-identified: Safe Harbor and expert determination. This post is about the former. I help companies with the latter.

Safe Harbor provision

The Safe Harbor provision lists 18 categories of data that would cause a data set to not be considered de-identified unless an expert determines the data does not pose a significant re-identification risk.

Some of the items prohibited by Safe Harbor are obvious: telephone number, email address, social security number, etc. Others are not so obvious. In order for data to fall under the Safe Harbor provision, one must remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 …

Why are these dates a problem? Birth dates are clearly useful in identifying individuals; when combined with zip code and sex, they give enough information to uniquely identify 87% of Americans. (More details here.) But why admission or discharge dates?

Public information on dates

Latanya Sweeney demonstrated here how dates of hospital admission can be used to identify individuals. She purchased a set of anonymized health records for the state of Washington for 2011 and compared the records to newspaper stories. She simply did a LexusNexus search on the term “hospitalized” to find news stories about people who were hospitalized, then searched for the medical records for the personal details from the newspaper articles.

In the discussion section of her article Sweeney points out that although she searched newspapers, one could get similar information from other sources, such as employee leave records or from a record of someone asking to pay a bill late due to illness.

Randomized dates

There are ways to retain the information in dates without jeopardizing privacy. For example, one could jitter the dates by adding a random offset. However, the way to do this depends on context and can be subtle. For example, Netflix jittered the dates in its Netflix Prize data set by +/- two weeks, but this was not enough to prevent a privacy breach [1]. And if you add too much randomness and the utility of the data degrades. That’s why the HIPAA Privacy Rule includes the provision to obtain expert determination that your procedures are adequate in your context.

Related posts

[1] Arvind Narayanan and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets, or How to Break Anonymity of the Netflix Prize Dataset.

Exponential sums in 2019

I’ve made a small change in my exponential sum page. I’ll need to give a little background before explaining the change.

First of all, you can read exactly what these exponential sums are here. These plots can be periodic in two senses. The first is simply repeating the same sequence of points. The second is repeatedly producing the same pattern of points, but shifted. Here are a couple examples from last year.

The sum for New Years Eve was periodic in the first sense. Plotting more points would make no difference.

The sum for Christmas Eve was periodic in the latter sense, repeating the pattern 12 times as it moves down the page. If we increased the value of N in the formula that generates the plots, we’d see more repetitions extending further down.

Since 19 is a prime, there will be more exponential sums this year that are periodic in this latter sense. I doubled the value of N so the plot will always display at least two periods.

The fact that 19 is prime also means there will be more intricate plots this year. For example, here’s the plot for March 17, 2019

compared to the plot for March 17, 2018.

There will also be a few simpler but strange plots. The plot for January 19 is unlike anything I’ve seen since starting this page.

Related posts

Goldbach’s conjecture, Lagrange’s theorem, and 2019

The previous post showed how to find all groups whose order is a product of two primes using 2019 as an example. Here are a couple more observations along the same line, illustrating the odd Goldbach conjecture and Lagrange’s four-square theorem with 2019.

Odd Goldbach Conjecture

Goldbach’s conjecture says that every even number greater than 2 can be written as the sum of two primes. The odd Goldbach conjecture, a.k.a. the weak Goldbach conjecture, says that every odd number greater than 5 can be written as the sum of three primes. The odd Goldbach conjecture isn’t really a conjecture anymore because Harald Helfgott proved it in 2013, though the original Goldbach conjecture remains unsettled.

The odd Goldbach conjecture says it should be possible to write 2019 as the sum of three primes. And in fact there are 2,259 ways to write 2019 as a non-decreasing sequence of primes.

      3 +   5 + 2011
      3 +  13 + 2003
      3 +  17 + 1999
          ...
    659 + 659 +  701
    659 + 677 +  701
    673 + 673 +  673

Lagrange’s four-square theorem

Lagrange’s four-square theorem says that every non-negative integer can be written as the sum of four squares. 2019 is a non-negative integer, so it can be written as the sum of four squares. In fact there are 66 ways to write 2019 as a sum of four squares.

     0  1 13 43
     0  5 25 37
     0  7 11 43
         ...
    16 19 21 31
    17 23 24 25
    19 20 23 27

Sometimes there is a unique way to write a number as the sum of four squares. The last time a year could be written uniquely as a sum of four squares was 1536, and the next time will be in 2048.

Related posts

Groups of order 2019

How many groups have 2019 elements? What are these groups?

2019 is a semiprime, i.e. the product of two primes, 3 and 673. For every semiprime s, there are either one or two distinct groups of order s.

As explained here, if spq with pq, all groups of order s are isomorphic if q is not a factor of p-1. If q does divide p-1 then there are exactly two non-isomorphic groups of order s, one abelian and one non-abelian. In our case, 3 does divide 672, so there are two groups of order 2019. The first is easy: the cyclic group of order 2019. The second is more complex.

You could take the direct product of the cyclic groups of order 3 and 673, but that turns out to be isomorphic to the cyclic group of order 2019. But if you take the semidirect product you get the other group of order 2019, the non-abelian one.

Semidirect products

Starting with two groups G and H, the direct product G × H is the Cartesian product of G and H with multiplication defined componentwise. The semidirect product of G and H, written [1]

G \rtimes H

also starts with the Cartesian product of G and H but defines multiplication differently.

In our application, G will be the integers mod 673 with addition and H will be a three-element subgroup of the integers mod 673 with multiplication [2]. Let H be the set {1, 255, 417} with respect to multiplication [3]. Note that 1 is its own inverse and 255 and 417 are inverses of each other.

Product

Now the group product of (g1, h1) and (g2, h2) is defined to be

(g1 + h1-1g2, h1 h2)

So, for example, the product of (5, 255) and (334, 417) is (5 + 417*334, 255*417) which reduces to (645, 1) working mod 673.

(We haven’t defined the semidirect product in general, but the procedure above suffices to define the semidirect product for any two groups of prime order, and so it is sufficient to find all groups of semiprime order.)

Note that our group is non-abelian. For example, if we reverse the order of multiplication above we get (263, 1).

Identity

The identity element is just the pair consisting of the identity elements from G and H. In our case, this is (0, 1) because 0 is the additive identity and 1 is the multiplicative identity.

Inverse

The inverse of an element (gh) is given by

(-ghh-1).

So, for example, the inverse of (600, 255) is (444, 417).

Python code

The goal of this post is to be concrete rather than general.

So to make everything perfectly explicit, we’ll write a little Python code implementing the product and inverse.

    
    def hinv(h):
        if h == 255:
            return 417
        if h == 417:
            return 255
        if h == 1:
            return 1
    
    def prod(x, y):
        g1, h1 = x
        g2, h2 = y
        g3 = (g1 + hinv(h1)*g2) % 673
        h3 = (h1*h2) % 673
        return (g3, h3)
    
    def group_inv(x):
        g, h = x
        return ((-g*h)%673, hinv(h))
    
    x = (5, 255)
    y = (334, 417)
    print(prod(x, y))
    print(prod(y, x))
    print(group_inv((600, 255)))

The following code verifies that our group satisfies the axioms of a group.

    from itertools import product
    
    identity = (0, 1)
    h_list = [1, 255, 417]
    
    def elem(x):
        g, h = x
        g_ok = (0 <= g <= 672)
        h_ok = (h in h_list)
        return (g_ok and h_ok)
    
    group = product(range(673), h_list)
    assert(len([g for g in group]) == 2019)
    
    # closed under multiplicaton    
    for x in group:
        for y in group:
            assert( elem(prod(x, y)) )
    
    # multiplication is associative
    for x in group:
        for y in group:
            for z in group:
                xy_z = prod(prod(x, y),z)
                x_yz = prod(x, prod(y,z))
                assert(xy_z == x_yz)
    
    # identity acts like it's supposed to
    for x in group:
        assert( prod(x, identity) == x )
        assert( prod(identity, x) == x )
    
    # every element has an inverse
    for x in group:
        ginv = group_inv(x)
        assert( elem(ginv) )
        assert( prod(x, ginv) == identity )
        assert( prod(ginv, x) == identity )

Related posts

[1] The symbol for semidirect product is ⋊. It’s U+22CA in Unicode and \rtimes in LaTeX.

[2] In general, the semidirect product depends on a choice of an action of the group H on the group G. Here the action is multiplication by an element of H. Different actions can result in different groups. Sometimes the particular choice of action is made explicit as a subscript on the ⋊ symbol.

[3] How did I find these numbers? There are 672 non-zero numbers mod 673, so I picked a number, it happened to be 5, and raised it to the powers 672/3 and 2*672/3.

 

Flattest US states

US physical map

I read somewhere that, contrary to popular belief, Kansas is not the flattest state in the US. Instead, Florida is the flattest, and Kansas was several notches further down the list.

(Update: Nevertheless, Kansas is literally flatter than a pancake. Thanks to Andreas Krause for the link.)

How would you measure the flatness of a geographic region? The simplest approach would be to look at the range of elevation from the highest point to the lowest point, so that’s what I did. Here are elevations and differences for each state, measured in feet.

|----------------+-------+------+-------|
| State          |  High |  Low |  Diff |
|----------------+-------+------+-------|
| Florida        |   345 |    0 |   345 |
| Delaware       |   450 |    0 |   450 |
| Louisiana      |   535 |   -7 |   542 |
| Mississippi    |   807 |    0 |   807 |
| Rhode Island   |   814 |    0 |   814 |
| Indiana        |  1257 |  322 |   935 |
| Illinois       |  1237 |  279 |   958 |
| Ohio           |  1549 |  456 |  1093 |
| Iowa           |  1670 |  479 |  1191 |
| Wisconsin      |  1952 |  581 |  1371 |
| Michigan       |  1982 |  571 |  1411 |
| Missouri       |  1772 |  230 |  1542 |
| Minnesota      |  2303 |  600 |  1703 |
| New Jersey     |  1804 |    0 |  1804 |
| Connecticut    |  2382 |    0 |  2382 |
| Alabama        |  2405 |    0 |  2405 |
| Arkansas       |  2756 |   56 |  2700 |
| North Dakota   |  3507 |  751 |  2756 |
| Pennsylvania   |  3215 |    0 |  3215 |
| Kansas         |  4042 |  679 |  3363 |
| Maryland       |  3363 |    0 |  3363 |
| Massachusetts  |  3491 |    0 |  3491 |
| South Carolina |  3563 |    0 |  3563 |
| Kentucky       |  4144 |  256 |  3888 |
| Vermont        |  4396 |   95 |  4301 |
| Nebraska       |  5427 |  840 |  4587 |
| West Virginia  |  4865 |  240 |  4625 |
| Oklahoma       |  4977 |  289 |  4688 |
| Georgia        |  4787 |    0 |  4787 |
| Maine          |  5269 |    0 |  5269 |
| New York       |  5348 |    0 |  5348 |
| Virginia       |  5732 |    0 |  5732 |
| South Dakota   |  7247 |  968 |  6279 |
| New Hampshire  |  6293 |    0 |  6293 |
| Tennessee      |  6647 |  177 |  6470 |
| North Carolina |  6690 |    0 |  6690 |
| Texas          |  8753 |    0 |  8753 |
| New Mexico     | 13169 | 2844 | 10325 |
| Wyoming        | 13812 | 3100 | 10712 |
| Montana        | 12808 | 1801 | 11007 |
| Colorado       | 14439 | 3314 | 11125 |
| Oregon         | 11247 |    0 | 11247 |
| Utah           | 13527 | 2001 | 11526 |
| Idaho          | 12667 |  712 | 11955 |
| Arizona        | 12631 |   69 | 12562 |
| Nevada         | 13146 |  479 | 12667 |
| Hawaii         | 13806 |    0 | 13806 |
| Washington     | 14419 |    0 | 14419 |
| California     | 14505 | -282 | 14787 |
| Alaska         | 20335 |    0 | 20335 |
|----------------+-------+------+-------|

By the measure used in the table above, Florida is about 10 times flatter than Kansas.

The centroid of the continental US is in Kansas, and if you look at the center of the map above, you’ll see that there’s an elevation change across Kansas.

If I remember correctly, the source I saw said that Kansas was something like 9th, though in the table above it’s 20th. Maybe that source measured flatness differently. If you had a grid of measurements throughout each state, you could compute something like a Sobolev norm that accounts for gradients.

Naked eye view vs photos of the northern lights

My daughter Elizabeth recently photographed the northern lights (aurora borealis) in Tromsø, Norway, about 3° above the Arctic Circle.

Northern lights, Tromsøm 2018

I haven’t seen the northern lights in person, and I didn’t know until she told me that the lights appear gray to the naked eye, like smoke. Sometimes the lights have a hint of color, but the color is nowhere near as intense as in photographs. Photos like the one above are not false color images. The green light is really there, even though it is not nearly as vivid to the naked eye as it is to a camera.

The reason the aurora appear gray is that human eyes don’t see in color well in dim light. At very low light levels, our vision is based entirely on rods, photoreceptor cells that are very sensitive to light but not to color. If aurora are bright enough, cones are activated and color becomes visible.

Why green in particular? It comes from excited oxygen atoms, emitting radiation at a wavelength of 557.7 nm. Other colors can be found in aurora, but green is most common for physical and physiological reasons. The physical reasons involve radiation and the chemical composition of the atmosphere. The physiological reason is that human vision is most sensitive around the frequency of green light.

Related posts

Check sums and error detection

The previous post looked at Crockford’s base 32 encoding, a minor variation on the way math conventionally represents base 32 numbers, with concessions for human use. By not using the letter O, for example, it avoids confusion with the digit 0.

Crockford recommends the following check sum procedure, a simple error detection code:

The check symbol encodes the number modulo 37, 37 being the least prime number greater than 32.

That is, we take the remainder when the base 32 number is divided by 37 and append the result to the original encoded number. The remainder could be larger than 31, so we need to expand our alphabet of symbols. Crockford recommends using the symbols *, ~, $, =, and U to represent remainders of 32, 33, 34, 35, and 36.

Crockford says his check sum will “detect wrong-symbol and transposed-symbol errors.” We will show that this is the case in the proof below.

Python example

Here’s a little Python code to demonstrate how the checksum works

    from base32_crockford import encode, decode

    s = "H88CMK9BVJ1V"
    w = "H88CMK9BVJ1W" # wrong last char
    t = "H88CMK9BVJV1" # transposed last chars

    def append_checksum(s):
        return encode(decode(s), checksum=True)

    print(append_checksum(s))
    print(append_checksum(w))
    print(append_checksum(t))

This produces the following output.

    H88CMK9BVJ1VP
    H88CMK9BVJ1WQ
    H88CMK9BVJV1E

The checksum character of the original string is P. When the last character is changed, the checksum changes from P to Q. Similarly, transposing the last two characters changes the checksum from P to E.

The following code illustrates that the check sum can be a non-alphabetic character.

    s = "H88CMK9BVJ10"
    n = decode(s)
    r = n % 37
    print(r)
    print(encode(n, checksum=True))

This produces

    32
    H88CMK9BVJ10*

As we said above, a remainder of 32 is represented by *.

Proof

If you change one character in a base 32 number, its remainder by 37 will change as well, and so the check sum changes.

Specifically, if you change the nth digit from the right, counting from 0, by an amount k, then you change the number by a factor of k 2n. Since 0 < k < 32, k is not divisible by 37, nor is 2n. Because 37 is prime, k 2n is not divisible by 37 [1]. The same argument holds if we replace 37 by any larger prime.

Now what about transpositions? If you swap consecutive digits a and b in a number, you also change the remainder by 37 (or any larger prime) and hence the check sum.

Again, let’s be specific. Suppose we transpose the nth and n+1st digits from the right, again counting from 0. Denote these digits by a and b respectively. Then swapping these two digits changes the number by an amount

(b 2n+1 + a 2n) – (a 2n+1 + b 2n) = (b – a) 2n

If ab, then b – a is a number between -31 and 31, but not 0, and so b – a is not divisible by 37. Neither is any power of 2 divisible by 37, so we’ve changed the remainder by 37, i.e. changed the check sum. And as before, the same argument works for any prime larger than 47.

Related posts

[1] A prime p divides a product ab only if it divides a or it divides b. This isn’t true for composite numbers. For example, 6 divides 4*9 = 36, but 6 doesn’t divide 4 or 9.

Base 32 and base 64 encoding

Math has a conventional way to represent numbers in bases larger than 10, and software development has a couple variations on this theme that are only incidentally mathematical.

Math convention

By convention, math books typically represent numbers in bases larger than 10 by using letters as new digit symbols following 9. For example, base 16 would use 0, 1, 2, …, 9, A, B, C, D, E, and F as its “digits.” This works for bases up to 36; base 36 would use all the letters of the alphabet. There’s no firm convention for whether to use upper or lower case letters.

Base 64 encoding

The common use for base 64 encoding isn’t to represent bits as numbers per se, but to have an efficient way to transmit bits in a context that requires text characters.

There are around 100 possible characters on a keyboard, and 64 is the largest power of 2 less than 100 [1], and so base 64 is the most dense encoding using common characters in a base that is a power of 2.

Base 64 encoding does not follow the math convention of using the digits first and then adding more symbols; it’s free not to because there’s no intention of treating the output as numbers. Instead, the capital letters A through Z represent the numbers 0 though 25, the lower case letters a through z represent the numbers 26 through 51, and the digits 0 through 9 represent the numbers 52 through 61. The symbol + is used for 62 and / is used for 63.

Crockford’s base 32 encoding

Douglas Crockford proposed an interesting form of base 32 encoding. His encoding mostly follows the math convention: 0, 1, 2, …, 9, A, B, …, except he does not use the letters I, L, O, and U. This eliminates the possibility of confusing i, I, or l with 1, or confusing O with 0. Crockford had one more letter he could eliminate, and he chose U in order to avoid an “accidental obscenity.” [2]

Crockford’s base 32 encoding is a compromise between efficiency and human legibility. It is more efficient than hexadecimal, representing 25% more bits per character. It’s less efficient than base 64, representing 17% fewer bits per character, but is more legible than base 64 encoding because it eliminates commonly confused characters.

His encoding is also case insensitive. He recommends using only capital letters for output, but permitting upper or lower case letters in input. This is in the spirit of Postel’s law, also known as the robustness principle:

Be conservative in what you send, and liberal in what you accept.

See the next post for an explanation of Crockford’s check sum proposal.

A password generator

Here’s a Python script to generate passwords using Crockford’s base 32 encoding.

    from secrets import randbits
    from base32_crockford import encode

    def gen_pwd(numbits):
        print(encode(randbits(numbits)))

For example, gen_pwd(60) would create a 12-character password with 60-bits of entropy, and this password would be free of commonly confused characters.

Related posts

[1] We want to use powers of 2 because it’s easy to convert between base 2 and base 2n: start at the right end and convert bits in groups of n. For example, to convert a binary string to hexadecimal (base 24 = 16), convert groups of four bits each to hexadecimal. So to convert the binary number 101111001 to hex, we break it into 1 0111 1001 and convert each piece to hex, with 1 -> 1, 0111 -> 7, and 1001 -> 9, to find 101111001 -> 179. If we a base that is not a power of 2, the conversion would be more complicated and not so localized.

[2] All the words on George Carlin’s infamous list include either an I or a U, and so none can result from Crockford’s base 32 encoding. If one were willing to risk accidental obscenities, it would be good to put U back in and remove S since the latter resembles 5, particularly in some fonts.