Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

    for x in ["Poincaré", "Gödel"]:
        print(x, unidecode.unidecode(x))

This produces

    Poincaré Poincare
    Gödel Godel

Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode
>>> unidecode("北亰")
'Bei Jing '

(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that unidecode transliterates both to “Bei Jing.”)

Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

PP(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say P is idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

Related posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

Groups of order 2019

How many groups have 2019 elements? What are these groups?

2019 is a semiprime, i.e. the product of two primes, 3 and 673. For every semiprime s, there are either one or two distinct groups of order s.

As explained here, if spq with pq, all groups of order s are isomorphic if q is not a factor of p-1. If q does divide p-1 then there are exactly two non-isomorphic groups of order s, one abelian and one non-abelian. In our case, 3 does divide 672, so there are two groups of order 2019. The first is easy: the cyclic group of order 2019. The second is more complex.

You could take the direct product of the cyclic groups of order 3 and 673, but that turns out to be isomorphic to the cyclic group of order 2019. But if you take the semidirect product you get the other group of order 2019, the non-abelian one.

Semidirect products

Starting with two groups G and H, the direct product G × H is the Cartesian product of G and H with multiplication defined componentwise. The semidirect product of G and H, written [1]

G \rtimes H

also starts with the Cartesian product of G and H but defines multiplication differently.

In our application, G will be the integers mod 673 with addition and H will be a three-element subgroup of the integers mod 673 with multiplication [2]. Let H be the set {1, 255, 417} with respect to multiplication [3]. Note that 1 is its own inverse and 255 and 417 are inverses of each other.

Product

Now the group product of (g1, h1) and (g2, h2) is defined to be

(g1 + h1-1g2, h1 h2)

So, for example, the product of (5, 255) and (334, 417) is (5 + 417*334, 255*417) which reduces to (645, 1) working mod 673.

(We haven’t defined the semidirect product in general, but the procedure above suffices to define the semidirect product for any two groups of prime order, and so it is sufficient to find all groups of semiprime order.)

Note that our group is non-abelian. For example, if we reverse the order of multiplication above we get (263, 1).

Identity

The identity element is just the pair consisting of the identity elements from G and H. In our case, this is (0, 1) because 0 is the additive identity and 1 is the multiplicative identity.

Inverse

The inverse of an element (gh) is given by

(-ghh-1).

So, for example, the inverse of (600, 255) is (444, 417).

Python code

The goal of this post is to be concrete rather than general.

So to make everything perfectly explicit, we’ll write a little Python code implementing the product and inverse.

    
    def hinv(h):
        if h == 255:
            return 417
        if h == 417:
            return 255
        if h == 1:
            return 1
    
    def prod(x, y):
        g1, h1 = x
        g2, h2 = y
        g3 = (g1 + hinv(h1)*g2) % 673
        h3 = (h1*h2) % 673
        return (g3, h3)
    
    def group_inv(x):
        g, h = x
        return ((-g*h)%673, hinv(h))
    
    x = (5, 255)
    y = (334, 417)
    print(prod(x, y))
    print(prod(y, x))
    print(group_inv((600, 255)))

The following code verifies that our group satisfies the axioms of a group.

    from itertools import product
    
    identity = (0, 1)
    h_list = [1, 255, 417]
    
    def elem(x):
        g, h = x
        g_ok = (0 <= g <= 672)
        h_ok = (h in h_list)
        return (g_ok and h_ok)
    
    group = product(range(673), h_list)
    assert(len([g for g in group]) == 2019)
    
    # closed under multiplicaton    
    for x in group:
        for y in group:
            assert( elem(prod(x, y)) )
    
    # multiplication is associative
    for x in group:
        for y in group:
            for z in group:
                xy_z = prod(prod(x, y),z)
                x_yz = prod(x, prod(y,z))
                assert(xy_z == x_yz)
    
    # identity acts like it's supposed to
    for x in group:
        assert( prod(x, identity) == x )
        assert( prod(identity, x) == x )
    
    # every element has an inverse
    for x in group:
        ginv = group_inv(x)
        assert( elem(ginv) )
        assert( prod(x, ginv) == identity )
        assert( prod(ginv, x) == identity )

Related posts

[1] The symbol for semidirect product is ⋊. It’s U+22CA in Unicode and \rtimes in LaTeX.

[2] In general, the semidirect product depends on a choice of an action of the group H on the group G. Here the action is multiplication by an element of H. Different actions can result in different groups. Sometimes the particular choice of action is made explicit as a subscript on the ⋊ symbol.

[3] How did I find these numbers? There are 672 non-zero numbers mod 673, so I picked a number, it happened to be 5, and raised it to the powers 672/3 and 2*672/3.

 

Check sums and error detection

The previous post looked at Crockford’s base 32 encoding, a minor variation on the way math conventionally represents base 32 numbers, with concessions for human use. By not using the letter O, for example, it avoids confusion with the digit 0.

Crockford recommends the following check sum procedure, a simple error detection code:

The check symbol encodes the number modulo 37, 37 being the least prime number greater than 32.

That is, we take the remainder when the base 32 number is divided by 37 and append the result to the original encoded number. The remainder could be larger than 31, so we need to expand our alphabet of symbols. Crockford recommends using the symbols *, ~, $, =, and U to represent remainders of 32, 33, 34, 35, and 36.

Crockford says his check sum will “detect wrong-symbol and transposed-symbol errors.” We will show that this is the case in the proof below.

Python example

Here’s a little Python code to demonstrate how the checksum works

    from base32_crockford import encode, decode

    s = "H88CMK9BVJ1V"
    w = "H88CMK9BVJ1W" # wrong last char
    t = "H88CMK9BVJV1" # transposed last chars

    def append_checksum(s):
        return encode(decode(s), checksum=True)

    print(append_checksum(s))
    print(append_checksum(w))
    print(append_checksum(t))

This produces the following output.

    H88CMK9BVJ1VP
    H88CMK9BVJ1WQ
    H88CMK9BVJV1E

The checksum character of the original string is P. When the last character is changed, the checksum changes from P to Q. Similarly, transposing the last two characters changes the checksum from P to E.

The following code illustrates that the check sum can be a non-alphabetic character.

    s = "H88CMK9BVJ10"
    n = decode(s)
    r = n % 37
    print(r)
    print(encode(n, checksum=True))

This produces

    32
    H88CMK9BVJ10*

As we said above, a remainder of 32 is represented by *.

Proof

If you change one character in a base 32 number, its remainder by 37 will change as well, and so the check sum changes.

Specifically, if you change the nth digit from the right, counting from 0, by an amount k, then you change the number by a factor of k 2n. Since 0 < k < 32, k is not divisible by 37, nor is 2n. Because 37 is prime, k 2n is not divisible by 37 [1]. The same argument holds if we replace 37 by any larger prime.

Now what about transpositions? If you swap consecutive digits a and b in a number, you also change the remainder by 37 (or any larger prime) and hence the check sum.

Again, let’s be specific. Suppose we transpose the nth and n+1st digits from the right, again counting from 0. Denote these digits by a and b respectively. Then swapping these two digits changes the number by an amount

(b 2n+1 + a 2n) – (a 2n+1 + b 2n) = (b – a) 2n

If ab, then b – a is a number between -31 and 31, but not 0, and so b – a is not divisible by 37. Neither is any power of 2 divisible by 37, so we’ve changed the remainder by 37, i.e. changed the check sum. And as before, the same argument works for any prime larger than 47.

Related posts

[1] A prime p divides a product ab only if it divides a or it divides b. This isn’t true for composite numbers. For example, 6 divides 4*9 = 36, but 6 doesn’t divide 4 or 9.

Base 32 and base 64 encoding

Math has a conventional way to represent numbers in bases larger than 10, and software development has a couple variations on this theme that are only incidentally mathematical.

Math convention

By convention, math books typically represent numbers in bases larger than 10 by using letters as new digit symbols following 9. For example, base 16 would use 0, 1, 2, …, 9, A, B, C, D, E, and F as its “digits.” This works for bases up to 36; base 36 would use all the letters of the alphabet. There’s no firm convention for whether to use upper or lower case letters.

Base 64 encoding

The common use for base 64 encoding isn’t to represent bits as numbers per se, but to have an efficient way to transmit bits in a context that requires text characters.

There are around 100 possible characters on a keyboard, and 64 is the largest power of 2 less than 100 [1], and so base 64 is the most dense encoding using common characters in a base that is a power of 2.

Base 64 encoding does not follow the math convention of using the digits first and then adding more symbols; it’s free not to because there’s no intention of treating the output as numbers. Instead, the capital letters A through Z represent the numbers 0 though 25, the lower case letters a through z represent the numbers 26 through 51, and the digits 0 through 9 represent the numbers 52 through 61. The symbol + is used for 62 and / is used for 63.

Crockford’s base 32 encoding

Douglas Crockford proposed an interesting form of base 32 encoding. His encoding mostly follows the math convention: 0, 1, 2, …, 9, A, B, …, except he does not use the letters I, L, O, and U. This eliminates the possibility of confusing i, I, or l with 1, or confusing O with 0. Crockford had one more letter he could eliminate, and he chose U in order to avoid an “accidental obscenity.” [2]

Crockford’s base 32 encoding is a compromise between efficiency and human legibility. It is more efficient than hexadecimal, representing 25% more bits per character. It’s less efficient than base 64, representing 17% fewer bits per character, but is more legible than base 64 encoding because it eliminates commonly confused characters.

His encoding is also case insensitive. He recommends using only capital letters for output, but permitting upper or lower case letters in input. This is in the spirit of Postel’s law, also known as the robustness principle:

Be conservative in what you send, and liberal in what you accept.

See the next post for an explanation of Crockford’s check sum proposal.

A password generator

Here’s a Python script to generate passwords using Crockford’s base 32 encoding.

    from secrets import randbits
    from base32_crockford import encode

    def gen_pwd(numbits):
        print(encode(randbits(numbits)))

For example, gen_pwd(60) would create a 12-character password with 60-bits of entropy, and this password would be free of commonly confused characters.

Related posts

[1] We want to use powers of 2 because it’s easy to convert between base 2 and base 2n: start at the right end and convert bits in groups of n. For example, to convert a binary string to hexadecimal (base 24 = 16), convert groups of four bits each to hexadecimal. So to convert the binary number 101111001 to hex, we break it into 1 0111 1001 and convert each piece to hex, with 1 -> 1, 0111 -> 7, and 1001 -> 9, to find 101111001 -> 179. If we a base that is not a power of 2, the conversion would be more complicated and not so localized.

[2] All the words on George Carlin’s infamous list include either an I or a U, and so none can result from Crockford’s base 32 encoding. If one were willing to risk accidental obscenities, it would be good to put U back in and remove S since the latter resembles 5, particularly in some fonts.

RSA with one shared prime

The RSA encryption setup begins by finding two large prime numbers. These numbers are kept secret, but their product is made public. We discuss below just how difficult it is to recover two large primes from knowing their product.

Suppose two people share one prime. That is, one person chooses primes p and q and the other chooses p and r, and qr. (This has happened [1].) Then you can easily find p. All three primes can be obtained in less than a millisecond as illustrated by the Python script below.

In a way it would be better to share both your primes with someone else than to share just one. If both your primes are the same as someone else’s, you could read each other’s messages, but a third party attacker could not. If you’re lucky, the person you share primes with doesn’t know that you share primes or isn’t interested in reading your mail. But if that person’s private key is ever disclosed, so is yours.

Python code

Nearly all the time required to run the script is devoted to finding the primes, so we time just the factorization.

    from secrets import randbits
    from sympy import nextprime, gcd
    from timeit import default_timer as timer
    
    numbits = 2048
    
    p = nextprime(randbits(numbits))
    q = nextprime(randbits(numbits))
    r = nextprime(randbits(numbits))                                      
    
    m = p*q
    n = p*r
    
    t0 = timer()
    g = gcd(m, n)
    assert(p == g)
    assert(q == m//p)
    assert(r == n//p)
    t1 = timer()
    
    # Print time in milliseconds
    print(1000*(t1 - t0))

Python notes

There are a few noteworthy things about the Python code.

First, it uses a cryptographic random number generator, not the default generator, to create 2048-bit random numbers.

Second, it uses the portable default_timer which chooses the appropriate timer on different operating systems. Note that it returns wall clock time, not CPU time.

Finally, it uses integer division // to recover q and r. If you use the customary division operator Python will carry out integer division and attempt to cast the result to a float, resulting in the error message “OverflowError: integer division result too large for a float.”

GCD vs factoring

If you wanted to find the greatest common divisor of two small numbers by hand, you’d probably start by factoring the two numbers. But as Euclid knew, you don’t have to factor two numbers before you can find the greatest common factor of the numbers. For large numbers, the Euclidean algorithm is orders of magnitude faster than factoring.

Nobody has announced being able to factor a 1024 bit number yet. The number m and n above have four times as many bits. The largest of the RSA numbers factored so far has 768 bits. This was done in 2009 and took approximately two CPU millennia, i.e. around 2000 CPU years.

Fast Euclidean algorithm

The classic Euclidean algorithm takes O(n²) operations to find the greatest common divisor of two integers of length n. But there are modern sub-quadratic variations on the Euclidean algorithm that take

O(n log²(n) log(log(n)))

operations, which is much faster for very large numbers. I believe SymPy is using the classic Euclidean algorithm, but it could transparently switch to using the fast Euclidean algorithm for large inputs.

Related posts

[1] As noted in the comments, this has happened due to faulty software implementation, but it shouldn’t happen by chance. By the prime number theorem, the number of n-bit primes is approximately 2n / (n log 2). If M people choose 2M primes each n bits long, the probability of two of these primes being the same is roughly

22-n M n log 2.

If M = 7.5 billion (the current world population) and n = 1024, the probability of two primes being the same is about 10-295. This roughly the probability of winning the Powerball jackpot 40 times in a row.

Simulating identification by zip code, sex, and birthdate

As mentioned in the previous post, Latanya Sweeney estimated that 87% of Americans can be identified by the combination of zip code, sex, and birth date. We’ll do a quick-and-dirty estimate and a simulation to show that this result is plausible. There’s no point being too realistic with a simulation because the actual data that Sweeney used is even more realistic. We’re just showing that her result is reasonable.

Quick estimate

Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.

What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in [1].

We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.

Python simulation

The array demo below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.

    from random import randrange
    from numpy import zeros
    
    zctasize = 9330
    demosize = 365*78*2
    demo = zeros(demosize)
    
    for _ in range(zctasize):
        d = randrange(demosize)
        demo[d] += 1
    
    unique = len(demo[demo == 1])
    print(unique/zctasize)

I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.

Analytic solution

As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.

Suppose there are z inhabitants in our zip code and d demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.

We start by looking at a particular demographic category. The probability that exactly one person falls in that category is

\frac{z}{d}\left(1 - \frac{1}{d}\right)^{z-1}

To find the expected number of demographic slots with exactly one entry we multiply by d, and to get the proportion of p of people this represents we divide by z.

\log p = (z-1)\log\left(1 - \frac{1}{d}\right) \approx - \frac{z}{d}

and so

p \approx \exp(-z/d)

which in our case is 84.8%, consistent with our simulation above.

Here we used the Taylor series approximation

\log(1 + x) = x + {\cal O}(x^2)

If z is of the same magnitude as d or smaller, then the error in the approximation above is O(1/d).

You could also work this as a Poisson distribution, then condition on the probability of a slot being occupied.

By the way, a similar calculation shows that the expected number of demographic slots containing two people is r exp(-r)/2 where rz/d. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.

Related posts

[1] I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.

Sine of a googol

How do you evaluate the sine of a large number in floating point arithmetic? What does the result even mean?

Sine of a trillion

Let’s start by finding the sine of a trillion (1012) using floating point arithmetic. There are a couple ways to think about this. The floating point number t = 1.0e12 can only be represented to 15 or 16 significant decimal figures (to be precise, 53 bits [1]), and so you could think of it as a representative of the interval of numbers that all have the same floating point representation. Any result that is the sine of a number in that interval should be considered correct.

Another way to look at this is to say that t can be represented exactly—its binary representation requires 42 bits and we have 53 bits of significand precision available—and so we expect sin(t) to return the sine of exactly one trillion, correct to full precision.

It turns out that IEEE arithmetic does the latter, computing sin(1e12) correctly to 16 digits.

Here’s the result in Python

    >>> sin(1.0e12)
    -0.6112387023768895

and verified by Mathematica by computing the value to 20 decimal places

    In:= N[Sin[10^12], 20]
    Out= -0.61123870237688949819

Range reduction

Note that the result above is not what you’d get if you were first to take the remainder when dividing by 2π and then taking the sine.

    >>> sin(1.0e12 % (2*pi))
    -0.6112078499756778

This makes sense. The result of dividing a trillion by the floating point representation of 2π, 159154943091.89536, is correct to full floating point precision. But most of that precision is in the integer part. The fractional part is only has five digits of precision, and so we should expect the result above to be correct to at most five digits. In fact it’s accurate to four digits.

When your processor computes sin(1e12) it does not naively take the remainder by 2π and then compute the sine. If it did, you’d get four significant figures rather than 16.

We started out by saying that there were two ways of looking at the problem, and according to the first one, returning only four significant figures would be quite reasonable. If we think of the value t as a measured quantity, measured to the precision of floating point arithmetic (though hardly anything can be measured that accurately), then four significant figures would be all we should expect. But the people who designed the sine function implementation on your processor did more than they might be expected to do, finding the sine of a trillion to full precision. They do this using a range reduction algorithm that retains far more precision than naively doing a division. (I worked on this problem a long time ago.)

Sine of a googol?

What if we try to take the sine of a ridiculously large number like a googol (10100)? This won’t work because a googol cannot be represented exactly as a floating point number (i.e. as a IEEE 754 double). It’s not too big; floating point numbers can be as large as around 10308. The problem is that a number that big cannot be represented to full precision. But we can represent 2333 exactly, and a googol is between 2332 and 2333. And amazingly, Python’s sine function (or rather the sine function built into the AMD processor on my computer) returns the correct result to full precision.

    >>> sin(2**333)
    0.9731320373846353

How do we know this is right? I verified it in Mathematica:

    In:= Sin[2.0^333]
    Out= 0.9731320373846353

How do we know Mathematica is right? We can do naive range reduction using extended precision, say 200 decimal places.

    In:= p = N[Pi, 200]
    In:= theta = x - IntegerPart[x/ (2 p)] 2 p
    Out= 1.8031286178334699559384196689…

and verify that the sine of the range reduced value is correct.

    In:= Sin[theta]
    Out= 0.97313203738463526…

Interval arithmetic interpretation

Because floating point numbers have 53 bits of precision, all real numbers between 256 – 22 and 256 + 22 are represented as the floating point number 256. This is a range of width 8, wider than 2π, and so the sines of the numbers in this range cover the possible values of sine, i.e. [-1, 1]. So if you think of a floating point number as a sample from the set of all real numbers with the same binary representation, every possible value of sine is a valid return value for numbers bigger than 256. But if you think of a floating point number as representing itself exactly, then it makes sense to ask for the sine of numbers like 2333 and larger, up to the limits of floating point representation [2].

Related posts

[1] An IEEE 754 double has 52 significand bits, but these bits can represent 53 bits of precision because the first bit of the fractional part is always 1, and so it does not need to be represented. See more here.

[2] The largest exponent of an IEEE double is 1023, and the largest significand is 2 – 2-53 (i.e. all 1’s), so the largest possible value of a double is

(253 – 1)21024-53

and in fact the Python expression sin((2**53 - 1)*2**(1024-53)) returns the correct value to 16 significant figures.

Ellipsoid distance on Earth

globe

To first approximation, Earth is a sphere. But it bulges at the equator, and to second approximation, Earth is an oblate spheroid. Earth is not exactly an oblate spheroid either, but the error in the oblate spheroid model is about 100x smaller than the error in the spherical model.

Finding the distance between two points on a sphere is fairly simple. Here’s a calculator to compute the distance, and here’s a derivation of the formula used in the calculator.

Finding the distance between two points on an ellipsoid is much more complicated. (A spheroid is a kind of ellipsoid.) Wikipedia gives a description of Vincenty’s algorithm for finding the distance between two points on Earth using an oblate spheroid model (specifically WGS-84). I’ll include a Python implementation below.

Comparison with spherical distance

How much difference does it make when you calculate difference on an oblate spheroid rather than a sphere? To address that question I looked at the coordinates of several cities around the world using the CityData function in Mathematica. Latitude is in degrees north of the equator and longitude is in degrees east of the prime meridian.

    |--------------+--------+---------|
    | City         |    Lat |    Long |
    |--------------+--------+---------|
    | Houston      |  29.78 |  -95.39 |
    | Caracas      |  10.54 |  -66.93 |
    | London       |  51.50 |   -0.12 |
    | Tokyo        |  35.67 |  139.77 |
    | Delhi        |  28.67 |   77.21 |
    | Honolulu     |  21.31 | -157.83 |
    | Sao Paulo    | -23.53 |  -46.63 |
    | New York     |  40.66 |  -73.94 |
    | Los Angeles  |  34.02 | -118.41 |
    | Cape Town    | -33.93 |   18.46 |
    | Sydney       | -33.87 |  151.21 |
    | Tromsø       |  69.66 |   18.94 |
    | Singapore    |   1.30 |  103.85 |
    |--------------+--------+---------|

Here are the error extremes.

The spherical model underestimates the distance from London to Tokyo by 12.88 km, and it overestimates the distance from London to Cape Town by 45.40 km.

The relative error is most negative for London to New York (-0.157%) and most positive for Tokyo to Sidney (0.545%).

Update: The comparison above used the same radius for both the spherical and ellipsoidal models. As suggested in the comments, a better comparison would use the mean radius (average of equatorial and polar radii) in the spherical model rather than the equatorial radius.

With that change the most negative absolute error is between Tokyo and New York at -30,038 m. The most negative relative error is between London and New York at -0.324%.

The largest positive error, absolute and relative, is between Tokyo and Sydney. The absolute error is 29,289 m and the relative error is 0.376%.

Python implementation

The code below is a direct implementation of the equations in the Wikipedia article.

Note that longitude and latitude below are assumed to be in radians. You can convert from degrees to radians with SciPy’s deg2rad function.

from scipy import sin, cos, tan, arctan, arctan2, arccos, pi

def spherical_distance(lat1, long1, lat2, long2):
    phi1 = 0.5*pi - lat1
    phi2 = 0.5*pi - lat2
    r = 0.5*(6378137 + 6356752) # mean radius in meters
    t = sin(phi1)*sin(phi2)*cos(long1-long2) + cos(phi1)*cos(phi2)
    return r * arccos(t)

def ellipsoidal_distance(lat1, long1, lat2, long2):

    a = 6378137.0 # equatorial radius in meters 
    f = 1/298.257223563 # ellipsoid flattening 
    b = (1 - f)*a 
    tolerance = 1e-11 # to stop iteration

    phi1, phi2 = lat1, lat2
    U1 = arctan((1-f)*tan(phi1))
    U2 = arctan((1-f)*tan(phi2))
    L1, L2 = long1, long2
    L = L2 - L1

    lambda_old = L + 0

    while True:
    
        t = (cos(U2)*sin(lambda_old))**2
        t += (cos(U1)*sin(U2) - sin(U1)*cos(U2)*cos(lambda_old))**2
        sin_sigma = t**0.5
        cos_sigma = sin(U1)*sin(U2) + cos(U1)*cos(U2)*cos(lambda_old)
        sigma = arctan2(sin_sigma, cos_sigma) 
    
        sin_alpha = cos(U1)*cos(U2)*sin(lambda_old) / sin_sigma
        cos_sq_alpha = 1 - sin_alpha**2
        cos_2sigma_m = cos_sigma - 2*sin(U1)*sin(U2)/cos_sq_alpha
        C = f*cos_sq_alpha*(4 + f*(4-3*cos_sq_alpha))/16
    
        t = sigma + C*sin_sigma*(cos_2sigma_m + C*cos_sigma*(-1 + 2*cos_2sigma_m**2))
        lambda_new = L + (1 - C)*f*sin_alpha*t
        if abs(lambda_new - lambda_old) <= tolerance:
            break
        else:
            lambda_old = lambda_new

    u2 = cos_sq_alpha*((a**2 - b**2)/b**2)
    A = 1 + (u2/16384)*(4096 + u2*(-768+u2*(320 - 175*u2)))
    B = (u2/1024)*(256 + u2*(-128 + u2*(74 - 47*u2)))
    t = cos_2sigma_m + 0.25*B*(cos_sigma*(-1 + 2*cos_2sigma_m**2))
    t -= (B/6)*cos_2sigma_m*(-3 + 4*sin_sigma**2)*(-3 + 4*cos_2sigma_m**2)
    delta_sigma = B * sin_sigma * t
    s = b*A*(sigma - delta_sigma)

    return s

Clearing up the confusion around Jacobi functions

The Jacobi elliptic functions sn and cn are analogous to the trigonometric functions sine and cosine. The come up in applications such as nonlinear oscillations and conformal mapping. Unfortunately there are multiple conventions for defining these functions. The purpose of this post is to clear up the confusion around these different conventions.

Plot of Jacobi sn

The image above is a plot of the function sn [1].

Modulus, parameter, and modular angle

Jacobi functions take two inputs. We typically think of a Jacobi function as being a function of its first input, the second input being fixed. This second input is a “dial” you can turn that changes their behavior.

There are several ways to specify this dial. I started with the word “dial” rather than “parameter” because in this context parameter takes on a technical meaning, one way of describing the dial. In addition to the “parameter,” you could describe a Jacobi function in terms of its modulus or modular angle. This post will be a Rosetta stone of sorts, showing how each of these ways of describing a Jacobi elliptic function are related.

The parameter m is a real number in [0, 1]. The complementary parameter is m‘ = 1 – m. Abramowitz and Stegun, for example, write the Jacobi functions sn and cn as sn(um) and cn(um). They also use m1 = rather than m‘ to denote the complementary parameter.

The modulus k is the square root of m. It would be easier to remember if m stood for modulus, but that’s not conventional. Instead, m is for parameter and k is for modulus. The complementary modulus k‘ is the square root of the complementary parameter.

The modular angle α is defined by m = sin² α.

Note that as the parameter m goes to zero, so does the modulus k and the modular angle α. As any one of these three goes to zero, the Jacobi functions sn and cn converge to their counterparts sine and cosine. So whether your dial is the parameter, modulus, or modular angle, sn converges to sine and cn converges to cosine as you turn the dial toward zero.

As the parameter m goes to 1, so does the modulus k, but the modular angle α goes to π/2. So if your dial is the parameter or the modulus, it goes to 1. But if you think of your dial as modular angle, it goes to π/2. In either case, as you turn the dial to the right as far as it will go, sn converges to hyperbolic secant, and cn converges to the constant function 1.

Quarter periods

In addition to parameter, modulus, and modular angle, you’ll also see Jacobi function described in terms of K and K‘. These are called the quarter periods for good reason. The functions sn and cn have period 4K as you move along the real axis, or move horizontally anywhere in the complex plane. They also have period 4iK‘. That is, the functions repeat when you move a distance 4K‘ vertically [2].

The quarter periods are a function of the modulus. The quarter period K along the real axis is

K(m) = \int_0^{\pi/2} \frac{d\theta}{\sqrt{1-m\sin^2\theta}}

and the quarter period K‘ along the imaginary axis is given by K‘(m) = K(m‘) = K(1 – m).

The function K(m) is known as “the complete elliptic integral of the first kind.”

Amplitude

So far we’ve focused on the second input to the Jacobi functions, and three conventions for specifying it.

There are two conventions for specifying the first argument, written either as φ or as u. These are related by

u = \int_0^{\varphi} \frac{d\theta}{\sqrt{1-m\sin^2\theta}}

The angle φ is called the amplitude. (Yes, it’s an angle, but it’s called an amplitude.)

When we said above that the Jacobi functions had period 4K, this was in terms of the variable u. Note that when φ = π/2, uK.

Jacobi elliptic functions in Mathematica

Mathematica uses the u convention for the first argument and the parameter convention for the second argument.

The Mathematica function JacobiSN[u, m] computes the function sn with argument u and parameter m. In the notation of A&S, sn(um).

Similarly, JacobiCN[u, m] computes the function cn with argument u and parameter m. In the notation of A&S, cn(um).

We haven’t talked about the Jacobi function dn up to this point, but it is implemented in Mathematica as JacobiDN[u, m].

The function that solves for the amplitude φ as a function of u is JacobiAmplitude[um m].

The function that computes the quarter period K from the parameter m is EllipticK[m].

Jacobi elliptic functions in Python

The SciPy library has one Python function that computes four mathematical functions at once. The function scipy.special.ellipj takes two arguments, u and m, just like Mathematica, and returns sn(um), cn(um), dn(um), and the amplitude φ(um).

The function K(m) is implemented in Python as scipy.special.ellipk.

Related posts

[1] The plot was made using JacobiSN[0.5, z] and the function ComplexPlot described here.

[2] Strictly speaking, 4iK‘ is a period. It’s the smallest vertical period for cn, but 2iK‘ is the smallest vertical period for sn.

Koide’s coincidence

The p-norm of a vector is defined to be the pth root of the sum of its pth powers.

|| x ||_p = \left(\sum_{i=1}^n x_i^p\right )^{1/p}

Such norms occur frequently in application [1]. Yoshio Koide discovered in 1981 that if you take the masses of the electron, muon, and tau particles, the ratio of the 1 norm to the 1/2 norm is very nearly 2/3. Explicitly,

\frac{m_e + m_{\mu} + m_{\tau}}{(\sqrt{m_e}+\sqrt{m_{\mu}}+\sqrt{m_{\tau}})^2} \approx \frac{2}{3}

to at least four decimal places. Since the ratio is tantalizingly close to 2/3, some believe there’s something profound going on here and the value is exactly 2/3, but others believe it’s just a coincidence.

The value of 2/3 is interesting for two reasons. Obviously it’s a small integer ratio. But it’s also exactly the midpoint between the smallest and largest possible value. More on that below.

Is the value 2/3 within the measure error of the constants? We’ll investigate that with a little Python code.

Python code

The masses of particles are available in the physical_constants dictionary in scipy.constants. For each constant, the dictionary contains the best estimate of the value, the units, and the uncertainty (standard deviation) in the measurement [2].

    from scipy.constants import physical_constants as pc
    from scipy.stats import norm
    
    def pnorm(v, p):
        return sum([x**p for x in v])**(1/p)
    
    def f(v):
        return pnorm(v, 1) / pnorm(v, 0.5)
    
    m_e = pc["electron mass"]
    m_m = pc["muon mass"]
    m_t = pc["tau mass"]
    
    v0 = [m_e[0], m_m[0], m_t[0]]
    
    print(f(v0))

This says that the ratio of the 1 norm and 1/2 norm is 0.666658, slightly less than 2/3. Could the value be exactly 2/3 within the resolution of the measurements? How could we find out?

Measurement uncertainty

The function f above is minimized when its arguments are all equal, and maximized when its arguments are maximally unequal. To see this, note that f(1, 1, 1) = 1/3 and f(1, 0, 0) = 1. You can prove that those are indeed the minimum and maximum values. To see if we can make f larger, we want to increase the largest value, the mass of tau, and decrease the others. If we move each value one standard deviation in the desired direction, we get

    v1 = [m_e[0] - m_e[2], 
          m_m[0] - m_m[2], 
          m_t[0] + m_t[2]]
    print(f(v1))

which returns 0.6666674, just slightly bigger than 2/3. Since the value can be bigger than 2/3, and less than 2/3, the intermediate value theorem says there are values of the constants within one standard deviation of their mean for which we get exactly 2/3.

Now that we’ve shown that it’s possible to get a value above 2/3, how likely is it? We can do a simulation, assuming each measurement is normally distributed.

    N = 1000
    above = 0
    for _ in range(N):
        r_e = norm(m_e[0], m_e[2]).rvs()
        r_m = norm(m_m[0], m_m[2]).rvs()
        r_t = norm(m_t[0], m_t[2]).rvs()
        t = f([r_e, r_m, r_t])
        if t > 2/3:
            above += 1
    print(above)

When we I ran this, I got 168 values above 2/3 and the rest below. So based solely on our calculations here, not taking into account any other information that may be important, it’s plausible that Koide’s ratio is exactly 2/3.

***

[1] Strictly speaking, we should take the absolute values of the vector components. Since we’re talking about masses here, I simplified slightly by assuming the components are non-negative.

Also, what I’m calling the p “norm” is only a norm if p is at least 1. Values of p less than 1 do occur in application, even though the functions they define are not norms. I’m pretty sure I’ve blogged about such an application, but I haven’t found the post.

[2] The SciPy library obtained its values for the constants and their uncertainties from the CODATA recommended values, published in 2014.

Distribution of carries

Suppose you add two long numbers together. As you work your way toward the result, adding digits from right to left, sometimes you carry a 1 and sometimes you don’t. How often do you carry 1?

Now suppose you add three numbers at a time. Now your carry might be 0, 1, or 2. How often does each appear? How do things change if you’re not working in base 10?

John Holte goes into this problem in detail in [1]. We’ll only look at his first result here.

Suppose we’re adding m numbers together in base b, with each digit being uniformly distributed. Then at each step of the addition process, the probability distribution of the amount we carry out only depends on the amount we carried in from the previous step. In other words, the carries form a Markov chain!

This means that we can do more than describe the distribution of the amounts carried. We can specify the transition probabilities for carries, i.e. the distribution on the amount carried out, conditional on the the amount carried in.

Examples

When we add two base ten numbers, m = 2 and b = 10, we have the transition matrix

\begin{pmatrix} 0.55 & 0.45 \\ 0.45 & 0.55 \end{pmatrix}

This means that on average, we’re as likely to carry 0 as to carry 1. But more specifically, there’s a 55% chance that we’ll carry a 1 if we carried a 1 on the previous step, and also a 55% chance that we won’t have a carry this time if we didn’t have one last time.

If we add three numbers, things get a little more interesting. Now the transition matrix is

\begin{pmatrix} 0.220 & 0.660 & 0.120 \\ 0.165 & 0.670 & 0.165 \\ 0.120 & 0.660 & 0.220 \end{pmatrix}

This gives the probabilities of each out carry for each in carry. We can compute the long run distribution from the matrix above to find that when adding three long numbers, we’ll carry 0 or 1 with probability 1/6 each, and carry 1 with probability 2/3.

When adding three binary numbers, the transition matrix is fairly different than for base 10

\begin{pmatrix} 0.500 & 0.500 & 0.000 \\ 0.125 & 0.750 & 0.125 \\ 0.000 & 0.500 & 0.500 \\ \end{pmatrix}

but the long run distributions are the same: carry a 1 two thirds of the time, and split the remaining probability equally between carrying 0 and 2.

The transition matrices don’t change much for larger bases; base 100 doesn’t look much different than base 100, for example.

Finally, let’s do an example adding five numbers in base 10. The transition matrix is

\begin{pmatrix} 0.020 & 0.305 & 0.533 & 0.140 & 0.003 \\ 0.013 & 0.259 & 0.548 & 0.176 & 0.005 \\ 0.008 & 0.216 & 0.553 & 0.216 & 0.008 \\ 0.005 & 0.176 & 0.548 & 0.260 & 0.013 \\ 0.003 & 0.140 & 0.533 & 0.305 & 0.020 \\ \end{pmatrix}

and the long run distribution is 1/60, 13/60, 33/60, 13/60, and 1/60 for carries of 0 through 4.

General case

In general, when adding m numbers in base b, Holte gives the transition probabilities as

\pi_{ij} = b^{-m} \sum_{r=0}^{j-\lfloor i/b\rfloor}(-1)^r {m+1\choose r}{m - i -1 + b(j-r+1) \choose m}

Here is some Python code to compute the transition matrices.

from scipy import floor, zeros, array
from scipy.special import comb
from numpy.linalg import matrix_power

# transition probability
def pi(i, j, b, m):
    s = 0
    for r in range(0, j - int(floor(i/b)) + 1):
        s += (-1)**r * comb(m+1, r) * comb(m-i-1 +b*(j-r+1), m)
    return s/b**m

def transition_matrix(b, m):
    P =  [ [pi(i, j, b, m) for j in range(m)]
            for i in range(m) ]
    return array(P)

You can find the long run probabilities by raising the transition matrix to a large power.

Related posts

[1] John M. Holte. The American Mathematical Monthly, Vol. 104, No. 2 (Feb., 1997), pp. 138-149

John Conway’s subprime fib sequence

Here’s how to construct what John Horton Conway calls his “subprime fib” sequence. Seed the sequence with any two positive integers. Then at each step, sum the last two elements of the sequence. If the result is prime, append it to the sequence. Otherwise divide it by its smallest prime factor and append that.

If we start the sequence with (1, 1) we get the sequence

1, 1, 2, 3, 5, 4, 3, 7, 5, 6, 11, 17, 14, 31, 15, 23, 19, 21, 20, 41, 61, 51, 56, 107, 163, 135, 149, 142, 97, 239, 168, 37, 41, 39, 40, 79, 17, 48, 13, 61, 37, 49, 43, 46, 89, 45, 67, 56, 41, 97, 69, 83, 76, 53, 43, 48, 13,

We know it will keep repeating because 48 followed by 13 has occurred before, and the “memory” of the sequence only stretches back two steps.

If we start with (6, 3) we get

6, 3, 3, 3, …

It’s easy to see that if the sequence ever repeats a value n then it will produce only that value from then on, because n + n = 2n is obviously not prime, and its smallest prime factor is 2.

Conway believes that this sequence will always get stuck in a cycle but this has not been proven.

Here’s Python code to try out Conway’s conjecture:

from sympy import isprime, primefactors

def subprime(a, b):

    seq = [a, b]
    repeated = False
    while not repeated:
        s = seq[-1] + seq[-2]
        if not isprime(s):
            s //= primefactors(s)[0]
        repeated = check_repeat(seq, s)
        seq.append(s)
    return seq

def check_repeat(seq, s):
    
    if not s in seq:
        return False
    
    if seq[-1] == s:
        return True
    
    # all previous occurances of the last element in seq
    indices = [i for i, x in enumerate(seq) if x == seq[-1]]
    
    for i in indices[:-1]:
        if seq[i+1] == s:
            return True
        
    return False

I’ve verified by brute force that the sequence repeats for all starting values less than 1000.

Related posts:

Circular law for random matrices

Suppose you create a large matrix M by filling its components with random values. If has size n by n, then we require the probability distribution for each entry to have mean 0 and variance 1/n. Then the Girko-Ginibri circular law says that the eigenvalues of M are approximately uniformly distributed in the unit disk in the complex plane. As the size n increases, the distribution converges to a uniform distribution on the unit disk.

The probability distribution need not be normal. It can be any distribution, shifted to have mean 0 and scaled to have variance 1/n, provided the tail of the distribution isn’t so thick that the variance doesn’t exist. If you don’t scale the variance to 1/n you still get a circle, just not a unit circle.

We’ll illustrate the circular law with a uniform distribution. The uniform distribution has mean 1/2 and variance 1/12, so we will subtract 1/2 and multiply each entry by √(12/n).

Here’s our Python code:

    import matplotlib.pyplot as plt
    import numpy as np
    
    n = 100
    M = np.random.random((n,n)) - 0.5
    M *= (12/n)**0.5
    w = np.linalg.eigvals(M)
    plt.scatter(np.real(w), np.imag(w))
    plt.axes().set_aspect(1)
    plt.show()

When n=100 we get the following plot.

eigenvalues of 100 by 100 matrix

When n=1000 we can see the disk filling in more.

eigenvalues of 1000 by 1000 matrix

Note that the points are symmetric about the real axis. All the entries of M are real, so its characteristic polynomial has all real coefficients, and so its roots come in conjugate pairs. If we randomly generated complex entries for M we would not have such symmetry.

Related post: Fat-tailed random matrices

3D chess knight moves

I’ve written before about a knight’s random walk on an ordinary chess board. In this post I’d like to look at the generalization to three dimensions (or more).

three dimensional lattice

So what do we mean by 3D chess? For this post, we’ll have a three dimensional lattice of possible positions, of size 8 by 8 by 8. You could think of this as a set of 8 ordinary chess boards stacked vertically. To generalize a knight’s move to this new situation, we’ll say that a knight move consists of moving 2 steps in one direction and 1 step in an orthogonal direction. For example, he might move up two levels and over one position horizontally.

Suppose our knight walks randomly through the chess lattice. At each point, he evaluates all possible moves and chooses one randomly with all possible moves having equal probability. How long on average will it take our knight to return to where he started?

As described in the post about the two dimensional problem, we can find the average return time using a theorem about Markov chains.

The solution is to view the problem as a random walk on a graph. The vertices of the graph are the squares of a chess board and the edges connect legal knight moves. The general solution for the time to first return is simply 2N/k where N is the number of edges in the graph, and k is the number of edges meeting at the starting point.

The problem reduces to counting N and k. This is tedious in two dimensions, and gets harder in higher dimensions. Rather than go through a combinatorial argument, I’ll show how to compute the result with a little Python code.

To count the number of edges N, we’ll add up the number of edges at each node in our graph, and then divide by 2 since this process will count each edge twice. We will iterate over our lattice, generate all potential moves, and discard those that would move outside the lattice.

from numpy import all
from itertools import product, combinations

def legal(v):
    return all([ 0 <= t < 8 for t in v])

count = 0
for position in product(range(8), range(8), range(8)):
    # Choose two directions
    for d in combinations(range(3), 2):
        # Move 1 step in one direction
        # and 2 steps in the other.
        for step in [1, 2]:
            for sign0 in [-1, 1]:
                for sign1 in [-1, 1]:
                    move = list(position)
                    move[d[0]] += sign0*step
                    move[d[1]] += sign1*(3-step)
                    if legal(move):
                        count += 1
print(count // 2)

This tells us that there are N = 4,032 nodes in our graph of possible moves. The number of starting moves k depends on our starting point. For example, if we start at a corner, then we have 6 possibilities. In this case we should expect our knight to return to his starting point in an average of 2*4032/6 = 1,344 moves.

We can easily modify the code above. To look at different size lattices, we could change all the 8’s above. The function legal would need more work if the lattice was not the same size in each dimensions.

We could also look at four dimensional chess by adding one more range to the position loop and changing the combinations to come from range(4) rather than range(3). In case you’re curious, in four dimensions, the graph capturing legal knight moves in an 8 by 8 by 8 by 8 lattice would have 64,512 edges. If our knight started in a corner, he’d have 12 possible starting moves, so we’d expect him to return to his starting position on average after 5,275 moves.

Probability of coprime sets

The latest blog post from Gödel’s Lost Letter and P=NP looks at the problem of finding relatively prime pairs of large numbers. In particular, they want a deterministic algorithm. They mention in passing that the probability of a set of k large integers being relatively prime (coprime) is 1/ζ(k) where ζ is the Riemann zeta function. This blog post will look closer at that statement.

How large is large?

What does it mean that the numbers are large? That they are large enough that the asymptotic result given by the zeta function is accurate enough. We’ll explore this with simulation code shortly.

The exact statement is that in the limit as N goes to infinity, the probability of k integers chosen uniformly from 1 to N being coprime converges to 1/ζ(k). I won’t go into the proof here, but in a nutshell, it falls out of Euler’s product formula for the zeta function.

What does it mean for a set of numbers to be coprime?

The probability above seemed wrong to me at first. The function ζ(k) decreases as k increases, and so its reciprocal increases. That means it’s more likely that a set of numbers is coprime as the size of the set increases. But the larger the set of numbers, the more likely it is that two will have a common factor, so shouldn’t the probability that they’re coprime go down with k, not up?

The resolution to the apparent contradiction is that I had the wrong idea of coprime in mind. Two integers are coprime if they have no prime factors in common. How do you generalize that to more than two integers? You could define a set of numbers to be coprime if every pair of numbers in the set is coprime. That’s the definition I had in mind. But that’s not the standard definition. The right definition here is that the greatest common divisor of the set is 1. For example, (6, 14, 21) would be coprime because no single prime divides all three numbers, even though no pair of numbers from the set is coprime.

Python simulation

Let’s write some Python code to try this out. We’ll randomly generate some sets of large numbers and see how often they’re coprime.

How do you generate large random integers in Python? You can use the function getrandbits to generate numbers as large as you like. In the code below we’ll generate 128-bit integers.

How do you compute the greatest common divisor in Python? There’s a library function gcd, but it only takes two integers. We’ll use the reduce function to fold gcd over a list of integers.

How do you compute the zeta function in Python? SciPy doesn’t have an implementation of ζ(x) per se, but it does have an implementation of ζ(x)-1. Why not just implement ζ itself? Because for large x, ζ(x) is close to 1, so more precision can be saved by computing ζ – 1.

Putting these pieces together, here’s code to randomly generate triples of 128-bit integers, see how often they’re coprime, and compare the result to 1/ζ(3).

    from random import getrandbits
    from fractions import gcd
    from functools import reduce
    from scipy.special import zetac
    
    def riemann_zeta(x):
        return zetac(x) + 1
    
    k = 3
    size = 128
    N = 10000
    
    coprime_count = 0
    for _ in range(N):
        x = [getrandbits(size) for _ in range(k)]
        if reduce(gcd, x) == 1:
            coprime_count += 1
    print(coprime_count/N)
    print(1/riemann_zeta(k))

When I ran this I got

    0.8363
    0.831907372581

Simulation agrees with theory to two decimal places, which is just what we should expect from 10,000 repetitions. (We’d expect error on the order of 1/√N.)

Here’s what I got when I changed k to 4:

    0.9234
    0.923938402922

And here’s what I got for k set to 5:

    0.9632
    0.964387340429

Related posts