So far 51 Mersenne primes have discovered [1]. Maybe that’s all there are, but it is conjectured that there are an infinite number Mersenne primes. In fact, it has been conjectured that as *x* increases, the number of primes *p* ≤ *x* is asymptotically

*e*^{γ} log *x* / log 2

where γ is the Euler-Mascheroni constant. For a heuristic derivation of this conjecture, see Conjecture 3.20 in Not Always Buried Deep.

How does the actual number of Mersenne primes compared to the number predicted by the conjecture? We’ll construct a plot below using Python. Note that the conjecture is asymptotic, and so it could make poor predictions for now and still be true for much larger numbers. But it appears to make fairly good predictions over the range where we have discovered Mersenne primes.

import numpy as np import matplotlib.pyplot as plt # p's for which 2^p - 1 is prime. # See https://oeis.org/A000043 ps = [2, 3, 5, ... , 82589933] # x has 200 points from 10^1 to 10^8 # spaced evenly on a logarithmic scale x = np.logspace(1, 8, 200) # number of p's less than x such that 2^p - 1 is prime actual = [np.searchsorted(ps, t) for t in x] exp_gamma = np.exp(0.5772156649) predicted = [exp_gamma*np.log2(t) for t in x] plt.plot(x, actual) plt.plot(x, predicted, "--") plt.xscale("log") plt.xlabel("p") plt.ylabel(r"Mersenne primes $\leq 2^p-1$") plt.legend(["actual", "predicted"])

[1] Fifty one Mersenne primes have been verified. But these may not be the smallest Mersenne primes. It has not yet been verified that there are no Mersenne primes yet to be discovered between the 47th and 51st known ones. The plot in this post assumes the known Mersenne primes are consecutive, and so it is speculative toward the right end.

]]>Alex Kontorovich posted a long thread on Twitter last week explaining why he thinks the Collatz conjecture might be false. His post came a few days after Tao’s paper was posted giving new partial results toward proving the Collatz conjecture is true.

Tao cites earlier work by Kontorovich. The two authors don’t contradict each other. Both are interested in the Collatz problem and have produced very technical papers studying the problem from different angles. See Konotorovich’s papers from 2006 and 2009.

Konotorovich’s 2009 paper looks at both the 3*n* + 1 problem and the 5*n* + 1 problem, the latter simply replacing the “3” in the Collatz conjecture with a “5”. The 5*n* + 1 has cycles, such as {13, 33, 83, 208, 104, 52, 26}. It is conjectured that the sequence produced by the 5*n* + 1 problem diverges for almost all inputs.

`$x`

which equals 42, then the string
"The answer is $x."

will expand to “The answer is 42.” Perl requires variables to start with *sigils*, like the `$`

in front of scalar variables. Sigils are widely considered to be ugly, but they have their benefits. Here, for example, `$x`

is clearly a variable name, whereas `x`

would not be.

You can do something similar to Perl’s string interpolation in Python with so-called f-strings. If you put an `f`

in front of an opening quotation mark, an expression in braces will be replaced with its value.

>>> x = 42 >>> f"The answer is {x}." 'The answer is 42.'

You could also say

>>> f"The answer is {6*7}."

for example. The f-string is just a string; it’s only printed because we’re working from the Python REPL.

The `glue`

package for R lets you do something very similar to Python’s f-strings.

> library(glue) > x <- 42 > glue("The answer is {x}.") The answer is 42. > glue("The answer is {6*7}.") The answer is 42.

As with f-strings, `glue`

returns a string. It doesn’t print the string, though the string is displayed because we’re working from the REPL, the R REPL in this case.

Now suppose you wanted to create a check sum for text typed on a computer keyboard. You want to detect any change where a single key was wrongly typed by using an adjacent key.

You don’t need many characters for the check sum because you’re not trying to detect arbitrary changes, such as typing H for A on a QWERTY keyboard. You’re only trying to detect, for example, if someone typed Q, W, S, or Z for A. In fact you would only need one of five characters for the check sum.

Here’s how to construct the check sum. Think of the keys of the keyboard as a map, say by drawing boundaries through the spaces between the keys. By the four color theorem, you can assign the numbers 0, 1, 2, and 3 to each key so that no two adjacent keys have the same number. Concatenate all these digits and interpret it as a base 4 number. Then take the remainder when the number is divided by 5. That’s your check sum. As proved here, this will detect any typo that hits an adjacent key. It will also detect transpositions of adjacent keys.

Note that this doesn’t assume anything about the particular keyboard. You could have any number of keys, and the keys could have any shape. You could even define “adjacent” in some non-geometrical way as long as your adjacency graph is planar.

]]>A VIN (vehicle identification number) is a string of 17 characters that uniquely identifies a car or motorcycle. These numbers are used around the world and have three standardized formats: one for North America, one for the EU, and one for the rest of the world.

The characters used in a VIN are digits and capital letters. The letters I, O, and Q are not used to avoid confusion with the numerals 0, 1, and 9. So if you’re not sure whether a character is a digit or a letter, it’s probably a digit.

It would have been better to exclude S than Q. A lower case q looks sorta like a 9, but VINs use capital letters, and an S looks like a 5.

The various parts of a VIN have particular meanings, as documented in the Wikipedia article on VINs. I want to focus on just the check sum, a character whose purpose is to help detect errors in the other characters.

Of the three standards for VINs, only the North American one requires a check sum. The check sum is in the middle of the VIN, the 9th character.

The scheme for computing the check sum is both complicated and weak. The end result is either a digit or an X. There are 33 possibilities for each character (10 digits + 23 letters) and 11 possibilities for a check sum, so the check sum cannot possibly detect all changes to even a single character.

The check sum is computed by first converting all letters to digits, computing a weighted sum of the 17 digits, and taking the remainder by 11. The weights for the 17 characters are

8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8 ,7 ,6, 5, 4, 3, 2

I don’t see any reason for these weights other than that adjacent weights are different, which is enough to detect transposition of consecutive digits (and characters might not be digits). Maybe the process was deliberately complicated in an attempt to provide a little security by obscurity.

There’s an interesting historical quirk in how the letters are converted to digits: each letter is mapped to the last digit of its **EBCDIC** code.

EBCDIC?! Why not ASCII? Both EBCDIC and ASCII go back to 1963. VINs date back to 1954 in the US but were standardized in 1981. Presumably the check sum algorithm using EBCDIC digits became a de facto standard before ASCII was ubiquitous.

Any error detection scheme that uses 11 characters to detect changes in 33 characters is necessarily weak.

A much better approach would be to use a slight variation on the check sum algorithm Douglas Crockford recommended for base 32 encoding described here. Crockford says to take a string of characters from an alphabet of 32 characters, interpret it as a base 32 number, and take the remainder by 37 as the check sum. The same algorithm would work for an alphabet of 33 characters. All that matters is that the number of possible characters is less than 37.

Since the check sum is a number between 0 and 36 inclusive, you need 37 characters to represent it. Crockford recommended using the symbols *, ~, $, =, and U for extra symbols in his base 32 system. His system didn’t use U, and VIN numbers do. But we only need four more characters, so we could use *, ~, $, and =.

The drawback to this system is that it requires four new symbols. The advantage is that any change to a single character would be detected, as would any transposition of adjacent characters. This is proved here.

The Collatz conjecture, also known as the 3n+1 problem, asks whether the following function terminates for all positive integer arguments *n*.

def collatz(n): if n == 1: return 1 elif n % 2 == 0: return collatz(n/2) else: return collatz(3*n+1)

In words, this says to start with a positive integer. Repeatedly either divide it by 2 if it’s even, or multiply it by 3 and add 1 if it’s odd. Will this sequence always reach 1?

The Collatz conjecture is a great example of how hard it can be to thoroughly understand even a few lines of code.

Terence Tao announced today that he has new partial results toward proving the Collatz conjecture. His blog post and arXiv paper are both entitled “Almost all Collatz orbits attain almost bounded values.”

When someone like Tao uses the word “almost,” it is a term of art, a common word used as a technical term. He is using “almost” as it is used as a technical term in number theory, which is different from the way the word is used technically in measure theory.

I get email routinely from people who believe they have a proof of the Collatz conjecture. These emails are inevitably from amateurs. The proofs are always short, elementary, and self-contained.

The contrasts with Tao’s result are stark. Tao has won the Fields Medal, arguably the highest prize in mathematics [1], and a couple dozen other awards. Amateurs can and do solve open problems, but it’s uncommon.

Tao’s proof is 48 pages of dense, advanced mathematics, building on the work of other researchers. Even so, he doesn’t claim to have a complete proof, but partial results. That is how big conjectures typically fall: by numerous people chipping away at them, building on each other’s work.

[1] Some say the Abel prize is more prestigious because it’s more of a lifetime achievement award. Surely Tao will win that one too when he’s older.

]]>US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode these symbols since 2^{7} = 128, and that’s what **ASCII** does. It represents each character with 8 bits since computers work with bits in groups of sizes that are powers of 2, but the first bit is always 0 because it’s not needed. **Extended ASCII** uses the left over space in ASCII to encode more characters.

A total of 256 characters might serve some users well, but it wouldn’t begin to let you represent, for example, Chinese. Unicode initially wanted to use two bytes instead of one byte to represent characters, which would allow for 2^{16} = 65,536 possibilities, enough to capture a lot of the world’s writing systems. But not all, and so Unicode expanded to four bytes.

If you were to store English text using two bytes for every letter, half the space would be wasted storing zeros. And if you used four bytes per letter, three quarters of the space would be wasted. *Without some kind of encoding every file containing English test would be two or four times larger than necessary*. And not just English, but every language that can represented with ASCII.

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn’t use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer. You don’t have to suddenly make every character take up twice or four times as much space just because you want to use, say, a Euro sign € (U+20AC).

Since the first bit of ASCII characters is set to zero, bytes with the first bit set to 1 are unused and can be used specially.

When software reading UTF-8 comes across a byte starting with 1, it counts how many 1’s follow before encountering a 0. For example, in a byte of the form 110xxxxx, there’s a single 1 following the initial 1. Let *n* be the number of 1’s between the initial 1 and the first 0. The remaining bits in this byte and some bits in the next *n* bytes will represent a Unicode character. There’s no need for *n* to be bigger than 3 for reasons we’ll get to later. That is, it takes at most four bytes to represent a Unicode character using UTF-8.

So a byte of the form 110xxxxx says the first five bits of a Unicode character are stored at the end of this byte, and the rest of the bits are coming in the next byte.

A byte of the form 1110xxxx contains four bits of a Unicode character and says that the rest of the bits are coming over the next two bytes.

A byte of the form 11110xxx contains three bits of a Unicode character and says that the rest of the bits are coming over the next three bytes.

Following the initial byte announcing the beginning of a character spread over multiple bytes, bits are stored in bytes of the form 10xxxxxx. Since the initial bytes of a multibyte sequence start with two 1 bits, there’s no ambiguity: a byte starting with 10 cannot mark the start of a new multibyte sequence. That is, UTF-8 is self-punctuating.

So multibyte sequences have one of the following forms.

110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If you count the x’s in the bottom row, there are 21 of them. So this scheme can only represent numbers with up to 21 bits. Don’t we need 32 bits? It turns out we don’t.

Although a Unicode character is ostensibly a 32-bit number, it actually takes at most 21 bits to encode a Unicode character for reasons explained here. This is why *n*, the number of 1’s following the initial 1 at the beginning of a multibyte sequence, only needs to be 1, 2, or 3. The UTF-8 encoding scheme could be extended to allow *n* = 4, 5, or 6, but this is unnecessary.

UTF-8 lets you take an ordinary ASCII file and consider it a Unicode file encoded with UTF-8. So UTF-8 is as efficient as ASCII in terms of space. But not in terms of time. If software knows that a file is in fact ASCII, it can take each byte at face value, not having to check whether it is the first byte of a multibyte sequence.

And while plain ASCII is legal UTF-8, extended ASCII is not. So extended ASCII characters would now take two bytes where they used to take one. My previous post was about the confusion that could result from software interpreting a UTF-8 encoded file as an extended ASCII file.

Say the CSV file looked like this:

foo,bar 1,2 3,4

I read the file into R with

df <- read.csv("foobar.csv", header=TRUE)

and could access the second column as `df$bar`

but could not access the first column as `df$foo`

. What’s going on?

When I ran `names(df)`

it showed me that the first column was named not `foo`

but `ï..foo`

. I opened the CSV file in a hex editor and saw this:

efbb bf66 6f6f 2c62 6172 0d0a 312c 320d

The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.

If you saw my post about Unicode the other day, you may have seen Daniel Lemire’s comment:

There are various byte-order masks like EF BB BF for UTF-8 (unused).

Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.

Now we can see where the funny characters in front of “foo” came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, “Latin Small Letter I with Diaeresis.” I don’t know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as

ï»¿foo,bar

with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.

How to fix the encoding problem with R? The `read.csv`

function has an optional `encoding`

parameter. I tried setting this parameter to “utf-8” and “utf8”. Neither made any difference. I looked at the R documentation, and it seems I need to set it to “UTF-8”. When I did that, the name of the first column became `X.U.FEFF.foo`

[1]. I don’t know what’s up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.

Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you’ll first see CSV UTF-8 (that’s why I picked it) but if you go further down you’ll see an option that’s simply CSV, implicitly in ASCII.

Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don’t know, but I do know that the problem goes away when I stick to ASCII.

***

[1] A couple people pointed out in the comments that you could use `fileEncoding="UTF-8-BOM"`

to fix the problem. This works, though I didn’t see it in the documentation the first time. The `read.csv`

function takes an `encoding`

parameter that appears to be for this purpose, but is a decoy. You need the `fileEncoding`

parameter. With enough persistence you’ll eventually find that `"UTF-8-BOM"`

is a possible value for `fileEncoding`

.

Researchers quantified the information content per syllable in 17 different languages by calculating Shannon entropy. When you multiply the information per syllable by the number of syllables per second, you get around 39 bits per second across a wide variety of languages.

If a language has *N* possible syllables, and the probability of the *i*th syllable occurring in speech is *p*_{i}, then the average information content of a syllable, as measured by Shannon entropy, is

For example, if a language had only eight possible syllables, all equally likely, then each syllable would carry 3 bits of information. And in general, if there were 2^{n} syllables, all equally likely, then the information content per syllable would be *n* bits. Just like *n* zeros and ones, hence the term bits.

Of course not all syllables are equally likely to occur, and so it’s not enough to know the number of syllables; you also need to know their relative frequency. For a fixed number of syllables, the more evenly the frequencies are distributed, the more information is carried per syllable.

If ancient languages conveyed information at 39 bits per second, as a variety of modern languages do, one could calculate the entropy of the language’s syllables and divide 39 by the entropy to estimate how many syllables the speakers spoke per second.

According to this overview of the research,

Japanese, which has only 643 syllables, had an information density of about 5 bits per syllable, whereas English, with its 6949 syllables, had a density of just over 7 bits per syllable. Vietnamese, with its complex system of six tones (each of which can further differentiate a syllable), topped the charts at 8 bits per syllable.

One could do the same calculations for Latin, or ancient Greek, or Anglo Saxon that the researches did for Japanese, English, and Vietnamese.

If all 643 syllables of Japanese were equally likely, the language would convey -log_{2}(1/637) = 9.3 bits of information per syllable. The overview says Japanese carries 5 bits per syllable, and so the efficiency of the language is 5/9.3 or about 54%.

If all 6949 syllables of English were equally likely, a syllable would carry 12.7 bits of information. Since English carries around 7 bits of information per syllable, the efficiency is 7/12.7 or about 55%.

Taking a wild guess by extrapolating from only two data points, maybe around 55% efficiency is common. If so, you could estimate the entropy per syllable of a language just from counting syllables.

Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" for more information.

The version number is a good reminder. I’m used to the command `python`

bringing up Python 3+, so seeing the text above would remind me that on that computer I need to type `python3`

rather than simply `python`

.

But if you’re working at the command line and jumping over to Python for a quick calculation, the start up verbiage separates your previous work from your current work by a few lines. This isn’t such a big deal with Python, but it is with R:

R version 3.6.1 (2019-07-05) -- "Action of the Toes" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

By the time you see all that, your previous work may have scrolled out of sight.

There’s a simple solution: use the option `-q`

for quiet mode. Then you can jump in and out of your REPL with a minimum of ceremony and keep your previous work on screen.

For example, the following shows how you can use Python and bc without a lot of wasted vertical space.

> python -q >>> 3+4 7 >>> quit() > bc -q 3+4 7 quit

Python added the `-q`

option in version 3, which the example above uses. Python 2 does not have an explicit quiet mode option, but Mike S points out a clever workaround in the comments. You can open a Python 2 REPL in quiet mode by using the following.

python -ic ""

The combination of the `-i`

and `-c`

options tells Python to run the following script and enter interpreter mode. In this case the script is just the empty string, so Python does nothing but quietly enter the interpreter.

R has a quiet mode option, but by default R has the annoying habit of asking whether you want to save a workspace image when you quit.

> R.exe -q > 3+4 [1] 7 > quit() Save workspace image? [y/n/c]: n

I have *never* wanted R to save a workspace image; I just don’t work that way. I’d rather keep my state in scripts. I set R to an alias that launches R with the `--no-save`

option.

So if you launch R with `-q`

and `--no-save`

it takes up no more vertical space than Python or bc.

`bc`

programming language cannot contain capital letters. I think I understand why: Capital letters are reserved for hexadecimal constants, though in a weird sort of way.
At first variable names in `bc`

could only be one letter long. (This is still the case in the POSIX version of `bc`

but not in Gnu `bc`

.) And since A through F were reserved, you might as well make things simple and just reserve all capital letters. Maybe that was the thinking.

If you enter A at the `bc`

prompt, you get back 10. Enter B you get 11, etc. So `bc`

assumes a number containing a hex character is a hex number, right? Actually no. It assumes that any *single* letter that could be a hex number is one. But in numbers with multiple digits, it interprets letters as 9’s. Yes, 9’s.

The full story is a little more complicated. `bc`

will work in multiple bases, and it lets you set the input and output bases with the variables `ibase`

and `obase`

respectively. Both are set to 10 by default. When a number contains multiple characters, letters less than `ibase`

are interpreted as you’d expect. But letters greater than or equal to `ibase`

are interpreted as `ibase`

– 1.

So in base 12 in a number represented by more than one character, A means 10 and B means 11. But C, D, E, and F also mean 11. For example, A0 is 120 and BB is 143. But CC is also 143.

If `ibase`

is set to 10, then the expression `E == F`

evaluates to false, because 14 does not equal 15. But the expression expression `EE == FF`

evaluates to true, because 99 equals 99.

If you set `ibase`

to 16, then you’re in hex mode and the letters A through F behave exactly as expected.

If you want to go back to base 10, you need to set `ibase`

to A, not 10. If you’re in hex mode, every number you enter is interpreted in hex, and so “10” is interpreted as the number we usually write as 16. In any base, setting `ibase`

to 10 does nothing because it sets the base equal to the base.

π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …

would you have to sum before doing better than the approximation

π ≈ 355/113.

A couple years later Richard Johnsonbaugh [2] answered Asimov’s question in the course of an article on techniques for computing the sum of series. Johnsonbaugh said you would need at least *N* = 3,748,630 terms.

Johnsonbaug’s answer is based on exact calculations. I wondered how well you’d do with *N* terms using ordinary floating point arithmetic. Would there be so much rounding error that the result is terrible?

I wrote the most direct implementation in Python, with no tricks to improve the accuracy.

from math import pi s = 0 N = 3748630 for n in range(1, N+1): s += (-1)**(n+1) * 4/(2*n - 1)

I intended to follow this up by showing that you could do better by summing all the positive and negative terms separately, then doing one subtraction at the end. But the naive version actually does quite well. It’s essentially as accurate as 355/113, with both approximations having an error of 2.66764 × 10^{-7}.

Next, I translated my program to `bc`

[3] so I could control the precision. `bc`

lets you specify your desired precision with its `scale`

parameter.

scale = 20 pi = 4*a(1) s = 0 m = 3748630 for (n = 1; n <= m; n++) { s += (-1)^(n+1) * 4/(2*n - 1) }

Floating point precision is between 15 and 16 decimal places. I added more precision by setting set `scale`

to 20, i.e. carrying out calculations to 20 decimal places, and summed the series again.

The absolute error in the series was less than the error in 355/113 in the 14th decimal place. When I used one less term in the series, its error was larger than the error in 355/113 in the 14th decimal place. In other words, the calculations suggest Johnsonbaugh found exactly the minimum number of terms needed.

I doubt Johnsonbaugh ever verified his result computationally. He doesn’t mention computer calculations in his paper [4], and it would have been difficult in 1979 to have access to the necessary hardware and software.

If he had access to an Apple II at the time, it would have run at 1 MHz. My calculation took around a minute to run on a 2 GHz laptop, so I’m guessing the same calculation would have taken a day or two on an Apple II. This assumes he could find extended precision software like `bc`

on an Apple II, which is doubtful.

The `bc`

programming language had been written four years earlier, so someone could have run a program like the one above on a Unix machine somewhere. However, such machines were hard to find, and their owners would have been reluctant to give up a couple days of compute time for a guest to run a frivolous calculation.

[1] Isaac Asimov, Asimov on Numbers, 1977.

[2] Richard Johnsonbaugh, Summing an Alternating Series. The American Mathematical Monthly, Vol 86, No 8, pp.637–648.

[3] Notice that *N* from the Python program became *m* in `bc`

. I’ve used `bc`

occasionally for years, and didn’t know until now that you cannot use capital letters for variables in standard *bc*. I guess I never tried before. The next post explains why `bc`

doesn’t allow capital letters in variable names.

[4] Johnsonbaugh’s paper does include some numerical calculations, but he only sums up 500 terms, not millions of terms, and it appears he only uses ordinary precision, not extended precision.

]]>An NDC contains 10 digits, separated into three segments by two dashes. The three segments are the labeler code, product code, and package code. The FDA assigns the labeler codes to companies, and each company assigns its own product and package codes.

The segments are of variable length and so the dashes are significant. The labeler code could be 4 or 5 digits. The product code could be 3 or 4 digits, and the package code could be 1 or 2 digits. The total number of digits is must be 10, so their are three possible combinations:

- 4-4-2
- 5-3-2
- 5-4-1.

There’s no way to look at just the digits and know how to separate them into three segments. My previous post looked at self-punctuating codes. The digits of NDC codes are not self-punctuating because they require the dashes. The digit combinations are supposed to be unique, but you can’t tell how to parse a set of digits from the digits alone.

I downloaded the NDC data from the FDA to verify whether the codes work as documented, and to see the relative frequency of various formats.

(The data change daily, so you may get different results if you do try this yourself.)

All the codes were 12 characters long, and all had the documented format as verified by the regular expression [1]

\d{4,5}-\d{3,4}-\d{1,2}

I found one exception to the rule that the sequence of digits should be unique. The command

sed "s/-//g" ndc.txt | sort | uniq -d

returned 2950090777.

The set of NDC codes contained both 29500-907-77 and 29500-9077-7.

About 60% of the codes had the form 5-3-2. About 30% had the form 5-4-1, and the remaining 10% had the form 4-4-2.

There were a total of 252,355 NDC codes with 6,532 different lablelers (companies).

There were 9448 NDC codes associated with the most prolific labeler. The 1,424 least prolific labelers had only one DNC code. In Pareto-like fashion, the top 20% of labelers accounted for about 90% of the codes.

[1] Programming languages like Python or Perl will recognize this regular expression, but by default `grep`

does not support `\d`

for digits. The Gnu implementation of `grep`

with the `-P`

option will. It will also understand notation like `{4,5}`

to mean a pattern is repeated 4 or 5 times, with or without `-P`

, but I don’t think other implementations of `grep`

necessarily will.

This is an example of a **prefix code**: No valid phone number is a prefix of another valid phone number. We’ll look at a few more examples of prefix codes in the context of phone numbers, then look at Roman numerals, Morse code, Unicode encoding, data compression, and RPN calculators.

It used to be true in the US that you could dial four or five digits for a local call, seven digits for a call within the area code, and ten digits for a long distance phone call. This didn’t cause any ambiguity because no local number would begin with the digits of your area code. You had to dial a 1 before dialing a long distance number, and no local or area code numbers begin with 1.

There are still parts of the US where you can dial either a seven digit number or a ten digit number. In most of the US you always enter 10 digits. This is a trivial form of prefix coding because all **fixed-length codes** are prefix codes.

A final example of prefix codes related to telephony are **country calling codes**. These codes have varying lengths, but phone exchanges can know when the country code stops and when the number within a country starts because no prefix of a country code is a valid country code.

Prefix codes are sometimes called **self-punctuating codes**. This is because you don’t need an additional symbol, a form of punctuation, to mark the end of codes.

We usually think of **Morse code** as a system of two symbols: dots and dashes. But it’s really a system of four symbols: dots, dashes, short space between letters, and longer space between words. As a system of only dots and dashes, Morse code is not self-punctuating. Without spaces, you couldn’t tell, for example, whether *dot dash* represented an A (*dot dash*) , or an E (*dot*) followed by a T (*dash*). It’s possible to design a prefix code with just two symbols, but that’s not what Morse did.

**Q codes** are not part of Morse code per se but are often used in Morse code. These are three letter codes beginning with Q. For example, QRP is an abbreviation for the question “Should I reduce power?”. Q codes are prefix codes because they have fixed length. If QR were a valid Q code by itself, that would ruin the prefix property; a recipient would not know whether to interpret QR as a complete code until listening for the next letter.

The letter components of **Roman numerals** are not a prefix code because you can’t tell whether a letter stands for a positive or negative amount until you read the next letter. For example, in CLX the X represents 10 but in CXL the X represents -10. If you wrote Roman numerals backward, the letters would form a prefix code.

In the previous post, I discussed **UTF-16** encoding of Unicode. The way UTF-16 encodes characters outside the Basic Multilingual Plane makes it a prefix code; the meaning of a surrogate doesn’t depend on any down-stream information. **UTF-8** is also a prefix code, which I discuss in detail here.

You may have seen **Huffman coding**, a form of data compression that uses a prefix code.

**Reverse Polish Notation** is an example of prefix coding. An RPN calculator doesn’t need parentheses for punctuation. You can enter calculations unambiguously with just digits and arithmetic operators because the meaning of a computation does not depend on any future input. Prefix codes are sometimes called **instantaneous codes** because of this feature.

The previous post showed how the number of Unicode characters has grown over time.

You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on.

Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible Unicode characters are there?

**Short answer**: There are 1,111,998 possible Unicode characters.

**Longer answer**: There are 17×2^{16} – 2048 – 66 = 1,111,998 possible Unicode characters: seventeen 16-bit planes, with 2048 values reserved as surrogates, and 66 reserved as non-characters. More on this below.

Going one level of detail deeper, which numbers correspond to Unicode characters?

The hexadecimal numbers 0 through 10FFFF are potential Unicode characters, with exception of surrogates and non-characters.

Unicode is divided into 17 **planes**. The first two hexadecimal “digits” indicate the plane, and the last four indicate a value within the plane. The first plane is known as the **BMP**, the **Basic Multilingual Plane**. The rest are known as **supplemental planes**.

The **surrogates** are DC00 through DFFF and D800 through DBFF. The first range of 1024 surrogates are known as low surrogates, and the second rage of 1024 the high surrogates.

The **non-characters** are FDD0 through FDEF and the last two values in each plane: FFFE, FFFF, 1FFFE, 1FFFF, 2FFFE, 2FFFF, …, 10FFFE, 10FFFF. This is one range of 32 non-characters, plus 34 coming from the end of each plane, for a total of 66.

Why are there only 17 planes? And what are these mysterious surrogates and non-characters? What purpose do they serve?

The limitations of UTF-16 encoding explain why 17 planes and why surrogates. Non-characters require a different explanation.

This post mentioned at the top that the size of Unicode jumped between versions 3.0 and 3.1. Significantly, the size went from less than 2^{16} to more than 2^{16}. Unicode broke out of the Basic Multilingual Plane.

Unicode needed a way to represent more than than 2^{16} characters using groups of 16 bits. The solution to this problem was UTF-16 encoding. With this encoding, the surrogate values listed above do not represent characters per se but are a kind of pointer to further values.

Sixteen supplemental planes would take 20 bits to describe, 4 to indicate the plane and 16 for the values within the plane. The idea was to use a high surrogate to represent the first 10 bits and a low surrogate to represent the last 10 bits. The values DC00 through DFFF and D800 through DBFF were unassigned at the time, so they were picked for surrogates.

In a little more detail, a character in one of the supplemental planes is represented by a hexadecimal number between 1 0000 and 10 FFFF. If we subtract off 1 0000 we get a number between 0000 and FFFFF, a 20-bit number. Take the first 10 bits and add them to D800 to get a high surrogate value. Take the last 10 bits and add them to DC00 to get a low surrogate value. This pair of surrogate values represents the value in one of the supplemental planes.

When you encounter a surrogate value, you don’t need any further context to tell what it is. You don’t need to look upstream for some indication of how the bits are to be interpreted. It cannot be a BMP character, and there’s no doubt whether it is the beginning or end of a pair of surrogate values since the high surrogates and low surrogates are in different ranges.

UTF-16 can only represent 17 planes, and the Unicode Consortium decided they would not assign values that cannot be represented in UTF-16. So that’s why there are 17 planes.

That leaves the non-characters. Why are a few values reserved to never be used for characters?

One use for non-characters is to return a **null value** as an error indicator, analogous to a NaN or non-a-number in floating point calculations. A program might return FFFF, for example, to indicate that it was unable to read a character.

Another use for special non-characters is to imply which encoding method is used. For reasons that are too complicated to get into here, computers do not always store the bytes within a word in the increasing order. In so called “little endian” order, lower order bits are stored before higher order bits. (“Big endian” and “little endian” are allusions to the two factions in Gulliver’s Travels that crack boiled eggs on their big end and and little end respectively.)

The **byte order mark** FEFF is inserted at the beginning of a file or stream to imply byte ordering. If it is received in the order FEFF then the byte stream is inferred to be using the big endian convention. But if it is received in the order FFFE then little endian is inferred because FFFE cannot be a character.

The preceding paragraphs give a justification for at least two non-characters, FFFF and FFFE, but it’s not clear why 66 are reserved. There could be reasons for each plane would have its own FFFF and FFFE, which would account for 34 non-characters. I’m not clear on why FDD0 through FDEF are non-characters, though I vaguely remember there being some historical reason. In any case, people are free to use the non-characters however they see fit.

The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number of characters in Unicode over time.

Here’s a slightly different plot where the horizontal axis is version number rather than time.

There’s plenty of room left in Unicode. The maximum number of possible Unicode characters is 1,111,998 for reasons I get into here.

]]>I am endlessly delighted by the

hopelesstask that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the heart of how we use language, no matter how hard they tried to create policies to avoid these problems. It’s just a fun example of how weird language is and how hard human communication is and how you really can’t really get around those problems. … These are really hard problems and I do not envy them.

Reminds me of Jeffrey Snover’s remark about problems vs dilemmas: problems can be solved, but dilemmas can only be managed. Unicode faces a host of dilemmas.

Regarding Munroe’s comment about Unicode starting out small and getting more ambitious, see the next post for a plot of the number of characters as a function of time and of version number.

This post goes through an example in detail that shows how to manage special characters in several different contexts.

I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like `ICD9_CODE`

and need to make that `ICD9\_CODE`

so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.

Underscore isn’t the only special character in TeX. It has ten special characters:

\ { } $ & # ^ _ % ~

The two that people most commonly stumble over are probably `$`

and `%`

because these are fairly common in ordinary prose. Since `%`

begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.

So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.

First I need to tell Python what the special characters are for TeX:

special = r"\\{}$&#^_%~"

There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The `r`

prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”

Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.

Here’s my regular expression:

re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)

I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with `(?<!`

and end with `)`

. So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression

(?<!charity )ball

This expression would match “foot ball” or “foosball” but not “charity ball”.

Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].

After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable `special`

is so the regular expression engine would see two backslashes and interpret that as one literal backslash.

The second argument to `re.sub`

tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.

We want to tell `re.sub`

to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it `\\`

to represent a literal backslash. When we follow this with `\1`

for the first capture, the result is `\\\1`

as above.

We can test our code above on with the following.

line = r"a_b $200 {x} %5 x\y"

and get

a\_b \$200 \{x\} \%5 x\\y

which would cause TeX to produce output that looks like

a_b $200 {x} %5 x\y.

Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the `r`

in front of the opening quote.

Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say

special = "\\\\{}$&#^_%~"

starting with four backslashes to send Python two to send the regular expression engine one.

- Four tips for learning regular expressions
- Unicode / LaTeX conversion
- Daily regular expression tips via Twitter

[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.

[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.

]]>In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard format for exchanging data. Some people like this, some lament this, but that’s the way it is.

A minor variation on comma-separated values is tab-separated values [1].

Why use standard Unix utilities? I’ll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don’t want to use something else.

Maybe you already know the standard utilities and don’t think that learning more specialized tools is worth the effort.

Maybe you’re already at the command line and in a command line state of mind, and don’t want to interrupt your work flow by doing something else.

Maybe you’re on a computer where you don’t have the ability to install any software and so you need to work with what’s there.

Whatever your reasons, we’ll go with the assumption that we’re committed to using commands that have been around for decades.

The tools I want to look at are `cut`

, `sort`

, and `awk`

. I wrote about cut the other day and apparently the post struck a chord with some readers. This post is a follow-up to that one.

These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of `sort`

will be named `gsort`

in order to not conflict with the native Windows `sort`

function. There’s no need to rename the other two utilities since they don’t have counterparts that ship with Windows.

The `sort`

command is simple and useful. There are just a few options you’ll need to know about. The utility sorts fields as text by default, but the `-n`

tells it to sort numerically.

Since we’re talking about CSV files, you’ll need to know that `-t,`

is the option to tell `sort`

that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the `-k`

option.

The last utility, `awk`

, is more than a utility. It’s a small programming language. But it works so well from the command line that you can almost think of it as a command line utility. It’s very common to pipe output to an awk program that’s only a few characters long.

You can get started quickly with `awk`

by reading Greg Grothous’ article Why you should learn just a little awk.

Now for the bad news: these programs are inconsistent in their options. The two most common things you’ll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately this is done differently in every utility.

`cut`

uses `-d`

or `--delimiter`

to specify the field delimiter and `-f`

or `--fields`

to specify fields. Makes sense.

`sort`

uses `-t`

or `--field-separator`

to specify the field delimiter and `-k`

or `--key`

to specify the field. When you’re talking about sorting things, it’s common to call the fields *keys*, and so the way `sort`

specifies fields makes sense in context. I see no reason for `-t`

other than `-f`

was already taken. (In sorting, you talk about folding upper case to lower case, so `-f`

stands for *fold*.)

`awk`

uses `-F`

or `--field-separator`

to specify the field delimiter. At least the verbose option is consistent with `sort`

. Why `-F`

for the short option instead of `-f`

? The latter was already taken for *file*. To tell `awk`

to read a program from a file rather than the command line you use the `-f`

option.

`awk`

handles fields differently than `cut`

and `sort`

. Because it is a programming language designed to parse delimited text files, each field has a built-in variable: `$1`

holds the content of the first field, `$2`

the second, etc.

The following compact table summarizes how you tell each utility that you’re working with comma-separated files and that you’re interested in the second field.

|------+-----+-----| | cut | -d, | -f2 | | sort | -t, | -k2 | | awk | -F, | $2 | |------+-----+-----|

Some will object that the inconsistencies documented above are a good example of why you shouldn’t work with CSV files using `cut`

, `sort`

, and `awk`

. You could use other command line utilities designed for working with CSV files. Or pull your CSV file into R or Pandas. Or import it somewhere to work with it in SQL. Etc.

The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.

[1] Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.

]]>So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip codes are larger than five-digit zip codes on average, some three-digit zip codes are still sparsely populated.

But if you use three-digit zip codes, and cut out sparsely populated zip3s, then you’re OK, right?

Well, there’s still a problem if you also report state. Ordinarily a zip3 fits within one state, but not always.

Five digit zip codes are each entirely contained within a state as far as I know. But three-digit zip codes can straddle state lines. For example, about 200,000 people live in the three-digit zip code 834. The vast majority of these are in Idaho, but about 500 live in zip code 83414 which is in Wyoming. Zip code 834 is not sparsely populated, and doesn’t give much information about an individual by itself. But it is conditionally sparsely populated. It does carry a lot of information about an individual if that person lives in Wyoming.

On average, a three-digit zip code covers about 350,000 people. And so most of the time, the combination of zip3 and state covers 350,000 people. But in the example above, the combination of zip3 and state might narrow down to 500 people. In a group that small, birthday (just day of the year, not the full date) is enough to uniquely identify around 25% of the population. [1]

[1] The 25% figure came from exp(-500/365). See this post for details.

]]>head data.csv

to look at the first few lines of the file and got this back:

That was not at all helpful. The part I was interested was at the beginning, but that part scrolled off the screen quickly. To see just how wide the lines are I ran

head -n 1 data.csv | wc

and found that the first line of the file is 4822 characters long.

How can you see just the first part of long lines? Use the `cut`

command. It comes with Linux systems and you can download it for Windows as part of GOW.

You can see the first 30 characters of the first few lines by piping the output of `head`

to `cut`

.

head data.csv | cut -c -30

This shows

"GEO_ID","NAME","DP05_0001E"," "id","Geographic Area Name","E "8600000US01379","ZCTA5 01379" "8600000US01440","ZCTA5 01440" "8600000US01505","ZCTA5 01505" "8600000US01524","ZCTA5 01524" "8600000US01529","ZCTA5 01529" "8600000US01583","ZCTA5 01583" "8600000US01588","ZCTA5 01588" "8600000US01609","ZCTA5 01609"

which is much more useful. The syntax `-30`

says to show up to the 30th character. You could do the opposite with `30-`

to show everything starting with the 30th character. And you can show a range, such as 20-30 to show the 20th through 30th characters.

You can also use `cut`

to pick out fields with the `-f`

option. The default delimiter is tab, but our file is delimited with commas so we need to add `-d,`

to tell it to split fields on commas.

We could see just the second column of data, for example, with

head data.csv | cut -d, -f 2

This produces

"NAME" "Geographic Area Name" "ZCTA5 01379" "ZCTA5 01440" "ZCTA5 01505" "ZCTA5 01524" "ZCTA5 01529" "ZCTA5 01583" "ZCTA5 01588" "ZCTA5 01609"

You can also specify a range of fields, say by replacing 2 with 3-4 to see the third and fourth columns.

The humble `cut`

command is a good one to have in your toolbox.

]]>

*V*(*n*) = *K* *n*^{β}

where *K* is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, *K* is often between 10 and 100 and β is often between 0.4 an 0.6.

(Note that it’s Heaps’ law, not Heap’s law. The law is named after Harold Stanley Heaps. However, true to Stigler’s law of eponymy, the law was first observed by someone else, Gustav Herdan.)

I’ll demonstrate Heaps law looking at books of the Bible and then by looking at novels of Jane Austen. I’ll also look at unique words, what linguists call “hapax legomena.”

For a collection of related texts, you can estimate the parameters *K* and β from data. I decided to see how well Heaps’ law worked in predicting the number of unique words in each book of the Bible. I used the King James Version because it is easy to download from Project Gutenberg.

I converted each line to lower case, replaced all non-alphabetic characters with spaces, and split the text on spaces to obtain a list of words. This gave the following statistics:

|------------+-------+------| | Book | n | V | |------------+-------+------| | Genesis | 38520 | 2448 | | Exodus | 32767 | 2024 | | Leviticus | 24621 | 1412 | ... | III John | 295 | 155 | | Jude | 609 | 295 | | Revelation | 12003 | 1283 | |------------+-------+------|

The parameter values that best fit the data were *K* = 10.64 and β = 0.518, in keeping with the typical ranges of these parameters.

Here’s a sample of how the actual vocabulary size and predicted vocabulary size compare.

|------------+------+-------| | Book | True | Model | |------------+------+-------| | Genesis | 2448 | 2538 | | Exodus | 2024 | 2335 | | Leviticus | 1412 | 2013 | ... | III John | 155 | 203 | | Jude | 295 | 296 | | Revelation | 1283 | 1387 | |------------+------+-------|

Here’s a visual representation of the results.

It looks like the predictions are more accurate for small books, and that’s true on an absolute scale. But the relative error is actually smaller for large books as we can see by plotting again on a log-log scale.

It’s a little surprising that Heaps’ law applies well to books of the Bible since the books were composed over centuries and in two different languages. On the other hand, the same committee translated all the books at the same time. Maybe Heaps’ law applies to translations better than it applies to the original texts.

I expect Heaps’ law would fit more closely if you looked at, say, all the novels by a particular author, especially if the author wrote all the books in his or her prime. (I believe I read that someone did a vocabulary analysis of Agatha Christie’s novels and detected a decrease in her vocabulary in her latter years.)

To test this out I looked at Jane Austen’s novels on Project Gutenberg. Here’s the data:

|-----------------------+--------+------| | Novel | n | V | |-----------------------+--------+------| | Northanger Abbey | 78147 | 5995 | | Persuasion | 84117 | 5738 | | Sense and Sensibility | 120716 | 6271 | | Pride and Prejudice | 122811 | 6258 | | Mansfield Park | 161454 | 7758 | | Emma | 161967 | 7092 | |-----------------------+--------+------|

The parameters in Heaps’ law work out to *K* = 121.3 and β = 0.341, a much larger *K* than before, and a smaller β.

Here’s a comparison of the actual and predicted vocabulary sizes in the novels.

|-----------------------+------+-------| | Novel | True | Model | |-----------------------+------+-------| | Northanger Abbey | 5995 | 5656 | | Persuasion | 5738 | 5799 | | Sense and Sensibility | 6271 | 6560 | | Pride and Prejudice | 6258 | 6598 | | Mansfield Park | 7758 | 7243 | | Emma | 7092 | 7251 | |-----------------------+------+-------|

If a suspected posthumous manuscript of Jane Austen were to appear, a possible test of authenticity would be to look at its vocabulary size to see if it is consistent with her other works. One could also look at the number of words used only once, as we discuss next.

In linguistics, a hapax legomenon is a word that only appears once in a given context. The term comes comes from a Greek phrase meaning something said only once. The term is often shortened to just hapax.

I thought it would be interesting to look at the number of hapax legomena in each book since I could do it with a minor tweak of the code I wrote for the first part of this post.

Normally if someone were speaking of hapax legomena in the context of the Bible, they’d be looking at unique words in the original languages, i.e. Hebrew and Greek, not in English translation. But I’m going with what I have at hand.

Here’s a plot of the number of haxap in each book of the KJV compared to the number of words in the book.

This looks a lot like the plot of vocabulary size and total words, suggesting the number of hapax also follow a power law like Heaps law. This is evident when we plot again on a logarithmic scale and see a linear relation.

Just to be clear on the difference between two analyses this post, in the first we looked at vocabulary size, the number of distinct words in each book. In the second we looked at words that only appear once. In both cases we’re counting unique words, but unique in different senses. In the first analysis, unique means that each word only *counts* once, no matter how many times it’s used. In the second, unique means that a work only *appears* once.

The **Riemann mapping theorem** says that any simply connected proper open subset of the plane is conformally equivalent to the interior of the unit disk. Simply connected means that the set has no holes [2]. Proper means that the set is not the entire plane [3].

Two regions that are conformally equivalent to the disk are conformally equivalent to each other. That means if you have a region like a Mickey Mouse silhouette

and another region like the Batman logo

then there’s a conformal map between them. If you specify a point in the domain and where it goes in the range, and you specify its derivative at that point, then the conformal map is unique.

Now this only applies to open sets, so the boundary of the figures is not included. The boundaries of the figures above clearly have different angles, so it wouldn’t be possible to extend the conformal map to the boundaries. But move in from the boundary by any positive distance and the map will be conformal.

Conformal maps are useful transformations in applications because they’re very nice functions to work with. In theory, you could write out a function that maps Mickey Mouse to Batman and one that goes the other way around. These functions would be difficult if not impossible to write down analytically, but there is software to find conformal maps numerically. For simple geometric shapes there are reference books that give exact conformal maps.

The Riemann mapping theorem applies to regions with no holes. What if we take two regions that do have a hole in them? To keep things simple, we will look at two annular regions, that is, each is the region between two circles. For *i* = 1 and 2 let *A*_{i} be the region between two circles centered at the origin, with inner radius *r*_{i} and outer radius *R*_{i}. Are these regions conformally equivalent?

Clearly they are if *R*_{1}/*r*_{1} = *R*_{2}/*r*_{2}, i.e. if one annulus is a magnification of the other. In that case the conformal map is given by mapping *z* to (*r*_{2}/*r*_{1}) *z*.

Here’s a big surprise: this is the only way the two annuli can be conformally equivalent! That is, if *R*_{1}/*r*_{1} does not equal *R*_{2}/*r*_{2} then there is no conformal map between the two regions.

So suppose you have to rings, both with inner radius 1. One has outer radius 2 and the other has outer radius 3. Then these rings are not conformally equivalent. They look a lot more like each other than do Mickey Mouse and Batman, but the latter are conformally equivalent and the former are not.

Almost any two regions without holes are conformally equivalent, but it’s much harder for two regions with holes to be conformally equivalent.

[1] Conformal maps also preserve orientation. So if you move clockwise from the first curve to the second in the domain, you move clockwise from the image of the first to the image of the second by the same angle.

A function of a complex variable is conformal if and only if it is *holomorphic*, i.e. analytic on the interior of its domain, with non-zero derivative. Some authors leave out the orientation preserving part of the definition; under their weaker definition a conformal map could also be the conjugate of a holomophic function.

[2] A rigorous way to say that a region has no holes is to say its fundamental group is trivial. Any loop inside the region is homotopic to a point, i.e. you can continuously shrink it to a point without leaving the region.

[3] Liouville’s theorem says an analytic function that is bounded on the entire plane must be constant. That’s why the Riemann mapping theorem says a region must be a proper subset of the plane, not the whole plane.

]]>Suppose that you are in love with a lady on Neptune and that she returns the sentiment. It will be some consolation for the melancholy separation if you can say to yourself at some—possibly prearranged—moment, “She is thinking of me now.” Unfortunately a difficulty has arisen because we have had to abolish Now … She will have to think of you continuously for eight hours on end in order to circumvent the ambiguity of “Now.”

This reminded me of The Book of Strange New Things. This science fiction novel has several themes, but one is the effect of separation on a relationship. Even if you have faster-than-light communication, how does it effect you to know that your husband or wife is light years away? The communication delay might be no more than was common before the telegraph, but the psychological effect is different.

**Related post**: Eddington’s constant

He says at one point “If I were a Christian …” implying that he is not, but his philosophy of software echos the Christian idea of grace, a completely free gift rather than something earned. If you want to use my software without giving anything back in return, enjoy. If you’re motivated by gratitude, not obligation, to give something back, that’s great. Works follow grace. Works don’t earn grace.

I was thinking about making a donation to a particular open source project that has been important to my business when I stumbled on DHH’s talk. While watching it, I reconsidered that donation. The software is freely given, no strings attached. I don’t take anything from anyone else by using it. Etc. Then I made the donation anyway, out of a sense of gratitude rather than a sense of obligation.

My biggest contributions to open source software have been unconscious. I had no idea that code from this blog was being used in hundreds of open source projects until Tim Hopper pointed it out.

Most contributions to open source software are in kind, i.e. contributing code. But cash helps too. Here are a couple ideas if you’d like to donate a little cash.

You could buy some swag with a project logo on it, especially some of the more overpriced swag. Maybe your company rules would allow this that wouldn’t allow making a donation. It feels odd to deliberately buy something overpriced—they want how much for this coffee mug?!—but that’s how the project can make a little money.

If you’d like to make a cash donation, but you’re hesitating because it’s not tax deductible, here’s a possibility: deduct your own tax. Give the after-tax amount that corresponds to the amount you would have given before taxes. For example, suppose you’d like to donate $100 cash if it were tax deductible, and suppose that your marginal tax rate is 30%. Then donate $70. It’s the same to you as donating $100 and saving $30 on your taxes.

Disclaimer: I am not an accountant or a tax attorney. And in case it ever needs to be said, I’m not a lepidopterist, auto mechanic, or cosmetologist either.

]]>Both are true, but they involve limits in different spaces. Weierstrass’ theorem applies to convergence over a compact interval of the real line, and Morera’s theorem applies to convergence over compact sets in the complex plane. The uniform limit of polynomials is better behaved when it has to be uniform in two dimensions rather than just one.

This post is a sort of variation on the post mentioned above, again comparing convergence over the real line and convergence in the complex plane, this time looking at what happens when you conjugate variables [1].

There’s an abstract version of Weierstrass’ theorem due to Marshall Stone known as the Stone-Weierstrass theorem. It generalizes the real interval of Weierstrass’ theorem to any compact Hausdorff space [2]. The compact Hausdorff space we care about for this post is the unit disk in the complex plane.

There are two versions of the Stone-Weierstrass theorem, one for real-valued functions and one for complex-valued functions. I’ll state the theorems below for those who are interested, but my primary interest here is a corollary to the complex Stone-Weierstrass theorem. It says that any continuous complex-valued function on the unit disk can be approximated as closely as you like with polynomials in *z* and the conjugate of *z* with complex coefficients, i.e. by polynomials of the form

When you throw in conjugates, things change a lot. The uniform limit of polynomials in *z* alone must be analytic, very well behaved. But the uniform limit of polynomials in *z* and *z* conjugate is merely continuous. It can have all kinds of sharp edges etc.

Conjugation opens up a lot of new possibilities, for better or for worse. As I wrote about here, an analytic function can only do two things to a tiny disk: stretch it or rotate it. It cannot flip it over, as conjugation does.

By adding or subtracting a variable and its conjugate, you can separate out the real and imaginary parts. The the parts are no longer inextricably linked and this allows much more general functions. The magic of complex analysis, the set of theorems that seem too good to be true, depends on the real and imaginary parts being tightly coupled.

***

Now for those who are interested, the statement of the Stone-Weierstrass theorems.

Let *X* be a compact Hausdorff space. The set of real or complex valued functions on *X* forms an algebra. That is, it is closed under taking linear combinations and products of its elements.

The real Stone-Weierstrass theorem says a subalgebra of the continuous real-valued functions on *X* is dense if it contains a non-zero constant function and if it separates points. The latter condition means that for any two points, you can find functions in the subalgebra that take on different values on the two points.

If we take the interval [0, 1] as our Hausdorff space, the Stone-Weierstrass theorem says that the subalgebra of polynomials is dense.

The complex Stone-Weierstrass theorem says a subalgebra of he continuous complex-valued functions on *X* is dense if it contains a non-zero constant function, separates points, and is closed under taking conjugates. The statement above about polynomials in *z* and *z* conjugate follows immediately.

***

[1] For a complex variable *z* of the form *a* + *bi* where *a* and *b* are real numbers and *i* is the imaginary unit, the conjugate is *a* – *bi*.

[2] A Hausdorff space is a general topological space. All it requires is that for any two distinct points in the space, you can find disjoint open sets that each contain one of the points.

In many cases, a Hausdorff space is the most general setting where you can work without running into difficulties, especially if it is compact. You can often prove something under much stronger assumptions, then reuse the proof to prove the same result in a (compact) Hausdorff space.

See this diagram of 34 kinds of topological spaces, moving from strongest assumptions at the top to weakest at the bottom. Hausdorff is almost on bottom. The only thing weaker on that diagram is *T*_{1}, also called a Fréchet space.

In a *T*_{1} space, you can separate points, but not simultaneously. That is, given two points *p*_{1} and *p*_{2}, you can find an open set *U*_{1} around *p*_{1} that does not contain *p*_{2}, and you can find an open set *U*_{2} around *p*_{2} that does not contain *p*_{1}. But you cannot necessarily find these sets in such a way that *U*_{1} and *U*_{2} are disjoint. If you could always choose these open sets to be distinct, you’d have a Hausdorff space.

Here’s an example of a space that is *T*_{1} but not Hausdorff. Let *X* be an infinite set with the cofinite topology. That is, open sets are those sets whose complements are finite. The complement of *p*_{1} is an open set containing *p*_{2} and vice versa. But you cannot find disjoint open sets separating two points because there are no disjoint open sets. The intersection of any pair of open sets must contain an infinite number of points.

The **naive** haven’t heard of power laws or don’t know much about them. They probably tend to expect things to be more uniformly distributed than they really are.

The **enthusiasts** have read books about fractals, networks, power laws, etc. and see power laws everywhere.

The **skeptics** say the prevalence of power laws is overestimated.

Within the skeptics are the **pedants**. They’ll point out that nothing exactly follows a power law over the entirety of its range, which is true but unhelpful. Nothing follows any theoretical distribution exactly and everywhere. This is true of the normal distribution as much as it is true of power laws, but that doesn’t mean that normal distributions and power laws aren’t useful models.

If you asked a power law enthusiast how zip code populations are distributed, their first guess would of course be a power law. They might suppose that 80% of the US population lives in 20% of the zip codes, following Pareto’s 80-20 rule.

The enthusiasts wouldn’t be too far off. They would do better than the uniformitarians who might expect that close to 20% of the population lives in 20% of the zip codes.

It turns out that about 70% of Americans live in the zip codes in the top 20th percentile of population. To include 80% of the population, you have to include the top 27% of zip codes.

But a power law implies more than just the 80-2o rule. It implies that something like the 80-20 rule holds on multiple scales, and that’s not true of zip codes.

The signature of a power law is a straight line on a log-log plot. Power laws never hold exactly and everywhere, but a lot of things approximately follow a power law over a useful range, typically in the middle.

What do we get if we make a log-log plot of zip code population?

This looks more like a squircle than a straight line.

If we zoom in on just part of the plot, say the largest 2,000 zip codes [1], we get something that has a flat spot, but plot bows outward, and continues to bow outward as we zoom in on different parts of it.

Why is the distribution of zip code populations not a power law? One reason is that zip codes are artificial. They were designed to make it easier to distribute mail, and so there was a deliberate effort to make them somewhat uniform in population. The population of the largest zip code is only 12 times the average.

You’re more likely to see power laws in something organic like city populations. The largest city in the US is nearly 1,400 larger than the average city [2].

Another reason zip code populations do not follow a power law is that zip codes are somewhat of a geographical designation. I say “somewhat” because the relationship between zip codes and geography is complicated. See [1].

But because there’s at least some attempt to accommodate geography, and because the US population is very thin in large regions of the country, there are many sparsely populated zip codes, even when you roll them up from 5 digits to just 3 digits, and that’s why the curve drops very sharply at the end.

- Identification using birthday, sex, and zip code
- Passwords and power laws
- Estimating power law exponents

[1] This post is based on ZCTAs (Zip Code Tabulation Areas) according to the 2010 census. ZCTAs are not exactly zip codes for reasons explained here.

[2] It depends on how you define “city,” but the average local jurisdiction population in the United States is around 6,200 and the population of New York City is around 8,600,000, which is 1,400 times larger.

]]>Before that, I wrote a post on Bernstein’s proof that used his eponymous polynomials to prove Weierstrass’ theorem. This is my favorite proof because it’s an example of using results from probability to prove a statement that has nothing to do with randomness.

This morning I’ll present one more way to prove the approximation theorem, this one due to Landau.

The Landau kernel is defined as

Denote its integral by

Let *f*(*x*) be any continuous function on [-1, 1]. Then the convolution of the normalized Landau kernel with *f* gives a sequence of polynomial approximations that converge uniformly to *f*. By “normalized” I mean dividing the kernel by its integral so that it integrates to 1.

For each *n*,

is a polynomial in *x* of degree 2*n*, and as *n* goes to infinity this converges uniformly to *f*(*x*).

There are a few connections I’d like to mention. First, the normalized Landau kernel is essentially a beta distribution density, just scaled to live on [-1, 1] rather than [0, 1].

And as with Bernstein’s proof of the Weierstrass approximation theorem, you could use probability to prove Landau’s result. Namely, you could use the fact that two independent random variables *X* and *Y*, the PDF of their sum is the convolution of their PDFs.

The normalizing constants *k*_{n} have a simple closed form in terms of double factorials:

I don’t know which Landau is responsible for the Landau kernel. I’ve written before about the Edmund Landau and his Big O notation, and I wrote about Lev Landau and his license plate game. Edmund was a mathematician, so it makes sense that he might be the one to come up with another proof of Weierstrass’ theorem. Lev was a physicist, and I could imagine he would be interested in the Landau kernel as an approximation to the delta function.

If you know which of these Landaus, or maybe another, is behind the Landau kernel, please let me know.

**Update**: Someone sent me this paper which implies Edmund Landau is the one we’re looking for.

The post mentioned above uses a proof by Bernstein. And in that post I used the absolute value function as an example. Not only is |*x*| an example, you could go the other way around and use it as a step in the proof. That is, there is a proof of the Weierstrass approximation theorem that starts by proving the special case of |*x*| then use that result to build a proof for the general case.

There have been many proofs of Weierstrass’ theorem, and recently I ran across a proof due to Lebesgue. Here I’ll show how Lebesgue constructed a sequence of polynomials approximating |*x*|. It’s like pulling a rabbit out of a hat.

The staring point is the binomial theorem. If *x* and *y* are real numbers with |*x*| > |*y*| and *r* is any real number, then

.

Now apply the theorem substituting 1 for *x* and *x*² – 1 for *y* above and you get

The partial sums of the right hand side are a sequence of polynomials converging to |*x*| on the interval [-1, 1].

***

If you’re puzzled by the binomial coefficient with a top number that isn’t a positive integer, see the general definition of binomial coefficients. The top number can even be complex, and indeed the binomial theorem holds for complex *r*.

You might also be puzzled by the binomial theorem being an infinite sum. Surely if *r* is a positive integer we should get the more familiar binomial theorem which is a *finite* sum. And indeed we do. The general definition of binomial coefficients insures that if *r* is a positive integer, all the binomial coefficients with *k* > *r* are zero.