This morning on Twitter, Alexander Bogomolny posted a link to his article that gives examples of words that are prime numbers when interpreted as numbers in base 36. Some examples are “Brooklyn”, “paleontologist”, and “deodorant.” (Numbers in base 36 are written using 0, 1, 2, …, 9, A, B, C, …, Z as “digits.” )

Tim Hopper replied with a snippet of Mathematica code that lists all words with up to four letters that correspond to base 36 primes.

Rest[ Flatten[ Union[ DictionaryLookup /@ IntegerString[ Table[Prime[n], {n, 1, 300000}], 36]]]]

That made me wonder whether you could estimate how many such words there are without doing an exhaustive search.

The Prime Number Theorem says that the probability of a number less than N being prime is approximately 1/log(N). If we knew how many English words there were of a certain length, then we could guess that 1/log(N) of that those words would be prime when interpreted as base 36 numbers. This assumes that forming an English word and being prime have independent probabilities, which may be approximately true.

How well would our guess have worked on Tim’s example? He prints out all the words corresponding to the first 300,000 primes. The last of these primes is 4,256,233. The exact probability that a number less than that upper limit is prime is then

300,000 / 4,256,233 ≈ 0.07.

There are about 4200 English words with four or fewer letters. (I found this out by running

grep -ciE '^[a-z]{1,4}$'

on the `words`

file on a Linux box. See similar tricks here.) If we estimate that 7% of these are prime, we’d expect 294 words from Tim’s program. His program produces 275 words, so our prediction is pretty good.

If we didn’t know the exact probability of a number in our range being prime, we could have estimated the probability at

1/log(4,256,233) ≈ 0.0655

using the Prime Number Theorem. Using this approximation we’d estimate 4200*0.0655 = 275.1 words; our estimate would be exactly correct! There’s good reason to believe our estimate would be reasonably close, but we got lucky to get this close.

**Related posts**:

I didn’t actually test to see if that gets all the words up to four characters. I just chose 300000 because it ran relatively quickly on my laptop.

Heh!

I discovered a couple of years back that my full name is a prime number in b-36.

Why I’m telling you this, when I devoutly avoid having my full name anywhere on the internet is another story 🙂

Nice to see someone informing (at least a cross-section) of the world about the beauty of alternate bases

So I wonder, is there a meaningful word which is a composite of two meaningful words in b-36?

s/lsits/lists/

@caner: hehe. Cheeky 🙂

Waldir: Thanks for leaving a

`sed`

program in the comments. 🙂Why 36 and not just 26 (A to Z)?

It seems to me quite artificial to include digits inside the alphabet.

I do not know real words with digits embedded.

I’d argue that the least articifical alphabet for such endeavours would not have 26 or 36 characters, but 22.

4,256,233 seems to be little bit too big for 4-character word problem. If we take 1,679,616 (i.e. 36^4) instead, then PNT gives the probability 1/ln(1,679,616) ≈ 0.0698, and 293 expected prime words.

Base 36 surprises me that it matches the PNT so closely. There’s all kinds of possibilities based on letter frequencies for the last digit. Are even last letters significantly less common than odd ones in the database of words?

Nice article. Thank you for submitting this article to the Math and Multimedia carnival.

It’s interesting that in Mac dictionary there are 5628 up-to-4-character words, comparing to 4221 on Ubuntu.

Another thing, if we do the same analysis for 5-character words, the difference between estimated and actual number of prime words will be bigger. Mathematica returns 477 prime words, but PNT gives ≈578 on Ubuntu and ≈788 on Mac.

The close estimation for 4-character prime words is just a luck.

Andrey: I estimated the number of words with <= 4 letters via Ubuntu just because it was easy, but that introduced some inconsistency. To be consistent, the number of words and the number of prime words should be taken from the same source. The question then is given a base 36 number that forms a word (by whatever dictionary you use to define "word"), what is the probability that the number is prime? Is that probability approximately what one would estimate from the PNT?

John: it’s a good point about single source of data! Mathematica dictionary seems to be smaller than Mac dictionary.

I recalculated actual and estimated probability using data from Mac dictionary. Here is the graph of their functions.