Numbers don’t typically have many prime factors

Suppose you select a 100-digit number at random. How many distinct prime factors do you think it would have?

The answer is smaller than you might think: most likely between 5 and 6.

The function ω(n) returns the number of distinct prime factors of n [1]. A theorem of Hardy and Ramanujan says that as N goes to infinity, the average value of ω(n) for positive integers up to N is log log N.

Since log log 10100 = 5.43…, we’d expect 100-digit numbers to have between 5 and 6 distinct factors.

The above calculation gives the average number of distinct prime factors for all numbers with up to 100 digits. But if we redid the calculation looking at numbers between 1099 and 10100 it would only make a difference in the third decimal place.

Let’s look at a much smaller example where we can tally the values of ω(n), numbers from 2 to 1,000,000.

Most numbers with up to six digits have two or three distinct prime factors, which is consistent with log log 106 ≈ 2.6.

There are 2,285 six-digit numbers that have six distinct prime factors. Because this is out of a million numbers, it corresponds to a bar in the graph above that is barely visible.

There are 8 six-digit number with 7 distinct prime factors and none with more factors.

Hardy’s theorem with Ramanujan established the mean of ω(n) for large numbers. The Erdős–Kac theorem goes further and says roughly that ω(n) is normally distributed with mean and variance log log n. More on this here.

So returning to our example of 100-digit numbers, the Hardy-Ramanujan theorem implies these numbers would have between 5 and 6 prime factors on average, and the Erdős–Kac theorem implies that about 95% of such numbers have 10 or fewer distinct prime factors. The maximum number of distinct prime factors is 54.

[1] SymPy has a function primeomega, but it does not compute ω(n). Instead, it computes Ω(n), the number of prime factors of n counted with multiplicity. So, for example, ω(12) = 2 but Ω(12) = 3. The SymPy function to compute ω(n) is called primenu.

Leading digits of primes

How are the first digits of primes distributed? Do some digits appear as first digits of primes more often that others? How should we even frame the problem?

There are an infinite number of primes that begin with each digit, so the cardinalities of the sets of primes beginning with each digit are the same. But cardinality isn’t the right tool for the job.

The customary way to (try to) approach problems like this is to pose a question for integers up to N and then let N got to infinity. This is called natural density. But this limit doesn’t always converge, and I don’t believe it converges in our case.

Logarithmic density

An alternative is to compute logarithmic density. This measure exists in cases where natural density does not. And when natural density does exist, logarithmic density may give the same result [1].

The logarithmic density of a sequence A is defined by

\lim_{N\to\infty}\frac{1}{\log N}\sum_{x \in A, \, x \leq N} \frac{1}{x}

We want to know about the relative logarithmic density of primes that start with 1, for example. The relative density of a sequence A relative to a sequence B is defined by

\lim_{N\to\infty} \frac{\sum_{x \in A, \, x \leq N} \frac{1}{x}} {\sum_{x \in B, \, x \leq N} \frac{1}{x}}

We’d like to know about the logarithmic density of primes that start with 1, 2, 3, …, 9 relative to all primes. The exact value is known [2], and we will get to that shortly, but first let’s try to anticipate the result by empirical calculation.

We want to look at the first million primes, and we could do that by calling the function prime with the integers 1 through 1,000,000. But it is much more efficient to as SymPy to find the millionth prime and then find all primes up to that prime.

    from sympy import prime, primerange
    import numpy as np

    first_digit = lambda n: int(str(n)[0])

    density = np.zeros(10)
    N = 1_000_000
    for x in primerange(prime(N+1)):
        density[0] += 1/x
        density[first_digit(x)] += 1/x
    density /= density[0]

Here’s a histogram of the results.

Clearly the distribution is uneven, and in general more primes begin with small digits than large digits. But the distribution is somewhat irregular. That’s how things work with primes: they act something like random numbers, except when they don’t. This sort of amphibious nature, between regularity and irregularity, makes primes interesting.

The limits defining the logarithmic relative density do exist, though the fact that even a million terms doesn’t produce a smooth histogram suggests the convergence is not too rapid.

In the limit we get that the density of primes with first digit d is

\log_{10}\left(1 + \frac{1}{d}\right)

The densities we’d see in the limit are plotted as follows.

This is exactly the density given by Benford’s law. However, this is not exactly Benford’s law because we are using a different density, logarithmic relative density rather than natural density [3].

[1] The Davenport–Erdős theorem says that for some kinds of sequences, if the natural density exists, logarithmic density also exists and will give the same result.

[2] R. E. Whitney. Initial Digits for the Sequence of Primes. The American Mathematical Monthly, Vol. 79, No. 2 (Feb., 1972), pp. 150–152

[3] R. L. Duncan showed that the leading digits of integers satisfy the logarithmic density version of Benford’s law.

Recognizing three-digit primes

If a three-digit number looks like it might be prime, there’s about a 2 in 3 chance that it is.

To be more precise about what it means for a number to “look like a prime,” let’s say that a number is obviously composite if it is divisible by 2, 3, 5, or 11. Then the following Python code quantifies the claim above.

    from sympy import gcd, isprime

    obviously_composite = lambda n: gcd(n, 2*3*5*11) > 1

    primes = 0
    nonobvious = 0

    for n in range(100, 1000):
        if not obviously_composite(n):
            nonobvious += 1
            if isprime(n):
                primes += 1
    print(primes, nonobvious)

This shows that out of 218 numbers that are not obviously composite, 143 are prime.

This is a fairly conservative estimate. It doesn’t consider 707 an obvious composite, for example, even though it’s pretty clear that 707 is divisible by 7. And if you recognize squares like 169, you can add a few more numbers to your list of obviously composite numbers.

Overpowered proof that π is transcendental

There is no polynomial with rational coefficients that evaluates to 0 at π. That is, π is a transcendental number, not an algebraic number. This post will prove this fact as a corollary of a more advanced theorem. There are proof that are more elementary and direct, but the proof given here is elegant.

A complex number z is said to be algebraic if it is the root of a polynomial with rational coefficients. The set of all algebraic numbers forms a field F.

The Lindemann-Weierstrass theorem says that if

α1, α2, …, αn

is a set of distinct algebraic numbers, then their exponentials

exp(α1), exp(α2), …, exp(αn)

are linearly independent. That is, no linear combination of these numbers with rational coefficients is equal to 0 unless all the coefficients are 0.

Assume π is algebraic. Then πi would be algebraic, because i is algebraic and the product of algebraic numbers is algebraic.

Certainly 0 is algebraic, and so the Lindemann-Weierstrass theorem would say that exp(πi) and exp(0) are linearly independent. But these two numbers are not independent because

exp(πi) + exp(0) = -1 + 1 = 0.

So we have a proof by contradiction that π is not algebraic, i.e. π is transcendental.

I found this proof in Excursions in Number Theory, Algebra, and Analysis by Kenneth Ireland and Al Cuoco.


Piranhas and prime factors

The piranha problem says an event cannot be highly correlated with a large number of independent predictors. If you have a lot of strong predictors, they must predict each other, analogous to having too many piranhas in a small body of water: they start to eat each other.

The piranha problem is subtle. It can sound obvious or mysterious, depending on how you state it. You can find several precise formulations of the piranha problem here.

Prime piranhas

An analog of the piranha problem in number theory is easier to grasp. A number N cannot have two prime factors both larger than its square root, nor can it have three prime factors all larger than its cube root. This observation is simple, obvious, and useful.

For example, if N is a three-digit number, then the smallest prime factor of N cannot be larger than 31 unless N is prime. And if N has three prime factors, at least one of these must be less than 10, which means it must be 2, 3, 5, or 7.

There are various tricks for testing divisibility by small primes. The tricks for testing divisibility by 2, 3, and 5 are well known. Tricks for testing divisibility by 7, 11, and 13 are moderately well known. Tests for divisibility by larger primes are more arcane.

Our piranha-like observation about prime factors implies that if you know ways to test divisibility by primes less than p, then you can factor all numbers up to p² and most numbers up to p³. The latter part of this statement is fuzzy, and so we’ll explore it a little further.

How much is “most”?

For a given prime p, what proportion of numbers less than p³ have two factors larger than p? We can find out with the following Python code.

    from sympy import factorint

    def density(p, N = None):
        if not N:
            N = p**3
        count = 0
        for n in range(1, N):
            factors = factorint(n).keys()
            if len([k for k in factors if k > p]) == 2:
                count += 1
        return count/N

The code is a little more general than necessary because in a moment we will like to consider a range that doesn’t necessarily end at p³.

Let’s plot the function above for the primes less than 100.

Short answer: “most” means roughly 90% for primes between 20 and 100.

The results are very similar if we pass in a value of N greater than p³. About 9% of numbers less than 1,000 have two prime factors greater than 10, and about 12% of numbers less than 1,000,000 have two prime factors greater than 100.

Density of safe primes

Sean Connolly asked in a comment yesterday about the density of safe primes. Safe primes are so named because Diffie-Hellman encryption systems based on such primes are safe from a particular kind of attack. More on that here.

If q and p = 2q + 1 are both prime, q is called a Sophie Germain prime and p is a safe prime. We could phrase Sean’s question in terms of Sophie Germain primes because every safe prime corresponds to a Sophie Germain prime.

It is unknown whether there are infinitely many Sophie Germain primes, so conceivably there are only a finite number of safe primes. But the number of Sophie Germain primes less than N is conjectured to be approximately

1.32 N / (log N)².

See details here.

Sean asks specifically about the density of safe primes with 19,000 digits. The density of Sophie Germain primes with 19,000 digits or less is conjectured to be about

1.32/(log 1019000)² = 1.32/(19000 log 10)² = 6.9 × 10-10.

So the chances that a 19,000-digit number is a safe prime are on the order of one in a billion.

Famous constants and the Gumbel distribution

The Gumbel distribution, named after Emil Julius Gumbel (1891–1966), is important in statistics, particularly in studying the maximum of random variables. It comes up in machine learning in the so-called Gumbel-max trick. It also comes up in other applications such as in number theory.

For this post, I wanted to point out how a couple famous constants are related to the Gumbel distribution.

Gumbel distribution

The standard Gumbel distribution is most easily described by its cumulative distribution function

F(x) = exp( −exp(−x) ).

You can introduce a location parameter μ and scale parameter β the usual way, replacing x with (x − μ)/β and dividing by β.

Here’s a plot of the density.

Euler-Mascheroni constant γ

The Euler-Mascheroni constant γ comes up frequently in applications. Here are five posts where γ has come up.

The constant γ comes up in the context of the Gumbel distribution two ways. First, the mean of the standard Gumbel distribution is γ. Second, the entropy of a standard Gumbel distribution is γ + 1.

Apéry’s constant ζ(3)

The values of the Riemann zeta function ζ(z) at positive even integers have closed-form expressions given here, but the values at odd integers do not. The value of ζ(3) is known as Apéry’s constant because Roger Apéry proved in 1978 that ζ(3) is irrational.

Like the Euler-Mascheroni constant, Apéry’s constant has come up here multiple times. Some examples:

The connection of the Gumbel distribution to Apéry’s constant is that the skewness of the distribution is

12√6 ζ(3)/π³.

Prime numbers and Taylor’s law

The previous post commented that although the digits in the decimal representation of π are not random, it is sometimes useful to think of them as random. Similarly, it is often useful to think of prime numbers as being randomly distributed.

If prime numbers were samples from a random variable, it would be natural to look into the mean and variance of that random variable. We can’t just compute the mean of all primes, but we can compute the mean and variance of all primes less than an upper bound x.

Let M(x) be the mean of all primes less than x and let V(x) be the corresponding variance. Then we have the following asymptotic results:

M(x) ~ x / 2


V(x) ~ x²/12.

We can investigate how well these limiting results fit for finite x with the following Python code.

    from sympy import sieve
    def stats(x):
        s = 0
        ss = 0
        count = 0
        for p in sieve.primerange(x):
            s += p
            ss += p**2
            count += 1
        mean = s / count
        variance = ss/count - mean**2
        return (mean, variance)

So, for example, when x = 1,000 we get a mean of 453.14, a little less than the predicted value of 500. We get a variance of 88389.44, a bit more than the predicted value of 83333.33.

When x = 1,000,000 we get closer to values predicted by the limiting formula. We get a mean of 478,361, still less than the prediction of 500,000, but closer. And we get a variance of 85,742,831,604, still larger than the prediction 83,333,333,333, but again closer. (Closer here means the ratios are getting closer to 1; the absolute difference is actually getting larger.)

Taylor’s law

Taylor’s law is named after ecologist Lionel Taylor (1924–2007) who proposed the law in 1961. Taylor observed that variance and mean are often approximately related by a power law independent of sample size, that is

V(x) ≈ a M(x)b

independent of x.

Taylor’s law is an empirical observation in ecology, but it is a theorem when applied to the distribution of primes. According to the asymptotic results above, we have a = 1/3 and b = 2 in the limit as x goes to infinity. Let’s use the code above to look at the ratio

V(x) / a M(x)b

for increasing values of x.

If we let x = 10k for k = 1, 2, 3, …, 8 we get ratios

0.612, 1.392, 1.291, 1.207, 1.156, 1.124, 1.102, 1.087

which are slowly converging to 1.

Reference: Joel E. Cohen. Statistics of Primes (and Probably Twin Primes) Satisfy Taylor’s Law from Ecology. The American Statistician, Vol. 70, No. 4 (November 2016), pp. 399–404

The coupon collector problem and π

How far do you have to go down the decimal digits of π until you’ve seen all the digits 0 through 9?

We can print out the first few digits of π and see that there’s no 0 until the 32nd decimal place.


It’s easy to verify that the remaining digits occur before the 0, so the answer is 32.

Now suppose we want to look at pairs of digits. How far out do we have to go until we’ve seen all pairs of digits (or base 100 digits if you want to think of it that way)? And what about triples of digits?

We know we’ll need at least 100 pairs, and at least 1000 triples, so this has gotten bigger than we want to do by hand. So here’s a little Python script that will do the work for us.

    from mpmath import mp
    mp.dps = 30_000
    s = str(mp.pi)[2:] 
    for k in [1, 2, 3]:
        tuples = [s[i:i+k] for i in range(0, len(s), k)]
        d = dict()
        i = 0
        while len(d) < 10**k:
            d[tuples[i]] = 1
            i += 1

The output:


This confirms that we at the 32nd decimal place we will have seen all 10 possible digits. It says we need 396 pairs of digits before we see all 100 possible digit pairs, and we’ll need 6076 triples before we’ve seen all possible triples.

We could have used the asymptotic solution to the “coupon collector problem” to approximately predict the results above.

Suppose you have an urn with n uniquely labeled balls. You randomly select one ball at a time, return the ball to the run, and select randomly again. The coupon collector problem ask how many draws you’ll have to make before you’ve selected each ball at least once.

The expected value for the number of draws is

n Hn

where Hn is the nth harmonic number. For large n this is approximately equal to

n(log n + γ)

where γ is the Euler-Mascheroni constant. (More on the gamma constant here.)

Now assume the digits of π are random. Of course they’re not random, but random is as random does. We can get useful estimates by making the modeling assumption that the digits behave like a random sequence.

The solution to the coupon collector problem says we’d expect, on average, to sample 28 digits before we see each digit, 518 pairs before we see each pair, and 7485 triples before we see each triple. “On average” doesn’t mean much since there’s only one π, but you could interpret this as saying what you’d expect if you repeatedly chose real numbers at random and looked at their digits, assuming the normal number conjecture.

The variance on the number of draws needed is asymptotically π² n²/6, so the number of draws with usually be an interval of the expected value ± 2n.

If you want the details of the coupon collector problem, not just the expected value but the probabilities for different number of draws, see Sampling with replacement until you’ve seen everything.


Numbering minor league baseball teams

El Paso Chihuahuas team logo
Last week I wrote about how to number MLB teams so that the number n told you where they are in the league hierarchy:

  • n % 2 tells you the league, American or National
  • n % 3 tells you the division: East, Central, or West
  • n % 5 is unique within a league/division combination.

Here n % m denotes n mod m, the remainder when n is divided by m.

This post will do something similar for minor league teams.

There are four minor league teams associated with each major league team. If we wanted to number them analogously, we’d need to do something a little different because we cannot specify n % 2 and n % 4 independently. We’d need an approach that is a hybrid of what we did for the NFL and MLB.

We could specify the league and the rank within the minor leagues by three bits: one bit for National or American league, and two bits for the rank:

  • 00 for A
  • 01 for High A
  • 10 for AA
  • 11 for AAA

It will be convenient later on if we make the ranks the most significant bits and the league the least significant bit.

So to place a minor league team on a list, we could write down the numbers 1 through 120, and for each n, calculate r = n % 8, d = n % 3, and k = n % 5.

The latest episode of 99% Invisible is called RoboUmp, a show about automating umpire calls. As part of the story, the show discusses the whimsical names of minor league teams and how the names allude to their location. For example, the El Paso Chihuahuas are located across the border from the Mexican state of Chihuahua and their mascot is a chihuahua dog. (The dog was named after the state.)

The El Paso Chihuahuas are the AAA team associated with the San Diego Padres, a team in the National League West, team #3 in the order listed in the MLB post. The number n for the Chihuahuas must equal 7 mod 8, 111two, the first bit for National League and the last two bits for AAA. We also require n to be 2 mod 3 because it’s in the West, and n = 3 mod 5 because the Padres are #3 in the list of National League West teams in our numbering. It works out that n = 23.

How do minor league and major league numbers relate? They have to be congruent mod 30. They have to have the same parity since they represent the same league, and must be congruent mod 3 because they have in the same division. And they must be congruent mod 5 to be in the same place in the list of associated major league teams.

So to calculate a minor league team’s number, start with the corresponding major league number, and add multiples of 30 until you get the right value mod 8.

For example, the Houston Astros are number 20 in the list from the earlier post. The Triple-A team associated with the Astros is the Sugar Land Space Cowboys. The number n for the Space Cowboys must be 6 mod 8 because 6 = 110two, and they’re a Triple-A team (11) in the American League (0). So n = 110.

The Astros’ Double-A team, the Corpus Christi Hooks, needs to have a number equal to 100two = 4 mod 8, so n = 20. The High-A team, the Asheville Tourists, are 50 and the Single-A team, the Fayetteville Woodpeckers, is 80.

You can determine what major league team is associated with a minor league team by taking the remainder by 30. For example, the Rocket City Trash Pandas has number 77, so they’re associated with the major league team with number 17, which is the Los Angeles Angels. The remainder when 77 is divided by 8 is 5 = 101two, which tells you they’re a Double-A team since the high order bits are 1 and 0.