This post uses a hypothesis test for proportions to look at a couple conjectures in number theory. It is similar to my earlier post on the chi-square test and prime remainders. You could read this as a post on statistics or a post on number theory, depending on which you’re less familiar with.

Using statistical tests on number theory problems is kind of odd. There’s nothing random going on, so in that since the whole enterprise is unjustified. Nevertheless, statistical tests can be suggestive. They certainly don’t prove theorems, but they can give reason to suspect a theorem is true or false. In that sense, applying statistical tests to number theory isn’t all that different from applying them to more traditional settings.

First we’ll look at the remainders of primes modulo 4. Except for 2, all primes are odd, and so they either have remainder 1 or 3 when divided by 4. Brian Hayes wrote recently that Chebyshev noticed in the 1850’s that there seems to be more primes with remainder 3. Is the imbalance larger than one would expect to see from fair coin tosses?

Here’s some Python code to find the proportion of the first million primes (after 2) that have remainder 3 when divided by 4.

from sympy import prime from scipy import sqrt n = 1000000 rem3 = 0 for i in range (2, n+2): if prime(i) % 4 == 3: rem3 += 1 p_hat = rem3/n

This shows that of the first million odd primes, 500,202 are congruent to 3 mod 4. Would it be unusual for a coin to come up heads this many times in a million flips? To find out we’d compute the z-statistic:

Here *p* is the proportion under the null hypothesis, *q* = 1 – *p*, and *n* is the sample size. In our case, the null hypothesis is *p* = 0.5 and *n* = 1,000,000. [1]

The code

p = 0.5 q = 1 - p z = (p_hat - p)/sqrt(p*q/n)

shows that z = 0.404, hardly a suspiciously large value. If we were looking at random values we’d see a z-score this large or larger 34% of the time. So in this case doesn’t suggest much.

[1] The derivation of the z statistic is fairly quick. If the proportion of successes is *p*, then the number of successes out of *n* trials is binomial(*n*, *p*). For large *n*, this is has approximately the same distribution as a normal distribution with the same mean and variance, mean *np* and variance *npq*. The *proportion* of successes then has approximately mean *p* and standard deviation √(*pq/n*). Subtracting the mean and dividing by the standard deviation normalizes the distribution to have mean 0 and variance 1. So under the null hypothesis, the z statistic has a standard normal distribution.

It’s more efficient to use sympy.nextprime, which will iterate primes from a sieve, than to iterate all numbers and do a primality test on each one.

Ah sorry, I thought you were using isprime(), not prime(). What you are doing is fine.

Wouldn’t it be more appropriate to use a 2-tails Z test so that the probability to obtain more than 500202 or less than 499798 heads is 68%?