# Finding large pseudoprimes

Fermat’s little theorem says that if p is a prime number, then for any integer b,

bp-1 = 1 (mod p).

This gives a necessary but not sufficient test for a number to be prime. A number that satisfies the equation above but is not prime is called a pseudoprime  for base b. For example, 341 is a pseudoprime base 2 because

2340 = 1 mod 341

and clearly 341 = 11*31.

How might you go about hunting for pseudoprimes? This page gives a way to find pseduoprimes base 2, quoting the following theorem.

If both numbers q and 2q – 1 are primes and n = q(2q-1) then 2n-1 = 1 (mod n) if and only if q is of the form 12k + 1.

Here’s Python code that will find pseudoprimes of a given size in bits.

    from secrets import randbits
from sympy import isprime

def find_pseudoprime(numbits):
n = numbits // 2
while True:
q = randbits(n)
if isprime(q) and q%12 == 1 and isprime(2*q-1):
return q*(2*q-1)

print( find_pseudoprime(256) )


This returned

47362882341088596725068562696893704769436677460225591859092704246296157080253

in under a second.

Clearly the return value of find_pseudoprime is composite since it is computed as the product of two integers. You could confirm it returns a pseudoprime by computing

    pow(2, p-1, p)


on the return value to verify that it equals 1.

***

 Sometimes such a number is called a Fermat pseudoprime to distinguish it from other kinds of pseudo primes, i.e. numbers that satisfy other necessary conditions for being prime without actually being prime. For example, Perrin pseudoprimes. See the next post for more examples.

# Estimating the proportion of smooth numbers A number is said to be “smooth” if all its prime factors are small. To make this precise, a number is said to be y-smooth if it only has prime factors less than or equal to y. So, for example, 1000 is 5-smooth.

The de Bruijn function φ(x, y) counts the number of y-smooth positive integers up to x, and so φ(x, y)/x is the probability that a number between 1 and x is y-smooth.

It turns out that

φ(x, x1/u)/x ≈ 1/uu.

This means, for example, that the proportion of numbers with less than 100 digits whose factors all have less than 20 digits is roughly 1/55 = 1/3125.

Source: The Joy of Factoring

Here’s a little Python code to experiment with this approximation. We’ll look at the proportion of 96-bit numbers whose largest prime factor has at most 32 bits.

    from secrets import randbits
from sympy.ntheory.factor_ import smoothness

smooth = 0
for _ in range(100):
x = randbits(96)
s = smoothness(x)
if s < 2**32:
smooth += 1
print("Num small:", smooth)


The SymPy function smoothness returns a pair, and the first element of the pair is the largest prime dividing its argument.

In our case u = 3, and so we’d expect about 1/27 of our numbers to have no factors larger than 32 bits. I ran this five times, and got 8, 2, 5, 3, and 7. From the approximation above we’d expect results around 4, so our results are not surprising.

# Generating Python code from SymPy

Yesterday I wrote about Householder’s higher-order generalizations of Newton’s root finding method. For n at least 2, define and iterate Hn to find a root of f(x). When n = 2, this is Newton’s method. In yesterday’s post I used Mathematica to find expressions for H3 and H4, then used Mathematica’s FortranForm[] function to export Python code. (Mathematica doesn’t have a function to export Python code per se, but the Fortran syntax was identical in this case.)

Aaron Muerer pointed out that it would have been easier to generate the Python code in Python using SymPy to do the calculus and labdify() to generate the code. I hadn’t heard of lambdify before, so I tried out his suggestion. The resulting code is nice and compact.

    from sympy import diff, symbols, lambdify

def f(x, a, b):
return x**5 + a*x + b

def H(x, a, b, n):
x_, a_, b_ = x, a, b
x, a, b = symbols('x a b')
expr = diff(1/f(x,a,b), x, n-2) / \
diff(1/f(x,a,b), x, n-1)
g = lambdify([x,a,b], expr)
return x_ + (n-1)*g(x_, a_, b_)


This implements all the Hn at once. The previous post implemented three of the Hn separately.

The first couple lines of H require a little explanation. I wanted to use the same names for the numbers that the function H takes and the symbols that SymPy operated on, so I saved the numbers to local variables.

This code is fine for a demo, but in production you’d want to generate the function g once (for each n) and save the result rather than generating it on every call to H.

# Software to factor integers

In my previous post, I showed how changing one bit of a semiprime (i.e. the product of two primes) creates an integer that can be factored much faster. I started writing that post using Python with SymPy, but moved to Mathematica because factoring took too long.

## SymPy vs Mathematica

When I’m working in Python, SymPy lets me stay in Python. I’ll often use SymPy for a task that Mathematica could do better just so I can stay in one environment. But sometimes efficiency is a problem.

SymPy is written in pure Python, for better and for worse. When it comes to factoring large integers, it’s for worse. I tried factoring a 140-bit integer with SymPy, and killed the process after over an hour. Mathematica factored the same integer in 1/3 of a second.

## Mathematica vs PARI/GP

The previous post factors 200-bit semiprimes. The first example, N = pq where

p = 1078376712338123201911958185123
q = 1126171711601272883728179081277

took 99.94 seconds to factor using Mathematica. A random sample of 13 products of 100-bit primes and they took an average of 99.1 seconds to factor.

Using PARI/GP, factoring the value of N above took 11.4 seconds to factor. I then generated a sample of 10 products of 100-bit primes and on average they took 10.4 seconds to factor using PARI/GP.

So in these examples, Mathematica is several orders of magnitude faster than SymPy, and PARI/GP is one order of magnitude faster than Mathematica.

It could be that the PARI/GP algorithms are relatively better at factoring semiprimes. To compare the efficiency of PARI/GP and Mathematica on non-semiprimes, I repeated the exercise in the previous post, flipping each bit of N one at a time and factoring.

This took 240.3 seconds with PARI/GP. The same code in Mathematica took 994.5 seconds. So in this example, PARI/GP is about 4 times faster where as for semiprimes it was 10 times faster.

## Python and PARI

There is a Python interface to PARI called cypari2. It should offer the convenience of working in Python with the efficiency of PARI. Unfortunately, the installation failed on my computer. I think SageMath interfaces Python to PARI but I haven’t tried it.

# Feller-Tornier constant

Here’s kind of an unusual question: What is the density of integers that have an even number of prime factors with an exponent greater than 1?

To define the density, you take the proportion up to an integer N then take the limit as N goes to infinity.

It’s not obvious that the limit should exist. But if it does exist, you might guess that it’s 1/2; it’s a question involving whether something is even or odd, and so even and odd occurrences might balance each other out.

But the result, known as the Feller-Tonier constant, is bigger than 1/2. This seems more reasonable when you consider that zero is an even number and a lot of numbers may not have any prime factors with exponent bigger than 1.

Here’s a little Python code to compute the ratio for N = 106.

    from sympy.ntheory import factorint

def f(n):
v = factorint(n).values()
n = len([x for x in v if x > 1])
return n % 2 == 0

c = 0
N = 1000000
for n in range(N):
if f(n):
c += 1
print (c/N)


This gives a ratio of 0.661344. The limiting value is 0.661370494….

## Computing the Feller-Tornier constant

The code above gives an estimate for the Feller-Tornier constant, but how accurate is it? Is there a more efficient way to estimate it that doesn’t require factoring a lot of numbers?

The constant is given analytically by where pn is the nth prime.

Here’s some Python code to estimate the Feller-Tornier constant using the expression above.

    N = 1000000
product = 1
for p in primerange(2, N):
product *= 1 - 2*p**-2
print(0.5*(1+product))


Note that we used primerange rather than prime(n). If you need to generate a range of prime numbers, the former is much more efficient.

The code gives us an estimate of 0.66131707, which is good to seven decimal places. Furthermore, we know this is an upper bound on the exact value because we’ve left out terms in the infinite product that are less than 1.

One could calculate a lower bound as well and thus know an interval that must contain the true value. The lower bound is left as an exercise to the reader.

# An attack on RSA with exponent 3

As I noted in this post, RSA encryption is often carried out reusing exponents. Sometimes the exponent is exponent 3, which is subject to an attack we’ll describe below . (The most common exponent is 65537.)

Suppose the same message m is sent to three recipients and all three use exponent e = 3. Each recipient has a different modulus Ni, and each will receive a different encrypted message

ci = m³ mod Ni.

Someone with access to c1, c2, and c3 can recover the message m as follows. We can assume each modulus Ni is relatively prime to the others, otherwise we can recover the private keys using the method described here. Since the moduli are relatively prime, we can solve the three equations for m³ using the Chinese Remainder Theorem. There is a unique x < N1 N2 N3 such that

x = c1 mod N1
x = c2 mod N2
x = c3 mod N3

and m is simply the cube root of x. What makes this possible is knowing m is a positive integer less than each of the Ns, and that x < N1 N2 N3. It follows that we can simply take the cube root in the integers and not the cube root in modular arithmetic.

This is an attack on “textbook” RSA because the weakness in this post could be avoiding by real-world precautions such as adding random padding to each message so that no two recipients are sent the exact same message.

By the way, a similar trick works even if you only have access to one encrypted message. Suppose you’re using a 2048-bit modulus N and exchanging a 256-bit key. If you message m is simply the key without padding, then m³ is less than N, and so you can simply take the cube root of the encrypted message in the integers.

## Python example

Here we’ll work out a specific example using realistic RSA moduli.

    from secrets import randbits, randbelow
from sympy import nextprime
from sympy.ntheory.modular import crt

def modulus():
p = nextprime(randbits(2048))
q = nextprime(randbits(2048))
return p*q

N = [modulus() for _ in range(3)]
m = randbelow(min(N))
c = [pow(m, 3, N[i]) for i in range(3)]
x = crt(N, c)

assert(cbrt(x) == m) # integer cube root


Note that crt is the Chinese Remainder Theorem. It returns a pair of numbers, the first being the solution we’re after, hence the  after the call.

The script takes a few seconds to run. Nearly all the time goes to finding the 2048-bit (617-digit) primes that go into the moduli. Encrypting and decrypting m takes less than a second.

## Related posts

 I don’t know who first discovered this line of attack, but you can find it written up here. At least in the first edition; the link is to the 2nd edition which I don’t have.

# RSA with Pseudoprimes

## RSA setup

Recall the setup for RSA encryption given in the previous post.

1. Select two very large prime numbers p and q.
2. Compute n = pq and φ(n) = (p – 1)(q – 1).
3. Choose an encryption key e relatively prime to φ(n).
4. Calculate the decryption key d such that ed = 1 (mod φ(n)).
5. Publish e and n, and keep dp, and q secret.

φ is Euler’s totient function, defined here.

There’s a complication in the first step. Maybe you think the numbers p and q are prime, but they’re not. In that case the calculation of φ in step 2 is wrong.

## Pseudoprimes

The numbers p and q need to be “very large,” where the definition of what constitutes large changes over time due to Moore’s law and progress with factoring algorithms. Currently p and q would typically have at least 2048 bits each. It’s easy to find numbers that large that are probably prime, but it’s difficult to be certain.

At the moment, the fastest way to test for primes has a small chance making a false positive error, but no chance of a false negative. That is, if the test says a number is composite, it is certainly composite. But if the test says a number may be prime, there is a small chance that it is not prime. (Details here.) So in practice RSA starts by finding two large probable primes or pseudoprimes.

Discussions of RSA encryption often point out that large pseudoprimes are very rare, so it isn’t a problem that RSA starts with pseudoprimes. But what does that mean? Is there a one in a trillion chance that your private key won’t work and nobody can send you messages? Or that you can receive some messages and not others?

## Encryption and decryption

RSA encryption works by taking a message m and raising it to the exponent e modulo n where e and n are defined at the top of the post. To decrypt the message, you raise it to the exponent d modulo n where d is your private decryption key. Because d was computed to satisfy

ed = 1 (mod φ(n)),

Euler’s theorem says that we’ll get our message back. We’ll give an example below.

## What if p or q are pseudoprime?

If p and q are prime, then φ(n) = (p – 1)(q – 1). But if we’re wrong in our assumption that one of these factors is prime, our calculation of φ(n) will be wrong. Will our encryption and decryption process work anyway? Maybe.

We’ll do three examples below, all using small numbers. We’ll use e = 257 as our public encryption exponent and m = 42 as the message to encrypt in all examples.

In the first example p and q are indeed prime, and everything works as it should. In the next two examples we will replace p with a pseudoprime. In the second example everything works despite our error, but in the third example decryption fails.

### Example 1: p and q prime

We’ll start with p = 337 and q = 283. Both are prime. The Python code below shows that d = 60833 and the encrypted message is 45431. Everything works as advertised.

### Example 2: p pseudoprime

Now we use p = 341 and q = 283. Here p is a pseudoprime for base 2, i.e.

2340 = 1 mod 341

and so 341 passes Fermat’s primality test  for b = 2. Now d = 10073 and the encrypted message is 94956. Decrypting the encrypted message works, even though p is not prime and our calculation for φ is wrong. In fact, the process works not just for our message m = 42 but for every message.

### Example 3: p pseudoprime

Here again we let p be the pseudoprime 341 but set q to the prime 389. Now d = 6673, the encrypted message is 7660, but decrypting the encrypted message returns 55669, not 42 as we started with. Decryption fails for all other messages as well.

If we use the correct value for φ(pq) the example works. That is, if we use

φ(341*389) = φ(11*31*389) = 10*30*388

rather than the incorrect value 340*388 the decryption process recovers our original message.

## What can go wrong

The examples above show that if we mistakenly think one of our numbers is a prime when it is only a pseudoprime, decryption might succeed or it might fail. In either case, we assume npq has two prime factors when in fact it has more, and so n is easier to factor than we thought.

## Python code

Here’s the code for the examples above.

    from sympy import gcd, mod_inverse

message = 42
e = 257

def example(p, q):
n = p*q
phi = (p-1)*(q-1)
assert( gcd(e, phi) == 1 )

d = mod_inverse(e, phi)
assert( d*e % phi == 1 )

encrypted = pow(message, e, n)
decrypted = pow(encrypted, d, n)
return (message == decrypted)

print(example(337, 283))
print(example(341, 283))
print(example(341, 389))


## Related posts

 Fermat’s primality test is explained here. In our example we only tried one base, b = 2. If we had also tried b = 3 we would have discovered that 341 is not prime. In practice one would try several different bases. Using the heuristic that failures are independent for each base, it’s very unlikely that a composite number would be a pseudoprime for each of, say, 5o different bases.

# Fermat’s factoring trick and cryptography

Many encryption algorithms rely on the difficulty of factoring a large number n. If you want to make n hard to factor, you want it to have only two factors. Otherwise, the more factors n has, the smaller the smallest factor must be.

So if you want n to be the product of two large primes, p and q, you want to pick these primes to be roughly the same size so that the smaller factor is as large as possible. If you’re limited on the size of n, then you want p and q to be roughly of size √n. But not too close to √n. You may see in a description of a cryptographic algorithm, such as RSA, “Pick two large primes p and q, but not too close together, …” Why is that?

The answer goes back to Fermat (1607–1665). His factoring trick is to start with an odd composite n and look for numbers a and b such that

n = a² – b²

because if you can do that, then

n = (ab)(a – b).

This trick always works , but it’s only practical when the factors are close together. If they are close together, you can do a brute force search for a and b. But otherwise you’re better off doing something else.

## Small example

To give an example, suppose we want to factor n = 12319. Then √n = 110.99, so we can start looking for a and b by trying a = 111. We increment a until a² – n is a square, b².

Now 111² – 12319 = 2, so that didn’t work. But 112² – 12319 = 225, which is a square, and so we take a = 112 and b = 15. This tells us p = 112+15 = 127 and q = 112 – 15 = 97.

## Larger example with Python code

Now let’s do a larger example, too big to do by hand and more similar to a real application.

    from sympy import sqrt, log, ceiling, Integer

def is_square(n):
return type(sqrt(n)) == Integer

def fermat_factor(n):
num_digits = int(log(n, 10).evalf() + 1)
a = ceiling( sqrt(n).evalf(num_digits) )

counter = 0
while not is_square(a*a - n):
a += 1
counter += 1

b = sqrt(a*a - n)
return(a+b, a-b, counter)

p = 314159200000000028138418196395985880850000485810513
q = 314159200000000028138415196395985880850000485810479
print( fermat_factor(p*q) )


This recovers p and q in 3,580 iterations. Note that the difference in p and q is large in absolute terms, approximately 3 × 1027, but small relative to p and q.

## Related posts

 If n = pq, then you can set a = (p + q)/2 and b = (p – q)/2.

# Empirically testing the Chowla conjecture

Terry Tao’s most recent blog post looks at the Chowla conjecture theoretically. This post looks at the same conjecture empirically using Python. (Which is much easier!)

The Liouville function λ(n) is (-1)Ω(n) where Ω(n) is the number of prime factors of n counted with multiplicity. So, for example, Ω(9) = 2 because even though 9 has only one distinct prime factor, that factor counts twice.

Given some set of k strides h1, h2, …, hk, define

f(n) = λ(n + h1) λ(n + h1) … λ(n + hk).

The Chowla conjecture says that the average of the first N values of f(n) tends to zero as N goes to infinity. This has been proven for k = 1 but not for larger k.

If f(n) acts like a Bernoulli random variable, our experiments might increase our confidence in the conjecture, but they wouldn’t prove anything. Unexpected results wouldn’t prove anything either, but a departure from random behavior might help find a counterexample if the conjecture is false.

We’re going to be evaluating the Liouville function repeatedly at the same arguments, so it will save some compute time if we tabulate its values. This will also let us use some compact Python notation to average f. We’ll need to tabulate the function up to Nhk.

In the code below, maxstride is an upper bound on the strides hk we may look at. SymPy has a function primeomega that calculates Ω(n) so we might as well use it. If you wanted to use a very large value of N, you might want to fill the array of Liouville function values using a more direct approach that avoids all the factoring implicit in calls to primeomega.

    from sympy import primeomega
from numpy import empty

N = 10000
maxstride = 100

liouville = empty(N + maxstride)
liouville = 1
for n in range(1, len(liouville)):
liouville[n] = (-1)**primeomega(n)


The following code looks at the Chowla conjecture for h1 = 0 and h2 ranging over 1, 2, …, 99.

    average = empty(maxstride-1)
for s in range(1, maxstride):
average[s-1] = (liouville[0:N] * liouville[s:N+s]).sum()/N


If the Liouville function is distributed like a random variable taking on -1 and 1 with equal probability, we’d expect the averages to vary like samples form a normal distribution with mean 0 and variance 1/(2N).

    print(average.mean())
print(average.std())
print( (2*N)**-0.5 )


This returns

    0.00141
0.00851
0.00707


and so the means are indeed near zero, and the standard deviation of the samples is about what we’d expect.

What if we replace Ω(n), the number of prime factors with multiplicity, with ω(n), the number of distinct prime factors? The averages above are still small, but the sample variance is about twice as big as before. I’ve seen this pattern with several different large values of N, so I suspect there’s something going on here.

(I didn’t see a function in SymPy corresponding to ω(n), but you can create your own with len(factorint(n)).

# Predicting when an RNG will output a given value

A few days ago I wrote about how to pick the seed of a simple random number generator so that a desired output came n values later. The number n was fixed and we varied the seed. In this post, the seed will be fixed and we’ll solve for n. In other words, we ask when a pseudorandom sequence will produce a given value.

In principle you could just run the RNG until you get the output you’re looking for, but we’ll assume such a brute force approach is not feasible or at least not fast enough.

If a LCG (linear congruential generator) has seed z, multiplier a, and modulus m, then the nth output is an z reduced mod m. So our task is to solve

x = an z mod m

for n. If we forget for a moment that we’re working with integers mod m, we see that the solution is

n = loga (x / z)

We can actually make this work if we interpret division by z to mean multiplication by the inverse of z mod m and if we interpret the logarithm to be a discrete logarithm. For more on discrete logarithms and one algorithm for computing them, see this post.

In an earlier post I used  a = 742938285 and m = 231 – 1 = 2147483647. We set n = 100 and solved for z to make the 100th output equal to 20170816, the date of that post. It turned out that z = 1898888478.

Now let’s set the seed z = 1898888478 and ask when the LCG sequence will take on the value x = 20170816. Of course we know that n will turn out to be 100, but let’s pretend we don’t know that. We’ll write a little Python script to find n.

I expect there’s a simple way to compute modular inverses using SymPy, but I haven’t found it, so I used some code from StackOverflow.

The following Python code produces n = 100, as expected.

from sympy.ntheory import discrete_log

def egcd(a, b):
if a == 0:
return (b, 0, 1)
else:
g, y, x = egcd(b % a, a)
return (g, x - (b // a) * y, y)

def modinv(a, m):
g, x, y = egcd(a, m)
if g != 1:
raise Exception('modular inverse does not exist')
else:
return x % m

a = 742938285
z = 1898888478
m = 2**31 - 1
x = 20170816
zinv = modinv(z, m)

n = discrete_log(m, x*zinv, a)
print(n)