Time to factor big integers Python and Mathematica

This post will look at the time required to factor n − 1 each of the following prime numbers in Python (SymPy) and Mathematica. The next post will explain why I wanted to factor these numbers.

p = 2254 + 4707489544292117082687961190295928833
q = 2254 + 4707489545178046908921067385359695873
r = 2254 + 45560315531419706090280762371685220353
s = 2254 + 45560315531506369815346746415080538113

Here are the timing results.

    |   |   Python | Mathematica |
    |---+----------+-------------|
    | p |    0.913 |       0.616 |
    | q |    0.003 |       0.002 |
    | r |  582.107 |      14.915 |
    | s | 1065.925 |      20.763 |

This is hardly a carefully designed benchmark, but it’s enough to suggest Mathematica can be a couple orders of magnitude faster than Python.

Here are the factorizations.

p − 1 = 234 × 3 × 4322432633228119 × 129942003317277863333406104563609448670518081918257
q − 1 = 233 × 3 × 5179 × 216901160674121772178243990852639108850176422522235334586122689
r − 1 = 232 × 32 × 463 × 539204044132271846773 × 8999194758858563409123804352480028797519453
s − 1 = 232 × 32 × 1709 × 24859 × 1690502597179744445941507 × 10427374428728808478656897599072717

Monero’s elliptic curve

Monero logo
Digital signatures often use elliptic curves. For example, Bitcoin and Ethereum use the elliptic curve secp256k1 [1]. This post will discuss the elliptic curve Ed25519 [2] using in Monero and in many other applications.

Ed25519 has the equation

y² − x² = 1 + d x² y²

over the finite field Fq where q = 2255 − 19 and d = −121665/121666. The general form of the equation above makes Ed25519 a “twisted Edwards curve.”

The expression for d is not the rational number it appears to be. Think of it as

d = −121665 × 121666−1

where the multiplication and the multiplicative inverse are calculated mod q.

We could calculate d in Python as follows.

>>> q = 2**255 - 19
>>> d = (-121665*pow(121666, -1, q)) % q
>>> d 
37095705934669439343138083508754565189542113879843219016388785533085940283555

The equation above does not look like an elliptic curve if you think of an elliptic curve having the form

y² = x³ + ax + b.

But that form, known as Weierstrass form, is not the most general definition of an elliptic curve. Elliptic curves can be written in that form [3] but they do not have to be. There are computational advantages to writing curves in the form

y² − x² = 1 + d x² y²

when possible rather than in Weierstrass form [4].

Elliptic curve digital signatures require a specified base point. The Monero whitepaper describes the base point simply as

G = (x, −4/5).

That’s leaving out a fair bit of detail. First of all, 4/5 is interpreted as 4 times the multiplicative inverse of 5 mod q, similar to the calculation for d above.

>>> y = (-4*pow(5, -1, q)) % q; y
11579208923731619542357098500868790785326998466564056403945758400791312963989

Now we have two tasks. How do we solve for x, and which solution do we take?

We need to solve

x² = (y² − 1)/(1 + d y²) mod q.

We can do this in Python as follows.

>>> from sympy import sqrt_mod
>>> roots = sqrt_mod((y**2 - 1)*pow(1 + d*y**2, -1, q), q, all_roots=True)
>>> roots 
[15112221349535400772501151409588531511454012693041857206046113283949847762202,
42783823269122696939284341094755422415180979639778424813682678720006717057747]

So which root to we choose? The convention is to use the even solution, the first one above. (The two roots add to 0 mod q; one will be odd and one will be even because q is odd.)

We can verify that (x, y) is on the Ed25519 elliptic curve:

>>> ((y**2 - x**2) - (1 + d*x**2 * y**2)) % q
0

Related posts

[1] The name secp256k1 was created as follows. The “sec” comes from “Standards for Efficient Cryptography,” referring to the group that specified the curve parameters. The “p” means the curve is over a finite field of prime order. The order of the curve is slightly less than 2256. The “k” indicates that this is a Koblitz curve.

[2] The “Ed” part of the name refers to Harold Edwards, the mathematician who studied the family of elliptic curves now known as Edwards curves. The “25519” part of the name refers to the fact that the curve is over the finite field with 2255 − 19 elements.

[3] Provided the elliptic curve is not over a field of characteristic 2 or 3.

[4] Group operations can be implemented more efficiently and the point at infinity can be handled without exception logic.

Factoring pseudoprimes

Fermat’s little theorem says that if p is a prime number, then for any positive integer b < p we have

bp−1 = 1 (mod p).

This theorem gives a necessary but not sufficient condition for a number to be prime.

Fermat’s primality test

The converse of Fermat’s little theorem is not always true, but it’s often true. That is, if there exists some base 1 < b < n such that

bn−1 = 1 (mod n)

then n is likely to be prime. There are examples where the equation above holds for a pair (b, n) even though n is not prime, and in that case n is called a pseudoprime to the base b.

If you’re searching for large primes, say for use in encryption, then you’d begin by applying Fermat’s little theorem with a few small values of b. This is because although Fermat’s test can’t prove that a number is prime, it can prove that a number is not prime.

For a small example, suppose you wanted to test whether 50621 is prime. You could start by applying Fermat’s test with b = 2 as in the following Python code.

>>> n = 50621
>>> 2**(n−1) % n
9605

Since the result is not 1, we know 50621 is not prime. This doesn’t tell us what the factors of 50621 are, but we know that it has nontrivial factors. We say 2 is a witness that the number 50621 is not prime.

Next, let’s see whether 294409 might be prime.

>>> n = 294409
>>> 2**(n – 1) % n
1

This tells us 294409 might be prime. It has passed a test that filters out a lot of composite numbers. What now? We could try other values of b: 3, 5, 7, 11, …. This will not resolve the question of whether 294409 is prime unless we keep going until we try 37. And in fact 37 is the smallest factor of 294409. Our number 294409 is a Carmichael number, a composite number n that passes Fermat’s primality test for all bases b relatively prime to n.

Note that it would be more efficient to use pow(b, n − 1, n) rather than 2**(n − 1) % n because the former takes advantage of the fact that we don’t need to compute 2n−1 per se and can reduce all intermediate calculations mod n.

Factoring pseudoprimes

Now suppose we have a number n that has passed Fermat’s primality test for some base b and we suspect that n is a pseudoprime. If we want to (try to) factor n, knowing that it is a pseudoprime to the base b gives us a head start. We can exploit the fact that we know b to factor n in polynomial time, unless n is a strong pseudoprime.

Suppose we have a number n that we suspect is a pseudoprime to the base b, and we’re smart enough to at least check that n is an odd number, then we begin by pulling out all the factors of 2 that we can from n − 1:

n − 1 = 2e f.

Next consider the set of numbers

bkf

for k = 1, 2, 4, …, 2e. Let x be the smallest of these numbers which is not congruent to 1 mod n. The existence of such an x is essentially the definition of strong pseudoprime [1].

Then gcd(x − 1, n) and gcd(x + 1, n) are factors of n. This is theorem 10.4 of [2].

Python example

Let n = 873181. This is a pseudoprime to the base b = 3, which we can confirm by seeing that pow(3, n−1, n) returns 1.

Now 873180 is divisible by 4 but not by 8, so e = 2. So the theorem above says we should compute

>>> b, e = 3, 2
>>> [pow(b, f*2**k, n) for k in range(e+1)]

This produces [2643, 1, 1]. So x = 2643,

>> x = 2643
>>> from sympy import gcd
>>> gcd(x−1, n)
1321
>>> gcd(x+1, n)
661

shows that 1321 and 661 are both factors of 873181.

Related posts

[1] Definition of strong pseudoprime. A strong pseudo prime to base b is a composite odd integer m such that if m − 1 = 2ef  with f odd, then either bf = 1 (mod m) or bf2c ≡ −1 (mod m) for some 0 ≤ c < e.

[2] The Joy of Factoring by Samuel S. Wagstaff, Jr.

Number of groups of squarefree order

This post is a sort of footnote to the previous post, Estimating the number of groups of a given order.

The following is taken from an answer to a question on Stack Exchange.

In general there is no formula f(n) for the number of groups of order up to isomorphism. However, if n is squarefree (no prime to the power 2 divides n), then Otto Hölder in 1893 proved the following amazing formula

f(n) = \sum_{m \mid n} \prod_p \frac{p^{c(p)} - 1}{p-1}

where p is a prime divisor of n/m and c(p) is the number of prime divisors q of m that satisfy q = 1 (mod p).

In a sense that can be made precise, about 60% of integers are square free [1]. So Hölder’s formula is often useful, but there are a lot of values it can’t compute.

In this post I develop Python code to implement Hölder’s formula. I’ve tested the code below by comparing it to a table of f(n) for n = 1, 2, 3, …, 100.

from sympy import mobius, divisors, factorint

def squarefree(n):
    return n > 1 and mobius(n) != 0

def prime_divisors(n):
    return list(factorint(n).keys())

def c(p, m):
    return len( [q for q in prime_divisors(m) if q % p == 1] )

def holder(n):
    if not squarefree(n):
        print("Sorry. I can't help you.")
        return None

    s = 0
    for m in divisors(n):
        prod = 1
        for p in prime_divisors(n // m):
            prod *= (p**c(p, m) - 1) // (p - 1)
        s += prod
        
    return s

[1] The number of square-free integers between 1 and x is asymptotically equal to 6x/π² as x → ∞.

A connected topology for the integers

You can define a topology on the positive integers by choosing as an open basis sets the series of the form an + b where a and b are relatively prime positive integers. Solomon Golumb defines this topology in [1] and proves that it is connected.

But that paper doesn’t prove that proposed basis really is a basis. That is implicitly an exercise for the reader, and we carry out that exercise here. The proof will say that certain things can be computed, and we’ll go a step further an actually compute them with Python.

Proof

To prove that these sets form the basis of a topology, we need to establish two things.

  1. Every point is contained in some basis element.
  2. The intersection of basis elements is another basis element or empty.

The first is easy. For any positive number x, the sequence (x + 1)n + x contains x, and x + 1 is relatively prime to x. Any other number relatively prime to x would work just as well.

The second is more interesting. Consider two sequences: an + b and cm + d. A number x in the intersection satisfies the pair of equations

x = b mod a
x = d mod c.

If a and c are relatively prime, the Chinese Remainder Theorem says that the equation has a solution y, and the solutions are unique mod ac. So the intersection of the two series is acn + y.

If a and c are not relatively prime, let r be their greatest common divisor. Then the intersection of an + b and cm + d is another sequence of the same form if b and d are congruent mod r and empty otherwise.

Separating points

A topological space is connected if it is not the union of two disjoint open sets. An easy way to make a space connected is to not have any disjoint open sets. But Golumb’s topology has plenty of disjoint open sets.

In fact, his topology is Hausdorff, meaning that any two points can be separated by disjoint open sets. The proof is simple. Take two points s and t. Let p be any prime bigger than both s and t, and consider the two sets pn + s and pn + t.

Python calculation

Our proof split into two cases: when the strides of the two series are relatively prime and when they are not.

Relatively prime strides

To get concrete, suppose we have the sequences

2, 9, 16, 23, …

and

3, 14, 25, 35, …

or in other words 7n + 2 and 11n + 3. According to the theory above, these sequences intersect because 7 and 11 are relatively prime, and the intersection should be of the form 77n + y, and we should be able to compute y from the Chinese Remainder Theorem. We will use the function crt from SymPy. (See another example of using this function here.)

    >>> from sympy.ntheory.modular import crt
    >>> crt([7,11], [2, 3], symmetric=False)
    >>> (58, 77)

This reports that y = 58.

Now let’s verify that the intersection of our two series looks like 77n + 58.

    >>> A = set(2+7*n for n in range(100))
    >>> B = set(3+11*n for n in range(100))
    >>> sorted(A.intersection(B))
    [58, 135, 212, 289, 366, 443, 520, 597, 674]

Strides with a common factor: empty intersection

Now for the case where a and c have greatest common divisor r > 1. Consider, for example,

1, 16, 31, 46, …

and

2, 14, 26, 38, …

The sequences 15n + 1 and 12m + 2 do not intersect. The greatest common divisor of 15 and 12 is 3. Numbers in the first series are congruent to 1 mod 3 and numbers in the second series are congruent to 2 mod 3, so no number can be in both. If we didn’t realize this, we could call

    crt([15,12], [1, 2], symmetric=False)

and see that it returns None.

Strides with a common factor: finding the intersection

But now let’s look at

1, 16, 31, 46, …

and

13, 25, 37, 49, …

Now both sequences are congruent to 1 mod 3, so it’s at least feasible that the two series may intersect. And in fact they do.

    >>> crt([15,12], [1, 13], symmetric=False)
    (1, 60)

This says that the set of solutions are all congruent to 1 mod 60. Now 1 itself is not a solution, but 61 is, and so is 60n + 1 for all positive integers n. We can illustrate this as before by computing the intersection of the two series.

    >>> A = set(1+15*n for n in range(30))
    >>> B = set(13+12*n for n in range(30))
    >>> sorted(A.intersection(B))
    [61, 121, 181, 241, 301, 361]

Related posts

[1] Solomon W. Golomb. A Connected Topology for the Integers. The American Mathematical Monthly , Oct., 1959, pp. 663-665

Finding large pseudoprimes

Fermat’s little theorem says that if p is a prime number, then for any integer b,

bp−1 = 1 (mod p).

This gives a necessary but not sufficient test for a number to be prime. A number that satisfies the equation above but is not prime is called a pseudoprime [1] for base b. For example, 341 is a pseudoprime base 2 because

2340 = 1 mod 341

and clearly 341 = 11*31.

How might you go about hunting for pseudoprimes? This page gives a way to find pseduoprimes base 2, quoting the following theorem.

If both numbers q and 2q − 1 are primes and n = q(2q − 1) then 2n−1 = 1 (mod n) if and only if q is of the form 12k + 1.

Here’s Python code that will find pseudoprimes of a given size in bits.

    from secrets import randbits
    from sympy import isprime

    def find_pseudoprime(numbits):
        n = numbits // 2
        while True:
            q = randbits(n)    
            if isprime(q) and q%12 == 1 and isprime(2*q-1):
                return q*(2*q-1)

    print( find_pseudoprime(256) )

This returned

47362882341088596725068562696893704769436677460225591859092704246296157080253

in under a second.

Clearly the return value of find_pseudoprime is composite since it is computed as the product of two integers. You could confirm it returns a pseudoprime by computing

    pow(2, p-1, p)

on the return value to verify that it equals 1.

***

[1] Sometimes such a number is called a Fermat pseudoprime to distinguish it from other kinds of pseudo primes, i.e. numbers that satisfy other necessary conditions for being prime without actually being prime. For example, Perrin pseudoprimes. See the next post for more examples.

Estimating the proportion of smooth numbers

river rocks

A number is said to be “smooth” if all its prime factors are small. To make this precise, a number is said to be y-smooth if it only has prime factors less than or equal to y. So, for example, 1000 is 5-smooth.

The de Bruijn function φ(x, y) counts the number of y-smooth positive integers up to x, and so φ(x, y)/x is the probability that a number between 1 and x is y-smooth.

It turns out that

φ(x, x1/u)/x ≈ 1/uu.

This means, for example, that the proportion of numbers with less than 100 digits whose factors all have less than 20 digits is roughly 1/55 = 1/3125.

Source: The Joy of Factoring

Here’s a little Python code to experiment with this approximation. We’ll look at the proportion of 96-bit numbers whose largest prime factor has at most 32 bits.

    from secrets import randbits
    from sympy.ntheory.factor_ import smoothness

    smooth = 0
    for _ in range(100):
        x = randbits(96)
        s = smoothness(x)[0]
        if s < 2**32:
            smooth += 1
    print("Num small:", smooth)

The SymPy function smoothness returns a pair, and the first element of the pair is the largest prime dividing its argument.

In our case u = 3, and so we’d expect about 1/27 of our numbers to have no factors larger than 32 bits. I ran this five times, and got 8, 2, 5, 3, and 7. From the approximation above we’d expect results around 4, so our results are not surprising.

Generating Python code from SymPy

Yesterday I wrote about Householder’s higher-order generalizations of Newton’s root finding method. For n at least 2, define

H_n(x) = x + (n-1) \frac{\left( \frac{1}{f(x)}\right)^{(n-2)}}{\left( \frac{1}{f(x)}\right)^{(n-1)}}

and iterate Hn to find a root of f(x). When n = 2, this is Newton’s method. In yesterday’s post I used Mathematica to find expressions for H3 and H4, then used Mathematica’s FortranForm[] function to export Python code. (Mathematica doesn’t have a function to export Python code per se, but the Fortran syntax was identical in this case.)

Aaron Muerer pointed out that it would have been easier to generate the Python code in Python using SymPy to do the calculus and labdify() to generate the code. I hadn’t heard of lambdify before, so I tried out his suggestion. The resulting code is nice and compact.

    from sympy import diff, symbols, lambdify

    def f(x, a, b):
        return x**5 + a*x + b

    def H(x, a, b, n):
        x_, a_, b_ = x, a, b
        x, a, b = symbols('x a b')
        expr = diff(1/f(x,a,b), x, n-2) / \
               diff(1/f(x,a,b), x, n-1)
        g = lambdify([x,a,b], expr)
        return x_ + (n-1)*g(x_, a_, b_)

This implements all the Hn at once. The previous post implemented three of the Hn separately.

The first couple lines of H require a little explanation. I wanted to use the same names for the numbers that the function H takes and the symbols that SymPy operated on, so I saved the numbers to local variables.

This code is fine for a demo, but in production you’d want to generate the function g once (for each n) and save the result rather than generating it on every call to H.

Software to factor integers

In my previous post, I showed how changing one bit of a semiprime (i.e. the product of two primes) creates an integer that can be factored much faster. I started writing that post using Python with SymPy, but moved to Mathematica because factoring took too long.

SymPy vs Mathematica

When I’m working in Python, SymPy lets me stay in Python. I’ll often use SymPy for a task that Mathematica could do better just so I can stay in one environment. But sometimes efficiency is a problem.

SymPy is written in pure Python, for better and for worse. When it comes to factoring large integers, it’s for worse. I tried factoring a 140-bit integer with SymPy, and killed the process after over an hour. Mathematica factored the same integer in 1/3 of a second.

Mathematica vs PARI/GP

The previous post factors 200-bit semiprimes. The first example, N = pq where

p = 1078376712338123201911958185123
q = 1126171711601272883728179081277

took 99.94 seconds to factor using Mathematica. A random sample of 13 products of 100-bit primes and they took an average of 99.1 seconds to factor.

Using PARI/GP, factoring the value of N above took 11.4 seconds to factor. I then generated a sample of 10 products of 100-bit primes and on average they took 10.4 seconds to factor using PARI/GP.

So in these examples, Mathematica is several orders of magnitude faster than SymPy, and PARI/GP is one order of magnitude faster than Mathematica.

It could be that the PARI/GP algorithms are relatively better at factoring semiprimes. To compare the efficiency of PARI/GP and Mathematica on non-semiprimes, I repeated the exercise in the previous post, flipping each bit of N one at a time and factoring.

This took 240.3 seconds with PARI/GP. The same code in Mathematica took 994.5 seconds. So in this example, PARI/GP is about 4 times faster where as for semiprimes it was 10 times faster.

Python and PARI

There is a Python interface to PARI called cypari2. It should offer the convenience of working in Python with the efficiency of PARI. Unfortunately, the installation failed on my computer. I think SageMath interfaces Python to PARI but I haven’t tried it.

More on prime factoring

Feller-Tornier constant

Here’s kind of an unusual question: What is the density of integers that have an even number of prime factors with an exponent greater than 1?

To define the density, you take the proportion up to an integer N then take the limit as N goes to infinity.

It’s not obvious that the limit should exist. But if it does exist, you might guess that it’s 1/2; it’s a question involving whether something is even or odd, and so even and odd occurrences might balance each other out.

But the result, known as the Feller-Tonier constant, is bigger than 1/2. This seems more reasonable when you consider that zero is an even number and a lot of numbers may not have any prime factors with exponent bigger than 1.

Here’s a little Python code to compute the ratio for N = 106.

    from sympy.ntheory import factorint

    def f(n):
        v = factorint(n).values()
        n = len([x for x in v if x > 1])
        return n % 2 == 0

    c = 0
    N = 1000000
    for n in range(N):
        if f(n):
            c += 1
    print (c/N)

This gives a ratio of 0.661344. The limiting value is 0.661370494….

Computing the Feller-Tornier constant

The code above gives an estimate for the Feller-Tornier constant, but how accurate is it? Is there a more efficient way to estimate it that doesn’t require factoring a lot of numbers?

The constant is given analytically by

c = \frac{1}{2} + \frac{1}{2}\prod_{n=1}^\infty \left(1 - \frac{2}{p_n^2} \right )

where pn is the nth prime.

Here’s some Python code to estimate the Feller-Tornier constant using the expression above.

    N = 1000000
    product = 1
    for p in primerange(2, N):
        product *= 1 - 2*p**-2
    print(0.5*(1+product))

Note that we used primerange rather than prime(n). If you need to generate a range of prime numbers, the former is much more efficient.

The code gives us an estimate of 0.66131707, which is good to seven decimal places. Furthermore, we know this is an upper bound on the exact value because we’ve left out terms in the infinite product that are less than 1.

One could calculate a lower bound as well and thus know an interval that must contain the true value. The lower bound is left as an exercise to the reader.