A connected topology for the integers

You can define a topology on the positive integers by choosing as an open basis sets the series of the form an + b where a and b are relatively prime positive integers. Solomon Golumb defines this topology in [1] and proves that it is connected.

But that paper doesn’t prove that proposed basis really is a basis. That is implicitly an exercise for the reader, and we carry out that exercise here. The proof will say that certain things can be computed, and we’ll go a step further an actually compute them with Python.

Proof

To prove that these sets form the basis of a topology, we need to establish two things.

  1. Every point is contained in some basis element.
  2. The intersection of basis elements is another basis element or empty.

The first is easy. For any positive number x, the sequence (x+1)n + x contains x, and x+1 is relatively prime to x. Any other number relatively prime to x would work just as well.

The second is more interesting. Consider two sequences: an + b and cm + d. A number x in the intersection satisfies the pair of equations

x = b mod a
x = d mod c.

If a and c are relatively prime, the Chinese Remainder Theorem says that the equation has a solution y, and the solutions are unique mod ac. So the intersection of the two series is acn + y.

If a and c are not relatively prime, let r be their greatest common divisor. Then the intersection of an + b and cm + d is another sequence of the same form if b and d are congruent mod r and empty otherwise.

Separating points

A topological space is connected if it is not the union of two disjoint open sets. An easy way to make a space connected is to not have any disjoint open sets. But Golumb’s topology has plenty of disjoint open sets.

In fact, his topology is Hausdorff, meaning that any two points can be separated by disjoint open sets. The proof is simple. Take two points s and t. Let p be any prime bigger than both s and t, and consider the two sets pn + s and pn + t.

Python calculation

Our proof split into two cases: when the strides of the two series are relatively prime and when they are not.

Relatively prime strides

To get concrete, suppose we have the sequences

2, 9, 16, 23, …

and

3, 14, 25, 35, …

or in other words 7n + 2 and 11n + 3. According to the theory above, these sequences intersect because 7 and 11 are relatively prime, and the intersection should be of the form 77n + y, and we should be able to compute y from the Chinese Remainder Theorem. We will use the function crt from SymPy. (See another example of using this function here.)

    >>> from sympy.ntheory.modular import crt
    >>> crt([7,11], [2, 3], symmetric=False)
    >>> (58, 77)

This reports that y = 58.

Now let’s verify that the intersection of our two series looks like 77n + 58.

    >>> A = set(2+7*n for n in range(100))
    >>> B = set(3+11*n for n in range(100))
    >>> sorted(A.intersection(B))
    [58, 135, 212, 289, 366, 443, 520, 597, 674]

Strides with a common factor: empty intersection

Now for the case where a and c have greatest common divisor r > 1. Consider, for example,

1, 16, 31, 46, …

and

2, 14, 26, 38, …

The sequences 15n + 1 and 12m + 2 do not intersect. The greatest common divisor of 15 and 12 is 3. Numbers in the first series are congruent to 1 mod 3 and numbers in the second series are congruent to 2 mod 3, so no number can be in both. If we didn’t realize this, we could call

    crt([15,12], [1, 2], symmetric=False)

and see that it returns None.

Strides with a common factor: finding the intersection

But now let’s look at

1, 16, 31, 46, …

and

13, 25, 37, 49, …

Now both sequences are congruent to 1 mod 3, so it’s at least feasible that the two series may intersect. And in fact they do.

    >>> crt([15,12], [1, 13], symmetric=False)
    (1, 60)

This says that the set of solutions are all congruent to 1 mod 60. Now 1 itself is not a solution, but 61 is, and so is 60n + 1 for all positive integers n. We can illustrate this as before by computing the intersection of the two series.

    >>> A = set(1+15*n for n in range(30))
    >>> B = set(13+12*n for n in range(30))
    >>> sorted(A.intersection(B))
    [61, 121, 181, 241, 301, 361]

Related posts

[1] Solomon W. Golomb. A Connected Topology for the Integers. The American Mathematical Monthly , Oct., 1959, pp. 663-665

Finding large pseudoprimes

Fermat’s little theorem says that if p is a prime number, then for any integer b,

bp-1 = 1 (mod p).

This gives a necessary but not sufficient test for a number to be prime. A number that satisfies the equation above but is not prime is called a pseudoprime [1] for base b. For example, 341 is a pseudoprime base 2 because

2340 = 1 mod 341

and clearly 341 = 11*31.

How might you go about hunting for pseudoprimes? This page gives a way to find pseduoprimes base 2, quoting the following theorem.

If both numbers q and 2q – 1 are primes and n = q(2q-1) then 2n-1 = 1 (mod n) if and only if q is of the form 12k + 1.

Here’s Python code that will find pseudoprimes of a given size in bits.

    from secrets import randbits
    from sympy import isprime

    def find_pseudoprime(numbits):
        n = numbits // 2
        while True:
            q = randbits(n)    
            if isprime(q) and q%12 == 1 and isprime(2*q-1):
                return q*(2*q-1)

    print( find_pseudoprime(256) )

This returned

47362882341088596725068562696893704769436677460225591859092704246296157080253

in under a second.

Clearly the return value of find_pseudoprime is composite since it is computed as the product of two integers. You could confirm it returns a pseudoprime by computing

    pow(2, p-1, p)

on the return value to verify that it equals 1.

***

[1] Sometimes such a number is called a Fermat pseudoprime to distinguish it from other kinds of pseudo primes, i.e. numbers that satisfy other necessary conditions for being prime without actually being prime. For example, Perrin pseudoprimes. See the next post for more examples.

Estimating the proportion of smooth numbers

river rocks

A number is said to be “smooth” if all its prime factors are small. To make this precise, a number is said to be y-smooth if it only has prime factors less than or equal to y. So, for example, 1000 is 5-smooth.

The de Bruijn function φ(x, y) counts the number of y-smooth positive integers up to x, and so φ(x, y)/x is the probability that a number between 1 and x is y-smooth.

It turns out that

φ(x, x1/u)/x ≈ 1/uu.

This means, for example, that the proportion of numbers with less than 100 digits whose factors all have less than 20 digits is roughly 1/55 = 1/3125.

Source: The Joy of Factoring

Here’s a little Python code to experiment with this approximation. We’ll look at the proportion of 96-bit numbers whose largest prime factor has at most 32 bits.

    from secrets import randbits
    from sympy.ntheory.factor_ import smoothness

    smooth = 0
    for _ in range(100):
        x = randbits(96)
        s = smoothness(x)[0]
        if s < 2**32:
            smooth += 1
    print("Num small:", smooth)

The SymPy function smoothness returns a pair, and the first element of the pair is the largest prime dividing its argument.

In our case u = 3, and so we’d expect about 1/27 of our numbers to have no factors larger than 32 bits. I ran this five times, and got 8, 2, 5, 3, and 7. From the approximation above we’d expect results around 4, so our results are not surprising.

Generating Python code from SymPy

Yesterday I wrote about Householder’s higher-order generalizations of Newton’s root finding method. For n at least 2, define

H_n(x) = x + (n-1) \frac{\left( \frac{1}{f(x)}\right)^{(n-2)}}{\left( \frac{1}{f(x)}\right)^{(n-1)}}

and iterate Hn to find a root of f(x). When n = 2, this is Newton’s method. In yesterday’s post I used Mathematica to find expressions for H3 and H4, then used Mathematica’s FortranForm[] function to export Python code. (Mathematica doesn’t have a function to export Python code per se, but the Fortran syntax was identical in this case.)

Aaron Muerer pointed out that it would have been easier to generate the Python code in Python using SymPy to do the calculus and labdify() to generate the code. I hadn’t heard of lambdify before, so I tried out his suggestion. The resulting code is nice and compact.

    from sympy import diff, symbols, lambdify

    def f(x, a, b):
        return x**5 + a*x + b

    def H(x, a, b, n):
        x_, a_, b_ = x, a, b
        x, a, b = symbols('x a b')
        expr = diff(1/f(x,a,b), x, n-2) / \
               diff(1/f(x,a,b), x, n-1)
        g = lambdify([x,a,b], expr)
        return x_ + (n-1)*g(x_, a_, b_)

This implements all the Hn at once. The previous post implemented three of the Hn separately.

The first couple lines of H require a little explanation. I wanted to use the same names for the numbers that the function H takes and the symbols that SymPy operated on, so I saved the numbers to local variables.

This code is fine for a demo, but in production you’d want to generate the function g once (for each n) and save the result rather than generating it on every call to H.

Software to factor integers

In my previous post, I showed how changing one bit of a semiprime (i.e. the product of two primes) creates an integer that can be factored much faster. I started writing that post using Python with SymPy, but moved to Mathematica because factoring took too long.

SymPy vs Mathematica

When I’m working in Python, SymPy lets me stay in Python. I’ll often use SymPy for a task that Mathematica could do better just so I can stay in one environment. But sometimes efficiency is a problem.

SymPy is written in pure Python, for better and for worse. When it comes to factoring large integers, it’s for worse. I tried factoring a 140-bit integer with SymPy, and killed the process after over an hour. Mathematica factored the same integer in 1/3 of a second.

Mathematica vs PARI/GP

The previous post factors 200-bit semiprimes. The first example, N = pq where

p = 1078376712338123201911958185123
q = 1126171711601272883728179081277

took 99.94 seconds to factor using Mathematica. A random sample of 13 products of 100-bit primes and they took an average of 99.1 seconds to factor.

Using PARI/GP, factoring the value of N above took 11.4 seconds to factor. I then generated a sample of 10 products of 100-bit primes and on average they took 10.4 seconds to factor using PARI/GP.

So in these examples, Mathematica is several orders of magnitude faster than SymPy, and PARI/GP is one order of magnitude faster than Mathematica.

It could be that the PARI/GP algorithms are relatively better at factoring semiprimes. To compare the efficiency of PARI/GP and Mathematica on non-semiprimes, I repeated the exercise in the previous post, flipping each bit of N one at a time and factoring.

This took 240.3 seconds with PARI/GP. The same code in Mathematica took 994.5 seconds. So in this example, PARI/GP is about 4 times faster where as for semiprimes it was 10 times faster.

Python and PARI

There is a Python interface to PARI called cypari2. It should offer the convenience of working in Python with the efficiency of PARI. Unfortunately, the installation failed on my computer. I think SageMath interfaces Python to PARI but I haven’t tried it.

More on prime factoring

Feller-Tornier constant

Here’s kind of an unusual question: What is the density of integers that have an even number of prime factors with an exponent greater than 1?

To define the density, you take the proportion up to an integer N then take the limit as N goes to infinity.

It’s not obvious that the limit should exist. But if it does exist, you might guess that it’s 1/2; it’s a question involving whether something is even or odd, and so even and odd occurrences might balance each other out.

But the result, known as the Feller-Tonier constant, is bigger than 1/2. This seems more reasonable when you consider that zero is an even number and a lot of numbers may not have any prime factors with exponent bigger than 1.

Here’s a little Python code to compute the ratio for N = 106.

    from sympy.ntheory import factorint

    def f(n):
        v = factorint(n).values()
        n = len([x for x in v if x > 1])
        return n % 2 == 0

    c = 0
    N = 1000000
    for n in range(N):
        if f(n):
            c += 1
    print (c/N)

This gives a ratio of 0.661344. The limiting value is 0.661370494….

Computing the Feller-Tornier constant

The code above gives an estimate for the Feller-Tornier constant, but how accurate is it? Is there a more efficient way to estimate it that doesn’t require factoring a lot of numbers?

The constant is given analytically by

c = \frac{1}{2} + \frac{1}{2}\prod_{n=1}^\infty \left(1 - \frac{2}{p_n^2} \right )

where pn is the nth prime.

Here’s some Python code to estimate the Feller-Tornier constant using the expression above.

    N = 1000000
    product = 1
    for p in primerange(2, N):
        product *= 1 - 2*p**-2
    print(0.5*(1+product))

Note that we used primerange rather than prime(n). If you need to generate a range of prime numbers, the former is much more efficient.

The code gives us an estimate of 0.66131707, which is good to seven decimal places. Furthermore, we know this is an upper bound on the exact value because we’ve left out terms in the infinite product that are less than 1.

One could calculate a lower bound as well and thus know an interval that must contain the true value. The lower bound is left as an exercise to the reader.

An attack on RSA with exponent 3

As I noted in this post, RSA encryption is often carried out reusing exponents. Sometimes the exponent is exponent 3, which is subject to an attack we’ll describe below [1]. (The most common exponent is 65537.)

Suppose the same message m is sent to three recipients and all three use exponent e = 3. Each recipient has a different modulus Ni, and each will receive a different encrypted message

ci = m³ mod Ni.

Someone with access to c1, c2, and c3 can recover the message m as follows. We can assume each modulus Ni is relatively prime to the others, otherwise we can recover the private keys using the method described here. Since the moduli are relatively prime, we can solve the three equations for m³ using the Chinese Remainder Theorem. There is a unique x < N1 N2 N3 such that

x = c1 mod N1
x = c2 mod N2
x = c3 mod N3

and m is simply the cube root of x. What makes this possible is knowing m is a positive integer less than each of the Ns, and that x < N1 N2 N3. It follows that we can simply take the cube root in the integers and not the cube root in modular arithmetic.

This is an attack on “textbook” RSA because the weakness in this post could be avoiding by real-world precautions such as adding random padding to each message so that no two recipients are sent the exact same message.

By the way, a similar trick works even if you only have access to one encrypted message. Suppose you’re using a 2048-bit modulus N and exchanging a 256-bit key. If you message m is simply the key without padding, then m³ is less than N, and so you can simply take the cube root of the encrypted message in the integers.

Python example

Here we’ll work out a specific example using realistic RSA moduli.

    from secrets import randbits, randbelow
    from sympy import nextprime
    from sympy.ntheory.modular import crt
    
    def modulus():
        p = nextprime(randbits(2048))
        q = nextprime(randbits(2048))
        return p*q
    
    N = [modulus() for _ in range(3)]
    m = randbelow(min(N))
    c = [pow(m, 3, N[i]) for i in range(3)]
    x = crt(N, c)[0]
    
    assert(cbrt(x) == m) # integer cube root

Note that crt is the Chinese Remainder Theorem. It returns a pair of numbers, the first being the solution we’re after, hence the [0] after the call.

The script takes a few seconds to run. Nearly all the time goes to finding the 2048-bit (617-digit) primes that go into the moduli. Encrypting and decrypting m takes less than a second.

Related posts

[1] I don’t know who first discovered this line of attack, but you can find it written up here. At least in the first edition; the link is to the 2nd edition which I don’t have.

RSA with Pseudoprimes

RSA setup

Recall the setup for RSA encryption given in the previous post.

  1. Select two very large prime numbers p and q.
  2. Compute n = pq and φ(n) = (p – 1)(q – 1).
  3. Choose an encryption key e relatively prime to φ(n).
  4. Calculate the decryption key d such that ed = 1 (mod φ(n)).
  5. Publish e and n, and keep dp, and q secret.

φ is Euler’s totient function, defined here.

There’s a complication in the first step. Maybe you think the numbers p and q are prime, but they’re not. In that case the calculation of φ in step 2 is wrong.

Pseudoprimes

The numbers p and q need to be “very large,” where the definition of what constitutes large changes over time due to Moore’s law and progress with factoring algorithms. Currently p and q would typically have at least 2048 bits each. It’s easy to find numbers that large that are probably prime, but it’s difficult to be certain.

At the moment, the fastest way to test for primes has a small chance making a false positive error, but no chance of a false negative. That is, if the test says a number is composite, it is certainly composite. But if the test says a number may be prime, there is a small chance that it is not prime. (Details here.) So in practice RSA starts by finding two large probable primes or pseudoprimes.

Discussions of RSA encryption often point out that large pseudoprimes are very rare, so it isn’t a problem that RSA starts with pseudoprimes. But what does that mean? Is there a one in a trillion chance that your private key won’t work and nobody can send you messages? Or that you can receive some messages and not others?

Encryption and decryption

RSA encryption works by taking a message m and raising it to the exponent e modulo n where e and n are defined at the top of the post. To decrypt the message, you raise it to the exponent d modulo n where d is your private decryption key. Because d was computed to satisfy

ed = 1 (mod φ(n)),

Euler’s theorem says that we’ll get our message back. We’ll give an example below.

What if p or q are pseudoprime?

If p and q are prime, then φ(n) = (p – 1)(q – 1). But if we’re wrong in our assumption that one of these factors is prime, our calculation of φ(n) will be wrong. Will our encryption and decryption process work anyway? Maybe.

We’ll do three examples below, all using small numbers. We’ll use e = 257 as our public encryption exponent and m = 42 as the message to encrypt in all examples.

In the first example p and q are indeed prime, and everything works as it should. In the next two examples we will replace p with a pseudoprime. In the second example everything works despite our error, but in the third example decryption fails.

Example 1: p and q prime

We’ll start with p = 337 and q = 283. Both are prime. The Python code below shows that d = 60833 and the encrypted message is 45431. Everything works as advertised.

Example 2: p pseudoprime

Now we use p = 341 and q = 283. Here p is a pseudoprime for base 2, i.e.

2340 = 1 mod 341

and so 341 passes Fermat’s primality test [1] for b = 2. Now d = 10073 and the encrypted message is 94956. Decrypting the encrypted message works, even though p is not prime and our calculation for φ is wrong. In fact, the process works not just for our message m = 42 but for every message.

Example 3: p pseudoprime

Here again we let p be the pseudoprime 341 but set q to the prime 389. Now d = 6673, the encrypted message is 7660, but decrypting the encrypted message returns 55669, not 42 as we started with. Decryption fails for all other messages as well.

If we use the correct value for φ(pq) the example works. That is, if we use

φ(341*389) = φ(11*31*389) = 10*30*388

rather than the incorrect value 340*388 the decryption process recovers our original message.

What can go wrong

The examples above show that if we mistakenly think one of our numbers is a prime when it is only a pseudoprime, decryption might succeed or it might fail. In either case, we assume npq has two prime factors when in fact it has more, and so n is easier to factor than we thought.

Python code

Here’s the code for the examples above.

    from sympy import gcd, mod_inverse
    
    message = 42
    e = 257
    
    def example(p, q):
        n = p*q
        phi = (p-1)*(q-1)
        assert( gcd(e, phi) == 1 )
        
        d = mod_inverse(e, phi)
        assert( d*e % phi == 1 )
    
        encrypted = pow(message, e, n)
        decrypted = pow(encrypted, d, n)
        return (message == decrypted)
    
    print(example(337, 283))
    print(example(341, 283))
    print(example(341, 389))

Related posts

[1] Fermat’s primality test is explained here. In our example we only tried one base, b = 2. If we had also tried b = 3 we would have discovered that 341 is not prime. In practice one would try several different bases. Using the heuristic that failures are independent for each base, it’s very unlikely that a composite number would be a pseudoprime for each of, say, 5o different bases.

Fermat’s factoring trick and cryptography

Many encryption algorithms rely on the difficulty of factoring a large number n. If you want to make n hard to factor, you want it to have only two factors. Otherwise, the more factors n has, the smaller the smallest factor must be.

So if you want n to be the product of two large primes, p and q, you want to pick these primes to be roughly the same size so that the smaller factor is as large as possible. If you’re limited on the size of n, then you want p and q to be roughly of size √n. But not too close to √n. You may see in a description of a cryptographic algorithm, such as RSA, “Pick two large primes p and q, but not too close together, …” Why is that?

The answer goes back to Fermat (1607–1665). His factoring trick is to start with an odd composite n and look for numbers a and b such that

n = a² – b²

because if you can do that, then

n = (ab)(a – b).

This trick always works [1], but it’s only practical when the factors are close together. If they are close together, you can do a brute force search for a and b. But otherwise you’re better off doing something else.

Small example

To give an example, suppose we want to factor n = 12319. Then √n = 110.99, so we can start looking for a and b by trying a = 111. We increment a until a² – n is a square, b².

Now 111² – 12319 = 2, so that didn’t work. But 112² – 12319 = 225, which is a square, and so we take a = 112 and b = 15. This tells us p = 112+15 = 127 and q = 112 – 15 = 97.

Larger example with Python code

Now let’s do a larger example, too big to do by hand and more similar to a real application.

    from sympy import sqrt, log, ceiling, Integer

    def is_square(n):
        return type(sqrt(n)) == Integer

    def fermat_factor(n):
        num_digits = int(log(n, 10).evalf() + 1)
        a = ceiling( sqrt(n).evalf(num_digits) )

        counter = 0
        while not is_square(a*a - n):
            a += 1
            counter += 1

        b = sqrt(a*a - n)
        return(a+b, a-b, counter)

    p = 314159200000000028138418196395985880850000485810513
    q = 314159200000000028138415196395985880850000485810479
    print( fermat_factor(p*q) )

This recovers p and q in 3,580 iterations. Note that the difference in p and q is large in absolute terms, approximately 3 × 1027, but small relative to p and q.

Related posts

[1] If n = pq, then you can set a = (p + q)/2 and b = (p – q)/2.

 

Empirically testing the Chowla conjecture

Terry Tao’s most recent blog post looks at the Chowla conjecture theoretically. This post looks at the same conjecture empirically using Python. (Which is much easier!)

The Liouville function λ(n) is (-1)Ω(n) where Ω(n) is the number of prime factors of n counted with multiplicity. So, for example, Ω(9) = 2 because even though 9 has only one distinct prime factor, that factor counts twice.

Given some set of k strides h1, h2, …, hk, define

f(n) = λ(n + h1) λ(n + h1) … λ(n + hk).

The Chowla conjecture says that the average of the first N values of f(n) tends to zero as N goes to infinity. This has been proven for k = 1 but not for larger k.

If f(n) acts like a Bernoulli random variable, our experiments might increase our confidence in the conjecture, but they wouldn’t prove anything. Unexpected results wouldn’t prove anything either, but a departure from random behavior might help find a counterexample if the conjecture is false.

We’re going to be evaluating the Liouville function repeatedly at the same arguments, so it will save some compute time if we tabulate its values. This will also let us use some compact Python notation to average f. We’ll need to tabulate the function up to Nhk.

In the code below, maxstride is an upper bound on the strides hk we may look at. SymPy has a function primeomega that calculates Ω(n) so we might as well use it. If you wanted to use a very large value of N, you might want to fill the array of Liouville function values using a more direct approach that avoids all the factoring implicit in calls to primeomega.

    from sympy import primeomega
    from numpy import empty

    N = 10000
    maxstride = 100

    liouville = empty(N + maxstride)
    liouville[0] = 1
    for n in range(1, len(liouville)):
        liouville[n] = (-1)**primeomega(n)

The following code looks at the Chowla conjecture for h1 = 0 and h2 ranging over 1, 2, …, 99.

    average = empty(maxstride-1)
    for s in range(1, maxstride):
        average[s-1] = (liouville[0:N] * liouville[s:N+s]).sum()/N

If the Liouville function is distributed like a random variable taking on -1 and 1 with equal probability, we’d expect the averages to vary like samples form a normal distribution with mean 0 and variance 1/(2N).

    print(average.mean())
    print(average.std())
    print( (2*N)**-0.5 )

This returns

    0.00141
    0.00851
    0.00707

and so the means are indeed near zero, and the standard deviation of the samples is about what we’d expect.

What if we replace Ω(n), the number of prime factors with multiplicity, with ω(n), the number of distinct prime factors? The averages above are still small, but the sample variance is about twice as big as before. I’ve seen this pattern with several different large values of N, so I suspect there’s something going on here.

(I didn’t see a function in SymPy corresponding to ω(n), but you can create your own with len(factorint(n)).