Solving hard problems

We help companies solve hard problems in mathematics, statistics, and computing. Let’s explore how we might work together.

Race between primes of the forms 4k + 1 and 4k + 3

Posted on 15 February 2026 by John

The last few posts have looked at expressing an odd prime p as a sum of two squares. This is possible if and only if p is of the form 4k + 1. I illustrated an algorithm for finding the squares with p = 2²⁵⁵ − 19, a prime that is used in cryptography. It is being used in bringing this page to you if the TLS connection between my server and your browser is uses Curve25519 or Ed25519.

World records

I thought about illustrating the algorithm with a larger prime too, such as a world record. But then I realized all the latest record primes have been of the form 4k + 3 and so cannot be written as a sum of squares. Why is p mod 4 equal to 3 for all the records? Are more primes congruent to 3 than to 1 mod 4? The answer to that question is subtle; more on that shortly.

More record primes are congruent to 3 mod 4 because Mersenne primes are easier to find, and that’s because there’s an algorithm, the Lucas-Lehmer test, that can test whether a Mersenne number is prime more efficiently than testing general numbers. Lucas developed his test in 1878 and Lehmer refined it in 1930.

Since the time Lucas first developed his test, the largest known prime has always been a Mersenne prime, with exceptions in 1951 and in 1989.

Chebyshev bias

So, are more primes congruent to 3 mod 4 than are congruent to 1 mod 4?

Define the function f(n) to be the ratio of the number of primes in each residue class.

f(n) = (# primes p < n with p = 3 mod 4) / (# primes p < n with p = 1 mod 4)

As n goes to infinity, the function f(n) converges to 1. So in that sense the number of primes in each category are equal.

If we look at the difference rather than the ratio we get a more subtle story. Define the lead function to be how much the count of primes equal to 3 mod 4 leads the number of primes equal to 1 mod 4.

g(n) = (# primes p < n with p = 3 mod 4) − (# primes p < n with p = 1 mod 4)

For any n, f(n) > 1 if and only if g(n) > 0. However, as n goes to infinity the function g(n) does not converge. It oscillates between positive and negative infinitely often. But g(n) is positive for long stretches. This phenomena is known as Chebyshev bias.

Visualizing the lead function

We can calculate the lead function at primes with the following code.

from numpy import zeros
from sympy import primepi, primerange

N = 1_000_000
leads = zeros(primepi(N) + 1)
for index, prime in enumerate(primerange(2, N), start=1):
    leads[index] = leads[index - 1] + prime % 4 - 2

Here is a list of the primes at which the lead function is zero, i.e. when it changes sign.

[   0,     1,     3,     7,    13,    89,  2943,  2945,  2947,
 2949,  2951,  2953, 50371, 50375, 50377, 50379, 50381, 50393,
50413, 50423, 50425, 50427, 50429, 50431, 50433, 50435, 50437,
50439, 50445, 50449, 50451, 50503, 50507, 50515, 50517, 50821,
50843, 50853, 50855, 50857, 50859, 50861, 50865, 50893, 50899,
50901, 50903, 50905, 50907, 50909, 50911, 50913, 50915, 50917,
50919, 50921, 50927, 50929, 51119, 51121, 51123, 51127, 51151,
51155, 51157, 51159, 51161, 51163, 51177, 51185, 51187, 51189,
51195, 51227, 51261, 51263, 51285, 51287, 51289, 51291, 51293,
51297, 51299, 51319, 51321, 51389, 51391, 51395, 51397, 51505,
51535, 51537, 51543, 51547, 51551, 51553, 51557, 51559, 51567,
51573, 51575, 51577, 51595, 51599, 51607, 51609, 51611, 51615,
51617, 51619, 51621, 51623, 51627]

This is OEIS sequence A038691.

Because the lead function changes more often in some regions than others, it’s best to plot the function over multiple ranges.

The lead function is more often positive than negative. And yet it is zero infinitely often. So while the count of primes with remainder 3 mod 4 is usually ahead, the counts equal out infinitely often.

Wagon’s algorithm in Python

Posted on 14 February 2026 by John

The last three posts have been about Stan Wagon’s algorithm for finding x and y satisfying

x² + y² = p

where p is an odd prime.

The first post in the series gives Gauss’ formula for a solution, but shows why it is impractical for large p. The bottom of this post introduces Wagon’s algorithm and says that it requires two things: finding a quadratic non-residue mod p and a variation on the Euclidean algorithm.

The next post shows how to find a quadratic non-residue.

The reason Wagon requires a non-residue is because he needs to find a square root of −1 mod p. The previous post showed how that’s done.

In this post we will complete Wagon’s algorithm by writing the modified version of the euclidean algorithm.

Suppose p is an odd prime, and we’ve found x such that x² = −1 mod p as in the previous posts. The last step in Wagon’s algorithm is to apply the Euclidean algorithm to x and p and stop when the numbers are both less than √p.

When we’re working with large integers, how do we find square roots? Maybe p and even √p are too big to represent as a floating point number, so we can’t just apply the sqrt function. Maybe p is less than the largest floating point number (around 10³⁰⁸) but the sqrt function doesn’t have enough precision. Floats only have 53 bits of precision, so an integer larger than 2⁵³ cannot necessarily be represented entirely accurately.

The solution is to use the isqrt function, introduced in Python 3.8. It returns the largest integer less than the square root of its argument.

Now we have everything necessary to finish implementing Wagon’s algorithm.

from sympy import legendre_symbol, nextprime
from math import isqrt

def find_nonresidue(p):
    q = 2
    while legendre_symbol(q, p) == 1:
        q = nextprime(q)
    return q

def my_euclidean_algorithm(a, b, stop):
    while a > stop:
        a, b = b, a % b
    return (a, b)

def find_ab(p):
    assert(p % 4 == 1)
    k = p // 4
    c = find_nonresidue(p)
    x = pow(c, k, p)
    return my_euclidean_algorithm(p, x, isqrt(p))

Let’s use this to find a and b such that x² + y² = p where p = 2²⁵⁵ − 19.

>>> a, b = find_ab(p := 2**255 - 19)
>>> a
230614434303103947632580767254119327050
>>> b
68651491678749784955913861047835464643
>>> a**2 + b**2 - p
0

Finis.

Finding a square root of -1 mod p

Posted on 14 February 2026 by John

If p is an odd prime, there is a theorem that says

x² = −1 mod p

has a solution if and only if p = 1 mod 4. When a solution x exists, how do you find it?

The previous two posts have discussed Stan Wagon’s algorithm for expressing an odd prime p as a sum of two squares. This is possible if and only if p = 1 mod 4, the same condition on p for −1 to have a square root.

In the previous post we discussed how to find c such that c does not have a square root mod p. This is most of the work for finding a square root of −1. Once you have c, set

x = c^k

where p = 4k + 1.

For example, let’s find a square root of −1 mod p where p = 2²⁵⁵ − 19. We found in the previous post that c = 2 is a non-residue for this value of p.

>>> p = 2**255 - 19
>>> k = p // 4
>>> x = pow(2, k, p)

Let’s view x and verify that it is a solution.

>>> x
19681161376707505956807079304988542015446066515923890162744021073123829784752
>>> (x**2 + 1) % p
0

Sometimes you’ll see a square root of −1 mod p written as i. This makes sense in context, but it’s a little jarring at first since here i is an integer, not a complex number.

The next post completes this series, giving a full implementation of Wagon’s algorithm.

Finding a non-square mod p

Posted on 14 February 2026 by John

The previous post briefly mentioned Stan Wagon’s algorithm for expressing an odd prime p as a sum of two squares when it is possible (i.e. when p = 1 mod 4). Wagon’s algorithm requires first finding a non-square mod p, i.e. a number c such that c ≠ d² mod p for any d in 1, 2, 3, … p − 1.

Wagon recommends searching for c by testing consecutive primes q as candidates. You can test whether a number q is a square mod p by computing the Legendre symbol, which is implemented in SymPy as legendre_symbol(q, p). This returns 1 if q is a quadratic residue mod p and −1 if it is not.

Here’s Python code for doing the search.

from sympy import *

def find_nonresidue(p):
    q = 2
    while legendre_symbol(q, p) == 1:
        q = nextprime(q)
    return q

Let’s start with p = 2²⁵⁵ − 19. This prime comes up often in cryptography, such as Curve25519. Now p = 1 mod 4, so Wagon’s algorithm applies.

The code above returns 2, i.e. the first thing we tried worked. That was kinda disappointingly easy.

Here’s another large prime used in cryptography, in the NIST curve P-384.

p = 2³⁸⁴ − 2¹²⁸ − 2⁹⁶ + 2³² − 1

For this value of p, find_nonresidue(p) returns 19.

Why test prime as candidates for non-residues? You could pick candidates at random, and there’s a probability 1/2 that any candidate will work, since half the integers less than p are residues and half are non-residues. It’s very likely that you’d find a solution quickly, but it’s not guaranteed. In principle, you could try a thousand candidates before one works, though that’s vanishingly unlikely. If you test consecutive primes, there is a theoretical limit on how many guesses you need to make, if the Extended Riemann Hypothesis is true.

The next post explains why we wanted to find a non-residue.

Expressing a prime as the sum of two squares

Posted on 12 February 2026 by John

I saw where Elon Musk posted Grok’s answer to the prompt “What are the most beautiful theorems.” I looked at the list, and there were no surprises, as you’d expect from a program that works by predicting the most likely sequence of words based on analyzing web pages.

There’s only one theorem on the list that hasn’t appeared on this blog, as far as I can recall, and that’s Fermat’s theorem that an odd prime p can be written as the sum of two squares if and only if p = 1 mod 4. The “only if” direction is easy [1] but the “if” direction takes more effort to prove.

If p is a prime and p = 1 mod 4, Fermat’s theorem guarantees the existence of x and y such that

Gauss’ formula

Stan Wagon [2] gave an algorithm for finding a pair (x, y) to satisfy the equation above [2]. He also presents “a beautiful formula due to Gauss” which “does not seem to be of any value for computation.” Gauss’ formula says that if p = 4k + 1, then a solution is

$\begin{align*} x &= \frac{1}{2} \binom{2k}{k} \pmod p \\ y &= (2k)\!!\, x \pmod p \end{align*}$

For x and y we choose the residues mod p with |x| and |y| less than p/2.

Why would Wagon say Gauss’ formula is computationally useless? The number of multiplications required is apparently on the order of p and the size of the numbers involved grows like p!.

You can get around the problem of intermediate numbers getting too large by carrying out all calculations mod p, but I don’t see a way of implementing Gauss’ formula with less than O(p) modular multiplications [3].

Wagon’s algorithm

If we want to express a large prime p as a sum of two squares, an algorithm requiring O(p) multiplications is impractical. Wagon’s algorithm is much more efficient.

You can find the details of Wagon’s algorithm in [3], but the two key components are finding a quadratic non-residue mod p (a number c such that c ≠ x² mod p for any x) and the Euclidean algorithm. Since half the numbers between 1 and p − 1 are quadratic non-residues, you’re very likely to find a non-residue after a few attempts.

[1] The square of an integer is either equal to 0 or 1 mod 4, so the sum of two squares cannot equal 3 mod 4.

[2] Stan Wagon. The Euclidean Algorithm Strikes Again. The American Mathematical Monthly, Vol. 97, No. 2 (Feb., 1990), pp. 125-129.

[3] Wilson’s theorem gives a fast way to compute (n − 1)! mod n. Maybe there’s some analogous identity that could speed up the calculation of the necessary factorials mod p, but I don’t know what it would be.

Aligning one matrix with another

Posted on 11 February 2026 by John

Suppose you have two n × n matrices, A and B, and you would like to find a rotation matrix Ω that lines up B with A. That is, you’d like to find Ω such that

A = ΩB.

This is asking too much, except in the trivial case of A and B being 1 × 1 matrices. You could view the matrix equation above as a set of n² equations in real numbers, but the space of orthogonal matrices only has n(n − 1) degrees of freedom [1].

When an equation doesn’t have an exact solution, the next best thing is to get as close as possible to a solution, typically in a least squares sense. The orthogonal Procrustes problem is to find an orthogonal matrix Ω minimizing the distance between A and ΩB That is, we want to minimize

|| A − ΩB ||

subject to the constraint that Ω is orthogonal. The matrix norm used in this problem is the Frobenius norm, the sum of the squares of the matrix entries. The Frobenius norm is the 2-norm if we straighten the matrices into vectors of dimension n².

Peter Schönemann found a solution to the orthogonal Procrustes problem in 1964. His solution involves singular value decomposition (SVD). This shouldn’t be surprising since SVD solves the problem of finding the closest thing to an inverse of an non-invertible matrix. (More on that here.)

Schönemann’s solution is to set M = AB^T and find its singular value decomposition

M = UΣV^T.

Then

Ω = UV^T.

Python code

The following code illustrates solving the orthogonal Procrustes problem for random matrices.

import numpy as np

n = 3

# Generate random n x n matrices A and B
rng = np.random.default_rng(seed=20260211) 
A = rng.standard_normal((n, n))
B = rng.standard_normal((n, n))

# Compute M = A * B^T
M = A @ B.T

# SVD: M = U * Sigma * V^T
U, s, Vt = np.linalg.svd(M, full_matrices=False)

# R = U * V^T
R = U @ Vt

# Verify that R * R^T is very nearly the identity matrix
print("||R^T R - I||_F =", np.linalg.norm(R.T @ R - np.eye(n), ord="fro"))

In this example the Frobenius norm between RR^T and I is 4 × 10⁻¹⁶, so essentially RR^T = I to machine precision.

[1] Every column of an orthogonal matrix Ω must have length 1, so that gives n constraints. Furthermore, each pair of columns must be orthogonal, which gives n choose 2 more constraints. We start with Ω having n² degrees of freedom, but then remove n and then n(n − 1)/2 degrees of freedom.

n² − n − n(n − 1)/2 = n(n − 1)/2

Computing large Fibonacci numbers

Posted on 8 February 2026 by John

The previous post discussed two ways to compute the nth Fibonacci number. The first is to compute all the Fibonacci numbers up to the nth iteratively using the defining property of Fibonacci numbers

F_{n + 2} = F_n + F_{n + 1}

with extended integer arithmetic.

The second approach is to use Binet’s formula

F_n = round( φⁿ / √ 5 )

where φ is the golden ratio.

It’s not clear which approach is more efficient. You might naively say that the iterative approach has run time O(n) while Binet’s formula is O(1). That doesn’t take into account how much work goes into each step, but it does suggest that eventually Binet wins.

The relative efficiency of each algorithm depends on how it is implemented. In this post I will compare using Python’s integer arithmetic and the mpmath library for floating point. Here’s my code for both methods.

from math import log10
import mpmath as mp

def fib_iterate(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

def digits_needed(n):

    phi = (1 + 5**0.5) / 2
    return int(n*log10(phi) - 0.5*log10(5)) + 1

def fib_mpmath(n, guard_digits=30):

    digits = digits_needed(n)

    # Set decimal digits of precision
    mp.mp.dps = digits + guard_digits

    sqrt5 = mp.sqrt(5)
    phi = (1 + sqrt5) / 2
    x = (phi ** n) / sqrt5

    return int(mp.nint(x))

Next, here’s some code to compare the run times.

def compare(n):
    
    start = time.perf_counter()
    x = fib_iterate(n)
    elapsed = time.perf_counter() - start
    print(elapsed)

    start = time.perf_counter()
    y = fib_mpmath(n)
    elapsed = time.perf_counter() - start
    print(elapsed)

    if (x != y):
        print("Methods produced different results.")

This code shows that the iterate approach is faster for n = 1,000 but Binet’s method is faster for n = 10,000.

>>> compare(1_000)
0.0002502090001144097
0.0009207079999669077
>>> compare(10_000)
0.0036547919999065925
0.002145750000181579

For larger n, the efficiency advantage of Binet’s formula becomes more apparent.

>>> compare(1_000_000)
11.169050417000108
2.0719056249999994

Guard digits and correctness

There is one unsettling problem with the function fib_mpmath above: how many guard digits do you need? To compute a number correctly to 100 significant figures, for example, requires more than 100 digits of working precision. How many more? It depends on the calculation. What about our calculation?

If we compute the 10,000th Fibonacci number using fib_mpmath(10_000, 2), i.e. with 2 guard digits, we get a result that is incorrect in the last digit. To compute the 1,000,000th Fibonacci number correctly, we need 5 guard digits.

We don’t need many guard digits, but we’re guessing at how many we need. How might we test whether we’ve guessed correctly? One way would be to compute the result using fib_iterate and compare results, but that defeats the purpose of using the more efficient fib_mpmath.

If floating point calculations produce an incorrect result, the error is likely to be in the least significant digits. If we knew that the last digit was correct, that would give us more confidence that all the digits are correct. More generally, we could test the result mod m. I discussed Fibonacci numbers mod m in this post.

When m = 10, the last digits of the Fibonacci numbers have a cycle of 60. So the Fibonacci numbers with index n and with index n mod 60 should be the same.

The post I just mentioned links to a paper by Niederreiter that says the Fibonacci numbers ore evenly distributed mod m if and only if m is a power of 5, in which case the cycle length is 4m.

The following code could be used as a sanity check on the result of fig_mpmath.

def mod_check(n, Fn):
    k = 3
    base = 5**k
    period = 4*base
    return Fn % base == fib_iterate(n % period) % base

With k = 3, we are checking the result of our calculation mod 125. It is unlikely that an incorrect result would be correct mod 125. It’s hard to say just how unlikely. Naively we could say there’s 1 chance in 125, but that ignores the fact that the errors are most likely in the least significant bits. The chances of an incorrect result being correct mod 125 would be much less than 1 in 125. For more assurance you could use a larger power of 5.

Fibonacci numbers and time-space tradeoffs

Posted on 8 February 2026 by John

A few days ago I wrote about Fibonacci numbers and certificates. As I pointed out in the article, there’s no need to certify Fibonacci numbers, but the point of the post was to illustrate the idea of a solution certificate in a simple context. Practical uses of certificates are more complicated.

This time I want to use Fibonacci numbers to illustrate space tradeoffs. The post on Fibonacci certificates imagined providing someone a pair (F, r) where F is a large Fibonacci number, and r is a certificate proving that F is indeed a Fibonacci number. The goal was to minimize the computational effort to verify F. All the recipient needed to do was compute

| 5F² − r² |.

The number F is a Fibonacci number if and only if this number equals 4. The problem with this scenario is that F and r are both large numbers. They require transmitting and storing a lot of bits.

A much more space-efficient approach would be to transmit the index of the Fibonacci number and have the user compute the number. The example in the certificate post was (12586269025, 28143753123). Since 12586269025 is the 50th Fibonacci number, I could communicate it to someone by simply transmitting the number 50. That saves space, but it puts more computational burden on the recipient.

Fibonacci numbers grow exponentially with index size, and so the size of the nth Fibonacci number in bits is proportional to n. But the number of bits in n is proportional to log n. When n is large, the difference is dramatic.

How many bits are in the 1,000,000th Fibonacci number? The nth Fibonacci number is φⁿ/√5 rounded to the nearest integer, so the number of bits in the millionth Fibonacci number would be

log₂ (φ^1000000/√5) = 1000000 log₂ φ − 0.5 log₂ 5

which is roughly 700,000. By contrast, one million is a 20 bit number. So transmitting “1000000” is far more efficient than transmitting the millionth Fibonacci number.

What does it take to compute the nth Fibonacci number? For small n, it’s fast and easy to compute the Fibonacci numbers up to n sequentially using the definition of the sequence. For large enough n, it would be faster to compute φⁿ/√5. However, the former requires (extended) integer arithmetic, and the latter requires (extended) floating point arithmetic. It’s not clear where the crossover point would be where floating point would be more efficient. That’s the topic of the next post.

Minimum of cosine sum

Posted on 7 February 2026 by John

Suppose f(x) is the sum of terms of the form cos(kx) where k is an integer from a set A with n elements.

$f_A(x) = \sum_{k \in A} \cos(kx)$

Then the maximum value of f is f(0) = n. But what is the minimum value of f?

The Chowla cosine conjecture says that the minimum should be less than −√n for large n. For now the best proven results are much smaller in absolute value [1].

I was playing around with this problem, and the first thing I thought of was to let the set A be the first n primes. This turned out to not be the most interesting example. Since all the primes except for the first are odd, and cos(kπ) = −1 for odd k, the minimum is 2 −n and occurs at π.

Here’s a plot where A is the set of primes less than 100.

For the cosine conjecture to be interesting, the set A should contain a mix of even and odd numbers.

Here’s a plot with A equal to a random selection of 25 points between 1 and 100. (I chose 25 because there are 25 primes less than 100.)

Here’s the Python code I used to generate the two sets A and the function to plot.

import numpy as np
from sympy import prime

def f(x, A):
    return sum([np.cos(k*x) for k in A])

n = 25
A_prime = [prime(i) for i in range(1, n+1)]
np.random.seed(20260207)
A_random = np.random.choice(range(1, 101), size=n, replace=False)

If you wanted to explore the Chowla conjecture numerically, direct use of minimization software is impractical. As you can tell from the plots above, there are a lot of local minima. If the values in A are not too large, you can look at a plot to see approximately where the minimum occurs, then use a numerical method to find the minimum in this region, but that doesn’t scale.

Here’s an approach that would scale better. You could find all the zeros of the derivative of f_A and evaluate the function at each. One of these is the minimum. The derivative is a sum of sines with integer frequencies, and so it could be written as a polynomial in z = exp(ix) [2]. You could find all the zeros of this polynomial using the QR algorithm as discussed in the previous post.

[1] Benjamin Bedert. Polynomial bounds for the Chowla cosine problem. arXiv

[2] You get a polynomial of degree n in z and 1/z. Then multiply by z²ⁿ to get a polynomial in z only of degree 2n.

Eigenvalue homework problems are backward

Posted on 6 February 2026 by John

Classroom

When you take a linear algebra course and get to the chapter on eigenvalues, your homework problems will include a small matrix A and you will be asked to find the eigenvalues. You do this by computing the determinant

det(A − λI) = P(λ)

and getting P(λ), a polynomial in λ. The roots of P are the eigenvalues of A.

Either A will be a 2 × 2 matrix, in which case you can find the roots using the quadratic formula, or the matrix will have been carefully selected so that P(λ) will be easy to factor. Otherwise, finding the roots of a polynomial is hard.

Real world

Numerical algorithms to find eigenvalues have gotten really good. In practice, you don’t compute determinants or find roots of polynomials. Instead you do something like the QR algorithm.

Finding all the roots of a polynomial is a challenging problem, and so what you might do in practice is find the roots by constructing a matrix, called the companion matrix, whose eigenvalues correspond to the roots you’re after.

Summary

As a classroom exercise, you calculate roots of polynomials to find eigenvalues.

In the real world, you might use an eigenvalue solver to find the roots of polynomials.

I wrote a similar post a few years ago. It explains that textbooks definite hyperbolic functions using e^x, but you might want to compute e^x using hyperbolic functions.

Solving hard problems

World records

Chebyshev bias

Visualizing the lead function

Gauss’ formula

Wagon’s algorithm

Python code

Related posts

Guard digits and correctness

Classroom

Real world

Summary