Daniel Bernstein’s Curve25519 is the elliptic curve
y² = x³ + 486662x² + x
over the prime field with order p = 2255 – 19. The curve is a popular choice in elliptic curve cryptography because its design choices are transparently justified  and because cryptography over the curve can be implemented very efficiently. This post will concentrate on one of the tricks that makes ECC over Curve25519 so efficient.
Curve25519 was designed for fast and secure cryptography. One of the things that make it fast is the clever way Bernstein carries out arithmetic mod 2255 – 19 which he describes here.
Bernstein represents numbers mod 2255 – 19 by polynomials whose value at 1 gives the number. That alone is not remarkable, but his choice of representation seems odd until you learn why it was chosen. Each number is represented as a polynomial of the form
∑ ui xi
where each ui is an integer multiple ki of 2⌈25.5i⌉, and each ki is an integer between -225 and 225 inclusive.
Why this limitation on the k‘s? Pentium cache optimization. In Bernstein’s words:
Why split 255-bit integers into ten 26-bit pieces, rather than nine 29-bit pieces or eight 32-bit pieces? Answer: The coefficients of a polynomial product do not fit into the Pentium M’s fp registers if pieces are too large. The cost of handling larger coefficients outweighs the savings of handling fewer coefficients.
And why unevenly spaced powers of 2: 1, 226, 251, 277, …, 2230? Some consecutive exponents differ by 25 and some by 26. This looks sorta like a base 225 or base 226 representation, but is a mixture of both. Bernstein answers this in his paper.
Bernstein answers this question as well.
Given that there are 10 pieces, why use radix 225.5 rather than, e.g., radix 225 or radix 226? Answer: My ring R contains 2255x10 − 19, which represents 0 in Z/(2255 − 19). I will reduce polynomial products modulo 2255x10 – 19 to eliminate the coefficients of x10, x11, etc. With radix 225 , the coefficient of x10 could not be eliminated. With radix 226, coefficients would have to be multiplied by 2519 rather than just 19, and the results would not fit into an fp register.
There are a few things to unpack here.
Remember that we’re turning polynomials in to numbers by evaluating them at 1. So when x = 1, 2255x10 – 19 = p = 2255 – 19, which is the zero in the integers mod 2255 – 19.
If we were using base (radix) 225 , the largest number we could represent with a 9th degree polynomial with the restrictions above would be 2250 , so we’d need a 10th degree polynomial; we couldn’t eliminate terms containing x10.
I don’t yet see why working with radix 226 would overflow an fp register. If you do see why, please leave an explanation in the comments.
 When a cryptographic method has an unjustified parameter, it invites suspicion that the parameter was chosen to create an undocumented back door. This is not the case with Curve25519. For example, why does it use p = 2255 – 19? It’s efficient to use a prime close to a large power of 2, and this p is the closes prime to 2255. The coefficient 486662 is not immediately obvious, but Bernstein explains in his paper how it was the smallest integer that met his design criteria.
2 thoughts on “Efficient modular arithmetic technique for Curve25519”
My understanding of why he doesn’t use radix 2^26:
Imagine you’re multiplying two polynomials. Using radix 2^26, you would have terms like
(j * 2^234 * x^9)(k * 2^26 * x^1) = jk * 2^260 * x^10
= 19 * jk * 2^5
(using the fact that 2^255 = 19 in this field). Hence, the computer may have to deal with numbers as large as 19 * 2^55. (Terms with higher powers of x are dealt with similarly, leaving behind some powers of x after you factor out the 2^255 * x^10.)
Using alternating radices 2^26 and 2^25, the worst terms (in the sense of requiring the largest numbers to store them) are similar to
(j * 2^230 * x^9)(k * 2^26 * x^1) = jk * 2^260 * x^10
= 19 * jk * 2,
and the largest number the computer has to deal with is 19 * 2^51.
The problem with base-2^26 is that it would require more frequent reduction of intermediate values. There are two kinds of reduction in the Curve25519 implementation: coefficient reduction and degree reduction. In Bernstein’s implementation, they are performed simultaneously after each multiplication, and not performed after additions.
Coefficient reduction is propagating carries between digits so that each digit is in the range ±2^25. Bernstein does not perform this after additions, allowing digits to range up to ±2^26. The big blowup happens during multiplication, when double-width partial products are formed. The partial product of two such addition results will be in the range ±2^52, which fits into the ±2^53 range of IEEE double representable integers.
But to compute the product of two 10-digit numbers, digit 10 is the sum of 10 partial products. This would exceed IEEE double range, but Bernetein uses the 80×87 temporary real format (“long double” in C) with a 64-bit mantissa, giving plenty of range.
The second kind of reduction is degree reduction. This reduces the 19-digit intermediate result back to 10 digits (which Bernstein describes as a polynomial of degree 9). This entails multiplying digit 10 by 19 and adding it to digit 0.
So the total is in the range ±191*2^52, which fits into 60 bits + sign.
If, however, we had used radix-2^26, then digit 10 would have had to be multiplied by 2^260 mod 2^255-19 = 2^5*19 = 608 before being added to digit 0. This would have produced a total range of ±6081*2^52, requiring 65 bits + sign.
Of course, it would fit if additions were followed by coefficient reduction, but that would slow and bloat the implementation.