Efficient arithmetic mod 2^255 - 19 for ECC Curve25519

Daniel Bernstein’s Curve25519 is the elliptic curve

y² = x³ + 486662x² + x

over the prime field with order p = 2²⁵⁵ – 19. The curve is a popular choice in elliptic curve cryptography because its design choices are transparently justified [1] and because cryptography over the curve can be implemented very efficiently. This post will concentrate on one of the tricks that makes ECC over Curve25519 so efficient.

Curve25519 was designed for fast and secure cryptography. One of the things that make it fast is the clever way Bernstein carries out arithmetic mod 2²⁵⁵ – 19 which he describes here.

Bernstein represents numbers mod 2²⁵⁵ – 19 by polynomials whose value at 1 gives the number. That alone is not remarkable, but his choice of representation seems odd until you learn why it was chosen. Each number is represented as a polynomial of the form

∑ u_i xⁱ

where each u_i is an integer multiple k_i of 2^⌈25.5i⌉, and each k_i is an integer between −2²⁵ and 2²⁵ inclusive.

Why this limitation on the k‘s? Pentium cache optimization. In Bernstein’s words:

Why split 255-bit integers into ten 26-bit pieces, rather than nine 29-bit pieces or eight 32-bit pieces? Answer: The coefficients of a polynomial product do not fit into the Pentium M’s fp registers if pieces are too large. The cost of handling larger coefficients outweighs the savings of handling fewer coefficients.

And why unevenly spaced powers of 2: 1, 2²⁶, 2⁵¹, 2⁷⁷, …, 2²³⁰? Some consecutive exponents differ by 25 and some by 26. This looks sorta like a base 2²⁵ or base 2²⁶ representation, but is a mixture of both. Bernstein answers this in his paper.

Bernstein answers this question as well.

Given that there are 10 pieces, why use radix 2^25.5 rather than, e.g., radix 2²⁵ or radix 2²⁶? Answer: My ring R contains 2²⁵⁵x¹⁰ − 19, which represents 0 in Z/(2²⁵⁵ − 19). I will reduce polynomial products modulo 2²⁵⁵x¹⁰ − 19 to eliminate the coefficients of x¹⁰, x¹¹, etc. With radix 2²⁵ , the coefficient of x¹⁰ could not be eliminated. With radix 2²⁶, coefficients would have to be multiplied by 2⁵19 rather than just 19, and the results would not fit into an fp register.

There are a few things to unpack here.

Remember that we’re turning polynomials in to numbers by evaluating them at 1. So when x = 1, 2²⁵⁵x¹⁰ − 19 = p = 2²⁵⁵ − 19, which is the zero in the integers mod 2²⁵⁵ − 19.

If we were using base (radix) 2²⁵ , the largest number we could represent with a 9th degree polynomial with the restrictions above would be 2²⁵⁰ , so we’d need a 10th degree polynomial; we couldn’t eliminate terms containing x¹⁰.

I don’t yet see why working with radix 2²⁶ would overflow an fp register. If you do see why, please leave an explanation in the comments.

[1] When a cryptographic method has an unjustified parameter, it invites suspicion that the parameter was chosen to create an undocumented back door. This is not the case with Curve25519. For example, why does it use p = 2²⁵⁵ − 19? It’s efficient to use a prime close to a large power of 2, and this p is the closes prime to 2²⁵⁵. The coefficient 486662 is not immediately obvious, but Bernstein explains in his paper how it was the smallest integer that met his design criteria.

2 thoughts on “Efficient arithmetic on Curve25519”

Nathan Hannon

9 March 2019 at 21:51

My understanding of why he doesn’t use radix 2^26:

Imagine you’re multiplying two polynomials. Using radix 2^26, you would have terms like
(j * 2^234 * x^9)(k * 2^26 * x^1) = jk * 2^260 * x^10
= 19 * jk * 2^5
(using the fact that 2^255 = 19 in this field). Hence, the computer may have to deal with numbers as large as 19 * 2^55. (Terms with higher powers of x are dealt with similarly, leaving behind some powers of x after you factor out the 2^255 * x^10.)

Using alternating radices 2^26 and 2^25, the worst terms (in the sense of requiring the largest numbers to store them) are similar to
(j * 2^230 * x^9)(k * 2^26 * x^1) = jk * 2^260 * x^10
= 19 * jk * 2,
and the largest number the computer has to deal with is 19 * 2^51.

George Spelvin

1 April 2023 at 13:47

The problem with base-2^26 is that it would require more frequent reduction of intermediate values. There are two kinds of reduction in the Curve25519 implementation: coefficient reduction and degree reduction. In Bernstein’s implementation, they are performed simultaneously after each multiplication, and not performed after additions.

Coefficient reduction is propagating carries between digits so that each digit is in the range ±2^25. Bernstein does not perform this after additions, allowing digits to range up to ±2^26. The big blowup happens during multiplication, when double-width partial products are formed. The partial product of two such addition results will be in the range ±2^52, which fits into the ±2^53 range of IEEE double representable integers.

But to compute the product of two 10-digit numbers, digit 10 is the sum of 10 partial products. This would exceed IEEE double range, but Bernetein uses the 80×87 temporary real format (“long double” in C) with a 64-bit mantissa, giving plenty of range.

The second kind of reduction is degree reduction. This reduces the 19-digit intermediate result back to 10 digits (which Bernstein describes as a polynomial of degree 9). This entails multiplying digit 10 by 19 and adding it to digit 0.

So the total is in the range ±191*2^52, which fits into 60 bits + sign.

If, however, we had used radix-2^26, then digit 10 would have had to be multiplied by 2^260 mod 2^255-19 = 2^5*19 = 608 before being added to digit 0. This would have produced a total range of ±6081*2^52, requiring 65 bits + sign.

Oops!

Of course, it would fit if additions were followed by coefficient reduction, but that would slow and bloat the implementation.

Comments are closed.

Related posts

2 thoughts on “Efficient arithmetic on Curve25519”