Efficient modular arithmetic technique for Curve25519

Daniel Bernstein’s Curve25519 is the elliptic curve

y² = x³ + 486662x² + x

over the prime field with order p = 2255 – 19. The curve is a popular choice in elliptic curve cryptography because its design choices are transparently justified [1] and because cryptography over the curve can be implemented very efficiently. This post will concentrate on one of the tricks that makes ECC over Curve25519 so efficient.

Curve25519 was designed for fast and secure cryptography. One of the things that make it fast is the clever way Bernstein carries out arithmetic mod 2255 – 19 which he describes here.

Bernstein represents numbers mod 2255 – 19 by polynomials whose value at 1 gives the number. That alone is not remarkable, but his choice of representation seems odd until you learn why it was chosen. Each number is represented as a polynomial of the form

ui xi

where each ui is an integer multiple ki of 2⌈25.5i, and each ki is an integer between -225 and 225 inclusive.

Why this limitation on the k‘s? Pentium cache optimization. In Bernstein’s words:

Why split 255-bit integers into ten 26-bit pieces, rather than nine 29-bit pieces or eight 32-bit pieces? Answer: The coefficients of a polynomial product do not fit into the Pentium M’s fp registers if pieces are too large. The cost of handling larger coefficients outweighs the savings of handling fewer coefficients.

And why unevenly spaced powers of 2: 1, 226, 251, 277, …, 2230? Some consecutive exponents differ by 25 and some by 26. This looks sorta like a base 225 or base 226 representation, but is a mixture of both. Bernstein answers this in his paper.

Bernstein answers this question as well.

Given that there are 10 pieces, why use radix 225.5 rather than, e.g., radix 225 or radix 226? Answer: My ring R contains 2255x10 − 19, which represents 0 in Z/(2255 − 19). I will reduce polynomial products modulo 2255x10 – 19 to eliminate the coefficients of x10, x11, etc. With radix 225 , the coefficient of x10 could not be eliminated. With radix 226, coefficients would have to be multiplied by 2519 rather than just 19, and the results would not fit into an fp register.

There are a few things to unpack here.

Remember that we’re turning polynomials in to numbers by evaluating them at 1. So when x = 1, 2255x10 – 19  = p = 2255 – 19, which is the zero in the integers mod  2255 – 19.

If we were using base (radix) 225 , the largest number we could represent with a 9th degree polynomial with the restrictions above would be 2250 , so we’d need a 10th degree polynomial; we couldn’t eliminate terms containing x10.

I don’t yet see why working with radix 226 would overflow an fp register. If you do see why, please leave an explanation in the comments.

Related posts

[1] When a cryptographic method has an unjustified parameter, it invites suspicion that the parameter was chosen to create an undocumented back door. This is not the case with Curve25519. For example, why does it use p = 2255 – 19? It’s efficient to use a prime close to a large power of 2, and this p is the closes prime to 2255. The coefficient 486662 is not immediately obvious, but Bernstein explains in his paper how it was the smallest integer that met his design criteria.

An attack on RSA with exponent 3

As I noted in this post, RSA encryption is often carried out reusing exponents. Sometimes the exponent is exponent 3, which is subject to an attack we’ll describe below [1]. (The most common exponent is 65537.)

Suppose the same message m is sent to three recipients and all three use exponent e = 3. Each recipient has a different modulus Ni, and each will receive a different encrypted message

ci = m³ mod Ni.

Someone with access to c1, c2, and c3 can recover the message m as follows. We can assume each modulus Ni is relatively prime to the others, otherwise we can recover the private keys using the method described here. Since the moduli are relatively prime, we can solve the three equations for m³ using the Chinese Remainder Theorem. There is a unique x < N1 N2 N3 such that

x = c1 mod N1
x = c2 mod N2
x = c3 mod N3

and m is simply the cube root of x. What makes this possible is knowing m is a positive integer less than each of the Ns, and that x < N1 N2 N3. It follows that we can simply take the cube root in the integers and not the cube root in modular arithmetic.

This is an attack on “textbook” RSA because the weakness in this post could be avoiding by real-world precautions such as adding random padding to each message so that no two recipients are sent the exact same message.

By the way, a similar trick works even if you only have access to one encrypted message. Suppose you’re using a 2048-bit modulus N and exchanging a 256-bit key. If you message m is simply the key without padding, then m³ is less than N, and so you can simply take the cube root of the encrypted message in the integers.

Python example

Here we’ll work out a specific example using realistic RSA moduli.

    from secrets import randbits, randbelow
    from sympy import nextprime
    from sympy.ntheory.modular import crt
    
    def modulus():
        p = nextprime(randbits(2048))
        q = nextprime(randbits(2048))
        return p*q
    
    N = [modulus() for _ in range(3)]
    m = randbelow(min(N))
    c = [pow(m, 3, N[i]) for i in range(3)]
    x = crt(N, c)[0]
    
    assert(cbrt(x) == m) # integer cube root

Note that crt is the Chinese Remainder Theorem. It returns a pair of numbers, the first being the solution we’re after, hence the [0] after the call.

The script takes a few seconds to run. Nearly all the time goes to finding the 2048-bit (617-digit) primes that go into the moduli. Encrypting and decrypting m takes less than a second.

Related posts

[1] I don’t know who first discovered this line of attack, but you can find it written up here. At least in the first edition; the link is to the 2nd edition which I don’t have.

Public key encryption based on squares and non squares

The RSA encryption algorithm depends indirectly on the assumption that factoring the product of large primes is hard. The algorithm presented here, invented by Shafi Goldwasser and Silvio Micali, depends on the same assumption but in a different way. The Goldwasser-Micali algorithm is more direct than RSA, thought it is also less efficient.

One thing that makes GM interesting is that allows a form of computing on encrypted data that we’ll describe below.

GM in a nutshell

To create a public key, find two large primes p and q and publish N = pq. (There’s one more piece we’ll get to shortly.) You keep p and q private, but publish N, much like with RSA.

Someone can send you a message, one bit at a time, by sending you numbers that either do or do not have a square root mod N.

Sending a 0

If someone wants to send you a 0, they send you a number that has a square root mod N. This is easy to do: they select a number between 1 and N at random, square it mod N, and send you the result.

Determining whether a random number is a square mod N is easy if and only if you know how to factor N. [1]

When you receive the number, you can quickly tell that it is a square because you know how to factor N. The sender knows that it’s a square because he got it by squaring something. You can produce a square without knowing how to factor N, but it’s computationally infeasible to start with a given number and tell whether it’s a square mod N, unless you know the factorization of N.

Sending a 1

Sending a 1 bit is a little more involved. How can someone who cannot factor N produce a number that’s not a square? That’s actually not feasible without some extra information. The public key is not just N. It’s also a number z that is not a square mod N. So the full public key is two numbers, N and z.

To generate a non-square, you first generate a square then multiply it by z.

Example

Suppose you choose p = 314159 and q = 2718281. (Yes, p is a prime. See the post on pi primes. And q comes from the first few digits of e.) In practice you’d choose p and q to be very large, hundreds of digits, and you wouldn’t pick them to have a cute pattern like we did here. You publish N = pq = 853972440679 and imagine it’s too large for anyone to factor (which may be true for someone armed with only pencil and paper).

Next you need to find a number z that is not a square mod N. You do that by trying numbers at random until you find one that is not a square mod p and not a square mod q. You can do that by using Legendre symbols, It turns out z = 400005 will work.

So you tell the world your public key is (853972440679, 400005).

Someone wanting to send you a 0 bit chooses a number between 1 and N = 853972440679, say 731976377724. Then they square it and take the remainder by N to get 592552305778, and so they send you 592552305778. You can tell, using Legendre symbols, that this is a square mod p and mod q, so it’s a square mod N.

If they had wanted to send you a 1, they could have sent us 592552305778 * 400005 mod N = 41827250972, which you could tell isn’t a square mod N.

Homomorphic encryption

Homomorphic encryption lets you compute things on encrypted data without having to first decrypt it. The GM encryption algorithm is homomorphic in the sense that you can compute an encrypted form of the XOR of two bits from an encrypted form of each bit. Specifically, if c1 and c2 are encrypted forms of bits b1 and b2, then c1 c2 is an encrypted form of b1b2. Let’s see why this is, and where there’s a small wrinkle.

Suppose our two bits are both 0s. Then c1 and c2 are squares mod N, and c1 c2 is a square mod N.

Now suppose one bit is a 0 and the other is a 1. Then either c1 is a square mod N and c2 isn’t or vice versa, but in either case their product is not a square mod N.

Finally suppose both our bits are 1s. Since 1⊕1 = 0, we’d like to say that c1 c2 is a square mod N. Is it?

The product of two non-squares is not necessarily a non-square. For example, 2 and 3 are not squares mod 35, and neither is their product 6 [2]. But if we followed the recipe above, and calculated c1 and c2 both by multiplying a square by the z in the public key, then we’re OK. That is, if c1 = x²z and c2 = y²z, then c1c2 = x²y²z², which is a square. So if you return non-squares that you find as expected, you get the homomorphic property. If you somehow find your own non-squares, they might not work.

Related posts

[1] As far as we know. There may be an efficient way to tell whether x is a square mod N without factoring N, but no such method has been published. The problem of actually finding modular square roots is equivalent to factoring, but simply telling whether modular square roots exist, without having to produce the roots, may be easier.

If quantum computing becomes practical, then factoring will be efficient and so telling whether numbers are squares modulo a composite number will be efficient.

[2] You could find all the squares mod 35 by hand, or you could let Python do it for you:

>>> set([x*x % 35 for x in range(35)])
{0, 1, 4, 9, 11, 14, 15, 16, 21, 25, 29, 30}

Base 58 encoding and Bitcoin addresses

A few weeks ago I wrote about base32 and base64 encoding. I’ll review these quickly then discuss base58 and its use in Bitcoin.

Base32 and Base64

All three methods have the goal of compactly representing large numbers while maintaining readability. Douglas Crockford’s base32 encoding is the most conservative: it’s case-insensitive and it does not use the letters I, L, O, or U. The first three letters are omitted because of visual similarity to digits, and the last to avoid “accidental obscenities.”

Base 64 is not concerned with avoiding visual similarities, and uses the full upper and lower case alphabet, plus two more symbols, + and /.

Base58

Base58 is nearly as efficient as base64, but more concerned about confusing letters and numbers.The number 1, the lower case letter l, and the upper case letter I all look similar, so base58 retains the digit 1 and does not use the lower case letter l or the capital letter I.

The number 0 looks like the lower case letter o and the upper case letter O. Here base58 makes an unusual choice: it keeps the lower case letter o, but does not use the digit 0 or the capital letter O. This is odd because every other encoding that I can think of keep the 10 digits and differs over what letters to use.

Bases like 32 and 64 have the advantage of being trivial to convert back and forth with binary. To convert a binary number to base 2n, you start at the least significant end and convert groups of n bits. Since 58 is not a power of 2, converting to base 58 is more involved.

Bitcoin addresses

Bitcoin addresses are written in base58, and in fact base58 was developed for Bitcoin.

A Bitcoin address is a 25 byte (200 bit) number. Now

log582200 = 34.14

and so it may take up to 35 characters to represent a Bitcoin address in base58. Using base64 would have taken up to 34 characters, so base58 pays a very small price for preventing a class of errors relative to base64. Base32 would require 40 characters.

As noted above, converting between binary and base58 is more complicated than converting between binary and either base32 or base64. However, converting to base58 is trivial compared to everything else that goes into forming a Bitcoin address. The steps, documented here, involve taking an ECDSA public key, applying a secure hash function three times, and appending a checksum.

Related posts

Google Adiantum and the ChaCha RNG

The ChaCha cryptographic random number generator is in the news thanks to Google’s Adiantum project. I’ll discuss what’s going on, but first a little background.

Adiantum maidenhead fern

The name of the project comes from a genus of fern. More on that below as well.

One-time pads

The one-time pad is a provably unbreakable way to encrypt things. You create a sheet of random bits and give your counterpart an exact copy. Then when it comes time for you to send an encrypted message, you convert your message to a stream of bits, XOR your message with the random bits you exchanged previously, and send the result. The recipient then takes the XOR of the received message with the pad of random bits, and recovers the original message.

This is called a one-time pad because it’s a pad of bits that you can only use one time. If you reuse a pad, it’s no longer unbreakable.

One-time pads are impractical for a couple reasons. First, it’s hard to generate truly random bits, especially in bulk. Second, exchanging the pads is almost as difficult as exchanging messages.

Stream ciphers

So here’s a bright idea: we’ll get around both of the problems with one-time pads by using pseudorandom bits rather than random bits! The both parties can generate their own random bits.

Many people have had this idea, and it’s not necessarily a bad one. It’s called a stream cipher. The problem is that most pseudorandom number generators are not up to the task. You need a cryptographically secure RNG, and most RNGs are far from secure. The ChaCha RNG, however, appears to be good enough to use in a stream cipher, given enough rounds of scrambling [1], and Google is using it for full disk encryption in Android devices.

Full disk encryption

If you forget your password to your computer, you may not be able to access your data, but a thief still could by removing the hard drive and accessing it from another computer. That is, unless the disk is encrypted.

Full disk encryption on a laptop, such as BitLocker on Windows or FileVault on OSX, is usually implemented via AES encryption with hardware acceleration. If you don’t have special hardware for encryption, AES can be too slow.

Adiantum: ChaCha encryption on Android

On low-end devices, ChaCha encryption can be around 5x faster than AES. So Google is using ChaCha for Android devices, using what it calls Adiantum.

You can read the technical details in [2], and you can read more about the ChaCha random number generator in [3].

So where does the name Adiantum come from? It’s a Victorian name for a genus of maidenhair ferns, symbolic of sincerity and discretion.

Related posts

[1] Adiantum using ChaCha with 12 rounds. TLS 1.3 uses ChaCha with 20 rounds.

[2] Adiantum: length-preserving encryption for entry-level processors by Google employees Paul Crowley and Eric Biggers.

[3] IRTF RFC 8439: ChaCha20 and Poly1305 for IETF Protocols

Sharing secrets with polynomials

This post will present a couple ways to share secrets using polynomials. We have a group of n people who want to share a secret between them so that k of them will have to cooperate in order to unlock the secret. For example, maybe a committee of n = 5 wants to require the cooperation of at least k = 3 members.

Shamir’s method

Adi Shamir came up with the idea of using polynomials to share secrets as follows. First, encode the secret you want to share as an integer a0. Next, generate m = k-1 other random integers a1 through am and use these as coefficients of a polynomial f of degree m:

f(x) = a_0 + a_1x + a_2x^2 + \cdots + a_mx^m

A trusted party generates n random integers values of x and gives each person an x and the corresponding value of f(x). Since m+1 points completely determine a mth degree polynomial, if k = m+1 people share their data, they can recover f, and thus recover the secret number a0. This can be efficiently, for example, by using Lagrange interpolation. But with fewer than k data points, the polynomial remains undetermined.

In practice we’d work over the integer modulo a large prime p. While fewer than k data points will not let someone completely determine the polynomial f, it will narrow down the possible coefficients if we’re working over the integers. Working modulo a large prime instead reveals less information.

Verifiable secret sharing

There’s a possible problem with Shamir’s method. Maybe the trusted party made a mistake. Or maybe the trusted party was dishonest and shouldn’t have been trusted. How can the parties verify that they have been given valid data without unlocking the secret? Seems we’re at a logical impasse since you’d have to recover the polynomial to know if your points are on the polynomial.

Paul Feldman came up with a way to assure the participants that the secret can be unlocked without giving them the information to unlock it. The trick is to give everyone data that in principle would let them determine the polynomial, but in practice would not.

We choose a large prime p such that p-1 has a large prime factor q [1]. Then the multiplicative group of non-zero integers mod p has a subgroup of order q. Let g be a generator of that group. The idea is to let everyone verify that

y_i = f(x_i)

for their given (xi, yi) by letting them verify that

g^{y_i} = g^{f(x_i)}

where all calculations are carried out mod p. Our trusted party does this by computing

A_i \equiv g^{a_i}\pmod{p}

for each coefficient ai and letting everyone know g and each of the Ai‘s.

In principle, anyone could solve for a0 if they know A0. But in practice, provided q is large enough, this would not be possible because doing so would require solving the discrete logarithm problem, which is computationally difficult. It’s possible to compute discrete logarithms for small q, but the difficulty goes up quickly as q gets larger.

How do the the Ai‘s let everyone verify that their (xi, yi) data is correct?

Each person can verify that

g^{y_i} = \prod_{j=0}^m A_j^{x_i^j} 

using the public data and their personal data, and so they can verify that

g^{y_i} = \prod_{j=0}^m A_j^{x_i^j} = \prod_{j=0}^m g^{a_j x_i^j} = g^{f(x_i)} 

Related posts

[1] Conceptually you pick p‘s until you find one so that p-1 has a large prime factor q. In practice, you’d do it the other way around: search for large primes q until you find one such that, say, 2q + 1 is also prime.

Regression, modular arithmetic, and PQC

Linear regression

Suppose you have a linear regression with a couple predictors and no intercept term:

β1x1 + β2x2 = y + ε

where the x‘s are inputs, the β are fixed but unknown, y is the output, and ε is random error.

Given n observations (x1, x2, y + ε), linear regression estimates the parameters β1 and β2.

I haven’t said, but I implicitly assumed all the above numbers are real. Of course they’re real. It would be strange if they weren’t!

Learning with errors

Well, we’re about to do something strange. We’re going to pick a prime number p and do our calculations modulo p except for the addition of the error ε. Our inputs (x1, x2) are going to be pairs of integers. Someone is going to compute

r = β1x1 + β2x2 mod p

where β1 and β2 are secret integers. Then they’re going to tell us

r/p + ε

where ε is a random variable on the interval [0, 1].  We give them n pairs (x1, x2) and they give back n values of r/p with noise added. Our job is to infer the βs.

This problem is called learning with errors or LWE. It’s like linear regression, but much harder when the problem size is bigger. Instead of just two inputs, we could have m of inputs with m secret coefficients where m is large. Depending on the number of variables m, the number of equations n, the modulus p, and the probability distribution on ε, the problem may be possible to solve but computationally very difficult.

Why is it so difficult? Working mod p is discontinuous. A little bit of error might completely change our estimation of the solution. If n is large enough, we could recover the coefficients anyway, using something like least squares. But how would we carry that out? If m and p are small we can just try all pm possibilities, but that’s not going to be practical if m and p are large.

In linear regression, we assume there is some (approximately) linear process out in the real world that we’re allowed to reserve with limited accuracy. Nobody’s playing a game with us, that just how data come to us. But with LWE, we are playing a game that someone has designed to be hard. Why? For cryptography. In particular, quantum-resistant cryptography.

Post Quantum Cryptography

Variations on LWE are the basis for several proposed encryption algorithms that believed to be secure even if an adversary has access to a quantum computer.

The public key encryption systems in common use today would all be breakable if quantum computing becomes practical. They depend on mathematical problems like factoring and discrete logarithms being computationally difficult, which they appear to be with traditional computing resources. But we know that these problems could be solved in polynomial time on a quantum computer with Shor’s algorithm. But LWE is a hard problem, even on a quantum computer. Or so we suspect.

The US government’s National Institute of Standards and Technology (NIST) is holding a competition to identify quantum-resistant encryption algorithms. Last month they announced 26 algorithms that made it to the second round. Many of these algorithms depend on LWE or variations.

One variation is LWR (learning with rounding) which uses rounding rather than adding random noise. There are also ring-based counterparts RLWE and RLWR which add random errors and use rounding respectively. And there are polynomial variations such as poly-LWE which uses a polynomial-based learning with errors problem. The general category for these methods is lattice methods.

Lattice methods

Of the public-key algorithms that made it to the second round of the the NIST competition, 9 out of 17 use lattice-based cryptography:

  • CRYSTALS-KYBER
  • FrodoKEM
  • LAC
  • NewHope
  • NTRU
  • NTRU Prime
  • Round5
  • SABER
  • Three Bears

Also, two of the nine digital signature algorithms are based on lattice problems:

  • CRYSTALS-DILITHIUM
  • FALCON

Based purely on the names, and not on the merits of the algorithms, I hope the winner is one of the methods with a science fiction allusion in the name.

Related posts

What is an elliptic curve?

Elliptic curves are pure and applied, concrete and abstract, simple and complex.

Elliptic curves have been studied for many years by pure mathematicians with no intention to apply the results to anything outside math itself. And yet elliptic curves have become a critical part of applied cryptography.

Elliptic curves are very concrete. There are some subtleties in the definition—more on that in a moment—but they’re essentially the set of point satisfying a simple equation. And yet a lot of extremely abstract mathematics has been developed out of necessity to study these simple objects. And while the objects are in some sense simple, the questions that people naturally ask about them are far from simple.

Preliminary definition

A preliminary definition of an elliptic curve is the set of points satisfying

y² = x³ + ax + b.

This is a theorem, not a definition, and it requires some qualifications. The values xya, and b come from some field, and that field is an important part of the definition of an elliptic curve. If that field is the real numbers, then all elliptic curves do have the form above, known as the Weierstrass form. For fields of characteristic 2 or 3, the Weierstrass form isn’t general enough. Also, we require that

4a³ + 27b² ≠ 0.

The other day I wrote about Curve1174, a particular elliptic curve used in cryptography. The points on this curve satisfy

x² + y² = 1 – 1174 x² y²

This equation does not specify an elliptic curve if we’re working over real numbers. But Curve1174 is defined over the integers modulo p = 2251 – 9. There it is an elliptic curve. It is equivalent to a curve in Weierstrass, though that’s not true when working over the reals. So whether an equation defines an elliptic curve depends on the field the constituents come from.

Not an ellipse, not a curve

An elliptic curve is not an ellipse, and it may not be a curve in the usual sense.

There is a connection between elliptic curves and ellipses, but it’s indirect. Elliptic curves are related to the integrals you would write down to find the length of a portion of an ellipse.

Working over the real numbers, an elliptic curve is a curve in the geometric sense. Working over a finite field, an elliptic curve is a finite set of points, not a continuum. Working over the complex numbers, an elliptic curve is a two-dimensional surface. The name “curve” is extended by analogy to elliptic curves over general fields.

Final definition

In this section we’ll give the full definition of an algebraic curve, though we’ll be deliberately vague about some of the details.

The definition of an elliptic curve is not in terms of equations of a particular form. It says an elliptic curve is a

  • smooth,
  • projective,
  • algebraic curve,
  • of genus one,
  • having a specified point O.

Working over real numbers, smoothness can be specified in terms of derivatives. But that does smoothness mean working over a finite field? You take the derivative equations from the real case and extend them by analogy to other fields. You can “differentiate” polynomials in settings where you can’t take limits by defining derivatives algebraically. (The condition 4a³ + 27b² ≠ 0 above is to guarantee smoothness.)

Informally, projective means we add “points at infinity” as necessary to make things more consistent. Formally, we’re not actually working with pairs of coordinates (xy) but equivalence classes of triples of coordinates (x, yz). You can usually think in terms of pairs of values, but the extra value is there when you need it to deal with points at infinity. More on that here.

An algebraic curve is the set of points satisfying a polynomial equation.

The genus of an algebraic curve is roughly the number of holes it has. Over the complex numbers, the genus of an algebraic curve really is the number of holes. As with so many ideas in algebra, a theorem from a familiar context is taken as a definition in a more general context.

The specified point O, often the point at infinity, is the location of the identity element for the group addition. In the post on Curve1174, we go into the addition in detail, and the zero point is (0, 1).

In elliptic curve cryptography, it’s necessary to specify another point, a base point, which is the generator for a subgroup. This post gives an example, specifying the base point on secp256k1, a curve used in the implementation of Bitcoin.

Microsoft replacing SHA-1

According to this article, Microsoft is patching Windows 7 and Windows Server 2008 to look for SHA-2 hash functions of updates. These older versions of Windows have been using SHA-1, while newer version are already using SHA-2.

This is a good move, but unnecessary. Here’s what I mean by that. The update was likely unnecessary for reasons I’ll explain below, but it was easy to do, and it increased consistency across Microsoft’s product line. It’s also good PR.

What are SHA-1 and SHA-2?

Let’s back up a bit. SHA-1 and SHA-2 are secure hash functions [1]. They take a file, in this case a Microsoft software update, and return a relatively small number, small relative to the original file size. In the case of SHA-1, the result is 160 bits (20 bytes).  They’re designed so that if a file is changed, the function value is nearly certain to change. That is, it’s extremely unlikely that a change to the file would not result in a change to the hash value.

The concern isn’t accidental changes. The probability of accidentally producing two files with the same hash function value is tiny as I show here.

The concern is a clever attacker who could modify the software update in such a way that the hash function remains unchanged, bypassing the hash as a security measure. That would be harder to do with SHA-2 than with SHA-1, hence Microsoft’s decision years ago to move to SHA-2 for new versions of the operating system, and its recent decision to make the change retroactive.

How hard is it to produce collisions?

By a collision we mean two files that hash to the same value. It’s obvious from the pigeon hole principle [2] that collisions are possible, but how hard are they to produce deliberately?

Google demonstrated two years ago that it could produce two PDF files with the same SHA-1 hash value. But doing so required over 6,500 years of CPU time running in parallel [3]. Also, Google started with a file designed to make collisions possible. According to their announcement,

We started by creating a PDF prefix specifically crafted to allow us to generate two documents with arbitrary distinct visual contents, but that would hash to the same SHA-1 digest.

It would be harder to start with a specified input, such as a software update file and generate a collision. It would be harder still to generate a collision that had some desired behavior.

According to this page, it’s known how to tamper with two files simultaneously so that they will have the same SHA-1 hash values. This is what Google did, at the cost of thousands of CPU years. But so far, nobody has been able to start with a given file and create another file with the same SHA-1 value.

As I said at the beginning, it made sense for Microsoft to decide to move from SHA-1 to SHA-2 because the cost of doing so was small. But the use of SHA-1 hash codes is probably not the biggest security risk in Windows 7.

Related posts

[1] SHA-1 is a hash function, but SHA-2 is actually a family of hash functions: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. All are believed to provide at least 112 bits of security, while SHA-1 provides less than 63.

The SHA-x functions output x bits. The SHA-x/y functions use x bits of internal state and output y bits. To be consistent with this naming convention, SHA-1 should be called SHA-160.

[2] The pigeon hole principle says that if you put more than n things into n boxes, one of the boxes has to have more than one thing. If you hash files of more than n bits to n-bit numbers, at least two files have to go to the same value.

[3] If you were to rent this much CPU time in the cloud at 5 cents per CPU hour, it would cost about $2,800,000. If the only obstacle were the cost of computing resources, someone might be willing to pay that to tamper with a Microsoft update.

Hash function menagerie

Here’s an oversimplified survey of cryptographic hash functions: Everyone used to use MD5, now they use some variation on SHA.

There’s some truth to that. MD5 was very popular, and remains popular years after it was proven insecure. And now variations on SHA like SHA1 and SHA256 are commonly used. But there are a lot more cryptographic hash functions in common use. Continue reading