The web is all abuzz about how SHA1 is “broken”, “a failure,” “obsolete”, etc.

It is supposed to be computationally impractical to create two documents that have the same secure hash code, and yet Google has demonstrated that they have done just that for the SHA1 algorithm.

I’d like to make two simple observations to put this in perspective.

This is not a surprise. Cryptography experts have suspected since 2005 that SHA1 was vulnerable and recommended using other algorithms. The security community has been gleeful about Google’s announcement. They feel vindicated for telling people for years not to use SHA1

This took a lot of work, both in terms of research and computing. Crypto researchers have been trying to break SHA1 for 22 years. And according to their announcement, these are the resources Google had to use to break SHA1:

Nine quintillion (9,223,372,036,854,775,808) SHA1 computations in total

6,500 years of CPU computation to complete the attack first phase

110 years of GPU computation to complete the second phase

While SHA1 is no longer recommended, it’s hardly a failure. I don’t imagine it would take 22 years of research and millions of CPU hours to break into my car.

I’m not saying people should use SHA1. Just a few weeks ago I advised a client not to use SHA1. Why not use a better algorithm (or at least what is currently widely believed to be a better algorithm) like SHA3? But I am saying it’s easy to exaggerate what it means to say SHA1 has failed.

A logarithm is the inverse of an exponential, and so we can generalize the idea of a logarithm wherever we can generalize the idea of an exponential. In particular, we can raise elements to exponents in a discrete group, and that leads to the definition of a discrete logarithm.

Diffie-Hellman public key cryptography is based on the assumption that discrete logarithms are hard to compute. There are algorithms to compute discrete logarithms that are much faster than brute force. For example, baby-step giant-step is a fairly simple algorithm. There are more efficient algorithms as well, but the best known algorithms are still much slower than raising numbers to powers. Whenever you find something that is much harder to undo than to do, it might be useful in cryptography, and that is the case with discrete logs.

Diffie-Hellman encryption requires users to compute exponentials and presumably requires attackers to compute discrete logs. I say “presumably” because it’s a fatal error in cryptography to assume an attacker has to solve the problem you think he’d have to solve. For example, you can create a simple encryption scheme by permuting the alphabet and encrypting each letter to its counterpart in the permutation. Someone might naively think “No one can break this because they’d have to try 26! permutations of the alphabet, over 400 million million million million possibilities!” Except that’s not how anyone approaches a substitution cipher. If it were, you wouldn’t see cryptograms in puzzle books.

As far as we know, discrete logarithms are hard to compute when working over integers mod p where p is a large prime, except for primes that have certain properties. We’ll look at what those properties are below and how to avoid them.

For a prime p, the integers mod p form a finite field. They are a group under addition, and the non-zero elements form a group under multiplication. It’s the multiplicative group we care about here. This group has order p-1, i.e. it has p-1 elements.

A group of prime order has no proper subgroups. But a group of composite order does. And our multiplicative group has order p-1, which is composite. (Except for p = 3, and cryptography depends on primes far, far bigger than 3.)

Sylow’s theorems tell us something about what kinds of subgroups a group must have. If s is prime and s^{k} is a factor of the order of our group, then the group has a subgroup of order s^{k}. We don’t want our multiplicative group to have any small-order subgroups because these would make it easier to compute discrete logarithms.

A safe primep has the form 2q + 1 where q is also prime. Diffie-Hellman chooses safe primes for moduli because this means the multiplicative group of order p-1 = 2q has no small subgroups. (It has two small subgroups, {1} and {1, -1}, but these can easily be avoided. The algorithm requires picking a generator g, and as long as you don’t pick g to be 1 or -1 mod p, then g generates a group of order q, and if p is gigantic, so is q.) Because q is prime, the subgroup of order q does not have any further subgroups.

A hash function maps arbitrarily long input strings to fixed-length outputs. For example, SHA-256 maps its input to a string of 256 bits. A cryptographically secure hash function h is a one-way function, i.e. given a message m it’s easy to compute h(m) but it’s not practical to go the other way, to learn anything about m from h(m). Secure hash functions are useful for message authentication codes because it is practically impossible to modify m without changing h(m).

Ideally, a secure hash is “indistinguishable from a random mapping.” [1] So if a hash function has a range of size N, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash N+1 items, two of them are certain to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.

The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values [2]. In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.

Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem comes up in attacking hash functions.

When N is large, it is likely that hashing √N values will lead to a collision. We prove this below.

Proof

The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.

Suppose we’re hashing n items to a range of size N = n^{2}. The exact probability that all n items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.

The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.

The second approximation comes from using the first two terms in the power series for log(1 + x). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as n goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.

Conclusions

So the probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, a hash function with range of size N can hash on the order of √N values before running into collisions.

This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 2^{32} or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range could be secure, based on the size of its range, while a 64-bit hash cannot be.

Numerical calculation

We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large n, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.

There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough n, n^{2} – n = n^{2} in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to k bits, you could lose k bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.

Notes

[1] Cryptography Engineering by Ferguson, Schneier, and Kohno

[2] Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.

I spoke with Manuel Blum this afternoon about his password scheme described here. This post is a few footnotes based on that conversation.

When I mentioned that some people had reacted to the original post saying the scheme was too hard, Blum said that he has taught the scheme to a couple children, 6 and 9 years old, who can use it.

He also said that many people have asked for his slide summarizing the method and asked if I could post it. You can save the image below to get the full-sized slide.

This slide and my blog post both use a 3-digit password for illustration, though obviously a 3-digit password would be easy to guess by brute force. I asked Blum how long a password using his scheme would need to be so that no one with a laptop would be able to break it. He said that 12 digits should be enough. Note that this assumes the attacker has access to many of your passwords created using the scheme, which would be highly unlikely.

* * *

For daily tweets on algebra and other math, follow @AlgebraFact on Twitter.

The most recent guest on Jon Udell’s Interviews with Innovators podcast was Marco Barulli. Barulli discusses Clipperz, a zero-knowledge password management application. The software encrypts passwords (or arbitrary text) using client-side JavaScript. Your data is encrypted on your computer and the encrypted version is uploaded to the Clipperz server. When you want to retrieve your data, the encrypted data is downloaded to your computer and decrypted there. Your unencrypted data never leaves your computer.

For more information, see the article on Jon Udell’s blog or listen to the podcast. Even if you’re not interested in using the Clipperz product, you may find the discussion of JavaScript and cryptography techniques interesting.