A hash function maps arbitrarily long input strings to fixed-length outputs. For example, SHA-256 maps its input to a string of 256 bits. A cryptographically secure hash function h is a one-way function, i.e. given a message m it’s easy to compute h(m) but it’s not practical to go the other way, to learn anything about m from h(m). Secure hash functions are useful for message authentication codes (MACs) because it is practically impossible to modify m without changing h(m).
Ideally, a secure hash is “indistinguishable from a random mapping.” [1] So if a hash function has a range of size N, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash N + 1 items, two of them are certain to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.
The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values [2]. In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.
Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem is the basis for birthday attacks against secure hash functions.
When N is large, it is likely that hashing √N values will lead to a collision. We prove this below.
Proof
The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.
Suppose we’re hashing n items to a range of size N = n2. The exact probability that all n items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.
The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.
The second approximation comes from using the first two terms in the power series for log(1 + x). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as n goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.
Conclusions
So the probability of no collisions is exp(−1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, a hash function with range of size N can hash on the order of √N values before running into collisions.
This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 232 or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range could be secure, based on the size of its range, while a 64-bit hash cannot be.
Numerical calculation
We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large n, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.
There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough n, n2 − n = n2 in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to k bits, you could lose k bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.
Notes
[1] Cryptography Engineering by Ferguson, Schneier, and Kohno
[2] Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.
A useful rule of thumb: If you try to store N pigeons in M pigeonholes randomly, then for large enough M, the distribution of pigeons-in-holes is well-approximated by a Poisson distribution with λ = N / M.
If you work with hash tables and designing hash functions a lot, this is a very useful approximation.
I think the third equation from the bottom in the “Proof” section is missing the trailing “-n”. That’s what cancels the first term in the parentheses to give the following equation. See if you don’t agree.
Here is another way to look at birthday paradox
https://www.researchgate.net/publication/333150017_On_the_average_number_of_birthdays_in_birthday_paradox_setup