A cryptographic hash is also known as a one-way function because given an input x, one can quickly compute the hash h(x), but it is extremely time-consuming to try to recover x if you only know h(x).
Even if the hashing algorithm is considered “broken,” it may take an enormous effort to break it. Google demonstrated that they could break a SHA-1 hash, but they used a GPU-century of compute power to do so. Attacks have become more efficient since then, but it still takes many orders of magnitude less work to compute a hash than to attempt to invert it. [1]
However, if you know that the hash value comes from a small set of possible inputs, brute force can discover which one. I wrote about this a couple years ago in the post Hashing names does not protect privacy. You could, for example, easily create a table of the hashes of all nine-digit social security numbers.
I often explain this to clients who have been told that hashed data is “encrypted.” This is subtle, because the data is encrypted, in a way, but not in the way they think.
A paper came out a few weeks ago that hashed 118 billion phone numbers.
The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers from [reference] (i.e., mobile phone numbers allowed by Google’s
libphonenumber
and the WhatsApp registration API) paired with their SHA-1 hashes.
The authors were able to use this data base to query
10% of US mobile phone numbers for WhatsApp and 100% for Signal. For Telegram we find that its API exposes a wide range of sensitive information, even about numbers not registered with the service.
Related posts
- Notes on computing hash functions
- Probability of secure hash collisions
- Privacy implications of computed IDs
- Data privacy consulting
[1] Hash functions are not invertible, even in theory, in the sense of a unique x leading to a hash value h(x). Suppose you’re computing a 256-bit hash on files that are one kilobyte (8192 bits). If you’re mapping a space of 28192 possible files into a space of 2256 possible hash values, the mapping cannot be one-to-one. However, if you know the inputs are not random bits but German prose, and you find a file of German prose that has a matching hash value, you’ve almost certainly recovered the file that led to the hash value.
Back in the Day, I gave my CS undergrads an assignment to create a probabilistic spell-checker based on hashing. Dictionary entries were hashed to 8 different tables with a variety of (frankly lousy) hash functions. But, a misspelled word would need 8 ‘hits” to be accepted, so it worked pretty well. Not super practical, but a good workout for my students, who thought this was a pretty cool assignment.