Hashing names does not protect privacy

Secure hash functions are practically impossible to reverse, but only if the input is unrestricted.

If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can’t do much better than brute force, hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may have found a collision, another string of bits that happens to have the same hash value.

But if you know the input comes from a restricted set, a set small enough to search by brute force, then hash functions are reversible. This is known as a rainbow table attack. If I know that you’ve either hashed “yes” or “no,” for example, then I can apply the hash function to both and see which one it was.

Hashing PII

Suppose someone has attempted to anonymize a data set by hashing personally identifying information (PII) such as name, phone number, etc. These inputs come from a small enough space that a brute force search is easy.

For instance, suppose someone has applied a cryptographic hash to first names. Then all an attacker needs to do is find a list of common names, hash them all, and see which hash values match. I searched for a list of names and found this, a list of the 1000 most popular baby girl and boy names in California in 2017.

The data set was compiled based on 473,441 births. Of those births, 366,039 had one of the 2,000 names. That is, 77% of the babies had one of the 1,000 most common names for their sex.

I wrote a little script to read in all the names and compute a SHA256 hash. The program took a fraction of a second to run. With the output of this program, I can’t re-identify every first name in a data set, but I could re-identify 77% of them, assuming my list of names is representative [1].

Now, for example, if I see the hash value

96001f228724d6a56db13d147a9080848103cf7d67bf08f53bda5d038777b2e6

in the data set, I can look this value up in my list of 2000 hash values and see that it is the hashed value of ACHILLES [2].

If you saw the hash value above and had no idea where it came from—it could be the hash of a JPEG image file for all you know—it would be hopeless to try to figure out what produced it. But if you suspect it’s the hash of a first name, it’s trivial to reverse.

Hashing SSNs

Hashing numbers is simpler but more computationally intense. I had to do a little research to find a list of names, but I know that social security numbers are simply 9-digit numbers. There are only a billion possible nine-digit numbers, so it’s feasible to hash them all to make a look-up table.

I tried this using a little Python script on an old laptop. In 20 seconds I was able to create a file with the hash values of a million SSNs, so presumably I could have finished the job in about five and a half hours. Of course it would take even less time with more efficient code or a more powerful computer.

Using a key

One way to improve the situation would be to use a key, a random string of bits that you combine with values before hashing. An attacker not knowing your secret key value could not do something as simple as what was described above.

However, in a large data set, such as one from a data breach, an attacker could apply frequency analysis to get some idea how hash values correspond to names. Hash values that show up most frequently in the data probably correspond to popular names etc. This isn’t definitive, but it is useful information. You might be able to tell, for example, that someone has a common first name and a rare last name. This could help narrow down the possibilities for identifying someone.

Adding salt

Several people have commented that the problem goes away if you use a unique salt value for each record. In some ways this is true, but in others it is not.

If you do use a unique salt per record, and save the salt with the data, then you can still do a brute force attack. The execution time is now quadratic rather than linear, but still feasible.

If you throw the salt away, then you’ve effectively replaced each identifier with a random string. Then you’ve effectively removed these columns since they’re filled with useless random noise. Why not just delete them?

You could use a unique salt per user rather than per record. Then you couldn’t identify a given record, but you could tell when two records belong to the same person. But in this case, why not just assign random IDs to each user? Use the salt itself as the ID. No need to use a hash function.

Custom hash functions

Note that the argument above for keys applies to using a custom hashing algorithm, either something you write from scratch or by combining rounds of established methods.

Kirchoff’s principle advises against relying on security-by-obscurity. It says that you should assume that your algorithm will become known, which often turns out to be the case. But no matter: any hashing scheme that always maps the same input to the same output is vulnerable to frequency analysis.

More privacy posts

[1] Name frequencies change over time and space, but I imagine the data from California in 2017 would be adequate to identify most Americans of any age and in any state. Certainly this would not meet the HIPAA standard that “the risk is very small that the information could be used … to identify an individual.”

[2] The California baby name data set standardized all names to capital letters and removed diacritical marks. With a little extra effort, one could hash variations such as JOSE, JOSÉ, Jose, and José.

 

 

9 thoughts on “Hashing names does not protect privacy

  1. Johnathan Corgan

    $ echo ACHILLES | sha256sum
    64a16f3fc6000f4ad755369b12549998c78ba9437ab314aa88ef1e890297668c

    $ echo -n ACHILLES | sha256sum
    96001f228724d6a56db13d147a9080848103cf7d67bf08f53bda5d038777b2e6

    It took me an embarrassingly long time to figure that out :-)

  2. Tony Zbaraschuk

    And SSNs are not random 9-digit numbers; they have quite a few very fixed elements. So it would be easier to hash that set than to hash 10^9 numbers.

  3. Jakub Narębski

    Wouldn’t using a *salt* solve some of those issues? If each result of hashing is really salt+hash(salt+data), then rainbow tables wouldn’t work. One would still be able to check all the possible values of data, but now one would have to do it per entry.

  4. What if you use part of the PII as salt? So let’s say you have a first name besides the SSN. Then you do `hash(hash(name) + ssn)`, throw away both name and ssn, and store this result as the unique identifier. The result is not a random number any more. If the algorithm is something like bcrypt or PBKDF2 this will really complicate creating rainbow tables.

  5. @Michiel Hendriks: You really don’t want to do this, because you can then simultaneously attack all Michaels at once, using one common rainbow table: `hash(hash(‘Michael’) + ssn)`

    A unique salt `hash(salt + ssn)` prevents that and forces you to create a separate rainbow table for every single record in your database.

  6. > But in this case, why not just assign random IDs to each user? Use the salt itself as the ID. No need to use a hash function.

    Isn’t the point of such a database to recognize existing entries without knowing what the entry contains? In such a case, what is the point in saving only the salt as an ID, and throwing away the PII? To find my entries, I would need another database relating the ID and the PII, and we are back to square one.

Comments are closed.