# Privacy implications of hashing data

Cryptographic hash functions are also known as one-way functions because given an input x, one can easily compute its hashed value f(x), but it is impractical to recover x from knowing f(x).

However, if we know that x comes from a small universe of possible values, our one-way function can effectively become a two-way function, i.e. it may be possible to start with f(x) and recover x using a rainbow table attack.

It’s possible to defend against a rainbow attack yet still leak information about the probability distribution of values of x given f(x).

For privacy protection, hashing is better than not hashing. Keyed hashing is better than unkeyed hashing. But even keyed hashing can leak information.

## Rainbow tables

Suppose a data set contains SHA-256 hashed values of US Social Security Numbers (SSNs). Since SSNs have 10 digits, there are 10 billion possible SSNs. It would be possible to hash all possible SSNs and create a lookup table, known as a rainbow table. There are three things that make the rainbow table attack possible in this example:

1. The range of possible inputs is known and relatively small.
2. The hashing algorithm is known.
3. The hashing algorithm can be computed quickly.

There are a couple ways to thwart a rainbow table attack. Assuming we have no control of (1) above, we can alter (2) or (3).

### Keyed hashing

A way to alter (2) is to use a keyed hash algorithm. For example, if we XOR the SSNs with a key before applying SHA-256, an attacker cannot construct a rainbow table without knowing the key. The attacker may know the core hash algorithm we are using, but they do not know the entire algorithm because the key is part of the algorithm.

### Expensive hashing

A way to alter (3) is to use hashing algorithms that are expensive to compute, such as Argon2. The idea is that such hash functions can be computed quickly enough for legitimate use cases, but not for brute-force attacks.

The time required to create a rainbow table is the product of the time to compute a single hash and the size of the set of possible inputs. If it took an hour to compute a hash value, but there are only 10 possible values, then an attacker could compute a rainbow table in 10 hours.

## Leaking attribute probabilities

Now suppose we’re using a keyed hashing algorithm. Maybe we’re paranoid and use an expensive hashing algorithm on top of that. What can go wrong now?

If we’re hashing values that each appear one time then we’re OK. If we apply a keyed hash to primary keys in a database, such as patient ID, then the hashed ID does not reveal the patient ID. But if we hash attributes associated with that patient ID, things are different.

### Frequency analysis

Suppose US state is an attribute in your database, and you hash this value. No matter how secure and how expensive your hash algorithm is, each state will have a unique hash value. If the database contains a geographically representative sample of the US population, then the hashed state value that appears most often is probably the most populous state, i.e. California. The second most common hashed state value probably corresponds to Texas.

Things are fuzzier on the low end. The hashed state value appearing least often may not correspond to Wyoming, but it very likely does not correspond to Florida, for example.

In short, you can infer the state values using the same kind of frequency analysis you’d use to solve a simple substitution cipher in a cryptogram puzzle. The larger the data set, the more closely the empirical order will likely align with the population order.

### Maybe this is OK?

Frequency analysis makes it possible to infer (with some uncertainty) the most common values. But the most common values are also the least informative. Knowing someone is from California narrows down their identity less than knowing that they are from Wyoming.

Frequency analysis is less specific for less common values. It might not tell you with much confidence that someone is from Wyoming, but it might tell you with some confidence that they come from a low-population state. However, since there are several low-population states, knowing someone is from such a state, without knowing which particular state, isn’t so informative.

Data privacy depends on context. Knowing what state someone is likely from may or may not be a problem depending on what other information is available.