Deidentification, anonymization, and pseudonymization

What is the difference between deidentification, anonymization, and pseudonimization? I don’t know, without asking for clarification, and neither does anyone else.

There are people who will tell you they know the difference, but they’re choosing a particular set of definitions. And two people who both “know” the difference between the terms can of course come to different conclusions.

Here’s an excerpt from a US federal government publication [1] showing that I’m not alone in thinking that the terms can be used either interchangeably or with technical distinctions in mind.

Some authors and publications use the terms “de-identification” and “anonymization” interchangeably. Others use “de-identification” to describe a process and “anonymization” to denote a specific kind of de-identification that cannot be reversed. In some healthcare contexts the terms “de-identification” and “pseudonymization” are treated equivalently, with the term “anonymization” being used to indicate that the mapping pseudonyms to subject identities has been erased.

Most of my clients use deidentification and anonymization to mean the same thing. My clients are primarily American companies or international companies wanting to comply with US privacy laws. The GDPR speaks of pseudonymization, but there is a lot of confusion around what that term means.

What difference does it make if a pseudonym is assigned to a database record as long as the assignment process is uninformative? For example, how is a woman’s data any more or less private if where there was no name we arbitrarily assign a name Jane Doe? It does matter that the process is uninformative. For example, if the data did not report patient sex, then calling the women in the database Jane Doe adds information.

If by “pseudonym” you mean a unique identifier, rather than a pseudonym in the colloquial sense, then things get more subtle. Often a unique but otherwise meaningless identifier is used to normalize a database, reducing redundancy by splitting data into multiple tables. The identifier tells you how to sew the information from separate tables back together, but otherwise reveals nothing about the individual.

Then there’s tokenization, which is a kind of gray zone. A tokenization service will assign identifiers to database rows, indicating that a row of data about someone in one database probably corresponds to another row of data about the same person in the other database. The token itself contains no information—it may be created by applying a cryptographic hash function to some data—but it facilitates the creation of information by linking the databases together. It may or may not be possible to identify someone in the combined linked data, even if it was not possible to identify anyone in the two databases separately.

[1] Simson L. Garfinkel, NIST Internal Report 8053, De-Identification of Personal Information

***

If you need help with data deidentification, anonymization, or pseudonymization, we can help you clarify what these words mean in your context and help you achieve the level of data privacy protection you need.

LET’S TALK

Trusted consultants to some of the world’s leading companies

Amazon, Facebook, Google, US Army Corp of Engineers, Amgen, Microsoft, Hitachi Data Systems