Privacy consulting: Anonymized data, statistical databases

blurry crowd

How do you anonymize a database? Is it enough to remove obviously identifiable data such as names and addresses? Unfortunately no.

Maybe you need to comply with privacy laws such as HIPAA (Health Insurance Portability and Accountability Act) in the US or GDPR (General Data Protection Regulation) in the EU. These laws require data to be anonymized before it can be shared. It is not enough that obvious identifiers are removed. It must be highly unlikely that individuals in the data can be re-identified.

Re-identification can be subtle

It can be surprisingly easy for data to make it possible to identify someone. For example, the combination of zip code, sex, and birthday may be enough to identify someone in the US. While it’s obvious that someone stating her zip code is giving away some amount of identifiable information, it is possible to leak identifiable information in more subtle ways. In one famous case study, anonymous movie ratings were combined with other publicly available information to eventually obtain medical records.

I can help

De-identification and re-identification are ultimately a matter or probability and statistics. My education and experience are ideally suited for these projects. I have taught at several universities, worked as a statistician for the world’s largest cancer center, and have been an independent consultant for years. I have helped companies large and small protect the privacy of individuals while meeting their business needs.

Tools and techniques

Sometimes anonymization is simple: just remove certain data fields or restrict access to the data. Other times matters are more complex. Protecting privacy while keeping the value of the data may require secure hashing, encryption, or randomization. Intuitive decision making is inadequate in complex projects. There are mathematical frameworks that can guide you through such projects, helping you avoid unforeseen pitfalls.

One way to preserve privacy is to add randomness to the data, making individual records private while maintaining statistical information. What kind of randomness should you add? How much? I can help with these questions. Small projects may need only simple randomization (or no randomization at all) while large, high-profile projects can use differential privacy for strong theoretical guarantees of privacy.

Whatever your size project, I can help you protect user privacy while meeting your business needs.


Trusted consultant to some of the world’s leading companies

Amazon, Facebook, Google, US Army Corp of Engineers, Amgen, Microsoft, Hitachi Data Systems