The California Consumer Privacy Act, or CCPA, takes effect January 1, 2020, less than six months from now. What does the act say about using deidentified data?
First of all, I am not a lawyer; I work for lawyers, advising them on matters where law touches statistics. This post is not legal advice, but my attempt to parse the CCPA, ignoring details best left up to others.
In my opinion, the CCPA is more vague than HIPAA, but not as vague as GDPR. It contains some clear language about using deidentified data, but that language is scattered throughout the act.
Deidentified data
Where to start? Section 1798.145 says
The obligations imposed on businesses by this title shall not restrict a business’s ability to … collect, use, retain, sell, or disclose consumer information that is deidentified or in the aggregate consumer information.
The act discusses identifiers and more importantly probabilistic identifiers, a topic I wrote about earlier. This term is potentially very broad. See my earlier post for a discussion.
Aggregate consumer information
So what is “aggregate consumer information”? Section 1798.140 says that
For purposes of this title: (a) “Aggregate consumer information” means information that relates to a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device. “Aggregate consumer information” does not mean one or more individual consumer records that have been deidentified.
So aggregate consumer information is different from deidentified information.
Pseudonymization
Later on (subsection (s) of the same section) the act says
Research with personal information … shall be … (2) Subsequently pseudonymized and deidentified, or deidentified and in the aggregate, such that the information cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer.
What? We’ve seen “in the aggregate” before, and presumably that has something to do with the use of “in the aggregate” above. But what does pseudonymized mean? Backing up to subsection (r) we have
“Pseudonymize” or “Pseudonymization” means the processing of personal information in a manner that renders the personal information no longer attributable to a specific consumer without the use of additional information, provided that the additional information is kept separately and is subject to technical and organizational measures to ensure that the personal information is not attributed to an identified or identifiable consumer.
That sounds a lot like deidentification to me, but slightly weaker. It implies that a company can retain the ability to re-identify an individual, as long as the means of doing so is “is kept separately and is subject to technical and organizational measures.” I’m speculating here, but it seems like that might mean, for example, that a restricted part of a company might apply a secure hash function to data, and another part of the company sees the results and analyzes the data. Then again, the law says “pseudonymized and deidentified,” so who knows what that means. More on the confusion around pseudonymization here.
The CCPA was written and passed in a hurry with the expectation of being amended later, and it shows.
Update: The CCPA was amended on September 25, 2020 to say that if data is deidentified under HIPAA, it will be considered deidentified under CCPA. Consult and attorney for details.
Compliance
How can you know whether you comply with CCPA‘s requirements for pseudonymizing, deidentifying, and aggregating data? A lawyer would have to tell you how the law applies to your situation. As I said above, I’m not a lawyer. But I can recommend lawyers working in this space. And I can work with your lawyer on the technical aspects: what methods are commonly used, how privacy risk is quantified, etc.