Probabilisitic Identifiers in CCPA

The CCPA, the California Consumer Privacy Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California.

The law specifically mentions probabilistic identifiers.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of guessing personal data fields [1]. That could be really broad. For example, the fact that you’re reading this blog post makes it “more probable than not” that you have a college degree, and education is one of the categories mentioned in the law.

Personal information

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

Significance

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

Devices

It’s interesting that the CCPA says “a particular consumer or device.” That would seem to include browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

Related posts

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

One thought on “Probabilisitic Identifiers in CCPA

  1. IANAL, but the way I would read the paragraph defining probabilstic identifiers is that any piece of personal information, or combination thereof, is a probabilstic identifier if and only if it identifies a specific person or device with a probability of more than 0.5. If this is correct, then arbitrary data that give a 51% chance of guessing a person’s purchasing history would not qualify as a probabilstic identifier, nor would information that gives a 51% chance that someone has a college degree. If, however, the combination of purchasing history and information about his/her education gives a 51% chance of identifying a specific person, that would count as a probabilistic identifier.

    In my mind, the tricky part is what it means for there to be a probability greater than 50% that a record belongs to a specific person. A probability statement about a specific individual only makes sense in a Bayesian framework, which means that we have to ask what the prior distribution is. The most obvious prior distribution is to assign equal probability to all individuals. However, in real life most people are not working from such a prior distribution, but might have some other information about the person they want to identify. If someone had a prior distribution that was uniform on 10,000 people, and after viewing the record concluded that it belonged to me with a probability of 0.6, was my privacy breached? I would probably feel like it was. What if the prior distribution was uniform on 100 people? 10 people?

Leave a Reply

Your email address will not be published. Required fields are marked *