Three-digit zip codes and data privacy

Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here.

So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip codes are larger than five-digit zip codes on average, some three-digit zip codes are still sparsely populated.

But if you use three-digit zip codes, and cut out sparsely populated zip3s, then you’re OK, right?

Well, there’s still a problem if you also report state. Ordinarily a zip3 fits within one state, but not always.

Five digit zip codes are each entirely contained within a state as far as I know. But three-digit zip codes can straddle state lines. For example, about 200,000 people live in the three-digit zip code 834. The vast majority of these are in Idaho, but about 500 live in zip code 83414 which is in Wyoming. Zip code 834 is not sparsely populated, and doesn’t give much information about an individual by itself. But it is conditionally sparsely populated. It does carry a lot of information about an individual if that person lives in Wyoming.

On average, a three-digit zip code covers about 350,000 people. And so most of the time, the combination of zip3 and state covers 350,000 people. But in the example above, the combination of zip3 and state might narrow down to 500 people. In a group that small, birthday (just day of the year, not the full date) is enough to uniquely identify around 25% of the population. [1]

Related posts

[1] The 25% figure came from exp(-500/365). See this post for details.

Hashing names does not protect privacy

Secure hash functions are practically impossible to reverse, but only if the input is unrestricted.

If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can’t do much better than brute force, hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may have found a collision, another string of bits that happens to have the same hash value.

But you know the input comes from a restricted set, a set small enough to search by brute force, then hash functions are reversible. If I know that you’ve either hashed “yes” or “no,” then I can apply the hash function to both and see which one it was.

Hashing PII

Suppose someone has attempted to anonymize a data set by hashing personally identifying information (PII) such as name, phone number, etc. These inputs come from a small enough space that a brute force search is easy.

For instance, suppose someone has applied a cryptographic hash to first names. Then all an attacker needs to do is find a list of common names, hash them all, and see which hash values match. I searched for a list of names and found this, a list of the 1000 most popular baby girl and boy names in California in 2017.

The data set was compiled based on 473,441 births. Of those births, 366,039 had one of the 2,000 names. That is, 77% of the babies had one of the 1,000 most common names for their sex.

I wrote a little script to read in all the names and compute a SHA256 hash. The program took a fraction of a second to run. With the output of this program, I can’t re-identify every first name in a data set, but I could re-identify 77% of them, assuming my list of names is representative [1].

Now, for example, if I see the hash value

96001f228724d6a56db13d147a9080848103cf7d67bf08f53bda5d038777b2e6

in the data set, I can look this value up in my list of 2000 hash values and see that it is the hashed value of ACHILLES [2].

If you saw the hash value above and had no idea where it came from—it could be the hash of a JPEG image file for all you know—it would be hopeless to try to figure out what produced it. But if you suspect it’s the hash of a first name, it’s trivial to reverse.

Hashing SSNs

Hashing numbers is simpler but more computationally intense. I had to do a little research to find a list of names, but I know that social security numbers are simply 9-digit numbers. There are only a billion possible nine-digit numbers, so it’s feasible to hash them all to make a look-up table.

I tried this using a little Python script on an old laptop. In 20 seconds I was able to create a file with the hash values of a million SSNs, so presumably I could have finished the job in about five and a half hours. Of course it would take even less time with more efficient code or a more powerful computer.

Using a key

One way to improve the situation would be to use a key, a random string of bits that you combine with values before hashing. An attacker not knowing your secret key value could not do something as simple as what was described above.

However, in a large data set, such as one from a data breach, an attacker could apply frequency analysis to get some idea how hash values correspond to names. Hash values that show up most frequently in the data probably correspond to popular names etc. This isn’t definitive, but it is useful information. You might be able to tell, for example, that someone has a common first name and a rare last name. This could help narrow down the possibilities for identifying someone.

Adding salt

Several people have commented that the problem goes away if you use a unique salt value for each record. In some ways this is true, but in others it is not.

If you do use a unique salt per record, and save the salt with the data, then you can still do a brute force attack. The execution time is now quadratic rather than linear, but still feasible.

If you throw the salt away, then you’ve effectively replaced each identifier with a random string. Then you’ve effectively removed these columns since they’re filled with useless random noise. Why not just delete them?

You could use a unique salt per user rather than per record. Then you couldn’t identify a given record, but you could tell when two records belong to the same person. But in this case, why not just assign random IDs to each user? Use the salt itself as the ID. No need to use a hash function.

Custom hash functions

Note that the argument above for keys applies to using a custom hashing algorithm, either something you write from scratch or by combining rounds of established methods.

Kirchoff’s principle advises against relying on security-by-obscurity. It says that you should assume that your algorithm will become known, which often turns out to be the case. But no matter: any hashing scheme that always maps the same input to the same output is vulnerable to frequency analysis.

Related posts

[1] Name frequencies change over time and space, but I imagine the data from California in 2017 would be adequate to identify most Americans of any age and in any state. Certainly this would not meet the HIPAA standard that “the risk is very small that the information could be used … to identify an individual.”

[2] The California baby name data set standardized all names to capital letters and removed diacritical marks. With a little extra effort, one could hash variations such as JOSE, JOSÉ, Jose, and José.

 

 

What does CCPA say about de-identified data?

The California Consumer Privacy Act, or CCPA, takes effect January 1, 2020, less than six months from now. What does the act say about using deidentified data?

First of all, I am not a lawyer; I work for lawyers, advising them on matters where law touches statistics. This post is not legal advice, but my attempt to parse the CCPA, ignoring details best left up to others.

In my opinion, the CCPA is more vague than HIPAA, but not as vague as GDPR. It contains some clear language about using deidentified data, but that language is scattered throughout the act.

Deidentified data

Where to start? Section 1798.145 says

The obligations imposed on businesses by this title shall not restrict a business’s ability to … collect, use, retain, sell, or disclose consumer information that is deidentified or in the aggregate consumer information.

The act discusses identifiers and more importantly probabilistic identifiers, a topic I wrote about earlier. This term is potentially very broad. See my earlier post for a discussion.

Aggregate consumer information

So what is “aggregate consumer information”? Section 1798.140 says that

For purposes of this title: (a) “Aggregate consumer information” means information that relates to a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device. “Aggregate consumer information” does not mean one or more individual consumer records that have been de­identified.

So aggregate consumer information is different from deidentified information.

Pseudonymization

Later on (subsection (s) of the same section) the act says

Research with personal information … shall be … (2) Subsequently pseudonymized and deidentified, or deidentified and in the aggregate, such that the information cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer.

What? We’ve seen “in the aggregate” before, and presumably that has something to do with the use of “in the aggregate” above. But what does pseudonymized mean? Backing up to subsection (r) we have

Pseudonymize” or “Pseudonymization” means the processing of personal information in a manner that renders the personal information no longer attributable to a specific consumer without the use of additional information, provided that the additional information is kept separately and is subject to technical and organizational measures to ensure that the personal information is not attributed to an identified or identifiable consumer.

That sounds a lot like deidentification to me, but slightly weaker. It implies that a company can retain the ability to re-identify an individual, as long as the means of doing so is “is kept separately and is subject to technical and organizational measures.” I’m speculating here, but it seems like that might mean, for example, that a restricted part of a company might apply a secure hash function to data, and another part of the company sees the results and analyzes the data. Then again, the law says “pseudonymized and deidentified,” so who knows what that means.

The CCPA was written and passed in a hurry with the expectation of being amended later, and it shows.

Compliance

How can you know whether you comply with CCPA‘s requirements for pseudonymizing, deidentifying, and aggregating data? A lawyer would have to tell you how the law applies to your situation. As I said above, I’m not a lawyer. But I can recommend lawyers working in this space. And I can work with your lawyer on the technical aspects: what methods are commonly used, how privacy risk is quantified, etc.

Related posts

Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to identify people as explained here.

Unfortunately, truncating dates to just the year ruins the utility of some data. For example, suppose you have a database of millions of individuals and you’d like to know how effective an ad campaign was. If all you have is dates to the resolution of years, you can hardly answer such questions. You could tell if sales of an item were up from one year to the next, but you couldn’t see, for example, what happened to sales in the weeks following the campaign.

With differential privacy you can answer such questions without compromising individual privacy. In this case you would retain accurate date information, but not allow direct access to this information. Instead, you’d allow queries based on date information.

Differential privacy would add just enough randomness to the results to prevent the kind of privacy attacks described here, but should give you sufficiently accurate answers to be useful. We said there were millions of individuals in the database, and the amount of noise added to protect privacy goes down as the size of the database goes up [1].

There’s no harm in retaining full dates, provided you trust the service that is managing differential privacy, because these dates cannot be retrieved directly. This may be hard to understand if you’re used to traditional deidentification methods that redact rows in a database. Differential privacy doesn’t let you retrieve rows at all, so it’s not necessary to deidentify the rows per se. Instead, differential privacy lets you pose queries, and adds as much noise as necessary to keep the results of the queries from threatening privacy. As I show here, differential privacy is much more cautious about protecting individual date information than truncating to years or even a range of years.

Traditional deidentification methods focus on inputs. Differential privacy focuses on outputs. Data scientists sometimes resist differential privacy because they are accustomed to thinking about the inputs, focusing on what they believe they need to do their job. It may take a little while for them to understand that differential privacy provides a way to accomplish what they’re after, though it does require them to change how they work. This may feel restrictive at first, and it is, but it’s not unnecessarily restrictive. And it opens up new possibilities, such as being able to retain detailed date information.

***

[1] The amount of noise added depends on the sensitivity of the query. If you were asking questions about a common treatment in a database of millions, the necessary amount of noise to add to query results should be small. If you were asking questions about an orphan drug, the amount of noise would be larger. In any case, the amount of noise added is not arbitrary, but calibrated to match the sensitivity of the query.

Internet privacy as seen from 1975

Science fiction authors set stories in the future, but they don’t necessarily try to predict the future, and so it’s a little odd to talk about what they “got right.” Getting something right implies they were making a prediction rather than imagining a setting of a story.

However, sometimes SF authors do indeed try to predict the future. This seems to have been at least somewhat the case with John Brunner and his 1975 novel The Shockwave Rider because he cites futurist Alvin Toffler in his acknowledgement.

The Shockwave Rider derives in large part from Alvin Toffler’s stimulating study Future Shock, and in consequence I’m much obliged to him.

In light of Brunner’s hat tip to Toffler, I think it’s fair to talk about what he got right, or possibly what Toffler got right. Here’s a paragraph from the dust jacket that seemed prescient.

Webbed in a continental data-net that year by year draws tighter as more and still more information is fed to it, most people are apathetic, frightened, resigned to what ultimately will be a total abolishment of individual privacy. A whole new reason has been invented for paranoia: it is beyond doubt — whoever your are! — that someone, somewhere, knows something about you that you wanted to keep a secret … and you stand no chance of learning what it is.

Related posts

Comparing Truncation to Differential Privacy

Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year.

Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual.

Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy.

Truncation

First consider truncating birth date to year. For example, anyone born between January 1, 1955 and December 31, 1955 would be recorded as being born in 1955. This effectively produces a 100% confidence interval that is one year wide.

Next we’ll compare this to a 95% confidence interval using ε-differential privacy.

Differential privacy

Differential privacy adds noise in proportion to the sensitivity Δ of a query. Here sensitivity means the maximum impact that one record could have on the result. For example, a query that counts records has sensitivity 1.

Suppose people live to a maximum of 120 years. Then in a database with n records [1], one person’s presence in or absence from the database would make a difference of no more than 120/n years, the worst case corresponding to the extremely unlikely event of a database of n-1 newborns and one person 120 year old.

Laplace mechanism and CIs

The Laplace mechanism implements ε-differential privacy by adding noise with a Laplace(Δ/ε) distribution, which in our example means Laplace(120/nε).

A 95% confidence interval for a Laplace distribution with scale b centered at 0 is

[b log 0.05, –b log 0.05]

which is very nearly

[-3b, 3b].

In our case b = 120/nε, and so a 95% confidence interval for the noise we add would be [-360/nε, 360/nε].

When n = 1000 and ε = 1, this means we’re adding noise that’s usually between -0.36 and 0.36, i.e. we know the average age to within about 4 months. But if n = 1, our confidence interval is the true age ± 360. Since this is wider than the a priori bounds of [0, 120], we’d truncate our answer to be between 0 and 120. So we could query for the age of an individual, but we’d learn nothing.

Comparison with truncation

The width of our confidence interval is 720/ε, and so to get a confidence interval one year wide, as we get with truncation, we would set ε = 720. Ordinarily ε is much smaller than 720 in application, say between 1 and 10, which means differential privacy reveals far less information than truncation does.

Even if you truncate age to decade rather than year, this still reveals more information than differential privacy provided ε < 72.

Related posts

[1] Ordinarily even the number of records in the database is kept private, but we’ll assume here that for some reason we know the number of rows a priori.

State privacy laws to watch

US map with states highlighted

A Massachusetts court ruled this week that obtaining real-time cell phone location data requires a warrant.

Utah has passed a law that goes into effect next month that goes further. Police in Utah will need a warrant to obtain location data or to search someone’s electronic files. (Surely electronic files are the contemporary equivalent of one’s “papers” under the Fourth Amendment.)

Vermont passed the nation’s first data broker law. It requires data brokers to register with the state and to implement security measures, but as far as I have read it doesn’t put much restriction what they can do.

Texas law expands HIPAA‘s notation of a “covered entity” so that it applies to basically anyone handling PHI (protected health information).

California’s CCPA law goes into effect on January 1, 2020. In some ways it’s analogous to GDPR. It will be interesting to see what the law ultimately means in practice. It’s expected that the state legislature will amend the law, and we’ll have to wait on precedents to find out in detail what the law prohibits and allows.

Update: Nevada passed a CCPA-like law May 23, 2019 which takes effect October 1, 2019, i.e. before the CCPA.

Update: Maine passed a bill May 30, 2019 that prohibits ISPs from selling browsing data without consent.

Related posts

Safe Harbor and the calendar rollover problem

elderly woman

Data privacy is subtle and difficult to regulate. The lawmakers who wrote the HIPAA privacy regulations took a stab at what would protect privacy when they crafted the “Safe Harbor” list. The list is neither necessary or sufficient, depending on context, but it’s a start.

Extreme values of any measurement are more likely to lead to re-identification. Age in particular may be newsworthy. For example, a newspaper might run a story about a woman in the community turning 100. For this reason, the Safe Harbor previsions require that ages over 90 be lumped together. Specifically,

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

One problem with this rule is that “age 90” is a moving target. Suppose that last year, in 2018, a data set recorded that a woman was born in 1930 and had a child in 1960. This data set was considered de-identified under the Safe Harbor provisions and published in a medical journal. On New Years Day 2019, does that data suddenly become sensitive? Or on New Years Day 2020? Should the journal retract the paper?!

No additional information is conveyed by the passage of time per se. However, if we knew in 2018 that the woman in question was still alive, and we also know that she’s alive now in 2019, we have more information. Knowing that someone born in 1930 is alive in 2019 is more informative than knowing that the same person was alive in 2018; there are fewer people in the former category than in the latter category.

The hypothetical journal article, committed to print in 2018, does not become more informative in 2019. But an online version of the article, revised with new information in 2019 implying that the woman in question is still alive, does become more informative.

No law can cover every possible use of data, and it would be a bad idea to try. Such a law would be both overly restrictive in some cases and not restrictive enough in others. HIPAA‘s expert determination provision allows a statistician to say, for example, that the above scenario is OK, even though it doesn’t satisfy the letter of the Safe Harbor rule.

Related posts

Data privacy Twitter account

My newest Twitter account is Data Privacy (@data_tip). There I post tweets about ways to protect your privacy, statistical disclosure limitation, etc.

I had a clever idea for the icon, or so I thought. I started with the default Twitter icon, a sort of stylized anonymous person, and colored it with the same blue and white theme as the rest of my Twitter accounts. I think it looked so much like the default icon that most people didn’t register that it had been customized. It looked like an unpopular account, unlikely to post much content.

Now I’ve changed to the new icon below, and the number of followers is increasing.
data tip icon

Related pages

Covered entities: TMPRA extends HIPAA

The US HIPAA law only protects the privacy of health data held by “covered entities,” which essentially means health care providers and insurance companies. If you give your heart monitoring data or DNA to your doctor, it comes under HIPAA. If you give it to Fitbit or 23andMe, it does not. Government entities are not covered by HIPAA either, a fact that Latanya Sweeney exploited to demonstrate how service dates be used to identify individuals.

Texas passed the Texas Medical Records Privacy Act (a.k.a. HB 300 or TMPRA) to close this gap. Texas has a much broader definition of covered entity. In a nutshell, Texas law defines a covered entity to include anyone “assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information.” The full definition, available here, says

“Covered entity” means any person who:

(A) for commercial, financial, or professional gain, monetary fees, or dues, or on a cooperative, nonprofit, or pro bono basis, engages, in whole or in part, and with real or constructive knowledge, in the practice of assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information. The term includes a business associate, health care payer, governmental unit, information or computer management entity, school, health researcher, health care facility, clinic, health care provider, or person who maintains an Internet site;

(B) comes into possession of protected health information;

(C) obtains or stores protected health information under this chapter; or

(D) is an employee, agent, or contractor of a person described by Paragraph (A), (B), or (C) insofar as the employee, agent, or contractor creates, receives, obtains, maintains, uses, or transmits protected health information.

Posts on other privacy regulations