Safe Harbor ain’t gonna cut it

There are two ways to deidentify data to satisfy HIPAA:

  • Safe Harbor, § 164.514(b)(2), and
  • Expert Determination, § 164.514(b)(1).

And for reasons explained here, you may need to be concerned with HIPAA even if you’re not a “covered entity” under the statute.

To comply with Safe Harbor, your data may not contain any of eighteen categories of information. Most of these are obvious: direct identifiers such as name, phone number, email address, etc. But some restrictions under Safe Harbor are less obvious and more difficult to comply with.

For example, under Safe Harbor you need to remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

This would make it impossible, for example, to look at seasonal trends in medical procedures because you would only have data to the resolution of a year. But with a more sophisticated approach, e.g. differential privacy, it would be possible to answer such questions while providing better privacy for individuals. See how here.

If you need to comply with HIPAA, or analogous state laws such as TMPRA, and you can’t follow Safe Harbor, your alternative is expert determination. If you’d like to discuss expert determination, let’s talk.

Why HIPAA matters even if you’re not a “covered entity”

medical data

The HIPAA privacy rule only applies to “covered entities.” This generally means insurance plans, healthcare clearinghouses, and medical providers. If your company is using heath information but isn’t a covered entity per the HIPAA statute, there are a couple reasons you might still need to pay attention to HIPAA [1].

The first is that state laws may be broader than federal laws. For example, the Texas Medical Records Privacy Act extends the definition of covered entity to any business “assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information.” So even if the US government does not consider your business to be a covered entity, the State of Texas might.

The second is that more recent privacy laws look to HIPAA. For example, it’s not clear yet what exactly California’s new privacy legislation CCPA will mean in practice, even though the law went into effect at the beginning of the year. Because HIPAA is well established and guidance documentation, companies needing to comply with CCPA are looking to HIPAA for precedent.

The connection between CCPA and HIPAA may be formalized into more than an analogy. There is a proposed amendment to CCPA that would introduce HIPAA-like expert determination for CCPA. (Update: This amendment, AB 713, was signed into law September 25, 2020.)

If you would like to discuss HIPAA deidentification or data privacy more generally, let’s talk.

More on HIPAA

[1] I advise lawyers on statistical matters, but I am not a lawyer. Nothing here should be considered legal advice. Ask your legal counsel if you need to comply with HIPAA, or with state laws analogous to HIPAA.

 

CCPA and expert determination

California’s new CCPA (California Consumer Privacy Act) may become more like HIPAA. In particular, a proposed amendment would apply HIPAA’s standards of expert determination to CCPA.

According to this article,

The California State Senate’s Health Committee recently approved California AB 713, which would amend the California Consumer Privacy Act (CCPA) to except from CCPA requirements additional categories of health information, including data de-identified in accordance with HIPAA and certain medical research data.

Some businesses have been looking to HIPAA by analogy for how to comply with CCPA. HIPAA has been on the books much longer, and what it means to comply with HIPAA is more clearly stated, in regulation itself and in guidance documents. AB 713 would make this appeal to HIPAA more than an analogy.

In particular, CCPA would now have a notion of expert determination. AB 713 explicitly refers to

The deidentification methodology described in Section 164.514(b)(1) of Title 45 of the Code of Federal Regulations, commonly known as the HIPAA expert determination method.

Emphasis added. Taken from 1798.130 (a)(5)(D)(i).

Update: California’s governor signed AB 713 into law on September 25, 2020.

Parsing AB 713

The amendment is hard to read because it doesn’t contain many complete sentences. The portion quoted above doesn’t have a verb. We have to go to up to (a) in the hierarchy before we can find a clear subject and verb:

… a business shall …

It’s not clear to me what the amendment is saying. Rather than trying to parse this myself, I’ll quote what the article linked above says.

AB 713 would except from CCPA requirements de-identified health information when … The information is de-identified in accordance with a HIPAA de-identification method [and two other conditions].

Expert determination

I am not a lawyer; I advise lawyers on statistical matters. I offer statistical advice, not legal advice.

If your lawyer determines that you need HIPAA-style expert determination to comply with CCPA, I can help. I have provided expert determination for many companies and would welcome the opportunity to provide this service for your company as well.

If you’d like discuss expert determination, either for HIPAA or for CCPA, let’s talk.

Stochastic rounding and privacy

Suppose ages in some database are reported in decades: 0, 10, 20, etc. You need to add a 27 year old woman to the data set. How do you record her age? A reasonable approach would to round-to-nearest. In this case, 27 would be rounded up to 30.

Another approach would be stochastic rounding. In our example, we would round this woman’s age up to 30 with 70% probability and round it down to 20 with 30% probability. The recorded value is a random variable whose expected value is exactly 27.

Suppose we were to add a large number of 27 year olds to the database. With round-to-nearest, the average value would be 30 because all the values are 30. With stochastic rounding, about 30% of the ages would be recorded as 20 and about 70% would be recorded as 30. The average would likely be close to 27.

Next, suppose we add people to the database of varying ages. Stochastic rounding would record every person’s age using a random variable whose expected value is their age. If someone’s age is a d+x where d is a decade, i.e. a multiple of 10, and 0 < x < 10, then we would record their age as d with probability 1-x/10 and d+10 with probability x/10. There would be no bias in the reported age.

Round-to-nearest will be biased unless ages are uniformly distributed in each decade. Suppose, for example, our data is on undergraduate students. We would expect a lot more students in their early twenties than in their late twenties.

Now let’s turn things around. Instead of looking at recorded age given actual age, let’s look at actual age given recorded age. Suppose someone’s age is recorded as 30. What does that tell you about them?

With round-to-nearest, it tells you that they certainly are between 25 and 35. With stochastic rounding, they could be anywhere between 20 and 40. The probability distribution on this interval could be computed from Bayes’ theorem, depending on the prior distribution of ages on this interval. That is, if you know in general how ages are distributed over the interval (20, 40), you could use Bayes’ theorem to compute the posterior distribution on age, given that age was recorded as 30.

Stochastic rounding preserves more information than round-to-nearest on average, but less information in the case of a particular individual.

More privacy posts

Computed IDs and privacy implications

Thirty years ago, a lot of US states thought it would be a good idea to compute someone’s drivers license number (DLN) from their personal information [1]. In 1991, fifteen states simply used your Social Security Number as your DLN. Eleven other states computed DLNs by applying a hash function to personal information such as name, birth date, and sex. A few other states based DLNs in part but not entirely on personal information.

Presumably things have changed a lot since then. If you know of any states that still do this, please let me know in the comments. Even if states have stopped computing DLNs from personal data, I’m sure many organizations still compute IDs this way.

The article I stumbled on from 1991 gave no hint perhaps encoding personal information into an ID number could be a problem. And at the time it wasn’t as much of a problem as it would be now.

Why is it a problem if IDs are computed from personal data? People don’t realize what information they’re giving away. Maybe they would be willing to give someone their personal information, but not their DLN, or vice versa, not realizing that the two are equivalent. They also don’t realize what information about them someone may already have; a little bit more info may be all an attacker needs. And they don’t realize the potential consequences of their loss of privacy.

In some cases the hashing functions were complicated, but not too complicated to carry out by hand. And even if states were applying a cryptographic hash function, which they certainly were not, this would still be a problem for reasons explained here. If you have a database of personal information, say from voter registration records, you could compute the hash value of everyone in the state, or at least a large enough portion that you stand a good chance of being able to reverse a hashed value.

Related posts

[1] Joseph A. Gallian. Assigning Driver’s License Numbers. Mathematics Magazine, Vol. 64, No. 1 (Feb., 1991), pp. 13-22.

What is a privacy budget?

Accountant

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t matter for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Metaphors can be dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

More mathematical privacy posts

Amendment to CCPA regarding personal information

California’s new privacy law takes effect January 1, 2020, less than 100 days from now. The bill was written in a hurry in order to prevent a similar measuring from appearing on a ballot initiative. The thought was that the state legislature would pass something quickly then clean it up later with amendments.

Six amendments were passed recently, and the deadline for further amendments has passed. California governor Gavin Newsom has until October 13 to either sign or veto each of the amendments.

This post will look at just one of the six amendments, AB-874, and what it means for personal information. The text of the amendment repeats the text of the original law, so I ran a diff tool on the two in order to see what changed.

Update: Governor Newsom signed amendment AB-874 into law on October 11, 2019.

In a couple instances, capable was changed to reasonably capable.

“Personal information” means information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household. Personal information includes, but is not limited to, the following if it identifies, relates to, describes, is reasonably capable of being associated with, or could be reasonably linked, directly or indirectly, with a particular consumer or household …

“Capable” is awfully broad. Almost anything is capable of being associated with a particular consumer or household, so adding reasonable was reasonable. You see something similar in the HIPAA privacy rule when it speaks of “reasonably available information.”

The amendment also removed a clause that was ungrammatical and nonsensical as far as I can tell:

… “publicly available” means information that is lawfully made available from federal, state, or local government records, if any conditions associated with such information.

The following sentence from the CCPA was also removed in the amendment:

Information is not “publicly available” if that data is used for a purpose that is not compatible with the purpose for which the data is maintained and made available in the government records or for which it is publicly maintained.

I suppose the idea behind removing this line was that data is either publicly available or it’s not. Once information is publicly available, it’s kinda hard to ask people to act as if it’s not publicly available for some uses.

The final change appears to be correcting a mistake:

Publicly available Personal information” does not include consumer information that is deidentified or aggregate consumer information.

It makes no sense to say public information does not include deidentified information. You might deidentify data precisely because you want to make it public. I believe the author of this line of the CCPA meant to say what the amendment says, that deidentified and aggregate information are not considered personal.

***

As I have pointed out elsewhere, I am not a lawyer. Nor am I a lepidopterist, auto mechanic, or cosmetologist. Nothing here should be considered legal advice. Nor should it be considered advice on butterflies, cars, or hair care.

More privacy posts

Right to be forgotten

erased people

The GDPR‘s right-to-be-forgotten has been in the news this week. This post will look at a couple news stories and how they relate.

Forgetting about a stabbing

On Monday the New York Times ran a story about an Italian news site that unfolded as a result of resisting requests to hide a story about a stabbing.

In 2008, Umberto Pecoraro stabbed his brother Vittorio in a restaurant with a fish knife. The victim, Vittorio, said that the news story violated his privacy and demanded that it be taken down, citing the right-to-be-forgotten clause in the GDPR. The journalist, Alessandro Biancardi, argued that the public’s right to know outweighed the right to be forgotten, but he lost the argument and lost his business.

The Streisand effect

This story is an example of the Streisand effect, making something more public by trying to keep it private. People around the world now know about the Pecoraro brothers’ fight only because one of them fought to suppress the story. I’d never know about local news from Pescara, Italy if the NYT hadn’t brought it to my attention [1].

Extending EU law beyond the EU

Will Vittorio Pecoraro now ask the NYT to take down their story? Will he ask me to take down this blog post? Probably not in light of a story in the Los Angeles Times yesterday [2].

France’s privacy regulator had argued that the EU’s right-to-be-forgotten extends outside the EU, but the European Court of Justice ruled otherwise, sorta.

According to the LA Times story, the court ruled that Google did not have to censor search results to comply with the right to be forgotten, but it did need to “put measures in place to discourage internet users from going outside the EU to find the missing information.”

I’ve written about a right to be forgotten before and suggested that it is impractical if not logically impossible. It also seems difficult to have a world wide web subject to local laws. How can the EU both allow its citizens to access the open web and also hide things it doesn’t want them to see? That seems like an unstable equilibrium, with the two stable equilibria being an open web and a Chinese-style closed web.

More privacy posts

[1] I also wouldn’t write about it. Although I think it’s impractical to legally require news sites to take down articles, I also think it’s often the decent thing to do. Someone’s life can be seriously harmed by easily accessible records of things they did long ago and regret. I would not stir up the Pecoraro story on my own. But since the NYT has far greater circulation than my blog, the story is already public.

[2] There’s an interesting meta-news angle here. If it’s illegal for Alessandro Biancardi to keep his story about the Pecoraro brothers online, would it be legal for someone inside the EU to write a story about the censoring of the Pecoraro brothers story like the NYT did?

You could argue that the Pecoraro brother’s story should be taken down because it’s old news. But the battle over whether the story should be taken down is new news. Maybe Vittorio Pecoraro’s privacy outweighs the public’s right to know about the knife fight, but does his privacy outweigh the public’s right to know about news stories being taken down?

 

Three-digit zip codes and data privacy

Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here.

So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip codes are larger than five-digit zip codes on average, some three-digit zip codes are still sparsely populated.

But if you use three-digit zip codes, and cut out sparsely populated zip3s, then you’re OK, right?

Well, there’s still a problem if you also report state. Ordinarily a zip3 fits within one state, but not always.

Five digit zip codes are each entirely contained within a state as far as I know [1]. But three-digit zip codes can straddle state lines. For example, about 200,000 people live in the three-digit zip code 834. The vast majority of these are in Idaho, but about 500 live in zip code 83414 which is in Wyoming. Zip code 834 is not sparsely populated, and doesn’t give much information about an individual by itself. But it is conditionally sparsely populated. It does carry a lot of information about an individual if that person lives in Wyoming.

On average, a three-digit zip code covers about 350,000 people. And so most of the time, the combination of zip3 and state covers 350,000 people. But in the example above, the combination of zip3 and state might narrow down to 500 people. In a group that small, birthday (just day of the year, not the full date) is enough to uniquely identify around 25% of the population. [2]

Related posts

[1] A five-digit zip code can cross a geographic state line, though the USPS sees state borders differently. For example, the zip code 57638 is mostly in South Dakota, but a small portion of the zip code extends slightly into North Dakota. But USPS considers addresses in this region to have South Dakota addresses because from the perspective of a mail carrier they practically are in South Dakota.

[2] The 25% figure came from exp(-500/365). See this post for details.