State privacy laws to watch

US map with states highlighted

A Massachusetts court ruled this week that obtaining real-time cell phone location data requires a warrant.

Utah has passed a law that goes into effect next month that goes further. Police in Utah will need a warrant to obtain location data or to search someone’s electronic files. (Surely electronic files are the contemporary equivalent of one’s “papers” under the Fourth Amendment.)

Vermont passed the nation’s first data broker law. It requires data brokers to register with the state and to implement security measures, but as far as I have read it doesn’t put much restriction what they can do.

Texas law expands HIPAA’s notation of a “covered entity” so that it applies to basically anyone handling PHI (protected health information).

California’s CCPA law goes into effect on January 1. In some ways it’s analogous to GDPR. It will be interesting to see what the law ultimately means in practice. It’s expected that the state legislature will amend the law, and we’ll have to wait on precedents to find out in detail what the law prohibits and allows.

Related posts

Safe Harbor and the calendar rollover problem

elderly woman

Data privacy is subtle and difficult to regulate. The lawmakers who wrote the HIPAA privacy regulations took a stab at what would protect privacy when they crafted the “Safe Harbor” list. The list is neither necessary or sufficient, depending on context, but it’s a start.

Extreme values of any measurement are more likely to lead to re-identification. Age in particular may be newsworthy. For example, a newspaper might run a story about a woman in the community turning 100. For this reason, the Safe Harbor previsions require that ages over 90 be lumped together. Specifically,

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

One problem with this rule is that “age 90” is a moving target. Suppose that last year, in 2018, a data set recorded that a woman was born in 1930 and had a child in 1960. This data set was considered de-identified under the Safe Harbor provisions and published in a medical journal. On New Years Day 2019, does that data suddenly become sensitive? Or on New Years Day 2020? Should the journal retract the paper?!

No additional information is conveyed by the passage of time per se. However, if we knew in 2018 that the woman in question was still alive, and we also know that she’s alive now in 2019, we have more information. Knowing that someone born in 1930 is alive in 2019 is more informative than knowing that the same person was alive in 2018; there are fewer people in the former category than in the latter category.

The hypothetical journal article, committed to print in 2018, does not become more informative in 2019. But an online version of the article, revised with new information in 2019 implying that the woman in question is still alive, does become more informative.

No law can cover every possible use of data, and it would be a bad idea to try. Such a law would be both overly restrictive in some cases and not restrictive enough in others. HIPAA’s expert determination provision allows a statistician to say, for example, that the above scenario is OK, even though it doesn’t satisfy the letter of the Safe Harbor rule.

Related posts

Data privacy Twitter account

My newest Twitter account is Data Privacy (@data_tip). There I post tweets about ways to protect your privacy, statistical disclosure limitation, etc.

I had a clever idea for the icon, or so I thought. I started with the default Twitter icon, a sort of stylized anonymous person, and colored it with the same blue and white theme as the rest of my Twitter accounts. I think it looked so much like the default icon that most people didn’t register that it had been customized. It looked like an unpopular account, unlikely to post much content.

Now I’ve changed to the new icon below, and the number of followers is increasing.
data tip icon

Related pages

Covered entities: TMPRA extends HIPAA

The US HIPAA law only protects the privacy of health data held by “covered entities,” which essentially means health care providers and insurance companies. If you give your heart monitoring data or DNA to your doctor, it comes under HIPAA. If you give it to Fitbit or 23andMe, it does not. Government entities are not covered by HIPAA either, a fact that Latanya Sweeney exploited to demonstrate how service dates be used to identify individuals.

Texas passed the Texas Medical Records Privacy Act (a.k.a. HB 300 or TMPRA) to close this gap. Texas has a much broader definition of covered entity. In a nutshell, Texas law defines a covered entity to include anyone “assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information.” The full definition, available here, says

“Covered entity” means any person who:

(A) for commercial, financial, or professional gain, monetary fees, or dues, or on a cooperative, nonprofit, or pro bono basis, engages, in whole or in part, and with real or constructive knowledge, in the practice of assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information. The term includes a business associate, health care payer, governmental unit, information or computer management entity, school, health researcher, health care facility, clinic, health care provider, or person who maintains an Internet site;

(B) comes into possession of protected health information;

(C) obtains or stores protected health information under this chapter; or

(D) is an employee, agent, or contractor of a person described by Paragraph (A), (B), or (C) insofar as the employee, agent, or contractor creates, receives, obtains, maintains, uses, or transmits protected health information.

Posts on other privacy regulations

Inferring religion from fitness data

woman looking at fitness tracker

Fitness monitors reveal more information than most people realize. For example, it may be possible to infer someone’s religious beliefs from their heart rate data.

If you have location data, it’s trivial to tell whether someone is attending religious services. But you could make a reasonable guess from cardio monitoring data alone.

Muslim prayers occur at five prescribed times a day. If you could detect that someone is kneeling every day at precisely those prescribed times, it’s likely they are Muslim. Maybe they just happen to be stretching while Muslims are praying, but that’s less likely.

It should be possible to detect when a person is singing by looking at fitness data. If you find that someone is singing every Sunday morning, it’s likely they are attending a church service. And if someone is consistently singing on Saturday evenings, they may be attending a large church, likely Catholic, which added a Saturday night service. Maybe they just have Saturday evening voice lessons, but attending a church service is more likely.

Maybe you could infer that someone is an observant Jew because they unusually inactive on Saturdays. Of course a lot of people take it easy on Saturdays. But if someone runs, for example, six days a week but not on Saturdays, something you could certainly tell from fitness data, that’s evidence that they may be Jewish. Not proof, but evidence.

All these inferences are fallible, of course. But that’s the nature of most privacy leaks. They don’t usually offer irrefutable evidence, but they update probabilities. One of the contributions of differential privacy is to acknowledge that all personal data leaks at least a little bit of information, and it’s better to acknowledge and control the amount of information leak than to pretend it doesn’t exist.

By the way, if you to keep your Fitbit data from revealing your religion, you might reveal it anyway. This is called the Barbara Streisand Effect for reasons explained here. If you take off your Fitbit five times a day, just before the Muslim call to prayer, you’re still giving someone who has access to your data clues to your religious affiliation.

Related posts

US Census Bureau embraces differential privacy

The US Census Bureau is convinced that traditional methods of statistical disclosure limitation have not done enough to protect privacy. These methods may have been adequate in the past, but it no longer makes sense to implicitly assume that those who would like to violate privacy have limited resources or limited motivation. The Bureau has turned to differential privacy for quantifiable privacy guarantees that are independent of the attacker’s resources and determination.

John Abowd, chief scientist for the US Census Bureau, gave a talk a few days ago (March 4, 2019) in which he discusses the need for differential privacy and how the bureau is implementing differential privacy for the 2020 census.

Absolutely the hardest lesson in modern data science is the constraint on publication that the fundamental law of information recovery imposes. I usually call it the death knell for traditional method of publication, and not just in statistical agencies.

Related posts

Congress and the Equifax data breach

Dialog from a congressional hearing February 26, 2019.

Representative Katie Porter: My question for you is whether you would be willing to share today your social security, your birth date, and your address at this public hearing.

Equifax CEO Mark Begor: I would be a bit uncomfortable doing that, Congresswoman. If you’d so oblige me, I’d prefer not to.

KP: Could I ask you why you’re unwilling?

MB: Well that’s sensitive information. I think it’s sensitive information that I like to protect, and I think consumers should protect theirs.

KP: My question is then, if you agree that exposing this kind of information, information like that you have in your credit reports, creates harm, therefore you’re unwilling to share it, why are your lawyers arguing in federal court that there was no injury and no harm created by your data breach?

Related posts

Supercookies

superhero cookies

Supercookies, also known as evercookies or zombie cookies, are like browser cookies in that they can be used to track you, but are much harder to remove.

What is a supercookie?

The way I first heard supercookies describe was as a cookie that you can appear to delete, but as soon as you do, software rewrites the cookie. Like the Hydra from Greek mythology, cutting off a head does no good because it grows back [1].

This explanation is oversimplified. It doesn’t quite work that way.

A supercookie is not a cookie per se. It’s anything that can be used to uniquely identify your browser: font fingerprinting, flash cache, cached images, browser plugins and preferences, etc. Deleting your cookies has no effect because a supercookie is not a cookie.

However, a supercookie can work with other code to recreate deleted cookies, and so the simplified description is not entirely wrong. A supercookie could alert web sites that a cookie has been deleted, and allow those sites to replace that cookie, or update the cookie if some browser data has changed.

What about ‘Do Not Track’?

You can ask sites to not track you, but this works on an honor system and is ignored with impunity, even (especially?) by the best known companies.

Apple has announced that it is removing Do Not Track from its Safari browser because the feature is worse than useless. Servers don’t honor it, and it gives a false sense of privacy. Not only that, the DNT setting is one more bit that servers could use to identify you! Because only about 9% of users turn on DNT, knowing that someone has it turned on gives about 3.5 bits of information toward identifying that person.

How to remove supercookies

How do you remove supercookies? You can’t. As explained above, a supercookie isn’t a file that can be removed. It’s a procedure for exploiting a combination of data.

You could remove specific ways that sites try to identify you. You could, for example, remove Flash to thwart attempts to exploit Flash’s data, cutting off one head of the Hydra. This might block the way some companies track you, but there are others.

It’s an arms race. As fingerprinting techniques become well known, browser developers and users try to block them, and those intent on identifying you come up with more creative approaches.

The economics of identification

Given the efforts companies use to identify individuals (or at least their devices), it seems it must be worth it. At least companies believe it’s worth it, and for some it probably is. But there are reasons to believe that tracking isn’t as valuable as it seems. For example, this article argues that the most valuable targeting information is freely given. For example, you know who is interested in buying weighted blankets? People who search on weighted blankets!

There have been numerous anecdotal stories recently of companies that have changed their marketing methods in order to comply with GDPR and have increased their sales. These are only anecdotes, but they suggest that at least for some companies, there are profitable alternatives to identifying customers who don’t wish to be identified.

Related posts

[1] In the Greek myth, cutting off one head of the Hydra caused two heads to grow back. Does deleting a supercookie cause it to come back stronger? Maybe. Clearing your cookies is another user behavior that can be used to fingerprint you.

Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Normal and Laplace probability density functions

A normal distribution has very thin tails, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has moderate tails [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

Statistics meets differential privacy

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

Solving for parameters

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the Greek letter fallacy, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two quantiles match. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale s has density

f(x) = exp(-|x|/s)/2s.

If we want to solve for the quantile x such that Prob(Xx) = p, we have

x = –s log(2 – 2p).

Using the formula derived in the previously mentioned post,

σ = 2x / Φ-1(x)

where Φ is the cumulative distribution function of the standard normal.

Related posts

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.