Uncategorized

Safe Harbor and the calendar rollover problem

elderly woman

Data privacy is subtle and difficult to regulate. The lawmakers who wrote the HIPAA privacy regulations took a stab at what would protect privacy when they crafted the “Safe Harbor” list. The list is neither necessary or sufficient, depending on context, but it’s a start.

Extreme values of any measurement are more likely to lead to re-identification. Age in particular may be newsworthy. For example, a newspaper might run a story about a woman in the community turning 100. For this reason, the Safe Harbor previsions require that ages over 90 be lumped together. Specifically,

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

One problem with this rule is that “age 90” is a moving target. Suppose that last year, in 2018, a data set recorded that a woman was born in 1930 and had a child in 1960. This data set was considered de-identified under the Safe Harbor provisions and published in a medical journal. On New Years Day 2019, does that data suddenly become sensitive? Or on New Years Day 2020? Should the journal retract the paper?!

No additional information is conveyed by the passage of time per se. However, if we knew in 2018 that the woman in question was still alive, and we also know that she’s alive now in 2019, we have more information. Knowing that someone born in 1930 is alive in 2019 is more informative than knowing that the same person was alive in 2018; there are fewer people in the former category than in the latter category.

The hypothetical journal article, committed to print in 2018, does not become more informative in 2019. But an online version of the article, revised with new information in 2019 implying that the woman in question is still alive, does become more informative.

No law can cover every possible use of data, and it would be a bad idea to try. Such a law would be both overly restrictive in some cases and not restrictive enough in others. HIPAA’s expert determination provision allows a statistician to say, for example, that the above scenario is OK, even though it doesn’t satisfy the letter of the Safe Harbor rule.

Related posts

Data privacy Twitter account

My newest Twitter account is Data Privacy (@data_tip). There I post tweets about ways to protect your privacy, statistical disclosure limitation, etc.

I had a clever idea for the icon, or so I thought. I started with the default Twitter icon, a sort of stylized anonymous person, and colored it with the same blue and white theme as the rest of my Twitter accounts. I think it looked so much like the default icon that most people didn’t register that it had been customized. It looked like an unpopular account, unlikely to post much content.

Now I’ve changed to the new icon below, and the number of followers is increasing.
data tip icon

Related pages

Covered entities: TMPRA extends HIPAA

The US HIPAA law only protects the privacy of health data held by “covered entities,” which essentially means health care providers and insurance companies. If you give your heart monitoring data or DNA to your doctor, it comes under HIPAA. If you give it to Fitbit or 23andMe, it does not. Government entities are not covered by HIPAA either, a fact that Latanya Sweeney exploited to demonstrate how service dates be used to identify individuals.

Texas passed the Texas Medical Records Privacy Act (a.k.a. HB 300 or TMPRA) to close this gap. Texas has a much broader definition of covered entity. In a nutshell, Texas law defines a covered entity to include anyone “assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information.” The full definition, available here, says

“Covered entity” means any person who:

(A) for commercial, financial, or professional gain, monetary fees, or dues, or on a cooperative, nonprofit, or pro bono basis, engages, in whole or in part, and with real or constructive knowledge, in the practice of assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information. The term includes a business associate, health care payer, governmental unit, information or computer management entity, school, health researcher, health care facility, clinic, health care provider, or person who maintains an Internet site;

(B) comes into possession of protected health information;

(C) obtains or stores protected health information under this chapter; or

(D) is an employee, agent, or contractor of a person described by Paragraph (A), (B), or (C) insofar as the employee, agent, or contractor creates, receives, obtains, maintains, uses, or transmits protected health information.

Posts on other privacy regulations

Inferring religion from fitness data

woman looking at fitness tracker

Fitness monitors reveal more information than most people realize. For example, it may be possible to infer someone’s religious beliefs from their heart rate data.

If you have location data, it’s trivial to tell whether someone is attending religious services. But you could make a reasonable guess from cardio monitoring data alone.

Muslim prayers occur at five prescribed times a day. If you could detect that someone is kneeling every day at precisely those prescribed times, it’s likely they are Muslim. Maybe they just happen to be stretching while Muslims are praying, but that’s less likely.

It should be possible to detect when a person is singing by looking at fitness data. If you find that someone is singing every Sunday morning, it’s likely they are attending a church service. And if someone is consistently singing on Saturday evenings, they may be attending a large church, likely Catholic, which added a Saturday night service. Maybe they just have Saturday evening voice lessons, but attending a church service is more likely.

Maybe you could infer that someone is an observant Jew because they unusually inactive on Saturdays. Of course a lot of people take it easy on Saturdays. But if someone runs, for example, six days a week but not on Saturdays, something you could certainly tell from fitness data, that’s evidence that they may be Jewish. Not proof, but evidence.

All these inferences are fallible, of course. But that’s the nature of most privacy leaks. They don’t usually offer irrefutable evidence, but they update probabilities. One of the contributions of differential privacy is to acknowledge that all personal data leaks at least a little bit of information, and it’s better to acknowledge and control the amount of information leak than to pretend it doesn’t exist.

By the way, if you to keep your Fitbit data from revealing your religion, you might reveal it anyway. This is called the Barbara Streisand Effect for reasons explained here. If you take off your Fitbit five times a day, just before the Muslim call to prayer, you’re still giving someone who has access to your data clues to your religious affiliation.

Related posts

Riffing on mistakes

I mentioned on Twitter yesterday that one way to relieve the boredom of grading math papers is to explore mistakes. If a statement is wrong, what would it take to make it right? Is it approximately correct? Is there some different context where it is correct? Several people said they’d like to see examples, so this blog post is a sort of response.

***

One famous example of this is the so-called Freshman’s Dream theorem:

(a + b)p = ap + bp

This is not true over the real numbers, but it is true, for example, when working with integers mod p.

(More generally, the Freshman’s Dream is true in any ring of characteristic p. This is more than an amusing result; it’s useful in applications of finite fields.)

***

A common misunderstanding in calculus is that a series converges if its terms converge to zero. The canonical counterexample is the harmonic series. It’s terms converge to zero, but the sum diverges.

But this can’t happen in the p-adic numbers. There if the terms of a series converge to zero, the series converges (though maybe not absolutely).

***

Here’s something sorta along these lines. It looks wrong, and someone might arrive at it via a wrong understanding, but it’s actually correct.

sin(xy) sin(x + y) = (sin(x) – sin(y)) (sin(x) + sin(y))

***

Odd integers end in odd digits, but that might not be true if you’re not working in base 10. See Odd numbers in odd bases.

***

You can misunderstand how percentages work, but still get a useful results. See Sales tax included.

***

When probabilities are small, you can often get by with adding them together even when strictly speaking they don’t add. See Probability mistake can make a good approximation.

Scaling up differential privacy: lessons from the US Census

The paper Issues Encountered Deploying Differential Privacy describes some of the difficulties the US Census Bureau has run into while deploying differential privacy for the 2020 census. It’s not surprising that they would have difficulties. It’s surprising that they would even consider applying differential privacy on such an enormous scale.

If your data project is smaller than the US Census, you can probably make differential privacy work.

Related posts

US Census Bureau embraces differential privacy

The US Census Bureau is convinced that traditional methods of statistical disclosure limitation have not done enough to protect privacy. These methods may have been adequate in the past, but it no longer makes sense to implicitly assume that those who would like to violate privacy have limited resources or limited motivation. The Bureau has turned to differential privacy for quantifiable privacy guarantees that are independent of the attacker’s resources and determination.

John Abowd, chief scientist for the US Census Bureau, gave a talk a few days ago (March 4, 2019) in which he discusses the need for differential privacy and how the bureau is implementing differential privacy for the 2020 census.

Absolutely the hardest lesson in modern data science is the constraint on publication that the fundamental law of information recovery imposes. I usually call it the death knell for traditional method of publication, and not just in statistical agencies.

Related posts

Congress and the Equifax data breach

Dialog from a congressional hearing February 26, 2019.

Representative Katie Porter: My question for you is whether you would be willing to share today your social security, your birth date, and your address at this public hearing.

Equifax CEO Mark Begor: I would be a bit uncomfortable doing that, Congresswoman. If you’d so oblige me, I’d prefer not to.

KP: Could I ask you why you’re unwilling?

MB: Well that’s sensitive information. I think it’s sensitive information that I like to protect, and I think consumers should protect theirs.

KP: My question is then, if you agree that exposing this kind of information, information like that you have in your credit reports, creates harm, therefore you’re unwilling to share it, why are your lawyers arguing in federal court that there was no injury and no harm created by your data breach?

Related posts

Miscellaneous

Image editor

Image editing software is complicated, and I don’t use it often enough to remember how to do much. I like Paint.NET on Windows because it is in a sort of sweet spot for me, more powerful than Paint and much less complicated than Photoshop.

I found out there’s a program Pinta for Linux that was inspired by Paint.NET. (Pinta runs on Windows, Mac, and BDS as well.)

Exponential sum of the day

I have a page that draws a different image every day, based on putting the month, day, and the laws two digits of the year into an exponential sum. This year’s images have been more intricate than last year’s because 19 is prime.

I liked today’s image.

exponential sum for 2019-02-27

The page has a link to details explaining the equation behind the image, and an animate link to let you see the sequence in which the points are traversed.

Podcast interview

Rebecca Herold posted a new episode of her podcast yesterday in which she asks me questions about privacy and artificial intelligence.

Entropy update

I updated my blog post on solving for probability from entropy because Sjoerd Visscher pointed out that a crude approximation I used could be made much more accurate with a minor tweak.

As a bonus, the new error plot looks cool.

approximation error on log scale
Newsletter

My monthly newsletter comes out tomorrow. This newsletter highlights the most popular blog posts of the month.

I used to say something each month about what I’m up to. Then I stopped because it got to be repetitive. Tomorrow I include a few words about projects I have coming up.

The letter S

I was helping my daughter with physics homework last night and she said “Why do they use s for arc length?!” I said that I don’t know, but that it is conventional.

By the way, this section heading is a reference to Donald Knuth’s essay The Letter S where he writes in delightful Knuthian detail about the design of the letter S in TeX. You can find the essay in his book Literate Programming.

What sticks in your head

This morning I read an article by Dennis Felsing about his impressive/intimidating Linux desktop setup. He uses a lot of tools that are not the easiest way to get things done immediately but are long-term productivity investments.

Remembrance of syntax past

Felsing apparently is able to remember the syntax of scores of tools and programming languages. I cannot. Part of the reason is practice. I cannot remember the syntax of any software I don’t use regularly. It’s tempting to say that’s the end of the story: use it or lose it. Everybody has their set of things they use regularly and remember.

But I don’t think that’s all. I remember bits of math that I haven’t used in 30 years. Math fits in my head and sticks. Presumably software syntax sticks in the heads of people who use a lot of software tools.

There is some software syntax I can remember, however, and that’s software closely related to math. As I commented here, it was easy to come back to Mathematica and LaTeX after not using them for a few years.

Imprinting

Imprinting has something to do with this too: it’s easier to remember what we learn when we’re young. Felsing says he started using Linux in 2006, and his site says he graduated college in 2012, so presumably he was a high school or college student when he learned Linux.

When I was a student, my software world consisted primarily of Unix, Emacs, LaTeX, and Mathematica. These are all tools that I quit using for a few years, later came back to, and use today. I probably remember LaTeX and Mathematica syntax in part because I used it when I was a student. (I also think Mathematica in particular has an internal consistency that makes its syntax easier to remember.)

Picking your memory battles

I see the value in Felsing’s choice of tools. For example, the xmonad window manager. I’ve tried it, and I could imagine that it would make you more productive if you mastered it. But I don’t see myself mastering it.

I’ve learned a few tools with lots of arbitrary syntax, e.g. Emacs. But since I don’t have a prodigious memory for such things, I have to limit the number of tools I try to keep loaded in memory. Other things I load as needed, such as a language a client wants me to use that I haven’t used in a while.

Revisiting a piece of math doesn’t feel to me like revisiting a programming language. Brushing up on something from differential equations, for example, feels like pulling a book off a mental shelf. Brushing up on C# feels like driving to a storage unit, bringing back an old couch, and struggling to cram it in the door.

Middle ground

There are things you use so often that you remember their syntax without trying. And there are things you may never use again, and it’s not worth memorizing their syntax just in case. Some things in the middle, things you don’t use often enough to naturally remember, but often enough that you’d like to deliberately remember them. Some of these are what I call bicycle skills, things that you can’t learn just-in-time. For things in this middle ground, you might try something like Anki, a flashcard program with spaced repetition.

However, this middle ground should be very narrow, at least in my experience/opinion. For the most part, if you don’t use something often enough to keep it loaded in memory, I’d say either let it go or practice using it regularly.

Related posts