Identifying someone from their heart beat

electrocardiogram of a toddler

How feasible would it be to identify someone based from electrocardiogram (EKG, ECG) data? (Apparently the abbreviation “EKG” is more common in America and “ECG” is more common in the UK.)

Electrocardiograms are unique, but unique doesn’t necessarily mean identifiable. Unique data isn’t identifiable without some way to map it to identities. If you shuffle a deck of cards, you will probably produce an arrangement that has never occurred before. But without some sort of registry mapping card deck orders to their shufflers, there’s no chance of identification. (For identification, you’re better off dusting the cards for fingerprints, because there are registries of fingerprints.)

According to one survey [1], researchers have tried a wide variety of methods for identifying people from electrocardiograms. They’ve used time-domain features such as peak amplitudes, slopes, variances, etc., as well as a variety of frequency-domain (DFT) features. It seems that all these methods work moderately well, but none are great, and there’s no consensus regarding which approach is best.

If you have two EKGs on someone, how readily can you tell that they belong to the same person? The answer depends on the size of the set of EKGs you’re comparing it to. The studies surveyed in [1] do some sort of similarity search, comparing a single EKG to tens of candidates. The methods surveyed had an overall success rate of around 95%. But these studies were based on small populations; at least at the time of publication no one had looked at matching an single EKG against thousands of possible matches.

In short, an electrocardiogram can identify someone with high probability once you know that they belong to a relatively small set of people for which you have electrocardiograms.

More identification posts

[1] Antonio Fratini et al. Individual identification via electrocardiogram analysis. Biomed Eng Online. 2015; 14: 78. doi 10.1186/s12938-015-0072-y

Is there a zip code that equals its population?

US stamp from 1973 promoting zip codes

I noticed yesterday that the population in a zip code near me is roughly equal to the zip code itself. So I wondered:

Does any zip code equal its population?

Yes, it’s a silly question. A zip code isn’t a quantity. Populations are always changing. Zip code boundaries are always changing. Etc.

The answer, according to the data I had on hand, is almost.

Smallest absolute error: Zip code 00674 has population 672.

Smallest relative error: Zip code 42301 has population 42319.

I’ve had to learn some of the intricacies of zip codes in the course of my work on data privacy. I found out that zip codes are more complicated than I ever would have thought.

For one thing, the US Census doesn’t exactly report data by zip code but by zip code tabulation area (ZCTA) for reasons that make sense but are too complicated to get into here. This is another reason why the question posed here is fuzzy; we don’t know the populations of zip codes unless they coincide with ZCTAs.

Update: Several people have told me they expected this post to be an existence argument, such as saying by the pigeon hole principle some zip code has to equal its population. That’s not the case, at least for real, populated US zip codes. (Though Andrew Gelman pointed out that 00000 is not populated, so there’s that.)

Let P be a population size with n zip codes and assume all zip codes have to be populated. If P = n and one of the zip codes must be numbered 1, then some zip code must equal its population. But other than trivial cases like that, it’s easy to avoid any equalities between zip codes and their populations.

A more interesting question goes in the opposite direction: is it possible that every zip code equals its population? Clearly not if n(n + 1)/2 > P because the left side is the minimum population possible with n zip codes, each equal to its non-zero population. But it turns out that n(n + 1)/2 ≤ P is sufficient: put one person in zip codes 1 through n-1 and let k = Pn(n-1) be the number of remaining people. Put them all in zip code k. Since kn-1, k is a zip code that hasn’t been used before.

More on zip codes

This post started out as a Twitter thread.

The image above is a US stamp from 1973 promoting the initial use of zip codes.

Three composition theorems for differential privacy

This is a brief post, bringing together three composition theorems for differential privacy.

  1. The composition of an ε1-differentially private algorithm and an ε2-differentially private algorithm is an (ε12)-differentially private algorithm.
  2. The composition of an (ε1, δ1)-differentially private algorithm and an (ε2, δ2)-differentially private algorithm is an (ε12, δ12)-differentially private algorithm.

The three composition rules can be summarized briefly as follows:

ε1 ∘ ε2 → (ε1 + ε2)
1, δ1) ∘ (ε2, δ2) → (ε12, δ12)
(α, ε1) ∘ (α, ε2) → (α, ε12)

What is the significance of these composition theorems? In short, ε-differential privacy and Rényi differential privacy compose as one would hope, but (ε, δ)-differential privacy does not.

The first form of differential privacy proposed was ε-differential privacy. It is relatively easy to interpret, composes nicely, but can be too rigid.

If you have Gaussian noise, for example, you are lead naturally to (ε, δ)-differential privacy. The δ term is hard to interpret. Roughly speaking you could think  it as the probability that ε-differential privacy fails to hold. Unfortunately with (ε, δ)-differential privacy the epsilons add and so do the deltas. We would prefer that δ didn’t grow with composition.

Rényi differential privacy is a generalization of ε-differential privacy that uses a family of information measures indexed by α to measure the impact of a single row being or not being in a database. The case of α = ∞ corresponds to ε-differential privacy, but finite values of α tend to be less pessimistic. The nice thing about the composition theorem for Rényi differential privacy is that the α parameter doesn’t change, unlike the δ parameter in (ε, δ)-differential privacy.

Safe Harbor ain’t gonna cut it

There are two ways to deidentify data to satisfy HIPAA:

  • Safe Harbor, § 164.514(b)(2), and
  • Expert Determination, § 164.514(b)(1).

And for reasons explained here, you may need to be concerned with HIPAA even if you’re not a “covered entity” under the statute.

To comply with Safe Harbor, your data may not contain any of eighteen categories of information. Most of these are obvious: direct identifiers such as name, phone number, email address, etc. But some restrictions under Safe Harbor are less obvious and more difficult to comply with.

For example, under Safe Harbor you need to remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

This would make it impossible, for example, to look at seasonal trends in medical procedures because you would only have data to the resolution of a year. But with a more sophisticated approach, e.g. differential privacy, it would be possible to answer such questions while providing better privacy for individuals. See how here.

If you need to comply with HIPAA, or analogous state laws such as TMPRA, and you can’t follow Safe Harbor, your alternative is expert determination. If you’d like to discuss expert determination, let’s talk.

Why HIPAA matters even if you’re not a “covered entity”

medical data

The HIPAA privacy rule only applies to “covered entities.” This generally means insurance plans, healthcare clearinghouses, and medical providers. If your company is using heath information but isn’t a covered entity per the HIPAA statute, there are a couple reasons you might still need to pay attention to HIPAA [1].

The first is that state laws may be broader than federal laws. For example, the Texas Medical Records Privacy Act extends the definition of covered entity to any business “assembling, collecting, analyzing, using, evaluating, storing, or transmitting protected health information.” So even if the US government does not consider your business to be a covered entity, the State of Texas might.

The second is that more recent privacy laws look to HIPAA. For example, it’s not clear yet what exactly California’s new privacy legislation CCPA will mean in practice, even though the law went into effect at the beginning of the year. Because HIPAA is well established and guidance documentation, companies needing to comply with CCPA are looking to HIPAA for precedent.

The connection between CCPA and HIPAA may be formalized into more than an analogy. There is a proposed amendment to CCPA that would introduce HIPAA-like expert determination for CCPA. (Update: This amendment, AB 713, was signed into law September 25, 2020.)

If you would like to discuss HIPAA deidentification or data privacy more generally, let’s talk.

More on HIPAA

[1] I advise lawyers on statistical matters, but I am not a lawyer. Nothing here should be considered legal advice. Ask your legal counsel if you need to comply with HIPAA, or with state laws analogous to HIPAA.

 

CCPA and expert determination

California’s new CCPA (California Consumer Privacy Act) may become more like HIPAA. In particular, a proposed amendment would apply HIPAA’s standards of expert determination to CCPA.

According to this article,

The California State Senate’s Health Committee recently approved California AB 713, which would amend the California Consumer Privacy Act (CCPA) to except from CCPA requirements additional categories of health information, including data de-identified in accordance with HIPAA and certain medical research data.

Some businesses have been looking to HIPAA by analogy for how to comply with CCPA. HIPAA has been on the books much longer, and what it means to comply with HIPAA is more clearly stated, in regulation itself and in guidance documents. AB 713 would make this appeal to HIPAA more than an analogy.

In particular, CCPA would now have a notion of expert determination. AB 713 explicitly refers to

The deidentification methodology described in Section 164.514(b)(1) of Title 45 of the Code of Federal Regulations, commonly known as the HIPAA expert determination method.

Emphasis added. Taken from 1798.130 (a)(5)(D)(i).

Update: California’s governor signed AB 713 into law on September 25, 2020.

Parsing AB 713

The amendment is hard to read because it doesn’t contain many complete sentences. The portion quoted above doesn’t have a verb. We have to go to up to (a) in the hierarchy before we can find a clear subject and verb:

… a business shall …

It’s not clear to me what the amendment is saying. Rather than trying to parse this myself, I’ll quote what the article linked above says.

AB 713 would except from CCPA requirements de-identified health information when … The information is de-identified in accordance with a HIPAA de-identification method [and two other conditions].

Expert determination

I am not a lawyer; I advise lawyers on statistical matters. I offer statistical advice, not legal advice.

If your lawyer determines that you need HIPAA-style expert determination to comply with CCPA, I can help. I have provided expert determination for many companies and would welcome the opportunity to provide this service for your company as well.

If you’d like discuss expert determination, either for HIPAA or for CCPA, let’s talk.

Stochastic rounding and privacy

Suppose ages in some database are reported in decades: 0, 10, 20, etc. You need to add a 27 year old woman to the data set. How do you record her age? A reasonable approach would to round-to-nearest. In this case, 27 would be rounded up to 30.

Another approach would be stochastic rounding. In our example, we would round this woman’s age up to 30 with 70% probability and round it down to 20 with 30% probability. The recorded value is a random variable whose expected value is exactly 27.

Suppose we were to add a large number of 27 year olds to the database. With round-to-nearest, the average value would be 30 because all the values are 30. With stochastic rounding, about 30% of the ages would be recorded as 20 and about 70% would be recorded as 30. The average would likely be close to 27.

Next, suppose we add people to the database of varying ages. Stochastic rounding would record every person’s age using a random variable whose expected value is their age. If someone’s age is a d+x where d is a decade, i.e. a multiple of 10, and 0 < x < 10, then we would record their age as d with probability 1-x/10 and d+10 with probability x/10. There would be no bias in the reported age.

Round-to-nearest will be biased unless ages are uniformly distributed in each decade. Suppose, for example, our data is on undergraduate students. We would expect a lot more students in their early twenties than in their late twenties.

Now let’s turn things around. Instead of looking at recorded age given actual age, let’s look at actual age given recorded age. Suppose someone’s age is recorded as 30. What does that tell you about them?

With round-to-nearest, it tells you that they certainly are between 25 and 35. With stochastic rounding, they could be anywhere between 20 and 40. The probability distribution on this interval could be computed from Bayes’ theorem, depending on the prior distribution of ages on this interval. That is, if you know in general how ages are distributed over the interval (20, 40), you could use Bayes’ theorem to compute the posterior distribution on age, given that age was recorded as 30.

Stochastic rounding preserves more information than round-to-nearest on average, but less information in the case of a particular individual.

More privacy posts

Computed IDs and privacy implications

Thirty years ago, a lot of US states thought it would be a good idea to compute someone’s drivers license number (DLN) from their personal information [1]. In 1991, fifteen states simply used your Social Security Number as your DLN. Eleven other states computed DLNs by applying a hash function to personal information such as name, birth date, and sex. A few other states based DLNs in part but not entirely on personal information.

Presumably things have changed a lot since then. If you know of any states that still do this, please let me know in the comments. Even if states have stopped computing DLNs from personal data, I’m sure many organizations still compute IDs this way.

The article I stumbled on from 1991 gave no hint perhaps encoding personal information into an ID number could be a problem. And at the time it wasn’t as much of a problem as it would be now.

Why is it a problem if IDs are computed from personal data? People don’t realize what information they’re giving away. Maybe they would be willing to give someone their personal information, but not their DLN, or vice versa, not realizing that the two are equivalent. They also don’t realize what information about them someone may already have; a little bit more info may be all an attacker needs. And they don’t realize the potential consequences of their loss of privacy.

In some cases the hashing functions were complicated, but not too complicated to carry out by hand. And even if states were applying a cryptographic hash function, which they certainly were not, this would still be a problem for reasons explained here. If you have a database of personal information, say from voter registration records, you could compute the hash value of everyone in the state, or at least a large enough portion that you stand a good chance of being able to reverse a hashed value.

Related posts

[1] Joseph A. Gallian. Assigning Driver’s License Numbers. Mathematics Magazine, Vol. 64, No. 1 (Feb., 1991), pp. 13-22.

What is a privacy budget?

Accountant

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t matter for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Metaphors can be dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

More mathematical privacy posts