What is a privacy budget?

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Some metaphors are dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

Related posts

Fat tails and the t test

Suppose you want to test whether something you’re doing is having any effect. You take a few measurements and you compute the average. The average is different than what it would be if what you’re doing had no effect, but is the difference significant? That is, how likely is it that you might see the same change in the average, or even a greater change, if what you’re doing actually had no effect and the difference is due to random effects?

The most common way to address this question is the one-sample t test. “One sample” doesn’t mean that you’re only taking one measurement. It means that you’re taking a set of measurements, a sample, from one thing. You’re not comparing measurements from two different things.

The t test assumes that the data are coming from some source with a normal (Gaussian) distribution. The Gaussian distribution has thin tails, i.e. the probability of seeing a value far from the mean drops precipitously as you move further out. What if the data are actually coming from a distribution with heavier tails, i.e. a distribution where the probability of being far from the mean drops slowly?

With fat tailed data, the t test loses power. That is, it is less likely to reject the null hypothesis, the hypothesis that the mean hasn’t changed, when it should. First we will demonstrate by simulation that this is the case, then we’ll explain why this is to be expected from theory.

Simulation

We will repeatedly draw a sample of 20 values from a distribution with mean 0.8 and test whether the mean of that distribution is not zero by seeing whether the t test produces a p-value less than the conventional cutoff of 0.05. We will increase the thickness of the distribution tails and see what that does to our power, i.e. the probability of correctly rejecting the hypothesis that the mean is zero.

We will fatten the tails of our distribution by generating samples from a Student t distribution and decreasing the number of degrees of freedom: as degrees of freedom go down, the weight of the tail goes up.

With a large number of degrees of freedom, the t distribution is approximately normal. As the number of degrees of freedom decreases, the tails get fatter. With one degree of freedom, the t distribution is a Cauchy distribution.

Here’s our Python code:

from scipy.stats import t, ttest_1samp

n = 20
N = 1000

for df in [100, 30, 10, 5, 4, 3, 2, 1]:
    rejections = 0
    for _ in range(N):
        y = 0.8 + t.rvs(df, size=n)
        stat, p = ttest_1samp(y, 0)
        if p < 0.05:
            rejections += 1
    print(df, rejections/N)

And here’s the output:

100 0.917
 30 0.921 
 10 0.873 
  5 0.757  
  4 0.700    
  3 0.628  
  2 0.449  
  1 0.137  

When the degrees of freedom are high, we reject the null about 90% of the time, even for degrees of freedom as small as 10. But with one degree of freedom, i.e. when we’re sampling from a Cauchy distribution, we only reject the null around 14% of the time.

Theory

Why do fatter tails lower the power of the t test? The t statistic is

\frac{\bar{y} - \mu_0}{s / \sqrt{n}}

where y bar is the sample average, μ0 is the mean under the null hypothesis (μ0 = 0 in our example), s is the sample standard deviation, and n is the sample size.

As distributions become fatter in the tails, the sample standard deviation increases. This means the denominator in the t statistic gets larger and so the t statistic gets smaller. The smaller the t statistic, the greater the probability that the absolute value of a t random variable is greater than the statistic, and so the larger the p-value.

t statistic, t distribution, t test

There are a lot of t‘s floating around in this post. I’ll finish by clarifying what the various t things are.

The t statistic is the thing we compute from our data, given by the expression above. It is called a t statistic because if the hypotheses of the test are satisfied, this statistic has a t distribution with n-1 degrees of freedom. The t test is a hypothesis test based on the t statistic and its distribution. So the t statistic, the t distribution, and the t test are all closely related.

The t family of probability distributions is a convenient example of a family of distributions whose tails get heavier or lighter depending on a parameter. That’s why in the simulation we drew samples from a t distribution. We didn’t need to, but it was convenient. We would get similar results if we sampled from some other distribution whose tails get thicker, and so variance increases, as we vary some parameter.

Related posts

Testing Rupert Miller’s suspicion

I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line:

I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in the tails.

Rupert wrote these words in 1986 when it would have been difficult to test is hunch. Now it’s easy, and so I wrote up a little simulation to test whether his feeling was justified. I’m sure this has been done before, but it’s easy (now—it would not have been in 1986) and so I wanted to do it myself.

I’ll compare the Kolmogorov-Smirnov test, a popular test for goodness-of-fit, with the Shapiro-Wilks test that Miller preferred. I’ll run each test 10,000 times on non-normal data and count how often each test produces a p-value less than 0.05.

To produce departures from normality in the tails, I’ll look at samples from a Student t distribution. This distribution has one parameter, the number of degrees of freedom. The fewer degrees of freedom, the thicker the tails and so the further from normality in the tails.

Then I’ll look at a mixture of a normal and uniform distribution. This will have thin tails like a normal distribution, but will be flatter in the middle.

If Miller was right, we should expect the Shapiro-Wilks to be more sensitive for fat-tailed t distributions, and the K-S test to be more sensitive for mixtures.

First we import some library functions we’ll need and define our two random sample generators.

from numpy import where
from scipy.stats import *

def mixture(p, size=100):
    u = uniform.rvs(size=size)
    v = uniform.rvs(size=size)
    n = norm.rvs(size=size)
    x = where(u < p, v, n)
    return x

def fat_tail(df, size=100):
    return t.rvs(df, size=size)

Next is the heart of the code. It takes in a sample generator and compares the two tests, Kolmogorov-Smirnov and Shapiro-Wilks, on 10,000 samples of 100 points each. It returns what proportion of the time each test detected the anomaly at the 0.05 level.

def test(generator, parameter):

    ks_count = 0
    sw_count = 0

    N = 10_000
    for _ in range(N):
        x = generator(parameter, 100)

        stat, p = kstest(x, "norm")
        if p < 0.05:
            ks_count += 1
    
        stat, p = shapiro(x)
        if p < 0.05:
            sw_count += 1
    
    return (ks_count/N, sw_count/N)

Finally, we call the test runner with a variety of distributions.

for df in [100, 10, 5, 2]:
    print(test(fat_tail, df))

for p in [0.05, 0.10, 0.15, 0.2]:
    print(test(mixture,p))

Note that the t distribution with 100 degrees of freedom is essentially normal, at least as far as a sample of 100 points can tell, and so we should expect both tests to report a lack of fit around 5% of the time since we’re using 0.05 as our cutoff.

Here’s what we get for the fat-tailed samples.

(0.0483, 0.0554)
(0.0565, 0.2277)
(0.1207, 0.8799)
(0.8718, 1.0000)   

So with 100 degrees of freedom, we do indeed reject the null hypothesis of normality about 5% of the time. As the degrees of freedom decrease, and the fatness of the tails increases, both tests reject the null hypothesis of normality more often. However, in each chase the Shapiro-Wilks test picks up on the non-normality more often than the K-S test, about four times as often with 10 degrees of freedom and about seven times as often with 5 degrees of freedom. So Miller was right about the tails.

Now for the middle. Here’s what we get for mixture distributions.

(0.0731, 0.0677)
(0.1258, 0.1051)
(0.2471, 0.1876)
(0.4067, 0.3041)

We would expect both goodness of fit tests to increase their rejection rates as the mixture probability goes up, i.e. as we sample from the uniform distribution more often. And thatis what we see. But the K-S test outperforms the S-W test each time. Both test have rejection rates that increase with the mixture probability, but the rejection rates increase faster for the K-S test. Miller wins again.

Related posts

National Drug Code (NDC)

The US Food and Drug Administration tracks drugs using an identifer called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code.

An NDC contains 10 digits, separated into three segments by two dashes. The three segments are the labeler code, product code, and package code. The FDA assigns the labeler codes to companies, and each company assigns its own product and package codes.

Format

The segments are of variable length and so the dashes are significant. The labeler code could be 4 or 5 digits. The product code could be 3 or 4 digits, and the package code could be 1 or 2 digits. The total number of digits is must be 10, so their are three possible combinations:

  • 4-4-2
  • 5-3-2
  • 5-4-1.

There’s no way to look at just the digits and know how to separate them into three segments. My previous post looked at self-punctuating codes. The digits of NDC codes are not self-punctuating because they require the dashes. The digit combinations are supposed to be unique, but you can’t tell how to parse a set of digits from the digits alone.

Statistics

I downloaded the NDC data from the FDA to verify whether the codes work as documented, and to see the relative frequency of various formats.

(The data change daily, so you may get different results if you do try this yourself.)

Format

All the codes were 12 characters long, and all had the documented format as verified by the regular expression [1]

    \d{4,5}-\d{3,4}-\d{1,2}

Uniqueness exception

I found one exception to the rule that the sequence of digits should be unique. The command

    sed "s/-//g" ndc.txt | sort | uniq -d

returned 2950090777.

The set of NDC codes contained both 29500-907-77 and 29500-9077-7.

Distribution

About 60% of the codes had the form 5-3-2. About 30% had the form 5-4-1, and the remaining 10% had the form 4-4-2.

There were a total of 252,355 NDC codes with 6,532 different lablelers (companies).

There were 9448 NDC codes associated with the most prolific labeler. The 1,424 least prolific labelers had only one DNC code. In Pareto-like fashion, the top 20% of labelers accounted for about 90% of the codes.

Related posts

[1] Programming languages like Python or Perl will recognize this regular expression, but by default grep does not support \d for digits. The Gnu implementation of grep with the -P option will. It will also understand notation like {4,5} to mean a pattern is repeated 4 or 5 times, with or without -P, but I don’t think other implementations of grep necessarily will.

Three-digit zip codes and data privacy

Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here.

So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip codes are larger than five-digit zip codes on average, some three-digit zip codes are still sparsely populated.

But if you use three-digit zip codes, and cut out sparsely populated zip3s, then you’re OK, right?

Well, there’s still a problem if you also report state. Ordinarily a zip3 fits within one state, but not always.

Five digit zip codes are each entirely contained within a state as far as I know. But three-digit zip codes can straddle state lines. For example, about 200,000 people live in the three-digit zip code 834. The vast majority of these are in Idaho, but about 500 live in zip code 83414 which is in Wyoming. Zip code 834 is not sparsely populated, and doesn’t give much information about an individual by itself. But it is conditionally sparsely populated. It does carry a lot of information about an individual if that person lives in Wyoming.

On average, a three-digit zip code covers about 350,000 people. And so most of the time, the combination of zip3 and state covers 350,000 people. But in the example above, the combination of zip3 and state might narrow down to 500 people. In a group that small, birthday (just day of the year, not the full date) is enough to uniquely identify around 25% of the population. [1]

Related posts

[1] The 25% figure came from exp(-500/365). See this post for details.

Estimating vocabulary size with Heaps’ law

Heaps’ law says that the number of unique words in a text of n words is approximated by

V(n) = K nβ

where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 an 0.6.

(Note that it’s Heaps’ law, not Heap’s law. The law is named after Harold Stanley Heaps. However, true to Stigler’s law of eponymy, the law was first observed by someone else, Gustav Herdan.)

I’ll demonstrate Heaps law looking at books of the Bible and then by looking at novels of Jane Austen. I’ll also look at unique words, what linguists call “hapax legomena.”

Demonsrating Heaps law

For a collection of related texts, you can estimate the parameters K and β from data. I decided to see how well Heaps’ law worked in predicting the number of unique words in each book of the Bible. I used the King James Version because it is easy to download from Project Gutenberg.

I converted each line to lower case, replaced all non-alphabetic characters with spaces, and split the text on spaces to obtain a list of words. This gave the following statistics:

    |------------+-------+------|
    | Book       |     n |    V |
    |------------+-------+------|
    | Genesis    | 38520 | 2448 |
    | Exodus     | 32767 | 2024 |
    | Leviticus  | 24621 | 1412 |
                    ...
    | III John   |   295 |  155 |
    | Jude       |   609 |  295 |
    | Revelation | 12003 | 1283 |
    |------------+-------+------|

The parameter values that best fit the data were K = 10.64 and β = 0.518, in keeping with the typical ranges of these parameters.

Here’s a sample of how the actual vocabulary size and predicted vocabulary size compare.

    |------------+------+-------|
    | Book       | True | Model |
    |------------+------+-------|
    | Genesis    | 2448 |  2538 |
    | Exodus     | 2024 |  2335 |
    | Leviticus  | 1412 |  2013 |
                    ...
    | III John   |  155 |   203 |
    | Jude       |  295 |   296 |
    | Revelation | 1283 |  1387 |
    |------------+------+-------|

Here’s a visual representation of the results.

KJV bible total words vs distinct words

It looks like the predictions are more accurate for small books, and that’s true on an absolute scale. But the relative error is actually smaller for large books as we can see by plotting again on a log-log scale.

KJV bible total words vs distinct words

Jane Austen novels

It’s a little surprising that Heaps’ law applies well to books of the Bible since the books were composed over centuries and in two different languages. On the other hand, the same committee translated all the books at the same time. Maybe Heaps’ law applies to translations better than it applies to the original texts.

I expect Heaps’ law would fit more closely if you looked at, say, all the novels by a particular author, especially if the author wrote all the books in his or her prime. (I believe I read that someone did a vocabulary analysis of Agatha Christie’s novels and detected a decrease in her vocabulary in her latter years.)

To test this out I looked at Jane Austen’s novels on Project Gutenberg. Here’s the data:

    |-----------------------+--------+------|
    | Novel                 |      n |    V |
    |-----------------------+--------+------|
    | Northanger Abbey      |  78147 | 5995 |
    | Persuasion            |  84117 | 5738 |
    | Sense and Sensibility | 120716 | 6271 |
    | Pride and Prejudice   | 122811 | 6258 |
    | Mansfield Park        | 161454 | 7758 |
    | Emma                  | 161967 | 7092 |
    |-----------------------+--------+------|

The parameters in Heaps’ law work out to K = 121.3 and β = 0.341, a much larger K than before, and a smaller β.

Here’s a comparison of the actual and predicted vocabulary sizes in the novels.

    |-----------------------+------+-------|
    | Novel                 | True | Model |
    |-----------------------+------+-------|
    | Northanger Abbey      | 5995 |  5656 |
    | Persuasion            | 5738 |  5799 |
    | Sense and Sensibility | 6271 |  6560 |
    | Pride and Prejudice   | 6258 |  6598 |
    | Mansfield Park        | 7758 |  7243 |
    | Emma                  | 7092 |  7251 |
    |-----------------------+------+-------|

If a suspected posthumous manuscript of Jane Austen were to appear, a possible test of authenticity would be to look at its vocabulary size to see if it is consistent with her other works. One could also look at the number of words used only once, as we discuss next.

Hapax legomenon

In linguistics, a hapax legomenon is a word that only appears once in a given context. The term comes comes from a Greek phrase meaning something said only once. The term is often shortened to just hapax.

I thought it would be interesting to look at the number of hapax legomena in each book since I could do it with a minor tweak of the code I wrote for the first part of this post.

Normally if someone were speaking of hapax legomena in the context of the Bible, they’d be looking at unique words in the original languages, i.e. Hebrew and Greek, not in English translation. But I’m going with what I have at hand.

Here’s a plot of the number of haxap in each book of the KJV compared to the number of words in the book.

Hapax logemenon in Bible, linear scale

This looks a lot like the plot of vocabulary size and total words, suggesting the number of hapax also follow a power law like Heaps law. This is evident when we plot again on a logarithmic scale and see a linear relation.

Number of hapax logemena on a log-log scale

Just to be clear on the difference between two analyses this post, in the first we looked at vocabulary size, the number of distinct words in each book. In the second we looked at words that only appear once. In both cases we’re counting unique words, but unique in different senses. In the first analysis, unique means that each word only counts once, no matter how many times it’s used. In the second, unique means that a work only appears once.

Related posts

R with Conda

I’ve been unable to get some R libraries to install on my Linux laptop. Two libraries in particular were tseries and tidyverse. The same libraries installed just fine on Windows. (Maybe you need to install Rtools first before installing these on Windows; I don’t remember.)

I use conda all the time with Python, but I hadn’t tried it with R until this evening. Apparently it just works. The libraries I was trying to install have a lot of dependencies, and conda is very good at managing dependencies.

I removed my installation of R and reinstalled from conda:

    conda install r-base

Then I installed tseries with

    conda install r-tseries

and installed tidyverse analogously:

    conda install r-tidyverse

Just prepend r- to the name of the R library you want to install.

I haven’t used it in anger yet, but it seems that everything works just fine.

Dose finding != dose escalation

You’ll often hear Phase I dose-finding trials referred to as dose escalation studies. This is because simple dose-finding methods can only explore in one direction: they can only escalate.

Three-plus-three rule

The most common dose finding method is the 3+3 rule. There are countless variations on this theme, but the basic idea is that you give a dose of an experimental drug to three people. If all three are OK, you go up a dose next time. If two out of three are OK, you give that dose again. If only one out of three is OK, you stop [1].

Deterministic thinking

The 3+3 algorithm implicitly assumes deterministic thinking, at least in part. The assumption is that if three out of three patients respond well, we know the dose is safe [2].

If you increase the dose level and the next three patients experience adverse events, you stop the trial. Why? Because you know that the new dose is dangerous, and you know the previous dose was safe. You can only escalate because you assume you have complete knowledge based on three samples.

But if we treat three patients at a particular dose level and none have an adverse reaction we do not know for certain that the dose level is safe, though we may have sufficient confidence in its safety to try the next dose level. Similarly, if we treat three patients at a dose and all have an adverse reaction, we do not know for certain that the dose is toxic.

Bayesian dose-finding

A Bayesian dose-finding method estimates toxicity probabilities given the data available. It might decide at one point that a dose appears safe, then reverse its decision later based on more data. Similarly, it may reverse an initial assessment that a dose is unsafe.

A dose-finding method based on posterior probabilities of toxicity is not strictly a dose escalation method because it can explore in two directions. It may decide that the next dose level to explore is higher or lower than the current level.

Starting at the lowest dose

In Phase I studies of chemotherapeutics, you conventionally start at the lowest dose. This makes sense. These are toxic agents, and you naturally want to start at a dose you have reason to believe isn’t too toxic. (NB: I say “too toxic” because chemotherapy is toxic. You hope that it’s toxic to a tumor without being too toxic for the patient host.)

But on closer inspection maybe you shouldn’t start at the lowest dose. Suppose you want to test 100 mg, 200 mg, and 300 mg of some agent. Then 100 mg is the lowest dose, and it’s ethical to start at 100 mg. Now what if we add a dose of 50 mg to the possibilities? Did the 100 mg dose suddenly become unethical as a starting dose?

If you have reason to believe that 100 mg is a tolerable dose, why not start with that dose, even if you add a lower dose in case you’re wrong? This makes sense if you think of dose-finding, but not if you think only in terms of dose escalation. If you can only escalate, then it’s impossible to ever give a dose below the starting dose.

Related posts

[1] I have heard, but I haven’t been able to confirm, that the 3+3 method has its origin in a method proposed by John Tukey during WWII for testing bombs. When testing a mechanical system, like a bomb, there is much less uncertainty than when testing a drug in a human. In a mechanical setting, you may have a lot more confidence from three samples than you would in a medical setting.

[2] How do you explain the situation where one out of three has an adverse reaction? Is the dose safe or not? Here you naturally switch to probabilistic thinking because deterministic thinking leads to a contradiction.

 

Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Normal and Laplace probability density functions

A normal distribution has very thin tails, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has moderate tails [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

Statistics meets differential privacy

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

Solving for parameters

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the Greek letter fallacy, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two quantiles match. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale s has density

f(x) = exp(-|x|/s)/2s.

If we want to solve for the quantile x such that Prob(Xx) = p, we have

x = –s log(2 – 2p).

Using the formula derived in the previously mentioned post,

σ = 2x / Φ-1(x)

where Φ is the cumulative distribution function of the standard normal.

Related posts

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.

Why are dates of service prohibited under HIPAA’s Safe Harbor provision?

calendar

The HIPAA Privacy Rule offers two ways to say that data has been de-identified: Safe Harbor and expert determination. This post is about the former. I help companies with the latter.

Safe Harbor provision

The Safe Harbor provision lists 18 categories of data that would cause a data set to not be considered de-identified unless an expert determines the data does not pose a significant re-identification risk.

Some of the items prohibited by Safe Harbor are obvious: telephone number, email address, social security number, etc. Others are not so obvious. In order for data to fall under the Safe Harbor provision, one must remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 …

Why are these dates a problem? Birth dates are clearly useful in identifying individuals; when combined with zip code and sex, they give enough information to uniquely identify 87% of Americans. (More details here.) But why admission or discharge dates?

Public information on dates

Latanya Sweeney demonstrated here how dates of hospital admission can be used to identify individuals. She purchased a set of anonymized health records for the state of Washington for 2011 and compared the records to newspaper stories. She simply did a LexusNexus search on the term “hospitalized” to find news stories about people who were hospitalized, then searched for the medical records for the personal details from the newspaper articles.

In the discussion section of her article Sweeney points out that although she searched newspapers, one could get similar information from other sources, such as employee leave records or from a record of someone asking to pay a bill late due to illness.

Randomized dates

There are ways to retain the information in dates without jeopardizing privacy. For example, one could jitter the dates by adding a random offset. However, the way to do this depends on context and can be subtle. For example, Netflix jittered the dates in its Netflix Prize data set by +/- two weeks, but this was not enough to prevent a privacy breach [1]. And if you add too much randomness and the utility of the data degrades. That’s why the HIPAA Privacy Rule includes the provision to obtain expert determination that your procedures are adequate in your context.

Related posts

[1] Arvind Narayanan and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets, or How to Break Anonymity of the Netflix Prize Dataset.