US Census Bureau embraces differential privacy

The US Census Bureau is convinced that traditional methods of statistical disclosure limitation have not done enough to protect privacy. These methods may have been adequate in the past, but it no longer makes sense to implicitly assume that those who would like to violate privacy have limited resources or limited motivation. The Bureau has turned to differential privacy for quantifiable privacy guarantees that are independent of the attacker’s resources and determination.

John Abowd, chief scientist for the US Census Bureau, gave a talk a few days ago (March 4, 2019) in which he discusses the need for differential privacy and how the bureau is implementing differential privacy for the 2020 census.

Absolutely the hardest lesson in modern data science is the constraint on publication that the fundamental law of information recovery imposes. I usually call it the death knell for traditional method of publication, and not just in statistical agencies.

Related posts

Congress and the Equifax data breach

Dialog from a congressional hearing February 26, 2019.

Representative Katie Porter: My question for you is whether you would be willing to share today your social security, your birth date, and your address at this public hearing.

Equifax CEO Mark Begor: I would be a bit uncomfortable doing that, Congresswoman. If you’d so oblige me, I’d prefer not to.

KP: Could I ask you why you’re unwilling?

MB: Well that’s sensitive information. I think it’s sensitive information that I like to protect, and I think consumers should protect theirs.

KP: My question is then, if you agree that exposing this kind of information, information like that you have in your credit reports, creates harm, therefore you’re unwilling to share it, why are your lawyers arguing in federal court that there was no injury and no harm created by your data breach?

Related posts

Supercookies

superhero cookies

Supercookies, also known as evercookies or zombie cookies, are like browser cookies in that they can be used to track you, but are much harder to remove.

What is a supercookie?

The way I first heard supercookies describe was as a cookie that you can appear to delete, but as soon as you do, software rewrites the cookie. Like the Hydra from Greek mythology, cutting off a head does no good because it grows back [1].

This explanation is oversimplified. It doesn’t quite work that way.

A supercookie is not a cookie per se. It’s anything that can be used to uniquely identify your browser: font fingerprinting, flash cache, cached images, browser plugins and preferences, etc. Deleting your cookies has no effect because a supercookie is not a cookie.

However, a supercookie can work with other code to recreate deleted cookies, and so the simplified description is not entirely wrong. A supercookie could alert web sites that a cookie has been deleted, and allow those sites to replace that cookie, or update the cookie if some browser data has changed.

What about ‘Do Not Track’?

You can ask sites to not track you, but this works on an honor system and is ignored with impunity, even (especially?) by the best known companies.

Apple has announced that it is removing Do Not Track from its Safari browser because the feature is worse than useless. Servers don’t honor it, and it gives a false sense of privacy. Not only that, the DNT setting is one more bit that servers could use to identify you! Because only about 9% of users turn on DNT, knowing that someone has it turned on gives about 3.5 bits of information toward identifying that person.

How to remove supercookies

How do you remove supercookies? You can’t. As explained above, a supercookie isn’t a file that can be removed. It’s a procedure for exploiting a combination of data.

You could remove specific ways that sites try to identify you. You could, for example, remove Flash to thwart attempts to exploit Flash’s data, cutting off one head of the Hydra. This might block the way some companies track you, but there are others.

It’s an arms race. As fingerprinting techniques become well known, browser developers and users try to block them, and those intent on identifying you come up with more creative approaches.

The economics of identification

Given the efforts companies use to identify individuals (or at least their devices), it seems it must be worth it. At least companies believe it’s worth it, and for some it probably is. But there are reasons to believe that tracking isn’t as valuable as it seems. For example, this article argues that the most valuable targeting information is freely given. For example, you know who is interested in buying weighted blankets? People who search on weighted blankets!

There have been numerous anecdotal stories recently of companies that have changed their marketing methods in order to comply with GDPR and have increased their sales. These are only anecdotes, but they suggest that at least for some companies, there are profitable alternatives to identifying customers who don’t wish to be identified.

Related posts

[1] In the Greek myth, cutting off one head of the Hydra caused two heads to grow back. Does deleting a supercookie cause it to come back stronger? Maybe. Clearing your cookies is another user behavior that can be used to fingerprint you.

Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Normal and Laplace probability density functions

A normal distribution has very thin tails, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has moderate tails [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

Statistics meets differential privacy

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

Solving for parameters

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the Greek letter fallacy, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two quantiles match. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale s has density

f(x) = exp(-|x|/s)/2s.

If we want to solve for the quantile x such that Prob(Xx) = p, we have

x = –s log(2 – 2p).

Using the formula derived in the previously mentioned post,

σ = 2x / Φ-1(x)

where Φ is the cumulative distribution function of the standard normal.

Related posts

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.

Probabilisitic Identifiers in CCPA

The CCPA, the California Consumer Privacy Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California.

The law specifically mentions probabilistic identifiers.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of guessing personal data fields [1]. That could be really broad. For example, the fact that you’re reading this blog post makes it “more probable than not” that you have a college degree, and education is one of the categories mentioned in the law.

Personal information

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

Significance

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

Devices

It’s interesting that the CCPA says “a particular consumer or device.” That would seem to include browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

Related posts

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

Font Fingerprinting

Browser fingerprint

Web sites may not be able to identify you, but they can probably identify your web browser. Your browser sends a lot of information back to web servers, and the combination of settings for a particular browser are usually unique. To get an idea what information we’re talking about, you could take a look at Device Info.

Installed fonts

One of the pieces of information that gets sent back to servers is the list of fonts installed on your device. Your font fingerprint is just one component of your browser fingerprint, but it’s an easy component to understand.

Application fonts

Various applications install their own fonts. If you’ve installed Microsoft Office, for example, that would be evident in your list of fonts. However, Office is ubiquitous, so that information doesn’t go very far to identifying you. Maybe the lack of fonts installed with Office would be more conspicuous.

Less common software goes further toward identifying you. For example, I have Mathematica on one of my computers, and along with it Mathematica fonts, something that’s not too common.

Personal fonts

Then there are the fonts you’ve installed deliberately, many of the free. Maybe you’ve installed fonts to support various languages, such as Hebrew and Greek fonts for Bible scholars. Maybe you have dyslexia and have installed fonts that are easier for you to read. Maybe you’ve installed a font because it contains technical symbols you need for your work. These increase the chances that your combination of fonts is unique.

Commercial fonts

Maybe you have purchased a few commercial fonts. One of the reasons to buy fonts is to have something that doesn’t look so common. This also makes the font fingerprint of your browser less common.

Moderate obscurity

Servers have to query whether particular fonts are installed. An obscure font would go a long way toward identifying you. But if a font is truly obscure, the server isn’t likely to ask whether it’s installed. So the greatest privacy risk comes from moderately uncommon fonts [1].

Advertising

Your browser fingerprint is probably unique, unless you have a brand new device, or you’ve made a deliberate effort to keep your fingerprint generic. So while a site may not know who you are, it can recognize whether you’ve been there before, and customize the content you receive accordingly. Maybe you’ve looked at the same product three times without buying, and so you get a nudge to encourage you to go ahead and buy.

(It’ll be interesting to see what effect the California Consumer Privacy Act has on this when it goes into effect the beginning of next year.)

What about changes?

Since there’s more than enough information to uniquely identify your browser, fingerprints are robust to changes. Installing a new font won’t throw advertisers off your trail. If you still have the same monitor size, same geographic location, etc. then advertisers can update your fingerprint information to include he new font. You might even get an advertisement for more fonts if they infer you’re a typography aficionado.

Related posts

[1] Except for a spearphishing attack. A server might check for the presence of fonts that, although uncommon in general, are likely to be on the target’s computer. For example, if someone wanted to detect my browser in particular, they know I have Mathematica fonts installed because I said so above. And they might guess that I have installed the Greek and Hebrew fonts I mentioned. They might also look for obscure fonts I’ve mentioned in the blog, such as Unifont, Andika, and Inconsolata.

Unstructured data is an oxymoron

messy workshop

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More productive discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free text fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Related posts

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

Why are dates of service prohibited under HIPAA’s Safe Harbor provision?

calendar

The HIPAA Privacy Rule offers two ways to say that data has been de-identified: Safe Harbor and expert determination. This post is about the former. I help companies with the latter.

Safe Harbor provision

The Safe Harbor provision lists 18 categories of data that would cause a data set to not be considered de-identified unless an expert determines the data does not pose a significant re-identification risk.

Some of the items prohibited by Safe Harbor are obvious: telephone number, email address, social security number, etc. Others are not so obvious. In order for data to fall under the Safe Harbor provision, one must remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 …

Why are these dates a problem? Birth dates are clearly useful in identifying individuals; when combined with zip code and sex, they give enough information to uniquely identify 87% of Americans. (More details here.) But why admission or discharge dates?

Public information on dates

Latanya Sweeney demonstrated here how dates of hospital admission can be used to identify individuals. She purchased a set of anonymized health records for the state of Washington for 2011 and compared the records to newspaper stories. She simply did a LexusNexus search on the term “hospitalized” to find news stories about people who were hospitalized, then searched for the medical records for the personal details from the newspaper articles.

In the discussion section of her article Sweeney points out that although she searched newspapers, one could get similar information from other sources, such as employee leave records or from a record of someone asking to pay a bill late due to illness.

Randomized dates

There are ways to retain the information in dates without jeopardizing privacy. For example, one could jitter the dates by adding a random offset. However, the way to do this depends on context and can be subtle. For example, Netflix jittered the dates in its Netflix Prize data set by +/- two weeks, but this was not enough to prevent a privacy breach [1]. And if you add too much randomness and the utility of the data degrades. That’s why the HIPAA Privacy Rule includes the provision to obtain expert determination that your procedures are adequate in your context.

Related posts

[1] Arvind Narayanan and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets, or How to Break Anonymity of the Netflix Prize Dataset.

May I have the last four digits of your social?

call center

Imagine this conversation.

“Could you tell me your social security number?”

“Absolutely not! That’s private.”

“OK, how about just the last four digits?”

“Oh, OK. That’s fine.”

When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected at the time. Using these four digits rather than names would keep your grades private from the most lazy observer but not from anyone willing to put out a little effort.

There’s a widespread belief in the US that your social security number is a deep secret, and that telling someone your social security number gives them power over you akin to a fairy telling someone his true name. On the other hand, we also believe that telling someone just the last four digits of your SSN is harmless. Both are wrong. It’s not that hard to find someone’s full SSN, and revealing the last four digits gives someone a lot of information to use in identifying you.

In an earlier post I looked at how easily most people could be identified by the combination of birth date, sex, and zip code. We’ll use the analytical results from that post to look at how easily someone could be identified by their birthday, state, and the last four digits of their SSN [1]. Note that the previous post used birth date, i.e. including year, where here we only look at birth day, i.e. month and day but no year. Note also that there’s nothing special about social security numbers for our purposes. The last four digits of your phone number would provide just as much information.

If you know someone lives in Wyoming, and you know their birthday and the last four digits of their SSN, you can uniquely identify them 85% of the time, and in an addition 7% of cases you can narrow down the possibilities to just two people. In Texas, by contrast, the chances of a birthday and four-digit ID being unique are 0.03%. The chances of narrowing the possibilities to two people are larger but still only 0.1%.

Here are results for a few states. Note that even though Texas has between two and three times the population of Ohio, it’s over 100x harder to uniquely identify someone with the information discussed here.

|-----------+------------+--------+--------|
| State     | Population | Unique |  Pairs |
|-----------+------------+--------+--------|
| Texas     | 30,000,000 |  0.03% |  0.11% |
| Ohio      | 12,000,000 |  3.73% |  6.14% |
| Tennessee |  6,700,000 | 15.95% | 14.64% |
| Wyoming   |    600,000 | 84.84% |  6.97% |
|-----------+------------+--------+--------|

Related posts

[1] In that post we made the dubious simplifying assumption that birth dates were uniformly distributed from 0 to 78 years. This assumption is not accurate, but it was good enough to prove the point that it’s easier to identify people than you might think. Here our assumptions are better founded. Birthdays are nearly uniformly distributed, though there are some slight irregularities. The last four digits of social security numbers are uniformly distributed, though the first digits are correlated with the state.