Sharing data without letting it go

Sharing data

Suppose two companies would like to share data, but they’d also each like to retain ownership of their own data. They’d like to enable querying as if each company had given the other all its data, without actually letting go of its data.

Maybe the two companies are competitors who want to collaborate for a particular project. Or maybe the companies each have data that they are not legally allowed to share with the other. Maybe one company is interested in buying (the data of) the other and would like to have some sort of preview of what they may be buying.

Differential privacy makes this possible, and can be useful even if privacy is not an issue. The two companies have data on inanimate widgets, not persons, and yet they have privacy-like concerns. They don’t want to hand over row-level data about their widgets, and yet they both want to be able to pose questions about the combined widget data. The situation is analogous to being concerned about the “privacy” of the widgets.

Both companies would deposit data with a trusted third party, and gain access to this data via an API that implements differential privacy. Such APIs let users pose queries but do not allow either side to request row-level data.

How is this possible? What if one party poses a query that unexpectedly turns out to be asking for row-level data? For example, maybe someone asks for the average net worth of customers in Alaska, assuming there are hundreds of such customers, but the data only contains one customer in Alaska. What was intended to be an aggregate query turns out to be a row-level query.

Differential privacy handles this sort of thing automatically. It adds an amount of random noise to each query in proportion to how sensitive the query is. If you ask for what amounts to data about an individual person (or individual widget) the system will add enough noise to the result to prevent revealing row-level information. (The system may also refuse to answer the query; this is done more often in practice than in theory.) But if you ask a question that reveals very little information about a single database row, the amount of noise added will be negligible.

The degree of collaboration can be limited up front by setting a privacy budget for each company. (Again, we may not necessarily be talking about the privacy of people. We may be looking at analogous protections on units of information about physical things, such as results of destructive testing of expensive hardware.)

Someone could estimate at the start of the collaboration how large the privacy budget would need to be to allow both companies to satisfy their objectives without being so large as to risk giving away intellectual property that the parties do not wish to exchange. This budget would be spent over the course of the project. When the parties exhaust their privacy budgets, they can discuss whether to allow each other more query budget.

This arrangement allows both parties the ability to ask questions of the combined data as if they had exchanged data. However, neither party has given up control of its data. They have given each other some level of information inferred from the combined data, but neither gives a single row of data to the other.

Related posts

Conspicuously missing data

I was working on a report for a client this afternoon when I remembered this comic from Spiked Math.

Waitress: Does everyone want a beer? Logician 1: I don't know. Logician 2: I don't know. Logician 3: Yes!

I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also give that company (and every entity they share your data with) information about your relatives.

This comic doesn’t illustrate the point I had in mind, but it does illustrate a related point. The third logician didn’t reveal the preferences of the first two, though it looks like that at first. Actually, the first two implicitly reported their own preferences.

If the first logician did not want a beer, he or she could have said “No” to the question “Does everyone want a beer?” Answering this question with “I don’t know” is tantamount to answering the question “Do you want a beer?” with “Yes.” What appears to be a non-committal answer is a definite answer on closer examination.

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

A naive approach to removing sensitive data can be about as effective as bleeping out profanity: it’s not hard to infer what was removed.

Related posts

Why target ads at pregnant women

I’m listening to a podcast interviewing Neil Richards, the author of Why Privacy Matters. Richards makes a couple interesting points about the infamous example of Target figuring out which women were pregnant based on their purchase history.

First, pregnancy is a point at which women are open to trying new things. So if a company can get a woman to buy a baby stroller at their store, they may be able to get her to remain a customer for years to come. (Richards mentioned going off to college as another such milestone, so a barrage of advertising is aimed at first-year college students.)

Second, women understandably freaked-out over the targeted ads. So Target hid the ads in with irrelevant ads. They might show a woman ads for lawnmowers and baby wipes. That way the baby wipe ads didn’t seem so on-the-nose. The target audience would see the ad without feeling like they’re being targeted.

Just to be clear, I’m not writing this post to offer how-to advice for doing creepy advertising. The info here is presumably common knowledge in the advertising industry, but it’s not common knowledge for the public.

What’s “differential” about differential privacy?

Interest in differential privacy is growing rapidly. As evidence of this, here’s the result of a Google Ngram search [1] on “differential privacy.”

Graph rapidly rising from 2000 to 2019

When I first mentioned differential privacy to consulting leads a few years ago, not many had heard of it. Now most are familiar with the term, though they may not be clear on what it means.

The term “differential privacy” is kinda mysterious, particularly the “differential” part. Are you taking the derivative of some privacy function?!

In math and statistics, the adjective “differential” usually has something to do with calculus. Differential geometry applies calculus to geometry problems. Differential equations are equations involving the derivatives of the function you’re after.

But differential privacy is different. There is a connection to differential calculus, but it’s further up the etymological tree. Derivatives are limits of difference quotients, and that’s the connection. Differential privacy is about differences, but without any limits. In a nutshell, a system is differentially private if it hardly makes any difference whether your data is in the system or not.

The loose phrase “it hardly makes any difference” can be quantified in terms of probability distributions. You can find a brief explanation of how this works here.

Differential privacy is an elegant and defensible solution to the problem of protecting personal privacy while also preserving the usefulness of data in the aggregate. But it’s also challenging to understand and implement. If you’re interested in exploring differential privacy for your business, let’s talk.

More differential privacy posts

[1] You can see the search here. I’ve chopped off the vertical axis labels because they’re virtually meaningless. The proportion of occurrences of “differential privacy” among all two-word phrases is extremely small. But the rate of growth in the proportion is large.

Hashing phone numbers

A cryptographic hash is also known as a one-way function because given an input x, one can quickly compute the hash h(x), but it is extremely time-consuming to try to recover x if you only know h(x).

Even if the hashing algorithm is considered “broken,” it may take an enormous effort to break it. Google demonstrated that they could break a SHA-1 hash, but they used a GPU-century of compute power to do so. Attacks have become more efficient since then, but it still takes many orders of magnitude less work to compute a hash than to attempt to invert it. [1]

However, if you know that the hash value comes from a small set of possible inputs, brute force can discover which one. I wrote about this a couple years ago in the post Hashing names does not protect privacy. You could, for example, easily create a table of the hashes of all nine-digit social security numbers.

I often explain this to clients who have been told that hashed data is “encrypted.” This is subtle, because the data is encrypted, in a way, but not in the way they think.

A paper came out a few weeks ago that hashed 118 billion phone numbers.

The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers from [reference] (i.e., mobile phone numbers allowed by Google’s libphonenumber and the WhatsApp registration API) paired with their SHA-1 hashes.

The authors were able to use this data base to query

10% of US mobile phone numbers for WhatsApp and 100% for Signal. For Telegram we find that its API exposes a wide range of sensitive information, even about numbers not registered with the service.

Related posts

[1] Hash functions are not invertible, even in theory, in the sense of a unique x leading to a hash value h(x). Suppose you’re computing a 256-bit hash on files that are one kilobyte (8192 bits). If you’re mapping a space of 28192 possible files into a space of 2256 possible hash values, the mapping cannot be one-to-one. However, if you know the inputs are not random bits but German prose, and you find a file of German prose that has a matching hash value, you’ve almost certainly recovered the file that led to the hash value.

Expert determination for CCPA

US and California flags

California’s CCPA regulation has been amended to say that data considered deidentified under HIPAA is considered deidentified under CCPA. The amendment was proposed last year and was finally signed into law on September 25, 2020.

This is good news because it’s relatively clear what deidentification means under HIPAA compared to CCPA. In particular, HIPAA has two well-established alternatives for determining that data have been adequately deidentified:

  1. Safe Harbor, or
  2. Expert determination.

The latter is especially important because most useful data doesn’t meet the requirements of Safe Harbor.

I provide companies with HIPAA expert determination. And now by extension I can provide expert determination under CCPA.

I’m not a lawyer, and so nothing I write should be considered legal advice. But I work closely with lawyers to provide expert determination. If you would like to discuss how I could help you, let’s talk.

Identifying someone from their heart beat

electrocardiogram of a toddler

How feasible would it be to identify someone based from electrocardiogram (EKG, ECG) data? (Apparently the abbreviation “EKG” is more common in America and “ECG” is more common in the UK.)

Electrocardiograms are unique, but unique doesn’t necessarily mean identifiable. Unique data isn’t identifiable without some way to map it to identities. If you shuffle a deck of cards, you will probably produce an arrangement that has never occurred before. But without some sort of registry mapping card deck orders to their shufflers, there’s no chance of identification. (For identification, you’re better off dusting the cards for fingerprints, because there are registries of fingerprints.)

According to one survey [1], researchers have tried a wide variety of methods for identifying people from electrocardiograms. They’ve used time-domain features such as peak amplitudes, slopes, variances, etc., as well as a variety of frequency-domain (DFT) features. It seems that all these methods work moderately well, but none are great, and there’s no consensus regarding which approach is best.

If you have two EKGs on someone, how readily can you tell that they belong to the same person? The answer depends on the size of the set of EKGs you’re comparing it to. The studies surveyed in [1] do some sort of similarity search, comparing a single EKG to tens of candidates. The methods surveyed had an overall success rate of around 95%. But these studies were based on small populations; at least at the time of publication no one had looked at matching an single EKG against thousands of possible matches.

In short, an electrocardiogram can identify someone with high probability once you know that they belong to a relatively small set of people for which you have electrocardiograms.

More identification posts

[1] Antonio Fratini et al. Individual identification via electrocardiogram analysis. Biomed Eng Online. 2015; 14: 78. doi 10.1186/s12938-015-0072-y

Is there a zip code that equals its population?

US stamp from 1973 promoting zip codes

I noticed yesterday that the population in a zip code near me is roughly equal to the zip code itself. So I wondered:

Does any zip code equal its population?

Yes, it’s a silly question. A zip code isn’t a quantity. Populations are always changing. Zip code boundaries are always changing. Etc.

The answer, according to the data I had on hand, is almost.

Smallest absolute error: Zip code 00674 has population 672.

Smallest relative error: Zip code 42301 has population 42319.

I’ve had to learn some of the intricacies of zip codes in the course of my work on data privacy. I found out that zip codes are more complicated than I ever would have thought.

For one thing, the US Census doesn’t exactly report data by zip code but by zip code tabulation area (ZCTA) for reasons that make sense but are too complicated to get into here. This is another reason why the question posed here is fuzzy; we don’t know the populations of zip codes unless they coincide with ZCTAs.

Update: Several people have told me they expected this post to be an existence argument, such as saying by the pigeon hole principle some zip code has to equal its population. That’s not the case, at least for real, populated US zip codes. (Though Andrew Gelman pointed out that 00000 is not populated, so there’s that.)

Let P be a population size with n zip codes and assume all zip codes have to be populated. If P = n and one of the zip codes must be numbered 1, then some zip code must equal its population. But other than trivial cases like that, it’s easy to avoid any equalities between zip codes and their populations.

A more interesting question goes in the opposite direction: is it possible that every zip code equals its population? Clearly not if n(n + 1)/2 > P because the left side is the minimum population possible with n zip codes, each equal to its non-zero population. But it turns out that n(n + 1)/2 ≤ P is sufficient: put one person in zip codes 1 through n-1 and let k = Pn(n-1) be the number of remaining people. Put them all in zip code k. Since kn-1, k is a zip code that hasn’t been used before.

More on zip codes

This post started out as a Twitter thread.

The image above is a US stamp from 1973 promoting the initial use of zip codes.

Three composition theorems for differential privacy

This is a brief post, bringing together three composition theorems for differential privacy.

  1. The composition of an ε1-differentially private algorithm and an ε2-differentially private algorithm is an (ε12)-differentially private algorithm.
  2. The composition of an (ε1, δ1)-differentially private algorithm and an (ε2, δ2)-differentially private algorithm is an (ε12, δ12)-differentially private algorithm.

The three composition rules can be summarized briefly as follows:

ε1 ∘ ε2 → (ε1 + ε2)
1, δ1) ∘ (ε2, δ2) → (ε12, δ12)
(α, ε1) ∘ (α, ε2) → (α, ε12)

What is the significance of these composition theorems? In short, ε-differential privacy and Rényi differential privacy compose as one would hope, but (ε, δ)-differential privacy does not.

The first form of differential privacy proposed was ε-differential privacy. It is relatively easy to interpret, composes nicely, but can be too rigid.

If you have Gaussian noise, for example, you are lead naturally to (ε, δ)-differential privacy. The δ term is hard to interpret. Roughly speaking you could think  it as the probability that ε-differential privacy fails to hold. Unfortunately with (ε, δ)-differential privacy the epsilons add and so do the deltas. We would prefer that δ didn’t grow with composition.

Rényi differential privacy is a generalization of ε-differential privacy that uses a family of information measures indexed by α to measure the impact of a single row being or not being in a database. The case of α = ∞ corresponds to ε-differential privacy, but finite values of α tend to be less pessimistic. The nice thing about the composition theorem for Rényi differential privacy is that the α parameter doesn’t change, unlike the δ parameter in (ε, δ)-differential privacy.

Safe Harbor ain’t gonna cut it

There are two ways to deidentify data to satisfy HIPAA:

  • Safe Harbor, § 164.514(b)(2), and
  • Expert Determination, § 164.514(b)(1).

And for reasons explained here, you may need to be concerned with HIPAA even if you’re not a “covered entity” under the statute.

To comply with Safe Harbor, your data may not contain any of eighteen categories of information. Most of these are obvious: direct identifiers such as name, phone number, email address, etc. But some restrictions under Safe Harbor are less obvious and more difficult to comply with.

For example, under Safe Harbor you need to remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

This would make it impossible, for example, to look at seasonal trends in medical procedures because you would only have data to the resolution of a year. But with a more sophisticated approach, e.g. differential privacy, it would be possible to answer such questions while providing better privacy for individuals. See how here.

If you need to comply with HIPAA, or analogous state laws such as TMPRA, and you can’t follow Safe Harbor, your alternative is expert determination. If you’d like to discuss expert determination, let’s talk.