There’s an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it’s like a lot of other records.

*k*-anonymity

The idea of *k*-anonymity is that every database record appears at least *k* times when you restrict your attention to **quasi-identifiers** [1]. If you have a lot of records and few fields, your value of *k* could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If *k* = 1, then *k*-anonymity offers no anonymity. Even when *k* is large, *k*-anonymity might prevent re-identification but still suffer from **attribute disclosure**.

Another problem with *k*-anonymity is that it doesn’t offer group privacy. A database could be *k*-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a **homogeneity attack**.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a **background knowledge attack**.

One way to address this shortcoming is *l*-diversity. This post won’t go into *l*-diversity because it’s an intermediate step to where we want to go, which is *t*-closeness.

*t*-closeness

The idea of *t*-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The “*t*” comes from requiring that the distributions be no more than a distance *t* apart in a sense that we’ll define below [2]. If the sensitive data in a group doesn’t stand out, this thwarts the homogeneity attack and the background knowledge attack.

## Earth mover distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what “far” means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Leiler divergence, though that’s not what we’ll use here. Instead, *t*-closeness uses the so-called earth mover distance (EMD), also know as the **Wasserstein metric**.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution *X* has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions *Y* and Z are the same except they have 80% of their probability mass at 2 and at 5 respectively. By some measures, *X* is equally far from *Y* and *Z*, but the earth mover distance would say that *X* is closer to *Y* than to *Z*, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform *X* into *Y*, we need to move a probability mass of 0.75 from 1 to 2, and so the EMD is 0.75. To transform *X* into *Z* we need to move the same amount of mass, but we need to move it 4x further, and so the EMD is 3.

## Related posts

[1] What exactly is a quasi-identifier? That’s hard to say precisely. An identifier is something that clearly identifies an individual. A phone number, for example, is an identifier. A quasi-identifier is something that helps narrow down who a person is, but isn’t an identifier. For example, someone’s birthday would be a quasi-identifier. But in general “quasi-identifier” is not a precisely defined concept.

[2] It’s common in math to use a variable name as an adjective, as with *k*-anonymity, *l*-diversity, and *t*-closeness. This is unfortunate because it isn’t descriptive and locks in a variable naming convention. Someone used the variable *k* to count the number of redundant database tables, and the name stuck. Similarly the *l* in *l*-diversity counts something.

As a sidenote, the variable names here follow the old FORTRAN convention that variables *i* through *n* represent integers by default, and that all other variables represent real (floating point) numbers by default. The *t* in *t*-closeness is a continuous measure, so the *t* is a real value, while *k* and *l* are integers.