Unstructured data is an oxymoron

messy workshop

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More productive discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free text fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Related posts

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

Why are dates of service prohibited under HIPAA’s Safe Harbor provision?

calendar

The HIPAA Privacy Rule offers two ways to say that data has been de-identified: Safe Harbor and expert determination. This post is about the former. I help companies with the latter.

Safe Harbor provision

The Safe Harbor provision lists 18 categories of data that would cause a data set to not be considered de-identified unless an expert determines the data does not pose a significant re-identification risk.

Some of the items prohibited by Safe Harbor are obvious: telephone number, email address, social security number, etc. Others are not so obvious. In order for data to fall under the Safe Harbor provision, one must remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 …

Why are these dates a problem? Birth dates are clearly useful in identifying individuals; when combined with zip code and sex, they give enough information to uniquely identify 87% of Americans. (More details here.) But why admission or discharge dates?

Public information on dates

Latanya Sweeney demonstrated here how dates of hospital admission can be used to identify individuals. She purchased a set of anonymized health records for the state of Washington for 2011 and compared the records to newspaper stories. She simply did a LexusNexus search on the term “hospitalized” to find news stories about people who were hospitalized, then searched for the medical records for the personal details from the newspaper articles.

In the discussion section of her article Sweeney points out that although she searched newspapers, one could get similar information from other sources, such as employee leave records or from a record of someone asking to pay a bill late due to illness.

Randomized dates

There are ways to retain the information in dates without jeopardizing privacy. For example, one could jitter the dates by adding a random offset. However, the way to do this depends on context and can be subtle. For example, Netflix jittered the dates in its Netflix Prize data set by +/- two weeks, but this was not enough to prevent a privacy breach [1]. And if you add too much randomness and the utility of the data degrades. That’s why the HIPAA Privacy Rule includes the provision to obtain expert determination that your procedures are adequate in your context.

Related posts

[1] Arvind Narayanan and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets, or How to Break Anonymity of the Netflix Prize Dataset.

Can I have the last four digits of your social?

call center

Imagine this conversation.

“Could you tell me your social security number?”

“Absolutely not! That’s private.”

“OK, how about just the last four digits?”

“Oh, OK. That’s fine.”

When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected at the time. Using these four digits rather than names would keep your grades private from the most lazy observer but not from anyone willing to put out a little effort.

There’s a widespread belief in the US that your social security number is a deep secret, and that telling someone your social security number gives them power over you akin to a fairy telling someone his true name. On the other hand, we also believe that telling someone just the last four digits of your SSN is harmless. Both are wrong. It’s not that hard to find someone’s full SSN, and revealing the last four digits gives someone a lot of information to use in identifying you.

In an earlier post I looked at how easily most people could be identified by the combination of birth date, sex, and zip code. We’ll use the analytical results from that post to look at how easily someone could be identified by their birthday, state, and the last four digits of their SSN [1]. Note that the previous post used birth date, i.e. including year, where here we only look at birth day, i.e. month and day but no year. Note also that there’s nothing special about social security numbers for our purposes. The last four digits of your phone number would provide just as much information.

If you know someone lives in Wyoming, and you know their birthday and the last four digits of their SSN, you can uniquely identify them 85% of the time, and in an addition 7% of cases you can narrow down the possibilities to just two people. In Texas, by contrast, the chances of a birthday and four-digit ID being unique are 0.03%. The chances of narrowing the possibilities to two people are larger but still only 0.1%.

Here are results for a few states. Note that even though Texas has between two and three times the population of Ohio, it’s over 100x harder to uniquely identify someone with the information discussed here.

|-----------+------------+--------+--------|
| State     | Population | Unique |  Pairs |
|-----------+------------+--------+--------|
| Texas     | 30,000,000 |  0.03% |  0.11% |
| Ohio      | 12,000,000 |  3.73% |  6.14% |
| Tennessee |  6,700,000 | 15.95% | 14.64% |
| Wyoming   |    600,000 | 84.84% |  6.97% |
|-----------+------------+--------+--------|

Related posts

[1] In that post we made the dubious simplifying assumption that birth dates were uniformly distributed from 0 to 78 years. This assumption is not accurate, but it was good enough to prove the point that it’s easier to identify people than you might think. Here our assumptions are better founded. Birthdays are nearly uniformly distributed, though there are some slight irregularities. The last four digits of social security numbers are uniformly distributed, though the first digits are correlated with the state.

Revealing information by trying to suppress it

FAS posted an article yesterday explaining how blurring military installations out of satellite photos points draws attention to them, showing exactly where they are and how big they are. The Russian mapping service Yandex Maps blurred out sensitive locations in Israel and Turkey. As the article says, this is an example of the Streisand effect, named after Barbara Streisand’s failed attempt to suppress photos of her Malibu home. Her efforts to protect her privacy caused the photos to go viral.

A similar issue arises in differential privacy. A practical implementation of differential privacy must rule out some queries a priori in some way that is not data-dependent. Otherwise it could be a violation of differential privacy either to answer or refuse to answer the same query. Short of refusing the query entirely, it could also be informative to reply “Are you sure you want to submit that query? It’s going to use up a lot of your privacy budget.”

Differential privacy adds random noise to query results, but in a more sophisticated way than Yandex blurring out portions of photos. Adding noise selectively reveals information about what the noise is trying to conceal. As the FAS article indicated, by blurring out only the sensitive areas, the blurred maps point out their exact location.

Selective security measures can have a similar effect. By placing greater protection on some information, you can mark that information as being particularly important. Once you realize this, otherwise nonsensical security measures start to make sense: applying uniform security results in excessive protection for some information, but keeps attackers from being able to identify the most valuable targets.

Related posts

Simulating identification by zip code, sex, and birthdate

As mentioned in the previous post, Latanya Sweeney estimated that 87% of Americans can be identified by the combination of zip code, sex, and birth date. We’ll do a quick-and-dirty estimate and a simulation to show that this result is plausible. There’s no point being too realistic with a simulation because the actual data that Sweeney used is even more realistic. We’re just showing that her result is reasonable.

Quick estimate

Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.

What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in [1].

We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.

Python simulation

The array demo below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.

    from random import randrange
    from numpy import zeros
    
    zctasize = 9330
    demosize = 365*78*2
    demo = zeros(demosize)
    
    for _ in range(zctasize):
        d = randrange(demosize)
        demo[d] += 1
    
    unique = len(demo[demo == 1])
    print(unique/zctasize)

I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.

Analytic solution

As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.

Suppose there are z inhabitants in our zip code and d demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.

We start by looking at a particular demographic category. The probability that exactly one person falls in that category is

\frac{z}{d}\left(1 - \frac{1}{d}\right)^{z-1}

To find the expected number of demographic slots with exactly one entry we multiply by d, and to get the proportion of p of people this represents we divide by z.

\log p = (z-1)\log\left(1 - \frac{1}{d}\right) \approx - \frac{z}{d}

and so

p \approx \exp(-z/d)

which in our case is 84.8%, consistent with our simulation above.

Here we used the Taylor series approximation

\log(1 + x) = x + {\cal O}(x^2)

If z is of the same magnitude as d or smaller, then the error in the approximation above is O(1/d).

You could also work this as a Poisson distribution, then condition on the probability of a slot being occupied.

By the way, a similar calculation shows that the expected number of demographic slots containing two people is r exp(-r)/2 where rz/d. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.

Related posts

[1] I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.

No funding for uncomfortable results

In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous. The state of Massachusetts had released data on 135,000 state employees and their families with obvious identifiers removed. However, the data contained zip code, birth date, and sex for each individual. Sweeney was able to cross reference this data with publicly available voter registration data to find the medical records of then Massachusetts governor William Weld.

An estimated 87% of Americans can be identified by the combination of zip code, birth date, and sex. A back-of-the-envelope calculation shows that this should not be surprising, but Sweeney appears to be the first to do this calculation and pursue the results. (Update: See such a calculation in the next post.)

In her paper Only You, Your Doctor, and Many Others May Know Sweeney says that her research was unwelcome. Over 20 journals turned down her paper on the Weld study, and nobody wanted to fund privacy research that might reach uncomfortable conclusions.

A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

Related posts

Photo of Latanya Sweeney via Parker Higgins [CC BY 4.0], from Wikimedia Commons

Poetic description of privacy-preserving analysis

Erlingsson et al give a poetic description of privacy-preserving analysis in their RAPPOR paper [1]. They say that the goal is to

… allow the forest of client data to be studied, without permitting the possibility of looking at individual trees.

Related posts

[1] Úlfar Erlingsson, Vasyl Pihur, and Alexsandra Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. Available here.

Rényi Differential Privacy

Differential privacy, specifically ε-differential privacy, gives strong privacy guarantees, but it can be overly cautious by focusing on worst-case scenarios. The generalization (ε, δ)-differential privacy was introduced to make ε-differential privacy more flexible.

Rényi differential privacy (RDP) is a new generalization of ε-differential privacy by Ilya Mironov that is comparable to the (ε, δ) version but has several advantages. For instance, RDP is easier to interpret than (ε, δ)-DP and composes more simply.

Rényi divergence

My previous post discussed Rényi entropyRényi divergence is to Rényi entropy what Kullback-Leibler divergence is to Shannon entropy.

Given two random variables X and Y that take on n possible values each with positive probabilities pi and qi respectively, the Rényi divergence of X from Y is defined by

D_\alpha(X || Y) = \frac{1}{\alpha -1} \log\left( \sum_{i=1}^n \frac{p_i^\alpha}{q_i^{\alpha -1}}\right)

for α > 0 and not equal to 1. The definition extends to α = 0, 1, or ∞ by taking limits.

As α converges to 1, Dα converges to the Kullback-Leibler divergence.

Rényi differential privacy

A couple weeks ago I wrote an introduction to differential privacy that started by saying

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not.

The post develops precisely what it means for a query to be differentially private. The bottom line (literally!) was

An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D’,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

It turns out that this definition is equivalent to saying the Rényi divergence of order ∞ between A(D) and A(D‘) is less than ε. That is,

D_\alpha\big(A(D) || A(D')\big) \leq \varepsilon

where α = ∞. The idea of Rényi differential privacy (RDP) is to allow the possibility that α could be finite [1]. That is, an algorithm is (α, ε)-RDP if the Rényi divergence of order α between any two adjacent databases is no more than ε.

One of the nice properties of RDP is how simply algorithms compose: The composition of an (α, ε1)-RDP algorithm and an (α, ε2)-RDP algorithm is a (α, ε1 + ε2)-RDP algorithm. This leads to simple privacy budget accounting.

Comparison to (ε, δ)-differential privacy

The composition of ε-DP algorithms is simple: just substitute α = ∞ in the result above for composing RDP algorithms. However, the resulting bound may be overly pessimistic.

The composition of (ε, δ)-DP algorithms is not as simple: both ε and δ change. With RDP, the order α does not change, and the ε values simply add.

Mironov proves in [1] that every RDP algorithm is also (ε, δ)-DP algorithm. Specifically, Proposition 3 says

If f is an (α, ε)-RDP mechanism, it also satisfies

\left(\varepsilon - \frac{\log \delta}{\alpha - 1}, \delta\right)

differential privacy for any 0 < δ < 1.

This tells us that Rényi differential privacy gives us privacy guarantees that are somewhere between ε-differential privacy and (ε, δ)-differential privacy.

Related posts

[1] Ilya Mironov, Rényi Differential Privacy. arXiv 1702.07476

International internet privacy law

world map

Scott Hanselman interviewed attorney Gary Nissenbaum in show #647 of Hanselminutes. The title was “How GDPR is effecting the American Legal System.”

Can Europe pass laws constraining American citizens? Didn’t we settle that question in 1776, or at least by 1783? And yet it is inevitable that European law effects Americans. And in fact Nissembaum argues that every country has the potential to pass internet regulation effecting citizens of every other country in the world.

Hanselman: Doesn’t that imply that we can’t win? There’s two hundred and something plus countries in the world and if any European decides to swing by a web site in Djibouti now they’re going to be subject to laws of Europe?

Nissenbaum: I’ll double down on that. It implies that any country that has users of the internet can create a more stringent law than even the Europeans, and then on the basis of that being the preeminent regulatory body of the world, because it’s a race to who can be the most restrictive. Because the most restrictive is what everyone needs to comply with.

So if Tanzania decides that it is going to be the most restrictive country in terms of the laws … that relate to internet use of their citizens, theoretically, all web sites around the world have to be concerned about that because there are users that could be accessing their web site from Tanzania and they wouldn’t even know it.

Will the “world wide web” someday not be worldwide at all? There has been speculation, for example, that we’ll eventually have at least two webs, one Chinese and and one non-Chinese. The web could tear into a lot more pieces than that.

As Nissenbaum says toward the end of the podcast

If anyone assumes there’s a simple way of handling this, they’re probably wrong. It is complicated, and you just have to live with that, because that’s the world we’re in.

Related post: GDPR and the right to be forgotten

Big data and privacy

patient survey

How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?

Rows vs Columns

People commonly speak of big data in terms of volume—the “four v’s” of big data being volume, variety, velocity, and veracity—but what we’re concerned with here might better be called “area.” We’ll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.

In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?

All else being equal, your privacy goes down as columns go up. The more information someone has about you, the more likely some of it may be used in combination to identify you.

How privacy varies with the number of rows is more complicated. Your privacy could go up or down with the number of rows.

The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of k-anonymity, this makes it harder to indentify you, but easier to reveal information about you.

Group privacy

For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you’re a Native American man in your 40’s and there are 49 others like you. Then someone who knows your demographics can’t be sure which record is yours; they’d have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data show that middle aged Native American men are highly susceptible to some disease, then the data imply that you are likely to be susceptible to that disease.

Because privacy measures aimed at protecting individuals don’t necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The Genetic Information Nondiscrimination Act of 2008 (GINA) makes some forms of genetic discrimination illegal, but it’s quite understandable that minority groups might still be reluctant to participate in studies.

Improving privacy with size

Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of randomization without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.

With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of queries on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.

If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or local differential privacy. With these approaches, there’s no need for a trusted data curator. This wouldn’t be possible with a small dataset because the noise would be too large relative to the size of the data.

Related posts

East Coast Code and West Coast Code

In 1999, Harvard law professor Lawrence Lessig stated that software is a kind of law. The algorithms of tech giants have an influence comparable to, and maybe in some ways greater than, statutory law.

The code is law. … Activists concerned with defending liberty, privacy or access must watch the code coming from the Valley—call it West Coast Code—as much as the code coming from Congress—call it East Coast Code.

I imagine more people would agree with him now than did when he said this a couple decades ago.

If you can call statutory law “code” then you could also call algorithms “law” and speak of East Coast Law and West Coast Law.

Source: The Architecture of Privacy

What is differential privacy?

Differential privacy is a strong form of privacy protection with a solid mathematical definition.

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not. This intuitive idea can be made precise as follows.

blurry crowd

Queries and algorithms

First of a differential privacy is something that applies to queries, not to databases. It could apply to a database if your query is to select everything in the database, but typically you want to run far more specific queries. Differential privacy adds noise to protect privacy, and the broader the query, the more noise must be added, so typically you want queries to be more narrow than “Tell me everything about everything.”

Generally the differential privacy literature speaks of algorithms rather than queries. The two are the same if you have a general idea of what a query is, but using the word algorithm emphasizes that the query need not be a simple SQL query.

Opt-out and neighboring databases

To quantify the idea of “whether your information is included or not” we look at two versions of a database, D and D‘, that differ by one row, which you can think of as the row containing your data. Our formalism is symmetric, so we do not specify which database, D or D‘, is the one which contains your row and which is missing your row.

We should mention some fine print here. The paragraph above implicitly assumes that your database is just a simple table. In general we can speak of neighboring databases. In the case of a more complex database design, the notion of what it means for two databases to be neighboring is more complex. Differential privacy can be defined whenever a notion of neighboring databases can be defined, so you could, for example, consider differential privacy for a non-relational (“No SQL”) database. But for the simple case of a single table, two databases are neighboring if they differ by a row.

However, there’s a subtlety regarding what it means even for two tables to “differ by one row.” Unbound differential privacy assumes one row has been removed from one of the databases. Bound differential privacy assumes the two databases have the same number of rows, but the data in one row has been changed. In practice the difference between bound and unbound differential privacy doesn’t often matter.

Quantifying difference

We said it “makes little difference” whether an individual’s data is included or not. How to we quantify that statement? Differential privacy is a stochastic procedure, so we need to measure difference in terms of probability.

Whenever mathematicians want to speak of two things being close together, we often use ε as our symbol. While in principle ε could be large, the choice of symbol implies that you should think of it as a quantity you can make as small as you like (though there may be some cost that increases as ε decreases).

We’re going to look at ratios of probabilities, so we want these ratios to be near 1. This means we’ll work with eε = exp(ε) rather than ε itself. A convenient property that falls out of the Taylor series for the exponential function is that if ε is small then exp(ε) is approximate 1+ε. For example, exp(0.05) = 1.0513, and so if the ratio of two probabilities is bounded by exp(0.05), the probabilities are roughly within 5% of each other.

Definition of differential privacy

Now we can finally state our definition. An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D‘,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

Update: See a follow-on post about Rényi differential privacy, a generalization of differential privacy that measures the difference between output probability distributions using information theory rather than bounding ratios. This includes the definition above a special case.

Help with differential privacy

If you’d like help understanding and implementing differential privacy, let’s talk.

Related posts

Earth mover distance and t-closeness

airport crowd

There’s an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it’s like a lot of other records.

k-anonymity

The idea of k-anonymity is that every database record appears at least k times when you restrict your attention to quasi-identifiers [1]. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity. Even when k is large, k-anonymity might prevent re-identification but still suffer from attribute disclosure.

Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.

One way to address this shortcoming is l-diversity. This post won’t go into l-diversity because it’s an intermediate step to where we want to go, which is t-closeness.

t-closeness

The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The “t” comes from requiring that the distributions be no more than a distance t apart in a sense that we’ll define below [2]. If the sensitive data in a group doesn’t stand out, this thwarts the homogeneity attack and the background knowledge attack.

Earth mover distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what “far” means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Leiler divergence, though that’s not what we’ll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5 respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EMD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EMD is 3.

Related posts

[1] What exactly is a quasi-identifier? That’s hard to say precisely. An identifier is something that clearly identifies an individual. A phone number, for example, is an identifier. A quasi-identifier is something that helps narrow down who a person is, but isn’t an identifier. For example, someone’s birthday would be a quasi-identifier. But in general “quasi-identifier” is not a precisely defined concept.

[2] It’s common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn’t descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly the l in l-diversity counts something.

As a sidenote, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.

Format-preserving encryption (FPE) for privacy

The idea of format-preserving encryption is to encrypt data while keeping its form, a sort of encryption in kind. An encrypted credit card number would look like a credit card number, a string of text would be replaced with a string of text, etc.

Format preserving encryption (FPE) is useful in creating a test or demo database. You want realistic data without having accurate data (at least for sensitive data fields), and using FPE on real data might be the way to go.

If a field is supposed to contain a 10-digit number, say a phone number, you want the test data to also contain a 10-digit number. Otherwise validation software might break. And if that number is a key that links tables together, simply substituting a random number would break the relationships unless the same random replacement was used everywhere. Also, two clear numbers could be replaced with the same randomly chosen value. FPE would be a simple way to avoid these problems.

FPE is a two-edged sword. It may be desirable to preserve formatting, but it could also cause problems. Using any form of encryption, format-preserving or not, to preserve the database structure could reveal information you don’t intend to reveal.

It’s quite possible to encrypt data and still compromise privacy. If you encrypt data, but not metadata, then you might keep enough information to re-identify individuals. For example, if you encrypt someone’s messages but retain the time stamp on the messages, that might be enough to identify that person.

The meaning of “format-preserving” can vary, and that could create inadvertently leak information. What does it mean to encrypt English text in a format-preserving way? It could mean that English words are replaced with English words. If this is done simplistically, then the number of words in the clear text is revealed. If a set of English words is replaced with a variable number of English words, you’re still revealing that the original text was English.

FPE may not reveal anything that wasn’t already known. If you know that a field in a database contains a 9-digit number, then encrypting it as a 9-digit number doesn’t reveal anything per se. But it might be a problem if it reveals that two numbers are the same number.

What about errors? What happens if a field that is supposed to be a 9-digit number isn’t? Maybe it contains a non-digit character, or it contains more than 9 digits. The encryption software should report an error. But if it doesn’t, maybe the encrypted output is not a 9-digit number, revealing that there was an error in the input. Maybe that’s a problem, maybe not. Depends on context.