Differential entropy and privacy

Differential entropy is the continuous analog of Shannon entropy. Given a random variable X with density function fX, the differential entropy of X, denoted h(X), is defined as

h(X) = -\int f_X(x) \log_2 f_X(x)\, dx

where the integration is over the support of fX. You may see differential entropy defined using logarithm to a different base, which changes h(X) by a constant amount.

In [1] the authors defined the privacy of a random variable X, denoted Π(X), as 2 raised to the power h(X).

\Pi(X) = 2^{h(X)}

This post will only look at “privacy” as defined above. Obviously the authors chose the name because of its application to privacy in the colloquial sense, but here we will just look at the mathematics.

Location and scale

It follows directly from the definitions above that location parameters do not effect privacy, and scale parameters change privacy linearly. That is, for σ > 0,

\Pi(\sigma X + \mu) = \sigma \,\Pi(X)

If we divide by standard deviation before (or after) computing privacy then we have a dimensionless quantity. Otherwise there’s more privacy is measuring a quantity in centimeters than in measuring it in inches, which is odd since both contain the same information.

Examples

If X is uniformly distributed on an interval of length a, then h(X) = log2 a and Π(X) = a.

The privacy of a standard normal random variable Z is √(2πe) and so the privacy of a normal random variable with mean μ and variance σ² is σ√(2πe).

The privacy of a standard exponential random variable is 1, so the privacy of an exponential with rate λ is 1/λ.

Bounds

A well-known theorem says that for given variance, differential entropy is maximized by a normal random variable. This means that the privacy of a random variable with variance σ² is bounded above by σ√(2πe).

The privacy of a Cauchy random variable with scale σ is 4πσ, which is greater than σ√(2πe). This is does not contradict the statement above because the scaling parameter of a Cauchy random variable is not its standard deviation. A Cauchy random variable does not have and standard deviation.

Related posts

[1] Agrawal D., Aggrawal C. C. On the Design and Quantification of Privacy-Preserving Data Mining Algorithms, ACM PODS Conference, 2002. (Yes, the first author’s name contains one g and the second author’s name contains two.)

Identifiers depend on context

Can you tell who someone is from their telephone number? That’s kinda the point of telephone numbers, to let you contact someone. And indeed telephone number is one the 18 identifiers under HIPAA Safe Harbor.

But whether any piece of information allows you to identify someone depends on context. If you don’t have access to a phone, or a phone book, or any electronic counterpart of a phone book, then a phone number doesn’t allow you to identify anyone. But once you can call a phone number or enter it into a search engine, then the phone number is identifiable. Maybe.

What if the number belongs to a burner phone? Then it would be harder to learn the identity of the person who owns the number, but not impossible. Maybe you couldn’t learn anything about the owner, but law enforcement officials could. Again identifiability depends on context.

An obvious identifier like a phone number might not be an identifier in some special circumstance. And an apparently useless bit of information might reveal someone’s identity in another circumstance.

HIPAA’s Safe Harbor Rule tries to say apart from context what kinds of data are identifiable. But if you read the Safe Harbor Rule carefully you’ll notice it isn’t so context-free as it seems. The last item in the list of 18 items to remove is “any other unique identifying number, characteristic, or code.” What might be an identifying characteristic? That depends on context.

The 19th rule of HIPAA Safe Harbor

The HIPAA Safe Harbor provision says that data can be considered deidentified if 18 kinds of data are removed or reported at low resolution. At the end of the list of 18 items, there is an extra category, sometimes informally called the 19th rule:

The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

So if you otherwise meet the letter of the Safe Harbor provision, but you know (or should know) that the data can still be used to identify people represented in the data, then Safe Harbor does not apply.

The Department of Health and Human Services guidance document gives four examples of “when a covered entity would fail to meet the ‘actual knowledge’ provision.” The first concerns a medical record that would reveal someone’s identity by revealing their profession.

Revealing that someone is a plumber would probably not be a privacy risk, but in the HHS example someone’s occupation was listed as a former state university president. If you know what state this person is in, that greatly narrows down the list possibilities. One more detail, such as age, might be enough to uniquely identify this person.

Free text fields, such as physician notes, could easily contain this kind of information. Software that removes obvious names won’t catch this kind of privacy leak.

Not only are intentional free text fields a problem, so are unintentional free text fields. For example, a database field labeled CASENOTES is probably intended to contain free text. But other text fields, particularly if they are wider than necessary to contain the anticipated data, could contain identifiable information.

If you have data that does not fall under the Safe Harbor provision, or if you are not sure the Safe Harbor rules are enough to insure that the data are actually deidentified, let’s talk.

Related posts

This post is not legal advice. My clients are often lawyers, but I am not a lawyer.

Natural language processing and unnatural text

I recently evaluated two software applications designed to find PII (personally identifiable information) in free text using natural language processing. Both failed badly, passing over obvious examples of PII. By contrast, I also tried natural language processing software on a nonsensical poem, it the software did quite well.

Doctor’s notes

It occurred to me later that the software packages to search for PII probably assume “natural language” has the form of fluent prose, not choppy notes by physicians. The notes that I tested did not consist of complete sentences marked up with grammatically correct punctuation. The text may have been transcribed from audio.

Some software packages deidentify medical notes better than others. I’ve seen some work well and some work poorly. I suspect the former were written specifically for their purpose and the latter were more generic.

Jabberwocky

I also tried NLP software on Lewis Carroll’s poem Jabberwocky. It too is unnatural language, but in a different sense.

Jabberwocky uses nonsense words that Carroll invented for the poem, but otherwise it is grammatically correct. The poem is standard English at the level of structure, though not at the level of words. It is the opposite of medical notes that are standard English at the word level (albeit with a high density of technical terms), but not at a structural level.

I used the spaCy natural language processing library on a couple stanzas from Lewis’ poem.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

I fed the lines into spaCy and asked it to diagram the lines, indicating parts of speech and dependencies. The software did a good job of inferring the use of even the nonsense words. I gave the software one line at a time rather than a stanza at a time because the latter results in diagrams that are awkwardly wide, too wide to display here. (The spaCy visualization software has a “compact” option, but this option does not make the visualizations much more compact.)

Here are the visualizations of the lines.








And here is the Python code I used to create the diagrams above.

    import spacy
    from spacy import displacy
    from pathlib import Path
    
    nlp = spacy.load("en_core_web_sm")
        
    lines = [
        "Beware the Jabberwock, my son!",
        "The jaws that bite, the claws that catch!",
        "Beware the Jubjub bird",
        "Shun the frumious Bandersnatch!",
        "He took his vorpal sword in hand.",
        "Long time the manxome foe he sought",
        "So rested he by the Tumtum tree",
        "And stood awhile in thought."
    ]
    
    for line in lines:
        doc = nlp(line)
        svg = displacy.render(doc, style="dep", jupyter=False)    
        file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
        output_path = Path(file_name)
        output_path.open("w", encoding="utf-8").write(svg)

Related posts

How an LLM might leak medical data

Machine learning models occasionally memorize training data. Under the right prompt, a model could return portions of the training data verbatim.

If a large language model is trained on deidentified medical data, along with data that overlaps with the medical data, it could potentially leak details of a person’s medical history. I’m not saying that any particular LLM does leak data, but it’s something to watch out for if you are training an LLM.

One way to try to create a data leak would be to prompt the LLM with the data you have and see whether it will fill in more. For example, I asked Bing AI

Tell me about a man from Arkansas with ICD-10 code C50.921.

It came back with the following.

… Once upon a time, there was a man named John who lived in Arkansas. John was a hardworking man who loved spending time with his family and friends. One day, John noticed a lump in his right breast. He went to the doctor and was diagnosed with malignant neoplasm of unspecified site of the right male breast, classified under the ICD-10 code C50.921. …

I chose this diagnosis code because male breast cancer is extremely rare, and so the risk of reidentification is higher than for a more common diagnosis.

The model filled in the detail that the man’s name is John. This is not at all suspicious since John is the one of the most common male names in history. If the man’s name had been Vivek, there would be more cause to suspect the model is telling us about a real man name Vivek, though of course that could be a randomly chosen male name.

If a neural network were training on deidentified medical data, it could link fields together. If the story above had said “John, aged 42, …” the age might have been pulled from an actual patient record.

If the data the network was trained on was deidentified well, even leaking data verbatim should not create more than a very small risk of identification. However, if the data contained tokens linking the records to publicly available information, such as real estate records—this happens—then our hypothetical LLM might reveal more personal details that could be used to narrow down whose data is being leaked.

Related posts

Query, then deidentify

Suppose you have a database of personally identifiable information (PII) and you want to allow someone else to query the data while protecting the privacy of the individuals represented by the data. There are two approaches:

  1. Deidentify, then query
  2. Query, then deidentify

The first approach is to do whatever is necessary to deidentify the data—remove some fields, truncate or randomize others, etc.—and then pose a query to this redacted data.

The second approach is to query the original data, then do whatever is necessary to deidentify the results.

In graphical terms, you can get from raw data to a deidentified result either by following the green arrows or the blue arrows below. In mathematical terms, this diagram does not commute.

The first approach is most common. A company that owns data (a “covered entity” in HIPAA terms) will deidentify it and license it to another company who then queries it. The second approach is becoming more common, where a company will license access to querying their data.

Pros and cons

Which approach is better? If by better you mean more accurate results, it’s always best to query first then deidentify. The order in which you do things matters, and deidentifying as late as possible preserves information.

The situation is analogous to carrying out a sequence of steps on a calculator. If you want your final result to be accurate to two decimal places, you first carry out all your operations to as much precision as you can, then round the final result. If you round your numbers first, you probably will get less accurate results, maybe even useless results.

However, deidentifying data before querying it is better in some non-mathematical ways. Data scientists want the convenience of working with the data with their tools in their environment. They want to possess (a deidentified version of) the data rather than have access to query the (exact) data. They also want the freedom to run ad hoc queries [1].

There are logistical and legal details to work out in order to license access to query data rather than licensing the data. But it is doable, and companies are doing it.

Why query first

When you deidentify data first, you have to guard against every possible use of the data. But when you deidentify data last, you only have to guard against the actual use of the data.

For example, suppose you are considering creating a new clinic and you would like to know how many patients of a certain type live closer to the location you have in mind than the nearest alternative. A data vendor cannot give you exact locations of patients. If they were to release such data, they’d have to obscure the addresses somehow, such as giving you the first three digits of zip codes rather than full addresses. But if you could ask your query of someone holding the full data, they may tell you exactly what you want to know.

Some queries may pose no privacy risk, and the data holder can return exact results. Or they may need to jitter the result a bit in order to protect privacy, for reasons explained here. But it’s better to jitter an exact result than to jitter your data before computing.

How to query first

The query-first approach requires a trusted party to hold the unredacted data. There are a variety of ways the data holder can license access, from simple to sophisticated, and in between.

The simplest approach would be for the data holder to sell reports. Maybe the data holder offers a predetermined set of reports, or maybe they allow requests.

The most sophisticated approach would be to use differential privacy. Clients are allowed to pose any query they wish, and a query manager automatically adds an amount of randomness to the results in proportion to the sensitivity of the query. All this is done automatically according to a mathematical model of privacy with no need for anyone to decide a priori which queries will be allowed.

There are approaches conceptually between pre-determined reports and differential privacy, offering more flexibility than the former and being easier to implement than the latter. There’s a lot of room for creativity in this space.

Related posts

[1] Being able to run ad hoc queries with no privacy budget is certainly simpler, in the same way that an all-you-can-eat buffet is simpler than ordering food à la carte. But it also means the price is higher. Deidentifying an entire data set entails more loss of accuracy that deidentifying a set of queries.

What can you learn from a credit card number?

The first 4 to 6 digits of a credit card number are the bank identification number or BIN. The information needed to decode a BIN is publicly available, with some effort, and so anyone could tell from a credit card number what institution issued it, what bank it draws on, whether its a personal or business card, etc.

Suppose your credit card number was exposed in a data breach. Someone makes a suspicious purchase with your card, the issuer contacts you, you cancel the card, and you get a new card from the same source. The number can no longer be used to make purchases on your account, but what information did it leave behind?

The cancelled number might tell someone where you used to bank, which is probably where you still bank. And it may tell them the first few digits of your new card since the new card is issued by the same institution [1]. If the old BIN doesn’t directly reveal your new BIN, it at least narrows down the possibilities.

The information in your BIN, by itself, will not identify you, but it does provide clues that might lead to identifying you when combined with other information.

Related posts

[1] According to Andrew in the comments, American Express often changes credit card numbers as little as possible when issuing a replacement, changing only one content digit and the checksum.

First names and Bayes’ theorem

Is the woman in this photograph more likely to be named Esther or Caitlin?

black and white photo of an elderly woman

Yesterday Mark Jason Dominus published wrote about statistics on first names in the US from 1960 to 2021. For each year and state, the data tell how many boys and girls were given each name.

Reading the data “forward” you could ask, for example, how common was it for girls born in 1960 to be named Paula. Reading the data “backward” you could ask how likely it is that a woman named Paula was born in 1960.

We do this intuitively. When you hear a name like Blanche you may think of an elderly woman because the name is uncommon now (in the US) but was more common in the past. Sometimes we get a bimodal distribution. Olivia, for example, has made a comeback. If you hear that a female is named Olivia, she’s probably not middle aged. She’s more likely to be in school or retired than to be a soccer mom.

Bayes’ theorem tells us how to turn probabilities around. We could go through Mark’s data and compute the probabilities in reverse. We could quantify the probability that a woman named Paula was born in the 1960s, for example, by adding up the probabilities that she was born in 1960, 1961, …, 1969.

Bayes theorem says

P(age | name) = P(name | age) P(age) / P(name).

Here the vertical bar separates the thing we want the probability of from the thing we assume. P(age | name), for example, is the probability of a particular age, given a particular name.

There is no bar in the probability in the denominator above. P(name) is the overall probability of a particular name, regardless of age.

People very often get probabilities backward; they need Bayes theorem to turn them around. A particular case of this is the prosecutor’s fallacy. In a court of law, the probability of a bit of evidence given that someone is guilty is irrelevant. What matters is the probability that they are guilty given the evidence.

In a paternity case, we don’t need to know the probability of someone having a particular genetic marker given that a certain man is or is not their father. We want to know the probability that a certain man is the father, given that someone has a particular genetic marker. The former probability is not what we’re after, but it is useful information to stick into Bayes’ theorem.

Related posts

Photo by Todd Cravens on Unsplash.

Identifiable to man or machine?

Like the previous post, this post riffs on a photo [1] I stumbled on while looking for something else.

Would it be easier to identify the man in this photo or the man whose photo appeared in the previous post, copied below.

I think it would be easier for a human to recognize the person in the first image. But what about a computer?

We humans identify people most easily by their faces, and especially by their eyes. These features are easier to see in the first photo. But what might we find if we applied some image processing to the two photos? Maybe the green man’s facial features could be exposed by some diligent processing. We see more of the second man’s body. Maybe a computer algorithm could extract more information out of the second image for this reason.

Photographs may, and often do, contain Exif (Exchangeable image file format) metadata, such as the GPS coordinates of the camera at the time the photo was taken. A photo taken when the lens cap on by mistake might contain a good deal of information about the subject even though the photo per se is useless. This information can be useful to the photographer, but it could also pose a privacy risk. Before posting a photo publicly, you might want to strip out the metadata.

As I noted in the previous post, the privacy risk from data depends on context. Suppose the metadata embedded in a photo contains the serial number of the camera. That serial number would not help most people identify a photo(grapher), but it would someone who had access to a database linking serial numbers to the customers who purchased the cameras.

Was the first photo created by actually projecting the red and green lights onto the subject, or were these added in post production? For that matter, was there actually a man who posed for the photo or was the image synthetically generated? A forensic investigation of the photo might be able to answer these questions.

[1] Photo by Sebastian Mark on Unsplash

Privacy and tomography

I ran across the image below [1] when I was searching for something else, and it made me think of a few issues in data privacy.

green grid of light on young man

The green lights cut across the man’s body like tomographic imaging. No one of these green lines would be identifiable, but maybe the combination of the lines is.

We can tell it’s a young man in the image, someone in good shape. But is this image identifiable?

It’s identifiable to the photographer. It’s identifiable to the person in the photo.

If hundreds of men were photographed under similar circumstances, and we knew who those men were, we could probably determine which of them is in this particular photo.

The identifiability of the photo depends on context. The HIPAA Privacy Rule acknowledges that the risk of identification of individuals in a data set depends on who is receiving the data and what information the recipient could reasonably combine the data with. The NSA could probably tell who the person in the photo is; I cannot.

Making photographs like this available is not a good idea if you want to protect privacy, even if its debatable whether the person in the photo can be identified. Other people photographed under the same circumstances might be easier to identify.

Differential privacy takes a more conservative approach. Differential privacy advocates argue that deidentification procedures should not depend on what the recipient knows. We can never be certain what another person does or doesn’t know, or what they might come to know in the future.

Conventional approaches to privacy consider some data more identifiable than other data. Your phone number and your lab test results from a particular time and place are both unique, but the former is readily available public information and the latter is not. Differential privacy purists would say both kinds of data should be treated identically. A Bayesian approach might be to say we need to multiply each risk by a prior probability: the probability of an intruder looking up your phone number is substantially greater than the probability of the intruder accessing your medical records.

Some differential privacy practitioners make compromises, agreeing that not all data is equally sensitive. Purists decry this as irresponsible. But if the only alternatives are pure differential privacy and traditional approaches, many clients will choose the latter, even though an impure approach to differential privacy would have provided better privacy protection.

[1] Photo by Caspian Dahlström on Unsplash