Zero-Concentrated Differential Privacy

Differential privacy can be rigid and overly conservative in practice, and so finding ways to relax pure differential privacy while retaining its benefits is an active area of research. Two approaches to doing this are concentrated differential privacy [1] and Rényi differential privacy [3].

Concentrated differential privacy was used in reporting results from the 2020 US Census. Specifically, zero-concentrated differential privacy with Gaussian noise.

Differential privacy quantifies the potential impact of an individual’s participation or lack of participation in a database and seeks to bound the difference. The original proposal for differential privacy and the approaches discussed here differ in how they measure the difference an individual can make. Both concentrated differential privacy (CDP) and Rényi differential privacy (RDP) use Rényi divergence, though they use it in different ways.

In [3], Mirinov discusses the similarities and differences regarding CDP and RDP. (I changed Mirnov’s reference numbers to the reference numbers used here.)

The closely related work by Dwork and Rothblum [1], followed by Bun and Steinke [2], explore privacy definitions—Concentrated Differential Privacy and zero-Concentrated Differential Privacy—that are framed using the language of, respectively, subgaussian tails and the Rényi divergence. The main difference between our approaches is that both Concentrated and zero-Concentrated DP require a linear bound on all positive moments of a privacy loss variable. In contrast, our definition applies to one moment at a time. Although less restrictive, it allows for more accurate numerical analyses.

(α, ε)-RDP fixes values of α and ε and requires that the Rényi divergence of order α between a randomized mechanism M applied to two adjacent databases, databases that differ by the data on one individual, is bounded by ε.

D_\alpha(M(x) || M(x')) \leq \varepsilon

Zero-concentrated differential privacy (zCDP) with parameters ε and ρ requires that the Rényi divergence is bounded by ε + ρα for all α in (1, ∞).

D_\alpha(M(x) || M(x')) \leq \varepsilon + \rho\alpha

The pros and cons of zCDP and RDP are complicated. For more details, see the references below.

Related posts

[1] Cynthia Dwork and Guy Rothblum. Concentrated differential privacy. CoRR, abs/1603.01887, 2016.

[2] Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. arXiv:1605.02065 [cs.CR], 2017

[3] Ilya Mironov. Renyi Differential Privacy. arXiv:1702.07476 [cs.CR]

Differentially private stochastic gradient descent

Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

Gradient descent

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

Stochastic gradient descent

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.

Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

Related posts

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]

Randomized response and local differential privacy

Differential privacy protects user privacy by adding randomness as necessary to the results of queries to a database containing private data. Local differential privacy protects user privacy by adding randomness before the data is inserted to the database.

Using the visualization from this post, differential privacy takes the left and bottom (blue) path through the diagram below, whereas local differential privacy takes the top and right (green) path.

The diagram does not commute. Results are more accurate along the blue path. But this requires a trusted party to hold the identifiable data. Local differential privacy does not require trusting the recipient of the data to keep the data private and so the data must be deidentified before being uploaded. If you have enough data, e.g. telemetry data on millions of customers, then you can statistically afford to randomize your data before storing it.

I gave a simple description of randomized response here years ago. Randomized response gives users plausible deniability because their communicated responses are not deterministically related to their actual responses. That post looked at a randomized response to a simple yes/no question. More generally, you could have a question with k possible answers and randomize each answer to one of ℓ different possibilities. It is not necessary that k = ℓ.

A probability distribution is said to be ε-locally differentially private if for all possible pairs of inputs x and x′ and any output y, the ratio of the conditional probabilities of y given x and y given x′ is bounded by exp(ε). So when ε is small, the probability of any given output conditional on each possible input is roughly the same. Importantly, the conditional probabilities are not exactly the same, and so one can recover some information about the unrandomized response in aggregate via statistical means. However, it is not possible to infer any individual’s unrandomized response, assuming ε is small.

In the earlier post on randomized response, the randomization mechanism and the inference from the randomized responses were simple. With multiple possible responses, things are more complicated. You could choose different randomization mechanisms and different inference approaches for different contexts and priorities.

With local differential privacy, users can share their data without trusting the data recipient to keep the data private; in a real sense the recipient isn’t receiving personal data at all. The recipient is receiving the output of a stochastic process which is weakly correlated with individual data, but isn’t receiving individual data per se.

Local differential privacy scales up well, but it doesn’t scale down well. When ε is small, each data contributor has strong privacy protection, but the aggregate data isn’t very useful unless so many individuals are represented in the data that the randomness added to the responses can largely be statistically removed.

Related posts

PATE framework for differentially private machine learning

Machine learning models can memorize fragments of their training data and return these fragments verbatim. I’ve seen instances, for example, where I believe an LLM returned phrases verbatim from this site. It’s easy to imagine how medical data might leak this way.

How might you prevent this? And how might you do it in a way that is easy to defend?

One such approach is the PATE framework. PATE stands for Private Aggregation of Teacher Ensembles. PATE was introduced in [1] and refined in [2].

In the PATE framework, you divide your sensitive data into n disjoint subsets and train a “teacher” model on each subset. These subsets are formed so that only one teacher has access to data from a particular individual.

Only these teacher models have direct access to sensitive data, and these models will not be released into production. Instead, the teacher models are used to train a “student” model.

The student model asks questions of the teacher models and so the student model is only indirectly trained on sensitive data. Furthermore, differential privacy is inserted between the student and teacher models as a further layer of privacy protection. So the student model is actually not trained on the answers from the teacher models but from an aggregate of the teacher models with a (ideally small) amount of randomness thrown in to further protect privacy. Publicly available data is also added to the training set for the student model.

There are a couple clever refinements in [2] that stretch the framework’s privacy budget.

To be more selective, our new mechanisms leverage some pleasant synergies between privacy and utility in PATE aggregation. For example, when teachers disagree, and there is no real consensus, the privacy cost is much higher; however, since such disagreement also suggest that the teachers may not give a correct answer, the answer may simply be omitted.

Differential privacy is a way of quantifying the notion that an individual’s participation or lack of participation in a database makes little difference. If the teacher models disagree, it may because an individual who is an outlier has had a large influence on the teacher model that was trained on his data. If you protect the privacy of the “teachers” then you protect the privacy of the individuals since no more than one teacher was training on any given individual’s data. When there is consensus among the teachers, there’s no need to spend much of the privacy budget.

The authors go on to say

Similarly, teachers may avoid giving an answer where the student already is confidently predicting the right answer.

Since each query uses some amount of the privacy budget, avoiding even asking questions saves budget.

Related posts

[1] Nicholas Papernot et al. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data arXiv:1610.05755

[2] Nicolas Papernot et al. Scalable Private Learning with PATE. arXiv:1802.08908v1

Earth mover’s distance

There are many ways to describe the distance between two probability distributions. The previous two posts looked at using the p-norm to measure the difference between the PDFs and using Kullbach-Leibler divergence. Earth mover’s distance (EMD) is yet another approach.

Imagine a probability distribution on ℝ² as a pile of dirt. Earth mover’s distance measures how different two distributions are by how much work it would take to reshape the pile of dirt representing one distribution into a pile of dirt representing the other distribution. Unlike KL divergence, earth mover’s distance is symmetric, and so it really is a distance. (EMD is a colorful name for what is more formally known as the Wasserstein metric.)

The concept of t-closeness in data privacy is based on EMD. Deidentification procedures such as k-anonymity that protect individual privacy may not protect group privacy. t-closeness measures the distribution of values of some attribute in a group and compares this distribution to that of the overall distribution using EMD.

Earth mover’s distance is difficult to compute, or even to rigorously define, when working in several dimensions, but in one dimension it is particularly simple. The 1-Wasserman distance between two probability distributions is simply the 1-norm distance between the corresponding CDFs.

W_1(X, Y) = \int_{-\infty}^\infty |F_X(x) - F_Y(x)|\, dx
There are p-Wasserstein metrics just as there are p-norms, but the case p = 1 is particularly simple and so we will focus on it for this post.

We can illustrate the univariate Wasserstein metric by returning to a problem in a recent post, namely now to optimally approximate a standard normal by a logistic distribution.

Logistic distribution example

One of the nice things about the logistic distribution is that its CDF is an elementary function. If X is a logistic distribution with mean 0 and scale s then the CDF is

F_X(x) = \frac{1}{1 + \exp(-x/s)}

The CDF of a normal distribution has no elementary form but can be written in terms of the complementary error function. If Z is a standard normal random variable, then

F_Z(x) = \mbox{erfc}( -x/\sqrt{2}) / 2

We get a distance of 0.05926 if we use the value of s  = 0.5513 obtained from moment matching here. The optimal value is s = 0.5867, a little smaller than the optimal values of s when minimizing the 1, 2, and ∞ norms which were around 0.61.

Related posts

Differential entropy and privacy

Differential entropy is the continuous analog of Shannon entropy. Given a random variable X with density function fX, the differential entropy of X, denoted h(X), is defined as

h(X) = -\int f_X(x) \log_2 f_X(x)\, dx

where the integration is over the support of fX. You may see differential entropy defined using logarithm to a different base, which changes h(X) by a constant amount.

In [1] the authors defined the privacy of a random variable X, denoted Π(X), as 2 raised to the power h(X).

\Pi(X) = 2^{h(X)}

This post will only look at “privacy” as defined above. Obviously the authors chose the name because of its application to privacy in the colloquial sense, but here we will just look at the mathematics.

Location and scale

It follows directly from the definitions above that location parameters do not effect privacy, and scale parameters change privacy linearly. That is, for σ > 0,

\Pi(\sigma X + \mu) = \sigma \,\Pi(X)

If we divide by standard deviation before (or after) computing privacy then we have a dimensionless quantity. Otherwise there’s more privacy is measuring a quantity in centimeters than in measuring it in inches, which is odd since both contain the same information.

Examples

If X is uniformly distributed on an interval of length a, then h(X) = log2 a and Π(X) = a.

The privacy of a standard normal random variable Z is √(2πe) and so the privacy of a normal random variable with mean μ and variance σ² is σ√(2πe).

The privacy of a standard exponential random variable is 1, so the privacy of an exponential with rate λ is 1/λ.

Bounds

A well-known theorem says that for given variance, differential entropy is maximized by a normal random variable. This means that the privacy of a random variable with variance σ² is bounded above by σ√(2πe).

The privacy of a Cauchy random variable with scale σ is 4πσ, which is greater than σ√(2πe). This does not contradict the statement above because the scaling parameter of a Cauchy random variable is not its standard deviation. A Cauchy random variable does not have and standard deviation.

Related posts

[1] Agrawal D., Aggrawal C. C. On the Design and Quantification of Privacy-Preserving Data Mining Algorithms, ACM PODS Conference, 2002. (Yes, the first author’s name contains one g and the second author’s name contains two.)

Identifiers depend on context

Can you tell who someone is from their telephone number? That’s kinda the point of telephone numbers, to let you contact someone. And indeed telephone number is one the 18 identifiers under HIPAA Safe Harbor.

But whether any piece of information allows you to identify someone depends on context. If you don’t have access to a phone, or a phone book, or any electronic counterpart of a phone book, then a phone number doesn’t allow you to identify anyone. But once you can call a phone number or enter it into a search engine, then the phone number is identifiable. Maybe.

What if the number belongs to a burner phone? Then it would be harder to learn the identity of the person who owns the number, but not impossible. Maybe you couldn’t learn anything about the owner, but law enforcement officials could. Again identifiability depends on context.

An obvious identifier like a phone number might not be an identifier in some special circumstance. And an apparently useless bit of information might reveal someone’s identity in another circumstance.

HIPAA’s Safe Harbor Rule tries to say apart from context what kinds of data are identifiable. But if you read the Safe Harbor Rule carefully you’ll notice it isn’t so context-free as it seems. The last item in the list of 18 items to remove is “any other unique identifying number, characteristic, or code.” What might be an identifying characteristic? That depends on context.

The 19th rule of HIPAA Safe Harbor

The HIPAA Safe Harbor provision says that data can be considered deidentified if 18 kinds of data are removed or reported at low resolution. At the end of the list of 18 items, there is an extra category, sometimes informally called the 19th rule:

The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

So if you otherwise meet the letter of the Safe Harbor provision, but you know (or should know) that the data can still be used to identify people represented in the data, then Safe Harbor does not apply.

The Department of Health and Human Services guidance document gives four examples of “when a covered entity would fail to meet the ‘actual knowledge’ provision.” The first concerns a medical record that would reveal someone’s identity by revealing their profession.

Revealing that someone is a plumber would probably not be a privacy risk, but in the HHS example someone’s occupation was listed as a former state university president. If you know what state this person is in, that greatly narrows down the list possibilities. One more detail, such as age, might be enough to uniquely identify this person.

Free text fields, such as physician notes, could easily contain this kind of information. Software that removes obvious names won’t catch this kind of privacy leak.

Not only are intentional free text fields a problem, so are unintentional free text fields. For example, a database field labeled CASENOTES is probably intended to contain free text. But other text fields, particularly if they are wider than necessary to contain the anticipated data, could contain identifiable information.

If you have data that does not fall under the Safe Harbor provision, or if you are not sure the Safe Harbor rules are enough to insure that the data are actually deidentified, let’s talk.

Related posts

This post is not legal advice. My clients are often lawyers, but I am not a lawyer.

Natural language processing and unnatural text

I recently evaluated two software applications designed to find PII (personally identifiable information) in free text using natural language processing. Both failed badly, passing over obvious examples of PII. By contrast, I also tried natural language processing software on a nonsensical poem, it the software did quite well.

Doctor’s notes

It occurred to me later that the software packages to search for PII probably assume “natural language” has the form of fluent prose, not choppy notes by physicians. The notes that I tested did not consist of complete sentences marked up with grammatically correct punctuation. The text may have been transcribed from audio.

Some software packages deidentify medical notes better than others. I’ve seen some work well and some work poorly. I suspect the former were written specifically for their purpose and the latter were more generic.

Jabberwocky

I also tried NLP software on Lewis Carroll’s poem Jabberwocky. It too is unnatural language, but in a different sense.

Jabberwocky uses nonsense words that Carroll invented for the poem, but otherwise it is grammatically correct. The poem is standard English at the level of structure, though not at the level of words. It is the opposite of medical notes that are standard English at the word level (albeit with a high density of technical terms), but not at a structural level.

I used the spaCy natural language processing library on a couple stanzas from Lewis’ poem.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

I fed the lines into spaCy and asked it to diagram the lines, indicating parts of speech and dependencies. The software did a good job of inferring the use of even the nonsense words. I gave the software one line at a time rather than a stanza at a time because the latter results in diagrams that are awkwardly wide, too wide to display here. (The spaCy visualization software has a “compact” option, but this option does not make the visualizations much more compact.)

Here are the visualizations of the lines.








And here is the Python code I used to create the diagrams above.

    import spacy
    from spacy import displacy
    from pathlib import Path
    
    nlp = spacy.load("en_core_web_sm")
        
    lines = [
        "Beware the Jabberwock, my son!",
        "The jaws that bite, the claws that catch!",
        "Beware the Jubjub bird",
        "Shun the frumious Bandersnatch!",
        "He took his vorpal sword in hand.",
        "Long time the manxome foe he sought",
        "So rested he by the Tumtum tree",
        "And stood awhile in thought."
    ]
    
    for line in lines:
        doc = nlp(line)
        svg = displacy.render(doc, style="dep", jupyter=False)    
        file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
        output_path = Path(file_name)
        output_path.open("w", encoding="utf-8").write(svg)

Related posts

How an LLM might leak medical data

Machine learning models occasionally memorize training data. Under the right prompt, a model could return portions of the training data verbatim.

If a large language model is trained on deidentified medical data, along with data that overlaps with the medical data, it could potentially leak details of a person’s medical history. I’m not saying that any particular LLM does leak data, but it’s something to watch out for if you are training an LLM.

One way to try to create a data leak would be to prompt the LLM with the data you have and see whether it will fill in more. For example, I asked Bing AI

Tell me about a man from Arkansas with ICD-10 code C50.921.

It came back with the following.

… Once upon a time, there was a man named John who lived in Arkansas. John was a hardworking man who loved spending time with his family and friends. One day, John noticed a lump in his right breast. He went to the doctor and was diagnosed with malignant neoplasm of unspecified site of the right male breast, classified under the ICD-10 code C50.921. …

I chose this diagnosis code because male breast cancer is extremely rare, and so the risk of reidentification is higher than for a more common diagnosis.

The model filled in the detail that the man’s name is John. This is not at all suspicious since John is the one of the most common male names in history. If the man’s name had been Vivek, there would be more cause to suspect the model is telling us about a real man name Vivek, though of course that could be a randomly chosen male name.

If a neural network were training on deidentified medical data, it could link fields together. If the story above had said “John, aged 42, …” the age might have been pulled from an actual patient record.

If the data the network was trained on was deidentified well, even leaking data verbatim should not create more than a very small risk of identification. However, if the data contained tokens linking the records to publicly available information, such as real estate records—this happens—then our hypothetical LLM might reveal more personal details that could be used to narrow down whose data is being leaked.

Related posts