Differentially private stochastic gradient descent

Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

Gradient descent

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

Stochastic gradient descent

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.

Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

Related posts

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]

Using classical statistics to avoid regulatory burden

On June 29 this year I said on Twitter that companies would start avoiding AI to avoid regulation.

Companies are advertising that their products contain AI. Soon companies may advertise that their projects are AI-free and thus exempt from AI regulations.

I followed that up with an article Three advantages of non-AI models. The third advantage I listed was

Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that such legislation will have unintended consequences are roughly 100%.

Fast forward four months and we now have a long, highly-detailed executive order, Executive Order 14110, effecting all things related to artificial intelligence. Here’s an excerpt:

… the Secretary [of Commerce] shall require compliance with these reporting requirements for: any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations; and any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 1020 integer or floating-point operations per second for training AI.

If a classical model can do what you need, you are not subject to any regulations that will flow out of the executive order above, not if these regulations use definitions similar to those in the executive order.

How many floating point operations does it take to train, say, a logistic regression model? It depends on the complexity of the model and the amount of data fed into the model, but it’s not 1020 flops.

Can you replace an AI model with something more classical like a logistic regression model or a Bayesian hierarchical model? Very often. I wouldn’t try to compete with Midjourney for image generation that way, but classical models can work very well on many problems. These models are much simpler—maybe a dozen parameters rather than a billion parameters—and so are much better understood (and so there is less fear of such models that leads to regulation).

I had a client that was using some complicated models to predict biological outcomes. I replaced their previous models with a classical logistic regression model and got better results. The company was so impressed with the improvement that they filed a patent on my model.

If you’d like to discuss whether I could help your company replace a complicated AI model with a simpler statistical model, let’s talk.

Identifiers depend on context

Can you tell who someone is from their telephone number? That’s kinda the point of telephone numbers, to let you contact someone. And indeed telephone number is one the 18 identifiers under HIPAA Safe Harbor.

But whether any piece of information allows you to identify someone depends on context. If you don’t have access to a phone, or a phone book, or any electronic counterpart of a phone book, then a phone number doesn’t allow you to identify anyone. But once you can call a phone number or enter it into a search engine, then the phone number is identifiable. Maybe.

What if the number belongs to a burner phone? Then it would be harder to learn the identity of the person who owns the number, but not impossible. Maybe you couldn’t learn anything about the owner, but law enforcement officials could. Again identifiability depends on context.

An obvious identifier like a phone number might not be an identifier in some special circumstance. And an apparently useless bit of information might reveal someone’s identity in another circumstance.

HIPAA’s Safe Harbor Rule tries to say apart from context what kinds of data are identifiable. But if you read the Safe Harbor Rule carefully you’ll notice it isn’t so context-free as it seems. The last item in the list of 18 items to remove is “any other unique identifying number, characteristic, or code.” What might be an identifying characteristic? That depends on context.

V-statistics

A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.

Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times

\underbrace{S \times S \times \cdots \times S}_{n \text{ times}}

is a V-statistic.

As in the previous post, let h(x, y) = (xy)²/2. We can illustrate the V-statistic associated with h with Python code as before.

    import numpy as np
    from itertools import product

    def var(xs):
        n = len(xs)
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in product(xs, repeat=2)) / n**2

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs))
    print(var(xs))

This time, however, we iterate over product rather than over combinations. Note also that at the bottom of the code we print

   np.var(xs)

rather than

   np.var(xs, ddof=1)

This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof.

   np.var(xs, ddof=0)

The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.

U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.

Related posts

Moments of Tukey’s g-and-h distribution

John Tukey developed his so-called g-and-h distribution to be very flexible, having a wide variety of possible values of skewness and kurtosis. Although the reason for the distribution’s existence is its range of possible skewness and values, calculating the skewness and kurtosis of the distribution is not simple.

Definition

Let φ be the function of one variable and four parameters defined by

\varphi(x; a, b, g, h) = a + b\left( \frac{\exp(gx) - 1}{g} \right) \exp(hx^2/2)

A random variable Y has a g-and-h distribution if it has the same distribution as φ(Z; a, b, g, h) where Z is a standard normal random variable. Said another way, if Y has a g-and-h distribution then the transformation φ-1 makes the data normal.

The a and b parameters are for location and scale. The name of the distribution comes from the parameters g and h that control skewness and kurtosis respectively.

The transformation φ is invertible but φ-1 does not have a closed-form; φ-1 must be computed numerically. It follows that the density function for Y does not have a closed form either.

Special cases

The g distribution is the g-and-h distribution with h = 0. It generalizes the log normal distribution.

The limit of the g-and-h distribution as g does to 0 is the h distribution.

If g and h are both zero we get the normal distribution.

Calculating skewness and kurtosis

The following method of computing the moments of Y comes from [1].

Define f by

f(g, h, i) = \frac{1}{g^i\sqrt{1 - ih}} \sum_{r=0}^i \binom{i}{r} \exp\left(\frac{((i-r)g)^2}{2(1-ih)}\right)

Then the raw moments of Y are given by

\text{E} \, Y^m = \sum_{i=0}^m \binom{m}{i} a^{m-i}b^i f(g,h,i)

Skewness is the 3rd centralized moment and kurtosis is the 4th centralized moment. Equations for finding centralized moments from raw moments are given here.

Related posts

[1] James B. McDonald and Patrick Turley. Distributional Characteristics: Just a Few More Moments. The American Statistician, Vol. 65, No. 2 (May 2011), pp. 96–103

Symmetric functions and U-statistics

A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

  • a + b + c
  • ab + bc + ac
  • abc

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let h be a symmetric function of r variables and suppose we have a set S of n numbers where nr. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.

If h(x) = x then the corresponding U-statistic is the sample mean.

If h(x, y) = (xy)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n−1 in the denominator while the corresponding term for population variance has an n.

Here is some Python code that demonstrates that the average of (xy)²/2 over all pairs in a sample is indeed the sample variance.

    import numpy as np
    from itertools import combinations

    def var(xs):
        n = len(xs)
        bin = n*(n-1)/2    
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in combinations(xs, 2)) / bin

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs, ddof=1))
    print(var(xs))

Note the ddof term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.

Eccentricity of bivariate normal level sets

Suppose you have a bivariate normal distribution with correlation ρ where

f(x,y) = \frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)} \right)

Then the level sets of the density function are ellipses, and the eccentricity e of the ellipses is related to the correlation ρ by

\begin{align*} \rho &= \pm \frac{e^2}{2 - e^2} \\ e &= \sqrt{\frac{2|\rho|}{1 + |\rho|} } \end{align*

Plots

For example, suppose ρ = 0.8.

Here’s a plot of the density function f(x, y).

And here is a plot of the contours, the level sets obtained by slicing the plot of the density horizontally.

Eccentricity and aspect ratio

The equation above says that since ρ = 0.8 the eccentricity e should be √(1.6/1.8) = 0.9428.

As I’ve written about several times, eccentricity is a little hard to interpret. It’s a number between 0 and 1, with 0 being a circle and 1 being a degenerate ellipse collapsed to a line segment. Since 0.9428 is closer to 1 than to 0, you might think that an ellipse with such a large value of e is very far from circular. But this is not the case.

Eccentric literally means off-center. Eccentricity measures how far off-center the foci of an ellipse are. This is related to how round or thin an ellipse is, but only indirectly. It is true that such a large value of e means that the foci are close to the ends of the ellipse. In fact, the foci are located at 94.28% of the distance from the center to the end of the semimajor axes. But the aspect ratio is more moderate.

It’s easier to imagine the shape of an ellipse from the aspect ratio than from the eccentricity. Aspect ratio r is related to eccentricity e by

r = \sqrt{1-e^2}

Using the expression above for eccentricity e as a function of correlation ρ we have

r = \sqrt{\frac{1 - |\rho|}{1 + |\rho|}}

This says our aspect ratio should be 1/3. So although our ellipse is highly eccentric—the foci are near the ends—it isn’t extraordinarily thin. It’s three times longer than it is wide, which matches the contour plot above.

Related posts

Babies and the beta-binomial distribution

About half of children are boys and half are girls, but that doesn’t mean that every couple is equally likely to have a boy or a girl each time they conceive a child. And evidence suggests that indeed the probability of conceiving a girl varies per couple.

I will simplify things for this post and look at a hypothetical situation abstracting away the complications of biology. This post fills in the technical details of a thread I posted on Twitter this morning.

Suppose the probability p that a couple will have a baby girl has a some distribution centered at 0.5 and symmetric about that point. Then half of all births on the planet will be girls, but that doesn’t mean that a particular couple is equally likely to have a boy or a girl.

How could you tell the difference empirically? You couldn’t if every family had one child. But suppose you studied all families with four children, for example. You’d expect 1 in 16 such families to have all boys, and 1 in 16 families to have all girls. If the proportions are higher than that, and they are, then that suggests that the distribution on p, the probability of a couple having a girl, is not constant.

Suppose the probability of a couple having girls has a beta(a, b) distribution. We would expect a and b to be approximately equal, since about half of babies are girls, and we’d expect a and b to be large, i.e. for the distribution be fairly concentrated around 1/2. For example, here’s a plot with ab = 100.

Then the probability distribution for the number of girls in a family of n children is given by a beta-binomial distribution with parameters n, a, and b. That is, the probability of x girls in a family of size n is given by

\text{Prob}(X = x) = \binom{n}{x} \frac{B(x + a, n - x + b)}{B(a, b)}

The mean of this distribution is na/(a+b). And so if a = b then the mean is n/2, half girls and half boys.

But the variance is more interesting. The variance is

\frac{nab(a + b + n)}{(a + b)^2(a + b +1)} = n \,\,\frac{a}{a + b} \,\,\frac{b}{a + b} \,\,\frac{a + b + n}{a + b + 1}

The variance of a binomial, corresponding to a constant p, is np(1-p). In the equation above, p corresponds to a/(a+b), and (1-p) corresponds to b/(a+b). And there’s an extra term,

\frac{a + b + n}{a + b + 1}

which is larger than 1 when n > 1. This says a beta binomial random variable always has more variance than the corresponding binomial distribution with the same mean.

Now suppose a family has had n children, with g girls and ng boys. Then the posterior predictive probability of a girl on the next birth is

\frac{a + g}{a + b + n}

If g = n/2 then this probability is 1/2. But if g > n/2 then the probability is greater than 1/2. And the smaller a and b are, the more the probability exceeds 1/2.

The binomial model is the limit of the beta-binomial model as a and b go to infinity (proportionately). In the limit, the probability above equals a/(a+b), independent of g and n.

Related posts

Can you have confidence in a confidence interval?

“The only use I know for a confidence interval is to have confidence in it.” — L. J. Savage

Can you have confidence in a confidence interval? In practice, yes. In theory, no.

If you have a 95% confidence interval for a parameter θ, can you be 95% sure that θ is in that interval? Sorta.

Frequentist theory

According to frequentist theory, parameters such as θ are fixed constants, though unknown. Probability statements about θ are forbidden. Here’s an interval I. What is the probability that θ is in I? Well, it’s 1 if θ is in the interval and 0 if it is not.

In theory, the probability associated with a confidence interval says something about the process used to create the interval, not about the parameter being estimated. Our θ is an immovable rock. Confidence intervals come and go, some containing θ and others not, but we cannot make probability statements about θ because θ is not a random variable.

Here’s an example of a perfectly valid and perfectly useless way to construct a 95% confidence interval. Take an icosahedron, draw an X on one face, and leave the other faces unmarked. Roll this 20-sided die, and if the X comes up on top, return the empty interval. Otherwise return the entire real line.

The resulting interval, either ø or (-∞, ∞), is a 95% confidence interval. The interval is the result of a process which will contain the parameter θ 95% of the time.

Now suppose I give you the empty set as my confidence interval. What is the probability now that θ is in the empty interval? Zero. What if I give you the real line as my confidence interval. What is the probability that θ is in the interval? One. The probability is either zero or one, but in no case is it 0.95. The probability that a given interval produced this way contains θ is never 95%. But before I hand you a particular result, the probability that the interval will be one that contains θ is 0.95.

Bayesian theory

Confidence intervals are better in practice than the example above. And importantly, frequentist confidence intervals are usually approximately Bayesian credible intervals.

In Bayesian statistics you can make probability statements about parameters. Once again θ is some unknown parameter. How might we express our uncertainty about the value of θ? Probability! Frequentist statistics represents some forms of uncertainty by probability but not others. Bayesian statistics uses probability to model all forms of uncertainty.

Example

Suppose I want to know what percentage of artists are left handed and I survey 400 artists. I find that 127 of artists surveyed were southpaws. A 95% confidence interval, using the most common approach rather than the pathological approach above, is given by

\left(\hat{p} - z_{1-\alpha/2} \sqrt{\frac{\hat{p}\hat{q}}{n}}, \hat{p} + z_{1-\alpha/2} \sqrt{\frac{\hat{p}\hat{q}}{n}} \right)

This results in a confidence interval of (0.272, 0.363).

Now suppose we redo our analysis using a Bayesian approach. Say we start with a uniform prior on θ. Then the posterior distribution on θ will have a beta(128, 264) distribution.

Now we can say in clear conscience that there is a 94% posterior probability that θ is in the interval (0.272, 0.363).

There are a couple predictable objections at this point. First, we didn’t get exactly 95%. No, we didn’t. But we got very close.

Second, the posterior probability depends on the prior probability. However, it doesn’t depend much on the prior. Suppose you said “I’m pretty sure most people are right handed, maybe 9 out of 10, so I’m going to start with a beta(1, 9) prior.” If so, you would compute the probability of θ being in the interval (0.272, 0.373) to be 0.948. Your a priori knowledge led you to have a little more confidence a posteriori.

Commentary

The way nearly everyone interprets a frequentist confidence interval is not justified by frequentist theory. And yet it can be justified by saying if you were to treat it as a Bayesian credible interval, you’d get nearly the same result.

You can often justify an informal understanding of frequentist statistics on Bayesian grounds. Note, however, that a Bayesian interpretation would not rescue the 95% confidence interval that returns either the empty set or the real line.

Often frequentist and Bayesian analyses reach approximately the same conclusions. A Bayesian can view frequentist techniques as convenient ways to produce approximately correct Bayesian results. And a frequentist can justify using a Bayesian procedure because the procedure has good frequentist properties.

There are times when frequentist and Bayesian results are incompatible, and in that case the Bayesian results typically make more sense in my opinion. But very often other considerations, such as model uncertainty, are much more substantial than the difference between a frequentist and Bayesian analysis.

Related posts

Privacy and tomography

I ran across the image below [1] when I was searching for something else, and it made me think of a few issues in data privacy.

green grid of light on young man

The green lights cut across the man’s body like tomographic imaging. No one of these green lines would be identifiable, but maybe the combination of the lines is.

We can tell it’s a young man in the image, someone in good shape. But is this image identifiable?

It’s identifiable to the photographer. It’s identifiable to the person in the photo.

If hundreds of men were photographed under similar circumstances, and we knew who those men were, we could probably determine which of them is in this particular photo.

The identifiability of the photo depends on context. The HIPAA Privacy Rule acknowledges that the risk of identification of individuals in a data set depends on who is receiving the data and what information the recipient could reasonably combine the data with. The NSA could probably tell who the person in the photo is; I cannot.

Making photographs like this available is not a good idea if you want to protect privacy, even if its debatable whether the person in the photo can be identified. Other people photographed under the same circumstances might be easier to identify.

Differential privacy takes a more conservative approach. Differential privacy advocates argue that deidentification procedures should not depend on what the recipient knows. We can never be certain what another person does or doesn’t know, or what they might come to know in the future.

Conventional approaches to privacy consider some data more identifiable than other data. Your phone number and your lab test results from a particular time and place are both unique, but the former is readily available public information and the latter is not. Differential privacy purists would say both kinds of data should be treated identically. A Bayesian approach might be to say we need to multiply each risk by a prior probability: the probability of an intruder looking up your phone number is substantially greater than the probability of the intruder accessing your medical records.

Some differential privacy practitioners make compromises, agreeing that not all data is equally sensitive. Purists decry this as irresponsible. But if the only alternatives are pure differential privacy and traditional approaches, many clients will choose the latter, even though an impure approach to differential privacy would have provided better privacy protection.

[1] Photo by Caspian Dahlström on Unsplash