Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

## Gradient descent

We’ll start with **gradient descent**. Suppose you have a function of several variables *f*(*x*) where *x* is a vector. Then the gradient ∇*f*(*x*) points in the direction of greatest increase in *f*, and its negative −∇*f*(*x*) points in the direction of greatest decrease. So you can decrease the function *f* by moving in the direction −∇*f*(*x*). You can turn this observation into an algorithm for minimizing *f*. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called **steepest descent** because *locally* you’re always moving in the direction of steepest descent.

*Locally* is the key word here. In some finite neighborhood of *x*, −∇*f*(*x*) points in a direction that will decrease the value of *f*. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function *f* is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

## Stochastic gradient descent

What could go wrong? If your objective function *f* is strictly convex, convergence is guaranteed, but **convergence may be slow**. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get **stuck in a local minimum**. If *f* is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have **many** variables, maybe millions or billions, and these functions are far from convex.

## Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

## Related posts

- Differential privacy
- Optimization and avoiding computing a matrix inverse
- Solving Kepler’s equation with Newton’s method

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]