Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.