Differentially private stochastic gradient descent

Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

Gradient descent

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

Stochastic gradient descent

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.

Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

Related posts

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]

Corny AI


Meredith Whittaker posted on Twitter that

In addition to being the best in privacy, Signal is also the best in not subjecting you to corny ‘AI’ features no one asked for or wants.

I love the phrase “corny AI.” That’s exactly what a lot of AI features are.

“Would you like help composing that tweet?”

“No than you, I can write tiny text messages by myself.”

AI is the new Clippy.

I’m sure someone will object that these are early days, and AI applications will get better. That’s probably true, but they’re corny today. Inserting gimmicky, annoying technology now on the basis that future versions might be useful is like serving someone unripe fruit.

“This banana is hard and bitter!”

“Yes, but you should enjoy it because it would have been soft and sweet a few days from now.”

Of course not all AI is corny. For example, GPS has become reliable and unobtrusive. But there’s a rush to add AI just so a company can say their product includes AI. If the AI worked really well, the company would brag about how well the software works, not what technology it uses.


Swish, mish, and serf

Swish, mish, and serf are neural net activation functions. The names are fun to say, but more importantly the functions have been shown to improve neural network performance by solving the “dying ReLU problem.” This happens when a large number of node weights become zero during training and do not contribute further to the learning process. The weights may be “trying” to be negative, but the activation function won’t allow it. The swish family of activation functions allows weights to dip negative.

Softplus can also be used as an activation function, but our interest in softplus here is as part of the definition of mish and serf. Similarly, we’re interested in the sigmoid function here because it’s used in the definition of swish.

Softplus, softmax, softsign

Numerous functions in deep learning are named “soft” this or that. The general idea is to “soften” a non-differentiable function by approximating it with a smooth function.

The plus function is defined by

x^+ =<br /> \left\{<br /> \begin{array}{ll}<br /> x & \mbox{if } x \geq 0 \\<br /> 0 & \mbox{if } x < 0 \end{array} \right.

In machine learning the plus is called ReLU (rectified linear unit) but x+ is conventional mathematical notation.

The softplus function approximates x+ by

\text{softplus}(x) = \log(1 + \exp(x))

We can add a parameter to the softplus function to control how soft it is.

\text{softplus}(x; k) = \frac{1}{k}\log\left(1 + \exp(kx)\right)

As the parameter k increases, softmax becomes “harder.” It more closely approximates x+ but at the cost of the derivative at 0 getting larger.

We won’t need other “soft” functions for the rest of the post, but I’ll mention a couple more while we’re here. Softmax is a smooth approximation to the max function

\text{softmax}(x, y) = \log\left(\exp(x) + \exp(y)\right)

and softsign is a smooth approximation to the sign function.

\text{softsign}(x) = \frac{x}{1 + |x|}

which is differentiable at the origin, despite the absolute value in the denominator. We could add a sharpness parameter k to the softmax and softsign functions like we did above.


The sigmoid function, a.k.a. the logistic curve, is defined by

\sigma(x) = \frac{\exp(x)}{1 + \exp(x)} = \frac{1}{1 + \exp(-x)}

The second and third expressions are clearly equal, but I prefer to think in terms of the former, but the latter is better for numerical calculation because it won’t overflow for large x.

The sigmoid function is the derivative of the softplus function. We could define a sigmoid function with a sharpness parameter k by takind the derivative of the softplus function with sharpness parameter k.


The Swish function was defined in [1] as

\text{swish}(x) = x \, \sigma(x)

Like softplus, the Swish function is asymptotically 0 as x → −∞ and asymptotically equal to x as x → ∞. That is, for large |x|, swish(x) is approximately x+. However, unlike softplus, swish(x) is not monotone. It is negative for negative x, having a minimum around -1.278 [2]. The fact that the activation function can dip into negative territory is what addresses “the dying ReLU problem,”


The Mish function was defined in [3] as

\text{mish}(x) = x \, \tanh(\,\text{softplus}(x)\,)

The mish function has a lot in common with the swish function, but performs better, at least on the examples the author presents.

The mish function is said to be part of the swish function family, as is the serf function below.


The serf function [4] is another variation on the swish function with similar advantages.

\text{serf}(x) = x \, \text{erf}(\,\text{softplus}(x)\,)

The name “serf” comes from “log-Softplus ERror activation Function” but I’d rather pronounce it “zerf” from “x erf” at the beginning of the definition.

The serf function is hard to distinguish visually from the mish function; I left it out of the plot above to keep the plot from being cluttered. Despite the serf and mish functions being close together, the authors in [4] says the former outperforms the latter on some problems.

Numerical evaluation

The most direct implementations of the softplus function can overflow, especially when implemented in low-precision floating point such as 16-bit floats. This means that the mish and serf functions could also overflow.

For large x, exp(x) could overflow, and so computing softplus as

   log(1 + exp(x))

could overflow while the exact value would be well within the range of representable numbers. A simple way to fix this would be to have the softmax function return x when x is so large that exp(x) would overflow.

As mentioned above, the expression for the sigmoid function with exp(-x) in the denominator is better for numerical work than the version with exp(x) in the numerator.

Related posts

[1] Swish: A Self-Gated Activation Function. Prajit Ramachandran∗, Barret Zoph, Quoc V. arXiv:1710.05941v1

[2] Exact form of the minimum location and value in terms of the Lambert W function discussed in the next post.

[3] Mish: A Self Regularized Non-Monotonic Activation Function. Diganta Misra. arXiv:1908.08681v3

[4] SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function. Sayan Nag, Mayukh Bhattacharyya. arXiv:2108.09598v3

Named entity recognition

Named entity recognition (NER) is a task of natural language processing: pull out named things text. It sounds like trivial at first. Just create a giant list of named things and compare against that.

But suppose, for example, University of Texas is on your list. If Texas is also on your list, do you report that you have a named entity inside a named entity? And how do you handle The University of Texas? Do you put it on your list as well? What about UT? Can you tell from context whether UT stands for University of Texas, University of Toronto, or the state of Utah?

Searching for Rice University would be even more fun. The original name of the school was The William Marsh Rice Institute for the Advancement of Letters, Science, and Art. I don’t know whether the name was ever officially changed. A friend who went to Rice told me they had a ridiculous cheer that spelled out every letter in the full name. And of course rice could refer to a grain.

Let’s see what happens when we run the following sentence through spaCy looking for named entities.

Researchers from the University of Texas at Austin organized a picleball game with their colleagues from Rice University on Tuesday.

I deliberately did not capitalize the definite article in front of University of Texas because I suspected spaCy might include the article if it were capitalized but not otherwise. It included the article in either case.

The results depend on the language model used. When I used en_core_web_trf it included at Austin as part of the university name.

When I used the smaller en_core_web_sm model it pulled out Austin as a separate entity.

The tag ORG stands for organization and DATE obviously stands for date. GPE is a little less obvious, standing for geopolitical entity.

When I changed Rice University to simply Rice, spaCy still recognized Rice as an organization. When I changed it to rice with no capitalization, it did not recognize it as an organization.

The other day I stress tested spaCy by giving it some text from Chaucer’s Canterbury Tales. Even though spaCy is trained on Modern English, it did better than I would have expected on Middle English.

Using the en_core_web_trf model it recognizes Engelond and Caunterbury as cities.

When I switched to  en_core_web_sm it still recognized Caunterbury as city, but tagged Engelond as a person.


Trying NLP on Middle English

It’s not fair to evaluate NLP software on a language it wasn’t designed to process, but I wanted to try it anyway.

The models in the spaCy software library were trained on modern English text and not on Middle English. Nevertheless, spaCy does a pretty good job of parsing Chaucer’s Canterbury Tales, written over 600 years ago. I used the model en_core_web_lg in my little experiment below.

The text I used comes from the prologue:

From every shires ende
of Engelond to Caunterbury they wende
the hooly blisful martir for to seke
that hem hath holpen
whan that they were seeke.

The software correctly identifies, for example, wende (went) and seke (seak) as verbs, and seeke (sick) as an adjective. Overall it does a pretty good job. I imagine it would do worse on Middle English text that differed more from Modern English usage, but so would a contemporary human who doesn’t know Middle English.

Related posts

How an LLM might leak medical data

Machine learning models occasionally memorize training data. Under the right prompt, a model could return portions of the training data verbatim.

If a large language model is trained on deidentified medical data, along with data that overlaps with the medical data, it could potentially leak details of a person’s medical history. I’m not saying that any particular LLM does leak data, but it’s something to watch out for if you are training an LLM.

One way to try to create a data leak would be to prompt the LLM with the data you have and see whether it will fill in more. For example, I asked Bing AI

Tell me about a man from Arkansas with ICD-10 code C50.921.

It came back with the following.

… Once upon a time, there was a man named John who lived in Arkansas. John was a hardworking man who loved spending time with his family and friends. One day, John noticed a lump in his right breast. He went to the doctor and was diagnosed with malignant neoplasm of unspecified site of the right male breast, classified under the ICD-10 code C50.921. …

I chose this diagnosis code because male breast cancer is extremely rare, and so the risk of reidentification is higher than for a more common diagnosis.

The model filled in the detail that the man’s name is John. This is not at all suspicious since John is the one of the most common male names in history. If the man’s name had been Vivek, there would be more cause to suspect the model is telling us about a real man name Vivek, though of course that could be a randomly chosen male name.

If a neural network were training on deidentified medical data, it could link fields together. If the story above had said “John, aged 42, …” the age might have been pulled from an actual patient record.

If the data the network was trained on was deidentified well, even leaking data verbatim should not create more than a very small risk of identification. However, if the data contained tokens linking the records to publicly available information, such as real estate records—this happens—then our hypothetical LLM might reveal more personal details that could be used to narrow down whose data is being leaked.

Related posts

Experiments with Bing chat

My two previous posts looked at experiments with ChatGPT and Google Bard. This post will look at redoing the same experiments with Microsoft’s Bing Chat: looking for mnemonic encodings and simplifying Boolean expressions.

When you open up Bing chat you can select a conversational style:

  • More creative
  • More balanced
  • More precise

I chose “more precise” for my experiments.

Mnemonic encodings

Bing seems to “know” more about the Major mnemonic system. In response to my first question about the system it gave a link to the Wikipedia article on the Major system and a couple other relevant links. It gives a more complete table of encodings than ChatGPT did:

  • 0: s, z
  • 1: t, d, th
  • 2: n
  • 3: m
  • 4: r
  • 5: l
  • 6: j, ch, sh
  • 7: c, k, g, q, ck
  • 8: v, f, ph
  • 9: p, b

Directly asking for encodings

It seems to do pretty well when I ask for an encoding of a two-digit number, and when I ask for an encoding of a three-digit number it encodes the first two digits correctly but ignores the third digit. However, when I asked it to encode 345 it did very well:

Using the Major System, the number 345 can be encoded as “molar” or “mailer” or “miller” or “mural” or “marl”, among others.

ChatGPT and Google Bard said that no English word can encode 345, though Google Bard came up with Merlin anyway. (Strictly speaking “Merlin” encodes 3452, but it’s a common practice to just pay attention the first three consonants in words with more than three consonants.)

Asking for sounds

When I asked “Find a word with the sounds g, r, and f” Bing came back with

 One word that contains the sounds g, r, and f is “giraffe” 🦒

When I asked for the sounds l, f, and b, Bard came up with “belief.” When I added “in that order” to my prompt it came back with “bluff” and “flyby,” ignoring the order requirement.

In general Bing AI seems to do better that ChatGPT and Bard at word game questions.

Boolean minimization

Next I tried the Boolean minimization problem that I’d tried with ChatGPT and Bard using the prompt

Minimize the boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

ChatGPT tried a Karnaugh map approach, but botched it. Bard tried an algebraic approach, but also botched it.

Bing tried something different:

The given boolean expression can be simplified using the consensus theorem. The consensus theorem states that for any three variables A, B, and C: AB + AC + BC = AB + AC. Applying this theorem to the given expression, we get

¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD
= ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC
= ¬AB¬C¬D + A( ¬B¬C¬D + ¬B¬CD + ¬BC¬D + ¬BCD) + AB( ¬C¬D) + ABC
= ¬AB( ¬C) ( ¬D) + A( ¬B) ( ¬C( ¬D) + ¬CD) + AB( ¬C( ¬D)) + ABC
= ¬AB( ¬C) ( ¬D) + A( ¬B) ( ¬CD) + AB( ¬C( ¬D)) + ABC
= (A+B)(A+C)(A+D)(B+C)(B+D)(C+D)

This derivation starts out correctly but goes astray about halfway through.


Boolean function minimization with AI

Screenshot of Karnaugh map produced by ChatGPT

I was curious how well LLMs would do at minimizing a Boolean expression, that is, taking a Boolean expression and producing a smaller equivalent expression.

I didn’t expect good performance because this problem is more about logic than recall, but sometimes LLMs surprise you, so I wanted to give it a chance. I thought it might accomplish some reduction in size though not as much as possible. Let’s see what it did.


My prompt for the three platforms I tried—GPT4All, ChatGPT, and Google Bard—was

Minimize the boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

I knew the expression could be minimized to

B¬C¬D + A¬D + AC.


I first tried GPT4All using the Groovy data set. It crashed immediately. I tried a few variations on the same prompt and it crashed right away every time.

This was on a MacBook Pro. I didn’t try it on Windows or Linux.


I’ll go through ChatGPT’s response a little at a time.

To minimize the given boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD, we can use Karnaugh maps or Boolean algebra simplification techniques. I’ll demonstrate both methods.

This is a sensible start. Karnaugh maps are a common classroom technique for solving small Boolean minimization problems, so there were lots of examples of Karnaugh maps in the data scraped to train ChatGPT.

Next ChatGPT draws a Karnaugh map. The screenshot at the top of the post was from a different experiment, a version of the experiment described here but containing an error on my part. This problem produced a black image simply containing “CD” but also contained a Karnaugh map in text. Cleaning up the formatting we get the following.

       | 00 | 01 | 11 | 10 |
    00 |  1 |  0 |  1 |  0 |
    01 |  1 |  1 |  1 |  1 |
    11 |  1 |  1 |  1 |  1 |
    10 |  0 |  1 |  0 |  0 | 

This is nonsense. Nine out of sixteen entries are wrong.

ChatGPT concludes

Grouping the adjacent 1’s in the Karnaugh map, we can simplify the expression to: ¬A¬C + ¬A¬D + A¬B + ACD

This is also nonsense. It does not correctly solve the original problem, nor does it solve the problem corresponding to the incorrect Karnaugh map above.

Next let’s see how it does with Boolean algebra. It starts out OK.

Boolean Algebra Method: We can simplify the expression using Boolean algebra identities and rules.

¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

Rearranging the terms: A¬B¬C¬D + A¬B¬CD + AB¬C¬D + A¬BC¬D + ABC¬D + A¬BCD + ABCD + A¬B¬C¬D

Grouping the terms: A¬B¬C¬D + A¬B¬CD + AB¬C¬D + A¬BC¬D + ABC¬D + A¬BCD + ABCD

Combining like terms: (A + A)¬B¬C¬D + (A + A)¬B¬CD + AB¬C¬D + (A + A)¬BC¬D + ABC¬D + A¬BCD + ABCD

These steps haven’t accomplished anything, but they’re not wrong. Then it goes off the rails.

Using the complement law A + ¬A = 1: ¬B¬C¬D + ¬B¬CD + AB¬C¬D + ¬BC¬D + ABC¬D + A¬BCD + ABCD

The explanation is wrong and the solution is wrong.

The solution is completely wrong, but it looks plausible. At first glance I thought it might be right, or more right than it was. If you were making a movie and you needed to generate electrical engineering homework, this would be adequate for a realistic-looking homework solution.

Google Bard

Google Bard gives a more succinct solution with five steps:

  1. Identify the common terms in the expression. The common terms are ¬AB¬C¬D, A¬B¬CD, and ABC¬D.
  2. Combine the common terms. This gives us the expression (¬AB¬C¬D + A¬B¬CD + ABC¬D) + (A¬B¬C¬D + AB¬C¬D).
  3. Apply the distribution property. This gives us the expression ¬AB¬C¬D + A¬B¬CD + ABC¬D + A¬B¬C¬D + AB¬C¬D.
  4. Identify the redundant terms. The redundant terms are ¬AB¬C¬D and A¬B¬C¬D.
  5. Remove the redundant terms. This gives us the expression A¬B¬CD + ABC¬D.

This too is nonsense.

Large language models and mnemonics

The Major mnemonic system encodes numbers as words in order to make them easier to remember. Digits correspond to consonant sounds (not spellings) as explained here. You can use the system ad hoc, improvising an encoding of a word as needed, or you can memorize canonical encodings of numbers, also known as pegs.

Pegs have couple advantages. For one, they are eventually faster. Rather than search for an encoding, you recall a predetermined association. For another, pegs are better for memorizing numbered lists. To recall the nth item in a list, pull up the one-and-only image associated with that number and recall what you associated with it.

Coming up with encodings for most numbers is easy, but for a few holdouts it’s challenging. I created a personal list of pegs for numbers up to 1000, and a few of these are idiosyncratic, such as names of people who are not well known and some bend the rules of the Major system.

I decided to try the ChatGPT and Google Bard large language models to see whether they could help me find pegs that I could share with a general audience, free of obscure references.

Asking directly about the Major system

ChatGPT and Google Bard both seems to have some idea of the Major mnemonic system. As part of its reply to one of my prompts, Google Bard returned this:

    0: S, Z, soft C
    1: D, T, TH
    2: N, L, M
    3: M
    4: R
    5: L
    6: J, SH, soft CH, DG
    7: K
    8: P
    9: H

which is correct, except for 8 and 9, and some incorrect alternatives for 2.

When I asked ChatGPT for “Major mnemonic system encoding table” it returned a correct table. But when I asked it for encodings of numbers, it used different rules. For example, it seemed to systematically use M for 2.

I asked both systems

Find a word that encodes the number 792 in the Major memory system

ChatGPT came back with KNP and GNP. These aren’t words, and it seems ChatGPT might have dyslexia.

Bard said

There is no word in the English language that encodes the number 792 in the Major memory system. The closest word is “Sink”, but it only encodes the digits 7 and 9.

This was meant to be an easy question since there are common English words to encode 792, such as cabin. Neither system came up with good responses to the numbers I found challenging, and it may be that there simple are no English words with the desired sequence of sounds.

Before moving on, I wanted to mention one response from Google Bard that I found amusing. When I asked

Find a word that encodes 345 using the Major mnemonic system

Bard responded with

… The encoded number 345 is therefore MERLE.

However, there is no word in the English language that is spelled exactly as MERLE. However, there are a few words that are similar, such as Merry, Mettle, and Melt.

If you are looking for a word that is spelled exactly as MERLE, you may need to create a new word. For example, you could create the word Merlin, which is a name that is often associated with magic and wisdom.

Asking for sounds

The Major system is not well known, so I didn’t expect a direct approach to be that helpful. Next I tried asking for words with particular sequences of sounds rather than asking directly about the Major system.

When I asked both systems

Find a word containing the consonant sounds k, b, and n, in that order

ChatGPT came back with cabin and Bard said

There is no word in the English language that contains the consonant sounds k, b, and n in that order. The closest word is “knob”, but the n sound is pronounced as “ng” in this word.

So the point goes to GPT on that one.

Both systems did poorly on the numbers I found a hard time finding pegs for, possibly because there aren’t any (or many) English words with the specified sounds. With easier questions, the results were mixed.

When I asked

Find a word with the sounds g, r, and f

I was thinking giraffe. ChatGPT returned graf and Bard returned graft.

When I asked

Find a word with sounds l, f, and b

I was thinking, for example, lifeboat. But ChatGPT came back with

A word that contains the consonant sounds “l,” “f,” and “b” is “elfishly.”

and Bard replied

The word “blurb” contains the sounds l, f, and b in that order.


Both systems did well when I asked directly for the phonetic spelling of a word. So I tried to get both systems to come up with lifeboat by asking

Words whose IPA phonetic spellings contain l, f, and b

Bard found words that had some of these letters but not all. When I was more specific, asking

Words whose IPA phonetic spellings contain each of the letters l, f, and b

the results weren’t better.

ChatGPT, on the other hand, came up with affleb, triflingly, bluff, lifeboat, and fallopian. When I asked the more specific question, adding “each of”, ChatGPT returned alphabet, fulbright, bubluous, flyable, and belltry. When I added “in that order” to the prompt, ChatGPT said there are no such words, despite having some up with several such words a moment earlier.

Related posts

Three advantages of non-AI models

It’s hard to imagine doing anything like Midjourney’s image generation without neural networks. The same is true of ChatGPT’s text generation. But a lot of business tasks do not require AI, and in fact would be better off not using AI. Three reasons why:

  1. Statistical models are easier to interpret. When a model has a dozen parameters, you can inspect each one. When a model has a billion parameters, not so much.
  2. Statistical models are more robust. Neural networks perform well on average, but fail in bizarre ways, and there is no way to explore all the possible edge cases. The edge cases of a statistical model can be enumerated, examined, and mitigated.
  3. Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that such legislation will have unintended consequences are roughly 100%.