Computing square root floor

Given an integer n, suppose you want to compute ⌊√n⌋, the greatest integer less than or equal to the square root of n. One reason you may want to compute ⌊√n⌋ is to know when you can stop trial division when factoring n. Similarly, ⌊√n⌋ gives you a starting point for Fermat’s factorization algorithm.

With floating point arithmetic

The obvious approach to computing ⌊√n⌋ would be to compute the square root of n as a real number and then round the result down.

Suppose we go down that route. How would we compute √n as real number? The usual way is to use Newton’s square root method. First you make an initial guess at the square root of n, say a. Then repeatedly update your guess to be the average of your previous guess a and n/a:

a ← (a + n/a) /2.

This is an ancient algorithm, going back to at least 1000 BC in Babylon. We call it Newton’s method because it is a special case of Newton’s root-finding method, applied to f(x) = x² − a.

Newton’s square root algorithm is implemented in computer hardware to this day. In some sense, we still compute square roots the same way the ancient Babylonians did.

Without floating point arithmetic

Our original problem was to find an integer, the largest integer no greater than the integer n. Can we take advantage of the fact that we’re working with integers?

A general approach to solving problems over the integers is to solve the same problem over the real numbers, then convert the result to an integer. That’s what we did above, but what if we incorporate the “integerification” into Newton’s algorithm itself so that we never work with real numbers?

This is a reasonable idea, but we don’t know a priori that it will work. Sometimes integer problems can be solved by integer analogs of real algorithms, but sometimes not. Integer optimization problems are harder than real number optimization problems because the optimal integer solution cannot in general be found by first finding the optimal real solution and rounding. And even when this approach works, it may not be the most efficient approach.

It turns out that the integer analog of Newton’s method does work. Bressoud [1] gives pseudocode for an algorithm which I write in Python below.

def sqrt_floor(n):
    a = n
    b = (n + 1) // 2
    while b < a:
        a = b
        b = (a*a + n) // (2*a)
    return a

Note that in Python the // operator carries out integer division, i.e. a // b computes ⌊a/b⌋.

The somewhat opaque line updating b inside the loop is just Newton’s method, averaging a and n/a, rounded down.

This algorithm could run on hardware with no floating point capability, only integer arithmetic. It will also work on extended precision integers where floating point would fail. For example, the following Python code

from math import factorial, sqrt
sqrt(factorial(1000))

produces an error “OverflowError: int too large to convert to float.” But

sqrt_floor(factorial(1000))

works just fine.

Update: FIPS 184-5

After writing this post, I was looking through the NIST FIPS 184-5 Digital Signature Standard (DSS). In Appendix B.4 the standard gives an algorithm for determining whether a large integer is a square. The algorithm in the appendix is essentially the algorithm above. The line that carries out the Newton method update is not explained, but this post explains where this otherwise mysterious line comes from.

[1] David M. Bressoud. Factorization and Primality Testing. Springer, 1989.

Weak encryption and surveillance

Two of the first things you learn in cryptography are that simple substitution ciphers are very easy to break, and that security by obscurity is a bad idea. This post will revisit both of these ideas.

Security depends on your threat model. If the threat you want to protect against is a human reading your encrypted messages, then simple substitution ciphers are indeed weak. Anyone capable of working a cryptogram in a puzzle book is capable of breaking your encryption.

But if your threat model is impersonal surveillance, things are different. It might not take much to thwart a corporation scanning your email for ideas of what to advertise to you. Even something as simple as rot13 might be enough.

The point of this article is not to recommend rot13, or any other simple substitution cipher, but to elaborate on the idea that evading commercial surveillance is different from evading intelligence agencies. It’s easier to evade bots than spooks.

If you’re the target of a federal investigation, the most sophisticated encryption at your disposal may be inadequate. But if you’re the target of an advertising company, things are much easier.

Rot13

Rot13 works by moving letters ahead 13 positions in the alphabet. The nth letter goes to the (n + 13)th letter mod 26. So A’s become N’s, and N’s become A’s, etc. Rot13 offers no security at all, but it is used to prevent text from being immediately recognizable. A common application is to publish the punchline of jokes.

Rot13 converts the word “pyrex” to “clerk.” If you use the word “pyrex” in an email, but rot13 encrypt your message, you’re unlikely to see advertisements for glassware unless you also talk about a clerk.

It is conceivable, but doubtful, that impersonal surveillance software would try to detect rot13 encoding even though it would be easy to do. But detecting and decrypting simple substitution ciphers in general, while certainly possible, would not be worth the effort. Security-by-obscurity could protect you from surveillance because it’s not profitable for mass surveillance to pursue anything obscure. For example, the Playfair cipher, broken over a century ago, would presumably throw off bots.

Modern but deprecated encryption

Simple substitution ciphers are ancient. Modern encryption methods, even deprecated methods like DES, are far more secure. For example, DES (Data Encryption Standard) is considered obsolete because it can be broken by a multiprocessor machine in under 24 hours. However, commercial surveillance is unwilling to spend processor-days to decrypt one person’s communication.

This is not to recommend DES. If it’s just as easy to use much better algorithms like AES (Advanced Encryption Standard) then why not do that? My purpose in bringing up DES is to say that squabbles over the merits of various modern encryption methods are beside the point if your goal is to evade impersonal surveillance.

Everyone agrees DES should be put out to pasture, but it doesn’t matter if you’re just trying to avoid surveillance. Things that not everyone agrees on matter even less.

What matters more than the algorithms is who holds the keys. A system in which you alone hold a DES key may give you more privacy than a system that holds an AES key for you. Companies may want to protect your private data from competitors but have no qualms about using it themselves.

Why surveillance matters

Why go to any effort to evade corporate surveillance?

Companies share data, and companies get hacked. One innocuous piece of data may be the link that connects two databases together. Things that you don’t mind revealing may unlock things you don’t want to reveal.

Related: A statistical problem with nothing to hide

What is the point of a public key fingerprint?

Public key cryptography uses two keys: a private key and a public key. The nature of these keys depends on the encryption scheme, such as whether one is using RSA, ECC, or some other method, but you can think of a key as a long number. A key may be a short list of numbers, but let’s just say a key is a number.

When you generate a public key, you get a pair of numbers, one that you keep secret and one that you publicize. Anyone can use your public key to encrypt a message that only you can decrypt, assuming only you have your private key.

If I give you my public key, say I post it on my website, how can you be sure that it’s really my key? Maybe my site has been hacked and I don’t even know it. Or maybe someone tampers with the site content between the time it leaves my server and arrives at your browser (though TLS is supposed to address that).

We could get on a phone call and you could read the key back to me. The problem with this approach is that keys are long. A 4096-bit RSA key, for example, encoded in hexadecimal, is 1024 characters long.

You could read off the last so many characters of the key, maybe telling me that what you received ends in 9F0D 38FA. That would probably be sufficient to tell one key from another—it’s very unlike you’d have two keys in a keychain that share the same last 32 bits—but that doesn’t mean the previous part of the key hasn’t been tampered with.

What you’d really like is a cryptographic hash of the public key, short enough to conveniently compare, but long enough that it would be infeasible for someone to alter the public key in such a way as to produce the same hash. And that’s exactly what a fingerprint is.

In GnuPG, the GNU implementation of PGP, a fingerprint is an 160-bit hash, which means 40 hexadecimal characters. Manually verifying 40 characters is feasible; manually verifying 1024 characters is not.

Fine-grained file differences

The diff utility compares files by lines, which is often what you’d like it to do. But sometimes you’d like more granularity.

For example, supposed we want to compare two versions of Psalm 23. Here are the first three verses in the King James version:

The Lord is my shepherd; I shall not want.
He maketh me to lie down in green pastures:
he leadeth me beside the still waters.
He restoreth my soul:
he leadeth me in the paths of righteousness
for his name’s sake.

And here are the corresponding lines from a more contemporary translation, the English Standard Version:

The Lord is my shepherd; I shall not want.
He makes me lie down in green pastures.
He leads me beside still waters.
He restores my soul.
He leads me in paths of righteousness
for his name’s sake.

Save these in two files, ps23.kjv and ps23.esv. If we run

diff ps23.kjv ps23.esv

we get

2,5c2,5 < He maketh me to lie down in green pastures: < he leadeth me beside the still waters. < He restoreth my soul: < he leadeth me in the paths of righteousness --- > He makes me lie down in green pastures. > He leads me beside still waters. > He restores my soul. > He leads me in paths of righteousness

This says that the two files differ in lines 2 through 5; the first and last lines are identical. The output shows lines 2 through 5 from each file but doesn’t show how they differ.

To see more fine-grained differences, such as changing maketh to makes, we can run the version of diff that comes with git.

If we run

git diff --word-diff ps23.kjv ps23.esv

we can compare the files by words rather than by lines. This produces

diff --git a/ps23.kjv b/ps23.esv index b90b858..be2a1a8 100644 --- a/ps23.kjv +++ b/ps23.esv @@ -1,6 +1,6 @@ The Lord is my shepherd; I shall not want. He [-maketh-]{+makes+} me[-to-] lie down in green [-pastures:-] [-he leadeth-]{+pastures.+} {+He leads+} me beside[-the-] still waters. He [-restoreth-]{+restores+} my [-soul:-] [-he leadeth-]{+soul.+} {+He leads+} me in[-the-] paths of righteousness for his name's sake.

The colors help make the test more readable, assuming you can see the difference between red and green. I assume the color scheme is configurable. But the text is readable without the color highlighting. For example, in the first line we have

[-maketh-]{+makes+}

which means we remove the word maketh and add the word makes.

We can compare the files on an even finer level, comparing by characters rather than words. For example, rather than saying we need to change maketh to makes the software can say we need to change the th ending to s. We can do this by running

git diff  --word-diff-regex=. ps23.kjv ps23.esv

The option --word-diff-regex=. says to use the regular expression . to indicate word boundaries. Since the dot matches any character, this says to chop the lines into individual characters.

diff --git a/ps23.kjv b/ps23.esv index b90b858..be2a1a8 100644 --- a/ps23.kjv +++ b/ps23.esv @@ -1,6 +1,6 @@ The Lord is my shepherd; I shall not want. He make[-th-]{+s+} me [-to -]lie down in green pastures[-:-] [-h-]{+.+} {+H+}e lead[-eth-]{+s+} me beside [-the -]still waters. He restore[-th-]{+s+} my soul[-:-] [-h-]{+.+} {+H+}e lead[-eth-]{+s+} me in [-the -]paths of righteousness for his name's sake.

As before we have square brackets to indicate what to remove and curly braces to indicate what to add, but now we’re removing and adding letters rather than words.

We can get a more compact display of the differences if we rely on color alone, by adding the --word-diff=color option.

git diff  --word-diff=color --word-diff-regex=. ps23.kjv ps23.esv

produces the following.

diff --git a/ps23.kjv b/ps23.esv index b90b858..be2a1a8 100644 --- a/ps23.kjv +++ b/ps23.esv @@ -1,6 +1,6 @@ The Lord is my shepherd; I shall not want. He makeths me to lie down in green pastures: h. He leadeths me beside the still waters. He restoreths my soul: h. He leadeths me in the paths of righteousness for his name's sake.

Equivalently, we can combine the two options

--word-diff=color --word-diff-regex=.

into the one option

--color-words=.

that specifies the word separation regular expression as an option to --color-words.

This may be the most convenient way to see the differences, provided you can distinguish the colors, and don’t need to use the plain text programmatically. Without the colors, makeths, for example, becomes simply makeths and we can no longer be sure what changed.

The ed line editor: bravado, utility, and history

I stumbled on the book Ed Mastery by Michael W. Lucas and couldn’t tell immediately whether it was serious. In a sort of technical version of Poe’s law, Lucas lays on the technical machismo pretty thick, but not thicker than some people do unironically.

Bravado

Here’s a paragraph from early in the book.

Many younger sysadmins naively hoist their pennants to defend overblown, overwrought, overdesigned text editors like ex, vi, or even the impossibly bloated nvi. A few are so lost as to devote themselves to turgid editors meant for mere users, such as vim and Emacs. This way lies not only appalling sysadmin skills, but an absence of moral fiber. As a sysadmin, you must have enough brain power to remember what you typed, to hold your own context in your head, and to truly commune with the machine on a deep and personal level.

By the time I read the reference to moral fiber I was pretty sure the author was being facetious.

Utility

The afterword to the book begins

Okay, come on Lucas, you’re not really serious here… are you?

I am. And I’m not.

This is what I find most interesting: ed is so minimal that it seems masochistic to use it, and yet it is also practical.

I’ve used Emacs for many years, but I’ve been learning the basics of vi because it is lightweight and installed everywhere [1]. But ed is even more lightweight and also installed everywhere.

My attraction to vi was that it’s complementary to Emacs. But ed is even more complementary. ed is so small that you don’t have to prioritize what to learn: just learn every feature. That is emphatically not true of vim or Emacs. Apparently you can learn ed in a day. Peter Krumins said

From my experience, once I had completed this cheat sheet and had it in front of me, I picked ed up in 30 minutes. And then spent a few more hours experimenting and trying various constructs.

History

I also find ed linguistically interesting. A lot of ed syntax is familiar because other tools inherited part of their syntax from ed. For example, sed and grep are direct descendants of ed, and the output of diff is basically ed syntax; with the -e flag the output of diffis exactly ed code. vim is “vi improved,” and vi was a visual interface to ex, which was an extension to ed.

[1] Here “everywhere” means “every Unix-like system.” When I worked in the Windows ecosystem I found this way of speaking arrogant and absurd. The large majority of the world’s desktops run Windows, and something that isn’t installed on Windows is hardly installed “everywhere.” Now I catch myself talking that way.

ASCII armor: send encrypted data in plain text

GnuPG ASCII armor is a way to encode binary data as text and decode the text back into binary. It could be used with any binary data, but most often it is used with cryptographic data: public keys, encrypted messages, etc.

Secure communication over an insecure channel

When people ask whether some medium “supports encryption,” or more specifically whether the medium “supports end-to-end encryption,” they’re asking how convenient it is to use the medium with encrypted data. Any medium can convey encrypted text, but the process may be more or less manual.

You can use Gmail for end-to-end encryption, for example, though I’m sure Google would rather you not. Just encrypt the data on your computer, convert the output to ASCII, paste it into an email, and send. The receiver then copies the ASCII text, converts it to binary, and then decodes it.

ASCII armor is a way of handling the conversion to and from ASCII, with some niceties such as a checksum and human-friendly formatting, as friendly as formatting can be under the circumstances.

Aside: Coding versus Encryption

There are two kinds of encoding in this context: secret and non-secret. Coding theory is often misunderstood because it is primarily concerned with non-secret encoding, such as error-correcting codes and transmission protocols. ASCII armor belongs to coding theory, not encryption, though it is usually deployed in service of encryption.

ASCII armor is not providing protection from prying eyes. It is providing a relatively convenient way to convey naturally binary data, including a checksum to detect errors in transmission.

How ASCII armor works

At its heart, ASCII armor does base-64 encoding. But there’s more to it than that. There are optional fields, formatting rules, and a checksum.

For an example, we’ll take the first 120 binary digits of π after the “decimal” point.

Here’s some Python code to write the bits to a file. We start with the bits in hexadecimal since that’s more compact. In hex notation, π = 3.243f6a8885a3… For more details, see this post on hexadecimal floating point.

import binascii

data = '243F6A8885A308D313198A2E03707344A4093822299F31D0082EFA98EC4E6C89'
with open('pibits', 'wb') as f:
    f.write(binascii.unhexlify(data))

We can inspect pibits in a hex editor to verify that the bits are correct, say by running xxd.

Enarmor

We can convert our binary file to ASCII using gpg, which stands for GNU Privacy Guard (GnuPG), the GNU implementation of Pretty Good Privacy. Running

gpg --enarmor pibits

produces a file pibits.ascii with the following contents.

-----BEGIN PGP ARMORED FILE-----
Comment: Use "gpg --dearmor" for unpacking

JD9qiIWjCNMTGYouA3BzRKQJOCIpnzHQCC76mOxO
=0G7A
-----END PGP ARMORED FILE-----

Dearmor

If we run

gpg --dearmor pibits.asc

we get a file pibits.asc.gpg identical to the original file pibits, which we could verify using diff.

Base 64 encoding

How does the string JD9… above correspond to bits?

A hexadecimal character represents 4 bits, and a base 64 character represents 6 bits. So every 3 hexadecimal character corresponds to 12 bits or 2 base 64 characters.

The first three hexadecimal characters in our data, 243, corresponds to the first two characters in our ASCII armor data, JD, as follows. First of all,

243hex = 001001000011two

and if we split the bits into groups of six we have 001001two = 9ten and 000011two = 3ten. ASCII armor represents base-64 numbers the same way MIME encoding does: the numbers 0 through 63 are represented by

A, B, C, … Z, a, b, c, …, z, 0, 1, 2, …, 9, +, /

So 9 is encoded as J, and 3 is encoded as D.

The 120 bits of π represented by the 30 hexadecimal characters 243…C89 or by the 20 base 64 characters JD…xO in base 64. Note that the last character is the letter O, not the decimal 0. (Base 58 avoids this problem by not using typographically similar symbols.)

In this example we had 120 bits, which is a multiple of 6. What if the number of bits is not a multiple of 6? See this post for an answer.

Checksum

After the 20 characters encoding our data follows five more characters, =0G7A. The equal sign is a separator, and 0G7A is a base 64 encoding of a 24-bit CRC checksum. The details of the checksum, and C code for implementing it, are given in RFC 4880 Section 6.1.

Related posts

Why not reuse passwords?

Perhaps you’ve heard that you should not reuse passwords but don’t understand why. After all, you have a really good password, one that nobody would ever guess, so why not use it everywhere?

Your password is not as good as you think

First of all, people who think they have a really good password are usually wrong. They implicitly assume that an attacker would have to recreate the devious process they went through to create the password. This is not the case. A password needs to be hard for an attacker to crack, not hard for the owner to create, and these are not the same thing. The former requires passwords that are long and randomly generated.

Credential stuffing

Suppose you use the same password on sites X and Y. Then site X gets hacked, revealing your password. Now your user name (most likely your email address) and password is part of a list posted online somewhere. Hackers then use scripts to see whether the same credentials will work somewhere else, and they often do. If they try site Y, then your account on Y has been compromised as well.

Now suppose you use a different username at the two sites. This is better, but the hack on X still means that your password has been added to a list of known passwords.

The worst case scenario is a site that stores passwords in the clear. This practice has long been discouraged, but the more times you reuse a password, the higher the chance that you’ll use your password at a site that does not hash passwords.

Hashing

Most sites these days don’t store your password per se but a cryptographic hash of your password. When you enter your password, the site hashes it and compares the result to the hashed value it has on file. If they match, you’re in.

If your hashed password for X is part of a breach, and the hashing algorithms for X and Y known, an attacker can try hashing a list of known passwords and maybe one of them will match the hash of your password on Y.

If sites X and Y use different hashing algorithms, then a hash of your password for X is not directly useful to hacking your account on site Y. But it is possible to “unhash” passwords, especially if a site uses a hashing algorithm that has been broken. This takes a lot of computation, but people do it.

Example

This is not hypothetical. It has happened many times. For example, it was part of what lead to the recent 23andMe breach. And when a hacker obtains one person’s family tree data, they obtain data on many other people at the same time. If you used a unique, long, randomly generated password on 23andMe but your cousin used password123, then your data may have been stolen.

What to do?

What can you do? You almost have to use a password manager. A strong password is necessarily hard to remember, and memorizing hundreds of strong passwords would be quite a chore. Still, you might want to memorize one or two strong passwords if they protect something really valuable.

Related posts

Unix linguistics

If you knew that you wanted to learn 10 spoken languages, it would probably be helpful to take a course in linguistics first. Or maybe to have a linguistics course after learning your first or second language. And if the languages are related, it would help to know something about the linguistics of that group of languages in particular. For example, if you wanted to learn several Romance languages, it might be worthwhile to learn at least some Latin first, even if Latin isn’t on the list of languages you want to learn.

In order to become fluent in using the Unix (Linux) command line, you need to learn a dozen related languages. Fortunately none of these languages are anywhere near as large as a spoken language, but there are several of them. Regular expressions, for example, are a pattern description language. You can think of vim as a language. And of course programming languages like sed and awk are languages.

As you use various command line utilities you notice similarities between them. Some tool history is fairly well known. For example, it’s well known that grep takes its name from the g/re/p command in ed, and that grep was created by modifying the ed source code. The history of sed is similar. The line editor ed is a common ancestor of grep, sed, and vi, which explains a lot of the similarity between these tools.

There is a large amount of preserved Unix history, but what I have in mind is more linguistics than history. History often accounts for the similarities in syntax, but I’m primarily interested in the similarities themselves rather than the history. A semi-fictional history might be more useful than an accurate history. “This isn’t exactly how things came about, but you could imagine …”

I’ve seen bits and pieces of what I imagine would comprise a course in Unix linguistics. For example, there is a section in the book sed & awk entitled “Awk, by Sed and Grep, out of Ed.”

I’ve used Emacs since college, but I’m learning how to get by in vi. Part of my motivation is to be able to log into a client’s Linux box and be productive without being able to install or configure anything. Although I hardly know anything about vi at this point, I can tell right away that vi has more syntactic similarity to the rest of the Unix ecosystem than Emacs does.

It would be really nice to have a book with a title like “vi for people who have used sed, grep, and less.” Or even better, a tutor who could relate what I don’t know to what I do know. Someone like my Greek professor.

I took one semester of classical Greek in college. The professor, William Nethercut, was amazing. At the beginning of the semester he asked each student what languages they had studied, and customized the rest of the course accordingly. “This feature of Greek is like that feature in French, Susan. And like this feature of Latin, Mike.” I was impressed by his erudition in languages, but even more impressed with his thoughtfulness in relating to each of us individually. If Dr. Nethercut taught a class in the Unix ecosystem, he could say “So, you know this set of tools and want to learn that set of tools. You’ll find that the syntax of this is similar to the syntax of that, but watch out for this difference.”

Randomized response and local differential privacy

Differential privacy protects user privacy by adding randomness as necessary to the results of queries to a database containing private data. Local differential privacy protects user privacy by adding randomness before the data is inserted to the database.

Using the visualization from this post, differential privacy takes the left and bottom (blue) path through the diagram below, whereas local differential privacy takes the top and right (green) path.

The diagram does not commute. Results are more accurate along the blue path. But this requires a trusted party to hold the identifiable data. Local differential privacy does not require trusting the recipient of the data to keep the data private and so the data must be deidentified before being uploaded. If you have enough data, e.g. telemetry data on millions of customers, then you can statistically afford to randomize your data before storing it.

I gave a simple description of randomized response here years ago. Randomized response gives users plausible deniability because their communicated responses are not deterministically related to their actual responses. That post looked at a randomized response to a simple yes/no question. More generally, you could have a a question with k possible answers and randomize each answer to one of ℓ different possibilities. It is not necessary that k = ℓ.

A probability distribution is said to be ε-locally differentially private if for all possible pairs of inputs x and x′ and any output y, the ratio of the conditional probabilities of y given x and y given x′ is bounded by exp(ε). So when ε is small, the probability of any given output conditional on each possible input is roughly the same. Importantly, the conditional probabilities are not exactly the same, and so one can recover some information about the unrandomized response in aggregate via statistical means. However, it is not possible to infer any individual’s unrandomized response, assuming ε is small.

In the earlier post on randomized response, the randomization mechanism and the inference from the randomized responses were simple. With multiple possible responses, things are more complicated. You could choose different randomization mechanisms and different inference approaches for different contexts and priorities.

With local differential privacy, users can share their data without trusting the data recipient to keep the data private; in a real sense the recipient isn’t receiving personal data at all. The recipient is receiving the output of a stochastic process which is weakly correlated with individual data, but isn’t receiving individual data per se.

Local differential privacy scales up well, but it doesn’t scale down well. When ε is small, each data contributor has strong privacy protection, but the aggregate data isn’t very useful unless so many individuals are represented in the data that the randomness added to the responses can largely be statistically removed.

Related posts

PATE framework for differentially private machine learning

Machine learning models can memorize fragments of their training data and return these fragments verbatim. I’ve seen instances, for example, where I believe an LLM returned phrases verbatim from this site. It’s easy to imagine how medical data might leak this way.

How might you prevent this? And how might you do it in a way that is easy to defend?

One such approach is the PATE framework. PATE stands for Private Aggregation of Teacher Ensembles. PATE was introduced in [1] and refined in [2].

In the PATE framework, you divide your sensitive data into n disjoint subsets and train a “teacher” model on each subset. These subsets are formed so that only one teacher has access to data from a particular individual.

Only these teacher models have direct access to sensitive data, and these models will not be released into production. Instead, the teacher models are used to train a “student” model.

The student model asks questions of the teacher models and so the student model is only indirectly trained on sensitive data. Furthermore, differential privacy is inserted between the student and teacher models as a further layer of privacy protection. So the student model is actually not trained on the answers from the teacher models but from an aggregate of the teacher models with a (ideally small) amount of randomness thrown in to further protect privacy. Publicly available data is also added to the training set for the student model.

There are a couple clever refinements in [2] that stretch the framework’s privacy budget.

To be more selective, our new mechanisms leverage some pleasant synergies between privacy and utility in PATE aggregation. For example, when teachers disagree, and there is no real consensus, the privacy cost is much higher; however, since such disagreement also suggest that the teachers may not give a correct answer, the answer may simply be omitted.

Differential privacy is a way of quantifying the notion that an individual’s participation or lack of participation in a database makes little difference. If the teacher models disagree, it may because an individual who is an outlier has had a large influence on the teacher model that was trained on his data. If you protect the privacy of the “teachers” then you protect the privacy of the individuals since no more than one teacher was training on any given individual’s data. When there is consensus among the teachers, there’s no need to spend much of the privacy budget.

The authors go on to say

Similarly, teachers may avoid giving an answer where the student already is confidently predicting the right answer.

Since each query uses some amount of the privacy budget, avoiding even asking questions saves budget.

Related posts

[1] Nicholas Papernot et al. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data arXiv:1610.05755

[2] Nicolas Papernot et al. Scalable Private Learning with PATE. arXiv:1802.08908v1