Weak encryption and surveillance

Two of the first things you learn in cryptography are that simple substitution ciphers are very easy to break, and that security by obscurity is a bad idea. This post will revisit both of these ideas.

Security depends on your threat model. If the threat you want to protect against is a human reading your encrypted messages, then simple substitution ciphers are indeed weak. Anyone capable of working a cryptogram in a puzzle book is capable of breaking your encryption.

But if your threat model is impersonal surveillance, things are different. It might not take much to thwart a corporation scanning your email for ideas of what to advertise to you. Even something as simple as rot13 might be enough.

The point of this article is not to recommend rot13, or any other simple substitution cipher, but to elaborate on the idea that evading commercial surveillance is different from evading intelligence agencies. It’s easier to evade bots than spooks.

If you’re the target of a federal investigation, the most sophisticated encryption at your disposal may be inadequate. But if you’re the target of an advertising company, things are much easier.

Rot13

Rot13 works by moving letters ahead 13 positions in the alphabet. The nth letter goes to the (n + 13)th letter mod 26. So A’s become N’s, and N’s become A’s, etc. Rot13 offers no security at all, but it is used to prevent text from being immediately recognizable. A common application is to publish the punchline of jokes.

Rot13 converts the word “pyrex” to “clerk.” If you use the word “pyrex” in an email, but rot13 encrypt your message, you’re unlikely to see advertisements for glassware unless you also talk about a clerk.

It is conceivable, but doubtful, that impersonal surveillance software would try to detect rot13 encoding even though it would be easy to do. But detecting and decrypting simple substitution ciphers in general, while certainly possible, would not be worth the effort. Security-by-obscurity could protect you from surveillance because it’s not profitable for mass surveillance to pursue anything obscure. For example, the Playfair cipher, broken over a century ago, would presumably throw off bots.

Modern but deprecated encryption

Simple substitution ciphers are ancient. Modern encryption methods, even deprecated methods like DES, are far more secure. For example, DES (Data Encryption Standard) is considered obsolete because it can be broken by a multiprocessor machine in under 24 hours. However, commercial surveillance is unwilling to spend processor-days to decrypt one person’s communication.

This is not to recommend DES. If it’s just as easy to use much better algorithms like AES (Advanced Encryption Standard) then why not do that? My purpose in bringing up DES is to say that squabbles over the merits of various modern encryption methods are beside the point if your goal is to evade impersonal surveillance.

Everyone agrees DES should be put out to pasture, but it doesn’t matter if you’re just trying to avoid surveillance. Things that not everyone agrees on matter even less.

What matters more than the algorithms is who holds the keys. A system in which you alone hold a DES key may give you more privacy than a system that holds an AES key for you. Companies may want to protect your private data from competitors but have no qualms about using it themselves.

Why surveillance matters

Why go to any effort to evade corporate surveillance?

Companies share data, and companies get hacked. One innocuous piece of data may be the link that connects two databases together. Things that you don’t mind revealing may unlock things you don’t want to reveal.

Related: A statistical problem with nothing to hide

Randomize, then humanize

Yesterday I wrote about a way to memorize a random 256-bit encryption key. This isn’t trivial, but it’s doable using memory techniques.

There’s a much easier way to create a memorable encryption key: start with something memorable, then apply a hash function. Why not just do that?

There are two conflicting criteria to satisfy: cryptographic strength and ease of memorization. Any password is a compromise between these two goals.

You get better security if you generate a random password, then try to make it memorable through some technique for making it more palatable to a human mind. You get something easier to remember if you start with something human-friendly and apply some process to make it appear more random.

In a nutshell, you can either randomize then humanize, or humanize then randomize.

Humanize then randomize, or randomize then humanize

You get better ease of use if you humanize then randomize. You get better security if you randomize then humanize.

This morning I ran across a paper by Arnold Reinhold suggesting that people generate 10-digit passwords by first generating 10 random letters, then create a mnemonic sentence with each word starting with one of the letters. Reinhold says that this leads to a greater variety of passwords than if you were to start with mnemonic sentence and somehow reduce it to 10 letters. This is an example of the randomize-then-humanize pattern.

Why?

There are more possibilities for an attacker to have to explore if you start with random input.

For example, suppose a site requires an 8-character password and you choose an 8-letter word. There are only about 30,000 English words with eight letters [1], and people are far more likely to choose some of these words than others. If you randomly choose a 3-character password using digits and letters (upper case and lower case) there are 623 = 238,328 possibilities. A three-character random password is far better than an 8-character word.

In Reinhold’s example, there are 2610 possible passwords made of 10 lowercase letters. That’s over 100 trillion possibilities. There are certainly fewer than 100 trillion pass phrases that humans are likely to come up with. Say you want to use your favorite sentence from a famous book. Suppose there are 1,000 famous books and each has 10,000 sentences. That’s only 10 million possibilities.

Human-generated randomness

People are not that good at generating randomness. Here’s a passage from David Kahn’s book The Codebreakers about the results of asking typists to create pages of random numbers for use in one-time pads.

Interestingly, some pads seem to be produced by typists and not by machines. They show strike-overs and erasures — neither likely to be made by machines. More significant are statistical analyses of the digits. One such pad, for example, has seven times as many groups in which digits in the 1-to-5 group alternate with digits in the 6-to-0 group, like 18293, as a purely random arrangement would have. This suggests that the typist is striking alternately with her left hand (which would type the 1-to-5 group on a Continental machine) and her right (which would type the 6-to-0 group). Again, instead of just half the groups beginning with a low number, which would be expected in a random selection, three quarters of them do, possibly because the typist is spacing with her right hand, then starting a new group with her left. Fewer doubles and triples appear than chance expects.

How hacks work

Websites implicitly use the humanize-then-randomize approach. When you create a password, the site hashes what you type and stores the hashed value. (A naive site might store the actual password.) Then the next time you log in, the site hashes your password input and compares it to the stored value.

If the site is hacked, and the site’s hashing algorithm is known, then many of these passwords can be recovered. This happens routinely. If you apply a hash function to a list of 10,000 common passwords, there are only 10,000 hash values, and you simply search the hacked list for these values. And since people often reuse passwords, someone who knows your password on one site can try that password on another site.

If you use a randomly generated password for each site, it’s less likely any individual password will be exposed. And if a password is exposed, a hacker cannot use it on another site.

Related posts

[1] On my Macbook, grep '^........$' words | wc -l returned 30,001. You’d get different results from searching different word lists, but your results wouldn’t vary too much.

What is the point of a public key fingerprint?

Public key cryptography uses two keys: a private key and a public key. The nature of these keys depends on the encryption scheme, such as whether one is using RSA, ECC, or some other method, but you can think of a key as a long number. A key may be a short list of numbers, but let’s just say a key is a number.

When you generate a public key, you get a pair of numbers, one that you keep secret and one that you publicize. Anyone can use your public key to encrypt a message that only you can decrypt, assuming only you have your private key.

If I give you my public key, say I post it on my website, how can you be sure that it’s really my key? Maybe my site has been hacked and I don’t even know it. Or maybe someone tampers with the site content between the time it leaves my server and arrives at your browser (though TLS is supposed to address that).

We could get on a phone call and you could read the key back to me. The problem with this approach is that keys are long. A 4096-bit RSA key, for example, encoded in hexadecimal, is 1024 characters long.

You could read off the last so many characters of the key, maybe telling me that what you received ends in 9F0D 38FA. That would probably be sufficient to tell one key from another—it’s very unlike you’d have two keys in a keychain that share the same last 32 bits—but that doesn’t mean the previous part of the key hasn’t been tampered with.

What you’d really like is a cryptographic hash of the public key, short enough to conveniently compare, but long enough that it would be infeasible for someone to alter the public key in such a way as to produce the same hash. And that’s exactly what a fingerprint is.

In GnuPG, the GNU implementation of PGP, a fingerprint is an 160-bit hash, which means 40 hexadecimal characters. Manually verifying 40 characters is tedious but feasible; manually verifying 1024 characters is not.

Why not reuse passwords?

Perhaps you’ve heard that you should not reuse passwords but don’t understand why. After all, you have a really good password, one that nobody would ever guess, so why not use it everywhere?

Your password is not as good as you think

First of all, people who think they have a really good password are usually wrong. They implicitly assume that an attacker would have to recreate the devious process they went through to create the password. This is not the case. A password needs to be hard for an attacker to crack, not hard for the owner to create, and these are not the same thing. The former requires passwords that are long and randomly generated.

Credential stuffing

Suppose you use the same password on sites X and Y. Then site X gets hacked, revealing your password. Now your username (most likely your email address) and password is part of a list posted online somewhere. Hackers then use scripts to see whether the same credentials will work somewhere else, and they often do. If they try site Y, then your account on Y has been compromised as well.

Now suppose you use a different username at the two sites. This is better, but the hack on X still means that your password has been added to a list of known passwords.

The worst case scenario is a site that stores passwords in the clear. This practice has long been discouraged, but the more times you reuse a password, the higher the chance that you’ll use your password at a site that does not hash passwords.

Hashing

Most sites these days don’t store your password per se but a cryptographic hash of your password. When you enter your password, the site hashes it and compares the result to the hashed value it has on file. If they match, you’re in.

If your hashed password for X is part of a breach, and the hashing algorithms for X and Y known, an attacker can try hashing a list of known passwords and maybe one of them will match the hash of your password on Y.

If sites X and Y use different hashing algorithms, then a hash of your password for X is not directly useful to hacking your account on site Y. But it is possible to “unhash” passwords, especially if a site uses a hashing algorithm that has been broken. This takes a lot of computation, but people do it.

Example

This is not hypothetical. It has happened many times. For example, it was part of what lead to the recent 23andMe breach. And when a hacker obtains one person’s family tree data, they obtain data on many other people at the same time. If you used a unique, long, randomly generated password on 23andMe but your cousin used password123, then your data may have been stolen.

What to do?

What can you do? You almost have to use a password manager. A strong password is necessarily hard to remember, and memorizing hundreds of strong passwords would be quite a chore. Still, you might want to memorize one or two strong passwords if they protect something really valuable.

Related posts

Database reconstruction attacks

In 2018, three researchers from the US Census Bureau published a paper entitled “Understanding Database Reconstruction Attacks on Public Data.” [1] The article showed that private data on many individuals could be reverse engineered from public data.

As I wrote about a few days ago, census blocks are at the bottom of the US Census Bureau’s hierarchy of geographical entities. On average a census block may contain about 40 people, but a block may contain only one person.

In hindsight it seems fairly obvious that data reported at the census block level is vulnerable to re-identification, and yet this doesn’t seem to have been noticed before around 2000. There were some privacy measures in place before then, but it wasn’t clear that these methods were insufficient to protect privacy.

You can think of each fact about each person as a variable and each reported statistic as an equation. When the number of equations is comparable to the number of variables, it’s possible that the system of equations has a unique solution. (We know a priori that there exists at least one solution, assuming the reported statistics were correctly computed.)

It’s not quite as simple as that, though that is roughly the idea in [1]. The data collected in the census is binary or integer data, which makes database reconstruction easier. Ages, for example, are integers, and typically integers less than 100.

One of the techniques the Census Bureau previously used in an attempt to protect individual privacy was a sort of small cell rule, a rule to not report statistics based on three or fewer individuals. This may or may not help. In the example given in [1], there are 7 people in a hypothetical census block, of whom 4 are adults and an unreported number are minors. Determining the number of minors is left as an exercise for the reader.

The set of equations is more complicated than a set of linear equations. The inference problem is a matter of logic programming or constraint satisfaction. Missing data is not always as trivial to reconstruct as in the preceding paragraph, but missing data can still convey partial information. The very fact that the data is missing tells you something.

The discrete nature of the data makes the solution process both harder and easier. It makes things harder in the sense of requiring a more complicated solution algorithm, but it makes things easier in the sense of increasing the likelihood that the equations have a unique solution.

This is why the Census Bureau embraced differential privacy for the 2020 census. They had no choice but to do something substantially different than they had done in the past once it became apparent that their previous approach failed rather badly at protecting confidentiality.

Related posts

[1] Simson Garfinkel, John M. Abowd, Christain Martindale. Understanding Database Reconstruction Attacks on Public Data. ACM Quque, October 2018. The article was also published in Communications of the ACM in March 2019.

Personal information in digital photos

Is it possible to identify the people in the photo above? Maybe. Digital images potentially contain a large amount of metadata that could reveal the photographer’s identify and location. There may also be a surprising number of clues in the photo itself.

EXIF metadata

The standard format for image metadata is EXIF, Exchangeable Image File Format. Some of this information is obviously identifiable, such as fields called CameraOwnerName, Photographer, and ImageEditor. A camera may or may not include such information, and someone may remove this image from photos after they are taken, but this image is possible inside the photo.

Similarly, the photo may include information regarding where the photo was taken, such as in the GPSLatitude, GPSLongitude, and GPSAltitude fields. There are also fields for recording when the photo was taken or edited.

A recurring theme in data privacy is that information that is not obviously identifiable may still be used to identify someone. If this data doesn’t do the whole job, it narrows down possibilities to the point that other known information may complete the identification.

For example, the highly technical fields contained in an image could identify the camera equipment. The camera serial number directly identifies the camera, but other fields may indirectly identify the camera.

Similarly, a image without GPS data still maybe contain indirect location. For example, there are fields for recording temperature, humidity, and atmospheric pressure. These fields used in combination with timestamps could identify a location, or at least narrow down the set of possible locations.

There are many EXIF fields that are allowed to be arbitrarily long ASCII or Unicode (UTF-8) sequences. A program for editing EXIF data would allow someone to copy the contents of Moby Dick into one of these fields.

The next post describes a similar situation for medical images.

Clues in the photo itself

Stripping EXIF data from an image before making it public is a good idea both for privacy and for size. If a free text field does contain Moby Dick, you could make your image 1.2 MB smaller by removing it.

However, it’s often possible to detect from the photo itself where the photo was taken. I stumbled on a YouTube channel of someone who identifies photos as a hobby. No doubt there are many such people. The host invites people to send in photos and he uses openly available information to track down where they are.

If you strip the precise time and location information from the metadata, someone may be able to infer approximate replacements from clues in the photo itself such as shadows or seasonal vegetation.

Ordinary people have no idea how much location information can be inferred from a photo. Neither do some people who ought to know better. There was a story a few months ago about a photo at a secret military location whose position was inferred from, among other clues, stars that faintly appeared in the sky near dusk.

Update: As noted in the comments, Facebook has a patent on a way to identify people from the pattern of dust on their camera lenses.

Related posts

Photo by Evgeniy Prokofiev on Unsplash

What can you learn from a phone number?

What can someone learn about you from your phone number?

The answer depends on what other information someone has. Identifiers always depend on context. To a naked man in a tree [1] the phone number doesn’t carry any information. But to someone with a list of names and phone numbers, some sort of reverse phone number look up, it might tell them your name.

A while back I wrote about area codes and how they are distributed among states. NANPA publicly posts data that goes into greater detail with central office codes. Using this data, you can look up the first six digits of a phone number and find more specifically where the central office associated with the number is located geographically.

For example, take the phone number 469 863 7090. This is a business phone number, and so you could type it into a search engine and find out exactly whose number it is. But if that weren’t possible, you could look up 469-863 in the NANPA database to find that the number is located in Frisco, Texas. In fact, the number belongs to Sky Rocket Burger. Recommended.

Now people can move around and keep their mobile phone numbers, so any kind of phone look up may tell you about where someone used to be rather than where they are. That could be even more useful.

Related posts

[1] A lawyer once told me that his law school professor said that the only thing the interstate commerce clause of the US Constitution doesn’t apply to is a naked man in a tree.

Zero-Concentrated Differential Privacy

Differential privacy can be rigid and overly conservative in practice, and so finding ways to relax pure differential privacy while retaining its benefits is an active area of research. Two approaches to doing this are concentrated differential privacy [1] and Rényi differential privacy [3].

Concentrated differential privacy was used in reporting results from the 2020 US Census. Specifically, zero-concentrated differential privacy with Gaussian noise.

Differential privacy quantifies the potential impact of an individual’s participation or lack of participation in a database and seeks to bound the difference. The original proposal for differential privacy and the approaches discussed here differ in how they measure the difference an individual can make. Both concentrated differential privacy (CDP) and Rényi differential privacy (RDP) use Rényi divergence, though they use it in different ways.

In [3], Mirinov discusses the similarities and differences regarding CDP and RDP. (I changed Mirnov’s reference numbers to the reference numbers used here.)

The closely related work by Dwork and Rothblum [1], followed by Bun and Steinke [2], explore privacy definitions—Concentrated Differential Privacy and zero-Concentrated Differential Privacy—that are framed using the language of, respectively, subgaussian tails and the Rényi divergence. The main difference between our approaches is that both Concentrated and zero-Concentrated DP require a linear bound on all positive moments of a privacy loss variable. In contrast, our definition applies to one moment at a time. Although less restrictive, it allows for more accurate numerical analyses.

(α, ε)-RDP fixes values of α and ε and requires that the Rényi divergence of order α between a randomized mechanism M applied to two adjacent databases, databases that differ by the data on one individual, is bounded by ε.

D_\alpha(M(x) || M(x')) \leq \varepsilon

Zero-concentrated differential privacy (zCDP) with parameters ε and ρ requires that the Rényi divergence is bounded by ε + ρα for all α in (1, ∞).

D_\alpha(M(x) || M(x')) \leq \varepsilon + \rho\alpha

The pros and cons of zCDP and RDP are complicated. For more details, see the references below.

Related posts

[1] Cynthia Dwork and Guy Rothblum. Concentrated differential privacy. CoRR, abs/1603.01887, 2016.

[2] Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. arXiv:1605.02065 [cs.CR], 2017

[3] Ilya Mironov. Renyi Differential Privacy. arXiv:1702.07476 [cs.CR]

Differentially private stochastic gradient descent

Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

Gradient descent

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

Stochastic gradient descent

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.

Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

Related posts

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]

Randomized response and local differential privacy

Differential privacy protects user privacy by adding randomness as necessary to the results of queries to a database containing private data. Local differential privacy protects user privacy by adding randomness before the data is inserted to the database.

Using the visualization from this post, differential privacy takes the left and bottom (blue) path through the diagram below, whereas local differential privacy takes the top and right (green) path.

The diagram does not commute. Results are more accurate along the blue path. But this requires a trusted party to hold the identifiable data. Local differential privacy does not require trusting the recipient of the data to keep the data private and so the data must be deidentified before being uploaded. If you have enough data, e.g. telemetry data on millions of customers, then you can statistically afford to randomize your data before storing it.

I gave a simple description of randomized response here years ago. Randomized response gives users plausible deniability because their communicated responses are not deterministically related to their actual responses. That post looked at a randomized response to a simple yes/no question. More generally, you could have a a question with k possible answers and randomize each answer to one of ℓ different possibilities. It is not necessary that k = ℓ.

A probability distribution is said to be ε-locally differentially private if for all possible pairs of inputs x and x′ and any output y, the ratio of the conditional probabilities of y given x and y given x′ is bounded by exp(ε). So when ε is small, the probability of any given output conditional on each possible input is roughly the same. Importantly, the conditional probabilities are not exactly the same, and so one can recover some information about the unrandomized response in aggregate via statistical means. However, it is not possible to infer any individual’s unrandomized response, assuming ε is small.

In the earlier post on randomized response, the randomization mechanism and the inference from the randomized responses were simple. With multiple possible responses, things are more complicated. You could choose different randomization mechanisms and different inference approaches for different contexts and priorities.

With local differential privacy, users can share their data without trusting the data recipient to keep the data private; in a real sense the recipient isn’t receiving personal data at all. The recipient is receiving the output of a stochastic process which is weakly correlated with individual data, but isn’t receiving individual data per se.

Local differential privacy scales up well, but it doesn’t scale down well. When ε is small, each data contributor has strong privacy protection, but the aggregate data isn’t very useful unless so many individuals are represented in the data that the randomness added to the responses can largely be statistically removed.

Related posts