# Probability, cryptography, and naïveté

Probability and cryptography have this in common: really smart people can be confidently wrong about both.

I wrote years ago about how striking it was to see two senior professors arguing over an undergraduate probability exercise. As I commented in that post, “Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.”

Not only do smart people often get probability wrong, they can be very confident while doing so. The same applies to cryptography.

I recently learned of a cipher J. E. Littlewood invented that he believed was unbreakable. His idea was essentially a stream cipher, simulating a one-time pad by using a pseudorandom number generator. He assumed that since a one-time pad is secure, his simulacrum of a one-time pad would be secure. But it was not, for reasons explained in this paper.

Littlewood was a brilliant mathematician, but he was naive, and even arrogant, about cryptography. Here’s the opening to the paper in which he explained his method.

The legend that every cipher is breakable is of course absurd, though still widespread among people who should know better. I give a sufficient example …

He seems to be saying “Here’s a little example off the top of my head that shows how easy it is to create an unbreakable cipher.” He was the one who should have known better.

# Breach Safe Harbor

In the context of medical data, Safe Harbor typically refers to the Safe Harbor provisions of the HIPAA Privacy Rule explained here. Breach Safe Harbor is a little different. It basically means you’re off the hook if you breach encrypted health data. (But not necessarily. More on that below.)

I’m not a lawyer, so this isn’t legal advice. Even the HHS, who coin the term “Breach Safe Harbor” in their guidance portal, weasels out of saying they’re giving legal guidance by saying “The contents of this database lack the force and effect of law, except as authorized by law …”

## Quality of encryption

You can’t just say that data were encrypted before they were breached. Weak encryption won’t cut it. You have to use acceptable algorithms and procedures.

How can you know whether you’ve encrypted data well enough to be covered Breach Safe Harbor? HHS cites four NIST publications for further guidance. (Not that I’m giving legal advice. I’m merely citing the HHS, who also is not giving legal advice.)

Here are the four publications.

## Maybe encryption isn’t enough

At one point Tennessee law said a breach of encrypted data was still a breach. According to Dempsey and Carlin [1]

In 2016, Tennessee repealed its encryption safe harbor, requiring notice of breach of even encrypted data, but then in 2017, after criticism, the state restored a safe harbor for “information that has been encrypted in accordance with the current version of the Federal Information Processing Standard (FIPS) 140-2 if the encryption key has not been acquired by an unauthorized person.”

This is interesting for a couple reasons. First, there is a precedent for requiring notification of encrypted data. Second, this underscores the point above that encryption in general is not sufficient to avoid having to give notice of a breach: standard-compliant encryption is sufficient.

## Consulting help

If you would like technical or statistical advice on how to prevent or prepare for a data breach, or how to respond after a data breach after the fact, we can help.

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition.

# MD5 hash collision example

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

The example claims that

TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

and

TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

have the same hash value.

This raises several questions.

Are these two strings really different, and if so, how do they differ? If you stare at the strings long enough you can see that they do indeed differ by one character. But how could you compare long strings like this in a more automated way?

How could you compute the MD5 hash values of the strings to verify that they are the same?

The following Python code addresses the questions above.

```from hashlib import md5
from difflib import ndiff

def showdiff(a, b):
for i,s in enumerate(ndiff(a, b)):
if s[0]==' ': continue
elif s[0]=='-':
print(u'Delete "{}" from position {}'.format(s[-1],i))
elif s[0]=='+':

a = "TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"
b = "TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"

showdiff(a, b)

ahash = md5(a.encode('utf-8')).hexdigest()
bhash = md5(b.encode('utf-8')).hexdigest()
assert(ahash == bhash)
```

The basis of the `showdiff` function was from an answer to a question on Stack Overflow.

The output of the call to `showdiff` is as follows.

```Delete "A" from position 21
```

This means we can form string `b` from `a` by changing the ‘A’ in position 21 to an ‘E’. There was only one difference between the two strings in this example, but `showdiff` could be useful for understanding more complex differences.

The assert statement passes because both strings hash to faad49866e9498fc1719f5289e7a0269.

# Detecting the language of encrypted text

Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that.

Simple substitution ciphers can be broken by frequency analysis: the most common letter probably corresponds to E, the next most common letter probably corresponds to T, etc. But that’s only for English prose. Maybe the message was composed in French. Or maybe it was composed in Japanese, then transliterated into the Latin alphabet so it could be transmitted via Morse code. You’d like to know what language the message was written in before you try identifying letters via their frequency.

William Friedman’s idea was to compute a statistic, what he dubbed the index of coincidence, to infer the probable language of the source. Since this statistic only depends on symbol frequencies, it gives the same value whether computed on clear text or text encrypted with simple substitution. It also gives the same value if the text has been run through a transposition cipher as well.

(Classical cryptanalysis techniques, such as computing the index of coincidence, are completely useless against modern cryptography. And yet ideas from classical cryptanalysis are still useful for other applications. Here’s an example that came up in a consulting project recently.)

As I mentioned at the top of the post, you’d try breaking the simplest encryption first. If the index of coincidence is lower than you’d expect for a natural language, you might suspect that the message has been encrypted using polyalphabetic substitution. That is, instead of using one substitution alphabet for every letter, maybe the message has been encrypted using a cycle of n different alphabets, such as the Vigenère cypher.

How would you break such a cipher? First, you’d like to know what n is. How would you do that? By trial and error. Try splitting the text into groups of letters according to their position mod n, then compute the index of coincidence again for each group. If the index statistics are much larger when n = 7, you’re probably looking a message encrypted with a key of length 7.

The source language would still leave its signature. If the message was encrypted by cycling through seven scrambled alphabets, each group of seven letters would most likely have the statistical distribution of the language used in the clear text.

Friedman’s index of coincidence, published in 1922, was one statistic that could be computed based on letter frequencies, one that worked well in practice, but you could try other statistics, and presumably people did. The index of coincidence is essentially Rényi entropy with parameter α = 2. You might try different values of α.

If the approach above doesn’t work, you might suspect that the text was not encrypted one letter at a time, even using multiple alphabets. Maybe pairs of letters were encrypted, as in the Playfair cipher. You could test this hypothesis by looking that the frequencies of pairs of letters in the encrypted text, calculating an index of coincidence (or some other statistic) based on pairs of letters.

Here again letter pair frequencies may suggest the original language. It might not distinguish Spanish from Portuguese, for example, but it would distinguish Japanese written in Roman letters from English.

# Frequency analysis

Suppose you have a list of encrypted surnames names of US citizens. If the list is long enough, the encrypted name that occurs most often probably corresponds to Smith. The second most common encrypted name probably corresponds to Johnson, and so forth. This kind of inference is analogous to solving a cryptogram puzzle by counting letter frequencies.

The probability of correctly guessing the most common names based on frequency analysis depends critically in the sample size. In a small sample, there may be no Smiths. In a larger sample, the name Smith may be common, but not the most common.

I did some simulations to estimate how well frequency analysis would work at identifying the 10 most common names as a function of the sample size N. For each N, I simulated 100 data sets using probabilities derived from the surname frequencies derived from US Census Bureau data.

When N = 1,000, there was a 53% chance that the most common name in the population, Smith, would be the most common name in the sample. The second most common name in the population, Johnson, was the second most common name in the sample only 14% of the time.

When N = 10,000, there was a 94% chance of identifying Smith, and at least a 30% chance of identifying the five most common names.

When N = 1,000,000, the three most common names were identified every time in the simulation. And each of the 10 most common names were correctly identified most of the time. In fact, the 18 most common names were correctly identified most of the time.

A consequence of this analysis is that hashing names does not protect privacy if the sample size is large. Hashing names along with other information, so that the combined data has a more uniform distribution, may protect privacy.

# Security by obscurity

Security-by-obscurity is a bad idea in general. It’s better, for example, to have a login page than to give your site an obscure URL. It’s better to encrypt a file than to hide it in some odd directory. It’s better to use a well-vetted encryption algorithm than to roll your own.

There are people whose knee-jerk reaction to any form of obscurity is to shout “That’s security-by-obscurity!” but obscurity can be subtle.

All else being equal, adding a layer of obscurity doesn’t hurt. For example, you can literally make a public encryption key public, as I’ve done here. But for extra security, why distribute your encryption key more widely than necessary? And if your message is adequately encrypted, you could in principle publish it for the world to see. But why not just give it to the intended recipient?

The public key on my site is there for strangers to contact me, but if I were really concerned about secure communication between colleagues, I’d just circulate the key among those colleagues. That may not be much more secure, but surely it’s no less secure. And I’d share messages privately, even though they are encrypted.

It’s good to look closely at any argument that beings “all else being equal” to see if all else is indeed equal. A more nuanced objection to security-by-obscurity is that it can create a false sense of security.

One could argue, for example, that making your public key available to the world forces you to be more careful about your encryption. Maybe you’ve been using an RSA key for years, and you really should use a longer key, but you don’t because you can argue that not many people have your public key anyway. But if your key’s too sort, obscuring your public key doesn’t help.

And while it’s better to deliver encrypted messages privately, it helps to not count on this, to assume that the encrypted message might be made public. That’s the basic premise behind encryption.

The principle behind no-security-by-obscurity is that you want to concentrate your security where it can be quantified. You can, for example, quantify how much more effort it would take to break a 64-bit key (like Blowfish) than a 56-bit key (like DES). Or even better, a 128-bit key (like AES). But you can’t quantify the level of protection that comes from obscurity.

Is it more secure to give someone a 56-bit DES key on a flash drive in a dark alley than to send them a 64-bit Blowfish key over SMS You can’t calculate an answer to that question.

In some sense all security is by obscurity. Cryptography literally means hidden writing. But all else being equal—there’s that phrase again—you want to minimize the surface area of what you have to obscure, e.g. limiting your secret to your key and not your algorithm, and it’s better to have quantified risks than unquantified risks. But all else is often not equal, and there are difficult trade-offs.

# Brute force cryptanalysis

A naive view of simple substitution ciphers is that they are secure because there are 26! ways to permute the English alphabet, and so an attacker would have to try 26! ≈ 4 × 1026 permutations. However, such brute force is not required. In practice, simple substitution ciphers are breakable by hand in a few minutes, and you can find software that automates the process.

However, for modern encryption, apparently brute force is required. If you encrypt a message using AES with a 128-bit key, for example, you can’t do much better than try 2128 keys. You might be able to do a little better, but as far as is openly known, you can’t do orders of magnitude better.

Even for obsolete encryption methods such as DES it still takes a lot more effort to break encryption than to apply encryption. The basic problem with DES is that it used 56-bit keys, and trying 256 keys is feasible [1]. You won’t be able to do it on your laptop, but it can be done using many processors in parallel [2]. Still, you’d need more than a passing curiosity about a DES encrypted message before you’d go to the time and expense of breaking it.

If breaking a simple substitution cipher really did require brute force, it would offer 88-bit security. That is, 26! roughly equals 288. So any cipher offering b-bit security for b > 88 is more secure in practice than breaking simple substitution ciphers would be in naive theory. This would include AES, as well as many of its competitors that weren’t chosen for the standard, such as Twofish.

For all the block ciphers mentioned here, the number of bits of security they offer is equal to the size of the key in bits. This isn’t always the case. For example, the security level of an RSA key is much less than the size of the key, and the relation between key size and security level is nonlinear.

A 1024-bit RSA modulus is believed to offer on the order of 87 bits security, which incidentally is comparable to 26! as mentioned above. NIST FIPS 184-5 recommends 2048 bits as the minimum RSA modulus size. This gives about 117 bits of security.

The security of RSA depends on the difficulty of factoring the product of large primes [3], and so you can compute the security level of a key based on the efficiency of the best known factoring algorithm, which is currently the General Number Field Sieve. More on this here.

## Related posts

[1] There are ways to do better than brute force against DES, if you have an enormous number of messages all encrypted with the same key.

[2] In 1998, the EFF built a machine called Deep Crack with 1,856 custom processors that could crack DES encoded messages in nine days, four and a half days on average.

[3] Nobody has proved that breaking RSA requires factoring. There is a variation on RSA that is provably as hard as factoring but as far as I know it has never been widely used.

## Introduction

Computers fundamentally changed cryptography, opening up new possibilities for making and breaking codes. At first it may not have been clear which side benefited most, but now it’s clear that computers gave more power to code makers than code breakers.

We now have cryptographic primitives that cannot be attacked more efficiently than by brute force, as far as we know. The weak link is how these primitives are implemented and combined, not the primitives themselves.

Before computers there was more of a cat and mouse game between encryption and cryptanalysis. Encryption schemes that were convenient to carry out by hand could usually be broken by hand eventually. But if you only needed secrecy briefly, a simple scheme might provide that secrecy for long enough. This post will look at one such scheme, the straddling checkerboard.

## Checkerboards

Perhaps the most obvious way to conveniently turn letters into numbers is to arrange the letters into a 5 × 5 grid. This has to leave out one letter, and in practice this meant combining I and J. Or if you needed digits, you could use a 6 × 6 grid and put J back in. You’d scramble the alphabet in the grid according to some key, then encrypt each letter by its coordinates.

```       12345
+-----
1|EBISP
2|XWLVN
3|AOYZQ
4|MDCKH
5|RTUFG
```

This is no better than a simple substitution cipher because someone intercepting a message encrypted this way would easily guess that pairs of digits represent letters. However, if you then permuted the digits with a transposition cipher, you’d have something more formidable. This is essentially what the ADFGV cipher did, which stumped cryptanalysts for a while.

The straddling checkerboard is a variation on the method above. Letters would be arranged in a 3 × 10 grid rather than 5 × 5. Some letters would be encrypted as a single digit and some as a pair of digits.

```       1234567890
+----------
|  EBISPXWL
1|VNAOYZQMDC
2|KHRTUFGJ./
```

In the example above, E would be encrypted as 3, N would be encrypted as 12, and so on. This is an instance of a prefix code. In order to be able to decode the digits unambiguously, no letter could be encoded as 1 or 2; these digits always signaled the beginning of a pair.

Prefix codes are often used in non-secret codes, such as country codes for telephone numbers. More examples of prefix codes in this post.

Because 1 and 2 could not be used to encode single letters, there were 28 slots to fill. These could be filled with other symbols, and in practice period and slash were added [1].

## Efficiency

The straddling checkerboard gives a more efficient encoding than does the checkerboard since typically fewer digits will be required. If efficiency were the only concern, we’d put the eight most frequent letters on the top row, something like the following [2].

```       1234567890
+----------
|  ETAOINSR
1|BCDFGHJKLM
2|PQUVWXYZ./
```

This would be more efficient but less secure since the arrangement of the letters would be more predictable.

## Security

The straddling checkerboard presents a bit of a challenge to the cryptanalyst since it’s not know a priori whether a digit is part of a pair (if the vertical coordinates are not always 1 and 2).

The straddling checkerboard didn’t offer much security even in its day. It would have been better if there had been some further processing done on the digits, such as how the ADFGV cipher permuted its coordinates.

The message, interpreted as a number N, could have been further encrypted as aN + b where a and b were randomly chosen numbers that were part of the key. As far as I know, nothing like this was ever done. This would have provided more security but would also require more effort and increase the chance of introducing errors.

## Related posts

[1] David Kahn. The Codebreakers. Chapter 18.

[2] You may have expected the last letter on the first row to be H, going by the printer’s order ETAOIN SHRDLU. Peter Norvig discovered a slightly different order of letter frequencies based on the Google corpus.

# New Twitter account for cryptography

I’ve started a new Twitter account: @CryptographyTip.

The icon for the account is the symbol for XOR, a common operation in encryption.

I intend to post about cryptography theory as well as practical matters such as software and file formats.

You can find a list of my other technical twitter accounts here.

You can also find me on Mathstodon.

# Binary to text to binary

Gnu Privacy Guard includes a way to encode binary files as plain ASCII text files, and turn these text files back into binary. This is intended as a way to transmit encrypted data, but it can be used to convert any kind of file from binary to text and back to binary.

To illustrate this, I’ll use Albrecht Dürer’s Melencolia I as an example. (More on this image and its mathematical significance here.)

This image is saved in the file `Melancholia.jpg`.

## Binary to text

If we run

`   gpg --enarmor Melencolia.jpg`

at a command line, it produces a file `Melancholia.jpg.asc`, the `asc` suffix indicating an ASCII file.

We can look inside this file if we’d like.

```-----BEGIN PGP ARMORED FILE-----
Comment: Use "gpg --dearmor" for unpacking

AAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAAExAAIAAAARAAAATgAAAAAAAAEs
AAAAAQAAASwAAAABUGFpbnQuTkVUIHYzLjUuOAAA/9sAQwACAQECAQECAgICAgIC
...
wJ/wkzRf8IN4LCCVVwNCtM8sDnPl5zkk565NbcH7Lnw01K0gtrn4feCriB4QfLl0
S2dV+bdgApwMknAwMk+tFFcktjqif//Z
=rpd1
-----END PGP ARMORED FILE-----
```

The cryptic text `/9j/4A…//Z` is a base 64 encoded representation of the binary file. Think of the file as one big binary number. Write that number in base 64, i.e. partition the bits into groups of six. Then represent the base 64 “digits” as alphanumeric characters, plus the symbols `+` and `/`. More on the details here.

The line `=rpd1` is a 24-bit CRC checksum. The equal sign is a separator, and `rpd1` is a base 64 encoding the checksum.

The JPG file is 91,272 bytes and the ASCII file is 123,712 bytes. The ASCII file is about 1/3 larger because every six bits in the binary file corresponds to an eight-bit ASCII character. The ASCII file is a little bit more than 1/3 larger because of the human-friendly text above and below the base 64 encoding, the newline characters, and the checksum.

## Text to binary

If we run

`    gpg --dearmor Melencolia.jpg.asc`

at a command line, it produces a file `Melancholia.jpg.asc.gpg`. This file is bit-for-bit exactly the same as the original file, which we could confirm by running

`    diff Melencolia.jpg Melencolia.jpg.asc.gpg`