ETAOIN SHRDLU and all that

Statistics can be useful, even if it’s idealizations fall apart on close inspection.

For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

One letter has worn off my keyboard

One letter has worn off my keyboard

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not English.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample must correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.


3 thoughts on “ETAOIN SHRDLU and all that

  1. I have very fond memories of learning about word frequency as a child from the book Alvin’s Secret Code (first published in 1963), which had it as ETAONRISHDLFCMUGY. No idea where that came from but I still have it memorized that way to this day.

  2. Sorry, Peter Norvig. “ETAOIN SRHLDCU” doesn’t roll off the tongue the way “ETAOIN SHRDLU” does.

    Languages are fundamentally spoken or signed. Writing is a later construct, historically. As language learners, we must first learn to understand the spoken or signed language. Only later can we learn to represent it with symbols. The writing attempts to represent what we hear or see, ideally at a basic level.

    You’ve allocated one short question to the issue of spoken English, but the rest of the article only makes sense if you focus on written English. Which phones (as in, “allophones of the same phoneme”) are the most common in English? Which phonemes, and in what order? The English letter “E” represents a number of sounds, as well as no sound at all.

    Even if we were to accept this simplification, the term “English as a whole” would still be vague and impossible to determine. You’d still have the issue of what counts as English, and what is really a separate language. How do you deal with mutually unintelligible versions of English, whose speakers both consider themselves to be speaking “English”? More often, we have versions that are partially intelligible by speakers (or readers) of other dialects, in the same way that readers today deal with Shakespeare or the KJV Bible.

    Should “English as a whole” include only published texts, and not, say, tweets? (You’d never be able to include private journals, or English written on napkins, or all sorts of written texts.) Could we limit it by only including “salient” texts that are published, or would that be impossible to determine? Should it include academic English, or other types of specialized English?

    Statistics can sort through many things, but language is inherently complicated, on many levels. I’m sure I’m forgetting some, in fact.

    One last thing . . . There’s an “it’s” where you meant “its” somewhere.

Comments are closed.