Knuth’s series for chi squared percentage points

In the latest two posts, we have needed to find the percentage points for a chi square random variable when the parameter ν, the number of degrees of freedom, is large.

In Volume 2 of Knuth’s magnum opus TAOCP, he gives a table of percentage points for the chi square distribution for small values of ν and he gives an asymptotic approximation to be used for values of ν > 30.

Knuth’s series is

\nu + \sqrt{2\nu}x_p + \frac{2}{3} x_p^2 - \frac{2}{3} + {\cal O}(1/\sqrt{\nu})

where xp is the corresponding percentage point function for a normal random variable. Knuth gives a few values of xp in his table.

What Knuth calls xp, I called p in the previous post. The approximation I gave, based on Fisher’s transformation to make the chi squared more like a normal, is similar to Knuth’s series, but not the same.

So which approximation is better? Where does Knuth’s approximation come from?

Comparing approximations

We want to find the quantiles for χ²(100) for p = 0.01 through 0.99. Both Fisher’s approximation and Knuth’s approximation are good enough that it’s hard to tell the curves apart visually when the exact values are plotted along with the two approximations.

But when you subtract off the exact values and just look at the errors, it’s clear that Knuth’s approximation is much better in general.

Here’s another plot, taking the absolute value of the error. This makes it easier to see little regions where Fisher’s approximation happens to be better.

Here are a couple more plots comparing the approximations at more extreme probabilities. First the left tail:

And then the right tail:

Derivation

The table mentioned above says to see exercise 16 in the same section, which in turn refers to Theorem 1.2.11.3A in Volume 1 of TAOCP. There Knuth derives an asymptotic series which is not directly the one above, but does most of the work. The exercise asks the reader to finish the derivation.

Chi squared approximations

In the previous post I needed to know the tail percentile points for a chi squared distribution with a huge number of degrees of freedom. When the number of degrees of freedom ν is large, a chi squared random variable has approximately a normal distribution with the same mean and variance, namely mean ν and variance 2ν.

In that post, ν was 9999 and we needed to find the 2.5 and 97.5 percentiles. Here are the percentiles for χ²(9999):

    >>> chi2(9999).ppf([0.025, 0.975])
    array([ 9723.73223701, 10278.05632026])

And here are the percentiles for N(9999, √19998):

    >>> norm(9999, (2*9999)**0.5).ppf([0.025, 0.975])
    array([ 9721.83309451, 10276.16690549])

So the results on the left end agree to three significant figures and the results on the right agree to four.

Fewer degrees of freedom

When ν is more moderate, say ν = 30, the normal approximation is not so hot. (We’re stressing the approximation by looking fairly far out in the tails. Closer to the middle the fit is better.)

Here are the results for χ²(30):

    >>> chi2(30).ppf([0.025, 0.975])
    array([16.79077227, 46.97924224])

And here are the results for N(30, √60):

    >>> norm(30, (60)**0.5).ppf([0.025, 0.975])
    array([14.81818426, 45.18181574])

The normal distribution is symmetric and the chi squared distribution is not, though it becomes more symmetric as ν → ∞. Transformations of the chi squared distribution that make it more symmetric may also improve the approximation accuracy. That wasn’t important when we had ν = 9999, but it is more important when ν = 30.

Fisher transformation

If X ~ χ²(ν), Fisher suggested the approximation √(2X) ~ N(√(2ν − 1), 1).

Let Y be a N(√(2ν − 1), 1) random variable and Z a standard normal random variable, N(0, 1). Then we can estimate χ² probabilities from normal probabilities.

\begin{align*} P(X \leq x) &= P(\sqrt{2X} \leq \sqrt{2x}) \\ &\approx P(Y \leq \sqrt{2x}) \\ &= P(Z \leq \sqrt{2x} - \sqrt{2\nu - 1}) \end{align*}

So if we want to find the percentage points for X, we can solve for corresponding percentage points for Z.

If z is the point where P(Zz) = p, then

x = \frac{(p + \sqrt{2\nu-1})^2}{2}

is the point where P(Xx) = p.

If we use this to find the 2.5 and 97.5 percentiles for a χ²(30) random variable, we get 16.36 and 46.48, an order of magnitude more accurate than before.

When ν = 9999, the Fisher transformation gives us percentiles that are two orders of magnitude more accurate than before.

Wilson–Hilferty transformation

If X ~ χ²(ν), the Wilson–Hilferty transformation is (X/ν)1/3 is approximately normal with mean 1 − 2/9ν and variance 2/9ν.

This transformation is a little more complicated than the Fisher transform, but also more accurate. You could go through calculations similar to those above to approximate percentage points using the Wilson–Hilferty transformation.

The main use for approximations like this is now for analytical calculations; software packages can give accurate numerical results. For analytical calculation, the simplicity of the Fisher transformation may outweigh the improve accuracy of the Wilson-Hilferty transformation.

Related posts

Variance of variances. All is variance.

In the previous post, I did a simulation to illustrate a theorem about the number of steps needed in the Euclidean algorithm. The distribution of the number of steps is asymptotically normal, and for numbers 0 < a < b < x the mean is asymptotically

12 log(2) log(x) / π².

What about the variance? The reference cited in the previous post [1] gives an expression for the variance, but it’s complicated. The article said the variance is close to a constant times log x, where that constant involves the second derivative of “a certain pseudo zeta function associated with continued fractions.”

So let’s estimate the variance empirically. We can easily compute the sample variance to get an unbiased point estimate of the population variance, but how can we judge how accurate this point estimate is? We’d need the variance of our estimate of variance.

When I taught statistics classes, this is where students would struggle. There are levels of randomness going on. We have some random process. (Algorithm run times are not really random, but we’ll let that slide.)

We have a statistic S², the sample variance, which is computed from samples from a random process, so our statistic is itself a random variable. As such, it has a mean and a variance. The mean of S² is the population variance; that’s what it means for a statistic to be an unbiased estimator.

The statistic S² has a known distribution, which we can use to create a confidence interval. Namely,

\frac{(n-1)S^2}{\sigma^2} \sim \chi^2

where the chi squared random variable has n − 1 degrees of freedom. Here n is the number of samples and σ² is the population variance. But wait, isn’t the whole purpose of this discussion to estimate σ² because we don’t know what it is?!

This isn’t quite circular reasoning. More like iterative reasoning. We have an estimate of σ² which we then turn around and use in constructing its confidence interval. It’s not perfect, but it’s useful.

A (1 − α) 100% confidence interval is given by

\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2}} \right)

Notice that the right tail probability appears on the left end of the interval and the left tail probability appears on the right end of the interval. The order flips because they’re in a denominator.

I modified the simulation code from the previous post to compute the sample standard deviation for the runs, and got 4.6926, which corresponds to a sample variance of 22.0205. There were 10,000 reps, so our chi squared distribution has 9,999 degrees of freedom. We can compute the confidence interval as follows.

>>> from scipy.stats import chi2
>>> 9999*22.0205/chi2.ppf([0.975, 0.025], 9999)

This says a 95% confidence interval for the variance, in one particular simulation run, was [21.42, 22.64].

The result in [1] said that the variance is proportional to log x, and in our case n = 263 because that post simulated random numbers up to that limit. We can estimate that the proportionality constant is 63 log(2)/22.0205 = 0.5042. In other words, the variance is about half of log x.

The theorem in [1] gives us the asymptotic variance. If we used a bigger value of x, the variance could be a little different, but I expect it would be around 0.5 log x.

Related posts

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Distribution of run times for Euclidean algorithm

The worst case run time for the Euclidean algorithm occurs when you give the algorithm a pair of consecutive Fibonacci numbers. The algorithm takes n steps to compute the greatest common divisor of Fn+1 and Fn+2.

The nth Fibonacci number is the nearest integer to φn/√5 where φ = (1 + √5)/2 is the golden ratio. You can take logs and solve for the largest n such that Fn is less than x, subtract 2, and have an upper bound on the number of steps necessary to find the gcd of two numbers less than x.

But what about the average run time, or the distribution of run times?

For large x, the distribution of number of steps needed to calculate the gcd of two numbers 0 < abx using the Euclidean algorithm is approximately normal with mean

12 log(2) log(x) / π².

See, for example, [1].

Let’s illustrate this with a simulation. Here’s Python code to return the number of steps used in computing gcd(ab).

    def euclid(a, b):
        steps = 0
        while a != 0:
            a, b = b%a, a
            steps += 1
        return steps # gcd = b

I generated 10,000 pairs of random integers less than 263 (because the maximum signed 64 bit integer is 263 − 1) and plotted a histogram of the run times.

I got a nice bell curve with mean 36.3736. The theoretical mean is 0.8428 log(263) = 36.8021.

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Adjunctions

The previous post looked at adjoints in the context of linear algebra. This post will look at adjoints in the context of category theory.

The adjoint of a linear operator T between inner product spaces V and W is the linear operator T* such that for all v in V and all w in W,

Tv, w ⟩W = ⟨ v, T*w ⟩V.

The subscripts are important because the left side is an inner product in W and the right side is an inner product in V. Think of the inner products as pairings. We pair together elements from two vector spaces. An adjunction in category theory will pair together objects in two categories.

Now suppose we have two categories, C and D. An adjunction between C and D is a pair of functors (FG) with

FDC

and

GCD

such that

HomC(Fd, c) ≅ HomD(d, Gc)

for every object d in D and c in C. Here “Hom” means the set of morphisms from one object to another. The subscript reminds us what category the morphisms are in. The left hand side is a set of morphisms in C and the right hand side is a set of morphisms in D.

Note the typographical similarity between

Tv, w ⟩W = ⟨ v, T*w ⟩V

and

HomC(Fd, c) ≅ HomD(d, Gc).

The functors F and G are analogous to the operators T and T*. Sets of morphisms in different categories are analogous to inner products in different spaces.

Fine print

There’s a lot more going on in the definition of adjoint than is explicit above. In particular, the symbol “≅” is doing a lot of heavy lifting. It’s not saying that two sets are equal. They can’t be equal because they’re sets of different things, sets of morphisms in different categories. It means that there is a bijection between the two sets, and the bijection is “natural in d and c.” This in turn means that for every fixed d in D, there is a natural isomorphism between HomC(Fd, -) and HomD(d, G-), and for every fixed c in C there is a natural isomorphism between HomC(F-, c) ≅ HomD(-, Gc). More on this here.

Direction

Note that there is a direction to adjunctions. In the case of linear operators T* is the adjoint of T; it is not necessarily the case that T is the adjoint of T*. In Hilbert spaces, T** = T, but in Banach spaces the two operators might be different.

Similarly, F is the left adjoint of G, and G is the right adjoint of F. This is written FG, which implies that direction matters. Adjunctions are from one category to another. It’s possible to have FG ⊣ F, but that’s unusual.

Note also that in the case of an adjunction from a category C to a category D, the left adjoint goes from D to C, and the right adjoint goes from C to D. This is the opposite of what you might expect, and many online sources get it wrong.

Hand-wavy examples

Left and right adjoints are inverses in some abstract sense. Very often in an adjunction pair (FG) the functor F adds structure and the function G takes it away. F is “free” in some sense, and G is “forgetful” in some corresponding sense.

For example, let C be the category of topological spaces and continuous functions, and D the space of sets and functions. Let F be the functor that takes sets and gives them the discrete topology, making all functions continuous functions. And let G be the functor that takes a topological space and “forgets” its topology, regarding the space simply as a set and regarding its continuous functions simply as functions. Then FG.

Not every adjunction pair follows the free/forgetful pattern, but most examples you’ll find in references do. As noted above, sometimes FG ⊣ F, which could not happen if you always lose structure as you move down the chain of adjunctions.

Other metaphors for left and right adjoints are that left adjoints approximate something from “below” (i.e. with less structure) and right adjoints approximate from “above” (i.e. with more structure). Some say left adjoints are “optimistic” and right adjoints are “pessimistic.” (This reminds me of last weekend’s post about kernels and cokernels: kernels are degrees of freedom and cokernels are constraints.)

Related posts

Transpose and Adjoint

The transpose of a matrix turns the matrix sideways. Suppose A is an m × n matrix with real number entries. Then the transpose A is an n × m matrix, and the (ij) element of A is the (ji) element of A. Very concrete.

The adjoint of a linear operator is a more abstract concept, though it’s closely related. The matrix A is sometimes called the adjoint of A. That may be fine, or it may cause confusion. This post will define the adjoint in a more general context, then come back to the context of matrices.

This post, and the next will be more abstract than usual. After indulging in a little pure math, I’ll return soon to more tangible topics such as Morse code and barbershop quartets.

Dual spaces

Before we can define adjoints, we need to define dual spaces.

Let V be a vector space over a field F. You can think of F as ℝ or ℂ. Then V* is the dual space of V, the space of linear functionals on V, i.e. the vector space of functions from V to F.

The distinction between a vector space and its dual seems artificial when the vector space is ℝn. The dual space of ℝn is isomorphic to ℝn, and so the distinction between them can seem pedantic. It’s easier to appreciate the distinction between V and V* when the two spaces are not isomorphic.

For example, let V be L3(ℝ), the space of functions f such that |f|3 has a finite Lebesgue integral. Then the dual space is L3/2(ℝ). The difference between these spaces is not simply a matter of designation. There are functions f such that the integral of |f|3 is finite but the integral of |f|3/2 is not, and vice versa.

Adjoints

The adjoint of a linear operator TVW is a linear operator T*: W* → V* where V* and W* are the dual spaces of V and W respectively. So T* takes a linear function from W to the field F, and returns a function from V to F. How does T* do this?

Given an element w* of W*, T*w* takes a vector v in V and maps it to F by

(T*w*)(v) = w*(Tv).

In other words, T* takes a functional w* on W and turns it into a function on V by mapping elements of V over to W and then letting w* act on them.

Note what this definition does not contain. There is no mention of inner products or bases or matrices.

The definition is valid over vector spaces that might not have an inner product or a basis. And this is not just a matter of perspective. It’s not as if our space has a inner product but we choose to ignore it; we might be working over spaces where it is not possible to define an inner product or a basis, such as ℓ, the space of bounded sequences.

Since a matrix represents a linear operator with respect to some basis, you can’t speak of a matrix representation of an operator on a space with no basis.

Bracket notation

For a vector space V over a field F, denote a function ⟨ ·, · ⟩ that takes an element from V and an element from V* and returns an element of F by applying the latter to the former. That is, ⟨ v, v* ⟩ is defined to be the action of v* on v. This is not an inner product, but the notation is meant to suggest a connection to inner products.

With this notation, we have

Tv, w* ⟩W = ⟨ v, T*w* ⟩V

for all v in V and for all in W by definition. This is the definition of T* in different notation. The subscripts on the brackets are meant to remind us that the left side of the equation is an element of F obtained by applying an element of W* to an element of W, while the right side is an element of F obtained by applying an element of V* to an element of V.

Inner products

The development of adjoint above emphasized that there is not necessarily an inner product in sight. But if there are inner products on V and W, then we can define turn an element of v into an element of V* by associating v with ⟨ ·, v ⟩ where now the brackets do denote an inner product.

Now we can write the definition of adjoint as

Tv, w ⟩W = ⟨ v, T*w V

for all v in V and for all in W. This definition is legitimate, but it’s not natural in the technical sense that it depends on our choices of inner products and not just on the operator T and the spaces V and W. If we chose different inner products on V and W then the definition of T* changes as well.

Back to matrices

We have defined the adjoint of a linear operator in a very general setting where there may be no matrices in sight. But now let’s look at the case of TVW where V and W are finite dimensional vector spaces, either over ℝ or ℂ. (The difference between ℝ and ℂ will matter.) And lets definite inner products on V and W. This is always possible because they are finite dimensional.

How does a matrix representation of T* correspond to a matrix representation of T?

Real vector spaces

Suppose V and W are real vector spaces and A is a matrix representation of TVW with respect to some choice of basis on each space. Suppose also that the bases for V* and W* are given by the duals of the bases for V and W. Then the matrix representation of T* is the transpose of A. You can verify this by showing that

Av, w ⟩W = ⟨ v, Aw V

for all v in V and for all in W.

The adjoint of A is simply the transpose of A, subject to our choice of bases and inner products.

Complex vector spaces

Now consider the case where V and W are vector spaces over the complex numbers. Everything works as above, with one wrinkle. If A is the representation of TVW with respect to a given basis, and we choose bases for V* and W* as above, then the conjugate of A is the matrix representation of T*. The adjoint of A is A*, the conjugate transpose of A. As before, you can verify this by showing that

Av, w ⟩W = ⟨ v, A*w V

We have to take the conjugate of A because the inner product in the complex case requires taking a conjugate of one side.

Related posts

Overtones and Barbershop Quartets

I’ve heard that barbershop quartets often sing the 7th in a dominant 7th a little flat in order to bring the note closer in tune with the overtone series. This post will quantify that assertion.

The overtones of a frequency f are 2f, 3f, 4f, 5f, etc. The overtone series is a Fourier series.

Here’s a rendering of the C below middle C and its overtones.

\score { \new Staff { \clef treble <c c' g' c'' e'' bes''>1 } \layout {} \midi {} }

We perceive sound on a logarithmic scale. So although the overtone frequencies are evenly spaced, they sound like they’re getting closer together.

Overtones and equal temperament

Let’s look at the notes in the chord above and compare the frequencies between the overtone series and equal temperament tuning.

Let f be the frequency of the lowest note. The top four notes in this overtone series have frequencies 4f, 5f, 6f, and 7f. They form a C7 chord [1].

In equal temperament, these four notes have frequencies 224/12 f, 228/12 f, 231/12 f, and 234/12 f. This works out to 4, 5.0397, 5.9932, and 7.1272 times the fundamental frequency f

The the highest note, the B♭, is the furthest from its overtone counterpart. The frequency is higher than that of the corresponding overtone, so you’d need to perform it a little flatter to bring it in line with the overtone series. This is consistent with the claim at the top of the post.

Differences in cents

How far apart are 7f and 7.1272f in terms of cents, 100ths of a semitone?

The difference between two frequencies, f1 and f2, measured in cents is

1200 log2(f1 / f2).

To verify this, note that this says an octave equals 1200 cents, because log2 2 = 1.

So the difference between the B♭ in equal temperament and in the 7th note of the overtone series is 31 cents.

The difference between the E and the 5th overtone is 14 cents, and the difference between the G and the 6th overtone is only 2 cents.

More music posts

[1] The dominant 7th chord gets its name from two thing. First, it’s called “dominant” because it’s often found on the dominant (i.e. V) chord of a scale, and it’s made from the 1st, 3rd, 5th, and 7th notes of the scale. It’s a coincidence in the example above that the 7th of the chord is also the 7th overtone. These are two different uses of the number 7 that happen to coincide.

Equipentatonic scale

I ran across a video that played around with the equipentatonic scale [1]. Instead of dividing the octave into 12 equal parts, as is most common in Western music, you divide the octave into 5 equal parts. Each note in the equipentatonic scale has a frequency 21/5 times its predecessor.

The equipentatonic scale is used in several non-Western music systems. For example, the Javanese slendro tuning system is equipentatonic, as is Babanda music from Uganda.

In the key of C, the nearest approximants of the notes in the equipentatonic scale are C, D, F, G, A#. Approximate equipentatonic scale

In the image above [2], the D is denoted as half sharp, 50 cents higher than D. (A cent is 1/100 of a half step.) The actual pitch is a D plus 40 cents, so the half sharp is closer, but still not exactly right.

The F should be 20 cents lower, the G should be 20 cents higher, and the A# should be 10 cents higher.

Notation

The conventional chromatic scale is denoted 12-TET, an abbreviation for 12 tone equal temperament. In general n-TET denotes the scale that results from dividing the octave into n parts. The previous discussion was looking at how 5-TET aligns with 12-TET.

A step in 5-TET corresponds to 2.4 steps in 12-TET. This is approximately 5 steps in 24-TET, the scale we’d get by adding quarter tones between all the notes in the chromatic scale.

Math

When we talk of dividing the octave evenly into n parts, we mean evenly on a logarithmic scale because we perceive music on a log scale.

The notation will be a little cleaner if we start counting from 0. Then the kth note the n-TET scale is proportional to 2k/n.

The proportionality constant is the starting pitch. So if we start on middle C, the frequencies are 262 × 2k/n Hz. The nth frequency is twice the starting frequency, i.e. an octave higher.

We can measure how well an m-TET scale can be approximated by notes from an n-TET scale with the following function:

d(m, n) = \max_j \min_k \left|\frac{j}{m} - \frac{k}{n} \right|

Note that this function is asymmetric: d(mn) does not necessarily equal d(nm). For example, d(12, 24) = 0 because a quarter tone scale contains exact matches for every note in a semitone scale. But d(24, 12) = 1/24 because the quarter tone scale contains notes in between the notes of the semitone scale.

The equipentatonic scale lines up better with the standard chromatic scale than would a 7-note scale or an 11-note scale. That is, d(5, 12) is smaller than d(7, 12) or d(11, 12). Something similar holds if we replace 12 with 24: d(5, 24) is smaller than d(m, 24) for any m > 1 that is relatively prime to 24.

Related posts

[1] The video first presents 10-TET and then defines 5-TET as taking every other note from this scale.

[2] The image was created with the following Lilypond code.

\score {
  \new Staff {
    \relative c' {
      c1 dih f g aih c \bar "|."
    }
  }
  \layout {
    \context {
      \Staff
      \remove "Bar_engraver"
      \remove "Time_signature_engraver"
    }
  }
}

The multiple coupon collector problem

I’ve written about the Coupon Collector Problem and variations several times, most recently here.

Brian Beckman left a comment linking to an article he wrote, which in turn links to a paper on the Double Dixie Cup Problem [1].

The idea behind the Coupon Collector Problem is to estimate how long it will take to obtain at least one of each of n coupons when drawing coupons at random with replacement from a set of n.

The Double Dixie Cup Problem replaces coupons with dixie cups, and it asks how many couples you’d expect to need to collect until you have two of each cup.

The authors in [1] answer the more general question of how long to collect m of each cup, so m = 1 reduces to the Coupon Collector Problem. The punchline is Theorem 2 from the paper: the expected number of cups (or coupons) is

n \left( \log n + (m-1) \log \log n + C_m + o(1) \right)

for fixed m as n → ∞.

Here Cm is a constant that depends on m (but not on n) and o(1) is a term that goes to zero as n → ∞.

Since log log n is much smaller than log n when n is large, this says that collecting multiples of each coupon doesn’t take much longer, on average, than collecting one of each coupon. It takes substantially less time than playing the original coupon collector game m times.

Related posts

[1] Donald J. Newman and Lawrence Shepp. The Double Dixie Cup Problem. The American Mathematical Monthly, Vol. 67, No. 1 (Jan., 1960), pp. 58–61.

Morse code and the limits of human perception

Musician Adam Neely made a video asking What is the fastest music humanly possible? He emphasizes that he means the fastest music possible to hear, not the fastest to perform.

Screenshot of Bolton paper

The video cites a psychology article [1] from 1894 that found that most people can reliably distinguish an inter-onset interval (time between notes) of 100 ms [2]. It also gives examples of music faster than this, such as a performance of Vivaldi with an inter-onset interval of 83 ms [3]. The limit seem to be greater than 50 ms because a pulse train with an inter-onset interval of 50 ms starts to sound like a 20 Hz pitch.

People are able to receive Morse code faster than this implies is possible. We will explain how this works, but first we need to convert words per minute to inter-onset interval length.

Morse code timing

Morse code is made of dots and dashes, but it is also made of spaces of three different lengths: the space between the dots and dashes representing a single letter, the space between letters, and the space between words.

According to an International Telecommunication Union standard

  • A dash is equal to three dots.
  • The space between the signals forming the same letter is equal to one dot.
  • The space between two letters is equal to three dots.
  • The space between two words is equal to seven dots.

The same timing is referred to as standard in a US Army manual from 1957,

Notice that all the numbers above are odd. Since a dot or dash is always followed by a space, the duration of a dot or dash and its trailing space is always an even multiple of the duration of a dot.

If we think of a dot as a sixteenth note, Morse code is made of notes that are either sixteenth notes or three sixteenth notes tied together. Rests are equal to one, three, or seven sixteenths, and notes and rests must alternate. All notes start on an eighth note boundary, i.e. either on a down beat or an up beat but not in between.

Words per minute

Morse code speed is measured in words per minute. But what exactly is a “word”? Words have a variable number of letters, and even words with the same number of letters can have very different durations in Morse code.

The most common standard is to use PARIS as the prototypical word. Ten words per minute, for example, means that dots and dashes are coming at you as fast as if someone were sending the word PARIS ten times per minute. Here’s a visualization of the code for PARIS with each square representing the duration of a dot.

■□■■■□■■■□■□□□■□■■■□□□■□■■■□■□□□■□■□□□■□■□■□□□□□□□□

This has the duration of 50 dots.

How does this relate to inter-onset interval? If each duration of a dot is an interval, then n words per minute corresponds to 50n intervals per minute, or 60/50n = 1.2/n seconds per interval.

This would imply that 12 wpm would correspond to an inter-onset interval of around 100 ms, pushing the limit of perception. But 12 wpm is a beginner speed. It’s not uncommon for people to receive Morse code at 30 wpm. The world record, set by Theodore Roosevelt McElroy in 1939, is 75.2 wpm.

What’s going on?

In the preceding section I assumed a dot is an interval when calculating inter-onset interval length. In musical terms, this is saying a sixteenth note is an interval. But maybe we should count eighth notes as intervals. As noted before, every “note” (dot or dash) starts on a down beat or up beat. Still, that would say 20 wpm is pushing the limit of perception, and we know people can listen faster than that.

You don’t have to hear with the precision of the diagram above in order to recognize letters. And you have context clues. Maybe you can’t hear the difference between “E E E” and “O”, but in ordinary prose the latter is far more likely.

E E E vs O

At some skill level people quit hearing individual letters and start hearing words, much like experienced readers see words rather than letters. I’ve heard that this transition happens somewhere between 20 wpm and 30 wpm. That would be consistent with the estimate above that 20 wpm is pushing the limit of perception letter by letter. But 30 words per minute is doable. It’s impressive, but not unheard of.

What I find hard to believe is that there were intelligence officers, such as Terry Jackson, who could copy encrypted text at 50 wpm. Here a “word” is a five-letter code group. There are millions of possible code groups [4], all equally likely, and so it would be impossible to become familiar with particular code groups the way one can become familiar with common words. Maybe they learned to hear pairs or triples of letters.

Related posts

[1] Thaddeus L. Bolton. Rhythm. The American Journal of Psychology. Vol. VI. No. 2. January, 1894. Available here.

[2] Interonset is not commonly hyphenated, but I’ve hyphenated it here for clarity.

[3] The movement Summer from Vivaldi’s The Four Seasons performed at 180 bpm which corresponds to 720 sixteenth notes per minute, each 83 ms long.

[4] If a code group consisted entirely of English letters, there are 265 = 11,881,376 possible groups. If a code group can contain digits as well, there would be 365 = 60,466,176 possible groups.