John D. Cook Applied Mathematics Consulting Tue, 07 Apr 2020 23:31:53 +0000 en-US hourly 1 John D. Cook 32 32 Sine series for a sine Tue, 07 Apr 2020 23:31:53 +0000 The Fourier series of an odd function only has sine terms—all the cosine coefficients are zero—and so the Fourier series is a sine series.

What is the sine series for a sine function? If the frequency is an integer, then the sine series is just the function itself. For example, the sine series for sin(5x) is just sin(5x). But what if the frequency is not an integer?

For an odd function f on [-π, π] we have

f(x) = \sum_{i=0}^\infty c_n \sin(n \pi x)

where the coefficients are given by

c_n = \frac{1}{\pi} \int_{-\pi}^\pi f(x) \sin(nx)\, dx

So if λ is not an integer, the sine series coefficients for sin(λx) are given by

c_n = 2\sin(\lambda \pi) (-1)^n \,\frac{ n}{\pi(\lambda^2 - n^2)}

The series converges slowly since the coefficients are O(1/n).

For example, here are the first 15 coefficients for the sine series for sin(1.6x).

And here is the corresponding plot for sin(2.9x).

As you might expect, the coefficient of sin(3x) is nearly 1, because 2.9 is nearly 3. What you might not expect is that the remaining coefficients are fairly large.

More posts on Fourier series

]]> 0
Two meanings of QR code Tue, 07 Apr 2020 14:06:38 +0000 “QR code” can mean a couple different things. There is a connection between these two, though that’s not at all obvious.

What almost everyone thinks of as a QR code is a quick response code, a grid of black and white squares that encode some data. For example, the QR code below contains my contact info.

There’s also something in algebraic coding theory called a QR code, a quadratic residue code. These are error-correcting codes that are related to whether numbers are squares or not modulo a prime.

The connection between quick response codes and quadratic residue codes is that both involve error-correcting codes. However, quick response codes use Reed-Solomon codes for error correction, not quadratic residue codes. Reed-Solomon codes are robust to long sequences of error, which is important for quick response codes since, for example, a row of the image might be cut off. It would be cute if QR (quick response) codes used QR (quadratic residue) codes, but alas they don’t.

More on quadratic residues

More on quick response codes

More on quadratic residue codes

]]> 1
Center of mass and vectorization Sun, 05 Apr 2020 19:38:34 +0000 Para Parasolian left a comment on my post about computing the area of a polygon, suggesting that I “say something similar about computing the centroid of a polygon using a similar formula.” This post will do that, and at the same time discuss vectorization.


We start by listing the vertices starting anywhere and moving counterclockwise around the polygon:

(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)

It will simplify notation below if we duplicate the last point:

(x_0, y_0) = (x_n, y_n)

The formula the centroid depends on the formula for the area, where the area of the polygon is

A = \frac{1}{2} \sum_{i=0}^{n-1} (x_i y_{i+1} - x_{i+1}y_i)

Hadamard product and dot product

We can express the area formula more compactly using vector notation. This will simplify the centroid formulas as well. To do so we need to define two ways of multiplying vectors: the Hadamard product and the dot product.

The Hadamard product of two vectors is just their componentwise product. This is a common operation in R or Python, but less common in formal mathematics. The dot product is the sum of the components of the Hadamard product.

If x and y are vectors, the notation xy with no further explanation probably means the dot product of x or y if you’re reading a math or physics book. In R or Python, x*y is the Hadamard product of the vectors.

Here we will use a circle ∘ for Hadamard product and a dot · for inner product.

Now let x with no subscript be the vector

x = (x_0, x_1, x_2, \cdots, x_{n-1})

and let x‘ be the same vector but shifted

x' = (x_1, x_2, x_3, \cdots, x_n)

We define y and y‘ analogously. Then the area is

A = (x\cdot y' - x' \cdot y) / 2

Centroid formula

The formula for the centroid in summation form is

\begin{align*} C_x &= \frac{1}{6A} \sum_{i=0}^{n-1} (x_i + x_{i+1})(x_i y_{i+1} - x_{i+1}y_i) \\ C_y &= \frac{1}{6A} \sum_{i=0}^{n-1} (y_i + y_{i+1})(x_i y_{i+1} - x_{i+1}y_i) \end{align*}

where A is the area given above.

We can write this in vector form as

\begin{align*} C_x &= \frac{1}{6A} (x + x') \cdot (x\circ y' - x' \circ y) \\ C_y &= \frac{1}{6A} (y + y') \cdot (x\circ y' - x' \circ y) \\ \end{align*}

You could evaluate v = x∘ y‘ – x‘∘y first. Then A is half the dot product of v with a vector of all 1’s, and the centroid x and y coordinates are inner products of v with xx‘ and yy‘ respectively, divided by 6A.

]]> 4
Making an invertible function out of non-invertible parts Sat, 04 Apr 2020 14:20:47 +0000 How can you make an invertible function out of non-invertable parts? Why would you want to?

Encryption functions must be invertible. If the intended recipient can’t decrypt the message then the encryption method is useless.

Of course you want an encryption function to be really hard to invert without the key. It’s hard to think all at once of a function that’s really hard to invert. It’s easier to think of small components that are kinda hard to invert. Ideally you can iterate steps that are kinda hard to invert and create a composite that’s really hard to invert.

So how do we come up with components that are kinda hard to invert? One way is to make small components that are non-linear, and that are in fact impossible to invert [1]. But how can you use functions that are impossible to invert to create functions that are possible to invert? It doesn’t seem like this could be done, but it can. Feistel networks, named after cryptographer Horst Feistel, provide a framework for doing just that.

Many block encryption schemes are based a Feistel network or a modified Feistel network: DES, Lucifer, GOST, Twofish, etc.

The basic idea of Feistel networks is so simple that it may go by too fast the first time you see it.

You take a block of an even number bits and split it into two sub-blocks, the left half L and the right half R. The nth round of a Feistel cipher creates new left and right blocks from the left and right blocks of the previous round by

\begin{align*} L_n =& R_{n-1} \\ R_n =& L_{n-1} \oplus f(R_{n-1}, K_n) \end{align*}

Here ⊕ is bitwise XOR (exclusive or) and f(Rn-1, Kn) is any function of the previous right sub-block and the key for the nth round. The function f need not be invertible. It could be a hash function. It could even be a constant, crushing all input down to a single value. It is one of the non-invertible parts that the system is made of.

Why is this invertible? Suppose you have Ln and Rn. How could you recover Ln-1 and Rn-1?

Recovering Rn-1 is trivial: it’s just Ln. How do you recover Ln-1? You know Rn-1 and the key Kn and so you can compute

R_n \oplus f(R_{n-1}, K_n) = L_{n-1} \oplus f(R_{n-1}, K_n)\oplus f(R_{n-1}, K_n) = L_{n-1}

The main idea is that XOR is it’s own inverse. No matter what f(Rn-1, Kn) is, if you XOR it with anything twice, you get that thing back.

At each round, only one sub-block from the previous round is encrypted. But since the roles of left and right alternate each time, the block that was left alone at one round will be encrypted the next round.

When you apply several rounds of a Feistel network, the output of the last round is the encrypted block. To decrypt the block, the receiver reverses each of the rounds in the reverse order.

A sketch of DES

The DES (Data Encryption Standard) algorithm may be the best-known application of Feistel networks. It operates on 64-bit blocks of data and carries out 16 rounds. It takes a 56-bit key [2] and derives from it different 48-bit keys for each of the 16 rounds. In the context of DES, the function f described above takes 32 bits of data and a 48-bit key and returns 32 bits. This function has four steps.

  1. Expand the 32 bits of input to 48 bits by duplicating some of the bits.
  2. XOR with the key for that round.
  3. Divide the 48 bits into eight groups of 6 bits and apply an S box to each group.
  4. Permute the result.

The S boxes are nonlinear functions that map 6 bits to 4 bits. The criteria for designing the S boxes was classified when DES became a standard, and there was speculation that the NSA has tweaked the boxes to make them less secure. In fact, the NSA tweaked the boxes to make them more secure. The S boxes were modified to make them more resistant to differential cryptanalysis, a technique that was not publicly know at the time.

More cryptography posts

[1] These functions are impossible to invert in the sense that two inputs may correspond to the same output; there’s no unique inverse. But they’re also computationally difficult to invert relative to their size: for a given output, it’s time consuming to find any or all corresponding inputs.

[2] When DES was designed in the 1970’s researchers objected that 56-bit keys were too small. That’s certainly the case now, and so DES is no longer secure. DES lives on as a component of Triple DES, which uses three 56-bit keys to produce 112-bit security. (Triple DES does not give 168 bits of security because it is vulnerable to a kind of meet-in-the-middle attack.)

]]> 2
Underestimating risk Wed, 01 Apr 2020 13:52:28 +0000 When I hear that a system has a one in a trillion (1,000,000,000,000) chance of failure, I immediately translate that in my mind to “So, optimistically the system has a one in a million (1,000,000) chance of failure.”

Extremely small probabilities are suspicious because they often come from one of two errors:

  1. Wrongful assumption of independence
  2. A lack of imagination

Wrongfully assuming independence

The Sally Clark case is an infamous example of a woman’s life being ruined by a wrongful assumption of independence. She had two children die of what we would call in the US sudden infant death syndrome (SIDS) and what was called in her native UK “cot death.”

The courts reasoned that the probability of two children dying of SIDS was the square of the probability of one child dying of SIDS. The result, about one chance in 73,000,000, was deemed to be impossibly small, and Sally Clark was sent to prison for murder. She was released from prison years later, and drank herself to death.

Children born to the same parents and living in the same environment hardly have independent risks of anything. If the probability of losing one child to SIDS is 1/8500, the conditional probability of losing a sibling may be small, but surely not as small as 1/8500.

The Sally Clark case only assumed two events were independent. By naively assuming several events are independent, you can start with larger individual probabilities and end up with much smaller final probabilities.

As a rule of thumb, if a probability uses a number name you’re not familiar with (such as septillion below) then there’s reason to be skeptical.

Lack of imagination

It is possible to legitimately calculate extremely small probabilities, but often this is by calculating the probability of the wrong thing, by defining the problem too narrowly. If a casino owner believes that the biggest risk to his business is dealing consecutive royal flushes, he’s not considering the risk of a tiger mauling a performer.

A classic example of a lack of imagination comes from cryptography. Amateurs who design encryption systems assume that an attacker must approach a system the same way they do. For example, suppose I create a simple substitution cipher by randomly permuting the letters of the English alphabet. There are over 400 septillion (4 × 1026) permutations of 26 letters, and so the chances of an attacker guessing my particular permutation are unfathomably small. And yet simple substitution ciphers are so easy to break that they’re included in popular books of puzzles. Cryptogram solvers do not randomly permute the alphabet until something works.

Professional cryptographers are not nearly so naive, but they have to constantly be on guard for more subtle forms of the same fallacy. If you create a cryptosystem by multiplying large prime numbers together, it may appear that an attacker would have to factor that product. That’s the idea behind RSA encryption. But in practice there are many cases where this is not necessary due to poor implementation. Here are three examples.

If the calculated probability of failure is infinitesimally small, the calculation may be correct but only address one possible failure mode, and not the most likely failure mode at that.

More posts on risk management

]]> 2
Reasoning under uncertainty Mon, 30 Mar 2020 20:12:54 +0000 Reasoning under uncertainty sounds intriguing. Brings up images of logic, philosophy, and artificial intelligence.

Statistics sounds boring. Brings up images of tedious, opaque calculations followed by looking some number in a table.

But statistics is all about reasoning under uncertainty. Many people get through required courses in statistics without ever hearing that, or at least without ever appreciating that. Rote calculations are easy to teach and easy to grade, so introductory courses focus on that.

Statistics is more interesting than it may sound. And reasoning under uncertainty is less glamorous than it may sound.

]]> 4
Lee distance: codes and music Sun, 29 Mar 2020 18:47:38 +0000 The Hamming distance between two sequences of symbols is the number of places in which they differ. For example, the Hamming distance between the words “hamming” and “farming” is 2, because the two worlds differ in their first and third letters.

Hamming distance is natural when comparing sequences of bits because bits are either the same or different. But when the sequence of symbols comes from a larger alphabet, Hamming distance may not be the most appropriate metric.

Here “alphabet” is usually used figuratively to mean the set of available symbols, but it could be a literal alphabet. As English words, “hamming” seems closer to “hanning” than to “farming” because m is closer to n, both in the alphabet and phonetically, than it is to f or r. [1]

The Lee distance between two sequences x1x2xn and y1y2yn of symbols from an alphabet of size q is defined as

\sum_{i=1}^n \min\{ |x_i - y_i|, q - |x_i - y_i| \}

So if we use distance in the English alphabet, the words “hamming” and “hanning” are a Lee distance of 1 + 1 = 2 apart, while “hamming” and “farming” are a Lee distance of 2 + 5 = 7 apart.

Coding theory uses both Hamming distance and Lee distance. In some contexts, it only matters whether symbols are different, and in other contexts it matters how different they are. If q = 2 or 3, Hamming distance and Lee distance coincide. If you’re working over an alphabet of size q > 3 and symbols are more likely to be corrupted into nearby symbols, Lee distance is the appropriate metric. If all corruptions are equally likely, then Hamming distance is more appropriate.

Application to music

Lee distance is natural in music since notes are like integers mod 12. Hence the circle of fifths.

My wife and I were discussing recently which of two songs was in a higher key. My wife is an alto and I’m a baritone, so we prefer lower keys. But if you transpose a song up so much that it’s comfortable to sing an octave lower, that’s good too.

If you’re comfortable singing in the key of C, then the key of D is two half-steps higher. But what about they key of A? You could think of it as 9 half-steps higher, or 3 half-steps lower. In the definition of Lee distance, measured in half-steps, the distance from C to D is

min{2, 12 – 2} = 2,

i.e. you could either go up two half-steps or down 10. Similarly the distance between C and A is

min{9, 12-9} = 3.

So you could think of the left side of the minimum in the definition of Lee distance as going up from x to y and the right side as going down from x to y.

Using Lee distance, the largest interval is the tritone, the interval from C to F#. It’s called the tritone because it is three whole steps. If C is your most comfortable key, F# would be your least comfortable key: the notes are as far away from your range as possible. Any higher up and they’d be closer because you could drop down an octave.

The tritone is like the hands of a clock at 6:00. The hour and minute hands are as far apart as possible. Just before 6:00 the hands are closer together on the left side of the clock and just after they are closer on the right side of the clock.

Related posts

[1] I bring up “Hanning” because Hamming and Hanning are often confused. In signal processing there is both a Hamming window and a Hanning window. The former is named after Richard Hamming and the latter after Julius von Hann. The name “Hanning window” rather than “Hann window” probably comes from the similarity with the Hamming window.

]]> 0
Conditional independence notation Fri, 27 Mar 2020 15:28:27 +0000 Ten years ago I wrote a blog post that concludes with this observation:

The ideas of being relatively prime, independent, and perpendicular are all related, and so it makes sense to use a common symbol to denote each.

This post returns to that theme, particularly looking at independence of random variables.


Graham, Knuth, and Patashnik proposed using ⊥ for relatively prime numbers in their book Concrete Mathematics, at least by the second edition (1994). Maybe it was in their first edition (1988), but I don’t have that edition.

Philip Dawid proposed a similar symbol ⫫ for (conditionally) independent random variables in 1979 [1].

We write X ⫫ Y to denote that X and Y are independent.

As explained here, independent random variables really are orthogonal in some sense, so it’s a good notation.


The symbol ⫫ (Unicode 2AEB, DOUBLE TACK UP) may or may not show up in your browser; it’s an uncommon character and your font may not have a glyph for it.

There’s no command in basic LaTeX for the symbol. You can enter the Unicode character in XeTeX, and there are several other alternatives discussed here. A simple work-around is to use


This says to take two perpendicular symbols, and kern them together by inserting three negative spaces between them.

The package MsSymbol has a command \upmodels to produce ⫫. Why “upmodels”? Because it is a 90° counterclockwise rotation of the \models symbol ⊧ from logic.

To put a strike through ⫫ in LaTeX to denote dependence, you can use \nupmodels from the MsSymbol package or if you’re not using a package you could use the following.


Graphoid axioms

As an example of where you might see the ⫫ symbol used for conditional independence, the table below gives the graphoid axioms for conditional independence. (They’re theorems, not axioms, but they’re called axioms because you could think of them as axioms for working with conditional independence at a higher level of abstraction.)

\begin{align*} \mbox{Symmetry: } & (X \perp\!\!\!\perp Y \mid Z) \implies (Y \perp\!\!\!\perp X \mid Z) \\ \mbox{Decomposition: } & (X \perp\!\!\!\perp YW \mid Z) \implies (X \perp\!\!\!\perp Y \mid Z) \\ \mbox{Weak union: } & (X \perp\!\!\!\perp YW \mid Z) \implies (X \perp\!\!\!\perp Y \mid ZW) \\ \mbox{Contraction: } & (X \perp\!\!\!\perp Y \mid Z) \,\wedge\, (X \perp\!\!\!\perp W \mid ZY)\implies (X \perp\!\!\!\perp YW \mid Z) \\ \mbox{Intersection: } & (X \perp\!\!\!\perp W \mid ZY) \,\wedge\, (X \perp\!\!\!\perp Y \mid ZW)\implies (X \perp\!\!\!\perp YW \mid Z) \\ \end{align*}

Note that the independence symbol ⫫ has higher precedence than the conditional symbol |. That is, X ⫫ Y | Z means X is independent of Y, once you condition on Z.

The axioms above are awfully dense, but they make sense when expanded into words. For example, the symmetry axiom says that if knowledge of Z makes Y irrelevant to X, it also makes X irrelevant to Y. The decomposition axiom says that if knowing Z makes the combination of Y and W irrelevant to X, then knowing Z makes Y alone irrelevant to X.

The intersection axiom requires strictly positive probability distributions, i.e. you can’t have events with probability zero.

More on conditional probability

[1] AP Dawid. Conditional Independence in Statistical Theory. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 41, No. 1 (1979), pp. 1-31

]]> 5
Three composition theorems for differential privacy Wed, 25 Mar 2020 19:46:58 +0000 This is a brief post, bringing together three composition theorems for differential privacy.

  1. The composition of an ε1-differentially private algorithm and an ε2-differentially private algorithm is an (ε12)-differentially private algorithm.
  2. The composition of an (ε1, δ1)-differentially private algorithm and an (ε2, δ2)-differentially private algorithm is an (ε12, δ12)-differentially private algorithm.

The three composition rules can be summarized briefly as follows:

ε1 ∘ ε2 → (ε1 + ε2)
1, δ1) ∘ (ε2, δ2) → (ε12, δ12)
(α, ε1) ∘ (α, ε2) → (α, ε12)

What is the significance of these composition theorems? In short, ε-differential privacy and Rényi differential privacy compose as one would hope, but (ε, δ)-differential privacy does not.

The first form of differential privacy proposed was ε-differential privacy. It is relatively easy to interpret, composes nicely, but can be too rigid.

If you have Gaussian noise, for example, you are lead naturally to (ε, δ)-differential privacy. The δ term is hard to interpret. Roughly speaking you could think  it as the probability that ε-differential privacy fails to hold. Unfortunately with (ε, δ)-differential privacy the epsilons add and so do the deltas. We would prefer that δ didn’t grow with composition.

Rényi differential privacy is a generalization of ε-differential privacy that uses a family of information measures indexed by α to measure the impact of a single row being or not being in a database. The case of α = ∞ corresponds to ε-differential privacy, but finite values of α tend to be less pessimistic. The nice thing about the composition theorem for Rényi differential privacy is that the α parameter doesn’t change, unlike the δ parameter in (ε, δ)-differential privacy.

]]> 0
Minimizing worst case error Tue, 24 Mar 2020 17:48:25 +0000 It’s very satisfying to know that you have a good solution even under the worst circumstances. Worst-case thinking doesn’t have to be concerned with probabilities, with what is likely to happen, only with what could happen. But whenever you speak of what could happen, you have to limit your universe of possibilities.

Suppose you ask me to write a program that will compute the sine of a number. I come up with a Chebyshev approximation for the sine function over the interval [0, 2π] so that the maximum approximation for any point in that interval is less than 10-8. So I proudly announce that the in the worst case the error is less than 10-8.

Then you come back and ask “What if I enter a number larger than 2π?” Since my approximation is a polynomial, you can make it take on as large a value as you please by sticking in large enough values. But sine takes on values between -1 and 1. So the worst case error is unlimited.

So I go back and do some range reduction and I tell you that my program will be accurate to within 10-8. And because I got burned last time by not specifying limits on the input, I make explicit my assumption that the input is a valid, finite IEEE 754 64-bit floating point number. Take that!

So then you come back and say “What if I make a mistake entering my number?” I object that this isn’t fair, but you say “Look, I want to minimize the worst case scenario, not just the worst scenario that fits into your frame of thinking.”

So then I rewrite my program to always return 0. That result can be off by as much as 1, but never more than that.

This is an artificial example, but it illustrates a practical point: worst-case scenarios minimize the worst outcome relative to some set of scenarios under consideration. With more imagination you can always come up with a bigger set. Maybe the initial set left out some important and feasible scenarios. Or maybe it was big enough and only left out far-fetched scenarios. It may be quite reasonable to exclude far-fetched scenarios, but when you do, you’re implicitly appealing to probability, because far-fetched means low probability.

Related post: Sine of a googol

]]> 3
Pecunia non olet Tue, 24 Mar 2020 15:11:37 +0000 I’ve been rereading That Hideous Strength. I’m going through it slowly this time, paying attention to details I glossed over before.

For example, early in the book we’re told that the head of a college has the nickname N.O.

N.O., which stood for Non-Olet, was the nickname of Charles Place, the warden of Bracton.

The first time I read the novel I zoomed past this tidbit. “Must be some Latin thing.” This time I looked it up.

It is indeed a Latin thing. It’s a reference to “Pecunia non olet” which translates as “Money doesn’t stink.” The idea is that money is money, and it doesn’t matter if it comes from a distasteful source.

The phrase goes back to the tax paid by those who bought the contents of public urinals as a source of ammonia. When Emperor Vespasian’s son Titus complained about the disgusting nature of the urine tax, the emperor held up a gold coin and said “Pecunia non olet.”

We’re told that the warden was “an elderly civil servant,” not an academic, and that his biggest accomplishment was that he had written “a monumental report on National Sanitation.”

So the nickname N.O. works on several levels. It implies that he’s willing to take money wherever he can get it, and it’s an allusion to the fact that he’s more qualified to be a sanitation engineer than a college president. I suppose it also implies that he’s inclined to say “no” to everything except money.

More posts on Latin phrases

]]> 4
Simple clinical trial of four COVID-19 treatments Mon, 23 Mar 2020 13:50:39 +0000 A story came out in Science yesterday saying the World Health Organization is launching a trial of what it believes are the the four most promising treatments for COVID-19 (a.k.a. SARS-CoV-2, novel coronavirus, etc.)

The four treatment arms will be

  • Remdesivir
  • Chloroquine and hydroxychloroquine
  • Ritonavir + lopinavir
  • Ritonavir + lopinavir + interferon beta

plus standard of care as a control arm.

I find the design of this trial interesting. Clinical trials are often complex and slow. Given a choice in a crisis between ponderously designing the perfect clinical trial and flying by the seat of their pants, health officials would rightly choose the latter. On the other hand, it would obviously be good to know which of the proposed treatments is most effective. So this trial has to be a compromise.

The WHO realizes that the last thing front-line healthcare workers want right now is the added workload of conducting a typical clinical trial. So this trial, named SOLIDARITY, will be very simple to run. According to the Science article,

When a person with a confirmed case of COVID-19 is deemed eligible, the physician can enter the patient’s data into a WHO website, including any underlying condition that could change the course of the disease, such as diabetes or HIV infection. The participant has to sign an informed consent form that is scanned and sent to WHO electronically. After the physician states which drugs are available at his or her hospital, the website will randomize the patient to one of the drugs available or to the local standard care for COVID-19.

… Physicians will record the day the patient left the hospital or died, the duration of the hospital stay, and whether the patient required oxygen or ventilation, she says. “That’s all.”

That may sound a little complicated, but by clinical trial standards the SOLIDARITY trial is shockingly simple. Normally you would have countless detailed case report forms, adverse event reporting, etc.

The statistics of the trial will be simple on the front end but complicated on the back end. There’s no sophisticated algorithm assigning treatments, just a randomization between available treatment options, including standard of care. I don’t see how you could do anything else, but this will create headaches for the analysis.

Patients are randomized to available treatments—what else could you do? [1]—which means the treatment options vary by site and over time. The control arm, standard of care, also varies by site and could change over time as well.  Also, this trial is not double-blind. This is a trial optimized for the convenience of frontline workers, not for the convenience of statisticians.

The SOLIDARITY trial will be adaptive in the sense that a DSMB will look at interim results and decide whether to drop treatment arms that appear to be under-performing. Ideally there would be objective algorithms for making these decisions, carefully designed and simulated in advanced, but there’s no time for that. Better to start learning immediately than to spend six months painstakingly designing a trial. Even if we could somehow go back in time and start the design process six months ago, there could very well be contingencies that the designers couldn’t anticipate.

The SOLIDARITY trial is an expedient compromise, introducing a measure of scientific rigor when there isn’t time to be as rigorous as we’d like.

More clinical trial posts

[1] You could limit the trial to sites that have all four treatment options available, cutting off most potential sources of data. The data would not be representative of the world at large and accrual would be slow. Or you could wait until all four treatments were distributed to clinics around the world, but there’s no telling how long that would take.

]]> 4
Product of copulas Mon, 23 Mar 2020 00:12:06 +0000 A few days ago I wrote a post about copulas and operations on them that have a group structure. Here’s another example of group structure for copulas. As in the previous post I’m just looking at two-dimensional copulas to keep things simple.

Given two copulas C1 and C2, you can define a sort of product between them by

(C_1 * C_2)(u,v) = \int_0^1 D_2C_1(u,t)\,\, D_1C_2(t,v) \,\, dt

Here Di is the partial derivative with respect to the ith variable.

The product of two copulas is another copula. This product is associative but not commutative. There is an identity element, so copulas with this product form a semigroup.

The identity element is the copula

M(u,v) = \min\{u, v\}

that is,

M * C = C * M = C

for any copula C.

The copula M is important because it is the upper bound for the Fréchet-Hoeffding bounds: For any copula C,

\max\{u+v-1, 0\}\leq C(u,v) \leq \min\{u, v\}

There is also a sort of null element for our semigroup, and that is the independence copula

\Pi(u,v) = uv

It’s called the independence copula because it’s the copula for two independent random variables: their joint CDF is the product of their individual CDFs. It acts like a null element because

\Pi * C = C * \Pi = \Pi

This tells us we have a semigroup and not a group: the independence copula cannot have an inverse.

Reference: Roger B. Nelsen. An Introduction to Copulas.

]]> 8
How to Set Num Lock on permanently Sat, 21 Mar 2020 13:50:28 +0000 When I use my Windows laptop, I’m always accidentally brushing against the Num Lock key. I suppose it’s because the keys are so flat; I never have this problem on a desktop.

I thought there must be some way to set it so that it’s always on, so I searched for it. First I found articles on how to turn Num Lock on at startup, but that’s not my problem. The key already is on when I start up, but it doesn’t stay on when I brush against it.

Next I found articles say to set a certain registry key. That didn’t work for me, and apparently a lot of other people have the same experience.

Some articles say to edit your BIOS. My understanding is that you can edit the BIOS to permanently disable the key, but I wanted to permanently enable the key.

Here’s what did work: give AutoHotKey the command

    SetNumLockState, AlwaysOn

I haven’t used AutoHotKey before. I’ve heard good things it, but it seems like it can be quite a deep rabbit hole. I intend to look into it a little, but for now I just want my Num Lock to stay on.

After you install AutoHotKey and run it, you get its help browser, not the app per se, and it’s not immediately obvious how to run the code above. You need to save the line of code to a file whose name ends in .ahk, such as numlock.ahk. If you double-click on that file, it will run the AutoHotKey script. To make it run automatically when your computer starts up, put the script in your Startup folder. This is probably

    C:\Users\...\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup

You can bring up the Startup folder by typing Windows key + R, then shell:startup.

Related post: Remapping the Caps Lock key

]]> 1
New Asymptotic function in Mathematica 12.1 Sat, 21 Mar 2020 01:59:12 +0000 One of the new features in Mathematica 12.1 is the function Asymptotic. Here’s a quick example of using it.

Here’s an asymptotic series for the log of the gamma function I wrote about here.

\log \Gamma(z) \sim (z - \frac{1}{2}) \log z - z + \frac{1}{2} \log(2\pi) + \frac{1}{12z} - \frac{1}{360z^3} + \cdots

If we ask Mathematica

    Asymptotic[LogGamma[z], z -> Infinity]

we get simply the first term:

z Log[z]

But we can set the argument SeriesTermGoal to tell it we’d like more terms. For example

    Asymptotic[LogGamma[z], z -> Infinity, SeriesTermGoal -> 4]

-1/(360*z^3) + 1/(12*z) - z + Log[2*Pi]/2 - Log[z]/2 + z*Log[z]

This doesn’t contain a term 1/z4, but it doesn’t need to: there is no such term in the asymptotic expansion, so it is giving us the terms up to order 4, it’s just that the coefficient of the 1/z4 term is zero.

If we ask for terms up to order 5

    Asymptotic[LogGamma[z], z -> Infinity, SeriesTermGoal -> 5]

we do get a term 1/z5, but notice there is no 4th order term.

1/(1260*z^5) - 1/(360*z^3) + 1/(12*z) - z + Log[2*Pi]/2 - Log[z]/2 + z*Log[z]

A note on output forms

The Mathematica output displayed above was created by using

Export[filename, expression]

to save images as SVG files. The alt text for the images was created using


More Mathematica posts

]]> 0
Extended floating point precision in R and C Wed, 18 Mar 2020 14:54:25 +0000 The GNU MPFR library is a C library for extended precision floating point calculations. The name stands for Multiple Precision Floating-point Reliable. The library has an R wrapper Rmpfr that is more convenient for interactive use. There are also wrappers for other languages.

It takes a long time to install MPFR and its prerequisite GMP, and so I expected it to take a long time to install Rmpfr. But the R library installs quickly, even on a system that doesn’t have MPFR or GMP installed. (I installed GMP and MPFR from source on Linux, but installed Rmpfr on Windows. Presumably the Windows R package included pre-compiled binaries.)

I’ll start by describing the high-level R interface, then go into the C API.


You can call the functions in Rmpfr with ordinary numbers. For example, you could calculate ζ(3), the Riemann zeta function evaluated at 3.

    > zeta(3)
    1 'mpfr' number of precision  128   bits
    [1] 1.202056903159594285399738161511449990768

The default precision is 128 bits, and a numeric argument is interpreted as a 128-bit MPFR object. R doesn’t have a built-in zeta function, so the only available zeta is the one from Rmpfr. If you ask for the cosine of 3, you’ll get ordinary precision.

    > cos(3)
    [1] -0.9899925

But if you explicitly pass cosine a 128-bit MPFR representation of the number 3 you will get cos(3) to 128-bit precision.

    > cos(mpfr(3, 128))                            
    1 'mpfr' number of precision  128   bits       
    [1] -0.9899924966004454572715727947312613023926

Of course you don’t have to only use 128-bits. For example, you could find π to 100 decimal places by multiplying the arctangent of 1 by 4.

    > 100*log(10)/log(2) # number of bits needed for 100 decimals                                               
    [1] 332.1928     
    >  4*atan(mpfr(1,333))                                                                                      
    1 'mpfr' number of precision  333   bits                                                                    
    [1] 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706807 

MPFR C library

The following C code shows how to compute cos(3) to 128-bit precision and 4 atan(1) to 333 bit precision as above.

    #include <stdio.h>
    #include <gmp.h>
    #include <mpfr.h>
    int main (void)
        // All functions require a rounding mode.
        // This mode specifies round-to-nearest
        mpfr_rnd_t rnd = MPFR_RNDN;
        mpfr_t x, y;
        // allocate unitialized memory for x and y as 128-bit numbers
        mpfr_init2(x, 128);
        mpfr_init2(y, 128);
        // Set x to the C double number 3
        mpfr_set_d(x, 3, rnd);
        // Set y to the cosine of x
        mpfr_cos(y, x, rnd);
        // Print y to standard out in base 10
        printf ("y = ");
        mpfr_out_str (stdout, 10, 0, y, rnd);
        putchar ('\n');
        // Compute pi as 4*atan(1)
        // Re-allocate x and y to 333 bits
        mpfr_init2(x, 333);    
        mpfr_init2(y, 333);    
        mpfr_set_d(x, 1.0, rnd);        
        mpfr_atan(y, x, rnd);
        // Multiply y by 4 and store the result back in y
        mpfr_mul_d(y, y, 4, rnd);
        printf ("y = ");
        mpfr_out_str (stdout, 10, 0, y, rnd);
        putchar ('\n');
        // Release memory
        return 0;

If this code is saved in the file hello_mpfr.c then you can compile it with

    gcc hello_mpfr.c -lmpfr -lgmp

One line above deserves a little more explanation. The second and third arguments to mpfr_out_str are the base b and number of figures n to print.

We chose b=10 but you could specify any base value 2 ≤ b ≤ 62.

If n were set to 100 then the output would contain 100 significant figures. When n=0, MPFR will determine the number of digits to output, enough digits that the string representation could be read back in exactly. To understand how many digits that is, see Matula’s theorem in the previous post.

]]> 2
When is round-trip floating point radix conversion exact? Mon, 16 Mar 2020 23:20:01 +0000 Suppose you store a floating point number in memory, print it out in human-readable base 10, and read it back in. When can the original number be recovered exactly?

D. W. Matula answered this question more generally in 1968 [1].

Suppose we start with base β with p places of precision and convert to base γ with q places of precision, rounding to nearest, then convert back to the original base β. Matula’s theorem says that if there are no positive integers i and j such that

βi = γj

then a necessary and sufficient condition for the round-trip to be exact (assuming no overflow or underflow) is that

γq-1 > βp.

In the case of floating point numbers (type double in C) we have β = 2 and p = 53. (See Anatomy of a floating point number.) We’re printing to base γ = 10. No positive power of 10 is also a power of 2, so Matula’s condition on the two bases holds.

If we print out q = 17 decimal places, then

1016 > 253

and so round-trip conversion will be exact if both conversions round to nearest. If q is any smaller, some round-trip conversions will not be exact.

You can also verify that for a single precision floating point number (p = 24 bits precision) you need q = 9 decimal digits, and for a quad precision number (p = 113 bits precision) you need q = 36 decimal digits [2].

Looking back at Matula’s theorem, clearly we need

γq ≥ βp.

Why? Because the right side is the number of base β fractions and the left side is the number of base γ fractions. You can’t have a one-to-one map from a larger space into a smaller space. So the inequality above is necessary, but not sufficient. However, it’s almost sufficient. We just need one more base γ figure, i.e. we Matula tells us

γq-1 > βp

is sufficient. In terms of base 2 and base 10, we need at least 16 decimals to represent 53 bits. The surprising thing is that one more decimal is enough to guarantee that round-trip conversions are exact. It’s not obvious a priori that any finite number of extra decimals is always enough, but in fact just one more is enough; there’s no “table maker’s dilemma” here.

Here’s an example to show the extra decimal is necessary. Suppose p = 5. There are more 2-digit numbers than 5-bit numbers, but if we only use two digits then round-trip radix conversion will not always be exact. For example, the number 17/16 written in binary is 1.0001two, and has five significant bits. The decimal equivalent is 1.0625ten, which rounded to two significant digits is 1.1ten. But the nearest binary number to 1.1ten with 5 significant bits is 1.0010two = 1.125ten. In short, rounding to nearest gives

1.0001two -> 1.1ten -> 1.0010two

and so we don’t end up back where we started.

More floating point posts

[1] D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50. January 1968. Cited in Handbook of Floating-point Arithmetic by Jean-Mihel Muller et al.

[2] The number of bits allocated for the fractional part of a floating point number is 1 less than the precision: the leading figure is always 1, so IEEE formats save one bit by not storing the leading bit, leaving it implicit. So, for example, a C double has 53 bits precision, but 52 bits of the 64 bits in a double are are allocated to storing the fraction.

]]> 4
Group symmetry of copula operations Mon, 16 Mar 2020 19:05:56 +0000 You don’t often see references to group theory in a statistics book. Not that there aren’t symmetries in statistics that could be described in terms of groups, but this isn’t often pointed out.

Here’s an example from An Introduction to Copulas by Roger Nelsen.

Show that under composition the set of operations of forming the survival copula, the dual of a copula, and the co-copula of a given copula, along with the identity (i.e., ^, ~, *, and i) yields the dihedral group.

Nelsen gives the following multiplication table for copula operations.

    o | i ^ ~ *
    i | i ^ ~ *
    ^ | ^ i * ~
    ~ | ~ * i ^
    * | * ~ ^ i

The rest of this post explains what a copula is and what the operations above are.

What is a copula?

At a high level, a copula is a mathematical device for modeling the dependence between random variables. Sklar’s theorem says you can express the joint distribution of a set of random variables in terms of their marginal distributions and a copula. If the distribution functions are continuous, the copula is unique.

The precise definition of a copula is technical. We’ll limit ourselves to copulas in two dimensions to make things a little simpler.

Let I be the unit interval [0, 1]. Then a (two-dimensional) copula is a function from I × I to I  that satisfies

\begin{align*} C(0, v) &= 0\\ C(u, 0) &= 0\\ C(u, 1) &= u\\ C(1, v) &= v \end{align*}

and is 2-increasing.

The idea of a 2-increasing function is that “gradients point northeast.” Specifically, for all points (x1, y1) and (x2, y2) with x1x2 and y1y2, we have

C(x_2, y_2) - C(x_2, y_1) - C(x_1, y_2) + C(x_1, y_1) \,\geq\, 0

The definition of copula makes no mention of probability, but the 2-increasing condition says that C acts like the joint CDF of two random variables.

Survival copula, dual copula, co-copula

For a given copula C, the corresponding survival copula, dual copula, and co-copula are defined by

\begin{align*} \hat{C}(u, v) &= u + v - 1 + C(1-u, 1-v) \\ \tilde{C}(u, v) &= u + v - C(u,v) \\ C^*(u,v) &= 1 - C(1-u, 1-v) \end{align*}


The reason for the name “survival” has to do with a survival function, i.e. complementary CDF of a random variable. The survival copula is another copula, but the dual copula and co-copulas aren’t actually copulas.

This post hasn’t said much too about motivation or application—that would take a lot more than a short blog post—but it has included enough that you could verify that the operations do compose as advertised.

Update: See this post for more algebraic structure for copulas, a sort of convolution product.

]]> 0
Product of Chebyshev polynomials Sun, 15 Mar 2020 19:12:50 +0000 Chebyshev polynomials satisfy a lot of identities, much like trig functions do. This point will look briefly at just one such identity.

Chebyshev polynomials Tn are defined for n = 0 and 1 by

T0(x) = 1
T1(x) = x

and for larger n using the recurrence relation

Tn+1(x) = 2xTn(x) – Tn-1(x)

This implies

T2(x) = 2xT1(x) – T0(x) = 2x2 – 1
T3(x) = 2xT2(x) – T1(x) = 4x3 – 3x
T4(x) = 2xT3(x) – T2(x) = 8x4 – 8x2 + 1

and so forth.

Now for the identity for this post. If mn, then

2 Tm Tn  = Tm+n  + Tmn.

In other words, the product of the mth and nth Chebyshev polynomials is the average of the (m + n)th and (mn)th Chebyshev polynomials. For example,

2 T3(x) T1(x) = 2 (4x3 – 3x) x = T4(x) + T2(x)

The identity above is not at all apparent from the recursive definition of Chebyshev polynomials, but it follows quickly from the fact that

Tn(cos θ) = cos nθ.

Proof: Let θ = arccos x. Then

2 Tm(x) Tn(x)
= 2 Tm(cos θ) Tn(cos θ)
= 2 cos mθ cos nθ
= cos (m+n)θ + cos (mn
= Tm+n(cos θ)  + Tmn(cos θ)
= Tm+n(x)  + Tmn(x)

You might object that this only shows that the first and last line are equal for values of x that are cosines of some angle, i.e. values of x in [-1, 1]. But if two polynomials agree on an interval, they agree everywhere. In fact, you don’t need an entire interval. For polynomials of degree m+n, as above, it is enough that they agree on m + n + 1 points. (Along those lines, see Binomial coefficient trick.)

The close association between Chebyshev polynomials and cosines means you can often prove Chebyshev identities via trig identities as we did above.

Along those lines, we could have taken

Tn(cos θ) = cos nθ

as the definition of Chebyshev polynomials and then proved the recurrence relation above as a theorem, using trig identities in the proof.

Forman Acton suggested in this book Numerical Methods that Work that you should think of Chebyshev polynomials as “cosine curves with a somewhat disturbed horizontal scale.”

]]> 1
The Brothers Markov Sat, 14 Mar 2020 14:46:28 +0000 The Markov brother you’re more likely to have heard of was Andrey Markov. He was the Markov of Markov chains, the Gauss-Markov theorem, and Markov’s inequality.

Andrey had a lesser known younger brother Vladimir who was also a mathematician. Together the two of them proved what is known as the Markov Brothers’ inequality to distinguish it from (Andrey) Markov’s inequality.

For any polynomial p(x) of degree n, and for any non-negative integer k, the maximum of the kth derivative of p over the interval [-1, 1] is bounded by a constant times the maximum of p itself. The constant is a function of k and n but is otherwise independent of the particular polynomial.

In detail, the Markov Brothers’ inequality says

\max_{-1\leq x \leq 1} |p^{(k)}(x)|\,\, \leq \prod_{0 \leq j < k} \frac{n^2 - j^2}{2j+1} \,\max_{-1\leq x \leq 1}|p (x)|

Andrey proved the theorem for k = 1 and his brother Vladimir generalized it for all positive k.

The constant in the Markov Brothers’ inequality is the smallest possible because the bound is exact for Chebyshev polynomials [1].

Let’s look at an example. We’ll take the second derivative of the fifth Chebyshev polynomial.

T5(x) = 16x5 – 20x3 + 5x.

The second derivative is

T5”(x) = 320x3 – 120x.

Here are their plots:

T5 and its second derivative

The maximum of T5(x) is 1 and the maximum of its second derivative is 200.

The product in the Markov Brothers’ inequality with n = 5 and k = 2 works out to

(25/1)(24/3) = 200

and so the bound is exact for p(x) = T5(x).


It took a while for westerners to standardize how to transliterate Russian names, so you might see Andrey written as Andrei or Markov written as Markoff.

There were even more ways to transliterate Chebyshev, including Tchebycheff, Tchebyshev, and Tschebyschow. These versions are the reason Chebyshev polynomials [1] are denoted with a capital T.

More posts mentioning Markov

[1] There are two families of Chebyshev polynomials. When used without qualification, as in this post, “Chebyshev polynomial” typically means Chebyshev polynomial of the first kind. These are denoted Tn. Chebyshev polynomials of the second kind are denoted Un.

]]> 1
Finding coffee in Pi Sat, 14 Mar 2020 06:05:10 +0000 It is widely believed that π is a “normal number,” which would mean that you can find any integer sequence you want inside the digits of π, in any base, if you look long enough.

So for Pi Day, I wanted to find c0ffee inside the hexadecimal representation of π. First I used TinyPI, a demonstration program included with the LibBF arbitrary precision floating point library, to find π to 100 million decimal places.

    ./tinypi -b 100000000 1e8.txt

This took five and a half minutes to run on my laptop. (I estimate it would take roughly an hour to find a billion decimal places [1].)

Note that the program takes its precision in terms of the number of decimal digits, even though the -b flag causes it to produce its output in hex. (Why 108 digits? Because I tried 106 and 107 without success.)

Next I want to search for where the string c0ffee appears.

    grep -o -b c0ffee 1e8.txt

The output of TinyPi is one long string, so I don’t want to print the matching line, just the match itself. That’s what the -o option is for. I want to know where the match occurs, and that’s what -b is for (‘b’ for byte offset).

This produced


This means the first occurrence of c0ffee starts in at the 193,000,791st position after the decimal point. Why’s that? The first two characters in the output are “3.”, but counting starts from 0, so we subtract 1 from the position reported by grep.

If you’re not convinced about the position, here’s a smaller example. We look for the first occurence of 888 in the hex representation of pi.

    echo 3.243f6a8885a3 | grep -b -o 888

This reports 8:888 meaning it found 888 starting in the 8th position of “3.243f6a8885a3”, counting from 0. And you can see that 888 starts in the 7th (hexa)decimal place.

By the way, if you’d like to search for cafe rather than c0ffee, you can find it first in position 156,230.

More Pi posts

[1] This program uses the Chudnovsky algorithm, which takes

O(n (log n)³)

time to compute n digits of π.

]]> 5
Chebyshev approximation Wed, 11 Mar 2020 12:25:43 +0000 In the previous post I mentioned that Remez algorithm computes the best polynomial approximation to a given function f as measured by the maximum norm. That is, for a given n, it finds the polynomial p of order n that minimizes the absolute error

|| f – p ||.

The Mathematica function MiniMaxApproximation minimizes the relative error by minimizing

|| (fp) / f ||.

As was pointed out in the comments to the previous post, Chebyshev approximation produces a nearly optimal approximation, coming close to minimizing the absolute error. The Chebyshev approximation can be computed more easily and the results are easier to understand.

To form a Chebyshev approximation, we expand a function in a series of Chebyshev polynomials, analogous to expanding a function in a Fourier series, and keep the first few terms. Like sines and cosines, Chebyshev polynomials are orthogonal functions, and so Chebyshev series are analogous to Fourier series. (If you find it puzzling to hear of functions being orthogonal to each other, see this post.)

Here is Mathematica code to find and plot the Chebyshev approximation to ex over [-1, 1]. First, here are the coefficients.

    weight[x_] := 2/(Pi Sqrt[1 - x^2])
    c = Table[ 
        Integrate[ Exp[x] ChebyshevT[n, x] weight[x], {x, -1, 1}], 
        {n, 0, 5}]

The coefficients turn out to be exactly expressible in terms of Bessel functions, but typically you’d need to do a numerical integration with NIntegrate.

Now we use the Chebyshev coefficients to evaluate the Chebyshev approximation.

    p[x_] := Evaluate[c . Table[ChebyshevT[n - 1, x], {n, Length[c]}]] 
             - First[c]/2

You could see the approximating polynomial with


which displays

    1.00004 + 1.00002 x + 0.499197 x^2 + 0.166489 x^3 + 0.0437939 x^4 + 
 0.00868682 x^5

The code

    Plot[Exp[x] - p[x], {x, -1, 1}]

shows the error in approximating the exponential function with the polynomial above.

Note that the plot has nearly equal ripples; the optimal approximation would have exactly equal ripples. The Chebyshev approximation is not optimal, but it is close. The absolute error is smaller than that of MiniMaxApproximation by a factor of about e.

There are bounds on how different the error can be between the best polynomial approximation and the Chebyshev series approximation. For polynomials of degree n, the Chebyshev error is never more than

4 + 4 log(n + 1)/π

times the Chebyshev series approximation error. See Theorem 16.1 in Approximation Theory and Approximation Practice by Lloyd N. Trefethen.

More Chebyshev posts

]]> 1
Remez algorithm and best polynomial approximation Tue, 10 Mar 2020 22:23:28 +0000 The best polynomial approximation, in the sense of minimizing the maximum error, can be found by the Remez algorithm. I expected Mathematica to have a function implementing this algorithm, but apparently it does not have one. (But see update below.)

It has a function named MiniMaxApproximation which sounds like Remez algorithm, and it’s close, but it’s not it.

To use this function you first have to load the FunctionApproximations package.

    << FunctionApproximations`

Then we can use it, for example, to find a polynomial approximation to ex on the interval [-1, 1].

    MiniMaxApproximation[Exp[x], {x, {-1, 1}, 5, 0}]

This returns the polynomial

    1.00003 + 0.999837 x + 0.499342 x^2 + 0.167274 x^3 + 0.0436463 x^4 + 
 0.00804051 x^5

And if we plot the error, the difference between ex and this polynomial, we see that we get a good fit.

minimax error

But we know this isn’t optimal because there is a theorem that says the optimal approximation has equal ripple error. That is, the absolute value of the error at all its extrema should be the same. In the graph above, the error is quite a bit larger on the right end than on the left end.

Still, the error is not much larger than the smallest possible using 5th degree polynomials. And the error is about 10x smaller than using a Taylor series approximation.

    Plot[Exp[x] - (1 + x + x^2/2 + x^3/6 + x^4/24 + x^5/120), {x, -1, 1}]

minimax error

Update: Jason Merrill pointed out in a comment what I was missing. Turns out MiniMaxApproximation finds an approximation that minimizes relative error. Since ex doesn’t change that much over [-1, 1], the absolute error and relative error aren’t radically different. There is no option to minimize absolute error.

When you look at the approximation error divided by ex you get the ripples you’d expect.

minimax error

See the next post for a way to construct near optimal polynomial approximations using Chebyshev approximation.

]]> 5
MDS codes Sat, 07 Mar 2020 15:56:51 +0000 A maximum distance separable code, or MDS code, is a way of encoding data so that the distance between code words is as large as possible for a given data capacity. This post will explain what that means and give examples of MDS codes.


A linear block code takes a sequence of k symbols and encodes it as a sequence of n symbols. These symbols come from an alphabet of size q. For binary codes, q = 2. But for non-trivial MDS codes, q > 2. More on that below.

The purpose of these codes is to increase the ability to detect and correct transmission errors while not adding more overhead than necessary. Clearly n must be bigger than k, but the overhead n-k has to pay for itself in terms of the error detection and correction capability it provides.

The ability of a code to detect and correct errors is measured by d, the minimum distance between code words. A code has separation distance d if every pair of code words differs in at least d positions. Such a code can detect up to d errors per block and can correct ⌊(d-1)/2⌋ errors.


The following example is not an MDS code but it illustrates the notation above.

The extended Golay code used to send back photos from the Voyager missions has q = 2 and [n, k, d] = [24, 12, 8]. That is, data is divided into segments of 12 bits and encoded as 24 bits in such a way that all code blocks differ in at least 8 positions. This allows up to 8 bit flips per block to be detected, and up to 3 bit flips per block to be corrected.

(If 4 bits were corrupted, the result could be equally distant between two valid code words, so the error could be detected but not corrected with certainty.)

Separation bound

There is a theorem that says that for any linear code

k + dn + 1.

This is known as the singleton bound. MDS codes are optimal with respect to this bound. That is,

k + d = n + 1.

So MDS codes are optimal with respect to the singleton bound, analogous to how perfect codes are optimal with respect to the Hamming bound. There is a classification theorem that says perfect codes are either Hamming codes or trivial with one exception. There is something similar for MDS codes.


MDS codes are essentially either Reed-Solomon codes or trivial. This classification is not as precise as the analogous classification of perfect codes. There are variations on Reed-Solomon codes that are also MDS codes. As far as I know, this accounts for all the known MDS codes. I don’t know that any others have been found, or that anyone has proved that there are no more.

Trivial MDS codes

What are these trivial codes? They are the codes with 0 or 1 added symbols, and the duals of these codes. (The dual of an MDS code is always an MDS code.)

If you do no encoding, i.e. take k symbols and encode them as k symbols, then d = 1 because different code words may only differ in one symbol. In this case n = k and so k + d = n + 1, i.e. the singleton bound is exact.

You could take k data symbols and add a checksum. If q = 2 this would be a parity bit. For a larger alphabet of symbols, it could be the sum of the k data symbols mod q. Then if two messages differ in 1 symbol, they also differ in added checksum symbol, so d = 2. We have n = k + 1 and so again k + d = n + 1.

The dual of the code that does no encoding is the code that transmits no information! It has only one code word of size n. You could say, vacuously, that d = n because any two different code words differ in all n positions. There’s only one code word so k = 1. And again k + d = n + 1.

The dual of the checksum code is the code that repeats a single data symbol n times. Then d = n because different code words differ in all n positions. We have k = 1 since there is only one information symbol per block, and so k + d = n + 1.

Reed Solomon codes

So the stars of the MDS show are the Reed-Solomon codes. I haven’t said how to construct these codes because that deserves a post of its own. Maybe another time. For now I’ll just say a little about how they are used in application.

As mentioned above, the Voyager probes used a Golay code to send back images. However, after leaving Saturn the onboard software was updated to use Reed-Solomon encoding. Reed-Solomon codes are used in more down-to-earth applications such as DVDs and cloud data storage.

Reed-Solomon codes are optimal block codes in terms of the singleton bound, but block codes are not optimal in terms of Shannon’s theorem. LDPC (low density parity check) codes come closer to the Shannon limit, but some forms of LDPC encoding use Reed-Solomon codes as a component. So in addition to their direct use, Reed-Solomon codes have found use as building blocks for other encoding schemes.

]]> 0
Automatic data reweighting Wed, 04 Mar 2020 11:10:48 +0000 Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don’t have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first on what you believe to be true prior to seeing any data.

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you’d like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum.

Rather than abruptly moving from prior assumptions to empirical data, you’d like the system to gradually transition from reliance on the former to reliance on the latter, weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. It’s called Bayesian statistics. The design issues raised above are all handled automatically by Bayes theorem. You start with a prior distribution summarizing what you believe to be true before collecting new data. Then as new data arrive, the magic of Bayes theorem adjusts this distribution, putting more emphasis on the empirical data and less on the prior. The effect of the prior gradually fades away as more data become available.

Related posts

]]> 3
Maximum gap between binomial coefficients Mon, 02 Mar 2020 12:19:04 +0000 I recently stumbled on a formula for the largest gap between consecutive items in a row of Pascal’s triangle.

For n ≥ 2,

\max_{1 \leq k \leq n} \left| {n \choose k} - {n \choose k-1}\right| = {n \choose \lfloor \tau \rfloor} - {n \choose \lfloor \tau - 1\rfloor}


\tau = \frac{n + 2 - \sqrt{n+2}}{2}

For example, consider the 6th row of Pascal’s triangle, the coefficients of (x + y)6.

1, 6, 15, 20, 15, 6, 1

The largest gap is 9, the gap between 6 and 15 on either side. In our formula n = 6 and so

τ = (8 – √8)/2 = 2.5858

and so the floor of τ is 2. The equation above says the maximum gap should be between the binomial coefficients with k = 2 and 1, i.e. between 15 and 6, as we expected.

I’ve needed a result like this in the past, but I cannot remember now why. I’m posting it here for my future reference and for the reference of anyone else who might need this. I intend to update this post if I run across an application.

More on Pascal’s triangle

Source: Zun Shan and Edward T. H. Wang. The Gaps Between Consecutive Binomial Coefficients. Mathematics Magazine, Vol. 63, No. 2 (Apr., 1990), pp. 122–124

]]> 0
Formatting in comments Mon, 02 Mar 2020 12:16:04 +0000 The comments to the posts here are generally very helpful. I appreciate your contributions to the site.

I wanted to offer a tip for those who leave comments and are frustrated by the way the comments appear, especially those who write nicely formatted snippet of code only to see the formatting lost. There is a way to post formatted code.

Comments can use basic HTML markup. In particular, you can wrap a section of code with <pre> and </pre> and the levels of indentation will be retained.

Sometimes I see that someone has left a block of code wrapped between <code> and </code> tags, but this doesn’t do what they expected. That’s because <code> is an inline element and <pre> is a block element. The former can be used in the middle of a sentence, such as referring to the qsort function, but for multiple lines of code you need the latter.

The comment editor does not accept LaTeX markup, but you can write a fair amount of mathematics with plain HTML.

Thanks again for your comments.

]]> 3
Sum of squared digits Fri, 28 Feb 2020 13:13:35 +0000 Take a positive integer x, square each of its digits, and sum. Now do the same to the result, over and over. What happens?

To find out, let’s write a little Python code that sums the squares of the digits.

    def G(x):
        return sum(int(d)**2 for d in str(x))

This function turns a number into a string, and iterates over the characters in the string, turning each one back into an integer and squaring it.

Now let’s plot the trajectories of the iterations of G.

    def iter(x, n):
        for _ in range(n):
            x = G(x)
        return x
    for x in range(1, 40):
        y = [iter(x, n) for n in range(1, 12)]

This produces the following plot.

For every starting value, the sequence of iterations either gets stuck on 1 or it goes in the cycle 4, 16, 37, 58, 89, 145, 42, 20, 4, … . This is a theorem of A. Porges published in 1945 [1].

To see how the sequences eventually hit 1 or 4, let’s modify our iteration function to stop at 4 rather than cycling.

    def iter4(x, n):
        for _ in range(n):
            if x != 4:
                x = G(x)
        return x

    for x in range(1, 40):
        y = [iter4(x, n) for n in range(1, 16)]

This produces the following plot.

Update: Here’s a better, or at least complementary, way to look at the iterations. Now the horizontal axis represents the starting points x and the points stacked vertically over x are the iterates of G starting at x.

    def orbit(x):
        pts = set()
        while x not in pts:
            x = G(x)
        return pts  

    for x in range(1, 81):
        for y in orbit(x):
            plt.plot(x, y, "bo", markersize=2)
    plt.ylabel("Iterations of $x$")

[1] Porges, A set of eight numbers, Amer. Math. Monthly, 52(1945) 379-382.

]]> 1
Computing the area of a thin triangle Thu, 27 Feb 2020 17:14:14 +0000 Heron’s formula computes the area of a triangle given the length of each side.

A = \sqrt{s(s-a)(s-b)(s-c)}


s = \frac{a + b + c}{2}

If you have a very thin triangle, one where two of the sides approximately equal s and the third side is much shorter, a direct implementation Heron’s formula may not be accurate. The cardinal rule of numerical programming is to avoid subtracting nearly equal numbers, and that’s exactly what Heron’s formula does if s is approximately equal to two of the sides, say a and b.

William Kahan’s formula is algebraically equivalent to Heron’s formula, but is more accurate in floating point arithmetic. His procedure is to first sort the sides in decreasing order, then compute

A = \frac{1}{4} \sqrt{(a + (b + c))(c - (a - b))(c + (a - b))(a + (b - c))}

You can find this method, for example, in Nick Higham’s book Accuracy and Stability of Numerical Algorithms.

The algebraically redundant parentheses in the expression above are not numerically redundant. As we’ll demonstrate below, the method is less accurate without them.

Optimizing compilers respect the parentheses: the results are the same when the code below is compiled with gcc with no optimization (-O0) and with aggressive optimization (-O3). The same is true of Visual Studio in Debug and Release mode.

C code demo

First, here is a straight-forward implementation of Heron

    #include <math.h>

    float heron1(float a, float b, float c) {
        float s = 0.5 * (a + b + c);
        return sqrt(s*(s - a)*(s - b)*(s - c));

And here’s an implementation of Kahan’s version.

    void swap(float* a, float* b) {
        float t = *b;
        *b = *a;
        *a = t;

    float heron2(float a, float b, float c) {
        // Sort a, b, c into descending order
        if (a < b) swap(&a, &b);
        if (a < c) swap(&a, &c);
        if (b < c) swap(&b, &c);

        float p = (a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c));
        return 0.25*sqrt(p);

Finally, here’s an incorrect implementation of Kahan’s method, with “unnecessary” parentheses removed.

    float heron3(float a, float b, float c) {
        // Sort a, b, c into descending order
        if (a < b) swap(&a, &b);
        if (a < c) swap(&a, &c);
        if (b < c) swap(&b, &c);

        float p = (a + b + c)*(c - (a - b))*(c + a - b)*(a + b - c);
        return 0.25*sqrt(p);

Now we call all three methods.

    int main()
        float a = 100000.1, b = 100000.2, c = 0.3;
        printf("%0.8g\n", heron1(a, b, c));
        printf("%0.8g\n", heron2(a, b, c));
        printf("%0.8g\n", heron3(a, b, c));

And for a gold standard, here is an implementation in bc with 40 decimal place precision.

    scale = 40

    a = 10000000.01
    b = 10000000.02
    c = 0.03

    s = 0.5*(a + b + c)

Here are the outputs of the various methods, in order of increasing accuracy.

    Heron: 14363.129
    Naive: 14059.268
    Kahan: 14114.293
    bc:    14142.157

Here “naive” means the incorrect implementation of Kahan’s method, heron3 above. The bc result had many more decimals but was rounded to the same precision as the C results.

Related post: How to compute the area of a polygon

]]> 4
A tale of two iterations Thu, 27 Feb 2020 00:59:30 +0000 I recently stumbled on a paper [1] that looks at a cubic equation that comes out of a problem in orbital mechanics:

σx³ = (1 + x

Much of the paper is about the derivation of the equation, but here I’d like to focus on a small part of the paper where the author looks at two ways to go about solving this equation by looking for a fixed point.

If you wanted to isolate x on the left side, you could divide by σ and get

x = ((x + 1)² / σ)1/3.

If you work in the opposite direction, you could start by taking the square root of both sides and get

x = √(σx3) – 1.

Both suggest starting with some guess at x and iterating. There is a unique solution for any σ > 4 and so for our example we’ll fix σ = 5.

We define two functions to iterate, one for each approach above.

    sigma = 5
    x0 = 0.1

    def f1(x):
        return sigma**(-1/3)*(x+1)**(2/3)

    def f2(x):
        return (sigma*x**3)*0.5 - 1

Here’s what we get when we use the cobweb plot code from another post.

    cobweb(f1, x0, 10, "ccube1.png", 0, 1.2)

cobweb plot for iterations of f1

This shows that iterations converge quickly to the solution x = 0.89578.

Now let’s try the same thing for f2. When we run

    cobweb(f2, x0, 10, "ccube2.png", 0, 1.2)

we get an error message

    OverflowError: (34, 'Result too large')

Let’s print out a few values to see what’s going on.

    x = 0.1
    for _ in range(10):
        x = f2(x)

This produces


before aborting with an overflow error. Well, that escalated quickly.

The first iteration converges to the solution for any initial starting point in (0, 1). But the solution is a point of repulsion for the second iteration.

If we started exactly on the solution, the unstable iteration wouldn’t move. But if we start as close to the solution as a computer can represent, the iterations still diverge quickly. When I changed the starting point to 0.895781791526322, the correct root to full floating point precision, the script crashed with an overflow error after 9 iterations.

More on fixed points

[1] C. W. Groetsch. A Celestial Cubic. Mathematics Magazine, Vol. 74, No. 2 (Apr., 2001), pp. 145–152.

]]> 2