Most popular posts this year

Here are the five post popular posts so far this year.

The posts above had the most page views. But just counting page views doesn’t measure what regular readers necessarily most enjoy reading. It’s heavily influenced by what gets shared on Hacker News and elsewhere, and so naturally favors the most accessible posts.

It’s harder to say which posts have gotten the most feedback because this isn’t as quantifiable and I’m drawing on memory. But I do remember getting a lot of feedback regarding my post More Stability, Less Stress about my experience consulting. I also got a lot of feedback regarding making interesting images using a strange coordinate system.


Cyclic permutations and trace

The trace of a square matrix is the sum of the elements on its main diagonal.

The order in which you multiply matrices matters: in general, matrix multiplication is not commutative. But the trace of a product of matrices may or may not depend on the order of multiplication.

Specifically, trace doesn’t change if you take the last matrix and move it to the front. You can apply this repeatedly the get all cyclic permutations.

If you just have two matrices, A and B, then order doesn’t matter because there is only one permutation of two things, and it’s a cyclic permutation. That is trace(AB) = trace(BA).

But in general only cyclic permutations leave the trace unchanged. The following Python code illustrates this.

    import numpy as np

    N = 4

    A = np.random.rand(N, N)
    B = np.random.rand(N, N)
    C = np.random.rand(N, N)

    print( (A@B@C).trace() )
    print( (B@C@A).trace() )
    print( (C@A@B).trace() )
    print( (C@B@A).trace() )
    print( (B@A@C).trace() )
    print( (A@C@B).trace() )

This shows that ABC, BCA, and CAB all have the same trace (5.793 in this example) and CBA, BAC, and ACB all have the same trace (4.851).

In this case there were only two trace values: one for the forward sequence and its rotations, and one for the reversed sequence and its rotations. With more matrices there are more possibilities.

    D = np.random.rand(N, N)

    print( (A@B@C@D).trace() )
    print( (C@D@A@B).trace() )
    print( (C@D@B@A).trace() )
    print( (D@C@B@A).trace() )

Now we see that ABCD and CDBA have the same trace (12.632), because they are rotations of each other, but the other two permutations have difference traces (13.843 and 12.564).

If we’re working with symmetric matrices, then reversing the order doesn’t change the trace. Here’s a quick proof for the product of three symmetric matrices:

trace(ABC) = trace((ABC)T) = trace(CT BT AT) = trace(CBA)

So for three symmetric matrices, the trace is the same for all permutations.

The following code shows that the order of multiplication can change the trace, even for symmetric matrices, if you have four matrices.

    A += A.T
    B += B.T
    C += C.T
    D += D.T

    print( (C@D@A@B).trace() )
    print( (C@D@B@A).trace() )

The first statement prints 202.085 and the second 222.211.

Related posts

Trick for 2×2 eigenvalues

3Blue1Brown has a nice new video on how to calculate the eigenvalues of 2×2 matrices.

The most common way to find the eigenvalues of a 2×2 matrix A is working straight from the definition, solving det(A – λI) = 0.

This is fine when you’re learning what eigenvalues are. But if you’ve already learned all the theory and just want to calculate the eigenvalues, there’s an easier way.

As shown in the video, the eigenvalues are

m ± √(m² – p)

where m is the mean of the elements on the main diagonal and p is the determinant.

Universal confidence interval

Here’s a way to find a 95% confidence interval for any parameter θ.

  • With probability 0.95, return the real line.
  • With probability 0.05, return the empty set.

Clearly 95% of the time this procedure will return an interval that contains θ.

This example shows the difference between a confidence interval and a credible interval.

A 95% credible interval is an interval such that the probability is 95% that the parameter is in the interval. That’s what people think a 95% confidence interval is, but it’s not.

Suppose I give you a confidence interval using the procedure above. The probability that θ is in the interval is 1 if I return the real line and 0 if I return the empty set. In either case, the interval that I give you tells you absolutely nothing about θ.

But if I give you a 95% credible interval (a, b), then given the model and the data that went into it, the probability is 95% that θ is in the interval (a, b).

Confidence intervals are more useful in practice than in theory because they often approximately correspond to a credible interval under a reasonable model.

Credible intervals depend on your modeling assumptions. So do confidence intervals.

What is a polynomial?

When you leave the comfort of the real numbers, you might be mistaken about what a polynomial is.

Suppose you’re looking at polynomials over some finite field. Why would you do that? Numerous practical applications, but that’s a topic for another post.

You look in some reference that’s talking about polynomials and you see things like

x³ + x

referred to as a polynomial over the field with two elements, 0 and 1. “Yeah, I get it. It’s just like middle school, except the coefficients are zeros and ones.”

No, it’s not like middle school.

What happens when you stick in x = 0? You get 0. What happens when you stick in 1? You get 0, because 1 + 1 = 0 mod 2.

The function that takes a bit and returns the cube of the bit plus the bit itself always evaluates to 0 when working mod 2.

We have a non-zero polynomial that when interpreted as a function equals the zero function.

The resolution to this tension is that in general, a polynomial is not a function. You can associate a function with a polynomial by evaluating the polynomial for some value of x, but two different polynomials may correspond to the same function.

I have a vague memory of some professor correcting me for conflating a polynomial with a function. That seemed like pure pedantry at the time. It would have been helpful if said professor had explained why the distinction mattered. It never mattered in any example I could think of.

When you go back and look at how polynomials are defined over an arbitrary field, there are two possibilities. A reference may use powers of x and coefficients from the field, but say somewhere that the x is a “formal symbol” or something like that. The other possibility is to define polynomials as infinite sequences of field elements for which only a finite number of elements are nonzero, along with addition and multiplication rules that look just like what you’d expect for polynomials.

This may seem like unnecessary formalism. And for polynomials over real numbers it is. If two real polynomials differ as polynomials, i.e. they differ in one of their coefficients, then there is some place where they differ as functions. But that’s not true in general.

Confidence interval widths

Suppose you do N trials of something that can succeed or fail. After your experiment you want to present a point estimate and a confidence interval. Or if you’re a Bayesian, you want to present a posterior mean and a credible interval. The numerical results hardly differ, though the two interpretations differ.

If you got half successes, you will report a confidence interval centered around 0.5. The more unbalanced your results were, the smaller your confidence interval will be. That is, the confidence interval will be smallest if you had no successes and widest if you had half successes.

What can we say about how the width of your confidence varies as a function of your point estimate p? That question is the subject of this post [1].

Frequentist confidence interval width

We’ll start with an example, where N = 1,000,000. We let our number of observed successes n vary from 0 to 500,000 and plot the width of a 95% confidence interval as a function of n on a log-log plot. (By symmetry we only need to look to up to N/2. If you have more than half successes, reverse your definitions of success and failure.)

The result is almost a perfectly straight line, with the exception of a tiny curve at the end. This says the log of the confidence interval is a linear function of the log of n, or in other words, the dependence of confidence interval width on n follows a power law.

Bayesian credible interval width

The plot above measures frequentist confidence intervals, using Wilson score with continuity correction. We can repeat our experiment, putting on our Bayesian hat, and compute credible intervals using a Jeffreys prior, that is, a beta(0.5, 0.5) prior.

The results are indistinguishable. The confidence interval widths differ after the sixth decimal place.

In this example, there’s very little quantitative difference between a frequentist confidence interval and a Bayesian credible interval. The difference is primarily one of interpretation. The Bayesian interpretation makes sense and the frequentist interpretation doesn’t [2].

Power law

If the logs of two things are related linearly [3], then the things themselves are related by a power law. So if confidence/credible interval width varies as a power law with the point estimate p, what is that power law?

The plots above suggest that to a good approximation, if we let w be the credible interval length,

log w = m log p + b

for some slope m and intercept b.

We can estimate the slope by choosing p = 10-1 and p = 10-5. This gives us m = 0.4925 and b = -5.6116. There are theoretical reasons to expect that m should be 0.5, so we’ll round 0.4925 up to 0.5 both for convenience and for theory.


log w = 0.5 log p – 5.6116.

and so

w = 0.00366 √p

Let’s see how well this works by comparing it to the exact interval length. (I’m using Jeffreys’ credible interval, not that it matters.)

The power law approximation works well when the estimated proportion p is small, say less than 0.2. For larger p the normal approximation works well.

We could have guessed that the confidence interval width was proportional to √p(1-p) based on the normal approximation/central limit theorem. But the normal approximation is not uniformly good over the range of all ps.

Related posts

[1] Standard notation would put a little hat on top of the p but HTML doesn’t make this possible without inserting images into text. I hope you don’t mind if I take my hat off.

[2] This quip is a little snarky, but it’s true. When I would ask graduate students to describe what a confidence interval is, they would basically give me the definition of a credible interval. Most people who calculate frequentist confidence intervals think they’re calculating Bayesian credible intervals, though they don’t realize it. The definition of confidence interval is unnatural and convoluted. But confidence intervals work better in practice than in theory because they approximate credible intervals.

L. J. Savage once said

“The only use I know for a confidence interval is to have confidence in it.”

In other words, the value of a confidence interval is that you can often get away with interpreting it as a credible interval.

[3] Unfortunately the word “linear” is used in two inconsistent ways. Technically we should say “affine” when describing a function y = mx + b, and I wish we would. But that word isn’t commonly known. People call this relation linear because the graph is a line. Which makes sense, but it’s inconsistent with the use of “linear” as in “linear algebra.”


Multicolor reproducing cellular automata

The previous post looked at a cellular automaton introduced by Edward Fredkin. It has only two states: a cell is on or off. At each step, each cell is set to the sum of the states of the neighboring cells mod 2. So a cell is on if it had an odd number neighbors turned on, and is turned off if it had an even number of neighbors turned on.

You can look at cellular automata with more states, often represented as colors. If the number of states per cell is prime, then the extension of Fredkin’s automaton to multiple states also reproduces its initial pattern. Instead of taking the sum of the neighboring states mod 2, more generally you take the sum of the neighboring states mod p.

This theorem is due to Terry Winograd, cited in the same source as in the previous post.

Here’s an example with 11 states. The purple background represents 0. As before, I start with an ‘E’ pattern, but I make it multi-colored.

Initial state

Before turn on the automaton, we need to clarify what we mean by “neighborhood.” In this example, I mean cells to the north, south, east, and west. You could include the diagonals, but that would be different example, though still self-reproducing.

After the first step, the pattern blurs.

step 1

After 10 steps the original pattern is nowhere to be seen.

step 10

And then suddenly on the next step the pattern reappears.

step 11

I don’t know whether there are theorems on how long it takes the pattern to reappear, or how the time depends on the initial pattern, but in my experiments, the a pattern reappears after 4 steps with 2 states, 9 steps with 3 states, 5 steps with 5 states, 7 steps with 7 states, and 11 steps with 11 states.

Here’s an animation of the 11-color version, going through 22 steps.

Animated gif of CA

Self-reproducing cellular automata

Edward Fredkin is best known these days for the Fredkin gate, a universal reversible circuit.

I recently found out that Fredkin was one of the pioneers in cellular automata. His student Edwin Banks describes a cellular automaton introduced by Fredkin [1].

Edward Fredkin of MIT has described the following interesting cellular space. If the states are 0 and 1 and if the result state is obtained by taking the sum of the neighbors, modulo 2, then every initial configuration will reproduce copies of itself about the initial configuration.

In other words, start with some bit pattern in an infinite checkerboard of bits. Then at each step, replace each bit by the parity of the sum of the neighboring bits. Eventually you’ll see copies of that original pattern.

There’s some ambiguity regarding what “neighbors” means. Banks’ dissertation considers the neighbors of a bit to be the bits to the north, south, east, and west, and this implies that Banks has these neighbors in mind when he describes Fredkin’s automaton. Other sources say Fredkin also considered diagonally adjacent bits as part of the neighborhood, i.e. northwest, northeast, southwest, and southeast.

Banks goes on to say

Terry Winograd (1970) generalized this result showing that any neighborhood, not just the four nearest neighbors, and any number of dimensions still yields the same results.

I take it by “the four nearest neighbors” the neighbors to the north, south, east, and west. I’m not sure what the “any neighborhood” means in Winograd’s theorem; I imagine there must be some implied restriction.

In any case, I’ll show by example that both definitions of “neighborhood” discussed above lead to reproducing the original pattern, though they do not reproduce it in the same way.

I’ll start with an ‘E’ in the middle of a grid. The black squares represent 0s and the white squares represent 1s.

initial state

Here are the first eight steps in evolving this pattern, using the rule that “neighbor” only includes north, south, east, and west neighbors.

If we consider each point to have eight neighbors, i.e. we include diagonals, here’s what the first eight steps look like.

Update: Here’s an animated version of the first automaton

and the second

See the next post for a generalization using more than two states per cell, using colors rather than black and white cells. If the number of colors is prime, and you take the sum of the neighboring states mod p rather than mod 2, you get analogous results.

[1] Edwin R. Banks, Information Processing and Transmission in Cellular Automata. MIT dissertation. January 1971.

Humming St. Christopher

The other day I woke up with a song in my head I hadn’t heard in a long time, the hymn Beneath the Cross of Jesus. The name of the tune is St. Christopher.

When I thought about the tune, I realized it has some fairly sophisticated harmony. My memory of the hymns I grew up with was that they were harmonically simple, mostly built around three chords: I, IV, V. But this hymn has a lot going on.

I imagine a lot of things that I remember as being simple weren’t. I was simple, and my world was richer than I realized.


You can find the sheet music for the hymn here. I’ll write out the chord progressions for the first two lines.

    I     idim | I            | V7   ii7 V7  | I   III | 
    vi   iidim | vi VI7 ii vi | II   VIII7♭5 | III     |

If you’re not familiar with music theory, just appreciate that there are a lot more symbols up there than I, IV, and V.

The second line effectively modulates into a new key, the relative minor of the original key, and I’m not sure how to describe what’s going on at the end of the second line.

Related posts

Aliquot ratio distribution

The previous post looked at repeatedly applying the function s(n) which is the sum of the divisors of n less than n. It is an open question whether the sequence

s( s( s( … s(n) … ) ) )

always converges or enters a loop. In fact, it’s an open question of whether the sequence starting with n = 276 converges.

A heuristic argument for why the sequence ought to converge, at least much of the time, is that s(n) is usually less than n. This can be made precise as explained here: in the limit as N goes to infinity, the proportion of numbers less than N for which s(n) < n is at least 0.752.

Even though applying s usually leads to a smaller result, conceivably it could lead to a much larger result when it increases, as with the hailstone problem.

I made some histograms of the ratio s(n) / n to get a feel for how much s increases or decreases its argument. (I imagine this has all been done before, but I haven’t seen it.)

Here’s my first attempt, for n up to 1000.

Histogram of aliquot ratios

So apparently the ratio is often less than 1.

I thought it would be clearer to work on a log scale and center the graph at 0 so I next made the following graph.

Histogram of log aliquot ratios

This shows that the aliquot function s never increases its argument by much, and it often reduces its argument quite a bit, at least for numbers less than 1000.

Next I made a histogram with numbers less than 1,000,000 to see whether the pattern continued. Here’s what I got.

Histogram of log aliquot ratios for numbers less than a million

The most obvious feature is the huge spike. If you can ignore that, mentally cropping the image to just the bottom part, it looks basically like the previous graph.

What’s up with that spike? It’s in the bin containing log(0.5). In other words, the ratio s(n)/n is often near 0.5.

So I thought maybe there’s some number theoretic reason the ratio is often exactly 1/2. That’s not the case. It’s only exactly 1/2 once, when n = 2, but it’s often near 1/2.

In any case, applying s very often returns a smaller number than you started with, so it’s plausible that the iterated aliquot sequence would often converge. But empirically it looks like some sequences, such as the one starting at n = 276, diverge. They somehow weave through all the numbers for which the sequence would come tumbling down.