How Lewis Carroll computed determinants

Mad hatter from Alice in Wonderland by Lewis Carroll

Charles Dodgson, better known by his pen name Lewis Carroll, discovered a method of calculating determinants now known variously as the method of contractants, Dodgson condensation, or simply condensation.

The method was devised for ease of computation by hand, but it has features that make it a practical method for computation by machine.

Overview

The basic idea is to repeatedly condense a matrix, replacing it by a matrix with one less row and one less column. Each element is replaced by the determinant of the 2×2 matrix formed by that element and its neighbors to the south, east, and southeast. The bottom row and rightmost column have no such neighbors and are removed. There is one additional part of the algorithm that will be easier to describe after introducing some notation.

Details

Let A be the matrix whose determinant we want to compute and let A(k) be the matrix obtained after k steps of the condensation algorithm.

The matrix A(1) is computed as described in the overview:

a^{(1)}_{ij} = a_{ij} a_{i+1, j+1} - a_{i, j+1} a_{i+1, j}.

Starting with A(2) the terms are similar, except each 2×2 determinant is divided by an element from two steps back:

a^{(k+1)}_{ij} = \frac{1}{a^{(k-1)}_{i+1,j+1}}\left(a^{(k)}_{ij} a^{(k)}_{i+1, j+1} - a^{(k)}_{i, j+1} a^{(k)}_{i+1, j} \right)

Dodgson’s original paper from 1867 is quite readable, surprisingly so given that math notation and terminology changes over time.

One criticism I have of the paper is that it is hard to understand which element should be in the denominator, whether the subscripts should be i and j or i+1 and j+1. His first example doesn’t clarify this because these elements happen to be equal in the example.

Example

Here’s an example using condensation to find the determinant of a 4×4 matrix.

\begin{align*} A^{(0)} &= \begin{bmatrix} 3 & 1 & 4 & 1 \\ 5 & 9 & 2 & 6 \\ 0 & 7 & 1 & 0 \\ 2 & 0 & 2 & 3 \end{bmatrix} \\ A^{(1)} &= \begin{bmatrix} 22 &-34 & 22 \\ 35 & -5 &-6 \\ -14 & 14 & 3 \end{bmatrix} \\ A^{(2)} &= \begin{bmatrix} 120 & 157 \\ 60 & 69 \end{bmatrix} \\ A^{(3)} &= \begin{bmatrix} 228 \end{bmatrix} \end{align*}

We can verify this with Mathematica:

    Det[{{3, 1, 4, 1}, {5, 9, 2, 6}, 
         {0, 7, 1, 0}, {2, 0, 2, 3}}]

which also produces 228.

Division

The algorithm above involves a division and so we should avoid dividing by zero. Dodgson says to

Arrange the given block, if necessary, so that no ciphers [zeros] occur in its interior. This may be done either by transposing rows or columns, or by adding to certain rows the several terms of other rows multiplied by certain multipliers.

He expands on this remark and gives examples. I’m not sure whether this preparation is necessary only to avoid division by zero, but it does avoid the problem of dividing by a zero.

If the original matrix has all integer entries, then the division in Dodgson’s condensation algorithm is exact. The sequence of matrices produced by the algorithm will all have integer entries.

Efficiency

Students usually learn cofactor expansion as their first method of calculating determinants. This rule is easy to explain, but inefficient since the number of steps required is O(n!).

The more efficient way to compute determinants is by Gaussian elimination with partial pivoting. As with condensation, one must avoid dividing by zero, hence the partial pivoting.

Gaussian elimination takes O(n³) operations, and so does Dodgson’s condensation algorithm. Condensation is easy to teach and easy to carry out by hand, but unlike cofactor expansion it scales well.

If a matrix has all integer entries, Gaussian elimination can produce non-integer values in intermediate steps. Condensation does not. Also, condensation is inherently parallelizable: each of the 2 × 2 determinants can be calculated simultaneously.

Related posts

Gram matrix

An elegant algebraic identity says

(a^2 + b^2)*c^2 + d^2) = (ac + bd)^2 + (ad - bc)^2

If x is the vector [a b] and y is the vector [c d] then this identity can be written

\begin{vmatrix} a & b \\ c & d \end{vmatrix}^2 = \begin{vmatrix} x\cdot x & x \cdot y \\ x\cdot y & y \cdot y \end{vmatrix}

where the dot indicates the usual dot product. I posted this on Twitter the other day.

Gram matrix

Now suppose that x and y are vectors of any length n. The matrix on the right hand side above is called the Gram matrix of x and y, and its determinant is called the Gram determinant or Gramian.

Correlation

Let θ be the angle between x and y in ℝn. Then

\cos\theta = \frac{\langle x , y\rangle} {||x|| \, || y ||}

where ⟨x, y⟩ is the dot product of x and y.

If x and y are data vectors with mean 0, then cos θ is their correlation. Since correlation is unaffected by scaling, we can scale y so that it has the same norm as x, i.e. ||x|| = ||y||. Then the Gram matrix G can be written in terms of cos θ:

G = \begin{pmatrix} \langle x, x\rangle & \langle x ,y\rangle \\ \langle x, y\rangle & \langle y ,y\rangle \end{pmatrix} = ||x||^2 \begin{pmatrix} 1 & \cos \theta \\ \cos \theta & 1 \end{pmatrix}

Generalization

The idea of the Gram matrix generalizes to more than two vectors. If we have m vectors, the Gram matrix is m × m whose (i, j) entry is the dot product of the ith and jth vectors. Note that the dimension n of the vectors does not have to equal the dimension m of the Gram matrix.

Let B be the matrix whose columns are the vectors xi. If the number of vectors m does equal the dimension of the vectors n then BT B equals the Gram matrix.

B^\intercal B = G

and so

|B^\intercal B| = |B|^2 = |G|

If m does not equal n then BT B cannot equal G because the two matrices have different dimensions, though the determinants of the two matrices are equal. That is, the square of the m-dimensional volume of the span of the x‘s inside ℝn equals their Gram determinant.

The Gram determinant can be defined for more general inner products than the dot product in ℝn. It could be, for example, the integral of the product of two functions.

Related posts

Powers of a 2×2 matrix in closed form

Here’s something I found surprising: the powers of a 2×2 matrix have a fairly simple closed form. Also, the derivation is only one page [1].

Let A be a 2×2 matrix with eigenvalues α and β. (3Blue1Brown made a nice jingle for finding the eigenvalues of a 2×2 matrix.)

If α = β then the nth power of A is given by

A^n = \alpha^{n-1}\left( nA - (n-1)\alpha I\right)

If α ≠ β then the nth power of A is given by

A^n = \frac{\alpha^n}{\alpha - \beta} (A - \beta I) + \frac{\beta^n}{\beta-\alpha}(A - \alpha I)

Example

Let’s do an example with

A = \begin{bmatrix} 6 & 3 \\ 20 & 23 \end{bmatrix}

The eigenvalues are 26 and 3. I chose the matrix entries based on today’s date, not to have integer eigenvalues, and was surprised that they turned out so simple [2]. (More along those lines here.)

Here’s a little Python code to show that the formula above gives the same result as directly computing the cube of A.

    import numpy as np
    
    A = np.matrix([[6, 3], [20, 23]])
    m = (6 + 23)/2
    p = 6*23 - 3*20
    α = m + (m**2 - p)**0.5
    β = m - (m**2 - p)**0.5
    print(α, β)
    
    I = np.eye(2)
    direct = A*A*A
    
    formula = α**3*(A - β*I)/(α - β) + β**3*(A - α*I)/(β - α)
    print(direct)
    print(formula)

[1] Kenneth S. Williams. The nth Power of a 2×2 Matrix. Mathematics Magazine, Dec., 1992, Vol. 65, No. 5, p. 336.

[2] I wrote a script to find out how often this happens, and it’s more often than I would have guessed. There are 31 dates this year that would give integer eigenvalues if arranged as in the example.

Bibliography histogram

I recently noticed something in a book I’ve had for five years: the bibliography section ends with a histogram of publication dates for references. I’ve used the book over the last few years, but maybe I haven’t needed to look at the bibliography before.

publication date histogram

This is taken from Bernstein’s Matrix Mathematics. I wrote a review of it here.

An up-to-date bibliography isn’t always necessary. One of the references I use the most is nearly 60 years old, and another book I pull down occasionally is 100 years old. But sometimes you do need a current reference, and a histogram of publication dates is a good way for the author to demonstrate that the references are up to date.

By the way, notice the phantom diagonal lines across the histogram? That’s a moiré.[1]

***

[1] Pun on this and this.

Cofactors, determinants, and adjugates

Let A be an n × n matrix over a field F. The cofactor of an element Aij is the matrix formed by removing the ith row and jth column, denoted A[i, j].

This terminology is less than ideal. The matrix just described is called the cofactor of Aij, but it would more accurately be called the cofactor of (i, j) because it does not depend on the matrix entry Aij per se but rather its position in the matrix A. For example, if someone asked you what the cofactor of 7 is, you’d have to say “That depends. Where is the 7 located?”. But we will stick to standard terminology.

(I wrote a post about a similar phenomenon in statistical notation, but I can’t find it this morning. You can’t answer questions like “What is p(0.5)?” in statistics by without first asking “If you were to write 0.5 using a variable, would you write it as x, y, or θ?”)

You can compute the determinant of a matrix by taking the alternating sum of elements of a row (or column) multiplied by the determinants of their cofactors.

\det A = \sum_{j=1}^n (-1)^{i+j} A_{i,j} \det A_{[i,j]}

This is essentially Cramer’s rule. It suggests a recursive procedure for computing determinants which is theoretically useful, but only computationally useful for very small matrices.

The adjugate of A, written adj(A), is the transpose of the matrix formed by replacing the (i, j) element of A with an alternating sign times its cofactor [1].

\text{adj}(A)_{ij} = (-1)^{i+j} \det A[j,i]

As you might guess, the Mathematica command to compute the adjugate is Adjugate.

Note that “adjugate” is not a typo for “adjoint.” The adjoint of A (now) means something different. What we now call the adjugate was sometimes in the past called the adjoint. What we now call the adjoint is the (conjugate) transpose.

Here’s an application of the adjugate. Let A be a 3 × 3 matrix. Then the characteristic polynomial of A is

x^3 - (\text{tr}\, A) x^2 + \text{tr}\,\left(\text{adj}(A)\right)x - \det A

where tr denotes the trace, the sum of the diagonal elements of a matrix.

Since the trace only depends on the diagonal, you only need to compute the diagonal of a adjugate to use the equation above for the characteristic polynomial. And the alternating terms defining the adjugate are all positive on the diagonal. So the trace of the adjugate is the sum of the determinants of the cofactors of the diagonal: det A11 + det A22 + det A33.

Circulant matrices commute

A few days ago I wrote that circulant matrices all have the same eigenvectors. This post will show that it follows that circulant matrices commute with each other.

Recall that a circulant matrix is a square matrix in which the rows are cyclic permutations of each other. If we number the rows from 0, then the kth row takes the last k rows of the top row and moves them to the front.

Numerical example

We can generate two random circulant matrices and confirm that they commute by modifying the Python code from the earlier post.

    np.random.seed(42)
    N = 6
    row = np.random.random(N)
    A = np.array([np.roll(row, i) for i in range(N)])
    row = np.random.random(N)
    B = np.array([np.roll(row, i) for i in range(N)])
    AB = np.matmul(A, B)
    BA = np.matmul(B, A)
    print(np.absolute(AB - BA).max())

This computes two random 6 by 6 circulant matrices A and B and shows that the difference between AB and BA is on the order of the limit of floating point precision. Specifically, this prints 4.44 × 10−16.

Proof

Now for a proof. Let A and B be two matrices of size n that have the same set of independent eigenvectors e1 through en. In the case of circulant matrices we can take ek to be the kth column of the FFT matrix, but the proof here applies to any matrices with common eigenvectors.

Let αk be the eigenvalue for ek as an eigenvector of A and let βk be the eigenvalue for ek as an eigenvector of B. Then

 (AB)e_k = A(Be_k) = A(\beta_k e_k) = \beta_k(Ae_k) = \beta_k\alpha_k e_k

and

(BA)e_k = B(Ae_k) = B(\alpha_k e_k) = \alpha_k(Be_k) = \alpha_k\beta_k e_k

and so multiplying by AB or BA has the same effect on ek since the complex numbers αk and βk commute. Since the eigenvalues form a basis, and multiplication by AB and BA has the same effect on each basis vector, AB = BA.

In short, matrices that share eigenvectors commute because eigenvalues commute.

Circulant matrices, eigenvectors, and the FFT

A circulant matrix is a square matrix in which each row is a rotation of the previous row. This post will illustrate a connection between circulant matrices and the FFT (Fast Fourier Transform).

Circulant matrices

Color in the first row however you want. Then move the last element to the front to make the next row. Repeat this process until the matrix is full.

The NumPy function roll will do this rotation for us. Its first argument is the row to rotate, and the second argument is the number of rotations to do. So the following Python code generates a random circulant matrix of size N.

    import numpy as np

    np.random.seed(20230512)
    N = 5
    row = np.random.random(N)
    M = np.array([np.roll(row, i) for i in range(N)])

Here’s the matrix M with entries truncated to 3 decimal places to make it easier to read.

    [[0.556 0.440 0.083 0.460 0.909]
     [0.909 0.556 0.440 0.083 0.460]
     [0.460 0.909 0.556 0.440 0.083]
     [0.083 0.460 0.909 0.556 0.440]
     [0.440 0.083 0.460 0.909 0.556]]    

Fast Fourier Transform

The Fast Fourier Transform is a linear transformation and so it can be represented by a matrix [1]. This the N by N matrix whose (j, k) entry is ωjk where ω = exp(-2πi/N), with j and k running from 0 to N – 1. [2]

Each element of the FFT matrix corresponds to a rotation, so you could visualize the matrix using clocks in each entry or by a cycle of colors. A few years ago I created a visualization using both clock faces and colors:

Eigenvectors and Python code

Here’s a surprising property of circulant matrices: the eigenvectors of a circulant matrix depend only on the size of the matrix, not on the elements of the matrix. Furthermore, these eigenvectors are the columns of the FFT matrix. The eigenvalues depend on the matrix entries, but the eigenvectors do not.

Said another way, when you multiply a circulant matrix by a column of the FFT matrix of the same size, this column will be stretched but not rotated. The amount of stretching depends on the particular circulant matrix.

Here is Python code to illustrate this.

    for i in range(N):
        ω = np.exp(-2j*np.pi/N)
        col1 = np.array([ω**(i*j) for j in range(N)])
        col2 = np.matmul(M, col1)
        print(col1/col2)

In this code col1 is a column of the FFT matrix, and col2 is the image of the column when multiplied by the circulant matrix M. The print statement shows that the ratios of each elements are the same in each position. This ratio is the eigenvalue associated with each eigenvector. If you were to generate a new random circulant matrix, the ratios would change, but the input and output vectors would still be proportional.

Notes

[1] Technically this is the discrete Fourier transform (DFT). The FFT is an algorithm for computing the DFT. Because the DFT is always computed using the FFT, the transformation itself is usually referred to as the FFT.

[2] Conventions vary, so you may see the FFT matrix written differently.

What does rotating a matrix do to its determinant?

This post will look at rotating a matrix 90° and what that does to the determinant. This post was motivated by the previous post. There I quoted a paper that had a determinant with 1s in the right column. I debated rotating the matrix so that the 1s would be along the top because that seems more natural, but in the end I kept the original notation. If I had rotated the matrix, I’d need to adjust the value of the determinant.

Rotating a matrix is a an unusual operation. When someone speaks of turning a matrix over, they usually mean taking the transpose; taking the transpose is a reflection, not a rotation. And when you see the words “rotation” and “matrix” in close proximity, the topic is a matrix that represents a rotation. That’s not what this post is about: the matrix isn’t performing a rotation; a rotation is being performed on it.

Here’s an example:

\begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} \rightarrow \begin{pmatrix} c & f & i \\ b & e & h \\ a & d & g \end{pmatrix}

Using subscripts makes the example above a little harder to read but much easier to formalize.

\begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \rightarrow \begin{pmatrix} a_{13} & a_{23} & a_{33} \\ a_{12} & a_{22} & a_{32} \\ a_{11} & a_{21} & a_{31} \end{pmatrix}

If we denote a matrix by A and its image after rotation as A‘, then the components a‘ of A‘ are related to the components a of A by

a'_{ij} = a_{j, n-i+1}

To form the rotation of A, we take the transpose of A, then reverse the order of the rows. In matrix notation,

A' = JA^T

where J is the matrix that has 1’s on the secondary diagonal, the diagonal running southwest to northeast, and 0’s everywhere else. The matrix J is the rotation of the identity matrix I.

The determinant of A‘ is the determinant of J times the determinant of AT, and the determinant of AT is the same as the determinant of A.

\det A' = \det J \det A^T = \det J \det A

So the last thing we need to do to compute the determinant of A‘ is to compute the determinant of J.

We can turn I into J, and vice versa, by reversing the order of the rows. Let’s do this one pair of rows at a time. We swap the first and last rows. Then we swap the second row with the second to last row, then swap the third row from the top with the third row from the bottom, etc.

Suppose I is an n × n matrix, and first assume n is even. Then we need to do n/2 swaps, swapping each of the top n/2 rows with one of the bottom n/2 rows. Each swap reverses the sign of the determinant, and so the determinant of J is (−1)n/2. Now if n is odd, the middle row doesn’t move. So we swap the top (n − 1)/2 rows with the bottom (n − 1)/2 rows. So in that case the determinant of J is (−1)(n−1)/2. We can combine the even and the odd case using floor notation:

\det A' = (-1)^{\lfloor n/2\rfloor} \det A

Gaussian elimination

Carl Friedrich Gauss

When you solve systems of linear equations, you probably use Gaussian elimination, even if you don’t call it that. You may learn Gaussian elimination before you see it formalized in terms of matrices.

So if you’ve had a course in linear algebra, and you sign up for a course in numerical linear algebra, it’s natural to expect that Gaussian elimination would be one of the first things you talk about. That was my experience, and it was painful. I had this odd mixture of feeling like I already knew what the professor was talking about, while at the same time feeling lost.

Trefethen and Bau do not start their numerical linear algebra textbook with Gaussian elimination, and they explain why in the preface:

We have departed from the customary practice by not starting with Gaussian elimination. That algorithm is atypical of numerical linear algebra, exceptionally difficult to analyze, yet at the same time tediously familiar to every student in a course like this. Instead, we begin with the QR factorization, which is more important, less complicated, and fresher idea to most students.

It’s reassuring to hear experts in numerical linear algebra admit that Gaussian elimination is “exceptionally difficult to analyze.” The algorithm is not exceptionally difficult to perform; children learn to manually execute the algorithm. But it is surprisingly difficult, and tedious, to analyze.

Numerical analysts like Trefethen and Bau have much more sophisticated questions in mind than how you would naively carry out the algorithm by hand. Numerical analysts are concerned, for example, with stability.

If your coefficients are known exactly and you carry out your arithmetic exactly, then you get an exact result. But can a small change to your inputs, say due to measurement uncertainty, or the inevitable loss of precision from floating point arithmetic, make a large change in your results? Absolutely, unless you use pivoting. But with pivoting, Gaussian elimination is stable [1].

After taking an introductory linear algebra course, you know how to solve any linear system of equations in principle. But when you actually want to compute the solution accurately, efficiently, at scale, in a real computer, with real data, things can get much more interesting. Many people have devoted their careers to doing in practice what high school students know how to do in principle.

Related posts

[1] There are matrices for which Gaussian elimination is unstable, but they are rare. It’s been shown that they are rare in the sense of having low probability, but more importantly they never arise naturally in applied problems.

Self-orthogonal vectors and coding

One of the surprising things about linear algebra over a finite field is that a non-zero vector can be orthogonal to itself.

When you take the inner product of a real vector with itself, you get a sum of squares of real numbers. If any element in the sum is positive, the whole sum is positive.

In a finite field, things don’t work that way. A list of non-zero squares can add up to zero.

In the previous post, we looked at the ternary Golay code, an error correcting code that multiplies a vector of data by a rectangular matrix mod 3. That post included the example that

[1, 0, 1, 2, 2]

was encoded as

[1 0 1 2 2 0 1 0 2 0 0].

The output is orthogonal to itself:

1² + 0² + 1² + 2² + 2² + 0² + 1² + 0² + 2² + 0² + 0² ≡ 0 (mod 3).

In fact, every output of the ternary Golay code will be self-orthogonal. To see this, let v be a 1 by 5 row vector. Then its encoding is vG where G is the 5 by 11 matrix in the previous post. The inner product of vG with itself is

vG (vG)T = vGGTvT.

We can easily show that GGT is a zero matrix using Python:

    >>> (G@G.T)%3
    [[0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]]

And so

vGGTvT = vOvT = 0.

If a vector is not self-orthogonal, there has been an error. For example, if one of the 0s in our output turned into a 1 or a 2, the inner product of the corrupted vector with itself would be equal to 1 (mod 3) rather than 0.

Self-orthogonality is necessary but not sufficient for an encoded vector to be error-free. Correctly encoded vectors form a 5 dimensional subspace of GF(3)11, namely the row space of G. The set of self-orthogonal vectors in GF(3)11 is a larger space.

A sufficient condition for a row vector w to be the encoding of a data vector v is for

GwT

to be the zero vector (carrying out all calculations mod 3).

The ternary Golay code is capable of detecting and correcting corruption in up to two spots in a vector. For example, suppose we take our vector

[1 0 1 2 2 0 1 0 2 0 0]

above and corrupt it by turning the first couple 2’s to 1’s.

[1 0 1 1 1 0 1 0 2 0 0]

We’d still get a self-orthogonal vector, but the product GwT would not be zero.

    >>> v = np.array( [1, 0, 1, 2, 2] )
    >>> w = (v@G)%3
    >>> w[3] = w[4] = 1
    >>> print((G@w)%3)

This prints

    [0 0 0 2 2]

which lets us know that the modified version of w is not a valid encoding, not a part of a row space of G.

If we know (or are willing to assume) that

[1 0 1 1 1 0 1 0 2 0 0]

is the result of no more than two changes applied a valid code word vG, then we can recover vG by looking for the closest vector in the row space of G. Here closeness is measured in Hamming distance: the number of positions in which vectors differ. Every vector in GF(3)11 is within a distance of 2 from a unique code word, and so the code can correct any two errors.

Related posts