Fractional linear and linear

A function of the form

g(z) = \frac{az + b}{cz + d}

where adbc ≠ 0 is sometimes called a fractional linear transformation or a bilinear transformation. I usually use the name Möbius transformation.

In what sense are Möbius transformations linear transformations? They’re nonlinear functions unless b = c = 0. And yet they’re analogous to linear transformations. For starters, the condition adbc ≠ 0 appears to be saying that a determinant is non-zero, i.e. that a matrix is non-singular.

The transformation g is closely associated with the matrix

\begin{pmatrix} a & b \\ c & d \end{pmatrix}

but there’s more going on than a set of analogies. The reason is that Möbius transformation are linear transformations, but not on the complex numbers ℂ.

When you’re working with Möbius transformations, you soon want to introduce ∞. Things get complicated if you don’t. Once you add ∞ theorems become much easier to state, and yet there’s a nagging feeling that you may be doing something wrong by informally introducing ∞. This feeling is justified because tossing around ∞ without being careful can lead to wrong conclusions.

So how can we rigorously deal with ∞? We could move from numbers (real or complex) to pairs of numbers, as with fractions. We replace the complex number z with the equivalence class of all pairs of complex numbers whose ratio is z. The advantage of this approach is that you get to add one special number, the equivalence class of all pairs whose second number is 0, i.e. fractions with zero in the denominator. This new number system is called P(ℂ), where “P” stands for “projective.”

Möbius transformations are projective linear transformations. They’re linear on P(ℂ), though not on ℂ.

When we multiply the matrix above by the column vector (z 1)T we get

\begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} z \\ 1 \end{pmatrix} = \begin{pmatrix} az + b \\ cz + d \end{pmatrix}

and since our vectors are essentially fractions, the right hand side corresponds to g(z) if the second component of the vector, cz + d, is not zero.

If cz + d = 0, that’s OK. Everything is fine while we’re working in P(ℂ), but we get an element of P(ℂ) that does not correspond to an element of ℂ, i.e. we get ∞.

We’ve added ∞ to the domain and range of our Möbius transformations without any handwaving. We’re just doing linear algebra on finite complex numbers.

There’s a little bit of fine print. In P(ℂ) we can’t allow both components of a pair to be 0, and non-zero multiples of the same vector are equivalent, so we’re not quite doing linear algebra. Strictly speaking a Möbius transformation is a projective linear transformation, not a linear transformation.

It takes a while to warm up to the idea of moving from complex numbers to equivalence classes of pairs of complex numbers. The latter seems unnecessarily complicated. And it often is unnecessary. In practice, you can work in P(ℂ) by thinking in terms of ℂ until you need to have to think about ∞. Then you go back to thinking in terms of P(ℂ). You can think of P(ℂ) as ℂ with a safety net for working rigorously with ∞.

Textbooks usually introduce higher dimensional projective spaces before speaking later, if ever, of one-dimensional projective space. (Standard notation would write P¹(ℂ) rather than P(ℂ) everywhere above.) But one-dimensional projective space is easier to understand by analogy to fractions, i.e. fractions whose denominator is allowed to be zero, provided the numerator is not also zero.

I first saw projective coordinates as an unmotivated definition. “Good morning everyone. We define Pn(ℝ) to be the set of equivalence classes of ℝn+1 where ….” There had to be some payoff for this added complexity, but we were expected to delay the gratification of knowing what that payoff was. It would have been helpful if someone had said “The extra coordinate is there to let us handle points at infinity consistently. These points are not special at all if you present them this way.” It’s possible someone did say that, but I wasn’t ready to hear it at the time.

Related posts

Jordan normal form: 1’s above or below diagonal?

Given a square complex matrix A, the Jordan normal form of A is a matrix J such that

P^{-1}A P = J

and J has a particular form. The eigenvalues of A are along the diagonal of J, and the elements above the diagonal are 0s or 1s. There’s a particular pattern to the 1s, giving the matrix J a block structure, but that’s not the focus of this post.

Some books say a Jordan matrix J has the eigenvalues of A along the diagonal and 0s and 1s below the diagonal.

So we have two definitions. Both agree that the non-zero elements of J are confined to the main diagonal and an adjacent diagonal, but they disagree on whether the secondary diagonal is above or below the main diagonal. It’s my impression that placing the 1s below the main diagonal is an older convention. See, for example, [1]. Now I believe it’s more common to put the 1s above the main diagonal.

How are these two conventions related and how might you move back and forth between them?

It’s often harmless to think of linear transformations and matrices as being interchangeable, but for a moment we need to distinguish them. Let T be a linear transformation and let A be the matrix that represents T with respect to the basis

e_1, e_2, e_3, \ldots, e_N

Now suppose we represent T by a new basis consisting of the same vectors but in the opposite order.

\bar{e}_i = e_{N-i+1}

If we reverse the rows and columns of A then we have the matrix for the representation of T with respect to the new basis.

So if J is a matrix with the eigenvalues of A along the diagonal and 0s and 1s above the diagonal, and we reverse the order of our basis, then we get a new matrix J′ with the eigenvalues of A along the diagonal (though in the opposite order) and 0s and  1s below the diagonal. So J and J′ represent the same linear transformation with respect to different bases.

Matrix calculations

Let R be the matrix formed by starting with the identity matrix I and reversing all the rows. So while I has 1s along the NW-SE diagonal, R has 1s along the SW-NE diagonal.

Reversing the rows of A is the same as multiplying A by R on the right.

Reversing the columns of A is the same as multiplying A by R on the left.

Here’s a 3 by 3 example:

\begin{pmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{pmatrix} \begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} \begin{pmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{pmatrix} = \begin{pmatrix} i & h & g \\ f & e & d \\ c & b & a \end{pmatrix}

Note that the matrix R is its own inverse. So if we have

P^{-1}A P = J

then we can multiply both sides on the left and right by R.

RP^{-1}APR = RJR

If J has 1s above the main diagonal, then RJR has 1s below the main diagonal. And if J has 1’s below the main diagonal, RJR has 1s above the main diagonal.

Since R is its own inverse, we have

RP^{-1}APR = R^{-1}P^{-1}APR = (PR)^{-1}A(PR)

This says that if the similarity transform by P puts A into Jordan form with 1’s above (below) the diagonal, then the similarity transform by PR puts A into Jordan form with 1’s below (above) the diagonal.

[1] Hirsch and Smale. Differential Equations, Dynamical Systems, and Linear Algebra. 1974.

The numerical range ellipse

Let A be an n × n complex matrix. The numerical range of A is the image of x*Ax over the unit sphere. That is, the numerical range of A is the set W(A) in defined by

W(A) = {x*Ax | x ∈ ℂn and ||x|| = 1}

where x* is the conjugate transpose of the column vector x.

The product of a row vector and a column vector is a scalar, so W(A) is a subset of the complex plane ℂ.

When A is a 2 × 2 matrix, W(A) is an ellipse, and the two foci are the eigenvalues of A.

Denote the two eigenvalues by λ1 and λ2 . Denote the major and minor axes of the ellipse by a and b respectively. Then

b² = tr(A*A) − |λ1|² − |λ2|².

where tr is the trace operator. This is the elliptical range theorem [1]. The theorem doesn’t state the major axis a explicitly, but it can be found from the two foci and the minor axis b.


We will create a visualization the numerical range of a matrix directly from the definition, generating x‘s at random and plotting x*Ax. Then we draw the ellipse described in the elliptical range theorem and see that it forms the boundary of the random points.

The following code plots 500 random points in the numerical range of a matrix formed from the numbers in today’s date. The eigenvalues are plotted in black.

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

A = np.array([[8, 16], [20, 23j]])

for _ in range(5000):
    v = norm(0, 1).rvs(size=4)
    v /= np.linalg.norm(v)
    x = np.array([v[0] + 1j*v[1], v[2] + 1j*v[3]])
    z = np.conj(x).dot(
    plt.plot(z.real, z.imag, 'o', color='0.9')

eigenvalues, eigenvectors = np.linalg.eig(A)    
plt.plot(eigenvalues[0].real, eigenvalues[0].imag, 'ko')
plt.plot(eigenvalues[1].real, eigenvalues[1].imag, 'ko')

The following code draws the ellipse guaranteed by the elliptical range theorem.

A_star_A =, A)
b = (np.trace(A_star_A) - abs(eigenvalues[0])**2 - abs(eigenvalues[1])**2)**0.5
c = 0.5*abs(eigenvalues[0] - eigenvalues[1])
a = (2/3)*(2*(2*b**2/4 + c**2)**0.5 + c)

t = np.linspace(0, np.pi*2)
z = a*np.cos(t)/2 + 1j*b*np.sin(t)/2
u = (eigenvalues[1] - eigenvalues[0])/2
u /= np.linalg.norm(u)
m = eigenvalues.mean()
z = z*u + m

plt.plot(z.real, z.imag, 'b-')

Here is the result.

Here is the result of running the same code with 10,000 random points.

Related posts

[1] Ulrich Daepp, Pamela Gorkin, Andrew Shaffer, and Karl Voss. Finding Ellipses: What Blaschke Products, Poncelet’s Theorem, and the Numerical Range Know about Each Other. MAA Press 2018.

Visualizing a determinant identity

The previous post discussed an algorithm developed by Charles Dodgson (better known as Lewis Carroll) for computing determinants.

The key identity for proving that Dodgson’s algorithm is correct involves the Desnanot-Jacobi identity from 1841. The identity is intimidating in its symbolic form and yet easy to visualize. In algebraic form the identity says

\det({M}) \det(M_{1,k}^{1,k}) = \det(M_1^1)\det(M_k^k) -\det(M_1^k) \det(M_k^1)

Here a subscript i means to delete the ith row of M, and a superscript i means to remove the ith column. When there are two subscripts (superscripts) then there are two rows (columns) to remove.

This dense notation hides the high degree of symmetry in this identity. In the image below, we gray-out the rows and columns that are to be removed. Note that the right hand side of the Desnanot-Jacobi identity is a 2×2 determinant of determinants.

The left side of the image represents the product of two determinants, a full k × k matrix and the k−2 × k−2 matrix formed by removing the edges of the matrix.

The right side of the image is a 2 × 2 determinant who entries are themselves determinants. The four k−1 × k−1 matrices on the right are formed by stripping off the outside row and column of each matrix.

Here’s a demonstration of the identity using the matrix from the example in the previous post and Mathematica.

    a = Det[{{3, 1, 4, 1}, {5, 9, 2, 6}, {0, 7, 1, 0}, {2, 0, 2, 3}}];
    b = Det[{{9, 2}, {7, 1}}];
    c = Det[{{9, 2, 6}, {7, 1, 0}, {0, 2, 3}}];
    d = Det[{{5, 9, 2}, {0, 7, 1}, {2, 0, 2}}];
    e = Det[{{1, 4, 1}, {9, 2, 6}, {7, 1, 0}}];
    f = Det[{{3, 1, 4}, {5, 9, 2}, {0, 7, 1}}];

    a b == c f - d e

How Lewis Carroll computed determinants

Mad hatter from Alice in Wonderland by Lewis Carroll

Charles Dodgson, better known by his pen name Lewis Carroll, discovered a method of calculating determinants now known variously as the method of contractants, Dodgson condensation, or simply condensation.

The method was devised for ease of computation by hand, but it has features that make it a practical method for computation by machine.


The basic idea is to repeatedly condense a matrix, replacing it by a matrix with one less row and one less column. Each element is replaced by the determinant of the 2×2 matrix formed by that element and its neighbors to the south, east, and southeast. The bottom row and rightmost column have no such neighbors and are removed. There is one additional part of the algorithm that will be easier to describe after introducing some notation.


Let A be the matrix whose determinant we want to compute and let A(k) be the matrix obtained after k steps of the condensation algorithm.

The matrix A(1) is computed as described in the overview:

a^{(1)}_{ij} = a_{ij} a_{i+1, j+1} - a_{i, j+1} a_{i+1, j}.

Starting with A(2) the terms are similar, except each 2×2 determinant is divided by an element from two steps back:

a^{(k+1)}_{ij} = \frac{1}{a^{(k-1)}_{i+1,j+1}}\left(a^{(k)}_{ij} a^{(k)}_{i+1, j+1} - a^{(k)}_{i, j+1} a^{(k)}_{i+1, j} \right)

Dodgson’s original paper from 1867 is quite readable, surprisingly so given that math notation and terminology changes over time.

One criticism I have of the paper is that it is hard to understand which element should be in the denominator, whether the subscripts should be i and j or i+1 and j+1. His first example doesn’t clarify this because these elements happen to be equal in the example.


Here’s an example using condensation to find the determinant of a 4×4 matrix.

\begin{align*} A^{(0)} &= \begin{bmatrix} 3 & 1 & 4 & 1 \\ 5 & 9 & 2 & 6 \\ 0 & 7 & 1 & 0 \\ 2 & 0 & 2 & 3 \end{bmatrix} \\ A^{(1)} &= \begin{bmatrix} 22 &-34 & 22 \\ 35 & -5 &-6 \\ -14 & 14 & 3 \end{bmatrix} \\ A^{(2)} &= \begin{bmatrix} 120 & 157 \\ 60 & 69 \end{bmatrix} \\ A^{(3)} &= \begin{bmatrix} 228 \end{bmatrix} \end{align*}

We can verify this with Mathematica:

    Det[{{3, 1, 4, 1}, {5, 9, 2, 6}, 
         {0, 7, 1, 0}, {2, 0, 2, 3}}]

which also produces 228.


The algorithm above involves a division and so we should avoid dividing by zero. Dodgson says to

Arrange the given block, if necessary, so that no ciphers [zeros] occur in its interior. This may be done either by transposing rows or columns, or by adding to certain rows the several terms of other rows multiplied by certain multipliers.

He expands on this remark and gives examples. I’m not sure whether this preparation is necessary only to avoid division by zero, but it does avoid the problem of dividing by a zero.

If the original matrix has all integer entries, then the division in Dodgson’s condensation algorithm is exact. The sequence of matrices produced by the algorithm will all have integer entries.


Students usually learn cofactor expansion as their first method of calculating determinants. This rule is easy to explain, but inefficient since the number of steps required is O(n!).

The more efficient way to compute determinants is by Gaussian elimination with partial pivoting. As with condensation, one must avoid dividing by zero, hence the partial pivoting.

Gaussian elimination takes O(n³) operations, and so does Dodgson’s condensation algorithm. Condensation is easy to teach and easy to carry out by hand, but unlike cofactor expansion it scales well.

If a matrix has all integer entries, Gaussian elimination can produce non-integer values in intermediate steps. Condensation does not. Also, condensation is inherently parallelizable: each of the 2 × 2 determinants can be calculated simultaneously.

Related posts

Gram matrix

An elegant algebraic identity says

(a^2 + b^2)*c^2 + d^2) = (ac + bd)^2 + (ad - bc)^2

If x is the vector [a b] and y is the vector [c d] then this identity can be written

\begin{vmatrix} a & b \\ c & d \end{vmatrix}^2 = \begin{vmatrix} x\cdot x & x \cdot y \\ x\cdot y & y \cdot y \end{vmatrix}

where the dot indicates the usual dot product. I posted this on Twitter the other day.

Gram matrix

Now suppose that x and y are vectors of any length n. The matrix on the right hand side above is called the Gram matrix of x and y, and its determinant is called the Gram determinant or Gramian.


Let θ be the angle between x and y in ℝn. Then

\cos\theta = \frac{\langle x , y\rangle} {||x|| \, || y ||}

where ⟨x, y⟩ is the dot product of x and y.

If x and y are data vectors with mean 0, then cos θ is their correlation. Since correlation is unaffected by scaling, we can scale y so that it has the same norm as x, i.e. ||x|| = ||y||. Then the Gram matrix G can be written in terms of cos θ:

G = \begin{pmatrix} \langle x, x\rangle & \langle x ,y\rangle \\ \langle x, y\rangle & \langle y ,y\rangle \end{pmatrix} = ||x||^2 \begin{pmatrix} 1 & \cos \theta \\ \cos \theta & 1 \end{pmatrix}


The idea of the Gram matrix generalizes to more than two vectors. If we have m vectors, the Gram matrix is m × m whose (i, j) entry is the dot product of the ith and jth vectors. Note that the dimension n of the vectors does not have to equal the dimension m of the Gram matrix.

Let B be the matrix whose columns are the vectors xi. If the number of vectors m does equal the dimension of the vectors n then BT B equals the Gram matrix.

B^\intercal B = G

and so

|B^\intercal B| = |B|^2 = |G|

If m does not equal n then BT B cannot equal G because the two matrices have different dimensions, though the determinants of the two matrices are equal. That is, the square of the m-dimensional volume of the span of the x‘s inside ℝn equals their Gram determinant.

The Gram determinant can be defined for more general inner products than the dot product in ℝn. It could be, for example, the integral of the product of two functions.

Related posts

Powers of a 2×2 matrix in closed form

Here’s something I found surprising: the powers of a 2×2 matrix have a fairly simple closed form. Also, the derivation is only one page [1].

Let A be a 2×2 matrix with eigenvalues α and β. (3Blue1Brown made a nice jingle for finding the eigenvalues of a 2×2 matrix.)

If α = β then the nth power of A is given by

A^n = \alpha^{n-1}\left( nA - (n-1)\alpha I\right)

If α ≠ β then the nth power of A is given by

A^n = \frac{\alpha^n}{\alpha - \beta} (A - \beta I) + \frac{\beta^n}{\beta-\alpha}(A - \alpha I)


Let’s do an example with

A = \begin{bmatrix} 6 & 3 \\ 20 & 23 \end{bmatrix}

The eigenvalues are 26 and 3. I chose the matrix entries based on today’s date, not to have integer eigenvalues, and was surprised that they turned out so simple [2]. (More along those lines here.)

Here’s a little Python code to show that the formula above gives the same result as directly computing the cube of A.

    import numpy as np
    A = np.matrix([[6, 3], [20, 23]])
    m = (6 + 23)/2
    p = 6*23 - 3*20
    α = m + (m**2 - p)**0.5
    β = m - (m**2 - p)**0.5
    print(α, β)
    I = np.eye(2)
    direct = A*A*A
    formula = α**3*(A - β*I)/(α - β) + β**3*(A - α*I)/(β - α)

[1] Kenneth S. Williams. The nth Power of a 2×2 Matrix. Mathematics Magazine, Dec., 1992, Vol. 65, No. 5, p. 336.

[2] I wrote a script to find out how often this happens, and it’s more often than I would have guessed. There are 31 dates this year that would give integer eigenvalues if arranged as in the example.

Bibliography histogram

I recently noticed something in a book I’ve had for five years: the bibliography section ends with a histogram of publication dates for references. I’ve used the book over the last few years, but maybe I haven’t needed to look at the bibliography before.

publication date histogram

This is taken from Bernstein’s Matrix Mathematics. I wrote a review of it here.

An up-to-date bibliography isn’t always necessary. One of the references I use the most is nearly 60 years old, and another book I pull down occasionally is 100 years old. But sometimes you do need a current reference, and a histogram of publication dates is a good way for the author to demonstrate that the references are up to date.

By the way, notice the phantom diagonal lines across the histogram? That’s a moiré.[1]


[1] Pun on this and this.

Cofactors, determinants, and adjugates

Let A be an n × n matrix over a field F. The cofactor of an element Aij is the matrix formed by removing the ith row and jth column, denoted A[i, j].

This terminology is less than ideal. The matrix just described is called the cofactor of Aij, but it would more accurately be called the cofactor of (i, j) because it does not depend on the matrix entry Aij per se but rather its position in the matrix A. For example, if someone asked you what the cofactor of 7 is, you’d have to say “That depends. Where is the 7 located?”. But we will stick to standard terminology.

(I wrote a post about a similar phenomenon in statistical notation, but I can’t find it this morning. You can’t answer questions like “What is p(0.5)?” in statistics by without first asking “If you were to write 0.5 using a variable, would you write it as x, y, or θ?”)

You can compute the determinant of a matrix by taking the alternating sum of elements of a row (or column) multiplied by the determinants of their cofactors.

\det A = \sum_{j=1}^n (-1)^{i+j} A_{i,j} \det A_{[i,j]}

This is essentially Cramer’s rule. It suggests a recursive procedure for computing determinants which is theoretically useful, but only computationally useful for very small matrices.

The adjugate of A, written adj(A), is the transpose of the matrix formed by replacing the (i, j) element of A with an alternating sign times its cofactor [1].

\text{adj}(A)_{ij} = (-1)^{i+j} \det A[j,i]

As you might guess, the Mathematica command to compute the adjugate is Adjugate.

Note that “adjugate” is not a typo for “adjoint.” The adjoint of A (now) means something different. What we now call the adjugate was sometimes in the past called the adjoint. What we now call the adjoint is the (conjugate) transpose.

Here’s an application of the adjugate. Let A be a 3 × 3 matrix. Then the characteristic polynomial of A is

x^3 - (\text{tr}\, A) x^2 + \text{tr}\,\left(\text{adj}(A)\right)x - \det A

where tr denotes the trace, the sum of the diagonal elements of a matrix.

Since the trace only depends on the diagonal, you only need to compute the diagonal of a adjugate to use the equation above for the characteristic polynomial. And the alternating terms defining the adjugate are all positive on the diagonal. So the trace of the adjugate is the sum of the determinants of the cofactors of the diagonal: det A11 + det A22 + det A33.

Circulant matrices commute

A few days ago I wrote that circulant matrices all have the same eigenvectors. This post will show that it follows that circulant matrices commute with each other.

Recall that a circulant matrix is a square matrix in which the rows are cyclic permutations of each other. If we number the rows from 0, then the kth row takes the last k rows of the top row and moves them to the front.

Numerical example

We can generate two random circulant matrices and confirm that they commute by modifying the Python code from the earlier post.

    N = 6
    row = np.random.random(N)
    A = np.array([np.roll(row, i) for i in range(N)])
    row = np.random.random(N)
    B = np.array([np.roll(row, i) for i in range(N)])
    AB = np.matmul(A, B)
    BA = np.matmul(B, A)
    print(np.absolute(AB - BA).max())

This computes two random 6 by 6 circulant matrices A and B and shows that the difference between AB and BA is on the order of the limit of floating point precision. Specifically, this prints 4.44 × 10-16.


Now for a proof. Let A and B be two matrices of size n that have the same set of independent eigenvectors e1 through en. In the case of circulant matrices we can take ek to be the kth column of the FFT matrix, but the proof here applies to any matrices with common eigenvectors.

Let αk be the eigenvalue for ek as an eigenvector of A and let βk be the eigenvalue for ek as an eigenvector of B. Then

 (AB)e_k = A(Be_k) = A(\beta_k e_k) = \beta_k(Ae_k) = \beta_k\alpha_k e_k


(BA)e_k = B(Ae_k) = B(\alpha_k e_k) = \alpha_k(Be_k) = \alpha_k\beta_k e_k

and so multiplying by AB or BA has the same effect on ek since the complex numbers αk and βk commute. Since the eigenvalues form a basis, and multiplication by AB and BA has the same effect on each basis vector, AB = BA.

In short, matrices that share eigenvectors commute because eigenvalues commute.