What is partial pivoting?

Gaussian elimination is pretty simple if all goes well, and in introductory courses all does go well.

You have an n × n matrix A, an n × 1 column vector b, and you want to find an n × 1 column vector x such that Ax = b.

You subtract multiples of the first rows of A and b to zero out all the elements below a11. Then you subtract multiples of the second rows of A and b to zero out all the elements below a22. You keep doing this until all the elements below the diagonal are zero. Then you can solve for x using back substitution.

The process above is called naive Gaussian elimination. This post is for people who are familiar with naive Gaussian elimination but who do not realize it is naive.

Creating zeros

Suppose the first two rows of our matrix A are as follows:

2 * * * …
3 * * * …

Then you subtract 3/2 times the new first row from the second:

2 * * * …
0 * * * …

You would continue to subtract multiples of the first row from the rest of the rows.

What could go wrong?

What if the first two rows looked like this?

0 * * * …
3 * * * …

In this case naive Gaussian elimination fails on the first step. As far as I recall, I never saw an example like that when I took linear algebra: naive Gaussian elimination always worked. I don’t think I ever ran into this possibility until I took a course in numerical analysis.

The simplest way around the problem would be to swap the first two rows:

3 * * * …
0 * * * …

I imagine this strategy was commonly done in hand calculations long before anyone gave the process a name. I said that I don’t remember encountering this problem in a linear algebra class. Maybe it came up in a homework problem and I swapped rows without thinking about it. But if you’re going to write software to implement Gaussian elimination, you have to think about these things systematically.

You might have a program literally swap rows, or if you’re more performance-conscious you might conceptually swap the rows, keeping track of the swaps with an index but without moving data in memory.

Pivots and pivoting

The first element in the first row is called a pivot. More generally, the akk element at the kth step of Gaussian elimination is a pivot. When you run into a zero pivot, you have a logical problem: naive Gaussian elimination cannot proceed without some help, i.e. swapping rows. Less obvious but still true is that if you run into a small pivot, you have a numerical problem. When carrying out calculations on a computer with floating point arithmetic, small but non-zero pivots can lead to loss of accuracy, possibly catastrophic loss.

Swapping rows to avoid small pivots is called partial pivoting. Partial pivoting is usually [1] adequate in practice. Complete pivoting allows the possibility of swapping columns as well. You can go down a deep rabbit hole exploring the nuances of various pivoting strategies.

Gaussian elimination is part of the high school curriculum, but there are a lot of practical issues to work out when applying this apparently simple algorithm. As numerical analysis expert Nick Trefethen put it, “The closer one looks, the more subtle and remarkable Gaussian elimination appears.”

Related posts

[1] “Despite [artificially constructed examples] Gaussian elimination with partial pivoting is utterly stable in practice. … In fifty years of computing, no matrix problems that create explosive instability are known to have arisen under natural circumstances.” — Nick Trefethen and David Bau. Numerical Linear Algebra. 2022.

Power method and centrality

A few days ago I wrote about eigenvector centrality, a way of computing which nodes in a network are most important. Rather than simply looking for the most highly connected nodes, it looks for nodes that are highly connected to nodes that are highly connected. It’s the idea behind Google’s PageRank algorithm.

Adjacency matrices

One way to capture the structure of a graph is its adjacency matrix A. Each element aij of this matrix equals 1 if there is an edge between the ith and jth node and a 0 otherwise. If you square the matrix A, the (i, j) entry of the result tells you how many paths of length 2 are between the ith and jth nodes. In general, the (i, j) entry of An tells you how many paths of length n there are between the corresponding nodes.

Power method

Calculating eigenvector centrality requires finding an eigenvector associated with the largest eigenvalue of A. One way to find such an eigenvector is the power method. You start with a random initial vector and repeatedly multiply it by A. This produces a sequence of vectors that converges to the eigenvector we’re after.

Conceptually this is the same as computing An first and multiplying it by the random initial vector. So not only does the matrix An count paths of length n, for large n it helps us compute eigenvector centrality.

Theoretical details

Now for a little fine print. The power method will converge for any starting vector that has some component in the direction of the eigenvector you’re trying to find. This is almost certainly the case if you start with a vector chosen at random. The power method also requires that the matrix has a single eigenvector of largest magnitude and that its associated eigenspace have dimension 1. The post on eigenvector centrality stated that these conditions hold, provided the network is connected.

Computational practicalities

In principle, you could use calculate eigenvector centrality by computing An for some large n. In practice, you’d never do that. For a square matrix of size N, a matrix-vector product takes O(N²) operations, whereas squaring A requires O(N³) operations. So you would repeatedly apply A to a vector rather than computing powers of A.

Also, you wouldn’t use the power method unless A is sparse, making it relatively efficient to multiply by A. If most of the entries of A are zeros, and there is an exploitable pattern to where the non-zero elements are located, you can multiply A by a vector with far less than N² operations.

Eigenvector centrality

A basic question to ask about a network is which nodes are most important. A simple way of answering this question would be to say the most well-connected nodes, i.e. the nodes with the most edges. This approach is known as degree centrality. It’s not a bad place to start. It’s easy to understand and easy to compute. It may be adequate for some purposes, but it’s often helpful to take a more sophisticated approach known as eigenvector centrality.

Examples

Let’s look at a few motivating examples before we get into the details.

Social network marketing

If you wanted to advertise something, say a book you’ve written, and you’re hoping people will promote it on Twitter. Would you rather get a shout out from someone with more followers or someone with less followers? All else being equal, more followers is better. But even better would be a shout out from someone whose followers have a lot of followers.

Deans and flight attendants

Suppose you’re at a graduation ceremony. Your mind starts to wander after the first few hundred people walk across the stage, and you start to think about how a cold might spread through the crowd. The dean handing out diplomas could be a superspreader because he’s shaking hands with everyone as they receive their diplomas. But an inconspicuous parent in the audience may also be a superspreader because she’s a flight attendant and will be in a different city every day for the next few days. And not only is she a traveler, she’s in contact with travelers.

Search engines

Ranking web pages according to the number of inbound links they have was a good idea because this takes advantage of revealed preferences: instead of asking people what pages they recommend, you can observe what pages they implicitly recommend by linking to them. An even better idea was Page Rank, weighing inbound links by how many links the linking pages have.

Eigenvector centrality

The idea of eigenvector centrality is to give each node a rank proportional to the sum of the ranks of the adjacent nodes. This may seem circular, and it kinda is.

To know the rank of a node, you have to know the ranks of the nodes linking to it. But to know their ranks, you have to know the ranks of the nodes linking to them, etc. There is no sequential solution to the problem because you’d end up in an infinite regress, even for a finite network. But there is a simultaneous solution, considering all pages at once.

You want to assign to the ith node a rank xi proportional to the sum of the ranks of all adjacent nodes. A more convenient way to express this to compute the weighted sum of the ranks of all nodes, with weight 1 for adjacent nodes and weight 0 for non-adjacent nodes. That is, you want to compute

x_i = \frac{1}{\lambda} \sum a_{ij} x_j

where the a‘s are the components of the adjacency matrix A. Here aij equals 1 if there is an edge between nodes i and j and it equals 0 otherwise.

This is equivalent to looking for solutions to the system of equations

Ax = \lambda x

where A is the adjacency matrix and x is the vector of node ranks. If there are N nodes, then A is an N × N matrix and x is a column vector of length N.

In linear algebra terminology, x is an eigenvector of the adjacency matrix A, hence the name eigenvector centrality.

Mathematical questions

There are several mathematical details to be concerned about. Does the eigenvalue problem defining x have a solution? Is the solution unique (up to scalar multiples)? What about the eigenvalue λ? Does the adjacency matrix have multiple eigenvalues, which would mean there are multiple eigenvectors?

If A is the adjacency matrix for a connected graph, then there is a unique eigenvalue λ of largest magnitude, there is a unique corresponding eigenvector x, and all the components of x are non-negative. It is often said that this is a result of the Perron–Frobenius theorem, but there’s a little more to it than that because you need the hypothesis that the graph is connected.

The matrix A is non-negative, but not necessarily positive: some entries may be zero. But if the graph is connected, there’s a path between any node i and another node j, which means for some power of A, the ij entry of A is not zero. So although A is not necessarily positive, some power of A is positive. This puts us in the positive case of the Perron–Frobenius theorem, which is nicer than the non-negative case.

This section has discussed graphs, but Page Rank applies to directed graphs because links have a direction. But similar theorems hold for directed graphs.

Related posts

Determinant of an infinite matrix

What does the infinite determinant

D = \left| \begin{array}{lllll} 1 & a_1 & 0 & 0 & \cdots \\ b_1 & 1 & a_2 & 0 & \cdots \\ 0 & b_2 & 1 & a_3 & \\ 0 & 0 & b_3 & 1 & \ddots \\ \vdots & \vdots & & \ddots & \ddots \\ \end{array} \right|

mean and when does it converge?

The determinant D above is the limit of the determinants Dn defined by

D_n = \left| \begin{array}{lllll} 1 & a_1 & & & \\ b_1 & 1 & a_2 & & \\ & b_2 & 1 & \ddots & \\ & & \ddots & \ddots & a_{n-1} \\ & & & b_{n-1} & 1 \\ \end{array} \right|

If all the a‘s are 1 and all the b‘s are −1 then this post shows that Dn = Fn, the nth Fibonacci number. The Fibonacci numbers obviously don’t converge, so in this case the determinant of the infinite matrix does not converge.

In 1895, Helge von Koch said in a letter to Poincaré that the infinite determinant is absolutely convergent if and only if the sum

\sum_{i=1}^\infty a_ib_i

is absolutely convergent. A proof is given in [1].

The proof shows that the Dn are bounded by

\prod_{i=1}^n\left(1 + |a_ib_i| \right)

and so the infinite determinant converges if the corresponding infinite product converges. And a theorem on infinite products says

\prod_{i=1}^\infty\left(1 + |a_ib_i| \right)

converges absolute if the sum in Koch’s theorem converges. In fact,

\prod_{i=1}^\infty\left(1 + |a_ib_i| \right) \leq \exp\left(\sum_{i=1}^\infty |a_ib_i| \right )

and so we have an upper bound on the infinite determinant.

Related post: Triadiagonal systems, determinants, and cubic splines

[1] A. A. Shaw. H. von Koch’s First Lemma and Its Generalization. The American Mathematical Monthly, April 1931, Vol. 38, No. 4, pp. 188–194

Area of quadrilateral as a determinant

I’ve written several posts about how determinants come up in geometry. These determinants often look similar, having columns related to coordinates and a column of ones. You can find several examples here along with an explanation for this pattern.

If you have three points z1, z2, and z3 in the complex plane, you can find the area of a triangle with these points as vertices

A(z_1, z_2, z_3) = \frac{i}{4} \, \left| \begin{matrix} z_1 & \bar{z}_1 & 1 \\ z_2 & \bar{z}_2 & 1 \\ z_3 & \bar{z}_3 & 1 \\ \end{matrix} \right|

You can read more about this here.

If you add the second column to the first, and subtract the first column from the second, you can get the equation for the area of a triangle in the real plane.

A = \frac{1}{2} \, \left| \begin{matrix} x_1 & y_1 & 1 \\ x_2 & y_2 & 1 \\ x_3 & y_3 & 1 \\ \end{matrix} \right|

Presumably the former equation was first derived from the latter, but the two are equivalent.

Now suppose you have a quadrilateral whose vertices are numbered in clockwise or counterclockwise order. Then the area is given by

A = \frac{1}{2} \, \left| \begin{matrix} x_1 & y_1 & 1 & 0\\ x_2 & y_2 & 1 & 1\\ x_3 & y_3 & 1 & 0\\ x_4 & y_4 & 1 & 1\\ \end{matrix} \right|

The proof is easy. If you expand the determinant by minors on the last column, you get the sum of two 3 × 3 determinants corresponding to the areas of the two triangles formed by splitting the quadrilateral along the diagonal connecting the first and third points.

The Million Dollar Matrix Multiply

The following post is by Wayne Joubert, the newest member of our consulting team. Wayne recently retired from his position as a Senior Computational Scientist at Oak Ridge National Laboratory. — John

Training large language models like GPT-4 costs many millions of dollars in server expenses. These costs are expected to trend to billions of dollars over the next few years [1]. One of the biggest computational expenses of LLM training is multiplying matrices. These are simple operations of the form C = AB. Matrix multiplies are common not only in AI model training but also many high performance computing applications from diverse science domains.

Eking out more speed from matrix multiplies could reduce AI model training costs by millions of dollars. More routinely, such improvements could reduce training runtime by hours on a single GPU-powered workstation or cut down cloud service provider expenses significantly.

What is less well-known is that matrix multiples run on graphics processing units (GPUs) that are typically used for model training have many exotic performance behaviors that can drastically reduce matrix multiply efficiency by a wide margin.

Two recent works [2], [3] examine these phenomena in considerable depth. Factors such as matrix size, alignment of data in memory, power throttling, math library versions, chip-level manufacturing variability, and even the values of the matrix entries can significantly affect performance. At the same time, much of this variability can be modeled by machine learning methods such as decision trees and random forests [2].

Use of these methods can be the first step toward implementing autotuning techniques to minimize costs. Using such methods or carefully applying rules of thumb for performance optimization can make a huge performance difference for matrix multiply-heavy GPU software.

Related posts

[1] What large models cost you—there is no free AI lunch

[2] Wayne Joubert, Eric Palmer and Verónica G. Melesse Vergara, “Matrix Multiply Performance of GPUs on Exascale-class HPE/Cray Systems,” Proceedings of the Cray User Group Meeting (CUG) 2022, https://www.osti.gov/biblio/2224210.

[3] P. Sinha, A. Guliani, R. Jain, B. Tran, M. D. Sinclair and S. Venkataraman, “Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems,” SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 01-15, doi: 10.1109/SC41404.2022.00070.

Fractional linear and linear

A function of the form

g(z) = \frac{az + b}{cz + d}

where adbc ≠ 0 is sometimes called a fractional linear transformation or a bilinear transformation. I usually use the name Möbius transformation.

In what sense are Möbius transformations linear transformations? They’re nonlinear functions unless b = c = 0. And yet they’re analogous to linear transformations. For starters, the condition adbc ≠ 0 appears to be saying that a determinant is non-zero, i.e. that a matrix is non-singular.

The transformation g is closely associated with the matrix

\begin{pmatrix} a & b \\ c & d \end{pmatrix}

but there’s more going on than a set of analogies. The reason is that Möbius transformation are linear transformations, but not on the complex numbers ℂ.

When you’re working with Möbius transformations, you soon want to introduce ∞. Things get complicated if you don’t. Once you add ∞ theorems become much easier to state, and yet there’s a nagging feeling that you may be doing something wrong by informally introducing ∞. This feeling is justified because tossing around ∞ without being careful can lead to wrong conclusions.

So how can we rigorously deal with ∞? We could move from numbers (real or complex) to pairs of numbers, as with fractions. We replace the complex number z with the equivalence class of all pairs of complex numbers whose ratio is z. The advantage of this approach is that you get to add one special number, the equivalence class of all pairs whose second number is 0, i.e. fractions with zero in the denominator. This new number system is called P(ℂ), where “P” stands for “projective.”

Möbius transformations are projective linear transformations. They’re linear on P(ℂ), though not on ℂ.

When we multiply the matrix above by the column vector (z 1)T we get

\begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} z \\ 1 \end{pmatrix} = \begin{pmatrix} az + b \\ cz + d \end{pmatrix}

and since our vectors are essentially fractions, the right hand side corresponds to g(z) if the second component of the vector, cz + d, is not zero.

If cz + d = 0, that’s OK. Everything is fine while we’re working in P(ℂ), but we get an element of P(ℂ) that does not correspond to an element of ℂ, i.e. we get ∞.

We’ve added ∞ to the domain and range of our Möbius transformations without any handwaving. We’re just doing linear algebra on finite complex numbers.

There’s a little bit of fine print. In P(ℂ) we can’t allow both components of a pair to be 0, and non-zero multiples of the same vector are equivalent, so we’re not quite doing linear algebra. Strictly speaking a Möbius transformation is a projective linear transformation, not a linear transformation.

It takes a while to warm up to the idea of moving from complex numbers to equivalence classes of pairs of complex numbers. The latter seems unnecessarily complicated. And it often is unnecessary. In practice, you can work in P(ℂ) by thinking in terms of ℂ until you need to have to think about ∞. Then you go back to thinking in terms of P(ℂ). You can think of P(ℂ) as ℂ with a safety net for working rigorously with ∞.

Textbooks usually introduce higher dimensional projective spaces before speaking later, if ever, of one-dimensional projective space. (Standard notation would write P¹(ℂ) rather than P(ℂ) everywhere above.) But one-dimensional projective space is easier to understand by analogy to fractions, i.e. fractions whose denominator is allowed to be zero, provided the numerator is not also zero.

I first saw projective coordinates as an unmotivated definition. “Good morning everyone. We define Pn(ℝ) to be the set of equivalence classes of ℝn+1 where ….” There had to be some payoff for this added complexity, but we were expected to delay the gratification of knowing what that payoff was. It would have been helpful if someone had said “The extra coordinate is there to let us handle points at infinity consistently. These points are not special at all if you present them this way.” It’s possible someone did say that, but I wasn’t ready to hear it at the time.

Related posts

Jordan normal form: 1’s above or below diagonal?

Given a square complex matrix A, the Jordan normal form of A is a matrix J such that

P^{-1}A P = J

and J has a particular form. The eigenvalues of A are along the diagonal of J, and the elements above the diagonal are 0s or 1s. There’s a particular pattern to the 1s, giving the matrix J a block structure, but that’s not the focus of this post.

Some books say a Jordan matrix J has the eigenvalues of A along the diagonal and 0s and 1s below the diagonal.

So we have two definitions. Both agree that the non-zero elements of J are confined to the main diagonal and an adjacent diagonal, but they disagree on whether the secondary diagonal is above or below the main diagonal. It’s my impression that placing the 1s below the main diagonal is an older convention. See, for example, [1]. Now I believe it’s more common to put the 1s above the main diagonal.

How are these two conventions related and how might you move back and forth between them?

It’s often harmless to think of linear transformations and matrices as being interchangeable, but for a moment we need to distinguish them. Let T be a linear transformation and let A be the matrix that represents T with respect to the basis

e_1, e_2, e_3, \ldots, e_N

Now suppose we represent T by a new basis consisting of the same vectors but in the opposite order.

\bar{e}_i = e_{N-i+1}

If we reverse the rows and columns of A then we have the matrix for the representation of T with respect to the new basis.

So if J is a matrix with the eigenvalues of A along the diagonal and 0s and 1s above the diagonal, and we reverse the order of our basis, then we get a new matrix J′ with the eigenvalues of A along the diagonal (though in the opposite order) and 0s and  1s below the diagonal. So J and J′ represent the same linear transformation with respect to different bases.

Matrix calculations

Let R be the matrix formed by starting with the identity matrix I and reversing all the rows. So while I has 1s along the NW-SE diagonal, R has 1s along the SW-NE diagonal.

Reversing the rows of A is the same as multiplying A by R on the right.

Reversing the columns of A is the same as multiplying A by R on the left.

Here’s a 3 by 3 example:

\begin{pmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{pmatrix} \begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} \begin{pmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{pmatrix} = \begin{pmatrix} i & h & g \\ f & e & d \\ c & b & a \end{pmatrix}

Note that the matrix R is its own inverse. So if we have

P^{-1}A P = J

then we can multiply both sides on the left and right by R.

RP^{-1}APR = RJR

If J has 1s above the main diagonal, then RJR has 1s below the main diagonal. And if J has 1’s below the main diagonal, RJR has 1s above the main diagonal.

Since R is its own inverse, we have

RP^{-1}APR = R^{-1}P^{-1}APR = (PR)^{-1}A(PR)

This says that if the similarity transform by P puts A into Jordan form with 1’s above (below) the diagonal, then the similarity transform by PR puts A into Jordan form with 1’s below (above) the diagonal.

[1] Hirsch and Smale. Differential Equations, Dynamical Systems, and Linear Algebra. 1974.

The numerical range ellipse

Let A be an n × n complex matrix. The numerical range of A is the image of x*Ax over the unit sphere. That is, the numerical range of A is the set W(A) in defined by

W(A) = {x*Ax | x ∈ ℂn and ||x|| = 1}

where x* is the conjugate transpose of the column vector x.

The product of a row vector and a column vector is a scalar, so W(A) is a subset of the complex plane ℂ.

When A is a 2 × 2 matrix, W(A) is an ellipse, and the two foci are the eigenvalues of A.

Denote the two eigenvalues by λ1 and λ2 . Denote the major and minor axes of the ellipse by a and b respectively. Then

b² = tr(A*A) − |λ1|² − |λ2|².

where tr is the trace operator. This is the elliptical range theorem [1]. The theorem doesn’t state the major axis a explicitly, but it can be found from the two foci and the minor axis b.

Demonstration

We will create a visualization the numerical range of a matrix directly from the definition, generating x‘s at random and plotting x*Ax. Then we draw the ellipse described in the elliptical range theorem and see that it forms the boundary of the random points.

The following code plots 500 random points in the numerical range of a matrix formed from the numbers in today’s date. The eigenvalues are plotted in black.

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

A = np.array([[8, 16], [20, 23j]])

for _ in range(5000):
    v = norm(0, 1).rvs(size=4)
    v /= np.linalg.norm(v)
    x = np.array([v[0] + 1j*v[1], v[2] + 1j*v[3]])
    z = np.conj(x).dot(A.dot(x))
    plt.plot(z.real, z.imag, 'o', color='0.9')

eigenvalues, eigenvectors = np.linalg.eig(A)    
plt.plot(eigenvalues[0].real, eigenvalues[0].imag, 'ko')
plt.plot(eigenvalues[1].real, eigenvalues[1].imag, 'ko')

The following code draws the ellipse guaranteed by the elliptical range theorem.

A_star_A = np.dot(np.conj(A).T, A)
b = (np.trace(A_star_A) - abs(eigenvalues[0])**2 - abs(eigenvalues[1])**2)**0.5
c = 0.5*abs(eigenvalues[0] - eigenvalues[1])
a = (2/3)*(2*(2*b**2/4 + c**2)**0.5 + c)

t = np.linspace(0, np.pi*2)
z = a*np.cos(t)/2 + 1j*b*np.sin(t)/2
u = (eigenvalues[1] - eigenvalues[0])/2
u /= np.linalg.norm(u)
m = eigenvalues.mean()
z = z*u + m

plt.plot(z.real, z.imag, 'b-')

Here is the result.

Related posts

[1] Ulrich Daepp, Pamela Gorkin, Andrew Shaffer, and Karl Voss. Finding Ellipses: What Blaschke Products, Poncelet’s Theorem, and the Numerical Range Know about Each Other. MAA Press 2018.

Visualizing a determinant identity

The previous post discussed an algorithm developed by Charles Dodgson (better known as Lewis Carroll) for computing determinants.

The key identity for proving that Dodgson’s algorithm is correct involves the Desnanot-Jacobi identity from 1841. The identity is intimidating in its symbolic form and yet easy to visualize. In algebraic form the identity says

\det({M}) \det(M_{1,k}^{1,k}) = \det(M_1^1)\det(M_k^k) -\det(M_1^k) \det(M_k^1)

Here a subscript i means to delete the ith row of M, and a superscript i means to remove the ith column. When there are two subscripts (superscripts) then there are two rows (columns) to remove.

This dense notation hides the high degree of symmetry in this identity. In the image below, we gray-out the rows and columns that are to be removed. Note that the right hand side of the Desnanot-Jacobi identity is a 2×2 determinant of determinants.

The left side of the image represents the product of two determinants, a full k × k matrix and the k−2 × k−2 matrix formed by removing the edges of the matrix.

The right side of the image is a 2 × 2 determinant who entries are themselves determinants. The four k−1 × k−1 matrices on the right are formed by stripping off the outside row and column of each matrix.

Here’s a demonstration of the identity using the matrix from the example in the previous post and Mathematica.

    a = Det[{{3, 1, 4, 1}, {5, 9, 2, 6}, {0, 7, 1, 0}, {2, 0, 2, 3}}];
    b = Det[{{9, 2}, {7, 1}}];
    c = Det[{{9, 2, 6}, {7, 1, 0}, {0, 2, 3}}];
    d = Det[{{5, 9, 2}, {0, 7, 1}, {2, 0, 2}}];
    e = Det[{{1, 4, 1}, {9, 2, 6}, {7, 1, 0}}];
    f = Det[{{3, 1, 4}, {5, 9, 2}, {0, 7, 1}}];

    a b == c f - d e
    True