# Illustrating Gershgorin disks with NumPy

Gershgorin’s theorem gives bounds on the locations of eigenvalues for an arbitrary square complex matrix.

The eigenvalues are contained in disks, known as Gershgorin disks, centered on the diagonal elements of the matrix. The radius of the disk centered on the kth diagonal element is the sum of the absolute values of the elements in the kth row, excluding the diagonal element itself.

To illustrate this theorem, we create a 5 by 5 diagonal matrix and add some random noise to it. The diagonal elements are

0, 3 + i, 4 + i, 1 + 5i, 9 + 2i.

The eigenvalues of the diagonal matrix would simply be the diagonal elements. The additional random values pull the eigenvalues away from the diagonal values, but by an amount no more than Gershgorin predicts. Note that in this example, two of the eigenvalues land in the smallest disk. It’s possible that a disk may not contain any eigenvalues; what Gershgorin guarantees is that the union of all the disks contains the union of all the eigenvalues.

Here’s the Python code that created the graph above.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(202103014)

N = 5 # dimension of our square matrix

D = np.diag([0, 3 + 1j, 4 + 1j, 1 + 5j, 9 + 2j])
M = np.random.rand(N, N) + D

R = np.zeros(N) # disk radii
for i in range(N):
R[i] = sum(abs(M[i,:])) - abs(M[i,i])

eigenvalues = np.linalg.eigvals(M)

# Plotting code
fig, ax = plt.subplots()
for k in range(N):
x, y = M[k,k].real, M[k,k].imag
ax.add_artist( plt.Circle((x, y), R[k], alpha=0.5) )
plt.plot(eigenvalues[k].real, eigenvalues[k].imag, 'k+')

ax.axis([-4, 12.5, -4, 9])
ax.set_aspect(1)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Gershgorin disks and eigenvalues $x + iy$")


# Moving between vectors and diagonal matrices

This is the first of two posts on moving between vectors and diagonal matrices. The next post is Broadcasting and functors.

## Motivation

When I first saw the product of two vectors in R, I was confused. If x and y are vectors, what does x*y mean? An R programmer would say “You multiply components together, of course. What else could it mean?!”

But my first thought was “What kind of product do they mean? Inner product? Outer product? You can’t take the matrix product of two vectors.”

That was a long time ago. What brought this to mind today was a statistical paper I was reading that used the same symbol for a vector and a matrix. In context it wasn’t hard to figure out what the authors meant, but there was something implicit going on there that I will make explicit here.

## Vectors and matrices

You can map a vector to a matrix by putting its contents along the diagonal of a diagonal matrix. That is, given a vector v of dimension n, create an n by n matrix V that is filled with zeros, except that

Vii = vi

for i = 1, 2, …, n. Let’s call this function Δ. In NumPy, Δ is implemented as the function diag. For example, The statistics paper that I mentioned above used Δ implicitly, i.e. it denoted v and Δv with the same notation.

The function Δ has some nice properties.

The componentwise product two vectors is called the Hadamard product, denoted by a circle. You can first take the Hadamard product of two column vectors (of the same dimension) and then embed the result in a matrix using Δ, or you can first embed the two vectors into the world of matrices and take the product there. Either way you end up at the same place.

In other words, the following diagram commutes: Here C is the set of column vectors, and by C × C we mean pairs of column vectors of the same dimension. Similarly, M is the set of square matrices, and by M × M we mean pairs of square matrices of the same dimension. The circle on the left is Hadamard product. The dot on the right is ordinary matrix product .

The Δ on top is the Δ we’ve defined above. The Δ on bottom takes a pair of column vectors to a pair of matrices by operating on each element of the pair separately.

Although it would a little strange to do so, you could define Hadamard product this way. To take the Hadamard product of two vectors, take them over to the world of matrices, multiply the two matrices, and take that matrix back to the world of column vectors.

## Related posts

 Alternately, you could think of the product on the right as Hadamard product of matrices. For diagonal matrices, ordinary matrix product and Hadamard product coincide, though of course in general they are very different.

# The debauch of indices

This morning I was working on a linear algebra problem for a client that I first solved by doing calculations with indices. As I was writing things up I thought of the phrase “the debauch of indices” that mathematicians sometimes use to describe tensor calculations. The idea is that calculations with lots of indices are inelegant and that more abstract arguments are better.

The term “debauch of indices” pejorative, but I’ve usually heard it used tongue-in-cheek. Although some people can be purists, going to great lengths to avoid index manipulation, pragmatic folk move up and down levels of abstraction as necessary to get their work done.

I searched on the term “debauch of indices” to find out who first said it, and found an answer on Stack Exchange that traces it back to Élie Cartan. Cartan said that although “le Calcul différentiel absolu du Ricci et Levi-Civita” (tensor calculus) is useful, “les débauches d’indices” could hide things that are easier to see geometrically.

After solving my problem using indices, I went back and came up with a more abstract solution. Both approaches were useful. The former cut through a complicated problem formulation and made things more tangible. The latter revealed some implicit pieces of the puzzle that needed to be made explicit.

# Kronecker sum

I’m working on a project these days that uses four different kinds of matrix product, which made me wonder if there’s another kind of product out there that I could find some use for.

In the process of looking around for other matrix products, I ran across the Kronecker sum. I’ve seen Kronecker products many times, but I’d never heard of Kronecker sums.

The Kronecker sum is defined in terms of the Kronecker product, so if you’re not familiar with the latter, you can find a definition and examples here. Essentially, you multiply each scalar element of the first matrix by the second matrix as a block matrix.

The Kronecker product of an m × n matrix A and a p × q matrix B is a mp × nq matrix KA B. You could think of K as an m × n matrix whose entries are p × q blocks.

So, what is the Kronecker sum? It is defined for two square matrices, an n × n matrix A and an m × m matrix B. The sizes of the two matrices need not match, but the matrices do need to be square.  The Kronecker sum of A and B is

AB = AIm + InB

where Im and In are identity matrices of size m and n respectively.

Does this make sense dimensionally? The left side of the (ordinary) matrix addition is nm × nm, and so is the right side, so the addition makes sense.

However, the Kronecker sum is not commutative, and usually things called “sums” are commutative. Products are not always commutative, but it goes against convention to call a non-commutative operation a sum. Still, the Kronecker sum is kinda like a sum, so it’s not a bad name.

I don’t have any application in mind (yet) for the Kronecker sum, but presumably it was defined for a good reason, and maybe I’ll run an application, maybe even on the project alluded to at the beginning.

There are several identities involving Kronecker sums, and here’s one I found interesting:

exp( A ) ⊗ exp( B ) = exp( A B ).

If you haven’t seen the exponential of a matrix before, basically you stick your matrix into the power series for the exponential function.

## Examples

First, let’s define a couple matrices A and B. We can compute the Kronecker sums

S = AB

and

T = B ⊕ A

with Mathematica to show they are different.

    A = {{1, 2}, {3, 4}}
B = {{1, 0, 1}, {1, 2, 0}, {2, 0, 3}}
S = KroneckerProduct[A, IdentityMatrix] +
KroneckerProduct[IdentityMatrix, B]
T = KroneckerProduct[B, IdentityMatrix] +
KroneckerProduct[IdentityMatrix, A]


This shows and so the two matrices are not equal.

We can compute the matrix exponentials of A and B with the Mathematica function MatrixExp to see that (I actually used MatrixExp[N[A]] and similarly for B so Mathematica would compute the exponentials numerically rather than symbolically. The latter takes forever and it’s hard to read the result.)

Now we have and so the two matrices are equal.

Even though the identity

exp( A ) ⊗ exp( B ) = exp( A B )

may look symmetrical, it’s not. The matrices on the left do not commute in general. And not only are AB and BA different in general, their exponentials are also different. For example # Convex function of diagonals and eigenvalues

Sam Walters posted an elegant theorem on his Twitter account this morning. The theorem follows the pattern of an equality for linear functions generalizing to an inequality for convex functions. We’ll give a little background, state the theorem, and show an example application.

Let A be a real symmetric n×n matrix, or more generally a complex n×n Hermitian matrix, with entries aij. Note that the diagonal elements aii are real numbers even if some of the other entries are complex. (A Hermitian matrix equals its conjugate transpose, which means the elements on the diagonal equal their own conjugate.)

A general theorem says that A has n eigenvalues. Denote these eigenvalues λ1, λ2, …, λn.

It is well known that the sum of the diagonal elements of A equals the sum of its eigenvalues. We could trivially generalize this to say that for any linear function φ: RR, because we could pull any shifting and scaling constants out of the sum.

The theorem Sam Walters posted says that the equality above extends to an inequality if φ is convex. Here’s an application of this theorem. Assume the eigenvalues of A are all positive and let φ(x) = – log(x). Then φ is convex, and and so i.e. the product of the diagonals of A is an upper bound on the determinant of A.

This post illustrates two general principles:

1. Linear equalities often generalize to convex inequalities.
2. When you hear a new theorem about convex functions, see what it says about exp or -log.

# Change of basis and Stirling numbers

Polynomials form a vector space—the sum of two polynomials is a polynomial etc.—and the most natural basis for this vector space is powers of x:

1, x, x², x³, …

But the power basis is not the only possible basis, and often not the most useful basis in application.

## Falling powers

In some applications the falling powers of x are a more useful basis. For positive integers n, the nth falling power of x is defined to be Falling powers come up in combinatorics, in the calculus of finite differences, and in hypergeometric functions.

## Change of basis

Since we have two bases for the vector space of polynomials, we can ask about the matrices that represent the change of basis from one to the other, and here’s where we see an interesting connection.

The entries of these matrices are numbers that come up in other applications, namely the Stirling numbers. You can think of Stirling numbers as variations on binomial coefficients. More on Stirling numbers here.

In summation notation, we have where the S1 are the (signed) Stirling numbers of the 1st kind, and the S2 are the Stirling numbers of the 2nd kind.

(There are two conventions for defining Stirling numbers of the 1st kind, differing by a factor of (-1)n-k.)

## Matrix form

This means the (ij)th element of matrix representing the change of basis from the power basis to the falling power basis is S1(i, j) and the (i, j)th entry of the matrix for the opposite change of basis is S2(i, j). These are lower triangular matrices because S1(i, j) and S2(i, j) are zero for j > i.

These are infinite matrices since there’s no limit to the degree of a polynomial. But if we limit our attention to polynomials of degree less than m, we take the upper left m by m submatrix of the infinite matrix. For example, if we look at polynomials of degree 4 or less, we have to convert from powers to falling powers, and going from falling powers to powers.

Incidentally, if we filled a matrix with unsigned Stirling numbers of the 1st kind, we would have the change of basis matrix going from the power basis to rising powers defined by It may be hard to see, but there’s a bar on top of the exponent n for rising powers whereas before we had a bar under the n for falling powers.

## Related posts

The first time you see matrices, if someone asked you how you multiply two matrices together, your first idea might be to multiply every element of the first matrix by the element in the same position of the corresponding matrix, analogous to the way you add matrices.

But that’s not usually how we multiply matrices. That notion of multiplication hardly involves the matrix structure; it treats the matrix as an ordered container of numbers, but not as a way of representing a linear transformation. Once you have a little experience with linear algebra, the customary way of multiplying matrices seems natural, and the way that may have seemed natural at first glance seems kinda strange.

The componentwise product of matrices is called the Hadamard product or sometimes the Schur product. Given two m by n matrices A and B, the Hadamard product of A and B, written AB, is the m by n matrix C with elements given by

cij = aij bij.

Because the Hadamard product hardly uses the linear structure of a matrix, you wouldn’t expect it to interact nicely with operations that depend critically on the linear structure. And yet we can give a couple theorems that do show a nice interaction, at least when A and B are positive semi-definite matrices.

The first is the Schur product theorem. It says that if A and B are positive semi-definite n by n matrices, then

det(A ∘ B) ≥ det(A) det(B)

where det stands for determinant.

Also, there is the following theorem of Pólya and Szegö. Assume A and B are symmetric positive semi-definite n by n matrices. If the eigenvalues of A and B, listed in increasing order, are αi and βi respectively, then for every eigenvalue λ of A ∘ B, we have

α1 β1 ≤ λ ≤ αn βn.

## Python implementation

If you multiply two (multidimensional) arrays in NumPy, you’ll get the componentwise product. So if you multiply two matrices as arrays you’ll get the Hadamard product, but if you multiply them as matrices you’ll get the usual matrix product. We’ll illustrate that below. Note that the function eigvalsh returns the eigenvalues of a matrix. The name may look a little strange, but the “h” on the end stands for “Hermitian.” We’re telling NumPy that the matrix is Hermitian so it can run software specialized for that case .


from numpy import array, matrix, array_equal, all
from numpy.linalg import det, eigvalsh

A = array([
[3, 1],
[1, 3]
])

B = array([
[5, -1],
[-1, 5]
])

H = array([
[15, -1],
[-1, 15]
])

AB = array([
[14,  2],
[ 2, 14]
])

assert(array_equal(A*B, H))

# Ordinary matrix product
assert(array_equal(A@B, AB))

# Schur product theorem
assert(det(H) >= det(A)*det(B))

# Eigenvalues
eigA = eigvalsh(A)
eigB = eigvalsh(B)
eigH = eigvalsh(A*B)

lower = eigA*eigB
upper = eigA*eigB
assert(all(eigH >= lower))
assert(all(eigH <= upper))


The code above shows that the eigenvalues of A are [2, 4], the eigenvalues of B are [4, 6], and the eigenvalues of A ∘ B are [14, 16].

## Related posts

 For complex matrices, Hermitian means conjugate symmetric, which in the real case reduces to simply symmetric. The theorem of Pólya and Szegö is actually valid for Hermitian matrices, but I simplified the statement for the case of real-valued matrices.

# How fast can you multiply matrices?

Suppose you want to multiply two 2 × 2 matrices together. How many multiplication operations does it take? Apparently 8, and yet in 1969 Volker Strassen discovered that he could do it with 7 multiplications.

## Upper and lower bounds

The obvious way to multiply two n × n matrices takes n³ operations: each entry in the product is the inner product of a row from the first matrix and a column from the second matrix. That amounts to n² inner products, each requiring n multiplications.

You can multiply two square matrices with O(n³) operations with the method described above, and it must take at least O(n²) operations because the product depends on all of the 2n² entries of the two matrices. Strassen’s result suggests that the optimal algorithm for multiplying matrices takes O(nk) operations for some k between 2 and 3. By applying Strassen’s algorithm recursively to larger matrices you can get k = log2 7 = 2.807.

The best known value at the moment is k = 2.3728639.

## Bounds on bounds

Yesterday the blog Gödel’s Lost Letter and P = NP posted an article Limits on Matrix Multiplication where they report on recent developments for finding the smallest value of k. A new paper doesn’t report a new value of k, but a limit on what current approaches to the problem can prove. Maybe k can equal 2, but there is a lower bound, strictly bigger than 2, on how small current approaches can go.

## Is this practical?

When I first heard of Strassen’s method, I was told it’s a curious but impractical result. Strassen saved one multiplication at the expense of introducing several more addition operations.

According to the Wikipedia article on matrix multiplication, recursively applying Strassen’s method can save time for n > 100. But there’s more to consider than counting operations. Strassen’s method, and subsequent algorithms, are more complicated. They may not be more efficient in practice even if they use fewer operations because the operations may not vectorize well.

Wikipedia reports that Strassen’s algorithm is not as numerically stable as the traditional approach, but this doesn’t matter when working over finite fields where arithmetic is exact.

## Strassen’s method

Let’s look at just what Strassen’s method does. We want to find the product of two matrices: I started to spell out Strassen’s method in LaTeX equations, but I thought it would be much better to write it out in code so I can be sure that I didn’t make a mistake.

The following Python code randomly fills in the values of the a’s and b’s, computes the c’s using the conventional method, then asserts that you can find these values from the q’s computed from Strassen’s method. Note there is one multiplication in each of the seven q’s.

    from random import randint

# Fill matrices with random integer values
a11 = randint(0, 9)
a12 = randint(0, 9)
a21 = randint(0, 9)
a22 = randint(0, 9)
b11 = randint(0, 9)
b12 = randint(0, 9)
b21 = randint(0, 9)
b22 = randint(0, 9)

c11 = a11*b11 + a12*b21
c12 = a11*b12 + a12*b22
c21 = a21*b11 + a22*b21
c22 = a21*b12 + a22*b22

# Strassen's method
q1 = (a11 + a22)*(b11 + b22)
q2 = (a21 + a22)*b11
q3 = a11*(b12 - b22)
q4 = a22 * (-b11 + b21)
q5 = (a11 + a12)*b22
q6 = (-a11 + a21)*(b11 + b12)
q7 = (a12 - a22)*(b21 + b22)

assert(c11 == q1 + q4 - q5 + q7)
assert(c21 == q2 + q4)
assert(c12 == q3 + q5)
assert(c22 == q1 + q3 - q2 + q6)


Since Strassen’s method takes more operations than the traditional method for multiplying 2 × 2 matrices, how can it take fewer operations than the traditional method for multiplying large matrices?

When you apply Strassen’s method to a matrix partitioned into submatrices, its multiplications become matrix multiplications, and its additions become matrix additions. These operations are O(n2.807) and O(n2) respectively, so saving multiplications at the cost of more additions is a win.

# Distribution of eigenvalues for symmetric Gaussian matrix

## Symmetric Gaussian matrices

The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let M be the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues will all be real because M is symmetric.

This is called a “Gaussian Orthogonal Ensemble” or GOE. The term is standard but a little misleading because such matrices may not be orthogonal.

## Eigenvalue distribution

The joint probability distribution for the eigenvalues of M has three terms: a constant term that we will ignore, an exponential term, and a product term. (Source) The exponential term is the same as in a multivariate normal distribution. This says the probability density drops of quickly as you go away from the origin, i.e. it’s rare for eigenvalues to be too big. The product term multiplies the distances between each pair of eigenvalues. This says it’s also rare for eigenvalues to be very close together.

(The missing constant to turn the expression above from a proportionality to an equation is whatever it has to be for the right side to integrate to 1. When trying to qualitatively understand a probability density, it usually helps to ignore proportionality constants. They are determined by the rest of the density expression, and they’re often complicated.)

If eigenvalues are neither tightly clumped together, nor too far apart, we’d expect that the distance between them has a distribution with a hump away from zero, and a tail that decays quickly. We will demonstrate this with a simulation, then give an exact distribution.

## Python simulation

The following Python code simulates 2 by 2 Gaussian matrices.

    import matplotlib.pyplot as plt
import numpy as np

n = 2
reps = 1000

diffs = np.zeros(reps)
for r in range(reps):
A = np.random.normal(scale=n**-0.5, size=(n,n))
M = 0.5*(A + A.T)
w = np.linalg.eigvalsh(M)
diffs[r] = abs(w - w)

plt.hist(diffs, bins=int(reps**0.5))
plt.show()


This produced the following histogram: The exact probability distribution is p(s) = s exp(-s²/4)/2. This result is known as “Wigner’s surmise.”

# Circular law for random matrices

Suppose you create a large matrix M by filling its components with random values. If has size n by n, then we require the probability distribution for each entry to have mean 0 and variance 1/n. Then the Girko-Ginibri circular law says that the eigenvalues of M are approximately uniformly distributed in the unit disk in the complex plane. As the size n increases, the distribution converges to a uniform distribution on the unit disk.

The probability distribution need not be normal. It can be any distribution, shifted to have mean 0 and scaled to have variance 1/n, provided the tail of the distribution isn’t so thick that the variance doesn’t exist. If you don’t scale the variance to 1/n you still get a circle, just not a unit circle.

We’ll illustrate the circular law with a uniform distribution. The uniform distribution has mean 1/2 and variance 1/12, so we will subtract 1/2 and multiply each entry by √(12/n).

Here’s our Python code:

    import matplotlib.pyplot as plt
import numpy as np

n = 100
M = np.random.random((n,n)) - 0.5
M *= (12/n)**0.5
w = np.linalg.eigvals(M)
plt.scatter(np.real(w), np.imag(w))
plt.axes().set_aspect(1)
plt.show()


When n=100 we get the following plot. When n=1000 we can see the disk filling in more. Note that the points are symmetric about the real axis. All the entries of M are real, so its characteristic polynomial has all real coefficients, and so its roots come in conjugate pairs. If we randomly generated complex entries for M we would not have such symmetry.

Related post: Fat-tailed random matrices