Hadamard product

The first time you see matrices, if someone asked you how you multiply two matrices together, your first idea might be to multiply every element of the first matrix by the element in the same position of the corresponding matrix, analogous to the way you add matrices.

But that’s not usually how we multiply matrices. That notion of multiplication hardly involves the matrix structure; it treats the matrix as an ordered container of numbers, but not as a way of representing a linear transformation. Once you have a little experience with linear algebra, the customary way of multiplying matrices seems natural, and the way that may have seemed natural at first glance seems kinda strange.

The componentwise product of matrices is called the Hadamard product or sometimes the Schur product. Given two m by n matrices A and B, the Hadamard product of A and B, written AB, is the m by n matrix C with elements given by

cij = aij bij.

Because the Hadamard product hardly uses the linear structure of a matrix, you wouldn’t expect it to interact nicely with operations that depend critically on the linear structure. And yet we can give a couple theorems that do show a nice interaction, at least when A and B are positive semi-definite matrices.

The first is the Schur product theorem. It says that if A and B are positive semi-definite n by n matrices, then

det(A ∘ B) ≥ det(A) det(B)

where det stands for determinant.

Also, there is the following theorem of Pólya and Szegö. Assume A and B are symmetric positive semi-definite n by n matrices. If the eigenvalues of A and B, listed in increasing order, are αi and βi respectively, then for every eigenvalue λ of A ∘ B, we have

α1 β1 ≤ λ ≤ αn βn.

Python implementation

If you multiply two (multidimensional) arrays in NumPy, you’ll get the componentwise product. So if you multiply two matrices as arrays you’ll get the Hadamard product, but if you multiply them as matrices you’ll get the usual matrix product. We’ll illustrate that below. Note that the function eigvalsh returns the eigenvalues of a matrix. The name may look a little strange, but the “h” on the end stands for “Hermitian.” We’re telling NumPy that the matrix is Hermitian so it can run software specialized for that case [1].

    
    from numpy import array, matrix, array_equal, all
    from numpy.linalg import det, eigvalsh
    
    A = array([
        [3, 1],
        [1, 3]
    ])
    
    B = array([
        [5, -1],
        [-1, 5]
    ])
    
    H = array([
        [15, -1],
        [-1, 15]
    ])
    
    AB = array([
        [14,  2],
        [ 2, 14]
    ])
    
    # Hadamard product
    assert(array_equal(A*B, H))
    
    # Ordinary matrix product
    assert(array_equal(A@B, AB))
    
    # Schur product theorem
    assert(det(H) >= det(A)*det(B))
    
    # Eigenvalues
    eigA = eigvalsh(A)
    eigB = eigvalsh(B)
    eigH = eigvalsh(A*B)
    
    lower = eigA[0]*eigB[0]
    upper = eigA[1]*eigB[1]
    assert(all(eigH >= lower))
    assert(all(eigH <= upper))

The code above shows that the eigenvalues of A are [2, 4], the eigenvalues of B are [4, 6], and the eigenvalues of A ∘ B are [14, 16].

Related posts

[1] For complex matrices, Hermitian means conjugate symmetric, which in the real case reduces to simply symmetric. The theorem of Pólya and Szegö is actually valid for Hermitian matrices, but I simplified the statement for the case of real-valued matrices.

How fast can you multiply matrices?

Suppose you want to multiply two 2 × 2 matrices together. How many multiplication operations does it take? Apparently 8, and yet in 1969 Volker Strassen discovered that he could do it with 7 multiplications.

Upper and lower bounds

The obvious way to multiply two n × n matrices takes n³ operations: each entry in the product is the inner product of a row from the first matrix and a column from the second matrix. That amounts to n² inner products, each requiring n multiplications.

You can multiply two square matrices with O(n³) operations with the method described above, and it must take at least O(n²) operations because the product depends on all of the 2n² entries of the two matrices. Strassen’s result suggests that the optimal algorithm for multiplying matrices takes O(nk) operations for some k between 2 and 3. By applying Strassen’s algorithm recursively to larger matrices you can get k = log2 7 = 2.807.

The best known value at the moment is k = 2.3728639.

Bounds on bounds

Yesterday the blog Gödel’s Lost Letter and P = NP posted an article Limits on Matrix Multiplication where they report on recent developments for finding the smallest value of k. A new paper doesn’t report a new value of k, but a limit on what current approaches to the problem can prove. Maybe k can equal 2, but there is a lower bound, strictly bigger than 2, on how small current approaches can go.

Is this practical?

When I first heard of Strassen’s method, I was told it’s a curious but impractical result. Strassen saved one multiplication at the expense of introducing several more addition operations.

According to the Wikipedia article on matrix multiplication, recursively applying Strassen’s method can save time for n > 100. But there’s more to consider than counting operations. Strassen’s method, and subsequent algorithms, are more complicated. They may not be more efficient in practice even if they use fewer operations because the operations may not vectorize well.

Wikipedia reports that Strassen’s algorithm is not as numerically stable as the traditional approach, but this doesn’t matter when working over finite fields where arithmetic is exact.

Strassen’s method

Let’s look at just what Strassen’s method does. We want to find the product of two matrices:

\begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22}\end{pmatrix} \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22}\end{pmatrix} = \begin{pmatrix} c_{11} & c_{12} \\ c_{21} & c_{22}\end{pmatrix}

I started to spell out Strassen’s method in LaTeX equations, but I thought it would be much better to write it out in code so I can be sure that I didn’t make a mistake.

The following Python code randomly fills in the values of the a’s and b’s, computes the c’s using the conventional method, then asserts that you can find these values from the q’s computed from Strassen’s method. Note there is one multiplication in each of the seven q’s.

    from random import randint
    
    # Fill matrices with random integer values
    a11 = randint(0, 9)
    a12 = randint(0, 9)
    a21 = randint(0, 9)
    a22 = randint(0, 9)
    b11 = randint(0, 9)
    b12 = randint(0, 9)
    b21 = randint(0, 9)
    b22 = randint(0, 9)
    
    # Traditional matrix multiplication
    c11 = a11*b11 + a12*b21
    c12 = a11*b12 + a12*b22
    c21 = a21*b11 + a22*b21
    c22 = a21*b12 + a22*b22
    
    # Strassen's method
    q1 = (a11 + a22)*(b11 + b22)
    q2 = (a21 + a22)*b11
    q3 = a11*(b12 - b22)
    q4 = a22 * (-b11 + b21)
    q5 = (a11 + a12)*b22
    q6 = (-a11 + a21)*(b11 + b12)
    q7 = (a12 - a22)*(b21 + b22)
    
    assert(c11 == q1 + q4 - q5 + q7)
    assert(c21 == q2 + q4)
    assert(c12 == q3 + q5)
    assert(c22 == q1 + q3 - q2 + q6)

Since Strassen’s method takes more operations than the traditional method for multiplying 2 × 2 matrices, how can it take fewer operations than the traditional method for multiplying large matrices?

When you apply Strassen’s method to a matrix partitioned into submatrices, its multiplications become matrix multiplications, and its additions become matrix additions. These operations are O(n2.807) and O(n2) respectively, so saving multiplications at the cost of more additions is a win.

Related articles

Distribution of eigenvalues for symmetric Gaussian matrix

Symmetric Gaussian matrices

The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let M be the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues will all be real because M is symmetric.

This is called a “Gaussian Orthogonal Ensemble” or GOE. The term is standard but a little misleading because such matrices may not be orthogonal.

Eigenvalue distribution

The joint probability distribution for the eigenvalues of M has three terms: a constant term that we will ignore, an exponential term, and a product term. (Source)

p(x_1, x_2, \ldots, x_n) \propto \exp\left(-\frac{1}{2} \sum_{j=1}^nx_j^2 \right ) \prod_{j < k} |x_j - x_k|

The exponential term is the same as in a multivariate normal distribution. This says the probability density drops of quickly as you go away from the origin, i.e. it’s rare for eigenvalues to be too big. The product term multiplies the distances between each pair of eigenvalues. This says it’s also rare for eigenvalues to be very close together.

(The missing constant to turn the expression above from a proportionality to an equation is whatever it has to be for the right side to integrate to 1. When trying to qualitatively understand a probability density, it usually helps to ignore proportionality constants. They are determined by the rest of the density expression, and they’re often complicated.)

If eigenvalues are neither tightly clumped together, nor too far apart, we’d expect that the distance between them has a distribution with a hump away from zero, and a tail that decays quickly. We will demonstrate this with a simulation, then give an exact distribution.

Python simulation

The following Python code simulates 2 by 2 Gaussian matrices.

    import matplotlib.pyplot as plt
    import numpy as np
    
    n = 2
    reps = 1000
    
    diffs = np.zeros(reps)
    for r in range(reps):
        A = np.random.normal(scale=n**-0.5, size=(n,n)) 
        M = 0.5*(A + A.T)
        w = np.linalg.eigvalsh(M)
        diffs[r] = abs(w[1] - w[0])
    
    plt.hist(diffs, bins=int(reps**0.5))
    plt.show()

This produced the following histogram:

The exact probability distribution is p(s) = s exp(-s²/4)/2. This result is known as “Wigner’s surmise.”

Circular law for random matrices

Suppose you create a large matrix M by filling its components with random values. If has size n by n, then we require the probability distribution for each entry to have mean 0 and variance 1/n. Then the Girko-Ginibri circular law says that the eigenvalues of M are approximately uniformly distributed in the unit disk in the complex plane. As the size n increases, the distribution converges to a uniform distribution on the unit disk.

The probability distribution need not be normal. It can be any distribution, shifted to have mean 0 and scaled to have variance 1/n, provided the tail of the distribution isn’t so thick that the variance doesn’t exist. If you don’t scale the variance to 1/n you still get a circle, just not a unit circle.

We’ll illustrate the circular law with a uniform distribution. The uniform distribution has mean 1/2 and variance 1/12, so we will subtract 1/2 and multiply each entry by √(12/n).

Here’s our Python code:

    import matplotlib.pyplot as plt
    import numpy as np
    
    n = 100
    M = np.random.random((n,n)) - 0.5
    M *= (12/n)**0.5
    w = np.linalg.eigvals(M)
    plt.scatter(np.real(w), np.imag(w))
    plt.axes().set_aspect(1)
    plt.show()

When n=100 we get the following plot.

eigenvalues of 100 by 100 matrix

When n=1000 we can see the disk filling in more.

eigenvalues of 1000 by 1000 matrix

Note that the points are symmetric about the real axis. All the entries of M are real, so its characteristic polynomial has all real coefficients, and so its roots come in conjugate pairs. If we randomly generated complex entries for M we would not have such symmetry.

Related post: Fat-tailed random matrices

Low-rank matrix perturbations

Here are a couple of linear algebra identities that can be very useful, but aren’t that widely known, somewhere between common knowledge and arcane. Neither result assumes any matrix has low rank, but their most common application, at least in my experience, is in the context of something of low rank added to something of full rank.

Sylvester’s determinant theorem

The first is Sylvester’s determinant theorem. If A is a n by k matrix and B is a k by n matrix, then

\det(I_n + AB) = \det(I_k + BA)

where I is an identity matrix whose dimension is given by its subscript. The theorem is true for any k, but it’s especially useful when k is small because the matrices on the right side are of size k. If k = 1, the right side is the determinant of a 1 by 1 matrix, i.e. just a number!

Woodbury matrix inversion lemma

The second is known as the matrix inversion lemma or Woodbury’s matrix identity. It says

\left(A+UCV \right)^{-1} = A^{-1} - A^{-1}U \left(C^{-1}+VA^{-1}U \right)^{-1} VA^{-1}

which is a lot to take in at once. We’ll unpack it a little at a time.

First of all, the matrices have whatever properties are necessary for the equation to make sense: A and C are square, invertible matrices, though possibly of different sizes.

Suppose A is n by n. Then U necessarily has n rows. Let k be the number of columns in U. Then C is k by k and V is k by n.

To simplify things a little, let C be the identity.

\left(A+UV \right)^{-1} = A^{-1} - A^{-1}U \left(I+VA^{-1}U \right)^{-1} VA^{-1}

Think of k as small, maybe even 1. Then UV is a low-rank perturbation of A. Suppose you have computed the inverse of A, or know something about the inverse of A, and want to know how that inverse changes when you change A by adding a rank k matrix to it.

To simplify things further, assume A is the identity matrix. Now the matrix inversion lemma reduces to something similar to Sylvester’s theorem above.

\left(I+UV \right)^{-1} = I - U \left(I+VU \right)^{-1} V

To make things clearer, we’ll add subscripts on the identity matrices as above.

\left(I_n+UV \right)^{-1} = I_n - U \left(I_k+VU \right)^{-1} V

If k is small, then the matrix between U and V on the right side is small. If k = 1, it’s just a number, and its inverse is just it’s reciprocal.

Related posts

Optimal low-rank matrix approximation

Matrix compression

Suppose you have an m by n matrix A, where m and n are very large, that you’d like to compress. That is, you’d like to come up with an approximation of A that takes less data to describe.

For example, consider a high resolution photo that as a matrix of gray scale values. An approximation to the matrix could be a lower resolution representation that takes less space.

I heard a rumor many years ago that space probes would compress images by interpreting an image as a matrix and sending back a few eigenvalues and eigenvectors. That sounded fascinating, but what about images that aren’t square? If a matrix M is not square, then you can’t have Mx = λx for a scalar λ because the left and right sides of the equation would have different dimensions.

Truncated SVD

Approximate a rectangular matrix requires using something more general than eigenvalues and eigenvectors, and that is singular values and singular vectors.

Suppose our matrix A has the singular value decomposition

A = U \Sigma V^*

This can be written as

A = \sum_{i=1}^p \sigma_i u_i v_i^*

where the σi are the singular values and p is the number of singular values that are non-zero. The ui and vi are the ith columns of U and V respectively. Singular values are nonnegative, and listed in decreasing order.

We can form an approximation to A by truncating the sum above. Instead of taking all the singular values, and their corresponding left and right singular vectors, we only take the k largest singular values and their corresponding vectors.

A_k = \sum_{i=1}^k \sigma_i u_i v_i^*

This is called the truncated SVD.

We started by assuming A had dimensions m by n and that both were large. Storing A requires mn numbers. Storing Ak requires only k(1 + mn) numbers. This is a big savings if k is much smaller than m and n.

Eckart-Young theorem

So how good an approximation is Ak ? Turns out it is optimal, in the least squares sense. This is the content of the Eckart-Young theorem. It says that the best least squares (2-norm) approximation of A by a rank k matrix is given by Ak.

Not only that, the theorem says the 2-norm error is given by the first singular value that we didn’t use, i.e.

|| A - A_k ||_2 = \sigma_{k+1}

Related posts

Least squares solutions to over- or underdetermined systems

If often happens in applications that a linear system of equations Axb either does not have a solution or has infinitely many solutions. Applications often use least squares to create a problem that has a unique solution.

Overdetermined systems

Suppose the matrix A has dimensions m by n and the right hand side vector b has dimension m. Then the solution x, if it exists, has to have dimension n. If mn, i.e. we have more equations than unknowns, the system is overdetermined. The equations will be inconsistent in general. This is typical, for example, in linear regression.

In this case, you can use the least square criterion to determine a solution. Instead of demanding that Axb, we look for an x than makes the difference between Ax and b as small as possible, as measured by the 2-norm (root mean square). That is, we pick x to minimize

|| Ax - b||_2^2

meaning we solve the system as best we can, best as measured by the 2-norm.

Underdetermined systems

If the number of rows in the matrix A, i.e. the number of equations, is less than the number of columns, i.e. the number of unknowns, then the system is underdetermined. In general there were be infinitely many solutions. It’s also possible that there are no solutions because the equations are inconsistent.

In this case, we can use least squares to assure that a solution exists, and to decide which of the many possible solutions to choose. Here we want to find the x that minimizes

|| Ax - b ||_2^2 + ||x||_2^2

If there are values of x that satisfy Axb then this will chose the solution with least norm.

Least squares solutions and SVD

If the singular value decomposition (SVD) of a real matrix A is given by

A = U \Sigma V^T
and A has rank r, then the least squares solution to the system Axb is given explicitly by

x = \sum_{i=1}^r \frac{u_i^T b}{\sigma_i} v_i

where the u‘s are the columns of U and the v‘s are the columns of v.

Note that the denominators are not zero; the fact that A has rank r means that it has r positive singular values.

Furthermore, the least squares residual, i.e. the degree to which Ax differs from b, is given by

||Ax -b ||_2^2 = \sum_{i = r+1}^m (u_i^Tb)^2

Note that if the matrix A has rank m, then the least squares problem can be solved exactly, and the right side above is an empty sum.

See, for example, Golub and Van Loan’s classic book Matrix Computations.

Related posts

Computing SVD and pseudoinverse

In a nutshell, given the singular decomposition of a matrix A,

A = U \Sigma V^*

the Moore-Penrose pseudoinverse is given by

A^+ = V \Sigma^+ U^*.

This post will explain what the terms above mean, and how to compute them in Python and in Mathematica.

Singular Value Decomposition (SVD)

The singular value decomposition of a matrix is a sort of change of coordinates that makes the matrix simple, a generalization of diagonalization.

Matrix diagonalization

If a square matrix A is diagonalizable, then there is a matrix P such that

A = P^{-1} D P

where the matrix D is diagonal. You could think of P as a change of coordinates that makes the action of A as simple as possible. The elements on the diagonal of D are the eigenvalues of A and the columns of P are the corresponding eigenvectors.

Unfortunately not all matrices can be diagonalized. Singular value decomposition is a way to do something like diagonalization for any matrix, even non-square matrices.

Generalization to SVD

Singular value decomposition generalizes diagonalization. The matrix Σ in SVD is analogous to D in diagonalization. Σ is diagonal, though it may not be square. The matrices on either side of Σ are analogous to the matrix P in diagonalization, though now there are two different matrices, and they are not necessarily inverses of each other. The matrices U and V are square, but not necessarily of the same dimension.

The elements along the diagonal of Σ are not necessarily eigenvalues but singular values, which are a generalization of eigenvalues. Similarly the columns of U and V are not necessarily eigenvectors but left singular vectors and right singular vectors respectively.

The star superscript indicates conjugate transpose. If a matrix has all real components, then the conjugate transpose is just the transpose. But if the matrix has complex entries, you take the conjugate and transpose each entry.

The matrices U and V are unitary. A matrix M is unitary if its inverse is its conjugate transpose, i.e. M* M = MM* = I.

Pseudoinverse and SVD

The (Moore-Penrose) pseudoinverse of a matrix generalizes the notion of an inverse, somewhat like the way SVD generalized diagonalization. Not every matrix has an inverse, but every matrix has a pseudoinverse, even non-square matrices.

Computing the pseudoinverse from the SVD is simple.

If

A = U \Sigma V^*

then

A^+ = V \Sigma^+ U^*

where Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements, leaving all the zeros alone, and making the matrix the right shape: if Σ is an m by n matrix, then Σ+ must be an n by m matrix.

We’ll give examples below in Mathematica and Python.

Computing SVD in Mathematica

Let’s start with the matrix A below.

A = \begin{bmatrix} 2 & -1 & 0 \\ 4 & 3 & -2 \end{bmatrix}

We can find the SVD of A with the following Mathematica commands.

    A = {{2, -1, 0}, {4, 3, -2}}
    {U, S, V} = SingularValueDecomposition[A]

From this we learn that the singular value decomposition of A is

\begin{bmatrix} \frac{1}{\sqrt{26}} & -\frac{5}{\sqrt{26}} \\ \frac{5}{\sqrt{26}} & \frac{1}{\sqrt{26}} \\ \end{bmatrix} \begin{bmatrix} \sqrt{30} & 0 & 0 \\ 0 & 2 & 0 \end{bmatrix} \begin{bmatrix} \frac{11}{\sqrt{195}} & \frac{7}{\sqrt{195}} & -\sqrt{\frac{5}{39}} \\ -\frac{3}{\sqrt{26}} & 2 \sqrt{\frac{2}{13}} & -\frac{1}{\sqrt{26}} \\ \frac{1}{\sqrt{30}} & \sqrt{\frac{2}{15}} & \sqrt{\frac{5}{6}} \end{bmatrix}

Note that the last matrix is not V but the transpose of V. Mathematica returns V itself, not its transpose.

If we multiply the matrices back together we can verify that we get A back.

    U . S. Transpose[V]

This returns

    {{2, -1, 0}, {4, 3, -2}}

as we’d expect.

Computing pseudoinverse in Mathematica

The Mathematica command for computing the pseudoinverse is simply PseudoInverse. (The best thing about Mathematica is it’s consistent, predictable naming.)

    PseudoInverse[A]

This returns

    {{19/60, 1/12}, {-(11/30), 1/6}, {1/12, -(1/12)}}

And we can confirm that computing the pseudoinverse via the SVD

    Sp = {{1/Sqrt[30], 0}, {0, 1/2}, {0, 0}}
    V . Sp . Transpose[U]

gives the same result.

Computing SVD in Python

Next we compute the singular value decomposition in Python (NumPy).

    >>> a = np.matrix([[2, -1, 0],[4,3,-2]])
    >>> u, s, vt = np.linalg.svd(a, full_matrices=True)

Note that np.linalg.svd returns the transpose of V, not the V in the definition of singular value decomposition.

Also, the object s is not the diagonal matrix Σ but a vector containing only the diagonal elements, i.e. just the singular values. This can save a lot of space if the matrix is large. The NumPy method svd has other efficiency-related options that I won’t go into here.

We can verify that the SVD is correct by turning s back into a matrix and multiply the components together.

    >>> ss = np.matrix([[s[0], 0, 0], [0, s[1], 0]])
    >>> u*ss*vt

This returns the matrix A, within floating point accuracy. Since Python is doing floating point computations, not symbolic calculation like Mathematica, the zero in A turns into -3.8e-16.

Note that the singular value decompositions as computed by Mathematica and Python differ in a few signs here and there; the SVD is not unique.

Computing pseudoinverse in Python

The pseudoinverse can be computed in NumPy with np.linalg.pinv.

    >>> np.linalg.pinv(a)
    matrix([[ 0.31666667,  0.08333333],
            [-0.36666667,  0.16666667],
            [ 0.08333333, -0.08333333]])

This returns the same result as Mathematica above, up to floating point precision.

Related posts

Review of Matrix Mathematics

Bernstein’s Matrix Mathematics is impressive. It’s over 1500 pages and weighs 5.3 pounds (2.4 kg). It’s a reference book, not the kind of book you just sit down to read. (Actually, I have sat down to read parts of it.) I’d used a library copy of the first edition, and so when Princeton University Press offered me a review copy of the second edition, I jumped on it.

Matrix Mathematics has a lot of information on linear algebra. As you’d expect from the title, it’s mostly about linear algebra. And despite it enormous size, it’s also dense. Mostly definitions and theorem statements. Some examples and proofs, but mostly statements of facts.

But there are a lot of other topics covered too. For example, I was surprised to see a section on Bell polynomials, a topic I ran across in my work and blogged about not long ago.

Why even have reference books these days when you can easily find so much online? For one thing, there’s still a lot you can’t easily find online. When you go beyond commonly known material, as this book does, it gets hard to search for what you need.

For another, an author goes to tremendous effort to arrange the information coherently. When you read a book you find things you didn’t know to search for. Maybe you start by looking up what you think you need to know in the index, but then you find out from context what you really needed to know.

I’m glad to add this to the books I keep close at hand. I can find what I need quickly, and that’s more important than it may seem. If I save a couple minutes, the benefit is not just that I get a couple more minutes work done. The main benefit is that I increase my chances of acting on an inspiration before it evaporates.

Moore-Penrose pseudoinverse is not an adjoint

The Moore-Penrose pseudoinverse of a matrix is a way of coming up with something like an inverse for a matrix that doesn’t have an inverse. If a matrix does have an inverse, then the pseudoinverse is in fact the inverse. The Moore-Penrose pseudoinverse is also called a generalized inverse for this reason: it’s not just like an inverse, it actually is an inverse when that’s possible.

Given an m by n matrix A, the Moore-Penrose pseudoinverse A+ is the unique n by m matrix satisfying four conditions:

  1. A A+ A = A
  2. A+ A A+ = A+
  3. (A A+)* = A A+
  4. (A+ A)* = A+ A

The first equation says that AA+ is a left identity for A, and A+A is a identity for A.

The second equation says A+A is a left identity for A+, and A A+ is a right identity for A+.

The third and fourth equations say that A A+ and A+A are Hermitian.

If A is invertible, A A+ and A+A are both the identity matrix. Otherwise A A+ and A+A act an awful lot like the identity, as much as you could expect, maybe a little more than you’d expect.

Update: See this post for the relationship between the singular value decomposition and pseudoinverses, and how to compute both in Python and Mathematica.

Galois connections and adjoints

John Baez recently wrote that a Galois connection, a kind of categorical adjunction, is

“the best approximation to reversing a computation that can’t be reversed.”

That sounds like a pseudoinverse! And the first two equations defining a pseudoinverse look a lot like things you’ll see in the context of adjunctions, so the pseudoinverse must be an adjunction, right?

The question was raised on MathOverflow and Michal R. Przybylek answered

I do not think the concept of Moore-Penrose Inverse and the concept of categorical adjunction have much in common (except they both try to generalise the concept of inverse) …

and gives several reasons why. (Emphasis added.)

Too bad. It would have made a good connection. Applied mathematicians are likely to be familiar with Moore-Penrose pseudoinverses but not categorical adjoints. And pure mathematicians, depending on their interests, may be more familiar with adjoint functors than matrix pseudoinverses.

So what about John Baez’ comment? His comment was expository (and very helpful) but not meant to be rigorous. To make it rigorous you’d have to be rigorous about what you mean by “best approximation” etc. And when you define your terms carefully, in the language of category theory, you get adjoints. This means that the Moore-Penrose inverse, despite its many nice properties [1], doesn’t mesh well with categorical definitions. It’s not the best approximate inverse from a categorical perspective because it doesn’t compose well, and category theory values composition above all else. The Moore-Penrose pseudoinverse may be the best approximate inverse from some perspectives, but not from a categorical perspective.

Przybylek explains

… adjunctions compose … but Moore-Penrose pseudoinverses—generally—do not. … pseudoinverses are not stable under isomorphisms, thus are not categorical.

That’s the gist of his final point. Now let me fill in and expand slightly part of what I cut out.

If f: AB is left adjoint to f+: BA and g: BC is left adjoint to g+: CB then the composition gfAC is left adjoint to the composition f+g+: C → A, but Moore-Penrose pseudoinverses do not compose this way in general.

This turns out to be an interesting example, but not of what I first expected. Rather than the pseudoinverse of a matrix being an example of an adjoint, it is an example of something that despite having convenient properties does not compose well from a categorical perspective.

Related posts:

[1] The book Matrix Mathematics devotes about 40 pages to stating theorems about the Moore-Penrose pseudoinverse.