The first time you see matrices, if someone asked you how you multiply two matrices together, your first idea might be to multiply every element of the first matrix by the element in the same position of the corresponding matrix, analogous to the way you add matrices.

But that’s not usually how we multiply matrices. That notion of multiplication hardly involves the matrix structure; it treats the matrix as an ordered container of numbers, but not as a way of representing a linear transformation. Once you have a little experience with linear algebra, the customary way of multiplying matrices seems natural, and the way that may have seemed natural at first glance seems kinda strange.

The componentwise product of matrices is called the Hadamard product or sometimes the Schur product. Given two m by n matrices A and B, the Hadamard product of A and B, written AB, is the m by n matrix C with elements given by

cij = aij bij.

Because the Hadamard product hardly uses the linear structure of a matrix, you wouldn’t expect it to interact nicely with operations that depend critically on the linear structure. And yet we can give a couple theorems that do show a nice interaction, at least when A and B are positive semi-definite matrices.

The first is the Schur product theorem. It says that if A and B are positive semi-definite n by n matrices, then

det(A ∘ B) ≥ det(A) det(B)

where det stands for determinant.

Also, there is the following theorem of Pólya and Szegö. Assume A and B are symmetric positive semi-definite n by n matrices. If the eigenvalues of A and B, listed in increasing order, are αi and βi respectively, then for every eigenvalue λ of A ∘ B, we have

α1 β1 ≤ λ ≤ αn βn.

## Python implementation

If you multiply two (multidimensional) arrays in NumPy, you’ll get the componentwise product. So if you multiply two matrices as arrays you’ll get the Hadamard product, but if you multiply them as matrices you’ll get the usual matrix product. We’ll illustrate that below. Note that the function eigvalsh returns the eigenvalues of a matrix. The name may look a little strange, but the “h” on the end stands for “Hermitian.” We’re telling NumPy that the matrix is Hermitian so it can run software specialized for that case [1].


from numpy import array, matrix, array_equal, all
from numpy.linalg import det, eigvalsh

A = array([
[3, 1],
[1, 3]
])

B = array([
[5, -1],
[-1, 5]
])

H = array([
[15, -1],
[-1, 15]
])

AB = array([
[14,  2],
[ 2, 14]
])

assert(array_equal(A*B, H))

# Ordinary matrix product
assert(array_equal(A@B, AB))

# Schur product theorem
assert(det(H) >= det(A)*det(B))

# Eigenvalues
eigA = eigvalsh(A)
eigB = eigvalsh(B)
eigH = eigvalsh(A*B)

lower = eigA[0]*eigB[0]
upper = eigA[1]*eigB[1]
assert(all(eigH >= lower))
assert(all(eigH <= upper))


The code above shows that the eigenvalues of A are [2, 4], the eigenvalues of B are [4, 6], and the eigenvalues of A ∘ B are [14, 16].

## Related posts

[1] For complex matrices, Hermitian means conjugate symmetric, which in the real case reduces to simply symmetric. The theorem of Pólya and Szegö is actually valid for Hermitian matrices, but I simplified the statement for the case of real-valued matrices.

# How fast can you multiply matrices?

Suppose you want to multiply two 2 × 2 matrices together. How many multiplication operations does it take? Apparently 8, and yet in 1969 Volker Strassen discovered that he could do it with 7 multiplications.

## Upper and lower bounds

The obvious way to multiply two n × n matrices takes n³ operations: each entry in the product is the inner product of a row from the first matrix and a column from the second matrix. That amounts to n² inner products, each requiring n multiplications.

You can multiply two square matrices with O(n³) operations with the method described above, and it must take at least O(n²) operations because the product depends on all of the 2n² entries of the two matrices. Strassen’s result suggests that the optimal algorithm for multiplying matrices takes O(nk) operations for some k between 2 and 3. By applying Strassen’s algorithm recursively to larger matrices you can get k = log2 7 = 2.807.

The best known value at the moment is k = 2.3728639.

## Bounds on bounds

Yesterday the blog Gödel’s Lost Letter and P = NP posted an article Limits on Matrix Multiplication where they report on recent developments for finding the smallest value of k. A new paper doesn’t report a new value of k, but a limit on what current approaches to the problem can prove. Maybe k can equal 2, but there is a lower bound, strictly bigger than 2, on how small current approaches can go.

## Is this practical?

When I first heard of Strassen’s method, I was told it’s a curious but impractical result. Strassen saved one multiplication at the expense of introducing several more addition operations.

According to the Wikipedia article on matrix multiplication, recursively applying Strassen’s method can save time for n > 100. But there’s more to consider than counting operations. Strassen’s method, and subsequent algorithms, are more complicated. They may not be more efficient in practice even if they use fewer operations because the operations may not vectorize well.

Wikipedia reports that Strassen’s algorithm is not as numerically stable as the traditional approach, but this doesn’t matter when working over finite fields where arithmetic is exact.

## Strassen’s method

Let’s look at just what Strassen’s method does. We want to find the product of two matrices:

I started to spell out Strassen’s method in LaTeX equations, but I thought it would be much better to write it out in code so I can be sure that I didn’t make a mistake.

The following Python code randomly fills in the values of the a’s and b’s, computes the c’s using the conventional method, then asserts that you can find these values from the q’s computed from Strassen’s method. Note there is one multiplication in each of the seven q’s.

    from random import randint

# Fill matrices with random integer values
a11 = randint(0, 9)
a12 = randint(0, 9)
a21 = randint(0, 9)
a22 = randint(0, 9)
b11 = randint(0, 9)
b12 = randint(0, 9)
b21 = randint(0, 9)
b22 = randint(0, 9)

c11 = a11*b11 + a12*b21
c12 = a11*b12 + a12*b22
c21 = a21*b11 + a22*b21
c22 = a21*b12 + a22*b22

# Strassen's method
q1 = (a11 + a22)*(b11 + b22)
q2 = (a21 + a22)*b11
q3 = a11*(b12 - b22)
q4 = a22 * (-b11 + b21)
q5 = (a11 + a12)*b22
q6 = (-a11 + a21)*(b11 + b12)
q7 = (a12 - a22)*(b21 + b22)

assert(c11 == q1 + q4 - q5 + q7)
assert(c21 == q2 + q4)
assert(c12 == q3 + q5)
assert(c22 == q1 + q3 - q2 + q6)


Since Strassen’s method takes more operations than the traditional method for multiplying 2 × 2 matrices, how can it take fewer operations than the traditional method for multiplying large matrices?

When you apply Strassen’s method to a matrix partitioned into submatrices, its multiplications become matrix multiplications, and its additions become matrix additions. These operations are O(n2.807) and O(n2) respectively, so saving multiplications at the cost of more additions is a win.

# Distribution of eigenvalues for symmetric Gaussian matrix

## Symmetric Gaussian matrices

The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let M be the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues will all be real because M is symmetric.

This is called a “Gaussian Orthogonal Ensemble” or GOE. The term is standard but a little misleading because such matrices may not be orthogonal.

## Eigenvalue distribution

The joint probability distribution for the eigenvalues of M has three terms: a constant term that we will ignore, an exponential term, and a product term. (Source)

The exponential term is the same as in a multivariate normal distribution. This says the probability density drops of quickly as you go away from the origin, i.e. it’s rare for eigenvalues to be too big. The product term multiplies the distances between each pair of eigenvalues. This says it’s also rare for eigenvalues to be very close together.

(The missing constant to turn the expression above from a proportionality to an equation is whatever it has to be for the right side to integrate to 1. When trying to qualitatively understand a probability density, it usually helps to ignore proportionality constants. They are determined by the rest of the density expression, and they’re often complicated.)

If eigenvalues are neither tightly clumped together, nor too far apart, we’d expect that the distance between them has a distribution with a hump away from zero, and a tail that decays quickly. We will demonstrate this with a simulation, then give an exact distribution.

## Python simulation

The following Python code simulates 2 by 2 Gaussian matrices.

    import matplotlib.pyplot as plt
import numpy as np

n = 2
reps = 1000

diffs = np.zeros(reps)
for r in range(reps):
A = np.random.normal(scale=n**-0.5, size=(n,n))
M = 0.5*(A + A.T)
w = np.linalg.eigvalsh(M)
diffs[r] = abs(w[1] - w[0])

plt.hist(diffs, bins=int(reps**0.5))
plt.show()


This produced the following histogram:

The exact probability distribution is p(s) = s exp(-s²/4)/2. This result is known as “Wigner’s surmise.”

# Circular law for random matrices

Suppose you create a large matrix M by filling its components with random values. If has size n by n, then we require the probability distribution for each entry to have mean 0 and variance 1/n. Then the Girko-Ginibri circular law says that the eigenvalues of M are approximately uniformly distributed in the unit disk in the complex plane. As the size n increases, the distribution converges to a uniform distribution on the unit disk.

The probability distribution need not be normal. It can be any distribution, shifted to have mean 0 and scaled to have variance 1/n, provided the tail of the distribution isn’t so thick that the variance doesn’t exist. If you don’t scale the variance to 1/n you still get a circle, just not a unit circle.

We’ll illustrate the circular law with a uniform distribution. The uniform distribution has mean 1/2 and variance 1/12, so we will subtract 1/2 and multiply each entry by √(12/n).

Here’s our Python code:

    import matplotlib.pyplot as plt
import numpy as np

n = 100
M = np.random.random((n,n)) - 0.5
M *= (12/n)**0.5
w = np.linalg.eigvals(M)
plt.scatter(np.real(w), np.imag(w))
plt.axes().set_aspect(1)
plt.show()


When n=100 we get the following plot.

When n=1000 we can see the disk filling in more.

Note that the points are symmetric about the real axis. All the entries of M are real, so its characteristic polynomial has all real coefficients, and so its roots come in conjugate pairs. If we randomly generated complex entries for M we would not have such symmetry.

Related post: Fat-tailed random matrices

# Low-rank matrix perturbations

Here are a couple of linear algebra identities that can be very useful, but aren’t that widely known, somewhere between common knowledge and arcane. Neither result assumes any matrix has low rank, but their most common application, at least in my experience, is in the context of something of low rank added to something of full rank.

## Sylvester’s determinant theorem

The first is Sylvester’s determinant theorem. If A is a n by k matrix and B is a k by n matrix, then

where I is an identity matrix whose dimension is given by its subscript. The theorem is true for any k, but it’s especially useful when k is small because the matrices on the right side are of size k. If k = 1, the right side is the determinant of a 1 by 1 matrix, i.e. just a number!

## Woodbury matrix inversion lemma

The second is known as the matrix inversion lemma or Woodbury’s matrix identity. It says

which is a lot to take in at once. We’ll unpack it a little at a time.

First of all, the matrices have whatever properties are necessary for the equation to make sense: A and C are square, invertible matrices, though possibly of different sizes.

Suppose A is n by n. Then U necessarily has n rows. Let k be the number of columns in U. Then C is k by k and V is k by n.

To simplify things a little, let C be the identity.

Think of k as small, maybe even 1. Then UV is a low-rank perturbation of A. Suppose you have computed the inverse of A, or know something about the inverse of A, and want to know how that inverse changes when you change A by adding a rank k matrix to it.

To simplify things further, assume A is the identity matrix. Now the matrix inversion lemma reduces to something similar to Sylvester’s theorem above.

To make things clearer, we’ll add subscripts on the identity matrices as above.

If k is small, then the matrix between U and V on the right side is small. If k = 1, it’s just a number, and its inverse is just it’s reciprocal.

# Optimal low-rank matrix approximation

## Matrix compression

Suppose you have an m by n matrix A, where m and n are very large, that you’d like to compress. That is, you’d like to come up with an approximation of A that takes less data to describe.

For example, consider a high resolution photo that as a matrix of gray scale values. An approximation to the matrix could be a lower resolution representation that takes less space.

I heard a rumor many years ago that space probes would compress images by interpreting an image as a matrix and sending back a few eigenvalues and eigenvectors. That sounded fascinating, but what about images that aren’t square? If a matrix M is not square, then you can’t have Mx = λx for a scalar λ because the left and right sides of the equation would have different dimensions.

## Truncated SVD

Approximate a rectangular matrix requires using something more general than eigenvalues and eigenvectors, and that is singular values and singular vectors.

Suppose our matrix A has the singular value decomposition

This can be written as

where the σi are the singular values and p is the number of singular values that are non-zero. The ui and vi are the ith columns of U and V respectively. Singular values are nonnegative, and listed in decreasing order.

We can form an approximation to A by truncating the sum above. Instead of taking all the singular values, and their corresponding left and right singular vectors, we only take the k largest singular values and their corresponding vectors.

This is called the truncated SVD.

We started by assuming A had dimensions m by n and that both were large. Storing A requires mn numbers. Storing Ak requires only k(1 + mn) numbers. This is a big savings if k is much smaller than m and n.

## Eckart-Young theorem

So how good an approximation is Ak ? Turns out it is optimal, in the least squares sense. This is the content of the Eckart-Young theorem. It says that the best least squares (2-norm) approximation of A by a rank k matrix is given by Ak.

Not only that, the theorem says the 2-norm error is given by the first singular value that we didn’t use, i.e.

# Least squares solutions to over- or underdetermined systems

If often happens in applications that a linear system of equations Axb either does not have a solution or has infinitely many solutions. Applications often use least squares to create a problem that has a unique solution.

## Overdetermined systems

Suppose the matrix A has dimensions m by n and the right hand side vector b has dimension m. Then the solution x, if it exists, has to have dimension n. If mn, i.e. we have more equations than unknowns, the system is overdetermined. The equations will be inconsistent in general. This is typical, for example, in linear regression.

In this case, you can use the least square criterion to determine a solution. Instead of demanding that Axb, we look for an x than makes the difference between Ax and b as small as possible, as measured by the 2-norm (root mean square). That is, we pick x to minimize

meaning we solve the system as best we can, best as measured by the 2-norm.

## Underdetermined systems

If the number of rows in the matrix A, i.e. the number of equations, is less than the number of columns, i.e. the number of unknowns, then the system is underdetermined. In general there were be infinitely many solutions. It’s also possible that there are no solutions because the equations are inconsistent.

In this case, we can use least squares to assure that a solution exists, and to decide which of the many possible solutions to choose. Here we want to find the x that minimizes

If there are values of x that satisfy Axb then this will chose the solution with least norm.

## Least squares solutions and SVD

If the singular value decomposition (SVD) of a real matrix A is given by

and A has rank r, then the least squares solution to the system Axb is given explicitly by

where the u‘s are the columns of U and the v‘s are the columns of v.

Note that the denominators are not zero; the fact that A has rank r means that it has r positive singular values.

Furthermore, the least squares residual, i.e. the degree to which Ax differs from b, is given by

Note that if the matrix A has rank m, then the least squares problem can be solved exactly, and the right side above is an empty sum.

See, for example, Golub and Van Loan’s classic book Matrix Computations.

# Computing SVD and pseudoinverse

In a nutshell, given the singular decomposition of a matrix A,

the Moore-Penrose pseudoinverse is given by

This post will explain what the terms above mean, and how to compute them in Python and in Mathematica.

## Singular Value Decomposition (SVD)

The singular value decomposition of a matrix is a sort of change of coordinates that makes the matrix simple, a generalization of diagonalization.

### Matrix diagonalization

If a square matrix A is diagonalizable, then there is a matrix P such that

where the matrix D is diagonal. You could think of P as a change of coordinates that makes the action of A as simple as possible. The elements on the diagonal of D are the eigenvalues of A and the columns of P are the corresponding eigenvectors.

Unfortunately not all matrices can be diagonalized. Singular value decomposition is a way to do something like diagonalization for any matrix, even non-square matrices.

### Generalization to SVD

Singular value decomposition generalizes diagonalization. The matrix Σ in SVD is analogous to D in diagonalization. Σ is diagonal, though it may not be square. The matrices on either side of Σ are analogous to the matrix P in diagonalization, though now there are two different matrices, and they are not necessarily inverses of each other. The matrices U and V are square, but not necessarily of the same dimension.

The elements along the diagonal of Σ are not necessarily eigenvalues but singular values, which are a generalization of eigenvalues. Similarly the columns of U and V are not necessarily eigenvectors but left singular vectors and right singular vectors respectively.

The star superscript indicates conjugate transpose. If a matrix has all real components, then the conjugate transpose is just the transpose. But if the matrix has complex entries, you take the conjugate and transpose each entry.

The matrices U and V are unitary. A matrix M is unitary if its inverse is its conjugate transpose, i.e. M* M = MM* = I.

## Pseudoinverse and SVD

The (Moore-Penrose) pseudoinverse of a matrix generalizes the notion of an inverse, somewhat like the way SVD generalized diagonalization. Not every matrix has an inverse, but every matrix has a pseudoinverse, even non-square matrices.

Computing the pseudoinverse from the SVD is simple.

If

then

where Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements, leaving all the zeros alone, and making the matrix the right shape: if Σ is an m by n matrix, then Σ+ must be an n by m matrix.

We’ll give examples below in Mathematica and Python.

### Computing SVD in Mathematica

We can find the SVD of A with the following Mathematica commands.

    A = {{2, -1, 0}, {4, 3, -2}}
{U, S, V} = SingularValueDecomposition[A]


From this we learn that the singular value decomposition of A is

Note that the last matrix is not V but the transpose of V. Mathematica returns V itself, not its transpose.

If we multiply the matrices back together we can verify that we get A back.

    U . S. Transpose[V]


This returns

    {{2, -1, 0}, {4, 3, -2}}


as we’d expect.

### Computing pseudoinverse in Mathematica

The Mathematica command for computing the pseudoinverse is simply PseudoInverse. (The best thing about Mathematica is it’s consistent, predictable naming.)

    PseudoInverse[A]


This returns

    {{19/60, 1/12}, {-(11/30), 1/6}, {1/12, -(1/12)}}


And we can confirm that computing the pseudoinverse via the SVD

    Sp = {{1/Sqrt[30], 0}, {0, 1/2}, {0, 0}}
V . Sp . Transpose[U]


gives the same result.

### Computing SVD in Python

Next we compute the singular value decomposition in Python (NumPy).

    >>> a = np.matrix([[2, -1, 0],[4,3,-2]])
>>> u, s, vt = np.linalg.svd(a, full_matrices=True)


Note that np.linalg.svd returns the transpose of V, not the V in the definition of singular value decomposition.

Also, the object s is not the diagonal matrix Σ but a vector containing only the diagonal elements, i.e. just the singular values. This can save a lot of space if the matrix is large. The NumPy method svd has other efficiency-related options that I won’t go into here.

We can verify that the SVD is correct by turning s back into a matrix and multiply the components together.

    >>> ss = np.matrix([[s[0], 0, 0], [0, s[1], 0]])
>>> u*ss*vt


This returns the matrix A, within floating point accuracy. Since Python is doing floating point computations, not symbolic calculation like Mathematica, the zero in A turns into -3.8e-16.

Note that the singular value decompositions as computed by Mathematica and Python differ in a few signs here and there; the SVD is not unique.

### Computing pseudoinverse in Python

The pseudoinverse can be computed in NumPy with np.linalg.pinv.

    >>> np.linalg.pinv(a)
matrix([[ 0.31666667,  0.08333333],
[-0.36666667,  0.16666667],
[ 0.08333333, -0.08333333]])


This returns the same result as Mathematica above, up to floating point precision.

# Review of Matrix Mathematics

Bernstein’s Matrix Mathematics is impressive. It’s over 1500 pages and weighs 5.3 pounds (2.4 kg). It’s a reference book, not the kind of book you just sit down to read. (Actually, I have sat down to read parts of it.) I’d used a library copy of the first edition, and so when Princeton University Press offered me a review copy of the second edition, I jumped on it.

Matrix Mathematics has a lot of information on linear algebra. As you’d expect from the title, it’s mostly about linear algebra. And despite it enormous size, it’s also dense. Mostly definitions and theorem statements. Some examples and proofs, but mostly statements of facts.

But there are a lot of other topics covered too. For example, I was surprised to see a section on Bell polynomials, a topic I ran across in my work and blogged about not long ago.

Why even have reference books these days when you can easily find so much online? For one thing, there’s still a lot you can’t easily find online. When you go beyond commonly known material, as this book does, it gets hard to search for what you need.

For another, an author goes to tremendous effort to arrange the information coherently. When you read a book you find things you didn’t know to search for. Maybe you start by looking up what you think you need to know in the index, but then you find out from context what you really needed to know.

I’m glad to add this to the books I keep close at hand. I can find what I need quickly, and that’s more important than it may seem. If I save a couple minutes, the benefit is not just that I get a couple more minutes work done. The main benefit is that I increase my chances of acting on an inspiration before it evaporates.

# Moore-Penrose pseudoinverse is not an adjoint

The Moore-Penrose pseudoinverse of a matrix is a way of coming up with something like an inverse for a matrix that doesn’t have an inverse. If a matrix does have an inverse, then the pseudoinverse is in fact the inverse. The Moore-Penrose pseudoinverse is also called a generalized inverse for this reason: it’s not just like an inverse, it actually is an inverse when that’s possible.

Given an m by n matrix A, the Moore-Penrose pseudoinverse A+ is the unique n by m matrix satisfying four conditions:

1. A A+ A = A
2. A+ A A+ = A+
3. (A A+)* = A A+
4. (A+ A)* = A+ A

The first equation says that AA+ is a left identity for A, and A+A is a identity for A.

The second equation says A+A is a left identity for A+, and A A+ is a right identity for A+.

The third and fourth equations say that A A+ and A+A are Hermitian.

If A is invertible, A A+ and A+A are both the identity matrix. Otherwise A A+ and A+A act an awful lot like the identity, as much as you could expect, maybe a little more than you’d expect.

Update: See this post for the relationship between the singular value decomposition and pseudoinverses, and how to compute both in Python and Mathematica.

John Baez recently wrote that a Galois connection, a kind of categorical adjunction, is

“the best approximation to reversing a computation that can’t be reversed.”

That sounds like a pseudoinverse! And the first two equations defining a pseudoinverse look a lot like things you’ll see in the context of adjunctions, so the pseudoinverse must be an adjunction, right?

The question was raised on MathOverflow and Michal R. Przybylek answered

I do not think the concept of Moore-Penrose Inverse and the concept of categorical adjunction have much in common (except they both try to generalise the concept of inverse) …

and gives several reasons why. (Emphasis added.)

Too bad. It would have made a good connection. Applied mathematicians are likely to be familiar with Moore-Penrose pseudoinverses but not categorical adjoints. And pure mathematicians, depending on their interests, may be more familiar with adjoint functors than matrix pseudoinverses.

So what about John Baez’ comment? His comment was expository (and very helpful) but not meant to be rigorous. To make it rigorous you’d have to be rigorous about what you mean by “best approximation” etc. And when you define your terms carefully, in the language of category theory, you get adjoints. This means that the Moore-Penrose inverse, despite its many nice properties [1], doesn’t mesh well with categorical definitions. It’s not the best approximate inverse from a categorical perspective because it doesn’t compose well, and category theory values composition above all else. The Moore-Penrose pseudoinverse may be the best approximate inverse from some perspectives, but not from a categorical perspective.

Przybylek explains

… adjunctions compose … but Moore-Penrose pseudoinverses—generally—do not. … pseudoinverses are not stable under isomorphisms, thus are not categorical.

That’s the gist of his final point. Now let me fill in and expand slightly part of what I cut out.

If f: AB is left adjoint to f+: BA and g: BC is left adjoint to g+: CB then the composition gfAC is left adjoint to the composition f+g+: C → A, but Moore-Penrose pseudoinverses do not compose this way in general.

This turns out to be an interesting example, but not of what I first expected. Rather than the pseudoinverse of a matrix being an example of an adjoint, it is an example of something that despite having convenient properties does not compose well from a categorical perspective.

Related posts:

# Duals and double duals of Banach spaces

The canonical examples of natural and unnatural transformations come from linear algebra, namely the relation between a vector space and its first and second duals. We will look briefly at the finite dimensional case, then concentrate on the infinite dimensional case.

Two finite-dimensional vector spaces over the same field are isomorphic if and only if they have the same dimension.

For a finite dimensional space V, its dual space V* is defined to be the vector space of linear functionals on V, that is, the set of linear functions from V to the underlying field. The space V* has the same dimension as V, and so the two spaces are isomorphic. You can do the same thing again, taking the dual of the dual, to get V**. This also has the same dimension, and so V is isomorphic to V** as well as V*. However, V is naturally isomorphic to V** but not to V*. That is, the transformation from V to V* is not natural.

Some things in linear algebra are easier to see in infinite dimensions, i.e. in Banach spaces. Distinctions that seem pedantic in finite dimensions clearly matter in infinite dimensions.

The category of Banach spaces considers linear spaces and continuous linear transformations between them. In a finite dimensional Euclidean space, all linear transformations are continuous, but in infinite dimensions a linear transformation is not necessarily continuous.

The dual of a Banach space V is the space of continuous linear functions on V. Now we can see examples of where not only is V* not naturally isomorphic to V, it’s not isomorphic at all.

For any real p > 1, let q be the number such that 1/p  + 1/q = 1. The Banach space Lp is defined to be the set of (equivalence classes of) Lebesgue integrable functions f such that the integral of |f|p is finite. The dual space of Lp is Lq. If p does not equal 2, then these two spaces are different. (If p does equal 2, then so does qL2 is a Hilbert space and its dual is indeed the same space.)

In the finite dimensional case, a vector space V is isomorphic to its second dual V**. In general, V can be embedded into V**, but V** might be a larger space. The embedding of V in V** is natural, both in the intuitive sense and in the formal sense of natural transformations, discussed in the previous post. We can turn an element of V into a linear functional on linear functions on V as follows.

Let v be an element of V and let f be an element of V*. The action of v on f is simply fv. That is, v acts on linear functions by letting them act on it!

This shows that some elements of V** come from evaluation at elements of V, but there could be more. Returning to the example of Lebesgue spaces above, the dual of L1 is L, the space of essentially bounded functions. But the dual of L is larger than L1. That is, one way to construct a continuous linear functional on bounded functions is to multiply them by an absolutely integrable function and integrate. But there are other ways to construct linear functionals on L.

A Banach space V is reflexive if the natural embedding of V in V** is an isomorphism. For p > 1, the spaces Lp are reflexive.

However, R. C. James proved the surprising result that there are Banach spaces that are isomorphic to their second duals, but not naturally. That is, there are spaces V where V is isomorphic to V**, but not via the natural embedding; the natural embedding of V into V** is not an isomorphism.

Related: Applied functional analysis

# Some ways linear algebra is different in infinite dimensions

There’s no notion of continuity in linear algebra per se. It’s not part of the definition of a vector space. But a finite dimensional vector space over the reals is isomorphic to a Euclidean space of the same dimension, and so we usually think of such spaces as Euclidean. (We’ll only going to consider real vector spaces in this post.) And there we have a notion of distance, a norm, and hence a topology and a way to say whether a function is continuous.

## Continuity

In finite dimensional Euclidean space, linear functions are continuous. You can put a different norm on a Euclidean space than the one it naturally comes with, but all norms give rise to the same topology and hence the same continuous functions. (This is useful in numerical analysis where you’d like to look at a variety of norms. The norms give different analytical results, but they’re all topologically equivalent.)

In an infinite dimensional normed space, linear functions are not necessarily continuous. If the dimension of a space is only a trillion, all linear functions are continuous, but when you jump from high dimension to infinite dimension, you can have discontinuous linear functions. But if you look at this more carefully, there isn’t a really sudden change.

If a linear function is discontinuous, its finite dimensional approximations are continuous, but the degree of continuity is degrading as dimension increases. For example, suppose a linear function stretches the nth basis vector by a factor of n. The bigger n gets, the more the function stretches in the nth dimension. As long as n is bounded, this is continuous, but in a sense it is less continuous as n increases. The fact that the infinite dimensional version is discontinuous tells you that the finite dimensional versions, while technically continuous, scale poorly with dimension. (See practical continuity for more discussion along these lines.)

## Completeness

A Banach space is a complete normed linear space. Finite dimensional normed spaces are always complete (i.e. every sequence in the space converges to a point in the space) but this might not happen in infinite dimensions.

## Duals and double duals

In basic linear algebra, the dual of a vector space V is the space of linear functionals on V, i.e. the set of linear maps from V to the reals. This space is denoted V*. If V has dimension nV* has dimension n, and all n-dimensional spaces are isomorphic, so the distinction between a space and its dual seems pedantic. But in general a Banach space and its dual are not isomorphic and so its easier to tell them apart.

The second dual of a vector space, V** is the dual of the dual space. In finite dimensional spaces, V** is naturally isomorphic to V. In Banach spaces, V is isomorphic to a subset of V**. And even when V is isomorphic to V**, it might not be naturally isomorphic to V**.  (Here “natural” means natural in the category theory sense of natural transformations.)

# It all boils down to linear algebra

When I was in college, my view of applied math was something like the following.

Applied math is mostly mathematical physics. Mathematical physics is mostly differential equations. Numerical solution of differential equations boils down to linear algebra. Therefore the heart of applied math is linear algebra.

I still think there’s a lot of truth in the summary above. Linear algebra is very important, and a great deal of applied math does ultimately depend on efficient solutions of large linear systems. The weakest link in the argument may be the first one: there’s a lot more to applied math than mathematical physics. Mathematical physics hasn’t declined, but other areas have grown. Still, areas of applied math outside of mathematical physics and outside of differential equations often depend critically on linear algebra.

I’d certainly recommend that someone interested in applied math become familiar with numerical linear algebra. If you’re going to be an expert in differential equations, or optimization, or many other fields, you need to be at leas familiar with numerical linear algebra if you’re going to compute anything. As Stephen Boyd points out in his convex optimization class, many of the breakthroughs in optimization over the last 20 years have at their core breakthroughs in numerical linear algebra. Improved algorithms have sped up the solution of very large systems more than Moore’s law has.

It may seem questionable to say that linear algebra is at the heart of applied math because it’s linear. What about nonlinear applications, such as nonlinear PDEs? Nonlinear differential equations lead to nonlinear algebraic equations when discretized. But these nonlinear systems are solved via iterations of linear systems, so we’re back to linear algebra.

# Iterative linear solvers as metaphor

Gaussian elimination is systematic way to solve systems of linear equations in a finite number of steps. Iterative methods for solving linear systems require an infinite number of steps in theory, but may find solutions faster in practice.

Gaussian elimination tells you nothing about the final solution until it’s almost done. The first phase, factorization, takes O(n^3) steps, where n is the number of unknowns. This is followed by the back-substitution phase which takes O(n^2) steps. The factorization phase tells you nothing about the solution. The back-substitution phase starts filling in the components of the solution one at a time. In application n is often so large that the time required for back-substitution is negligible compared to factorization.

Iterative methods start by taking a guess at the final solution. In some contexts, this guess may be fairly good. For example, when solving differential equations, the solution from one time step gives a good initial guess at the solution for the next time step. Similarly, in sequential Bayesian analysis the posterior distribution mode doesn’t move much as each observation arrives. Iterative methods can take advantage of a good starting guess while methods like Gaussian elimination cannot.

Iterative methods take an initial guess and refine it to a better approximation to the solution. This sequence of approximations converges to the exact solution. In theory, Gaussian elimination produces an exact answer in a finite number of steps, but iterative methods never produce an exact solution after any finite number of steps. But in actual computation with finite precision arithmetic, no method, iterative or not, ever produces an exact answer. The question is not which method is exact but which method produces an acceptably accurate answer first. Often the iterative method wins.

Successful projects often work like iterative numerical methods. They start with an approximation solution and iteratively refine it. All along the way they provide a useful approximation to the final product. Even if, in theory, there is a more direct approach to a final product, the iterative approach may work better in practice.

Algorithms iterate toward a solution because that approach may reach a sufficiently accurate result sooner. That may apply to people, but more important for people is the psychological benefit of having something to show for yourself along the way. Also, iterative methods, whether for linear systems or human projects, are robust to changes in requirements because they are able to take advantage of progress made toward a slightly different goal.

Related posts:

# Example of not inverting a matrix: optimization

People are invariably surprised when they hear it’s hardly ever necessary to invert a matrix. It’s very often necessary solve linear systems of the form Ax = b, but in practice you almost never do this by inverting A. This post will give an example of avoiding matrix inversion. I will explain how the Newton-Conjugate Gradient method works, implemented in SciPy by the function fmin_ncg.

If a matrix A is large and sparse, it may be possible to solve Ax = b but impossible to even store the matrix A-1 because there isn’t enough memory to hold it. Sometimes it’s sufficient to be able to form matrix-vector products Ax. Notice that this doesn’t mean you have to store the matrix A; you have to produce the product Ax as if you had stored the matrix A and multiplied it by x.

Very often there are physical reasons why the matrix A is sparse, i.e. most of its entries are zero and there is an exploitable pattern to the non-zero entries. There may be plenty of memory to store the non-zero elements of A, even though there would not be enough memory to store the entire matrix. Also, it may be possible to compute Ax much faster than it would be if you were to march along the full matrix, multiplying and adding a lot of zeros.

Iterative methods of solving Ax = b, such as the conjugate gradient method, create a sequence of approximations that converge (in theory) to the exact solution. These methods require forming products Ax and updating x as a result. These methods might be very useful for a couple reasons.

1. You only have to form products of a sparse matrix and a vector.
2. If don’t need a very accurate solution, you may be able to stop very early.

In Newton’s optimization method, you have to solve a linear system in order to find a search direction. In practice this system is often large and sparse. The ultimate goal of Newton’s method is to minimize a function, not to find perfect search directions. So you can save time by finding only approximately solutions to the problem of finding search directions. Maybe an exact solution would in theory take 100,000 iterations, but you can stop after only 10 iterations! This is the idea behind the Newton-Conjugate Gradient optimization method.

The function scipy.optimize.fmin_ncg can take as an argument a function fhess that computes the Hessian matrix H of the objective function. But more importantly, it lets you provide instead a function fhess_p that computes the product of the H with a vector. You don’t have to supply the actual Hessian matrix because the fmin_ncg method doesn’t need it. It only needs a way to compute matrix-vector products Hx to find approximate Newton search directions.

For more information, see the SciPy documentation for fmin_ncg.