## Matrix compression

Suppose you have an *m* by *n* matrix *A*, where *m* and *n* are very large, that you’d like to compress. That is, you’d like to come up with an approximation of *A* that takes less data to describe.

For example, consider a high resolution photo that as a matrix of gray scale values. An approximation to the matrix could be a lower resolution representation that takes less space.

I heard a rumor many years ago that space probes would compress images by interpreting an image as a matrix and sending back a few eigenvalues and eigenvectors. That sounded fascinating, but what about images that aren’t square? If a matrix *M* is not square, then you can’t have *Mx* = λ*x* for a scalar λ because the left and right sides of the equation would have different dimensions.

## Truncated SVD

Approximate a rectangular matrix requires using something more general than eigenvalues and eigenvectors, and that is singular values and singular vectors.

Suppose our matrix *A* has the singular value decomposition

This can be written as

where the σ_{i} are the singular values and *p* is the number of singular values that are non-zero. The *u*_{i} and *v*_{i} are the *i*th columns of *U* and *V* respectively. Singular values are nonnegative, and listed in decreasing order.

We can form an approximation to *A* by truncating the sum above. Instead of taking *all* the singular values, and their corresponding left and right singular vectors, we only take the *k* largest singular values and their corresponding vectors.

This is called the **truncated SVD**.

We started by assuming *A* had dimensions *m* by *n* and that both were large. Storing *A* requires *mn* numbers. Storing *A*_{k} requires only *k*(1 + *m* + *n*) numbers. This is a big savings if *k* is much smaller than *m* and *n*.

## Eckart-Young theorem

So how good an approximation is *A*_{k} ? Turns out it is optimal, in the least squares sense. This is the content of the Eckart-Young theorem. It says that the best least squares (2-norm) approximation of *A* by a rank *k* matrix is given by *A*_{k}.

Not only that, the theorem says the 2-norm error is given by the first singular value that we didn’t use, i.e.

A_k is optimal not only in the LS sense, but in the sense of any unitarily invariant norm; therefore including nuclear, Schatten or Ky Fan.

This important extension was proved by Mirsky in 1958 and published in 1968 (“Symmetric Gauge Functions and Unitarily Invariant Norms”).