My favorite paper: H = W

A paper came out in 1964 with the title “H = W.” The remarkably short title was not cryptic, however. The people for whom the paper was written knew exactly what it meant. There were two families of function spaces, one denoted with H and another denoted with W, that were speculated to be the same, and this paper proved that indeed they were.

This post will explain the significance of the paper. It will be more advanced and less self-contained than most of my posts. If you’re still interested, read on.

Definitions

The derivative of a function might exist in a generalized sense when it doesn’t exist in the classical sense. I give an introduction to this idea in the post How to differentiate a non-differentiable function. The generalized derivative is a linear functional on test functions [1] and may not correspond to a classical function. The delta “function” δ(x) is the classic example.

A regular distribution is a distribution whose action on a test function is equal to multiplying the test function by a locally integrable function and integrating.

Given Ω ⊂ ℝn, the Sobolev space Wk,p(Ω) consists of functions whose partial derivatives of order up to k are regular distributions that lie in the space Lp(Ω).

For example, let I be the interval (−1, 1) and let f(x) = |x|. The function f is not differentiable at 0, but the generalized derivative of f is the sign function sgn(x), which is in Lp(I) for all p. The generalized derivative of sgn(x) is 2δ(x), which is not a regular distribution [2], and so fW1,p(I) but fW2,p(I).

The norm on Wk,p(Ω) is the sum of the Lp norms of the function and each of its partial derivatives up to order k.

The Sobolev space Hk,p(Ω) is the closure of test functions in the norm of the space Wk,p(Ω).

Theorem

It’s not obvious a priori that these two ways of defining a Sobolev space are equivalent, but James Serrin and Normal George Meyers [3] proved in 1964 that for all domains Ω, and for all non-negative integers k, and for all 1 ≤ p in < ∞ we have

Hk,p(Ω) = Wk,p(Ω).

The proof is remarkably brief, less than a page.

Significance

Why does this theorem matter? Sobolev spaces are the machinery of the modern theory of differential equations. I spent a lot of time in my mid 20s working with Sobolev spaces.

The grand strategy of PDE research is to first search for generalized solutions to an equation, solutions belonging to a Sobolev space, then if possible prove that the generalized solution is in fact a classical solution.

This is analogous to first proving an algebraic equation has complex solution, then proving that the complex solution is real, or proving that an equation has a real number solution, then proving that the real solution is in fact an integer. It’s easier to first find a solution in a larger space, then if possible show that the thing you found belongs to a smaller space.

Related posts

[1] A test function in this context is an infinitely differentiable function of compact support. In other contexts a test function is not required to have compact support but is required to go asymptotically approach zero rapidly, faster than the reciprocal of any polynomial.

[2] The classical derivative of sgn(x) is equal to zero almost everywhere. But the derivative as a distribution is not zero. The pointwise derivative may not equal the generalized derivative.

[2] Norman G, Meyers; Serrin, James (1964), “H = W”, Proceedings of the National Academy of Sciences, 51 (6): 1055–1056

Wilkinson’s polynomial

If you change the coefficients of a polynomial a little bit, do you change the location of its zeros a little bit? In other words, do the roots of a polynomial depend continuously on its coefficients?

You would think so, and you’d be right. Sorta.

It’s easy to see that small change to a coefficient could result in a qualitative change, such as a double root splitting into two distinct roots close together, or a real root acquiring a small imaginary part. But it’s also possible that a small change to a coefficient could cause roots to move quite a bit.

Wilkinson’s polynomial

Wilkinson’s polynomial is a famous example that makes this all clear.

Define

w(x) = (x − 1)(x − 2)(x − 3)…(x − 20) = x20 − 210x19 + …

and

W(x) = w(x) − 2−23 x19

so that the difference between the two polynomials is that the coefficient of x19 in the former is −210 and in the latter the coefficient is −210.0000001192.

Surely the roots of w(x) and W(x) are close together. But they’re not!

The roots of w are clearly 1, 2, 3, …, 20. But the roots of W are substantially different. Or at least some are substantially different. About half the roots didn’t move much, but the other half moved a lot. If we order the roots by their real part, the 16th root splits into two roots, 16.7308 ± 2.81263 i.

Here’s the Mathematica code that produced the plot above.

    w[x_] := Product[(x - i), {i, 1, 20}]
    W[x_] := w[x] - 2^-23 x^19
    roots = NSolve[W[x] == 0, x]
    ComplexListPlot[x /. roots]

A tiny change to one coefficient creates a change in the roots that is seven orders of magnitude larger. Wilkinson described this discovery as “traumatic.”

Modulus of continuity

The zeros of a polynomial do depend continuously on the parameters, but the modulus of continuity might be enormous.

Given some desired tolerance ε on how much the zeros are allowed to move, you can specify a corresponding degree of accuracy δ on the coefficients to meet that tolerance, but you might have to have extreme accuracy on the coefficients to have moderate accuracy on the roots. The modulus of continuity is essentially the ratio ε/δ, and this ratio might be very large. In the example above it’s roughly 8,400,000.

In terms of the δ-ε definition of continuity, for every ε there is a δ. But that δ might have to be orders of magnitude smaller than ε.

Modulus of continuity is an important concept. In applications, it might not matter that a function is continuous if the modulus of continuity is enormous. In that case the function may be discontinuous for practical purposes.

Continuity is a discrete concept—a function is either continuous or it’s not—but modulus of continuity is a continuous concept, and this distinction can resolve some apparent paradoxes.

For example, consider the sequence of functions fn(x) = xn on the interval [0, 1].  For every n, fn is continuous, but the limit is not a continuous function: the limit equals 0 for x < 1 and equals 1 at x = 1.

But this makes sense in terms of modulus of continuity. The modulus of continuity of fn(x) is n. So while the sequence is continuous for large n, it is becoming “less continuous” in the sense that the modulus of continuity is increasing.

Interpolation instability

You would think that interpolating at more points would give you a more accurate approximation. There’s a famous example by Runge that proves this is not the case. If you interpolate the function 1/(1 + x²) over the interval [−5, 5], as you add more interpolation points the maximum interpolation error actually increases.

Here’s an example from a post I wrote a few years ago, using 16 interpolation points. This is the same function, rescaled to live on the interval [−1, 1].

There are two things that make Runge’s example work.

  1. The interpolation nodes are evenly spaced.
  2. Singularities in the complex plane too close to the domain.

If you use Chebyshev nodes rather than evenly spaced nodes, the problem goes away.

And if the function being interpolated is analytic in a certain region around the unit interval (described here) the problem also goes away, at least in theory. In practice, using floating point arithmetic, you can see rapid divergence even though in theory the interpolants converge [1]. That is point of this post.

Example

So let’s try interpolating exp(−9x²) on the interval [−1, 1]. This function is analytic everywhere, so the interpolating polynomials should converge as the number of interpolation points n goes to infinity. Here’s a plot of what happens when n = 50.

Looks like all is well. No need to worry. But let’s press our luck and try n = 100.

Hmm. Something is wrong.

In fact things are far worse than the plot suggests. The largest interpolant value is over 700,000,000. It just doesn’t show in the plot.

Interpolating at evenly spaced points is badly conditioned. There would be no problem if we could compute with infinite precision, but with finite precision interpolation errors can grow exponentially.

Personal anecdote

Years ago I learned that someone was interpolating exp(−x²) at a large number of evenly spaced points in an application. If memory serves, they wanted to integrate the interpolant in order to compute probabilities.

I warned them that this was a bad idea because of Runge phenomena.

I was wrong on theoretical grounds, because exp(−x²) is analytic. I didn’t know about Runge’s convergence theorem.

But I was right on numerical grounds, because I also didn’t know about the numerical instability demonstrated above.

So I made two errors that sorta canceled out.

Related posts

[1] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Drazin pseudoinverse

The most well-known generalization of the inverse of a matrix is the Moore-Penrose pseudoinverse. But there is another notion of generalized inverse, the Drazin pseudoinverse, for square matrices.

If a matrix A has an inverse A−1 then it also has a Moore-Penrose pseudoinverse A+ and a Drazin pseudoinverse AD and

A−1 = A+ = AD.

When the matrix A is not invertible the two notions of pseudoinverse might not agree.

One way to think about the two generalized inverses is that both are based on a canonical form. The Moore-Penrose pseudoinverse is based on the singular value decomposition. The Drazin inverse is based on the Jordan canonical form.

Jordan canonical form

Every square matrix A can be put into Jordan canonical form

P−1 A PJ

where J is the Jordan canonical form and P is an invertible matrix whose columns are the generalized eigenvectors of A.

If the matrix A is diagonalizable, the matrix J is diagonal. Otherwise the matrix J is block diagonal. Each block has an eigenvalue on the diagonal; the size of the block equals the multiplicity of the eigenvalue. There are 1s on the diagonal above the main diagonal and zeros everywhere else.

If all the eigenvalues of A are non-zero, the matrix is invertible and it is easy to compute the inverse from the Jordan form. To invert J, simply invert each block of J. This is clear from the way you multiply partitioned matrices.

To find the inverse of A, multiply on the left by P and on the right by P−1. That is,

A−1 = P J−1 P−1.

Proof:

A (P J−1 P−1) = (P J P−1)(P J−1 P−1) = I

Drazin pseudoinverse

One way to compute the Drazin inverse, at least in theory, is from the Jordan canonical form. In practice, you might want to compute the Drazin inverse a different way.

If all the eigenvalues of A are non-zero, A is invertible and its Drazin pseudoinverse is its inverse.

If zero is an eigenvalue, the block(s) in J corresponding to 0 cannot be inverted. If a block is 1 by 1 then it just has a zero in it. If a block is larger, it will have 1’s above the main diagonal. Replace the 1s by 0s. This gives us the Drazin pseudoinverse of J. To obtain the pseudoinverse of J, multiply on the left by P and on the right by P−1 just as above.

Categorization

The Drazin pseudoinverse can be categorized by a few properties just as the Moore-Penrose pseudoinverse. Let’s compare the properties.

The defining properties of the Moore-Penrose pseudoinverse A+ of a matrix A are

  1. A A+ A = A
  2. A+ A A+ = A+
  3. (A A+)* = A A+
  4. (A+ A)* = A+ A

The defining properties of the Drazin pseudoinverse are

  1. A ADAD A
  2. AD A AD = AD
  3. Ak+1 ADA Ak

where k is the smallest non-negative integer such that rank(Ak + 1) = rank(Ak).

The second properties of each pseudoinverse are analogous. The first property of Drazin inverse is sorta like the 3rd and 4th properties of the Moore-Penrose pseudoinverse.

The analog of A A+ A = A does not hold; in general A AD A does not equal A.

The Drazin property Ak+1 ADA Ak is nothing like any property for Moore-Penrose. It suggests that Drazin inverses behave nicely when you raise them to exponents, which is true.

You can see this exponent condition in the Jordan normal form. The blocks in J corresponding to a 0 eigenvector are nilpotent, i.e. they become zero when you raise them to a high enough power. When you raise J to some power k, the same k as above, the nilpotent blocks go away.

Note that although the Moore-Penrose pseudoinverse is symmetric, the Drazin inverse is not. That is (A+)+ = A but (AD)DA in general.

Related posts

Effective graph resistance

I’ve mentioned the Moore-Penrose pseudoinverse of a matrix a few times, most recently last week. This post will give an application of the pseudoinverse: computing effective graph resistance.

Given a graph G, imagine replacing each edge with a one Ohm resistor. The effective resistance between two nodes in G is the electrical resistance between those the two nodes.

Calculating graph resistance requires inverting the graph Laplacian, but the graph Laplacian isn’t invertible. We’ll resolve this paradox shortly, but in the mean time this sounds like a job for the pseudoinverse, and indeed it is.

The graph Laplacian of a graph G is the matrix L = D − A where D is the diagonal matrix whose entries are the degrees of each node and A is the adjacency matrix. The vector of all 1’s

1 = (1, 1, 1, …, 1)

is in the null space of L but if G is connected then the null space is one dimensional. More generally, the dimension of the null space equals the number of components in G.

Calculating graph resistance requires solving the linear system

Lw = v

where v and w are orthogonal to 1. If the graph G is connected then the system has a unique solution. The graph Laplacian L is not invertible over the entire space, but it is invertible in the orthogonal complement to the null space.

The Moore-Penrose pseudoinverse L+ gives the graph resistance matrix. The graph resistance between two nodes a and b is given by

Rab = (eaeb)T L+ (eaeb)

where ei is the vector that has all zero entries except for a 1 in the ith slot.

For more on graph resistance see Effective graph resistance by Wendy Ellens et al.

More spectral graph theory posts

Multiplying a matrix by its transpose

An earlier post claimed that there practical advantages to partitioning a matrix, thinking of the matrix as a matrix of matrices. This post will give an example.

Let M be a square matrix and suppose we need to multiply M by its transpose MT. We can compute this product faster than multiplying two arbitrary matrices of the same size by exploiting the fact that MMT will be a symmetric matrix.

We start by partitioning M into four blocks

M = \begin{bmatrix}A & B \\ C & D \end{bmatrix}

Then

M^\intercal = \begin{bmatrix}A^\intercal & C^\intercal \\ B^\intercal & D^\intercal \end{bmatrix}

and

MM^\intercal = \begin{bmatrix} AA^\intercal  + BB^\intercal & AC^\intercal  + BD^\intercal \\CA^\intercal + DB^\intercal & CC^\intercal + DD^\intercal \end{bmatrix}

Now for the first clever part: We don’t have to compute both of the off-diagonal blocks because each is the transpose of the other. So we can reduce our calculation by 25% by not calculating one of the blocks, say the lower left block.

And now for the second clever part: apply the same procedure recursively. The diagonal blocks in MMT involve a matrix times its transpose. That is, we can partition A and use the same idea to compute AAT and do the same for BBT, CCT, and DDT. The off diagonal blocks require general matrix multiplications.

The net result is that we can compute MMT in about 2/3 the time it would take to multiply two arbitrary matrices of the same size.

Recently a group of researchers found a way to take this idea even further, partitioning a matrix into a 4 by 4 matrix of 16 blocks and doing some clever tricks. The RXTX algorithm can compute MMT in about 26/41 the time required to multiply arbitrary matrices, a savings of about 5%. A 5% improvement may be significant if it appears in the inner loop of a heavy computation. According to the authors, “The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.”

Related posts

A bit-twiddling marvel

Pop quiz: what does the following code do?

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

It determines whether the year y is a leap year in the Gregorian calendar, of course. :)

It’s very efficient, though I don’t image that would ever matter. But it’s very clever.

Gregorian vs Julian calendar

A year is a leap year in the Julian calendar if and only if it is a multiple of 4. But the Julian year is a bit too long on average to match the earth’s orbit around the sun. You get a much better fit if you add the complication that years divisible by 100 but not by 400 are not leap years.

Presumably no one reading this recalls 1900 not being a leap year, but some of you may need to know some day that 2100 is not a leap year either.

C code

The cryptic function above comes from the recent post A leap year check in three instructions by Falk Hüffner. The function is correct for the next 100 thousand years (more precisely, through 103499) and correct if we anachronistically extent the Gregorian calendar back to the year 0.

The following C code shows empirically that Hüffner’s code is correct, but you’ll have to read his post to understand why it is correct.

#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

int main() {
    for (uint32_t y = 0; y <= 102499; y++)
        if (is_leap_year_fast(y) != is_leap_year(y))
            printf("Exception: %d\n", y);
    return 0;
}

Related posts

Matrices of Matrices

It’s often useful to partition a matrix, thinking of it as a matrix whose entries are matrices. For example, you could look at the matrix 6 × 6

M = \begin{bmatrix} 3 & 1 & 4 & 0 & 0 & 0 \\ 2 & 7 & 1 & 0 & 0 & 0 \\ 5 & 0 & 6 & 0 & 0 & 0 \\ 0 & 0 & 0 & 2 & 4 & 8 \\ 0 & 0 & 0 & 1 & 7 & 6\\ 0 & 0 & 0 & 3 & 5 & 1 \\ \end{bmatrix}

as a 2 × 2 matrix whose entries are 3 × 3 matrices.

\begin{bmatrix} \begin{bmatrix} 3 & 1 & 4 \\ 2 & 7 & 1 \\ 5 & 0 & 6 \\ \end{bmatrix} & \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix} \\ & \\ \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix} & \begin{bmatrix} 2 & 4 & 8 \\ 1 & 7 & 6\\ 3 & 5 & 1 \\ \end{bmatrix} \end{bmatrix}

M is not a diagonal matrix of real numbers, but its block structure is a diagonal matrix of blocks. It might be possible in some application to take advantage of the fact that the off-diagonal blocks are zero matrices.

You can multiply two partitioned matrices by multiplying the blocks as if they were numbers. The only requirement is that the blocks be of compatible sizes for multiplication: the block widths on the left have to match the corresponding block heights on the right.

For example, if you have two partitioned matrices

A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}
and
B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}

then their product is the partitioned matrix

C = \begin{bmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{bmatrix}

The only requirement is that the matrix products above make sense. The number of columns in A11 must equal the number of rows in B11, and the number of columns in A12 must equal the number of rows in B21. (The rest of the compatibility conditions follow because the blocks in a column of a partitioned matrix have the same number of columns, and the blocks in a row of a partitioned matrix have the same number of rows.)

You can partition matrices into any number of rows and columns of blocks; the examples above have been 2 × 2 arrays of blocks only for convenience.

It’s natural to oscillate a bit between thinking “Of course you can do that” and “Really? That’s incredible.” If you haven’t gone through at least one cycle perhaps you haven’t thought about it long enough.

What could go wrong?

Manipulation of partitioned matrices works so smoothly that you have to wonder where things could go wrong.

For the most part, if a calculation makes logical sense then it’s likely to be correct. If you add two blocks, they have to be compatible for addition. If you “divide” by a block, i.e, multiply by its inverse, then the block must have an inverse. Etc.

One subtle pitfall is that matrix multiplication does not commute in general, and so you have to watch out for implicitly assuming that ABBA where A and B are blocks. For example, theorems about the determinant of a partitioned matrix are not as simple as you might hope. The determinant of the partitioned matrix

\begin{vmatrix} A & B \\ C & D \end{vmatrix}

is not simply det(ADBC) in general, though it is true if all the blocks are square matrices of the same size and the blocks in one row or one column commute, i.e. if AB = BA, CD = DC, AC = CA, or BD = DB.

More linear algebra posts

Trinomial Coefficients and Kings

The term trinomial coefficient is used a couple different ways. The most natural use of the term is a generalization of bionomial coefficients to three variables, i.e. numbers of the form

\frac{n!}{i! \, j! \, k!}

where ijkn. These numbers are the coefficients of yj zk in the expansion of (x + y + z)n.

But there’s another use of the term trinomial coefficient, the one that we are interest in for this post, and that is the coefficients of xk in the expansion of (1 + x + 1/x)n. These numbers can be visualized in a triangle analogous to Pascal’s triangle.

In Pascal’s triangle, each entry is the sum of the two entries above it. In the trinomial triangle, each entry is the sum of the three entries above it.

If you’ve been reading this blog regularly you know I’ve been interested in chess puzzles lately. The thing that make me interested in trinomial coefficients is that they relate to moves of a king on a chessboard.

If you note the number paths a king can take to a square using the minimal number of moves, you get the trinomial triangle. The reason is simple: the number of ways to get to a square equals the number of ways to get there via the square up and to the left, plus the number of ways to get their via the square above, plus the number of ways to get their via the square up and to the right.

Related posts