Looking for keys under the lamppost

There’s an old joke about a drunk man looking for his keys under a lamppost. Someone stops and offers to help. He asks, “So, did you lose your keys here?” The drunk replies “No, I lost them over there, but here’s where the light is.”

I routinely talk to people who have strong technical skills and who want to go into consulting. They usually think that the main thing they need to do next is improve their technical skills. Maybe they know five programming languages but believe learning a sixth one would really open up opportunities. (Invariably the five languages they know are in demand and the sixth is not.) Or they have a graduate degree in math but believe there’s an area of math they need to learn more about.

They’re looking for their keys under the lamppost. And I completely understand. I would rather learn another programming language, for example, than go to a conference and hustle for work.

There’s something to be said for improving your strengths rather than your weaknesses, unless your weaknesses are the rate limiting factor. If sales are holding you back, for example, then you need to learn to be better at sales.

Structured frameworks for complex systems

I wrote two unrelated blog posts this morning, one about the math paper H = W and one about a joke putting numbers into the D&D alignment matrix.

I used Grok to look up the reference to the H = W paper, and to confirm that the alignment matrix originated with Dungeons & Dragons [1]. I didn’t start a new chat when I asked a new question, but Grok felt obligated to connect the two queries. Near the end of the response to my D&D query, Grok added a section “Connection to Sobolev Spaces Query” with the following content:

While the alignment system and Sobolev spaces (from your previous query about the “H = W” paper) are unrelated mathematically, both reflect efforts to create structured frameworks for complex systems—alignments for character ethics, Sobolev spaces for function analysis. The “Neutral” alignment, like the regularity of distributions, represents a balanced or intermediate state, bridging extremes (Law/Chaos or regular/singular distributions).

Emphasis in the original.

On the one hand this is nonsense. On the other hand, Sobolev spaces and the D&D alignment matrix are structured frameworks for complex systems!

Grok’s response reminds me of when I would doze off while studying and my brain would find marvelous nonsensical connections between everything that I had been studying, such as a way to integrate French verbs.

Aligning functions

Classifying functions into the D&D alignment grid is silly, but what if we were to do it anyway? Here’s my quick take.

The good and lawful functions have classical definitions. C is the space of continuous functions, C² the space of functions with two continuous derivatives, and C the space of infinitely differentiable functions.

The function W classified as chaotic good is Weierstrass’ example of a nowhere differentiable, everywhere continuous function.

The true neutral functions Wk,p (no relation to the Weierstrass W) are the Sobolev spaces from the earlier post. The larger the k, the more regular the functions. Lp  is Wk,p with k = 0. These functions are not necessarily defined pointwise but only modulo a set of measure zero.

The function 1 is the indicator function of the rationals, i.e. the function that equals 1 if a number is rational and 0 if it is irrational.

Here δ is Dirac’s delta “function” and δ is its nth derivative. This is a generalized function, not a function in any classical sense, and taking its derivative only makes it worse.

[1] The original 1974 version of D&D had one axis: lawful, neutral, and chaotic. In 1977 the game added the second axis: good, neutral, and evil.

 

Dungeons, Dragons, and Numbers

Dan Piponi posted a chart like the one below on Mastodon.

| | Lawful | Neutral | Chaotic | |---------+--------+---------+---------| | Good | 6 | φ | π | | Neutral | 1 | 0 | δ | | Evil | -1 | Ω | Rayo |

At the risk of making a joke not funny by explaining it, I’d like to explain Dan’s table.

The alignment matrix above comes from Dungeons & Dragons and has become a kind of meme.

The number neutral good number φ is the golden ratio, (1 + √5)/2 = 1.6180339….

Presumably readers of this blog have heard of π. Perhaps it ended up in the chaotic column because it is believed to be (but hasn’t been proven to be) a “normal number,” i.e. the distribution of its digits are normal when expressed in any base.

The number δ = 4.6692016… above is Feigenbaum’s constant, something I touched on three weeks ago. I suppose putting it in the chaos column is a kind of pun because the number comes up in chaos theory. It is believed that δ is transcendental, but there’s no proof that it’s even irrational.

The number Ω is Chaitin’s constant. In some loose sense, this number is the probability that a randomly generated program will halt. It is a normal number, but it is also uncomputable. The smallest program to compute the first n bits of Ω must have length O(n) and so no program of any fixed length can crank out the bits of Ω.

Rayo’s number is truly evil. It is said to be the largest named number. Here’s the definition:

The smallest number bigger than any finite number named by an expression in any language of first-order set theory in which the language uses only a googol symbols or less.

You could parameterize Rayo’s definition by replacing googol (= 10100) with an integer n. The sequence Rayo(n) starts out small enough. According to OEIS Rayo(30) = 2 because

the 30 symbol formula “(∃x_1(x_1∈x_0)∧(¬∃x_1(∃x_2((x_2∈x_1∧x_1∈x_0)))))” uniquely defines the number 1, while smaller formulae can only define the number 0, the smallest being the 10 symbol “(¬∃x_1(x_1∈x_0))”.

OEIS does not give the value of Rayo(n) for any n > 30, presumably because it would be very difficult to compute even Rayo(31).

My favorite paper: H = W

A paper came out in 1964 with the title “H = W.” The remarkably short title was not cryptic, however. The people for whom the paper was written knew exactly what it meant. There were two families of function spaces, one denoted with H and another denoted with W, that were speculated to be the same, and this paper proved that indeed they were.

This post will explain the significance of the paper. It will be more advanced and less self-contained than most of my posts. If you’re still interested, read on.

Definitions

The derivative of a function might exist in a generalized sense when it doesn’t exist in the classical sense. I give an introduction to this idea in the post How to differentiate a non-differentiable function. The generalized derivative is a linear functional on test functions [1] and may not correspond to a classical function. The delta “function” δ(x) is the classic example.

A regular distribution is a distribution whose action on a test function is equal to multiplying the test function by a locally integrable function and integrating.

Given Ω ⊂ ℝn, the Sobolev space Wk,p(Ω) consists of functions whose partial derivatives of order up to k are regular distributions that lie in the space Lp(Ω).

For example, let I be the interval (−1, 1) and let f(x) = |x|. The function f is not differentiable at 0, but the generalized derivative of f is the sign function sgn(x), which is in Lp(I) for all p. The generalized derivative of sgn(x) is 2δ(x), which is not a regular distribution [2], and so fW1,p(I) but fW2,p(I).

The norm on Wk,p(Ω) is the sum of the Lp norms of the function and each of its partial derivatives up to order k.

The Sobolev space Hk,p(Ω) is the closure of test functions in the norm of the space Wk,p(Ω).

Theorem

It’s not obvious a priori that these two ways of defining a Sobolev space are equivalent, but James Serrin and Normal George Meyers [3] proved in 1964 that for all domains Ω, and for all non-negative integers k, and for all 1 ≤ p in < ∞ we have

Hk,p(Ω) = Wk,p(Ω).

The proof is remarkably brief, less than a page.

Significance

Why does this theorem matter? Sobolev spaces are the machinery of the modern theory of differential equations. I spent a lot of time in my mid 20s working with Sobolev spaces.

The grand strategy of PDE research is to first search for generalized solutions to an equation, solutions belonging to a Sobolev space, then if possible prove that the generalized solution is in fact a classical solution.

This is analogous to first proving an algebraic equation has complex solution, then proving that the complex solution is real, or proving that an equation has a real number solution, then proving that the real solution is in fact an integer. It’s easier to first find a solution in a larger space, then if possible show that the thing you found belongs to a smaller space.

Related posts

[1] A test function in this context is an infinitely differentiable function of compact support. In other contexts a test function is not required to have compact support but is required to go asymptotically approach zero rapidly, faster than the reciprocal of any polynomial.

[2] The classical derivative of sgn(x) is equal to zero almost everywhere. But the derivative as a distribution is not zero. The pointwise derivative may not equal the generalized derivative.

[2] Norman G, Meyers; Serrin, James (1964), “H = W”, Proceedings of the National Academy of Sciences, 51 (6): 1055–1056

Wilkinson’s polynomial

If you change the coefficients of a polynomial a little bit, do you change the location of its zeros a little bit? In other words, do the roots of a polynomial depend continuously on its coefficients?

You would think so, and you’d be right. Sorta.

It’s easy to see that small change to a coefficient could result in a qualitative change, such as a double root splitting into two distinct roots close together, or a real root acquiring a small imaginary part. But it’s also possible that a small change to a coefficient could cause roots to move quite a bit.

Wilkinson’s polynomial

Wilkinson’s polynomial is a famous example that makes this all clear.

Define

w(x) = (x − 1)(x − 2)(x − 3)…(x − 20) = x20 − 210x19 + …

and

W(x) = w(x) − 2−23 x19

so that the difference between the two polynomials is that the coefficient of x19 in the former is −210 and in the latter the coefficient is −210.0000001192.

Surely the roots of w(x) and W(x) are close together. But they’re not!

The roots of w are clearly 1, 2, 3, …, 20. But the roots of W are substantially different. Or at least some are substantially different. About half the roots didn’t move much, but the other half moved a lot. If we order the roots by their real part, the 16th root splits into two roots, 16.7308 ± 2.81263 i.

Here’s the Mathematica code that produced the plot above.

    w[x_] := Product[(x - i), {i, 1, 20}]
    W[x_] := w[x] - 2^-23 x^19
    roots = NSolve[W[x] == 0, x]
    ComplexListPlot[x /. roots]

A tiny change to one coefficient creates a change in the roots that is seven orders of magnitude larger. Wilkinson described this discovery as “traumatic.”

Modulus of continuity

The zeros of a polynomial do depend continuously on the parameters, but the modulus of continuity might be enormous.

Given some desired tolerance ε on how much the zeros are allowed to move, you can specify a corresponding degree of accuracy δ on the coefficients to meet that tolerance, but you might have to have extreme accuracy on the coefficients to have moderate accuracy on the roots. The modulus of continuity is essentially the ratio ε/δ, and this ratio might be very large. In the example above it’s roughly 8,400,000.

In terms of the δ-ε definition of continuity, for every ε there is a δ. But that δ might have to be orders of magnitude smaller than ε.

Modulus of continuity is an important concept. In applications, it might not matter that a function is continuous if the modulus of continuity is enormous. In that case the function may be discontinuous for practical purposes.

Continuity is a discrete concept—a function is either continuous or it’s not—but modulus of continuity is a continuous concept, and this distinction can resolve some apparent paradoxes.

For example, consider the sequence of functions fn(x) = xn on the interval [0, 1].  For every n, fn is continuous, but the limit is not a continuous function: the limit equals 0 for x < 1 and equals 1 at x = 1.

But this makes sense in terms of modulus of continuity. The modulus of continuity of fn(x) is n. So while the sequence is continuous for large n, it is becoming “less continuous” in the sense that the modulus of continuity is increasing.

Interpolation instability

You would think that interpolating at more points would give you a more accurate approximation. There’s a famous example by Runge that proves this is not the case. If you interpolate the function 1/(1 + x²) over the interval [−5, 5], as you add more interpolation points the maximum interpolation error actually increases.

Here’s an example from a post I wrote a few years ago, using 16 interpolation points. This is the same function, rescaled to live on the interval [−1, 1].

There are two things that make Runge’s example work.

  1. The interpolation nodes are evenly spaced.
  2. Singularities in the complex plane too close to the domain.

If you use Chebyshev nodes rather than evenly spaced nodes, the problem goes away.

And if the function being interpolated is analytic in a certain region around the unit interval (described here) the problem also goes away, at least in theory. In practice, using floating point arithmetic, you can see rapid divergence even though in theory the interpolants converge [1]. That is point of this post.

Example

So let’s try interpolating exp(−9x²) on the interval [−1, 1]. This function is analytic everywhere, so the interpolating polynomials should converge as the number of interpolation points n goes to infinity. Here’s a plot of what happens when n = 50.

Looks like all is well. No need to worry. But let’s press our luck and try n = 100.

Hmm. Something is wrong.

In fact things are far worse than the plot suggests. The largest interpolant value is over 700,000,000. It just doesn’t show in the plot.

Interpolating at evenly spaced points is badly conditioned. There would be no problem if we could compute with infinite precision, but with finite precision interpolation errors can grow exponentially.

Personal anecdote

Years ago I learned that someone was interpolating exp(−x²) at a large number of evenly spaced points in an application. If memory serves, they wanted to integrate the interpolant in order to compute probabilities.

I warned them that this was a bad idea because of Runge phenomena.

I was wrong on theoretical grounds, because exp(−x²) is analytic. I didn’t know about Runge’s convergence theorem.

But I was right on numerical grounds, because I also didn’t know about the numerical instability demonstrated above.

So I made two errors that sorta canceled out.

Related posts

[1] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Drazin pseudoinverse

The most well-known generalization of the inverse of a matrix is the Moore-Penrose pseudoinverse. But there is another notion of generalized inverse, the Drazin pseudoinverse, for square matrices.

If a matrix A has an inverse A−1 then it also has a Moore-Penrose pseudoinverse A+ and a Drazin pseudoinverse AD and

A−1 = A+ = AD.

When the matrix A is not invertible the two notions of pseudoinverse might not agree.

One way to think about the two generalized inverses is that both are based on a canonical form. The Moore-Penrose pseudoinverse is based on the singular value decomposition. The Drazin inverse is based on the Jordan canonical form.

Jordan canonical form

Every square matrix A can be put into Jordan canonical form

P−1 A PJ

where J is the Jordan canonical form and P is an invertible matrix whose columns are the generalized eigenvectors of A.

If the matrix A is diagonalizable, the matrix J is diagonal. Otherwise the matrix J is block diagonal. Each block has an eigenvalue on the diagonal; the size of the block equals the multiplicity of the eigenvalue. There are 1s on the diagonal above the main diagonal and zeros everywhere else.

If all the eigenvalues of A are non-zero, the matrix is invertible and it is easy to compute the inverse from the Jordan form. To invert J, simply invert each block of J. This is clear from the way you multiply partitioned matrices.

To find the inverse of A, multiply on the left by P and on the right by P−1. That is,

A−1 = P J−1 P−1.

Proof:

A (P J−1 P−1) = (P J P−1)(P J−1 P−1) = I

Drazin pseudoinverse

One way to compute the Drazin inverse, at least in theory, is from the Jordan canonical form. In practice, you might want to compute the Drazin inverse a different way.

If all the eigenvalues of A are non-zero, A is invertible and its Drazin pseudoinverse is its inverse.

If zero is an eigenvalue, the block(s) in J corresponding to 0 cannot be inverted. If a block is 1 by 1 then it just has a zero in it. If a block is larger, it will have 1’s above the main diagonal. Replace the 1s by 0s. This gives us the Drazin pseudoinverse of J. To obtain the pseudoinverse of J, multiply on the left by P and on the right by P−1 just as above.

Categorization

The Drazin pseudoinverse can be categorized by a few properties just as the Moore-Penrose pseudoinverse. Let’s compare the properties.

The defining properties of the Moore-Penrose pseudoinverse A+ of a matrix A are

  1. A A+ A = A
  2. A+ A A+ = A+
  3. (A A+)* = A A+
  4. (A+ A)* = A+ A

The defining properties of the Drazin pseudoinverse are

  1. A ADAD A
  2. AD A AD = AD
  3. Ak+1 ADA Ak

where k is the smallest non-negative integer such that rank(Ak + 1) = rank(Ak).

The second properties of each pseudoinverse are analogous. The first property of Drazin inverse is sorta like the 3rd and 4th properties of the Moore-Penrose pseudoinverse.

The analog of A A+ A = A does not hold; in general A AD A does not equal A.

The Drazin property Ak+1 ADA Ak is nothing like any property for Moore-Penrose. It suggests that Drazin inverses behave nicely when you raise them to exponents, which is true.

You can see this exponent condition in the Jordan normal form. The blocks in J corresponding to a 0 eigenvector are nilpotent, i.e. they become zero when you raise them to a high enough power. When you raise J to some power k, the same k as above, the nilpotent blocks go away.

Note that although the Moore-Penrose pseudoinverse is symmetric, the Drazin inverse is not. That is (A+)+ = A but (AD)DA in general.

Related posts

Effective graph resistance

I’ve mentioned the Moore-Penrose pseudoinverse of a matrix a few times, most recently last week. This post will give an application of the pseudoinverse: computing effective graph resistance.

Given a graph G, imagine replacing each edge with a one Ohm resistor. The effective resistance between two nodes in G is the electrical resistance between those the two nodes.

Calculating graph resistance requires inverting the graph Laplacian, but the graph Laplacian isn’t invertible. We’ll resolve this paradox shortly, but in the mean time this sounds like a job for the pseudoinverse, and indeed it is.

The graph Laplacian of a graph G is the matrix L = D − A where D is the diagonal matrix whose entries are the degrees of each node and A is the adjacency matrix. The vector of all 1’s

1 = (1, 1, 1, …, 1)

is in the null space of L but if G is connected then the null space is one dimensional. More generally, the dimension of the null space equals the number of components in G.

Calculating graph resistance requires solving the linear system

Lw = v

where v and w are orthogonal to 1. If the graph G is connected then the system has a unique solution. The graph Laplacian L is not invertible over the entire space, but it is invertible in the orthogonal complement to the null space.

The Moore-Penrose pseudoinverse L+ gives the graph resistance matrix. The graph resistance between two nodes a and b is given by

Rab = (eaeb)T L+ (eaeb)

where ei is the vector that has all zero entries except for a 1 in the ith slot.

For more on graph resistance see Effective graph resistance by Wendy Ellens et al.

More spectral graph theory posts

Multiplying a matrix by its transpose

An earlier post claimed that there practical advantages to partitioning a matrix, thinking of the matrix as a matrix of matrices. This post will give an example.

Let M be a square matrix and suppose we need to multiply M by its transpose MT. We can compute this product faster than multiplying two arbitrary matrices of the same size by exploiting the fact that MMT will be a symmetric matrix.

We start by partitioning M into four blocks

M = \begin{bmatrix}A & B \\ C & D \end{bmatrix}

Then

M^\intercal = \begin{bmatrix}A^\intercal & C^\intercal \\ B^\intercal & D^\intercal \end{bmatrix}

and

MM^\intercal = \begin{bmatrix} AA^\intercal  + BB^\intercal & AC^\intercal  + BD^\intercal \\CA^\intercal + DB^\intercal & CC^\intercal + DD^\intercal \end{bmatrix}

Now for the first clever part: We don’t have to compute both of the off-diagonal blocks because each is the transpose of the other. So we can reduce our calculation by 25% by not calculating one of the blocks, say the lower left block.

And now for the second clever part: apply the same procedure recursively. The diagonal blocks in MMT involve a matrix times its transpose. That is, we can partition A and use the same idea to compute AAT and do the same for BBT, CCT, and DDT. The off diagonal blocks require general matrix multiplications.

The net result is that we can compute MMT in about 2/3 the time it would take to multiply two arbitrary matrices of the same size.

Recently a group of researchers found a way to take this idea even further, partitioning a matrix into a 4 by 4 matrix of 16 blocks and doing some clever tricks. The RXTX algorithm can compute MMT in about 26/41 the time required to multiply arbitrary matrices, a savings of about 5%. A 5% improvement may be significant if it appears in the inner loop of a heavy computation. According to the authors, “The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.”

Related posts

A bit-twiddling marvel

Pop quiz: what does the following code do?

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

It determines whether the year y is a leap year in the Gregorian calendar, of course. :)

It’s very efficient, though I don’t image that would ever matter. But it’s very clever.

Gregorian vs Julian calendar

A year is a leap year in the Julian calendar if and only if it is a multiple of 4. But the Julian year is a bit too long on average to match the earth’s orbit around the sun. You get a much better fit if you add the complication that years divisible by 100 but not by 400 are not leap years.

Presumably no one reading this recalls 1900 not being a leap year, but some of you may need to know some day that 2100 is not a leap year either.

C code

The cryptic function above comes from the recent post A leap year check in three instructions by Falk Hüffner. The function is correct for the next 100 thousand years (more precisely, through 103499) and correct if we anachronistically extent the Gregorian calendar back to the year 0.

The following C code shows empirically that Hüffner’s code is correct, but you’ll have to read his post to understand why it is correct.

#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

int main() {
    for (uint32_t y = 0; y <= 102499; y++)
        if (is_leap_year_fast(y) != is_leap_year(y))
            printf("Exception: %d\n", y);
    return 0;
}

Related posts