Dijkstra extends Pythagoras

Suppose a triangle has sides a, b, and c. Label the angles opposite these three sides α, β, and γ respectively.

Edsger Dijkstra published (EWD975) a note proving the following extension of the Pythagorean theorem:

sgn(α + β – γ) = sgn(a² + b² – c²).

Here the sgn function is -1, 0, or 1 depending on whether its input is negative, zero, or positive.

To see that this really is an extension of the Pythagorean theorem, if γ is a right angle, then α + β = γ and so the sgn on the left hand side evaluates to 0. This forces the right hand side to 0, which says a² + b² = c².

As Dijkstra points out, his is a theorem about triangles, not simply a theorem about right triangles.

More Dijkstra posts

Bump functions

A bump function is a smooth (i.e. infinitely differentiable) function that is positive on some open interval (a, b) and zero outside that interval. I mentioned bump functions a few weeks ago and discussed how they could be used to prevent clicks in radio transmissions.

Today I ran into a twitter thread that gave a more general construction of bump functions that I’d seen before. The thread concludes with this:

You can create bump functions by this recipe:

    1. Take any f(x) growing faster than polynomial (e.g. exp)
    2. Define g(x) = 1 / f(1/x)
    3. Let h(x) = g(1+x) g(1−x).
    4. Zero out x∉(−1,1)
    5. Scale, shift, etc

In this post I’ll give a quick asymptotic proof that the construction above works.

Let a positive integer n be given and define g(x) to be zero for negative x. We’ll show that if f grows faster than xn. then g is n times differentiable at 0.

As x → ∞, f(x) is eventually bounded below by a function growing faster than xn. And so as x → 0, f(1/x) grows faster than xn and 1/f(1/x) goes to zero faster than xn, and so its nth derivative is zero.

It follows that g(1 + x) is n times differentiable at -1 and g(1 – x) is n times differentiable at 1. So h is n times differentiable at -1 and 1. The function h is positive on the open interval (-1, 1)  and zero outside. Our choice of n was arbitrary, so h is infinitely differentiable, and so h is a bump function. We could shift and scale h to make it a bump function on any other finite interval.


When I was a student, I would have called this kind of proof hand waving. I’d want to see every inequality made explicit: there exists some M > 0 such that for x > M …. Now I find arguments like the one above easier to follow and more convincing. I imagine if a lecturer gives a proof with all the inequalities spelled out, he or she is probably thinking about something like the proof above and expanding the asymptotic argument on the fly.

Note that a slightly more general theorem falls out of the proof. Our goal was to show if f grows faster than every polynomial, then g and h are infinitely differentiable. But along the way we proved, for example, that if f eventually grows like x7 then g and h are six-times differentiable.

In fact, let’s look at the case f(x) = x7.

    f[x_] := x^7
    g[x_] := If[x > 0, 1/ f[1/x], 0]
    h[x_] := g[x + 1] g[1 - x]
    Plot[h[x], {x, -1.5, 1.5}]

This produces the following plot.

Just looking at the plot, h looks smooth; it’s plausible that it has six derivatives. It appears h is zero outside [-1, 1] and h is positive over at least part of [-1, 1]. If you look at the Mathematica code you can convince yourself that h really is positive on the entire interval open interval (-1, 1), though it is very small as it approaches either end.

Related posts

Polynomial analog of the Collatz conjecture

The Collatz conjecture, a.k.a. the 3n + 1 problem, a.k.a. the hailstone conjecture, asks whether the following sequence always terminates.

Start with a positive integer n.

  1. If n is even, set nn /2. Otherwise n ← 3n + 1.
  2. If n = 1, stop. Otherwise go back to step 1.

The Collatz conjecture remains an unsolved problem, though there has been progress toward a proof. Some people, however, are skeptical whether the conjecture is true.

This post will look at a polynomial analog of the Collatz conjecture. Instead of starting with a positive integer, we start with a polynomial with coefficients in the integers mod m.

If the polynomial is divisible by x, then divide by x. Otherwise, multiply by (x + 1) and add 1. Here x is analogous to 2 and (x + 1) is analogous to 3 in the (integer) Collatz conjecture.

Here is Mathematica code to carry out the polynomial Collatz operation.

    c[p_, m_] := 
            If[ (p /. x -> 0) == 0, 
                 (x + 1) p + 1

If m = 2, the process always converges. In fact, it converges in at most n² + 2n steps where n is the degree of the polynomial [1].

Here’s an example starting with the polynomial x7 + x3 + 1.

    Nest[c[#, 2] &, x^7 + x^3 + 1, 15]

This returns 1, and so in this instance 15 iterations are enough.

If m = 3, however, the conjecture is false. In [1] the authors report that the sequence starting with x² + 1 is periodic with period 6.

The following code produces the sequence of values.

    NestList[c[#, 3] &, x^2 + 1, 6]

This returns the sequence

  • 1 + x2
  • 2 + x + x2 + x3
  • 2 x2 + 2 x3 + x4
  • 2 x + 2 x2 + x3
  • 2 + 2 x + x2
  • x + x3
  • 1 + x2

Related posts

[1] Kenneth Hicks, Gary L. Mullen, Joseph L. Yucas and Ryan Zavislak. A Polynomial Analogue of the 3n + 1 Problem. The American Mathematical Monthly, Vol. 115, No. 7. pp. 615-622.

Simple derivation of exponential approximation

I was watching one of Brian Douglas’ videos on control theory (Discrete Control #5) and ran into a simple derivation of an approximation I presented earlier.

Back in April I wrote several post on simple approximations for log, exp, etc. In this post I gave an approximation for the exponential function:

\exp(x) \approx \frac{2 + x}{2 - x}

The control theory video arrives at the same approximation as follows.

\begin{align*} \exp(x) &= \exp(x/2)\, \exp(x/2) \\ &= \frac{\exp(x/2)}{\exp(-x/2)} \\ &\approx \frac{1 + x/2}{1 - x/2} \\ &= \frac{2 + x}{2-x} \end{align*}

As I believe I’ve suggested before here, in a derivation like the one above, where you have mostly equalities and one or two approximations, pay special attention to the approximation steps. The approximation step above uses a first order Taylor approximation in the numerator and denominator.

The plot below shows that the approximation above (the bilinear approximation) is more accurate than doing a single Taylor approximation, approximating exp(x) by 1 + x (linear approximation).

exp, linear approx, bilinear approx

Here’s a plot focusing on the error in the bilinear and linear approximations.

error in linear and bilinear approximations to exp

The bilinear approximation is hard to tell from 0 in the plot above for x up to 0.5.

The derivation above is simple, but why is the result so good? An explanation in terms of Padé approximation is given here.

Hebrew letters spotted in applied math

Math and physics use Greek letters constantly, but seldom do they use letters from any other alphabet.

The only Cyrillic letter I recall seeing in math is sha (Ш, U+0428) for the so-called Dirc comb distribution.

One Hebrew letter is commonly used in math, and that’s aleph (א, U+05D0). Aleph is used fairly often, but other Hebrew letters are much rarer. If you see any other Hebrew letter in math, it’s very likely to be one of the next three letters: beth (ב, U+05D1), gimel (ג, U+05D2), or dalet (ד, U+05D3).

To back up this claim, basic LaTeX only has a command for aleph (unsurprisingly, it’s \aleph). AMS-LaTeX adds the commands \beth, \gimel, and \daleth, but no more. Those are the only Hebrew letters you can use in LaTeX without importing a package or using XeTeX so you can use Unicode symbols.

Not only are Hebrew letters rare in math, the only area of math that uses them at all is set theory, where they are used to represent transfinite numbers.

So in short, if you see a Hebrew letter in math, it’s overwhelmingly likely to be in set theory, and it’s very likely to be aleph, or possibly beth, gimel, or dalet.

But today I was browsing through Morse and Feschbach and was very surprised to see the following on page 324.

gimel = lambda ayin + mu yod + mu yod star

I’ve never seen a Hebrew letter in applied math, and I’ve never seen ayin (ע, U+05E2) or yod (י, U+05D9) used anywhere in math.

In context, the authors had used Roman letters, Fraktur letters, and Greek letters and so they’d run out of alphabets. The entity denoted by gimel is related to a tensor the authors denoted with g, so presumably they used the Hebrew letter that sounds like “g”. But I have no idea why they chose ayin or yod.

Related posts

Inverse Gray code

The previous post looked at Gray code, a way of encoding digits so that the encodings of consecutive integers differ in only bit. This post will look at how to compute the inverse of Gray code.

The Gray code of a non-negative integer n is given by

    def gray(n):
        return n ^ (n >> 1)

Bit-level operations

In case you’re not familiar with the notation, the >> operator shifts the bits of its argument. The code above shifts the bits of n one place to the right. In the process, the least significant bit falls off the end. We could replace n >> 1 with n // 2 if we like, i.e. integer division by 2, rounding down if n is odd. The ^ operator stands for XOR, exclusive OR. A bit of x ^ y is 0 if both corresponding bits in x and y are the same, and 1 if they are different.

Computing the inverse

The inverse of Gray code is a more complicated. If we assume n < 232, then we can compute the inverse Gray code of n by

    def inverse_gray32(n):
        assert(0 <= n < 2**32) n = n ^ (n >> 1)
        n = n ^ (n >> 2)
        n = n ^ (n >> 4)
        n = n ^ (n >> 8)
        n = n ^ (n >> 16)
        return n

For n of any size, we can compute its inverse Gray code by

    def inverse_gray(n):
        x = n
        e = 1
        while x:
            x = n >> e
            e *= 2
            n = n ^ x
        return n

If n is a 32-bit integer, inverse_gray32 is potentially faster than inverse_gray because of the loop unrolling.


Here’s a plot of the Gray code function and its inverse.

Proof via linear algebra

How do we know that what we’re calling the inverse Gray code really is the inverse? Here’s a proof for 32-bit integers n.

    def test_inverse32():
        for i in range(32):
            x = 2**i
            assert(inverse_gray32(gray(x)) == x)
            assert(gray(inverse_gray32(x)) == x)

How is that a proof? Wouldn’t you need to try all possible 32-bit integers if you wanted a proof by brute force?

If we think of 32-bit numbers as vectors in a 32-dimensional vector space over the binary field, addition is defined by XOR. So XOR is linear by definition. It’s easy to see that shifts are also linear, and the composition of linear functions is linear. This means that gray and inverse_gray32 are linear transformations. If the two linear transformations are inverses of each other on the elements of a basis, they are inverses everywhere. The unit basis vectors in our vector space are simply the powers of 2.

Matrix representation

Because Gray code and its inverse are linear transformations, they can be defined by matrix multiplication (over the binary field). So we could come up with 32 × 32 binary matrices for Gray code and its inverse. Matrix multiplication would give us a possible, but inefficient, way to implement these functions. Alternatively, you could think of the code above as clever ways to implement multiplication by special matrices very efficiently!

OK, so what are the matrices? For n-bit numbers, the matrix giving the Gray code transformation has dimension n by n. It has 1’s on the main diagonal, and on the diagonal just below the main diagonal, and 0s everywhere else. The inverse of this matrix, the matrix for the inverse Gray code transformation, has 1s on the main diagonal and everywhere below.

Here are the matrices for n = 4.

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

The matrix on the left is for Gray code, the next matrix is for inverse Gray code, and the last matrix is the identity. NB: The equation above only holds when you’re working over the binary field, i.e. addition is carried out mod 2, so 1 + 1 = 0.

To transform a number, represent it as a vector of length n, with the least significant in the first component, and multiply by the appropriate matrix.

Relation to popcount

It’s easy to see by induction that a number is odd if and only if its Gray code has an odd number of 1. The number 1 is its own Gray code, and as we move from the Gray code of n to the Gray code of n+1 we change one bit, so we change the parity of the the number of 1s.

There’s a standard C function popcount that counts the number of 1’s in a number’s binary representation, and the last bit of the popcount is the parity of the number of 1s. I blogged about this before here. If you look at the code at the bottom of that post, you’ll see that it’s the same as gray_inverse32.

The code in that post works because you can compute whether a word has an odd or even number of 1s by testing whether it is the Gray code of an odd or even number.

Permutable polynomials

Two polynomials p(x) and q(x) are said to be permutable if

p(q(x)) = q(p(x))

for all x. It’s not hard to see that Chebyshev polynomials are permutable.


Tn(x) = cos (n arccos(x))

where Tn is the nth Chebyshev polyomial. You can take this as a definition, or if you prefer another approach to defining the Chebyshev polynomials it’s a theorem.

Then it’s easy to show that

Tm(Tn(x)) = Tmn (x).


cos(m arccos(cos(n arccos(x)))) = cos(mn arccos(x)).

Then the polynomials Tm and Tn must be permutable because

Tm(Tn(x)) = Tmn (x) = Tn(Tm(x))

for all x.

There’s one more family of polynomials that are permutable, and that’s the power polynomials xk. They are trivially permutable because

(xm)n = (xn)m.

It turns out that the Chebyshev polynomials and the power polynomials are essentially [1] the only permutable sequence of polynomials.

Related posts

[1] Here’s what “essentially” means. A set of polynomials, at least one of each positive degree, that all permute with each other is called a chain. Two polynomials p and q are similar if there is an affine polynomial

λ(x) = ax + b

such that

p(x) = λ-1( q( λ(x) ) ).

Then any permutable chain is similar to either the power polynomials or the Chebyshev polynomials. For a proof, see Chebyshev Polynomials by Theodore Rivlin.

Conceptual vs Numerical

One of the things that makes numerical computation interesting is that it often reverses the usual conceptual order of things, using advanced math to compute things that are introduced earlier.

Here’s an example I recently ran across [1]. The hyperbolic functions are defined in terms of the exponential function:

\begin{align*} \sinh(x) &= \frac{\exp(x) - \exp(-x)}{2} \\ \cosh(x) &= \frac{\exp(x) + \exp(-x)}{2} \\ \tanh(x) &= \frac{\exp(x) - \exp(-x)}{\exp(x) + \exp(-x)} \\ \end{align*}

But it may be more efficient to compute the exponential function in terms of hyperbolic functions. Specifically,

\exp(x) = \sinh(x) + \sqrt{\sinh(x)^2 + 1}

Why would you want to compute the simple thing on the left in terms of the more complicated thing on the right? Because hyperbolic sine is an odd function. This means that half its power series coefficients are zero: an odd function only has odd powers in its power series.

Suppose you need to compute the exponential function in an environment where you only have a limited number of registers to store constants. You can get more accuracy out of the same number of registers by using them to store coefficients in the power series for hyperbolic sine than for exp.

If we store n coefficients for sinh, we can include powers of x up to 2n – 1. And since the coefficient of x2n is zero, the error in our Taylor approximation is O(x2n+1). See this post for more on this trick.

If we stored n coefficients for exp, we could include powers of x up to n-1, and our error would be O(xn).

To make things concrete, suppose n = 3 and x = 0.01. The error in the exp function would be on the order of 10-6, but the error in the sinh function would be on the order of 10-14. That is, we could almost compute exp to single precision and sinh to almost double precision.

(I’m glossing over a detail here. Would you really need to store, for example, the 1 coefficient in front of x in either series? For simplicity in the argument above I’ve implicitly said yes. Whether you’d actually need to store it depends on the details of your implementation.)

The error estimate above uses big-oh arguments. Let’s do an actual calculation to see what we get.

    from math import exp, sinh, sqrt
    def exp_taylor(x):
        return 1 + x + 0.5*x**2
    def sinh_taylor(x):
        return x + x**3/6 + x**5/120
    def exp_via_sinh(x):
        s = sinh_taylor(x)
        return s + sqrt(s*s + 1)
    def print_error(approx, exact):
        print(f"Computed: {approx}")
        print(f"Exact:    {exact}")
        print(f"Error:    {approx - exact}")
    x = 0.01
    approx1 = exp_taylor(x)
    approx2 = exp_via_sinh(x)
    exact   = exp(x)
    print("Computing exp directly:\n")
    print_error(approx1, exact)
    print("Computing exp via sinh:\n")
    print_error(approx2, exact)

This produces

    Computing exp directly:

    Computed: 1.0100500000000001
    Exact:    1.010050167084168
    Error:    -1.6708416783473012e-07

    Computing exp via sinh:

    Computed: 1.0100501670841682
    Exact:    1.010050167084168
    Error:    2.220446049250313e-16

Our errors are roughly what we expected from our big-oh arguments.

More numerical analysis posts

[1] I saw this identity in Matters Computational: Ideas, Algorithms, Source Code by Jörg Arndt.

Banned math book

Courant & Hilbert is a classic applied math textbook, still in print nearly a century after the first edition came out. The actual title of the book is Methods of Mathematical Physics, but everyone calls it Courant & Hilbert after the authors, Richard Courant and David Hilbert. I was surprised to find out recently that this was once a banned book. How could there be anything controversial about a math book? It doesn’t get into any controversial applications of math; it applies math to physics problems, but doesn’t apply the physics to anything in particular.

The book was first published in Germany in 1924 under the title Methoden der mathematischen Physik. Courant says in the preface

… I had been forced to leave Germany and was fortunate and grateful to be given the opportunities open in the United States. During the Second World War the German book became unavailable and was even suppressed by the National Socialist rulers of Germany. The survival of the book was secured when the United States Government seized the copyright and licensed a reprint issued by Interscience Publishers.

Courant’s language is remarkably restrained under the circumstances.

I wondered why the book was banned. Was Courant Jewish? I’d never considered this before, because I couldn’t care less about the ethnicity of authors. Jew or Greek, bond or free, male or female, I just care about their content. The Nazis, however, did care. According to his Wikipedia biography, Courant fled Germany not because of his Jewish ancestry but because of his affiliation with the wrong political party.


I never had Courant & Hilbert as a textbook, but I was familiar with it as a student. I vaguely remember that the library copy was in high demand and that I considered buying a copy, though it was too expensive for my means at the time. I recently bought a copy now that the book is cheaper and my means have improved.

I covered most of the material in Courant & Hilbert in graduate school, albeit in a more abstract form. As I mentioned the other day, my education was somewhat top-down; I learned about things first in an abstract setting and got down to particulars later, moving from soft analysis to hard analysis.

One quick anecdote along these lines. I read somewhere that David Hilbert was at a conference where someone referred to a Hilbert space and he asked the speaker what such a thing was. Hilbert’s work had motivated the definition of a Hilbert space, but Mr. Hilbert thought in more concrete terms.

Sinc approximation

If a function is smooth and has thin tails, it can be well approximated by sinc functions. These approximations are frequently used in applications, such as signal processing and numerical integration. This post will illustrate sinc approximation with the function exp(-x²).

The sinc approximation for a function f(x) is given by

f(x) \approx \sum_{j=-n}^n f(jh) \, \text{sinc}\left(\frac{x - jh}{h}\right)

where sinc(x) = sin(πx)/πx.

Do you get more accuracy from sampling more densely or by sampling over a wider range? You need to do both. As the number of sample points n increases, you want h to decrease something like 1/√n and the range to increase something like √n.

According to [1], the best trade-off between smaller h and larger n depends on the norm you use to measure approximation error. If you’re measuring error with the Lebesgue p-norm [2], choose h to be

h = (\pi/2) \sqrt{q/(2n)}

where q is the conjugate exponent to p, i.e.

\frac{1}{p} + \frac{1}{q} = 1

When p = 2, q is also 2. For the sup norm, p = ∞ and q = 1.

So the range between the smallest and largest sample will be

\pi \sqrt{nq/2}

Here are a few plots to show how quickly the sinc approximations converge for exp(-x²). We’ll look at n = 1 (i.e. three sample points) , n = 2 (five sample points) and n = 4 (nine sample points). For n > 1, the sinc approximations are so good that the plots are hard to distinguish. So we’ll show the plot of exp(-x²) and its approximation for n = 1 then show the error curves.

And now the error plots.

Note that the vertical scales are different in each subplot. The error for n = 3 is two orders of magnitude smaller than the error for n = 1.

Related posts

[1] Masaaki Sugihara. Near Optimality of the Sinc Approximation. Mathematics of Computation, Vol. 72, No. 242 (Apr., 2003), pp. 767-786.

[2] The reference above doesn’t use the p-norms on the real line but on a strip in the complex plane containing the real line, and with a weight function that penalizes errors exponentially in the tails.