Polynomial analog of the Collatz conjecture

The Collatz conjecture, a.k.a. the 3n + 1 problem, a.k.a. the hailstone conjecture, asks whether the following sequence always terminates.

Start with a positive integer n.

  1. If n is even, set nn /2. Otherwise n ← 3n + 1.
  2. If n = 1, stop. Otherwise go back to step 1.

The Collatz conjecture remains an unsolved problem, though there has been progress toward a proof. Some people, however, are skeptical whether the conjecture is true.

This post will look at a polynomial analog of the Collatz conjecture. Instead of starting with a positive integer, we start with a polynomial with coefficients in the integers mod m.

If the polynomial is divisible by x, then divide by x. Otherwise, multiply by (x + 1) and add 1. Here x is analogous to 2 and (x + 1) is analogous to 3 in the (integer) Collatz conjecture.

Here is Mathematica code to carry out the polynomial Collatz operation.

    c[p_, m_] := 
            If[ (p /. x -> 0) == 0, 
                 (x + 1) p + 1

If m = 2, the process always converges. In fact, it converges in at most n² + 2n steps where n is the degree of the polynomial [1].

Here’s an example starting with the polynomial x7 + x3 + 1.

    Nest[c[#, 2] &, x^7 + x^3 + 1, 15]

This returns 1, and so in this instance 15 iterations are enough.

If m = 3, however, the conjecture is false. In [1] the authors report that the sequence starting with x² + 1 is periodic with period 6.

The following code produces the sequence of values.

    NestList[c[#, 3] &, x^2 + 1, 6]

This returns the sequence

  • 1 + x2
  • 2 + x + x2 + x3
  • 2 x2 + 2 x3 + x4
  • 2 x + 2 x2 + x3
  • 2 + 2 x + x2
  • x + x3
  • 1 + x2

Related posts

[1] Kenneth Hicks, Gary L. Mullen, Joseph L. Yucas and Ryan Zavislak. A Polynomial Analogue of the 3n + 1 Problem. The American Mathematical Monthly, Vol. 115, No. 7. pp. 615-622.

Simple derivation of exponential approximation

I was watching one of Brian Douglas’ videos on control theory (Discrete Control #5) and ran into a simple derivation of an approximation I presented earlier.

Back in April I wrote several post on simple approximations for log, exp, etc. In this post I gave an approximation for the exponential function:

\exp(x) \approx \frac{2 + x}{2 - x}

The control theory video arrives at the same approximation as follows.

\begin{align*} \exp(x) &= \exp(x/2)\, \exp(x/2) \\ &= \frac{\exp(x/2)}{\exp(-x/2)} \\ &\approx \frac{1 + x/2}{1 - x/2} \\ &= \frac{2 + x}{2-x} \end{align*}

As I believe I’ve suggested before here, in a derivation like the one above, where you have mostly equalities and one or two approximations, pay special attention to the approximation steps. The approximation step above uses a first order Taylor approximation in the numerator and denominator.

The plot below shows that the approximation above (the bilinear approximation) is more accurate than doing a single Taylor approximation, approximating exp(x) by 1 + x (linear approximation).

exp, linear approx, bilinear approx

Here’s a plot focusing on the error in the bilinear and linear approximations.

error in linear and bilinear approximations to exp

The bilinear approximation is hard to tell from 0 in the plot above for x up to 0.5.

The derivation above is simple, but why is the result so good? An explanation in terms of Padé approximation is given here.

Hebrew letters spotted in applied math

Math and physics use Greek letters constantly, but seldom do they use letters from any other alphabet.

The only Cyrillic letter I recall seeing in math is sha (Ш, U+0428) for the so-called Dirc comb distribution.

One Hebrew letter is commonly used in math, and that’s aleph (א, U+05D0). Aleph is used fairly often, but other Hebrew letters are much rarer. If you see any other Hebrew letter in math, it’s very likely to be one of the next three letters: beth (ב, U+05D1), gimel (ג, U+05D2), or dalet (ד, U+05D3).

To back up this claim, basic LaTeX only has a command for aleph (unsurprisingly, it’s \aleph). AMS-LaTeX adds the commands \beth, \gimel, and \daleth, but no more. Those are the only Hebrew letters you can use in LaTeX without importing a package or using XeTeX so you can use Unicode symbols.

Not only are Hebrew letters rare in math, the only area of math that uses them at all is set theory, where they are used to represent transfinite numbers.

So in short, if you see a Hebrew letter in math, it’s overwhelmingly likely to be in set theory, and it’s very likely to be aleph, or possibly beth, gimel, or dalet.

But today I was browsing through Morse and Feschbach and was very surprised to see the following on page 324.

gimel = lambda ayin + mu yod + mu yod star

I’ve never seen a Hebrew letter in applied math, and I’ve never seen ayin (ע, U+05E2) or yod (י, U+05D9) used anywhere in math.

In context, the authors had used Roman letters, Fraktur letters, and Greek letters and so they’d run out of alphabets. The entity denoted by gimel is related to a tensor the authors denoted with g, so presumably they used the Hebrew letter that sounds like “g”. But I have no idea why they chose ayin or yod.

Related posts

Inverse Gray code

The previous post looked at Gray code, a way of encoding digits so that the encodings of consecutive integers differ in only bit. This post will look at how to compute the inverse of Gray code.

The Gray code of a non-negative integer n is given by

    def gray(n):
        return n ^ (n >> 1)

Bit-level operations

In case you’re not familiar with the notation, the >> operator shifts the bits of its argument. The code above shifts the bits of n one place to the right. In the process, the least significant bit falls off the end. We could replace n >> 1 with n // 2 if we like, i.e. integer division by 2, rounding down if n is odd. The ^ operator stands for XOR, exclusive OR. A bit of x ^ y is 0 if both corresponding bits in x and y are the same, and 1 if they are different.

Computing the inverse

The inverse of Gray code is a more complicated. If we assume n < 232, then we can compute the inverse Gray code of n by

    def inverse_gray32(n):
        assert(0 <= n < 2**32) n = n ^ (n >> 1)
        n = n ^ (n >> 2)
        n = n ^ (n >> 4)
        n = n ^ (n >> 8)
        n = n ^ (n >> 16)
        return n

For n of any size, we can compute its inverse Gray code by

    def inverse_gray(n):
        x = n
        e = 1
        while x:
            x = n >> e
            e *= 2
            n = n ^ x
        return n

If n is a 32-bit integer, inverse_gray32 is potentially faster than inverse_gray because of the loop unrolling.


Here’s a plot of the Gray code function and its inverse.

Proof via linear algebra

How do we know that what we’re calling the inverse Gray code really is the inverse? Here’s a proof for 32-bit integers n.

    def test_inverse32():
        for i in range(32):
            x = 2**i
            assert(inverse_gray32(gray(x)) == x)
            assert(gray(inverse_gray32(x)) == x)

How is that a proof? Wouldn’t you need to try all possible 32-bit integers if you wanted a proof by brute force?

If we think of 32-bit numbers as vectors in a 32-dimensional vector space over the binary field, addition is defined by XOR. So XOR is linear by definition. It’s easy to see that shifts are also linear, and the composition of linear functions is linear. This means that gray and inverse_gray32 are linear transformations. If the two linear transformations are inverses of each other on the elements of a basis, they are inverses everywhere. The unit basis vectors in our vector space are simply the powers of 2.

Matrix representation

Because Gray code and its inverse are linear transformations, they can be defined by matrix multiplication (over the binary field). So we could come up with 32 × 32 binary matrices for Gray code and its inverse. Matrix multiplication would give us a possible, but inefficient, way to implement these functions. Alternatively, you could think of the code above as clever ways to implement multiplication by special matrices very efficiently!

OK, so what are the matrices? For n-bit numbers, the matrix giving the Gray code transformation has dimension n by n. It has 1’s on the main diagonal, and on the diagonal just below the main diagonal, and 0s everywhere else. The inverse of this matrix, the matrix for the inverse Gray code transformation, has 1s on the main diagonal and everywhere below.

Here are the matrices for n = 4.

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

The matrix on the left is for Gray code, the next matrix is for inverse Gray code, and the last matrix is the identity. NB: The equation above only holds when you’re working over the binary field, i.e. addition is carried out mod 2, so 1 + 1 = 0.

To transform a number, represent it as a vector of length n, with the least significant in the first component, and multiply by the appropriate matrix.

Relation to popcount

It’s easy to see by induction that a number is odd if and only if its Gray code has an odd number of 1. The number 1 is its own Gray code, and as we move from the Gray code of n to the Gray code of n+1 we change one bit, so we change the parity of the the number of 1s.

There’s a standard C function popcount that counts the number of 1’s in a number’s binary representation, and the last bit of the popcount is the parity of the number of 1s. I blogged about this before here. If you look at the code at the bottom of that post, you’ll see that it’s the same as gray_inverse32.

The code in that post works because you can compute whether a word has an odd or even number of 1s by testing whether it is the Gray code of an odd or even number.

Permutable polynomials

Two polynomials p(x) and q(x) are said to be permutable if

p(q(x)) = q(p(x))

for all x. It’s not hard to see that Chebyshev polynomials are permutable.


Tn(x) = cos (n arccos(x))

where Tn is the nth Chebyshev polyomial. You can take this as a definition, or if you prefer another approach to defining the Chebyshev polynomials it’s a theorem.

Then it’s easy to show that

Tm(Tn(x)) = Tmn (x).


cos(m arccos(cos(n arccos(x)))) = cos(mn arccos(x)).

Then the polynomials Tm and Tn must be permutable because

Tm(Tn(x)) = Tmn (x) = Tn(Tm(x))

for all x.

There’s one more family of polynomials that are permutable, and that’s the power polynomials xk. They are trivially permutable because

(xm)n = (xn)m.

It turns out that the Chebyshev polynomials and the power polynomials are essentially [1] the only permutable sequence of polynomials.

Related posts

[1] Here’s what “essentially” means. A set of polynomials, at least one of each positive degree, that all permute with each other is called a chain. Two polynomials p and q are similar if there is an affine polynomial

λ(x) = ax + b

such that

p(x) = λ-1( q( λ(x) ) ).

Then any permutable chain is similar to either the power polynomials or the Chebyshev polynomials. For a proof, see Chebyshev Polynomials by Theodore Rivlin.

Conceptual vs Numerical

One of the things that makes numerical computation interesting is that it often reverses the usual conceptual order of things, using advanced math to compute things that are introduced earlier.

Here’s an example I recently ran across [1]. The hyperbolic functions are defined in terms of the exponential function:

\begin{align*} \sinh(x) &= \frac{\exp(x) - \exp(-x)}{2} \\ \cosh(x) &= \frac{\exp(x) + \exp(-x)}{2} \\ \tanh(x) &= \frac{\exp(x) - \exp(-x)}{\exp(x) + \exp(-x)} \\ \end{align*}

But it may be more efficient to compute the exponential function in terms of hyperbolic functions. Specifically,

\exp(x) = \sinh(x) + \sqrt{\sinh(x)^2 + 1}

Why would you want to compute the simple thing on the left in terms of the more complicated thing on the right? Because hyperbolic sine is an odd function. This means that half its power series coefficients are zero: an odd function only has odd powers in its power series.

Suppose you need to compute the exponential function in an environment where you only have a limited number of registers to store constants. You can get more accuracy out of the same number of registers by using them to store coefficients in the power series for hyperbolic sine than for exp.

If we store n coefficients for sinh, we can include powers of x up to 2n – 1. And since the coefficient of x2n is zero, the error in our Taylor approximation is O(x2n+1). See this post for more on this trick.

If we stored n coefficients for exp, we could include powers of x up to n-1, and our error would be O(xn).

To make things concrete, suppose n = 3 and x = 0.01. The error in the exp function would be on the order of 10-6, but the error in the sinh function would be on the order of 10-14. That is, we could almost compute exp to single precision and sinh to almost double precision.

(I’m glossing over a detail here. Would you really need to store, for example, the 1 coefficient in front of x in either series? For simplicity in the argument above I’ve implicitly said yes. Whether you’d actually need to store it depends on the details of your implementation.)

The error estimate above uses big-oh arguments. Let’s do an actual calculation to see what we get.

    from math import exp, sinh, sqrt
    def exp_taylor(x):
        return 1 + x + 0.5*x**2
    def sinh_taylor(x):
        return x + x**3/6 + x**5/120
    def exp_via_sinh(x):
        s = sinh_taylor(x)
        return s + sqrt(s*s + 1)
    def print_error(approx, exact):
        print(f"Computed: {approx}")
        print(f"Exact:    {exact}")
        print(f"Error:    {approx - exact}")
    x = 0.01
    approx1 = exp_taylor(x)
    approx2 = exp_via_sinh(x)
    exact   = exp(x)
    print("Computing exp directly:\n")
    print_error(approx1, exact)
    print("Computing exp via sinh:\n")
    print_error(approx2, exact)

This produces

    Computing exp directly:

    Computed: 1.0100500000000001
    Exact:    1.010050167084168
    Error:    -1.6708416783473012e-07

    Computing exp via sinh:

    Computed: 1.0100501670841682
    Exact:    1.010050167084168
    Error:    2.220446049250313e-16

Our errors are roughly what we expected from our big-oh arguments.

More numerical analysis posts

[1] I saw this identity in Matters Computational: Ideas, Algorithms, Source Code by Jörg Arndt.

Banned math book

Courant & Hilbert is a classic applied math textbook, still in print nearly a century after the first edition came out. The actual title of the book is Methods of Mathematical Physics, but everyone calls it Courant & Hilbert after the authors, Richard Courant and David Hilbert. I was surprised to find out recently that this was once a banned book. How could there be anything controversial about a math book? It doesn’t get into any controversial applications of math; it applies math to physics problems, but doesn’t apply the physics to anything in particular.

The book was first published in Germany in 1924 under the title Methoden der mathematischen Physik. Courant says in the preface

… I had been forced to leave Germany and was fortunate and grateful to be given the opportunities open in the United States. During the Second World War the German book became unavailable and was even suppressed by the National Socialist rulers of Germany. The survival of the book was secured when the United States Government seized the copyright and licensed a reprint issued by Interscience Publishers.

Courant’s language is remarkably restrained under the circumstances.

I wondered why the book was banned. Was Courant Jewish? I’d never considered this before, because I couldn’t care less about the ethnicity of authors. Jew or Greek, bond or free, male or female, I just care about their content. The Nazis, however, did care. According to his Wikipedia biography, Courant fled Germany not because of his Jewish ancestry but because of his affiliation with the wrong political party.


I never had Courant & Hilbert as a textbook, but I was familiar with it as a student. I vaguely remember that the library copy was in high demand and that I considered buying a copy, though it was too expensive for my means at the time. I recently bought a copy now that the book is cheaper and my means have improved.

I covered most of the material in Courant & Hilbert in graduate school, albeit in a more abstract form. As I mentioned the other day, my education was somewhat top-down; I learned about things first in an abstract setting and got down to particulars later, moving from soft analysis to hard analysis.

One quick anecdote along these lines. I read somewhere that David Hilbert was at a conference where someone referred to a Hilbert space and he asked the speaker what such a thing was. Hilbert’s work had motivated the definition of a Hilbert space, but Mr. Hilbert thought in more concrete terms.

Sinc approximation

If a function is smooth and has thin tails, it can be well approximated by sinc functions. These approximations are frequently used in applications, such as signal processing and numerical integration. This post will illustrate sinc approximation with the function exp(-x²).

The sinc approximation for a function f(x) is given by

f(x) \approx \sum_{j=-n}^n f(jh) \, \text{sinc}\left(\frac{x - jh}{h}\right)

where sinc(x) = sin(πx)/πx.

Do you get more accuracy from sampling more densely or by sampling over a wider range? You need to do both. As the number of sample points n increases, you want h to decrease something like 1/√n and the range to increase something like √n.

According to [1], the best trade-off between smaller h and larger n depends on the norm you use to measure approximation error. If you’re measuring error with the Lebesgue p-norm [2], choose h to be

h = (\pi/2) \sqrt{q/(2n)}

where q is the conjugate exponent to p, i.e.

\frac{1}{p} + \frac{1}{q} = 1

When p = 2, q is also 2. For the sup norm, p = ∞ and q = 1.

So the range between the smallest and largest sample will be

\pi \sqrt{nq/2}

Here are a few plots to show how quickly the sinc approximations converge for exp(-x²). We’ll look at n = 1 (i.e. three sample points) , n = 2 (five sample points) and n = 4 (nine sample points). For n > 1, the sinc approximations are so good that the plots are hard to distinguish. So we’ll show the plot of exp(-x²) and its approximation for n = 1 then show the error curves.

And now the error plots.

Note that the vertical scales are different in each subplot. The error for n = 3 is two orders of magnitude smaller than the error for n = 1.

Related posts

[1] Masaaki Sugihara. Near Optimality of the Sinc Approximation. Mathematics of Computation, Vol. 72, No. 242 (Apr., 2003), pp. 767-786.

[2] The reference above doesn’t use the p-norms on the real line but on a strip in the complex plane containing the real line, and with a weight function that penalizes errors exponentially in the tails.

Manipulating sums

This post is a list of five spinoffs of my previous post. Except for the last point it doesn’t build on the previous post per se, but I’ll use a sum from that post to illustrate five things:

  1. Putting multiple things under a summation sign in LaTeX
  2. Simplifying sums by generalizing binomial coefficients
  3. A bit of notation from Iverson
  4. Changing variables in a sum
  5. Chebyshev polynomials come up again.

Let’s get started. The equation I want to focus on is the following.

\cos n\theta + i\sin n\theta = \sum_{j=0}^n {n\choose j} i^j(\cos\theta)^{n-j} (\sin\theta)^j

Putting things under a sum in LaTeX

I said in the previous post that you could equate the real parts of the left and right side to show that cos nθ can be written as sums and products of cos θ and sin θ. To write this in LaTeX we’d say

\cos n\theta = \sum_{\substack{0 \leq j \leq n \\ j \text{ even}}} {n\choose j} i^j(\cos\theta)^{n-j} (\sin\theta)^j

The command that makes it possible to put two lines of stuff under the summation is \substack. Here’s the LaTeX code that produce the summation sign and the text below it.

    \sum_{\substack{0 \leq j \leq n \\ j \text{ even}}}

Binomial coefficients

We can simplify the sum by removing the limits on j and implicitly letting j run over all integers:

\cos n\theta = \sum_{ j \text{ even}} {n\choose j} i^j(\cos\theta)^{n-j} (\sin\theta)^j

This is because the binomial coefficient term is zero when j > n or j < 0. (See this post for an explanation of more general binomial coefficients.)

Indicator functions

It’s often helpful to turn a sum over a complicated region into a more complicated sum over a simpler region. That is, it’s often easier to deal with complications in the summand than in the domain of summation.

In our case, instead of summing over even integers, we could sum over all integers, if we multiply the summand by a function that is 0 on odd numbers and 1 on even numbers. That is, we multiply by the indicator function of the even integers. The indicator function of a set is 1 on that set and 0 everywhere else.

Kenneth Iverson’s notation uses a Boolean expression in brackets to indicate the function that is 1 if the condition is true and 0 otherwise. So [j even] means the function that is 1 when j is even and 0 when j is odd. So we could write our sum as follows.

\cos n\theta = \sum {n\choose j} i^j [j \text{ even}](\cos\theta)^{n-j} (\sin\theta)^j

Change of variables

We could get rid of the requirement that j is even by replacing j with 2k for a new variable k. Now our sum is

\cos n\theta = \sum {n\choose 2k} (-1)^k (\cos\theta)^{n-2k} (\sin\theta)^{2k}

Notice a couple things. For one thing we were table to write (-1)k rather than i2k.

More importantly, the change of variables was trivial because the sum runs over all integers. If we had explicit limits on j, we would have to change them to different explicit limits on k.

Changing limits of summation is error-prone. This happens a lot, for example, when computing power series solutions for differential equations, and there are mnemonics for reducing errors such as “limits and exponents move in opposite directions.” These complications go away when you sum over all integers.

Chebyshev strikes again

GlennF left a comment on the previous post to the effect that the sum we’ve been talking about reduces to a Chebyshev polynomial.

Since the powers of sin θ are all even, we can replace sin²θ with 1 – cos²θ and get the following.

\cos n\theta = \sum {n\choose 2k} (-1)^k (\cos\theta)^{n-2k} (1 - \cos^2\theta)^k

Now the left side is a polynomial in cos θ, call it P(cos θ). Then P = Tn, the nth Chebyshev polynomial because as explained here, one way to define the Chebyshev polynomials is by

\cos n\theta = T_n(\cos\theta)

If you don’t like that definition, you could use another definition and the equation becomes a theorem.

Analogy between Fibonacci and Chebyshev

Quick observation: I recently noticed that Chebyshev polynomials and Fibonacci numbers have analogous formulas.

The nth Chebyshev polynomial satisfies

T_n(x) = \frac{(x + \sqrt{x^2-1})^n + (x - \sqrt{x^2-1})^n }{2}

for |x| ≥ 1, and the nth Fibonacci number is given by

F_n = \frac{(1 + \sqrt{5})^n - (1 - \sqrt{5})^n }{2^n\sqrt 5}

There’s probably a way to explain the similarity in terms of the recurrence relations that both sequences satisfy.

More on Chebyshev polynomials

More on Fibonacci numbers