Runge phenomena

I’ve mentioned the Runge phenomenon in a couple posts before. Here I’m going to go into a little more detail.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with cowabunga, not with sponge.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function f(x) = 1/(1 + x²) on the interval [-5, 5], or equivalently, and more convenient here, the function f(x) = 1/(1 + 25x²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge's example

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255]  on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2i, which is inside the football. Linear interpolation at evenly spaced points would converge for the function f(x) = 1/(1 + x²) since the singularity at i is outside the football.

Runge's example

For another example, consider the function f(x) = exp(- 1/x²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

Interpolating exp(-1/x^2)

With 28 interpolation points in the plot below, the lack of convergence is clear.

Interpolating exp(-1/x^2)

The problem is not polynomial interpolation per se but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the rate of convergence but not whether the interpolants converge.

RelatedHelp with interpolation


[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Yogi Berra meets Pafnuty Chebyshev

I just got an evaluation copy of The Best Writing on Mathematics 2017. My favorite chapter was Inverse Yogiisms by Lloyd N. Trefethen.

Trefethen gives several famous Yogi Berra quotes and concludes that

Yogiisms are statements that, if taken literally, are meaningless or contradictory or nonsensical or tautological—yet nevertheless convey something true.

An inverse yogiism is the opposite,

[a] statement that is literally true, yet conveys something false.

What a great way way to frame a chapter! Now that I’ve heard the phrase, I’m trying to think of inverse yogiisms. Nothing particular has come to mind yet, but I feel like there must be lots of things that fit that description. Trefethen comes up with three inverse yogiisms, and my favorite is the middle one: Faber’s theorem on polynomial interpolation.

Faber’s theorem is a non-convergence result for interpolants of continuous functions. Trefethen quotes several numerical analysis textbooks that comment on Faber’s theorem in a way that implies an overly pessimistic interpretation. Faber’s theorem is true for continuous functions in general, but if the function f  being interpolated is smooth, or even just Lipschitz continuous, the theorem doesn’t hold. In particular, Chebyshev interpolation produces a sequence of polynomials converging to f.

A few years ago I wrote a blog post that shows a famous example due to Carle Runge that if you interpolate f(x) = 1/(1 + x²) over [-5, 5] with evenly spaced nodes, the sequence of interpolating polynomials diverges. In other words, adding more interpolation points makes the fit worse.

Here’s the result of fitting a 16th degree polynomial to f  at evenly spaced nodes.

graph of f(x) and p16(x)

The error near the ends is terrible, though the fit does improve in the middle. If instead of using evenly spaced nodes you use the roots of Chebyshev polynomials, the interpolating polynomials do in fact converge, and converge quickly. If the kth derivative of f has bounded variation, then the error in interpolating f at n points is O(nk).

Common words that have a technical meaning in math

Mathematical writing is the opposite of business writing in at least one respect. Math uses common words as technical terms, whereas business coins technical terms to refer to common ideas.

There are a few math terms I use fairly often and implicitly assume readers understand. Perhaps the most surprising is almost as in “almost everywhere.” My previous post, for example, talks about something being true for “almost all x.”

The term “almost” sounds vague but it actually has a precise technical meaning. A statement is true almost everywhere, or holds for almost all x, if the set of points where it doesn’t hold has measure zero.

For example, almost all real numbers are irrational. There are infinitely many rational numbers, and so there are a lot of exceptions to the statement “all real numbers are irrational,” but the set of exceptions has measure zero [1].

In common parlance, you might use ball and sphere interchangeably, but in math they’re different. In a normed vector space, the set of all points of norm no more than r is the ball of radius r. The set of all points with norm exactly r is the sphere of radius r. A sphere is the surface of a ball.

The word smooth typically means “infinitely differentiable,” or depending on context, differentiable as many times as you need. Often there’s no practical loss of generality in assuming something is infinitely differentiable when you only need to know, for example, that it only needs three derivatives [2]. For example, a manifold whose charts are once differentiable can always be altered slightly to be infinitely differentiable.

The words regular and normal are used throughout mathematics as technical terms, and their meaning changes completely depending on context. For example, in topology regular and normal are two kinds of separation axioms. They tell you whether a topology has enough open sets to separate a point from a closed set or separate two closed sets from each other.

When I use normal I’m most often talking about a normal (i.e. Gaussian) probability distribution. I don’t think I use regular as a technical term that often, but when I do it probably means something like smooth, but more precise. A regularity result in differential equations, for example, tells you what sort of desirable properties a solution has: whether it’s a classical solution or only a weak solution, whether it’s continuous or differentiable, etc.

While I’m giving a sort of reader’s guide to my terminology, log always refers to natural log and trig functions are always in radians unless noted otherwise. More on that here.

* * *

The footnotes below are much more technical than the text above.

[1] Here’s a proof that any countable set of points has measure zero. Pick any ε > 0. Put an open interval of width ε/2 around the first point, an interval of width ε/4 around the second point, an interval of width ε/8 around the third point etc. This covers the countable set of points with a cover of measure ε, and since ε as arbitrary, the set of points must have measure 0.

The irrational numbers are uncountable, but that’s not why they have positive measure. A countable set has measure zero, but a set of measure zero may be uncountable. For example, the Cantor set is uncountable but has measure zero. Or to be more precise, I should say the standard Cantor set has measure zero. There are other Cantor sets, i.e. sets homoemorphic to the standard Cantor set, that have positive measure. This shows that “measure zero” is not a topological property.

[2] I said above that often it doesn’t matter how many times you can differentiate a function, but partial differential equations are an exception to that rule. There you’ll often you’ll care exactly how many (generalized) derivatives a solution has. And you’ll obsess over exactly which powers of the function or its derivatives are integrable. The reason is that a large part of the theory revolves around embedding theorems, whether this function space embeds in that function space. The number of derivatives a function has and the precise exponents p for the Lebesgue spaces they live in matters a great deal. Existence and uniqueness of solutions can hang on such fine details.

No critical point between two peaks

If a function of one variable has two local maxima, it must have a local minimum in between.

What about a function of two variables? If it has two local maxima, does it need to have a local minimum? No, it could have a saddle point in between, a point that is a local minimum in one direction but a local maximum in another direction. But even more surprising, it need not even have a saddle point. A function of two variables could have two local maxima and no other critical points! Here’s an example:

f(x, y) = – (x² – 1)² – (x²yx – 1)²

It’s clear that the function is zero at (-1, 0) and (1, 2), and that the function is negative otherwise. So it clearly has two local maxima. You can write out the partial derivatives with respect to x and y and see that the only place they’re both zero is at the two local maxima.

Here’s a plot of the function:

Plot3D[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}]

And here’s a contour plot:

ContourPlot[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}, Contours -> 50]

The two maxima are in in the bright patches in the lower left and upper right.

You might be thinking that if you walk between two peaks, you’ve got to go down in between. And that’s true. If you walk in a straight line between (-1, 0) and (1, 2), you’ll run into a local minimum around (0.2316, 1.2316). But that’s only a local minimum along your path. It’s not a local minimum or saddle point of the function in a neighborhood of that point.

I found this example in the book Single Digits.

Cellular automata with random initial conditions

The previous post looked at a particular cellular automaton, the so-called Rule 90. When started with a single pixel turned on, it draws a Sierpinski triangle. With random starting pixels, it draws a semi-random pattern that retains features like the Sierpinski triangle.

There are only 256 possible elementary cellular automata, so it’s practical to plot them all. I won’t list all the images here—you can find them all here—but I will give a few examples to show the variety of patterns they produce. As in the previous post, we imagine our grid rolled up into a cylinder, i.e. we’ll wrap around if necessary to find pixels diagonally up to the left and right.

rule 8 with random initial conditions
rule 18 with random initial conditions
rule 29 with random initial conditions
rule 30 with random initial conditions
rule 108 with random initial conditions
rule 129 with random initial conditions

As we discussed in the previous post, the number of a rule comes from what value it assigns to each of eight possible cellular states, turned into a binary number. So it’s plausible that binary numbers with more 1’s correspond to more black pixels. This is roughly true, though the graph below shows that the situation is more complex than that.

automata pixel density as a function of 1 bits in rule

Sierpinski triangle strikes again

A couple months ago I wrote about how a simple random process gives rise to the Sierpinski triangle. Draw an equilateral triangle and pick a random point in the plane. Repeatedly pick a triangle vertex at random and move half way from the current position to that vertex. The result converges to a Sierpinksi triangle. This post will show another way to arrive at the same pattern using cellular automata.

Imagine an infinite grid of graph paper and fill in a few squares in one row. The squares in the subsequent rows are filled in or not depending on the state of the square’s upstairs neighbors. For elementary cellular automata, the state of a square depends on the square directly above and the two squares diagonally above on each side. In matrix notation, the state of the (ij) element depends on elements (i-1, j-1), (i-1, j), and (i-1, j+1).

There are 256 possible elementary cellular automata, and here’s how you can number them. The states of three consecutive cells determine a three-bit binary number. An automaton is determined by what bit it assigned to each of the eight possible three-bit states, so an automaton corresponds to an 8-bit number. In this post we’re interested in Rule 90. In binary, 90 is written 01011010, and the table below spells out the rule in detail.

000 -> 0
001 -> 1
010 -> 0
011 -> 1
100 -> 1
101 -> 0
110 -> 1
111 -> 0

If we start with a single square filled in (i.e. set to 1) then this is the graph we get, i.e. the Sierpenski triangle:

Rule 90 with one initial bit set

This pattern depends critically on our initial conditions. Roughly speaking, it seems that if you start with regular initial conditions you’ll get regular results. If you start with random initial conditions, you’ll get random-looking results as shown below.


Rule 90 with random initial conditions

We see the same empty triangles as before, but they’re much smaller and appear scattered throughout.

In order to create a rectangular image, I wrapped the edges: the upper left neighbor of a point on the left edge is the right-most square on the row above, and similarly for the right edge. You could think of this as wrapping our graph paper into a cylinder.


Highly cited theorems

Some theorems are cited far more often than others. These are not the most striking theorems, not the most advanced or most elegant, but ones that are extraordinarily useful.

I first noticed this when taking complex analysis where the Cauchy integral formula comes up over and over. When I first saw the formula I thought it was surprising, but certainly didn’t think “I bet we’re going to use this all the time.” The Cauchy integral formula was discovered after many of the results that textbooks now prove using it. Mathematicians realized over time that they could organize a class in complex variables more efficiently by proving the Cauchy integral formula as early as possible, then use it to prove much of the rest of the syllabus.

In functional analysis, it’s the Hahn-Banach theorem. This initially unimpressive theorem turns out to be the workhorse of functional analysis. Reading through a book on functional analysis you’ll see “By the Hahn-Banach theorem …” so often that you start to think “Really, that again? What does it have to do here?”

In category theory, it’s the Yoneda lemma. The most common four-word phrase in category theory must be “by the Yoneda lemma.” Not only is it the most cited theorem in category theory, it may be the only highly cited theorem in category theory.

The most cited theorem in machine learning is probably Bayes’ theorem, but I’m not sure Bayes’ theorem looms as large in ML the previous theorems do in their fields.

Every area of math has theorems that come up more often than other, such as the central limit theorem in probability and the dominated convergence theorem in real analysis, but I can’t think of any theorems that come up as frequently as Hahn-Banach and Yoneda do in their areas.

As with people, there are theorems that attract attention and theorems that get the job done. These categories may overlap, but often they don’t.


Wolfram Alpha, Finnegans Wake, and Quaternions

James Joyce

I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye.

Suppose you did extract all the i‘s, j‘s, and k‘s from James Joyce’s novel Finnegans Wake. How would you answer the question above?

You could initialize an accumulator to 1 and then march through the list, updating the accumulator by multiplying it by the next element. But is is there a more efficient way?

Quaternion multiplication is not commutative, i.e. the order in which you multiply things matters. So it would not be enough to have a count of how many times each letter appears. Is there any sort of useful summary of the data short of carrying out the whole multiplication? In other words, could you scan the list while doing something other than quaternion multiplication, something faster to compute? Something analogous to sufficient statistics.

We’re carrying out multiplications in the group Q of unit quaternions, a group with eight elements: ±1, ±i, ±j, ±k. But the input to the question about Finnegans Wake only involves three of these elements. Could that be exploited for some slight efficiency?

How would you best implement quaternion multiplication? Of course the answer depends on your environment and what you mean by “best.”

Note that we don’t actually need to implement quaternion multiplication in general, though that would be sufficient. All we really need is multiplication in the group Q.

You could implement multiplication by a table lookup. You could use an 8 × 3 table; the left side of our multiplication could be anything in Q, but the right side can only be ij, or k. You could represent quaternions as a list of four numbers—coefficients of 1, ij, and k—and write rules for multiplying these. You could also represent quaternions as real 4 × 4 matrices or as complex 2 × 2 matrices.

If you have an interesting solution, please share it in a comment below. It could be interesting by any one of several criteria: fast, short, cryptic, amusing, etc.

Update: See follow up post, Random walk on quaternions

Related posts:

The cross polytope

There are five regular solids in three dimensions:

  • tetrahedron
  • octahedron (pictured above)
  • hexahedron (cube)
  • dodecahedron
  • icosahedron.

I give a proof here that these are the only five.

The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular solids in dimension 4.)

I’ve mentioned generalizations of the cube, the hypercube, lately. I suppose you could call the generalization of a octahedron a “hyperoctahedron” by analogy with the hypercube, though I’ve never heard anybody use that term. Instead, the most common name is cross polytope.

This post will focus on the cross polytope. In particular, we’re going to look at the relative volume of a ball inside a cross polytope.

The cross polytope in n dimensions is the convex hull of all n-dimensional vectors that are ±1 in one coordinate and 0 in all the rest. It is the “plus or minus” part that gives the cross polyhedron its name, i.e. the vertices are in pairs across the origin.

In analysis, the cross polytope is the unit ball in ℓ1 (“little ell one”), the set of points (x1, x1, …, xn) such that

|x1| + |x2| + … + |xn| = 1.

The ℓ1 norm, and hence the ℓ1 ball, comes up frequently in compressed sensing and in sparse regression.

In recent blog posts we’ve looked at how the relative volume in a ball inscribed in a hypercube drops quickly as dimension increases. What about the cross polytope? The relative volume of a ball inscribed in a cross polytope decreases rapidly with dimension as well. But does it decreases faster or slower than the relative volume of a ball inscribed in a hypercube? To answer this, we need to compute

\left.\frac{\mbox{vol ball in cross poly}}{\mbox{vol cross poly}}\middle/\frac{\mbox{vol ballin hypercube}}{\mbox{vol hypercube}}\right.

Let’s gather what we need to evaluate this. We need the volume of a ball of radius r in n dimensions, and as mentioned before this is

V = \frac{\pi^{\frac{n}{2}} r^n}{\Gamma\left(\frac{n}{2} + 1\right)}

A ball sitting inside an n-dimensional unit cross polytope will have radius 1/√n. This is because if n positive numbers sum to 1, the sum of their squares is minimized by making them all equal, and the point (1/n, 1/n, …, 1/n) has norm 1/√ n. A ball inside a unit hypercube will have radius 1/2.

The cross polytope has volume 2n / n! and the hypercube has volume 1.

Putting this all together, the relative volume of a ball in a cross polytope divided by the relative volume of a ball inside a hypercube is

\left. \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{\sqrt{n}}\right)^n } { \frac{2^n}{n!} } \middle/ \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{2}\right)^n } { 1 } \right.

which fortunately reduces to just


But how do we compare n! and nn/2? That’s a job for Stirling’s approximation. It tells us that for large n, the ratio is approximately

\sqrt{2\pi n}\, n^{n/2}e^{-n}

and so the ratio diverges for large n, i.e. the ball in the cross polytope takes up increasingly more relative volume.

Looking back at just the relative volume of the ball inside the cross polytope, and applying Stirling’s approximation again, we see that the relative volume of the ball inside the cross polytope is approximately

\sqrt{2}\left( \frac{\pi}{2e} \right )^{n/2}

and so the relative volume decreases geometrically as n increases, decreasing much slower than the relative volume of a ball in a hypercube.

Sphere packing

The previous couple blog posts touched on a special case of sphere packing.

We looked at the proportion of volume contained near the corners of a hypercube. If you take the set of points within a distance 1/2 of a corner of a hypercube, you could rearrange these points to form a full ball centered one corner of the hypercube. Saying that not much volume is located near the corners is equivalent to saying that the sphere packing that centers spheres at points with integer coordinates is not very dense.

We also looked at centering balls inside hypercubes. This is the same sphere packing as above, just shifting every coordinate by 1/2. So saying that a ball in a box doesn’t take up much volume in high dimensions is another way of saying that the integer lattice sphere packing is not very dense.

How much better can we pack spheres? In 24 dimensions, balls centered inside hypercubes would have density equal to the volume of a ball of radius 1/2, or (π/2)12 / 12!. The most dense packing in 24 dimensions, the Leech lattice sphere packing, has a density of π12 / 12!, i.e. it is 212 = 4096 times more efficient.

The densest sphere packings have only been proven in dimensions 1, 2, 3, 8, and 24. (The densest regular (lattice) packings are known for dimensions up to 8, but it is conceivable that there exist irregular packings that are more efficient than the most efficient lattice packing.) Dimension 24 is special in numerous ways, and it appears that 24 is a local maximum as far as optimal sphere packing density. How does sphere packing based on a integer lattice compare to the best packing in other high dimensions?

Although optimal packings are not known in high dimensions, upper and lower bounds on packing density are known. If Δ is the optimal sphere packing density in dimension n, then we have the following upper and lower bounds for large n:

-1 \leq \frac{1}{n} \log_2 \Delta \leq -0.599

The following plot shows how the integer lattice packing density (solid line) compares to the upper and lower bounds (dashed lines).

The upper and lower bounds come from Sphere Packings, Lattices, and Groups, published in 1998. Perhaps tighter bounds have been found since then.

Is most volume in the corners or not?

I’ve written a couple blog posts that may seem to contradict each other. Given a high-dimensional cube, is most of the volume in the corners or not?

I recently wrote that the corners of a cube stick out more in high dimensions. You can quantify this by centering a ball at a corner and looking at how much of the ball comes from the cube and how much from surrounding space. That post showed that the proportion of volume near a corner goes down rapidly as dimension increases.

About a year ago I wrote a blog post about how formal methods let you explore corner cases. Along the way I said that most cases are corner cases, i.e. most of the volume is in the corners.

Both posts are correct, but they use the term “corner” differently. That is because there are two ideas of “corner” that are the same in low dimensions but diverge in higher dimensions.

Draw a circle and then draw a square just big enough to contain it. You could say that the area in the middle is the area inside the circle and the corners are everything else. Or you could say that the corners are the regions near a vertex of the square, and the middle is everything else. These two criteria aren’t that different. But in high dimensions they’re vastly different.

The post about pointy corners looked at the proportion of volume near the vertices of the cube. The post about formal methods looked at the proportion of volume not contained in a ball in the middle of the cube. As dimension increases, the former goes to zero and the latter goes to one.

In other words, in high dimensions most of the volume is neither near a vertex nor in a ball in the middle. This gives a hint at why sphere packing is interesting in high dimensions. The next post looks at how the sphere packings implicit in this post compare to the best possible packings.

Nearly all the area in a high-dimensional sphere is near the equator

Nearly all the area of a high-dimensional sphere is near the equator.  And by symmetry, it doesn’t matter which equator you take. Draw any great circle and nearly all of the area will be near that circle.  This is the canonical example of “concentration of measure.”

What exactly do we mean by “nearly all the area” and “near the equator”? You get to decide. Pick your standard of “nearly all the area,” say 99%, and your definition of “near the equator,” say within 5 degrees. Then it’s always possible to take the dimension high enough that your standards are met. The more demanding your standard, the higher the dimension will need to be, but it’s always possible to pick the dimension high enough.

This result is hard to imagine. Maybe a simulation will help make it more believable.

In the simulation below, we take as our “north pole” the point (1, 0, 0, 0, …, 0). We could pick any unit vector, but this choice is convenient. Our equator is the set of points orthogonal to the pole, i.e. that have first coordinate equal to zero. We draw points randomly from the sphere, compute their latitude (i.e. angle from the equator), and make a histogram of the results.

The area of our planet isn’t particularly concentrated near the equator.

But as we increase the dimension, we see more and more of the simulation points are near the equator.

Here’s the code that produced the graphs.

from scipy.stats import norm
from math import sqrt, pi, acos, degrees
import matplotlib.pyplot as plt

def pt_on_sphere(n):
    # Return random point on unit sphere in R^n.
    # Generate n standard normals and normalize length.
    x = norm.rvs(0, 1, n)
    length = sqrt(sum(x**2))
    return x/length

def latitude(x):
    # Latitude relative to plane with first coordinate zero.
    angle_to_pole = acos(x[0]) # in radians
    latitude_from_equator = 0.5*pi - angle_to_pole
    return degrees( latitude_from_equator )

N = 1000 # number of samples

for n in [3, 30, 300, 3000]: # dimension of R^n
    latitudes = [latitude(pt_on_sphere(n)) for _ in range(N)]
    plt.hist(latitudes, bins=int(sqrt(N)))
    plt.xlabel("Latitude in degrees from equator")
    plt.title("Sphere in dimension {}".format(n))
    plt.xlim((-90, 90))

Not only is most of the area near the equator, the amount of area outside a band around the equator decreases very rapidly as you move away from the band. You can see that from the histograms above. They look like a normal (Gaussian) distribution, and in fact we can make that more precise.

If A is a band around the equator containing at least half the area, then the proportion of the area a distance r or greater from A is bound by exp( -(n-1)r² ). And in fact, this holds for any set A containing at least half the area; it doesn’t have to be a band around the equator, just any set of large measure.

Related post: Willie Sutton and the multivariate normal distribution

The chaos game and the Sierpinski triangle

The chaos game is played as follows. Pick a starting point at random. Then at each subsequent step, pick a triangle vertex at random and move half way from the current position to that vertex.

The result looks like a fractal called the Sierpinski triangle or Sierpinski gasket.

Here’s an example:

Unbiased chaos game results

If the random number generation is biased, the resulting triangle will show it. In the image below, the lower left corner was chosen with probability 1/2, the top with probability 1/3, and the right corner with probability 1/6.

Biased chaos game results

Update: Here’s an animated version that lets you watch the process in action.

animated gif

Here’s Python code to play the chaos game yourself.

from scipy import sqrt, zeros
import matplotlib.pyplot as plt
from random import random, randint

def midpoint(p, q):
    return (0.5*(p[0] + q[0]), 0.5*(p[1] + q[1]))

# Three corners of an equilateral triangle
corner = [(0, 0), (0.5, sqrt(3)/2), (1, 0)]

N = 1000
x = zeros(N)
y = zeros(N)

x[0] = random()
y[0] = random()
for i in range(1, N):
    k = randint(0, 2) # random triangle vertex
    x[i], y[i] = midpoint( corner[k], (x[i-1], y[i-1]) )
plt.scatter(x, y)


Update 2: Peter Norvig posted some Python code with variations on the game presented here, generalizing a triangle to other shapes. If you try the analogous procedure with a square, you simply get a square filled with random dots.

However, you can get what you might expect, the square analog of the Sierpinski triangle, the product of a Cantor set with itself, if you make a couple modifications. First, pick a side at random, not a corner. Second, move 1/3 of the way toward the chosen side, not 1/2 way.

Here’s what I got with these changes:

Chaos game for a square

Source: Chaos and Fractals

Irrational rotations are ergodic

In a blog post yesterday, I mentioned that the golden angle is an irrational portion of a circle, and so a sequence of rotations by the golden angle will not repeat itself. We can say more: rotations by an irrational portion of a circle are ergodic. Roughly speaking, this means that not only does the sequence not repeat itself, the sequence “mixes well” in a technical sense.

Ergodic functions have the property that “the time average equals the space average.” We’ll unpack what that means and illustrate it by simulation.

Suppose we pick a starting point x on the circle then repeatedly rotate it by a golden angle. Take an integrable function f on the circle and form the average of its values at the sequence of rotations. This is the time average. The space average is the integral of f over the circle, divided by the circumference of the circle. The ergodic theorem says that the time average equals the space average, except possibly for a setting of starting values of measure zero.

More generally, let X be a measure space (like the unit circle) with measure μ let T be an ergodic transformation (like rotating by a golden angle), Then for almost all starting values x we have the following:

\lim_{n\to \infty} \frac{1}{n} \sum_{k=0}^{n-1} f(T^k x) = \frac{1}{\mu(X)} \int_X f\, d\mu

Let’s do a simulation to see this in practice by running the following Python script.

        from scipy import pi, cos
        from scipy.constants import golden
        from scipy.integrate import quad
        golden_angle = 2*pi*golden**-2
        def T(x):
            return (x + golden_angle) % (2*pi)
        def time_average(x, f, T, n):
            s = 0
            for k in range(n):
                s += f(x)
                x = T(x)
            return s/n
        def space_average(f):
            integral = quad(f, 0, 2*pi)[0]
            return integral / (2*pi)
        f = lambda x: cos(x)**2
        N = 1000000
        print( time_average(0, f, T, N) )
        print( space_average(f) )

In this case we get 0.49999996 for the time average, and 0.5 for the space average. They’re not the same, but we only used a finite value of n; we didn’t take a limit. We should expect the two values to be close because n is large, but we shouldn’t expect them to be equal.

Update: The code and results were updated to fix a bug pointed out in the comments below.  I had written ... % 2*pi when I should have written ... % (2*pi). I assumed the modulo operator was lower precedence than multiplication, but it’s not. It was a coincidence that the buggy code was fairly accurate.

A friend of mine, a programmer with decades of experience, recently made a similar error. He’s a Clojure fan but was writing in C or some similar language. He rightfully pointed out that this kind of error simply cannot happen in Clojure. Lisps, including Clojure, don’t have operator precedence because they don’t have operators. They only have functions, and the order in which functions are called is made explicit with parentheses. The Python code x % 2*pi corresponds to (* (mod x 2) pi) in Clojure, and the Python code x % (2*pi) corresponds to (mod x (* 2 pi)).

Related: Origin of the word “ergodic”