Integrating polynomials over a sphere or ball

Spheres and balls are examples of common words that take on a technical meaning in math, as I wrote about here. Recall the the unit sphere in n dimensions is the set of points with distance 1 from the origin. The unit ball is the set of points of distance less than or equal to 1 from the origin. The sphere is the surface of the ball.

Integrating a polynomial in several variables over a ball or sphere is easy. For example, take the polynomial xy² + 5x²z² in three variables. The integral of the first term, xy², is zero. If any variable in a term has an odd exponent, then the integral of that term is zero by symmetry. The integral over half of the sphere (ball) will cancel out the integral over the opposite half of the sphere (ball). So we only need to be concerned with terms like 5x²z².

Now in n dimensions, suppose the exponents of x1, x2, …, xn are a1, a2, …, an respectively. If any of the a‘s are odd, the integral over the sphere or ball will be zero, so we assume all the a‘s are even. In that case the integral over the unit sphere is simply

2 B(b_1, b_2, \ldots, b_n)

where

B(b_1, b_2, \ldots, b_n) = \frac{\Gamma(b_1) \Gamma(b_2) \cdots \Gamma(b_n)}{ \Gamma(b_1 + b_2 + \cdots + b_n)}

is the multivariate beta function and for each i we define bi = (ai + 1)/2. When n = 2 then B is the (ordinary) beta function.

Note that the integral over the unit sphere doesn’t depend on the dimension of the sphere.

The integral over the unit ball is

\frac{2 B(b_1, b_2, \ldots, b_n)}{ a_1 + a_2 + \cdots + a_n + n}

which is proportional to the integral over the sphere, where the proportionality constant depends on the sum of the exponents (the original exponents, the a‘s, not the b‘s) and the dimension n.

Note that if we integrate the constant polynomial 1 over the unit sphere, we get the surface area of the unit sphere, and if we integrate it over the unit ball, we get the volume of the unit ball.

You can find a derivation for the integral results above in [1]. The proof is basically Liouville’s trick for integrating the normal distribution density, but backward. Instead of going from rectangular to polar coordinates, you introduce a normal density and go from polar to rectangular coordinates.

[1] Gerald B. Folland, How to Integrate a Polynomial over a Sphere. The American Mathematical Monthly, Vol. 108, No. 5 (May, 2001), pp. 446-448.

Big derivatives

Suppose you have a function of n variables f. The kth derivative of f is a kth order tensor [1] with nk components.

Not all those tensor components are unique. Assuming our function f is smooth, the order in which partial derivatives are taken doesn’t matter. It only matters which variables you differentiate with respect to and how many times.

There is a potentially unique component of the kth order derivative for every choice of k items (the variables you’re differentiating with respect to) from a set of n items (the variables) with replacement (because you can differentiate with respect to the same variable multiple times). Richard Stanley introduced the notation

\left( {n \choose k} \right)

for this quantity, and it turns out that

\left( {n \choose k} \right) = {n + k - 1 \choose k}

See Counting selections with replacement.

If n or k are large, this is a very large number. Suppose you have a function of n = 1,000 variables and you want to take the 5th derivative. The derivative has 1015 components, but “only” about 1013 distinct components. 8,416,958,750,200 to be exact. A floating point number (IEEE 754 double) takes up 8 bytes, so storing the distinct components would require 62,711 gigabytes. So you could store the distinct components if you had a thousand computers with 64 GB of RAM each.

An application that depends on high-order derivatives of functions of many variables has to exploit some structure, such as sparsity or symmetry, and not directly compute and store the derivatives.

By the way, if both are large, the approximate number of components estimated via Stirling’s approximation is

\sqrt{\frac{m+k}{2\pi m k}} \frac{(m+k)^{m+k}}{m^m k^k}

where mn-1.

***

[1] The value of the function is a real number, a scalar, a 0th order tensor. The first derivative, the gradient, is a vector, a 1st order tensor. The second derivative, the Hessian, is a matrix, a 2nd order tensor. There aren’t more common names for tensors of order three and higher. You could think of a 3rd order tensor as a three dimensional box of numbers, a stack of matrices.

Average fraction round up

Pick a large number n. Divide n by each of the positive integers up to n and round the results up to the nearest integer. On average, how far do you round up?

Or in terms of probability, what is the expected distance between a fraction n/r, where n is large and fixed and r is chosen randomly between 1 and n, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

\frac{1}{n} \sum_{r = 1}^n \left( \left\lceil \frac{n}{r} \right\rceil - \frac{n}{r}\right )

for large n, i.e. in the limit as n goes to ∞. Here ⌈x⌉ denotes the ceiling of x, the smallest integer greater than or equal to x.

Let’s plot this as a function of n and see what it looks like. Here’s the Python code.

    import matplotlib.pyplot as plt
    from numpy import ceil, arange

    def f(n):
        return sum( [ceil(n/r) - n/r for r in range(1, n)] )/n

    x = arange(1, 100)
    y = [f(n) for n in x]
    plt.plot(x, y)
    plt.show()

And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the nth harmonic number and log n, i.e.

\gamma = \lim_{n\to\infty} \left(\sum_{r = 1}^n \frac{1}{r} - \log n \right)

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant euler_gamma from numpy and add the

    plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at n = 1,000,000, we get 0.577258… while γ = 0.577215….

At n = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√n.

Efficiency is not associative for matrix multiplication

Here’s a fact that has been rediscovered many times in many different contexts: The way you parenthesize matrix products can greatly change the time it takes to compute the product. This is important, for example, for the back propagation algorithm in deep learning.

Let AB, and C be matrices that are compatible for multiplication. Then (AB)CA(BC). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix M, let rows(M) be the number of rows in M and let cols(M) be the number of columns. When we multiply two matrices M and N, we multiply every row of M by every column of N [1]. So the number of vector inner products is rows(M) cols(N), and the number of multiplications [2] in each inner product is cols(M). (We must have cols(M) = rows(N) because we implicitly assumed it’s possible to multiple M and N.)

That means the total number of scalar multiplications required to conmpute MN equals

rows(M) cols(M) cols(N) = rows(M) rows(N) cols(N).

Now let’s apply this to computing ABC. Suppose A is an m by n matrix and C is a by q matrix. Then B has to be a n by p matrix in order to be compatible for multiplication.

Computing AB requires mnp multiplications. Once we have AB, a m by p matrix, the number of multiplications to compute (AB)C is mpq. So the total multiplications to compute ABC by first computing AB is mp(nq).

If we compute BC first, this requires npq multiplications, and then multiplying A by BC requires mnq operations, for a total of nq(m + p).

In summary, computing (AB)C requires mp(nq) multiplications, and computing A(BC) requires (m + p)nq multiplications. I turned the second term around to emphasize that in both expressions you do something to m and p, and something else to n and q. You either multiply and add, or add and multiply.

If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if m and p are large relative to n and q, i.e. both A and C are tall matrices, having more rows than columns, then multiplying A(BC) is going to be faster than multiplying (AB)C.

For example, suppose both A and C have a million rows and a hundred columns. Necessarily B would have a hundred rows and a million columns. Computing (AB)C would require 2×1014 multiplications, but computing A(BC) would take 2×1010 multiplications. That is, the latter would be 10,000 times faster.

Related: Applied linear algebra

***

[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.

 

 

 

Higher order Taylor series in several variables

Most sources that present Taylor’s theorem for functions of several variables stop at second order terms. One reason is that one or two terms are good enough for many applications. But the bigger reason is that things get more complicated when you include higher order terms.

The kth order term in Taylor’s theorem is a rank k tensor. You can think o rank 0 tensors as numbers, rank 1 tensors as vectors, and rank 2 tensors as matrices. Then we run out of familiar objects. A rank 3 tensor requires you start thinking in terms of tensors rather than more elementary terms.

There is a way to express Taylor’s theorem using only vectors and matrices. Maybe not the most elegant approach, depending on one’s taste, but it avoids any handwaving talk of a tensor being a “higher dimensional boxes of numbers” and such.

There’s a small price to pay. You have to introduce two new but simple ideas: the vec operator and the Kronecker product.

The vec operator takes an m × n matrix A and returns an mn × 1 matrix v, i.e. a column vector, by stacking the columns of A. The first m elements of v are the first column of A, the next m elements of v are the second column of A, etc.

The Kronecker product of an m × n matrix A and a p × q matrix B is a mp × nq matrix KA B. You can think of K as a block partitioned matrix. The ij block of K is aij B. In other words, to form K, take each element of A and replace it with its product with the matrix B.

A couple examples will make this clear.

A = \left\[ \begin{array}{cc} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{array} \right]

 

\mathrm{vec} \, A = \left\[ \begin{array}{c} 1 \\ 3 \\ 5 \\ 7 \\ 2 \\ 4 \\ 6 \\ 8 \end{array} \right]

 

B= \left\[ \begin{array}{ccc} 10 & 0 & 0 \\ 0 & 20 & 0 \\ 0 & 0 & 30 \end{array} \right]

 

A \otimes B= \left\[ \begin{array}{cccccc} 10 & 0 & 0 & 20 & 0 & 0 \\ 0 & 20 & 0 & 0 & 40 & 0 \\ 0 & 0 & 30 & 0 & 0 & 60 \\ 30 & 0 & 0 & 40 & 0 & 0 \\ 0 & 60 & 0 & 0 & 80 & 0 \\ 0 & 0 & 90 & 0 & 0 & 120 \\ 50 & 0 & 0 & 60 & 0 & 0 \\ 0 & 100 & 0 & 0 & 120 & 0 \\ 0 & 0 & 150 & 0 & 0 & 180 \\ 70 & 0 & 0 & 80 & 0 & 0 \\ 0 & 140 & 0 & 0 & 160 & 0 \\ 0 & 0 & 210 & 0 & 0 & 240 \\ \end{array} \right]

 

Now we write down Taylor’s theorem. Let f be a real-valued function of n variables. Then

f(x + h) = f(x) + f^{(1)}(x) h + \sum_{k=2}^\infty \frac{1}{k!} \left[\stackrel{k-1}{\otimes}h^T \right] f^{(k)}(x) h

where f(0) = f and for k > 0,

f^{(k)}(x) = left. \frac{ \partial \mathrm{vec}\, f^{(k-1)} }{\partial h^T} \right|x

The symbol ⊗ with a number on top means to take the Kronecker product of the argument with itself that many times.

Source: Matrix Differential Calculus by Magnus and Neudecker.

Related post: What is a tensor?

Runge phenomena

I’ve mentioned the Runge phenomenon in a couple posts before. Here I’m going to go into a little more detail.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with cowabunga, not with sponge.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function f(x) = 1/(1 + x²) on the interval [-5, 5], or equivalently, and more convenient here, the function f(x) = 1/(1 + 25x²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge's example

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255]  on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2i, which is inside the football. Linear interpolation at evenly spaced points would converge for the function f(x) = 1/(1 + x²) since the singularity at i is outside the football.

Runge's example

For another example, consider the function f(x) = exp(- 1/x²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

Interpolating exp(-1/x^2)

With 28 interpolation points in the plot below, the lack of convergence is clear.

Interpolating exp(-1/x^2)

The problem is not polynomial interpolation per se but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the rate of convergence but not whether the interpolants converge.

RelatedHelp with interpolation

***

[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Yogi Berra meets Pafnuty Chebyshev

I just got an evaluation copy of The Best Writing on Mathematics 2017. My favorite chapter was Inverse Yogiisms by Lloyd N. Trefethen.

Trefethen gives several famous Yogi Berra quotes and concludes that

Yogiisms are statements that, if taken literally, are meaningless or contradictory or nonsensical or tautological—yet nevertheless convey something true.

An inverse yogiism is the opposite,

[a] statement that is literally true, yet conveys something false.

What a great way way to frame a chapter! Now that I’ve heard the phrase, I’m trying to think of inverse yogiisms. Nothing particular has come to mind yet, but I feel like there must be lots of things that fit that description. Trefethen comes up with three inverse yogiisms, and my favorite is the middle one: Faber’s theorem on polynomial interpolation.

Faber’s theorem is a non-convergence result for interpolants of continuous functions. Trefethen quotes several numerical analysis textbooks that comment on Faber’s theorem in a way that implies an overly pessimistic interpretation. Faber’s theorem is true for continuous functions in general, but if the function f  being interpolated is smooth, or even just Lipschitz continuous, the theorem doesn’t hold. In particular, Chebyshev interpolation produces a sequence of polynomials converging to f.

A few years ago I wrote a blog post that shows a famous example due to Carle Runge that if you interpolate f(x) = 1/(1 + x²) over [-5, 5] with evenly spaced nodes, the sequence of interpolating polynomials diverges. In other words, adding more interpolation points makes the fit worse.

Here’s the result of fitting a 16th degree polynomial to f  at evenly spaced nodes.

graph of f(x) and p16(x)

The error near the ends is terrible, though the fit does improve in the middle. If instead of using evenly spaced nodes you use the roots of Chebyshev polynomials, the interpolating polynomials do in fact converge, and converge quickly. If the kth derivative of f has bounded variation, then the error in interpolating f at n points is O(nk).

Common words that have a technical meaning in math

Mathematical writing is the opposite of business writing in at least one respect. Math uses common words as technical terms, whereas business coins technical terms to refer to common ideas.

There are a few math terms I use fairly often and implicitly assume readers understand. Perhaps the most surprising is almost as in “almost everywhere.” My previous post, for example, talks about something being true for “almost all x.”

The term “almost” sounds vague but it actually has a precise technical meaning. A statement is true almost everywhere, or holds for almost all x, if the set of points where it doesn’t hold has measure zero.

For example, almost all real numbers are irrational. There are infinitely many rational numbers, and so there are a lot of exceptions to the statement “all real numbers are irrational,” but the set of exceptions has measure zero [1].

In common parlance, you might use ball and sphere interchangeably, but in math they’re different. In a normed vector space, the set of all points of norm no more than r is the ball of radius r. The set of all points with norm exactly r is the sphere of radius r. A sphere is the surface of a ball.

The word smooth typically means “infinitely differentiable,” or depending on context, differentiable as many times as you need. Often there’s no practical loss of generality in assuming something is infinitely differentiable when you only need to know, for example, that it only needs three derivatives [2]. For example, a manifold whose charts are once differentiable can always be altered slightly to be infinitely differentiable.

The words regular and normal are used throughout mathematics as technical terms, and their meaning changes completely depending on context. For example, in topology regular and normal are two kinds of separation axioms. They tell you whether a topology has enough open sets to separate a point from a closed set or separate two closed sets from each other.

When I use normal I’m most often talking about a normal (i.e. Gaussian) probability distribution. I don’t think I use regular as a technical term that often, but when I do it probably means something like smooth, but more precise. A regularity result in differential equations, for example, tells you what sort of desirable properties a solution has: whether it’s a classical solution or only a weak solution, whether it’s continuous or differentiable, etc.

While I’m giving a sort of reader’s guide to my terminology, log always refers to natural log and trig functions are always in radians unless noted otherwise. More on that here.

* * *

The footnotes below are much more technical than the text above.

[1] Here’s a proof that any countable set of points has measure zero. Pick any ε > 0. Put an open interval of width ε/2 around the first point, an interval of width ε/4 around the second point, an interval of width ε/8 around the third point etc. This covers the countable set of points with a cover of measure ε, and since ε as arbitrary, the set of points must have measure 0.

The irrational numbers are uncountable, but that’s not why they have positive measure. A countable set has measure zero, but a set of measure zero may be uncountable. For example, the Cantor set is uncountable but has measure zero. Or to be more precise, I should say the standard Cantor set has measure zero. There are other Cantor sets, i.e. sets homoemorphic to the standard Cantor set, that have positive measure. This shows that “measure zero” is not a topological property.

[2] I said above that often it doesn’t matter how many times you can differentiate a function, but partial differential equations are an exception to that rule. There you’ll often you’ll care exactly how many (generalized) derivatives a solution has. And you’ll obsess over exactly which powers of the function or its derivatives are integrable. The reason is that a large part of the theory revolves around embedding theorems, whether this function space embeds in that function space. The number of derivatives a function has and the precise exponents p for the Lebesgue spaces they live in matters a great deal. Existence and uniqueness of solutions can hang on such fine details.

No critical point between two peaks

If a function of one variable has two local maxima, it must have a local minimum in between.

What about a function of two variables? If it has two local maxima, does it need to have a local minimum? No, it could have a saddle point in between, a point that is a local minimum in one direction but a local maximum in another direction. But even more surprising, it need not even have a saddle point. A function of two variables could have two local maxima and no other critical points! Here’s an example:

f(x, y) = – (x² – 1)² – (x²yx – 1)²

It’s clear that the function is zero at (-1, 0) and (1, 2), and that the function is negative otherwise. So it clearly has two local maxima. You can write out the partial derivatives with respect to x and y and see that the only place they’re both zero is at the two local maxima.

Here’s a plot of the function:

Plot3D[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}]

And here’s a contour plot:

ContourPlot[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}, Contours -> 50]

The two maxima are in in the bright patches in the lower left and upper right.

You might be thinking that if you walk between two peaks, you’ve got to go down in between. And that’s true. If you walk in a straight line between (-1, 0) and (1, 2), you’ll run into a local minimum around (0.2316, 1.2316). But that’s only a local minimum along your path. It’s not a local minimum or saddle point of the function in a neighborhood of that point.

I found this example in the book Single Digits.

Cellular automata with random initial conditions

The previous post looked at a particular cellular automaton, the so-called Rule 90. When started with a single pixel turned on, it draws a Sierpinski triangle. With random starting pixels, it draws a semi-random pattern that retains features like the Sierpinski triangle.

There are only 256 possible elementary cellular automata, so it’s practical to plot them all. I won’t list all the images here—you can find them all here—but I will give a few examples to show the variety of patterns they produce. As in the previous post, we imagine our grid rolled up into a cylinder, i.e. we’ll wrap around if necessary to find pixels diagonally up to the left and right.

rule 8 with random initial conditions
rule 18 with random initial conditions
rule 29 with random initial conditions
rule 30 with random initial conditions
rule 108 with random initial conditions
rule 129 with random initial conditions

As we discussed in the previous post, the number of a rule comes from what value it assigns to each of eight possible cellular states, turned into a binary number. So it’s plausible that binary numbers with more 1’s correspond to more black pixels. This is roughly true, though the graph below shows that the situation is more complex than that.

automata pixel density as a function of 1 bits in rule

Sierpinski triangle strikes again

A couple months ago I wrote about how a simple random process gives rise to the Sierpinski triangle. Draw an equilateral triangle and pick a random point in the plane. Repeatedly pick a triangle vertex at random and move half way from the current position to that vertex. The result converges to a Sierpinksi triangle. This post will show another way to arrive at the same pattern using cellular automata.

Imagine an infinite grid of graph paper and fill in a few squares in one row. The squares in the subsequent rows are filled in or not depending on the state of the square’s upstairs neighbors. For elementary cellular automata, the state of a square depends on the square directly above and the two squares diagonally above on each side. In matrix notation, the state of the (ij) element depends on elements (i-1, j-1), (i-1, j), and (i-1, j+1).

There are 256 possible elementary cellular automata, and here’s how you can number them. The states of three consecutive cells determine a three-bit binary number. An automaton is determined by what bit it assigned to each of the eight possible three-bit states, so an automaton corresponds to an 8-bit number. In this post we’re interested in Rule 90. In binary, 90 is written 01011010, and the table below spells out the rule in detail.

000 -> 0
001 -> 1
010 -> 0
011 -> 1
100 -> 1
101 -> 0
110 -> 1
111 -> 0

If we start with a single square filled in (i.e. set to 1) then this is the graph we get, i.e. the Sierpenski triangle:

Rule 90 with one initial bit set

This pattern depends critically on our initial conditions. Roughly speaking, it seems that if you start with regular initial conditions you’ll get regular results. If you start with random initial conditions, you’ll get random-looking results as shown below.

 

Rule 90 with random initial conditions

We see the same empty triangles as before, but they’re much smaller and appear scattered throughout.

In order to create a rectangular image, I wrapped the edges: the upper left neighbor of a point on the left edge is the right-most square on the row above, and similarly for the right edge. You could think of this as wrapping our graph paper into a cylinder.

 

Highly cited theorems

Some theorems are cited far more often than others. These are not the most striking theorems, not the most advanced or most elegant, but ones that are extraordinarily useful.

I first noticed this when taking complex analysis where the Cauchy integral formula comes up over and over. When I first saw the formula I thought it was surprising, but certainly didn’t think “I bet we’re going to use this all the time.” The Cauchy integral formula was discovered after many of the results that textbooks now prove using it. Mathematicians realized over time that they could organize a class in complex variables more efficiently by proving the Cauchy integral formula as early as possible, then use it to prove much of the rest of the syllabus.

In functional analysis, it’s the Hahn-Banach theorem. This initially unimpressive theorem turns out to be the workhorse of functional analysis. Reading through a book on functional analysis you’ll see “By the Hahn-Banach theorem …” so often that you start to think “Really, that again? What does it have to do here?”

In category theory, it’s the Yoneda lemma. The most common four-word phrase in category theory must be “by the Yoneda lemma.” Not only is it the most cited theorem in category theory, it may be the only highly cited theorem in category theory.

The most cited theorem in machine learning is probably Bayes’ theorem, but I’m not sure Bayes’ theorem looms as large in ML the previous theorems do in their fields.

Every area of math has theorems that come up more often than other, such as the central limit theorem in probability and the dominated convergence theorem in real analysis, but I can’t think of any theorems that come up as frequently as Hahn-Banach and Yoneda do in their areas.

As with people, there are theorems that attract attention and theorems that get the job done. These categories may overlap, but often they don’t.

 

Wolfram Alpha, Finnegans Wake, and Quaternions

James Joyce

I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye.

Suppose you did extract all the i‘s, j‘s, and k‘s from James Joyce’s novel Finnegans Wake. How would you answer the question above?

You could initialize an accumulator to 1 and then march through the list, updating the accumulator by multiplying it by the next element. But is is there a more efficient way?

Quaternion multiplication is not commutative, i.e. the order in which you multiply things matters. So it would not be enough to have a count of how many times each letter appears. Is there any sort of useful summary of the data short of carrying out the whole multiplication? In other words, could you scan the list while doing something other than quaternion multiplication, something faster to compute? Something analogous to sufficient statistics.

We’re carrying out multiplications in the group Q of unit quaternions, a group with eight elements: ±1, ±i, ±j, ±k. But the input to the question about Finnegans Wake only involves three of these elements. Could that be exploited for some slight efficiency?

How would you best implement quaternion multiplication? Of course the answer depends on your environment and what you mean by “best.”

Note that we don’t actually need to implement quaternion multiplication in general, though that would be sufficient. All we really need is multiplication in the group Q.

You could implement multiplication by a table lookup. You could use an 8 × 3 table; the left side of our multiplication could be anything in Q, but the right side can only be ij, or k. You could represent quaternions as a list of four numbers—coefficients of 1, ij, and k—and write rules for multiplying these. You could also represent quaternions as real 4 × 4 matrices or as complex 2 × 2 matrices.

If you have an interesting solution, please share it in a comment below. It could be interesting by any one of several criteria: fast, short, cryptic, amusing, etc.

Update: See follow up post, Random walk on quaternions

Related posts:

The cross polytope

There are five regular solids in three dimensions:

  • tetrahedron
  • octahedron (pictured above)
  • hexahedron (cube)
  • dodecahedron
  • icosahedron.

I give a proof here that these are the only five.

The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular solids in dimension 4.)

I’ve mentioned generalizations of the cube, the hypercube, lately. I suppose you could call the generalization of a octahedron a “hyperoctahedron” by analogy with the hypercube, though I’ve never heard anybody use that term. Instead, the most common name is cross polytope.

This post will focus on the cross polytope. In particular, we’re going to look at the relative volume of a ball inside a cross polytope.

The cross polytope in n dimensions is the convex hull of all n-dimensional vectors that are ±1 in one coordinate and 0 in all the rest. It is the “plus or minus” part that gives the cross polyhedron its name, i.e. the vertices are in pairs across the origin.

In analysis, the cross polytope is the unit ball in ℓ1 (“little ell one”), the set of points (x1, x1, …, xn) such that

|x1| + |x2| + … + |xn| = 1.

The ℓ1 norm, and hence the ℓ1 ball, comes up frequently in compressed sensing and in sparse regression.

In recent blog posts we’ve looked at how the relative volume in a ball inscribed in a hypercube drops quickly as dimension increases. What about the cross polytope? The relative volume of a ball inscribed in a cross polytope decreases rapidly with dimension as well. But does it decreases faster or slower than the relative volume of a ball inscribed in a hypercube? To answer this, we need to compute

\left.\frac{\mbox{vol ball in cross poly}}{\mbox{vol cross poly}}\middle/\frac{\mbox{vol ballin hypercube}}{\mbox{vol hypercube}}\right.

Let’s gather what we need to evaluate this. We need the volume of a ball of radius r in n dimensions, and as mentioned before this is

V = \frac{\pi^{\frac{n}{2}} r^n}{\Gamma\left(\frac{n}{2} + 1\right)}

A ball sitting inside an n-dimensional unit cross polytope will have radius 1/√n. This is because if n positive numbers sum to 1, the sum of their squares is minimized by making them all equal, and the point (1/n, 1/n, …, 1/n) has norm 1/√ n. A ball inside a unit hypercube will have radius 1/2.

The cross polytope has volume 2n / n! and the hypercube has volume 1.

Putting this all together, the relative volume of a ball in a cross polytope divided by the relative volume of a ball inside a hypercube is

\left. \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{\sqrt{n}}\right)^n } { \frac{2^n}{n!} } \middle/ \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{2}\right)^n } { 1 } \right.

which fortunately reduces to just

\frac{n!}{n^n}

But how do we compare n! and nn/2? That’s a job for Stirling’s approximation. It tells us that for large n, the ratio is approximately

\sqrt{2\pi n}\, n^{n/2}e^{-n}

and so the ratio diverges for large n, i.e. the ball in the cross polytope takes up increasingly more relative volume.

Looking back at just the relative volume of the ball inside the cross polytope, and applying Stirling’s approximation again, we see that the relative volume of the ball inside the cross polytope is approximately

\sqrt{2}\left( \frac{\pi}{2e} \right )^{n/2}

and so the relative volume decreases geometrically as n increases, decreasing much slower than the relative volume of a ball in a hypercube.