Equation to fit an egg

How would you fit an equation to the shape of an egg? This site suggests an equation of the form

\frac{x^2}{a^2} + \frac{y^2}{b^2}(1 + kx) = 1

Note that if k = 0 we get an ellipse. The larger the parameter k is, the more asymmetric the shape is about the y-axis.

Let’s try that out in Mathematica:

        x^2/16 + y^2 (1 + 0.1 x)/4 == 1, 
        {x, -4, 4}, {y, -3, 3}

Here’s another plot with k = 0.05. This one’s a little closer to an ellipse.

Solving for parameters

If you measured an egg, how would you find ab, and k?

Setting y = 0 shows that 2a is the length of the egg. Setting x = 0 shows that 2b is the width of the egg at the midpoint of the length. Note that it’s not the maximum height of the egg because that occurs to the left of the midpoint. (To the left if k is positive. The parameter k could be negative, which flips the egg about the y-axis so that the flatter side is on the right.)

To find k we measure the point x where the maximum height occurs.

We have the equation

y^2 = \left(1 - \frac{x^2}{a^2} \right) \frac{b^2}{1 + kx}

and implicit differentiation shows

y^2 = \left(1 - \frac{x^2}{a^2} \right) \frac{b^2}{1 + kx}

At the maximum height the derivative of y is zero, and so the right side also equals zero. This lets us solve for k.

k = \frac{2x}{a^2 - 3x^2}


As k increases, the egg gets flatter on the left side and more pointed on the right side. We can quantify this by calculating the curvature at both ends.

For a curve given implicitly by F(xy) = 0, the curvature is given by

\kappa = \frac{-F_y^2F_{xx}+2F_xF_yF_{xy}-F_x^2F_{yy}}{(F_x^2+F_y^2)^{3/2}}

The expression above simplifies greatly at the two points we’re interest in, (±a, 0).

\begin{eqnarray*} F_x &=& \frac{2x}{a^2} + \frac{ky^2}{b^2} = \pm \frac{2}{a} \\ F_y &=& \frac{2y}{b^2} (1 + kx) = 0 \\ F_{xx} &=& \frac{2}{a^2} \\ F_{yy} &=& \frac{2(1 + kx)}{b^2} = \frac{2(1 \pm ka)}{b^2} \\ F_{xy} &=& \frac{2yk}{b^2} = 0 \end{eqnarray*}

And so the curvature reduces to

\frac{a(1 \pm ka)}{b^2}

So in our first example above, with a = 4, b = 2, and k = 0.1, we have a curvature of 0.6 on the left and 1.4 on the right. In the second example with k = 0.05, we have a curvature of 0.8 on the left and 1.2 on the right.

Related posts

Comparing range and precision of IEEE and posit

The IEEE standard 754-2008 defines several sizes of floating point numbers—half precision (binary16), single precision (binary32), double precision (binary64), quadruple precision (binary128), etc.—each with its own specification. Posit numbers, on the other hand, can be defined for any number of bits. However, the IEEE specifications share common patterns so that you could consistently define theoretical IEEE numbers that haven’t actually been specified, making them easier to compare to posit numbers.

An early post goes into the specification of posit numbers in detail. To recap briefly, a posit<nes> number has n bits, a maximum of es of which are devoted to the exponent. The bits are divided into a sign bit, regime bits, exponent bits, and fraction bits. The sign bit is of course one bit, but the other components have variable lengths. We’ll come back to posits later for comparison.

IEEE floating point range and precision

We will denote a (possibly hypothetical) IEEE floating point number as ieee<nes> to denote one with n total bits and (exactly) es exponent bits. Such a number has one sign bit and n – es -1 significand bits. Actual specifications exist for ieee<16, 5>, ieee<32, 8>, ieee<64, 11>, and ieee<128, 15>.

The exponent of a posit number is simply represented as an unsigned integer. The exponent of an IEEE floating point number equals the exponent bits interpreted as an unsigned integers minus a bias.

\text{bias} = 2^{es -1} - 1.

So the biases for half, single, double, and quad precision floats are 15, 127, 1023, and 65535 respectively. We could use the formula above to define the bias for a hypothetical format not yet specified, assuming the new format is consistent with existing formats in this regard.

The largest exponent, emax is 2es-1 – 1 (also equal to the bias), and the smallest (most negative) exponent is emin = 2 – 2es-1. This accounts for 2es-1 – 2 possible exponents. The two remaining possibilities consist of all 1’s and all 0’s, and are reserved for special use. They represent, in combination with sign and signifcand bits, special values ±0, ±∞, NaN, and denomalized numbers. (More on denormalized numbers shortly.)

The largest representable finite number has the maximum exponent and a significand of all 1’s. Its value is thus

2^{e_{\text{max}}} (1 - 2^{-s})

where s is the number of significand bits. And so the largest representable finite number is just slightly less than

2^{2^{es -1}} }

We’ll use this as the largest representable value when calculating dynamic range below.

The smallest representable normalized number (normalized meaning the signifcand represents a number greater than or equal to 1) is

2^{e_{\text{min}}} = 2^{2 - 2^{es -1}}

However, it is possible to represent smaller values with denomalized numbers. Ordinarily the significand bits fff… represent a number 1.fff… But when the exponent bit pattern consists of all 0’s, the significand bits are interpreted as 0.fff… This means that the smallest denormalized number has a significand of all o’s except for a 1 at the end. This represents a value of

2^{e_{\text{min}}} \cdot 2^{-s} = 2^{2 - 2^{es-1} - s}

where again s is the number of significand bits.

The dynamic range of an ieee<nes> number is the log base 10 of the ratio of the largest to smallest representable numbers, smallest here including denormalized numbers.

\log_{10} \left(  \frac{2^{2^{es - 1} }}{2^{2 - 2^{es-1} - s}} \right) = \log_{10}2^{2^{es} - 2 + s} = (2^{es} - 2 + s) \log_{10}2

IEEE float and posit dynamic range at comparable precision

Which posit number should we compare with each IEEE number? We can’t simply compare ieee<nes> with posit<nes>. The value n means the same in both cases: the total number of bits. And although es does mean the number of exponent bits in both cases, they are not directly comparable because posits also have regime bits that are a special kind of exponent bits. In general a comparable posit number will have a smaller es value than its IEEE counterpart.

One way to compare IEEE floating point numbers and posit numbers is to chose a posit number format with comparable precision around 1. See the first post on posits their dynamic range and significance near 1.

In the following table, the numeric headings are the number of bits in a number. The “sig” rows contain the number of sigificand bits in the representation of 1, and “DR” stands for dynamic range in decades.

|           | 16 |  32 |   64 |   128 |
| IEEE  es  |  5 |   8 |   11 |    15 |
| posit es  |  1 |   3 |    5 |     8 |
| IEEE  sig | 10 |  23 |   52 |   112 |
| posit sig | 12 |  26 |   56 |   117 |
| IEEE  DR  | 12 |  83 |  632 |  9897 |
| posit DR  | 17 | 144 | 1194 | 19420 |

Note that in each case the posit number has both more precision for numbers near 1 and a wider dynamic range.

It’s common to use a different set of posit es values that have a smaller dynamic range than their IEEE counterparts (except for 16 bits) but have more precision near 1.

|           | 16 |  32 |   64 |   128 |
| IEEE  es  |  5 |   8 |   11 |    15 |
| posit es  |  1 |   2 |    3 |     4 |
| IEEE  sig | 10 |  23 |   52 |   112 |
| posit sig | 12 |  27 |   58 |   122 |
| IEEE  DR  | 12 |  83 |  632 |  9897 |
| posit DR  | 17 |  72 |  299 |  1214 |

Python code

Here’s a little Python code if you’d like to experiment with other number formats.

from math import log10

def IEEE_dynamic_range(total_bits, exponent_bits):

    # number of significand bits
    s = total_bits - exponent_bits - 1
    return (2**exponent_bits + s - 2)*log10(2)

def posit_dynamic_range(total_bits, max_exponent_bits):
    return (2*total_bits - 4) * 2**max_exponent_bits * log10(2)

Anatomy of a posit number

This post will introduce posit numbers, explain the interpretation of their bits, and discuss their dynamic range and precision.

Posit numbers are a new way to represent real numbers for computers, an alternative to the standard IEEE floating point formats. The primary advantage of posits is the ability to get more precision or dynamic range out of a given number of bits. If an application can switch from using 64-bit IEEE floats to using 32-bit posits, for example, it can fit twice as many numbers in memory at a time. That can make a big difference in the performance of applications that process large amounts of data.

Let’s back up and say what a posit number is.

Unums and posits

John Gustafson introduced unums (universal numbers) as a different way to represent real numbers using using a finite number of bits, an alternative to IEEE floating point. See, for example, his 2015 book The End of Error. Posits are a hardware-friendly version of unums.

A conventional floating point number (IEEE 754) has a sign bit, a set of bits to represent the exponent, and a set of bits called the significand (formerly called the mantissa). For details, see Anatomy of a floating point number. For a given size number, the lengths of the various parts are fixed. A 64-bit floating point number, for example, has 1 sign bit, 11 exponent bits, and 52 bits for the significand.

A posit adds an additional category of bits, known as the regime. A posit has four parts

  1. sign bit
  2. regime
  3. exponent
  4. fraction

while an IEEE floating point number has a sign bit, exponent, and significand, the latter corresponding to the fraction part of a posit. Unlike IEEE numbers, the exponent and fraction parts of a posit do not have fixed length. The sign and regime bits have first priority. Next, the remaining bits, if any, go into the exponent. If there are still bits left after the exponent, the rest go into the fraction.

The main reference for this post is [1].

Bit pattern of a posit

To understand posits in more detail, and why they have certain advantages over conventional floating point numbers, we need to unpack their bit representation. A posit number type is specified by two numbers: the total number of bits n, and the maximum number of bits devoted to the exponent, es. (Yes, it’s a little odd to use a two-letter variable name, but that’s conventional in this context.) Together we say we have a posit<nes> number.

Sign bit

As with an IEEE floating point number, the first bit of a posit is the sign bit. If the sign bit is 1, representing a negative number, take the two’s complement of the rest of the bits before unpacking the regime, exponent, and fraction bits.

Regime bits

After the sign bit come the regime bits. The number of regime bits is variable. There could be anywhere from 1 to n-1 regime bits. How do you know when the regime bits stop? When a run of identical bits ends, either because you run out of bits or because you run into an opposite bit.

If the first bit after the sign bit is a 0, then the regime bits continue until you run out of bits or encounter a 1. Similarly, if the first bit after the sign bit is a 1, the regime bits continue until you run out of bits or encounter a 0. The bit that indicates the end of a run is not included in the regime; the regime is a string of all 0’s or all 1’s.

Exponent bits

The sign bit and regime bits get first priority. If there are any bits left, the exponent bits are next in line.  There may be no exponent bits. The maximum number of exponent bits is specified by the number es. If there are at least es bits after the sign bit, regime bits, and the regime terminating bit, the next es bits belong to the exponent. If there are fewer than es bits left, what bits remain belong to the exponent.

Fraction bits

If there are any bits left after the sign bit, regime bits, regime terminating bit, and the exponent bits, they all belong to the fraction.

Interpreting the components of a posit

Next we look at how the components described above represent a real number.

Let b be the sign bit in a posit. The sign s of the number represented by the bit pattern is positive if this bit is 0 and negative otherwise.

s = (-1)^b

Let m be the number of bits in the regime, i.e. the length of the run of identical bits following the sign bit. Then let k = –m if the regime consists of all 0’s, and let km-1 otherwise.

k = \left\{ \begin{array}{ll} -m & \text{ if regime has } m \text{ 0's} \\ m-1 & \text{ if regime has } m \text{ 1's} \end{array} \right.

The useed u of the posit is determined by es, the maximum exponent size.

u = 2^{2^{\text{\small\emph{es}}} }

The exponent e is simply the exponent bits interpreted as an unsigned integer.

The fraction f is 1 + the fraction bits interpreted as following a binary point. For example, if the fraction bits are 10011, then f = 1.10011 in binary.

Putting it all together, the value of the posit number is the product of the contributions from the sign bit, regime bits, exponent bits (if any), and fraction bits (if any).

x = s\, u^k\, 2^e f = (-1)^b \, f\, 2^{e + k2^{\text{\small\emph{es}}} }

Exceptional posits

There are two exceptional posits, both with all zeros after the sign bit. A string of n 0’s represents the number zero, and a 1 followed by n-1 0’s represents ±∞.

There’s only one zero for posit numbers, unlike IEEE floats that have two kinds of zero, one positive and one negative.

There’s also only one infinite posit number. For that reason you could say that posits represent projective real numbers rather than extended real numbers. IEEE floats have two kinds of infinities, positive and negative, as well as several kinds of non-numbers. Posits have only one entity that does not correspond to a real number, and that is ±∞.

Dynamic range and precision

The dynamic range and precision of a posit number depend on the value of es. The larger es is, the larger the contribution of the regime and exponent bits will be, and so the larger range of values one can represent. So increasing es increases dynamic range. Dynamic range, measured in decades, is the log base 10 of the ratio between the largest and smallest representable positive values.

However, increasing es means decreasing the number of bits available to the fraction, and so decreases precision. One of the benefits of posit numbers is this ability to pick es to adjust the trade-off between dynamic range and precision to meet your needs.

The largest representable finite posit is labeled maxpos. This value occurs when k is as large as possible, i.e. when all the bits after the sign bit are 1’s. In this case kn-2. So maxpos equals

u^{n-2} = \left( 2^{2^{\text{\small\emph{es}}} } \right)^{n-2}

The smallest representable positive number, minpos, occurs when k is as negative as possible, i.e. when the largest possible number of bits after the sign bit are 0’s. They can’t all be zeros or else we have the representation for the number 0, so there must be a 1 on the end. In this case m = n-2 and k = 2-n.

\mbox{minpos} = u^{2-n} = \left( 2^{2^{\text{\small\emph{es}}} } \right)^{2-n} = 1/\mbox{maxpos}

The dynamic range is given by the log base 10 of the ratio between maxpos and minpos.

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

For example, 16-bit posit with es = 1 has a dynamic range of 17 decades, whereas a 16-bit IEEE floating point number has a dynamic range of 12 decades. The former has a fraction of 12 bits for numbers near 1, while the latter has a significand of 10 bits. So a posit<16,1> number has both a greater dynamic range and greater precision (near 1) than its IEEE counterpart.

[Update: See this post for more on the dynamic range and precision of IEEE floats of various sizes and how posits compare.]

Note that the precision of a posit number depends on its size. This is the sense in which posits have tapered precision. Numbers near 1 have more precision, while extremely big numbers and extremely small numbers have less. This is often what you want. Typically the vast majority of numbers in a computation are roughly on the order of 1, while with the largest and smallest numbers, you mostly want them to not overflow or underflow.

Related post: Anatomy of a floating point number


[1] John L. Gustafson and Isaac Yonemoto. Beating Floating Point at its Own Game: Posit Arithmetic. DOI: 10.14529/jsfi170206


Up arrow and down arrow notation

I recently ran into a tweet saying that if ** denotes exponentiation then // should denote logarithm. With this notation, for example, if we say

    3**4 == 81

we would also say

    81 // 3 == 4.

This runs counter to convention since // has come to be a comment marker or a notation for integer division. I assume the original proposal was meant to be funny, or at most semi-serious, but it would have made sense had notation developed that way.

Donald Knuth’s arrow notation for hyperexponentiation works that way. An up arrow denotes repeated exponentiation and a down arrow denotes repeated logarithm.

Up arrow notation

One up arrow denotes repeated multiplication, i.e. exponentiation.

b \uparrow n &=& \underbrace{b \cdot b \cdot b \cdots b}_{n \text{ copies of } b} = b^n

Two up arrows denote repeated exponentiation, i.e. hyperexponentiation.

b \uparrow \uparrow n &=& b \underbrace{\uparrow(b \uparrow \cdots(b\uparrow b))}_{n \text{ copies of } b} = \underbrace{b^{b^{.\,^{.\,^{.\,^b}}}}}_{n \text{ copies of } b}

Three up arrows denotes repeated applications of double arrow, etc.

Here’s how you could calculate

b \underbrace{\uparrow \uparrow \uparrow \cdots \uparrow}_{k \text{ arrows}} n

using Python:

    def hyperexp(b, k, n):

        if n == 0:
            return 1
        if k == 1:
            return b**n
        return hyperexp(b, k-1, hyperexp(b, k, n-1))

This function grows shockingly quickly. The first four values of hyperexp(2, 2, n) starting with n=1 are 2, 4, 16, and 65536. But the fifth value has over 19,000 digits.

Up arrow notation comes up sometimes, for example, in analysis of algorithms.

Down arrow notation

The downarrow notation is less common, and a little more complicated.

The logarithm of a number x base b is the number of times you’d need to multiply b by itself to get x. So if down arrow is the inverse of up arrow, down arrow is logarithm.

b \downarrow n = \log_b n

Now what should two down arrows represent? It should be the inverse of down up arrows. For example,

2 \uparrow \uparrow 4 = 2 \uparrow( 2 \uparrow (2\uparrow 2)) = 2^{2^{2^{2}}} = 65,536

and so we should have

2 \downarrow \downarrow 65536 = 4

That means the double down arrow counts how many times you have to use up arrows to get from 4 to 65536. But notice we’re counting things. We’re assuming there are an integer number of times you can apply up arrows to define the down arrows. In general you can’t do that. What if we change 65536 to 65535 above?

So here’s where there’s a slight complication, an asymmetry between up arrow and down arrow notation. Repeated up arrows count the number of times you have to apply down arrows to get something less than or equal to the base. See MathWorld, for example.

Seems like down arrow notation might be useful in number theory where iterated logs are common. (There’s a joke that asks what sound a number theorist makes when drowning. Answer: log log log ….)

Related posts

Planets and Platonic solids

Johann Kepler discovered in 1596 that the ratios of the orbits of the six planets known in his day were the same as the ratios between nested Platonic solids. Kepler was understandably quite impressed with this discovery and called it the Mysterium Cosmographicum.

Kepler's Mysterium Cosmographicum

I heard of this in a course in the history of astronomy long ago, and have had in the back of my mind that one day I’d look into this in detail. How exactly do you fit these regular solids together? How well do the planetary ratios match the regular solid ratios?

Imagine the orbit of each planet being the equator of a spherical shell centered at the sun. The five regular solids fit snugly into the spaces between the shells. Between Mercury and Venus you can insert an octahedron. Its inradius is the distance of Mercury to the sun, and its circumradius is the distance of Venus to the sun. You can fit the other regular solids in similarly, the icosahedron between Venus and Earth, the dodecahedron between Earth and Mars, the tetrahedron between Mars and Jupiter, and the hexahedron (cube) between Jupiter and Saturn.

Here’s the data on the inradius and circumradius of each regular solid taken from Mathworld.

| solid        | circumradius | inradius |
| octahedron   |      0.70722 |  0.40825 |
| icosahedron  |      0.95106 |  0.75576 |
| dodecahedron |      1.40126 |  1.11352 |
| tetrahedron  |      0.61237 |  0.20412 |
| hexahedron   |      0.86603 |  0.50000 |

Here’s the data on average orbit radii measured in astronomical units taken from WolframAlpha.

| Planet  | Distance |
| Mercury |  0.39528 |
| Venus   |  0.72335 |
| Earth   |  1.00000 |
| Mars    |  1.53031 |
| Jupiter |  5.20946 |
| Saturn  |  9.55105 |

So how well does Kepler’s pattern hold? In the table below, “Planet ratio” is the radius ratios of Venus to Mercury, Earth to Venus, etc. “Solid ratio” is the circumradius to inradius ratio of the regular solids in the order given above.

| Planet ratio | Solid ratio |
|      1.82997 |     1.73205 |
|      1.38246 |     1.25841 |
|      1.53031 |     1.25841 |
|      3.40419 |     3.00000 |
|      1.83340 |     1.73205 |

Not a bad fit, but not great either. I’ve heard that the fit was better given the data available to Kepler at the time; if Kepler had had more accurate data, he might not have come up with his Mysterium Cosmographicum.

By the way, notice some repetition in the solid ratios. The implied equalities are exact. The icosahedron and dodecahedron have the same ratio of circumradius to inradius because they are dual polyhedra. The same is true for the cube and the octahedron. Also, the ratio of 3 for the tetrahedron is exact.

Update: What if Kepler had known about more planets? The next planet ratios in our table above would be 2.01, 1.57, and 1.35. None of these is close to any of the solid ratios.

Related posts

Minimizing relative error

Suppose you know a number is between 30 and 42. You want to guess the number while minimizing how wrong you could be in the worst case. Then you’d guess the midpoint of the two ends, which gives you 36.

But suppose you want to minimize the worst case relative error? If you chose 36, as above, but the correct answer turned out to be 30, then your relative error is  1/5.

But suppose you had guessed 35. The numbers in the interval [30, 42] furthest away from 35 are of course the end points, 30 and 42. In either case, the relative error would be 1/6.

Let’s look at the general problem for an interval [ab]. It seems the thing to do is to pick x so that the relative error is the same whether the true value is either extreme, a or b. In that case

(x – a) / a = (b – x) / b

and if we solve for x we get

x = 2ab/(ab).

In other words, the worst case error is minimized when we pick x to be the harmonic mean of a and b.


Related post: Where to wait for an elevator

Integrating polynomials over a sphere or ball

Spheres and balls are examples of common words that take on a technical meaning in math, as I wrote about here. Recall the the unit sphere in n dimensions is the set of points with distance 1 from the origin. The unit ball is the set of points of distance less than or equal to 1 from the origin. The sphere is the surface of the ball.

Integrating a polynomial in several variables over a ball or sphere is easy. For example, take the polynomial xy² + 5x²z² in three variables. The integral of the first term, xy², is zero. If any variable in a term has an odd exponent, then the integral of that term is zero by symmetry. The integral over half of the sphere (ball) will cancel out the integral over the opposite half of the sphere (ball). So we only need to be concerned with terms like 5x²z².

Now in n dimensions, suppose the exponents of x1, x2, …, xn are a1, a2, …, an respectively. If any of the a‘s are odd, the integral over the sphere or ball will be zero, so we assume all the a‘s are even. In that case the integral over the unit sphere is simply

2 B(b_1, b_2, \ldots, b_n)


B(b_1, b_2, \ldots, b_n) = \frac{\Gamma(b_1) \Gamma(b_2) \cdots \Gamma(b_n)}{ \Gamma(b_1 + b_2 + \cdots + b_n)}

is the multivariate beta function and for each i we define bi = (ai + 1)/2. When n = 2 then B is the (ordinary) beta function.

Note that the integral over the unit sphere doesn’t depend on the dimension of the sphere.

The integral over the unit ball is

\frac{2 B(b_1, b_2, \ldots, b_n)}{ a_1 + a_2 + \cdots + a_n + n}

which is proportional to the integral over the sphere, where the proportionality constant depends on the sum of the exponents (the original exponents, the a‘s, not the b‘s) and the dimension n.

Note that if we integrate the constant polynomial 1 over the unit sphere, we get the surface area of the unit sphere, and if we integrate it over the unit ball, we get the volume of the unit ball.

You can find a derivation for the integral results above in [1]. The proof is basically Liouville’s trick for integrating the normal distribution density, but backward. Instead of going from rectangular to polar coordinates, you introduce a normal density and go from polar to rectangular coordinates.

[1] Gerald B. Folland, How to Integrate a Polynomial over a Sphere. The American Mathematical Monthly, Vol. 108, No. 5 (May, 2001), pp. 446-448.

Big derivatives

Suppose you have a function of n variables f. The kth derivative of f is a kth order tensor [1] with nk components.

Not all those tensor components are unique. Assuming our function f is smooth, the order in which partial derivatives are taken doesn’t matter. It only matters which variables you differentiate with respect to and how many times.

There is a potentially unique component of the kth order derivative for every choice of k items (the variables you’re differentiating with respect to) from a set of n items (the variables) with replacement (because you can differentiate with respect to the same variable multiple times). Richard Stanley introduced the notation

\left( {n \choose k} \right)

for this quantity, and it turns out that

\left( {n \choose k} \right) = {n + k - 1 \choose k}

See Counting selections with replacement.

If n or k are large, this is a very large number. Suppose you have a function of n = 1,000 variables and you want to take the 5th derivative. The derivative has 1015 components, but “only” about 1013 distinct components. 8,416,958,750,200 to be exact. A floating point number (IEEE 754 double) takes up 8 bytes, so storing the distinct components would require 62,711 gigabytes. So you could store the distinct components if you had a thousand computers with 64 GB of RAM each.

An application that depends on high-order derivatives of functions of many variables has to exploit some structure, such as sparsity or symmetry, and not directly compute and store the derivatives.

By the way, if both are large, the approximate number of components estimated via Stirling’s approximation is

\sqrt{\frac{m+k}{2\pi m k}} \frac{(m+k)^{m+k}}{m^m k^k}

where mn-1.


[1] The value of the function is a real number, a scalar, a 0th order tensor. The first derivative, the gradient, is a vector, a 1st order tensor. The second derivative, the Hessian, is a matrix, a 2nd order tensor. There aren’t more common names for tensors of order three and higher. You could think of a 3rd order tensor as a three dimensional box of numbers, a stack of matrices.

Average fraction round up

Pick a large number n. Divide n by each of the positive integers up to n and round the results up to the nearest integer. On average, how far do you round up?

Or in terms of probability, what is the expected distance between a fraction n/r, where n is large and fixed and r is chosen randomly between 1 and n, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

\frac{1}{n} \sum_{r = 1}^n \left( \left\lceil \frac{n}{r} \right\rceil - \frac{n}{r}\right )

for large n, i.e. in the limit as n goes to ∞. Here ⌈x⌉ denotes the ceiling of x, the smallest integer greater than or equal to x.

Let’s plot this as a function of n and see what it looks like. Here’s the Python code.

    import matplotlib.pyplot as plt
    from numpy import ceil, arange

    def f(n):
        return sum( [ceil(n/r) - n/r for r in range(1, n)] )/n

    x = arange(1, 100)
    y = [f(n) for n in x]
    plt.plot(x, y)

And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the nth harmonic number and log n, i.e.

\gamma = \lim_{n\to\infty} \left(\sum_{r = 1}^n \frac{1}{r} - \log n \right)

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant euler_gamma from numpy and add the

    plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at n = 1,000,000, we get 0.577258… while γ = 0.577215….

At n = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√n.

Efficiency is not associative for matrix multiplication

Here’s a fact that has been rediscovered many times in many different contexts: The way you parenthesize matrix products can greatly change the time it takes to compute the product. This is important, for example, for the back propagation algorithm in deep learning.

Let AB, and C be matrices that are compatible for multiplication. Then (AB)CA(BC). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix M, let rows(M) be the number of rows in M and let cols(M) be the number of columns. When we multiply two matrices M and N, we multiply every row of M by every column of N [1]. So the number of vector inner products is rows(M) cols(N), and the number of multiplications [2] in each inner product is cols(M). (We must have cols(M) = rows(N) because we implicitly assumed it’s possible to multiple M and N.)

That means the total number of scalar multiplications required to conmpute MN equals

rows(M) cols(M) cols(N) = rows(M) rows(N) cols(N).

Now let’s apply this to computing ABC. Suppose A is an m by n matrix and C is a by q matrix. Then B has to be a n by p matrix in order to be compatible for multiplication.

Computing AB requires mnp multiplications. Once we have AB, a m by p matrix, the number of multiplications to compute (AB)C is mpq. So the total multiplications to compute ABC by first computing AB is mp(nq).

If we compute BC first, this requires npq multiplications, and then multiplying A by BC requires mnq operations, for a total of nq(m + p).

In summary, computing (AB)C requires mp(nq) multiplications, and computing A(BC) requires (m + p)nq multiplications. I turned the second term around to emphasize that in both expressions you do something to m and p, and something else to n and q. You either multiply and add, or add and multiply.

If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if m and p are large relative to n and q, i.e. both A and C are tall matrices, having more rows than columns, then multiplying A(BC) is going to be faster than multiplying (AB)C.

For example, suppose both A and C have a million rows and a hundred columns. Necessarily B would have a hundred rows and a million columns. Computing (AB)C would require 2×1014 multiplications, but computing A(BC) would take 2×1010 multiplications. That is, the latter would be 10,000 times faster.

Related: Applied linear algebra


[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.




Higher order Taylor series in several variables

Most sources that present Taylor’s theorem for functions of several variables stop at second order terms. One reason is that one or two terms are good enough for many applications. But the bigger reason is that things get more complicated when you include higher order terms.

The kth order term in Taylor’s theorem is a rank k tensor. You can think o rank 0 tensors as numbers, rank 1 tensors as vectors, and rank 2 tensors as matrices. Then we run out of familiar objects. A rank 3 tensor requires you start thinking in terms of tensors rather than more elementary terms.

There is a way to express Taylor’s theorem using only vectors and matrices. Maybe not the most elegant approach, depending on one’s taste, but it avoids any handwaving talk of a tensor being a “higher dimensional boxes of numbers” and such.

There’s a small price to pay. You have to introduce two new but simple ideas: the vec operator and the Kronecker product.

The vec operator takes an m × n matrix A and returns an mn × 1 matrix v, i.e. a column vector, by stacking the columns of A. The first m elements of v are the first column of A, the next m elements of v are the second column of A, etc.

The Kronecker product of an m × n matrix A and a p × q matrix B is a mp × nq matrix KA B. You can think of K as a block partitioned matrix. The ij block of K is aij B. In other words, to form K, take each element of A and replace it with its product with the matrix B.

A couple examples will make this clear.

A = \left\[ \begin{array}{cc} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{array} \right]


\mathrm{vec} \, A = \left\[ \begin{array}{c} 1 \\ 3 \\ 5 \\ 7 \\ 2 \\ 4 \\ 6 \\ 8 \end{array} \right]


B= \left\[ \begin{array}{ccc} 10 & 0 & 0 \\ 0 & 20 & 0 \\ 0 & 0 & 30 \end{array} \right]


A \otimes B= \left\[ \begin{array}{cccccc} 10 & 0 & 0 & 20 & 0 & 0 \\ 0 & 20 & 0 & 0 & 40 & 0 \\ 0 & 0 & 30 & 0 & 0 & 60 \\ 30 & 0 & 0 & 40 & 0 & 0 \\ 0 & 60 & 0 & 0 & 80 & 0 \\ 0 & 0 & 90 & 0 & 0 & 120 \\ 50 & 0 & 0 & 60 & 0 & 0 \\ 0 & 100 & 0 & 0 & 120 & 0 \\ 0 & 0 & 150 & 0 & 0 & 180 \\ 70 & 0 & 0 & 80 & 0 & 0 \\ 0 & 140 & 0 & 0 & 160 & 0 \\ 0 & 0 & 210 & 0 & 0 & 240 \\ \end{array} \right]


Now we write down Taylor’s theorem. Let f be a real-valued function of n variables. Then

f(x + h) = f(x) + f^{(1)}(x) h + \sum_{k=2}^\infty \frac{1}{k!} \left[\stackrel{k-1}{\otimes}h^T \right] f^{(k)}(x) h

where f(0) = f and for k > 0,

f^{(k)}(x) = left. \frac{ \partial \mathrm{vec}\, f^{(k-1)} }{\partial h^T} \right|x

The symbol ⊗ with a number on top means to take the Kronecker product of the argument with itself that many times.

Source: Matrix Differential Calculus by Magnus and Neudecker.

Related post: What is a tensor?

Runge phenomena

I’ve mentioned the Runge phenomenon in a couple posts before. Here I’m going to go into a little more detail.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with cowabunga, not with sponge.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function f(x) = 1/(1 + x²) on the interval [-5, 5], or equivalently, and more convenient here, the function f(x) = 1/(1 + 25x²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge's example

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255]  on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2i, which is inside the football. Linear interpolation at evenly spaced points would converge for the function f(x) = 1/(1 + x²) since the singularity at i is outside the football.

Runge's example

For another example, consider the function f(x) = exp(- 1/x²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

Interpolating exp(-1/x^2)

With 28 interpolation points in the plot below, the lack of convergence is clear.

Interpolating exp(-1/x^2)

The problem is not polynomial interpolation per se but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the rate of convergence but not whether the interpolants converge.

RelatedHelp with interpolation


[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Yogi Berra meets Pafnuty Chebyshev

I just got an evaluation copy of The Best Writing on Mathematics 2017. My favorite chapter was Inverse Yogiisms by Lloyd N. Trefethen.

Trefethen gives several famous Yogi Berra quotes and concludes that

Yogiisms are statements that, if taken literally, are meaningless or contradictory or nonsensical or tautological—yet nevertheless convey something true.

An inverse yogiism is the opposite,

[a] statement that is literally true, yet conveys something false.

What a great way way to frame a chapter! Now that I’ve heard the phrase, I’m trying to think of inverse yogiisms. Nothing particular has come to mind yet, but I feel like there must be lots of things that fit that description. Trefethen comes up with three inverse yogiisms, and my favorite is the middle one: Faber’s theorem on polynomial interpolation.

Faber’s theorem is a non-convergence result for interpolants of continuous functions. Trefethen quotes several numerical analysis textbooks that comment on Faber’s theorem in a way that implies an overly pessimistic interpretation. Faber’s theorem is true for continuous functions in general, but if the function f  being interpolated is smooth, or even just Lipschitz continuous, the theorem doesn’t hold. In particular, Chebyshev interpolation produces a sequence of polynomials converging to f.

A few years ago I wrote a blog post that shows a famous example due to Carle Runge that if you interpolate f(x) = 1/(1 + x²) over [-5, 5] with evenly spaced nodes, the sequence of interpolating polynomials diverges. In other words, adding more interpolation points makes the fit worse.

Here’s the result of fitting a 16th degree polynomial to f  at evenly spaced nodes.

graph of f(x) and p16(x)

The error near the ends is terrible, though the fit does improve in the middle. If instead of using evenly spaced nodes you use the roots of Chebyshev polynomials, the interpolating polynomials do in fact converge, and converge quickly. If the kth derivative of f has bounded variation, then the error in interpolating f at n points is O(nk).

Common words that have a technical meaning in math

Mathematical writing is the opposite of business writing in at least one respect. Math uses common words as technical terms, whereas business coins technical terms to refer to common ideas.

There are a few math terms I use fairly often and implicitly assume readers understand. Perhaps the most surprising is almost as in “almost everywhere.” My previous post, for example, talks about something being true for “almost all x.”

The term “almost” sounds vague but it actually has a precise technical meaning. A statement is true almost everywhere, or holds for almost all x, if the set of points where it doesn’t hold has measure zero.

For example, almost all real numbers are irrational. There are infinitely many rational numbers, and so there are a lot of exceptions to the statement “all real numbers are irrational,” but the set of exceptions has measure zero [1].

In common parlance, you might use ball and sphere interchangeably, but in math they’re different. In a normed vector space, the set of all points of norm no more than r is the ball of radius r. The set of all points with norm exactly r is the sphere of radius r. A sphere is the surface of a ball.

The word smooth typically means “infinitely differentiable,” or depending on context, differentiable as many times as you need. Often there’s no practical loss of generality in assuming something is infinitely differentiable when you only need to know, for example, that it only needs three derivatives [2]. For example, a manifold whose charts are once differentiable can always be altered slightly to be infinitely differentiable.

The words regular and normal are used throughout mathematics as technical terms, and their meaning changes completely depending on context. For example, in topology regular and normal are two kinds of separation axioms. They tell you whether a topology has enough open sets to separate a point from a closed set or separate two closed sets from each other.

When I use normal I’m most often talking about a normal (i.e. Gaussian) probability distribution. I don’t think I use regular as a technical term that often, but when I do it probably means something like smooth, but more precise. A regularity result in differential equations, for example, tells you what sort of desirable properties a solution has: whether it’s a classical solution or only a weak solution, whether it’s continuous or differentiable, etc.

While I’m giving a sort of reader’s guide to my terminology, log always refers to natural log and trig functions are always in radians unless noted otherwise. More on that here.

* * *

The footnotes below are much more technical than the text above.

[1] Here’s a proof that any countable set of points has measure zero. Pick any ε > 0. Put an open interval of width ε/2 around the first point, an interval of width ε/4 around the second point, an interval of width ε/8 around the third point etc. This covers the countable set of points with a cover of measure ε, and since ε as arbitrary, the set of points must have measure 0.

The irrational numbers are uncountable, but that’s not why they have positive measure. A countable set has measure zero, but a set of measure zero may be uncountable. For example, the Cantor set is uncountable but has measure zero. Or to be more precise, I should say the standard Cantor set has measure zero. There are other Cantor sets, i.e. sets homoemorphic to the standard Cantor set, that have positive measure. This shows that “measure zero” is not a topological property.

[2] I said above that often it doesn’t matter how many times you can differentiate a function, but partial differential equations are an exception to that rule. There you’ll often you’ll care exactly how many (generalized) derivatives a solution has. And you’ll obsess over exactly which powers of the function or its derivatives are integrable. The reason is that a large part of the theory revolves around embedding theorems, whether this function space embeds in that function space. The number of derivatives a function has and the precise exponents p for the Lebesgue spaces they live in matters a great deal. Existence and uniqueness of solutions can hang on such fine details.