# Equation to fit an egg

How would you fit an equation to the shape of an egg? This site suggests an equation of the form

Note that if k = 0 we get an ellipse. The larger the parameter k is, the more asymmetric the shape is about the y-axis.

Let’s try that out in Mathematica:

    ContourPlot[
x^2/16 + y^2 (1 + 0.1 x)/4 == 1,
{x, -4, 4}, {y, -3, 3}
]

Here’s another plot with k = 0.05. This one’s a little closer to an ellipse.

## Solving for parameters

If you measured an egg, how would you find ab, and k?

Setting y = 0 shows that 2a is the length of the egg. Setting x = 0 shows that 2b is the width of the egg at the midpoint of the length. Note that it’s not the maximum height of the egg because that occurs to the left of the midpoint. (To the left if k is positive. The parameter k could be negative, which flips the egg about the y-axis so that the flatter side is on the right.)

To find k we measure the point x where the maximum height occurs.

We have the equation

and implicit differentiation shows

At the maximum height the derivative of y is zero, and so the right side also equals zero. This lets us solve for k.

## Curvature

As k increases, the egg gets flatter on the left side and more pointed on the right side. We can quantify this by calculating the curvature at both ends.

For a curve given implicitly by F(xy) = 0, the curvature is given by

The expression above simplifies greatly at the two points we’re interest in, (±a, 0).

And so the curvature reduces to

So in our first example above, with a = 4, b = 2, and k = 0.1, we have a curvature of 0.6 on the left and 1.4 on the right. In the second example with k = 0.05, we have a curvature of 0.8 on the left and 1.2 on the right.

See the next post for the volume of an egg, assuming the equation for the shape in this post.

# Eight-bit floating point

Researchers have discovered that for some problems, deep neural networks (DNNs) can get by with low precision weights. Using fewer bits to represent weights means that more weights can fit in memory at once. This, as well as embedded systems, has renewed interest in low-precision floating point.

Microsoft mentioned its proprietary floating point formats ms-fp8 and ms-fp9 in connection with its Brainwave Project [1]. I haven’t been able to find any details about these formats, other than that they use two- and three-bit exponents (respectively?).

This post will look at what an 8-bit floating point number would look like if it followed the pattern of IEEE floats or posit numbers. In the notation of the previous post, we’ll look at ieee<8,2> and posit<8,0> numbers. (Update: Added a brief discussion of ieee<8,3>, ieee<8,4>, and posit<8,1> at the end.)

## Eight-bit IEEE-like float

IEEE floating point reserves exponents of all 0’s and all 1’s for special purposes. That’s not as much of a high price with large exponents, but with only four possible exponents, it seems very wasteful to devote half of them for special purposes. Maybe this is where Microsoft does something clever. But for this post, we’ll forge ahead with the analogy to larger IEEE floating point numbers.

There would be 191 representable finite numbers, counting the two representations of 0 as one number. There would be two infinities, positive and negative, and 62 ways to represent NaN.

The smallest non-zero number would be

2-5 = 1/32 = 0.03125.

The largest value would be 01011111 and have value

4(1 – 2-5) = 31/8 = 3.3875.

This makes the dynamic range just over two decades.

## Eight-bit posit

A posit<8, 0> has no significand, just a sign bit, regime, and exponent. But in this case the useed value is 2, and so the range acts like an exponent.

There are 255 representable finite numbers and one value corresponding to ±∞.

The smallest non-zero number would be 1/64 and the largest finite number would be 64. The dynamic range is 3.6 decades.

Update: Here is a list of all possible posit<8,0> numbers.

## Distribution of values

The graphs below give the distribution of 8-bit IEEE-like numbers and 8-bit posits on a log scale.

The distribution of IEEE-like numbers is asymmetric because much of the dynamic range comes from denormalized numbers.

The distributions of posits is approximately symmetrical. If a power of 2 is representable as a posit, so is its reciprocal. But you don’t have perfect symmetry because, for example, 3/2 is representable while 2/3 is not.

## Other eight-bit formats

I had originally considered a 2-bit significand because Microsoft’s ms-fp8 format has a two-bit significand. After this post was first published it was suggested in the comments that an ieee<8, 4> float might be better than ieee<8, 2>, so let’s look at that. Let’s look at ieee<8, 3> too while we’re at it. And a posit<8, 1> too.

An ieee<8, 3> floating point number would have a maximum value of 7 and a minimum value of 2-6 = 1/64, a dynamic range of  2.7 decades. It would have 223 finite values, including two zeros, as well as 2 infinities as 30 NaNs.

An ieee<8, 4> floating point number would have a maximum value of 120 and a minimum value of 2-9 = 1/512, a dynamic range of 4.7 decades. It would have 239 finite values, including two zeros, as well as 2 infinities and 14 NaNs.

A posit<8, 1> would have a maximum value of 212 = 4096 and a minimum value of 1/4096, a dynamic range of  7.2 decades. Any 8-bit posit, regardless of the maximum number of exponent bits, will have 255 finite values and one infinity.

Near 1, an ieee<8, 4> has 3 significand bits, an ieee<8, 3> has 4, and a posit<8,1> has 4.

***

[1] Chung et al. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. Available here.

# Comparing range and precision of IEEE and posit

The IEEE standard 754-2008 defines several sizes of floating point numbers—half precision (binary16), single precision (binary32), double precision (binary64), quadruple precision (binary128), etc.—each with its own specification. Posit numbers, on the other hand, can be defined for any number of bits. However, the IEEE specifications share common patterns so that you could consistently define theoretical IEEE numbers that haven’t actually been specified, making them easier to compare to posit numbers.

An early post goes into the specification of posit numbers in detail. To recap briefly, a posit<nes> number has n bits, a maximum of es of which are devoted to the exponent. The bits are divided into a sign bit, regime bits, exponent bits, and fraction bits. The sign bit is of course one bit, but the other components have variable lengths. We’ll come back to posits later for comparison.

## IEEE floating point range and precision

We will denote a (possibly hypothetical) IEEE floating point number as ieee<nes> to denote one with n total bits and (exactly) es exponent bits. Such a number has one sign bit and n – es -1 significand bits. Actual specifications exist for ieee<16, 5>, ieee<32, 8>, ieee<64, 11>, and ieee<128, 15>.

The exponent of a posit number is simply represented as an unsigned integer. The exponent of an IEEE floating point number equals the exponent bits interpreted as an unsigned integers minus a bias.

So the biases for half, single, double, and quad precision floats are 15, 127, 1023, and 65535 respectively. We could use the formula above to define the bias for a hypothetical format not yet specified, assuming the new format is consistent with existing formats in this regard.

The largest exponent, emax is 2es-1 – 1 (also equal to the bias), and the smallest (most negative) exponent is emin = 2 – 2es-1. This accounts for 2es-1 – 2 possible exponents. The two remaining possibilities consist of all 1’s and all 0’s, and are reserved for special use. They represent, in combination with sign and signifcand bits, special values ±0, ±∞, NaN, and denomalized numbers. (More on denormalized numbers shortly.)

The largest representable finite number has the maximum exponent and a significand of all 1’s. Its value is thus

where s is the number of significand bits. And so the largest representable finite number is just slightly less than

We’ll use this as the largest representable value when calculating dynamic range below.

The smallest representable normalized number (normalized meaning the signifcand represents a number greater than or equal to 1) is

However, it is possible to represent smaller values with denomalized numbers. Ordinarily the significand bits fff… represent a number 1.fff… But when the exponent bit pattern consists of all 0’s, the significand bits are interpreted as 0.fff… This means that the smallest denormalized number has a significand of all o’s except for a 1 at the end. This represents a value of

where again s is the number of significand bits.

The dynamic range of an ieee<nes> number is the log base 10 of the ratio of the largest to smallest representable numbers, smallest here including denormalized numbers.

## IEEE float and posit dynamic range at comparable precision

Which posit number should we compare with each IEEE number? We can’t simply compare ieee<nes> with posit<nes>. The value n means the same in both cases: the total number of bits. And although es does mean the number of exponent bits in both cases, they are not directly comparable because posits also have regime bits that are a special kind of exponent bits. In general a comparable posit number will have a smaller es value than its IEEE counterpart.

One way to compare IEEE floating point numbers and posit numbers is to chose a posit number format with comparable precision around 1. See the first post on posits their dynamic range and significance near 1.

In the following table, the numeric headings are the number of bits in a number. The “sig” rows contain the number of sigificand bits in the representation of 1, and “DR” stands for dynamic range in decades.

|-----------+----+-----+------+-------|
|           | 16 |  32 |   64 |   128 |
|-----------+----+-----+------+-------|
| IEEE  es  |  5 |   8 |   11 |    15 |
| posit es  |  1 |   3 |    5 |     8 |
| IEEE  sig | 10 |  23 |   52 |   112 |
| posit sig | 12 |  26 |   56 |   117 |
| IEEE  DR  | 12 |  83 |  632 |  9897 |
| posit DR  | 17 | 144 | 1194 | 19420 |
|-----------+----+-----+------+-------|


Note that in each case the posit number has both more precision for numbers near 1 and a wider dynamic range.

It’s common to use a different set of posit es values that have a smaller dynamic range than their IEEE counterparts (except for 16 bits) but have more precision near 1.

|-----------+----+-----+------+-------|
|           | 16 |  32 |   64 |   128 |
|-----------+----+-----+------+-------|
| IEEE  es  |  5 |   8 |   11 |    15 |
| posit es  |  1 |   2 |    3 |     4 |
| IEEE  sig | 10 |  23 |   52 |   112 |
| posit sig | 12 |  27 |   58 |   122 |
| IEEE  DR  | 12 |  83 |  632 |  9897 |
| posit DR  | 17 |  72 |  299 |  1214 |
|-----------+----+-----+------+-------|


## Python code

Here’s a little Python code if you’d like to experiment with other number formats.

from math import log10

def IEEE_dynamic_range(total_bits, exponent_bits):

# number of significand bits
s = total_bits - exponent_bits - 1

return (2**exponent_bits + s - 2)*log10(2)

def posit_dynamic_range(total_bits, max_exponent_bits):

return (2*total_bits - 4) * 2**max_exponent_bits * log10(2)


# Up arrow and down arrow notation

I recently ran into a tweet saying that if ** denotes exponentiation then // should denote logarithm. With this notation, for example, if we say

    3**4 == 81

we would also say

    81 // 3 == 4.

This runs counter to convention since // has come to be a comment marker or a notation for integer division. I assume the original proposal was meant to be funny, or at most semi-serious, but it would have made sense had notation developed that way.

Donald Knuth’s arrow notation for hyperexponentiation works that way. An up arrow denotes repeated exponentiation and a down arrow denotes repeated logarithm.

## Up arrow notation

One up arrow denotes repeated multiplication, i.e. exponentiation.

Two up arrows denote repeated exponentiation, i.e. hyperexponentiation.

Three up arrows denotes repeated applications of double arrow, etc.

Here’s how you could calculate

using Python:

    def hyperexp(b, k, n):

if n == 0:
return 1
if k == 1:
return b**n
return hyperexp(b, k-1, hyperexp(b, k, n-1))


This function grows shockingly quickly. The first four values of hyperexp(2, 2, n) starting with n=1 are 2, 4, 16, and 65536. But the fifth value has over 19,000 digits.

Up arrow notation comes up sometimes, for example, in analysis of algorithms.

## Down arrow notation

The downarrow notation is less common, and a little more complicated.

The logarithm of a number x base b is the number of times you’d need to multiply b by itself to get x. So if down arrow is the inverse of up arrow, down arrow is logarithm.

Now what should two down arrows represent? It should be the inverse of down up arrows. For example,

and so we should have

That means the double down arrow counts how many times you have to use up arrows to get from 4 to 65536. But notice we’re counting things. We’re assuming there are an integer number of times you can apply up arrows to define the down arrows. In general you can’t do that. What if we change 65536 to 65535 above?

So here’s where there’s a slight complication, an asymmetry between up arrow and down arrow notation. Repeated up arrows count the number of times you have to apply down arrows to get something less than or equal to the base. See MathWorld, for example.

Seems like down arrow notation might be useful in number theory where iterated logs are common. (There’s a joke that asks what sound a number theorist makes when drowning. Answer: log log log ….)

Related posts

# Asymmetric surprise

## Motivating example: planet spacing

My previous post showed that planets are roughly evenly distributed on a log scale, not just in our solar system but also in extrasolar planetary systems. I hadn’t seen this before I stumbled on it by making some plots.

I didn’t think it was an original discovery—I assume someone did this exercise immediately when systems with several planets were discovered—but I didn’t know what this observation was called. I now know it’s known as the Titius-Bode law, a generalization of an observation about our solar system by Messrs. Titius and Bode a couple centuries ago. See, for example, [1].

Several people were skeptical of the claim that planets are distributed according to a power law and pointed out that uniformly distributed points can look fairly evenly distributed on a logarithmic scale. Which is true, and gets to the topic I want to discuss in this post. Planets are not spaced like uniform random samples (see [1]) and yet it reasonable, at first glance, to ask whether they are.

## Asymmetric surprise

If you’re expecting a power law, and you’re given uniformly distributed data, it doesn’t look too surprising. On the other hand, if you’re expecting uniformly distributed data and you see data distributed according to a power law, you are surprised. I’ll formalize this below.

If you’ve ever tried to make a scaled model of our solar system, you were probably surprised that the planets are far from uniformly spaced. A scaled model of our solar system, say at a museum, is likely to position a few of the inner planets to scale, and then use text to explain where the outer planets should be. For example, there may be a footnote saying “And if everything were to scale, Pluto would be behind the Exxon station at the end of the street.” This is an example of implicitly expected a uniform distribution and receiving data distributed according to a power law.

Some people suspected that I was doing the opposite. By plotting distances on a log scale, I’m implicitly expected a power law distribution. Maybe the data were roughly uniform, but I fooled myself into seeing a power law.

## Quantifying surprise

The Kullback-Liebler divergence from Y to X, written KL(X || Y), is the average surprise of seeing Y when you expected X. That’s one of the interpretations. See this post for more interpretations.

In general, Kullback-Liebler divergence is not symmetric. The divergence from X to Y typically does not equal the divergence from Y to X. The discussion above claims that the surprise from seeing power law data when expecting a uniform distribution is greater than the surprise from seeing uniform data when expected a power law distribution. We show below that this is true.

Let X be random variable uniformly distributed on [0, 1] and let Y be a random variable with distribution proportional to xα on the same interval. (The proportionality constant necessary to make the probability integrate to 1 is α + 1.) We will show that KL(X || Y) is greater than KL(Y || X).

First we calculate the two divergences.

and

And here is a plot comparing the two results as a function of the exponent α.

## Related posts

***

[1] Timothy Bovaird, Charles H. Lineweaver; Exoplanet predictions based on the generalized Titius–Bode relation, Monthly Notices of the Royal Astronomical Society, Volume 435, Issue 2, 21 October 2013, Pages 1126–1138, https://doi.org/10.1093/mnras/stt1357

# Planets and Platonic solids

Johann Kepler discovered in 1596 that the ratios of the orbits of the six planets known in his day were the same as the ratios between nested Platonic solids. Kepler was understandably quite impressed with this discovery and called it the Mysterium Cosmographicum.

I heard of this in a course in the history of astronomy long ago, and have had in the back of my mind that one day I’d look into this in detail. How exactly do you fit these regular solids together? How well do the planetary ratios match the regular solid ratios?

Imagine the orbit of each planet being the equator of a spherical shell centered at the sun. The five regular solids fit snugly into the spaces between the shells. Between Mercury and Venus you can insert an octahedron. Its inradius is the distance of Mercury to the sun, and its circumradius is the distance of Venus to the sun. You can fit the other regular solids in similarly, the icosahedron between Venus and Earth, the dodecahedron between Earth and Mars, the tetrahedron between Mars and Jupiter, and the hexahedron (cube) between Jupiter and Saturn.

Here’s the data on the inradius and circumradius of each regular solid taken from Mathworld.

|--------------+--------------+----------|
| solid        | circumradius | inradius |
|--------------+--------------+----------|
| octahedron   |      0.70722 |  0.40825 |
| icosahedron  |      0.95106 |  0.75576 |
| dodecahedron |      1.40126 |  1.11352 |
| tetrahedron  |      0.61237 |  0.20412 |
| hexahedron   |      0.86603 |  0.50000 |
|--------------+--------------+----------|


Here’s the data on average orbit radii measured in astronomical units taken from WolframAlpha.

|---------+----------|
| Planet  | Distance |
|---------+----------|
| Mercury |  0.39528 |
| Venus   |  0.72335 |
| Earth   |  1.00000 |
| Mars    |  1.53031 |
| Jupiter |  5.20946 |
| Saturn  |  9.55105 |
|---------+----------|


So how well does Kepler’s pattern hold? In the table below, “Planet ratio” is the radius ratios of Venus to Mercury, Earth to Venus, etc. “Solid ratio” is the circumradius to inradius ratio of the regular solids in the order given above.

|--------------+-------------|
| Planet ratio | Solid ratio |
|--------------+-------------|
|      1.82997 |     1.73205 |
|      1.38246 |     1.25841 |
|      1.53031 |     1.25841 |
|      3.40419 |     3.00000 |
|      1.83340 |     1.73205 |
|--------------+-------------|


Not a bad fit, but not great either. I’ve heard that the fit was better given the data available to Kepler at the time; if Kepler had had more accurate data, he might not have come up with his Mysterium Cosmographicum.

By the way, notice some repetition in the solid ratios. The implied equalities are exact. The icosahedron and dodecahedron have the same ratio of circumradius to inradius because they are dual polyhedra. The same is true for the cube and the octahedron. Also, the ratio of 3 for the tetrahedron is exact.

Update: What if Kepler had known about more planets? The next planet ratios in our table above would be 2.01, 1.57, and 1.35. None of these is close to any of the solid ratios.

Related posts

# Curvature and automatic differentiation

Curvature is tedious to calculate by hand because it involves calculating first and second order derivatives. Of course other applications require derivatives too, but curvature is the example we’ll look at in this post.

## Computing derivatives

It would be nice to write programs that only explicitly implement the original function and let software take care of finding the derivatives.

### Numerical differentiation

Finite difference approximations for derivatives are nothing new. For example, Euler (1707–1783) used finite differences to numerically solve differential equations. But numerical differentiation can be inaccurate or unstable, especially for higher order derivatives.

### Symbolic differentiation

Symbolic differentiation is another approach, having the computer manipulate expressions much as a person would do by hand. It works well for many problems, though it scales poorly for large problems. It also requires functions to be presented in traditional mathematical form, not in the form of source code.

### Automatic differentiation

Automatic differentiation is a third way. Like numerical differentiation, it works with floating point numbers, not symbolic expressions. But unlike numerical differentiation, the result does not have approximation error.

As someone put it, automatic differentiation applies the chain rule to floating point numbers rather than to symbolic expressions.

## Python implementation

I’ll use the Python library autograd to compute curvature and illustrate automatic differentiation. autograd is not the most powerful automatic differentiation library for Python, but it is the simplest I’ve seen.

We will compute the curvature of a logistic curve.

The curvature of the graph of a function is given by

Here’s Python code using autograd to compute the curvature.

    import autograd.numpy as np
from autograd import grad

def f(x):
return 1/(1 + np.exp(-x))

f1 = grad(f)  # 1st derivative of f
f2 = grad(f1) # 2nd derivative of f

def curvature(x):
return abs(f2(x))*(1 + f1(x)**2)**-1.5


## Curvature plots

The graph is relatively flat in the middle and at the far ends. In between, the graph bends creating two regions of higher curvature.

    import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 300)
plt.plot(x, f(x))
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Logisitic curve")
plt.savefig("logistic_curve.svg")


Now let’s look at the curvature.

    y = [curvature(t) for t in x]
plt.plot(x, y)
plt.xlabel("$x$")
plt.ylabel(r"$\kappa(x)$")
plt.title("Curvature")
plt.savefig("plot_logistic_curvature.svg")


As we should expect, the curvature is small at the ends and in the middle, with local maxima in between.

We can also look at the signed curvature, the same expression as curvature but without the absolute value.

We plot this with the following code.

    def signed_curvature(x):
return f2(x)*(1 + f1(x)**2)**-1.5

y = [signed_curvature(t) for t in x]
plt.plot(x, y)
plt.xlabel("$x$")
plt.ylabel(r"$k(x)$")
plt.title("Signed curvature")
plt.savefig("graph_signed_curvature.svg")


The result looks more like a sine wave.

The positive values mean the curve is bending counterclockwise, and the negative values mean the curve is bending clockwise.

Related post: Squircles and curvature

# Sums of palindromes

Every positive integer can be written as the sum of three palindromes, numbers that remain the same when their digits are reverse. For example, 389 = 11 + 55 + 323. This holds not just for base 10 but for any base b ≥ 5.

[Update: a new paper on arXiv extends this to b = 3 and 4 as well. Base 2 requires four palindromes.]

The result and algorithms for finding the palindromes was published online last August and is in the most recent print issue of Mathematics of Computation.

Javier Cilleruelo, Florian Luca and Lewis Baxter. Every positive integer is a sum of three palindromes. DOI: https://doi.org/10.1090/mcom/3221

# Squared digit sum

Take any positive integer n and sum the squares of its digits. If you repeat this operation, eventually you’ll either end at 1 or cycle between the eight values 4, 16, 37, 58, 89, 145, 42, and 20.

For example, pick n = 389. Then 3² + 8² + 9² = 9 + 64 + 81 = 154.

Next, 1² + 5² + 4² = 1 + 25 + 16 = 42, and now we’re at one of the eight special values. You can easily verify that once you land on one of these values you keep cycling.

The following code shows that the claim above is true for numbers up to 1,000.

    def square_digits(n):
return sum([int(c)**2 for c in str(n)])

attractors = [1, 4, 16, 37, 58, 89, 145, 42, 20]

for n in range(1, 1000):
m = n
while m not in attractors:
m = square_digits(m)

print("Done")


For a proof that the claim is true in general, see “A Set of Eight Numbers” by Arthur Porges, The American Mathematical Monthly, Vol. 52, No. 7, pp. 379-382.

***

By the way, the cycle image above was created with Emacs org-mode using the following code.

    #+BEGIN_SRC ditaa :file square_digit_sum_cycle.png

+-> 4 -> 16 ->  37 -> 58---+
|                          |
|                          |
+- 20 <- 42 <- 145 <- 89 <-+

#+END_SRC


It’s not the prettiest output, but it was quick to produce. More on producing ASCII art graphs here.

# Asymptotic solution to ODE

Our adventure starts with the following ordinary differential equation:

## Analytic solution

We can solve this equation in closed-form, depending on your definition of closed-form, by multiplying by an integrating factor.

The left side factors to

and so

The indefinite integral above cannot be evaluated in elementary terms, though it can be evaluated in terms of special functions.

We didn’t specify where the integration starts or the constant c. You can make the lower limit of integration whatever you want, and the value of c will be whatever is necessary to satisfy initial conditions.

The initial conditions hardly matter for large x because the constant c is multiplied by a negative exponential. As we’ll see below, the solution y decays like 1/x while the effect of the initial condition decays exponentially.

## Asymptotic series solution

I wrote a few months ago about power series solutions for differential equations, but that approach won’t work here, not over a very large range. If y has a Taylor series expansion, so does it’s derivative, and so y‘ + y has a Taylor series. But the right side of our differential equation has a singularity at 0.

What if instead of assuming y has a Taylor series expansion, we assume it has a Laurant series expansion? That is, instead of assuming y is a weighted sum of positive powers of x, we assume it’s a weighted sum of negative powers of x:

Then formally solving for the coefficients we find that

There’s one problem with this approach: The series diverges for all x because n! grows faster than xn.

And yet the series gives a useful approximation! With a convergent series, we fix x and let n go to infinity. The divergent series above is an asymptotic series, and so we fix n and let x go to infinity.

To see how well the asymptotic solution compares to the analytic solution, we’ll pick the lower limit of integration to be 1 and pick c = 0 in the analytic solution.

The plot above shows the analytic solution, and the first and second order asymptotic solutions. (The nth order solution takes the first n terms of the asymptotic series.) Note that the three curves differ substantially at first but quickly converge together.

Now let’s look at the relative error.

That’s interesting. Eventually the second order approximation is more accurate than the first order, but not at first. In both cases the relative error hits a local minimum, bounces back up, then slowly decreases.

Does this pattern continue if we move on to a third order approximation. Yes, as illustrated below.

## Superasymptotic series

The graphs above suggests that higher order approximations are more accurate, eventually, but we might do better than any particular order by letting the order vary, picking the optimal order for each x. That’s the idea behind superasymptotics. For each x, the superasymptotic series sums up terms in the asymptotic series until the terms start getting larger.

When this approach works, it can produce a series that converges exponentially fast, even though each truncated asymptotic series only converges polynomially fast.

Update: To get an idea how accuracy varies jointly with the argument x and the asymptotic approximation order n, consider the following graph. Note that the horizontal axis is now n, not x. The different curves correspond to different values of x, generally moving down as x increases.

Every curve is non-monotone because for every value of x, increasing the order only helps to a point. Then the error gets worse as the series diverges. But as you can see from the graph, you can do quite well by picking the optimal approximation order for a given x.

# Approximating gamma ratios

Ratios of gamma functions come up often in applications. If the two gamma function arguments differ by an integer, then it’s easy to calculate their ratio exactly by using (repeatedly if necessary) the fact at Γ(x + 1) = x Γ(x).

If the arguments differ by 1/2, there is no closed formula, but the there are useful approximations. I’ve needed something like this a few times lately.

The simplest approximation is

You could motivate or interpret this as saying Γ(x + 1/2) is approximately the geometric mean between Γ(x + 1) and Γ(x). As we’ll see in the plot below, this approximation is good to a couple significant figures for moderate values of x.

There is another approximation that is a little more complicated but much more accurate.

The following plot shows the relative error in both approximations.

By the way, the first approximation above is a special case of the more general approximation

Source:  J. S. Frame. An Approximation to the Quotient of Gamma Function. The American Mathematical Monthly, Vol. 56, No. 8 (Oct., 1949), pp. 529-535

# Generating Laplace random variables

Differential privacy adds Laplace-distributed random noise to data to protect individual privacy. (More on that here.) Although it’s simple to generate Laplacian random values, the Laplace distribution is not always one of the built-in options for random number generation libraries.

The Laplace distribution with scale β has density

The Laplace distribution is also called the double exponential because it looks like two mirror-image exponential distributions glued together.

Note that the scale β is not the standard deviation. The standard deviation is √2 β.

To generate samples from a Laplace distribution with scale β, generate two independent exponential samples with mean β and return their difference.

If you don’t have an API for generating exponential random values, generate uniform random values and return the negative of the log. That will produce exponential values with mean 1. To make random values with mean β, just multiply the results by β.

If you want to generate Laplace values in Python, you could simply use the laplace function in scipy.stats. But I’ll write a generator from scratch just to show what you might do in another environment where you didn’t have exponential or Laplace generators.

    from math import log
from random import random

def exp_sample(mean):
return -mean*log(random())

def laplace(scale):
e1 = exp_sample(scale)
e2 = exp_sample(scale)
return e1 - e2


Related: Stand-alone numerical code, useful when you need a few common mathematical functions but are in an environment that doesn’t provide them, or when you want to avoid adding a library to your project.

# Minimizing relative error

Suppose you know a number is between 30 and 42. You want to guess the number while minimizing how wrong you could be in the worst case. Then you’d guess the midpoint of the two ends, which gives you 36.

But suppose you want to minimize the worst case relative error? If you chose 36, as above, but the correct answer turned out to be 30, then your relative error is  1/5.

But suppose you had guessed 35. The numbers in the interval [30, 42] furthest away from 35 are of course the end points, 30 and 42. In either case, the relative error would be 1/6.

Let’s look at the general problem for an interval [ab]. It seems the thing to do is to pick x so that the relative error is the same whether the true value is either extreme, a or b. In that case

(x – a) / a = (b – x) / b

and if we solve for x we get

x = 2ab/(ab).

In other words, the worst case error is minimized when we pick x to be the harmonic mean of a and b.

Related post: Where to wait for an elevator

# Bits of information in age, birthday, and birthdate

The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone’s age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney’s famous result [1] that 87% of the US population can be identified based on zip code, sex, and birth date.

## Birthday

Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn’t matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.

## Age

I’ll be using the following age distribution data found on Wikipedia.

|-----------+------------|
| Age range | Population |
|-----------+------------|
|  0– 4     |   20201362 |
|  5– 9     |   20348657 |
| 10–14     |   20677194 |
| 15–19     |   22040343 |
| 20–24     |   21585999 |
| 25–29     |   21101849 |
| 30–34     |   19962099 |
| 35–39     |   20179642 |
| 40–44     |   20890964 |
| 45–49     |   22708591 |
| 50–54     |   22298125 |
| 55–59     |   19664805 |
| 60–64     |   16817924 |
| 65–69     |   12435263 |
| 70–74     |    9278166 |
| 75–79     |    7317795 |
| 80–84     |    5743327 |
| 85+       |    5493433 |
|-----------+------------|


To get data for each particular age, I’ll assume ages are evenly distributed in each group, and I’ll assume the 85+ group consists of people from ages 85 to 92. [2]

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.

## Birth date

If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

## Zip code, sex, and age

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 228 and 229, it’s plausible that we’d have enough information to identify everyone.

We’ve made a number of simplifying assumptions. We were a little fast and loose with age data, and we’ve assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date [1]. Her study was based on 1990 census data, and at that time the US population was a little less than 228.

## Related posts

***

[1] Latanya Sweeney. “Simple Demographics Often Identify People Uniquely”. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.

[1] Bob Wells and Mel Tormé. “The Christmas Song.” Commonly known as “Chestnuts Roasting on an Open Fire.”

# Bits of information in a US zip code

If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.

To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.

There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.

Average information is maximized by dividing people into categories as evenly as possible. Maximum information about one person is maximized by dividing people into categories as unevenly as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!

If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.

But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.

We can go through the data and find the probability p of someone living in each ZCTA and add up –p log2p over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)

The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.

If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.

See the next post on information contained in age, birthday, and birth date.

Related posts: