# Planets and Platonic solids

Johann Kepler discovered in 1596 that the ratios of the orbits of the six planets known in his day were the same as the ratios between nested Platonic solids. Kepler was understandably quite impressed with this discovery and called it the Mysterium Cosmographicum.

I heard of this in a course in the history of astronomy long ago, and have had in the back of my mind that one day I’d look into this in detail. How exactly do you fit these regular solids together? How well do the planetary ratios match the regular solid ratios?

Imagine the orbit of each planet being the equator of a spherical shell centered at the sun. The five regular solids fit snugly into the spaces between the shells. Between Mercury and Venus you can insert an octahedron. Its inradius is the distance of Mercury to the sun, and its circumradius is the distance of Venus to the sun. You can fit the other regular solids in similarly, the icosahedron between Venus and Earth, the dodecahedron between Earth and Mars, the tetrahedron between Mars and Jupiter, and the hexahedron (cube) between Jupiter and Saturn.

Here’s the data on the inradius and circumradius of each regular solid taken from Mathworld.

|--------------+--------------+----------|
|--------------+--------------+----------|
| octahedron   |      0.70722 |  0.40825 |
| icosahedron  |      0.95106 |  0.75576 |
| dodecahedron |      1.40126 |  1.11352 |
| tetrahedron  |      0.61237 |  0.20412 |
| hexahedron   |      0.86603 |  0.50000 |
|--------------+--------------+----------|


Here’s the data on average orbit radii measured in astronomical units taken from WolframAlpha.

|---------+----------|
| Planet  | Distance |
|---------+----------|
| Mercury |  0.39528 |
| Venus   |  0.72335 |
| Earth   |  1.00000 |
| Mars    |  1.53031 |
| Jupiter |  5.20946 |
| Saturn  |  9.55105 |
|---------+----------|


So how well does Kepler’s pattern hold? In the table below, “Planet ratio” is the radius ratios of Venus to Mercury, Earth to Venus, etc. “Solid ratio” is the circumradius to inradius ratio of the regular solids in the order given above.

|--------------+-------------|
| Planet ratio | Solid ratio |
|--------------+-------------|
|      1.82997 |     1.73205 |
|      1.38246 |     1.25841 |
|      1.53031 |     1.25841 |
|      3.40419 |     3.00000 |
|      1.83340 |     1.73205 |
|--------------+-------------|


Not a bad fit, but not great either. I’ve heard that the fit was better given the data available to Kepler at the time; if Kepler had had more accurate data, he might not have come up with his Mysterium Cosmographicum.

By the way, notice some repetition in the solid ratios. The implied equalities are exact. The icosahedron and dodecahedron have the same ratio of circumradius to inradius because they are dual polyhedra. The same is true for the cube and the octahedron. Also, the ratio of 3 for the tetrahedron is exact.

Update: What if Kepler had known about more planets? The next planet ratios in our table above would be 2.01, 1.57, and 1.35. None of these is close to any of the solid ratios.

Related posts

# Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.

# Curvature and automatic differentiation

Curvature is tedious to calculate by hand because it involves calculating first and second order derivatives. Of course other applications require derivatives too, but curvature is the example we’ll look at in this post.

## Computing derivatives

It would be nice to write programs that only explicitly implement the original function and let software take care of finding the derivatives.

### Numerical differentiation

Finite difference approximations for derivatives are nothing new. For example, Euler (1707–1783) used finite differences to numerically solve differential equations. But numerical differentiation can be inaccurate or unstable, especially for higher order derivatives.

### Symbolic differentiation

Symbolic differentiation is another approach, having the computer manipulate expressions much as a person would do by hand. It works well for many problems, though it scales poorly for large problems. It also requires functions to be presented in traditional mathematical form, not in the form of source code.

### Automatic differentiation

Automatic differentiation is a third way. Like numerical differentiation, it works with floating point numbers, not symbolic expressions. But unlike numerical differentiation, the result does not have approximation error.

As someone put it, automatic differentiation applies the chain rule to floating point numbers rather than to symbolic expressions.

## Python implementation

I’ll use the Python library autograd to compute curvature and illustrate automatic differentiation. autograd is not the most powerful automatic differentiation library for Python, but it is the simplest I’ve seen.

We will compute the curvature of a logistic curve.

The curvature of the graph of a function is given by

Here’s Python code using autograd to compute the curvature.

    import autograd.numpy as np

def f(x):
return 1/(1 + np.exp(-x))

f1 = grad(f)  # 1st derivative of f
f2 = grad(f1) # 2nd derivative of f

def curvature(x):
return abs(f2(x))*(1 + f1(x)**2)**-1.5


## Curvature plots

The graph is relatively flat in the middle and at the far ends. In between, the graph bends creating two regions of higher curvature.

    import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 300)
plt.plot(x, f(x))
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Logisitic curve")
plt.savefig("logistic_curve.svg")


Now let’s look at the curvature.

    y = [curvature(t) for t in x]
plt.plot(x, y)
plt.xlabel("$x$")
plt.ylabel(r"$\kappa(x)$")
plt.title("Curvature")
plt.savefig("plot_logistic_curvature.svg")


As we should expect, the curvature is small at the ends and in the middle, with local maxima in between.

We can also look at the signed curvature, the same expression as curvature but without the absolute value.

We plot this with the following code.

    def signed_curvature(x):
return f2(x)*(1 + f1(x)**2)**-1.5

y = [signed_curvature(t) for t in x]
plt.plot(x, y)
plt.xlabel("$x$")
plt.ylabel(r"$k(x)$")
plt.title("Signed curvature")
plt.savefig("graph_signed_curvature.svg")


The result looks more like a sine wave.

The positive values mean the curve is bending counterclockwise, and the negative values mean the curve is bending clockwise.

Related post: Squircles and curvature

# Generalized normal distribution and kurtosis

The generalized normal distribution adds an extra parameter β to the normal (Gaussian) distribution. The probability density function for the generalized normal distribution is

Here the location parameter μ is the mean, but the scaling factor σ is not the standard deviation unless β = 2.

For small values of the shape parameter β, the distribution is more sharply pointed in the middle and decays more slowly in the tails. We say the tails are “thick” or “heavy.” When β = 1 the generalized normal distribution reduces to the Laplace distribution.

Here are examples with μ = 0 and σ = 1.

The normal distribution is a special case corresponding to β = 2. Large values of β make the distribution flatter on top and thinner (lighter) in the tails. Again μ = 0 and σ = 1 in the plots below.

One way to measure the thickness of probability distribution tails is kurtosis. The normal distribution has kurtosis equal to 3. Smaller values of kurtosis correspond to thinner tails and larger values to thicker tails.

There’s a common misunderstanding that kurtosis measures how pointy the distribution is in the middle. Often that’s the case, and in fact that’s the case for the generalized normal distribution. But it’s not true in general. It’s possible for a distribution to be flat on top and have heavy tails or pointy on top and have thin tails.

Distributions with thinner tails than the normal are called “platykurtic” and distributions with thicker tails than the normal are called “leptokurtic.” The names were based on the misunderstanding mentioned above. The platy– prefix means broad, but it’s not the tails that are broader, it’s the middle. Similarly, the lepto– prefix means “thin”, referring to being pointy in the middle. But leptokurtic distributions have thicker tails!

The kurtosis of the generalized normal distribution is given by

We can use that to visualize how the kurtosis varies as a function of the shape parameter β.

The Laplace distribution (β = 1) has kurtosis 6 and the normal distribution (β = 2) has kurtosis 3.

You can use the fact that Γ(x) ~ 1/x for small x to show that in the limit as β goes to infinity, the kurtosis approaches 9/5.

Related post: Computing skewness and kurtosis in one pass

# Gravitational attraction of stars and cows

One attempt at rationalizing astrology is to say that the gravitational effects of celestial bodies impact our bodies. To get an idea how hard the stars and planets pull on us, let’s compare their gravitational attraction to that of cows at various distances.

Newton’s law of gravity says that the gravitational attraction between two objects is proportional to the product of their masses and inversely proportional to the square of the distance between them. From this we can solve for how far away a cow would have to be to have the same gravitational attraction.

For round numbers, let’s say a cow weighs 1000 kg. (Strictly speaking, that’s a more typical weight for a bull than a cow.) Then Jupiter has as much gravitational tug on a person as does a cow about two feet away.

Regulus, the brightest star in Leo, has the same pull as a cow about 5 miles away. Alpha Centauri, the closest star to Earth (other than the sun, of course) has about the same pull as a cow 18 miles away.

Calculations were based on average distance. The distance to planets changes over time, but so does the distance to a given cow.

|----------------+--------------|
| Body           | Cow distance |
|----------------+--------------|
| Jupiter        |     0.57 m   |
| Venus          |     2.45 m   |
| Mars           |    10.06 m   |
| Regulus        |     7.81 km  |
| Alpha Centauri |    28.70 km  |
|----------------+--------------|


# Sums of palindromes

Every positive integer can be written as the sum of three palindromes, numbers that remain the same when their digits are reverse. For example, 389 = 11 + 55 + 323. This holds not just for base 10 but for any base b ≥ 5.

[Update: a new paper on arXiv extends this to b = 3 and 4 as well. Base 2 requires four palindromes.]

The result and algorithms for finding the palindromes was published online last August and is in the most recent print issue of Mathematics of Computation.

Javier Cilleruelo, Florian Luca and Lewis Baxter. Every positive integer is a sum of three palindromes. DOI: https://doi.org/10.1090/mcom/3221

# Squared digit sum

Take any positive integer n and sum the squares of its digits. If you repeat this operation, eventually you’ll either end at 1 or cycle between the eight values 4, 16, 37, 58, 89, 145, 42, and 20.

For example, pick n = 389. Then 3² + 8² + 9² = 9 + 64 + 81 = 154.

Next, 1² + 5² + 4² = 1 + 25 + 16 = 42, and now we’re at one of the eight special values. You can easily verify that once you land on one of these values you keep cycling.

The following code shows that the claim above is true for numbers up to 1,000.

    def square_digits(n):
return sum([int(c)**2 for c in str(n)])

attractors = [1, 4, 16, 37, 58, 89, 145, 42, 20]

for n in range(1, 1000):
m = n
while m not in attractors:
m = square_digits(m)

print("Done")


For a proof that the claim is true in general, see “A Set of Eight Numbers” by Arthur Porges, The American Mathematical Monthly, Vol. 52, No. 7, pp. 379-382.

***

By the way, the cycle image above was created with Emacs org-mode using the following code.

    #+BEGIN_SRC ditaa :file square_digit_sum_cycle.png

+-> 4 -> 16 ->  37 -> 58---+
|                          |
|                          |
+- 20 <- 42 <- 145 <- 89 <-+

#+END_SRC


It’s not the prettiest output, but it was quick to produce. More on producing ASCII art graphs here.

# Asymptotic solution to ODE

Our adventure starts with the following ordinary differential equation:

## Analytic solution

We can solve this equation in closed-form, depending on your definition of closed-form, by multiplying by an integrating factor.

The left side factors to

and so

The indefinite integral above cannot be evaluated in elementary terms, though it can be evaluated in terms of special functions.

We didn’t specify where the integration starts or the constant c. You can make the lower limit of integration whatever you want, and the value of c will be whatever is necessary to satisfy initial conditions.

The initial conditions hardly matter for large x because the constant c is multiplied by a negative exponential. As we’ll see below, the solution y decays like 1/x while the effect of the initial condition decays exponentially.

## Asymptotic series solution

I wrote a few months ago about power series solutions for differential equations, but that approach won’t work here, not over a very large range. If y has a Taylor series expansion, so does it’s derivative, and so y‘ + y has a Taylor series. But the right side of our differential equation has a singularity at 0.

What if instead of assuming y has a Taylor series expansion, we assume it has a Laurant series expansion? That is, instead of assuming y is a weighted sum of positive powers of x, we assume it’s a weighted sum of negative powers of x:

Then formally solving for the coefficients we find that

There’s one problem with this approach: The series diverges for all x because n! grows faster than xn.

And yet the series gives a useful approximation! With a convergent series, we fix x and let n go to infinity. The divergent series above is an asymptotic series, and so we fix n and let x go to infinity.

To see how well the asymptotic solution compares to the analytic solution, we’ll pick the lower limit of integration to be 1 and pick c = 0 in the analytic solution.

The plot above shows the analytic solution, and the first and second order asymptotic solutions. (The nth order solution takes the first n terms of the asymptotic series.) Note that the three curves differ substantially at first but quickly converge together.

Now let’s look at the relative error.

That’s interesting. Eventually the second order approximation is more accurate than the first order, but not at first. In both cases the relative error hits a local minimum, bounces back up, then slowly decreases.

Does this pattern continue if we move on to a third order approximation. Yes, as illustrated below.

## Superasymptotic series

The graphs above suggests that higher order approximations are more accurate, eventually, but we might do better than any particular order by letting the order vary, picking the optimal order for each x. That’s the idea behind superasymptotics. For each x, the superasymptotic series sums up terms in the asymptotic series until the terms start getting larger.

When this approach works, it can produce a series that converges exponentially fast, even though each truncated asymptotic series only converges polynomially fast.

Update: To get an idea how accuracy varies jointly with the argument x and the asymptotic approximation order n, consider the following graph. Note that the horizontal axis is now n, not x. The different curves correspond to different values of x, generally moving down as x increases.

Every curve is non-monotone because for every value of x, increasing the order only helps to a point. Then the error gets worse as the series diverges. But as you can see from the graph, you can do quite well by picking the optimal approximation order for a given x.

# Approximating gamma ratios

Ratios of gamma functions come up often in applications. If the two gamma function arguments differ by an integer, then it’s easy to calculate their ratio exactly by using (repeatedly if necessary) the fact at Γ(x + 1) = x Γ(x).

If the arguments differ by 1/2, there is no closed formula, but the there are useful approximations. I’ve needed something like this a few times lately.

The simplest approximation is

You could motivate or interpret this as saying Γ(x + 1/2) is approximately the geometric mean between Γ(x + 1) and Γ(x). As we’ll see in the plot below, this approximation is good to a couple significant figures for moderate values of x.

There is another approximation that is a little more complicated but much more accurate.

The following plot shows the relative error in both approximations.

By the way, the first approximation above is a special case of the more general approximation

Source:  J. S. Frame. An Approximation to the Quotient of Gamma Function. The American Mathematical Monthly, Vol. 56, No. 8 (Oct., 1949), pp. 529-535

# Generating Laplace random variables

Differential privacy adds Laplace-distributed random noise to data to protect individual privacy. (More on that here.) Although it’s simple to generate Laplacian random values, the Laplace distribution is not always one of the built-in options for random number generation libraries.

The Laplace distribution with scale β has density

The Laplace distribution is also called the double exponential because it looks like two mirror-image exponential distributions glued together.

Note that the scale β is not the standard deviation. The standard deviation is √2 β.

To generate samples from a Laplace distribution with scale β, generate two independent exponential samples with mean β and return their difference.

If you don’t have an API for generating exponential random values, generate uniform random values and return the negative of the log. That will produce exponential values with mean 1. To make random values with mean β, just multiply the results by β.

If you want to generate Laplace values in Python, you could simply use the laplace function in scipy.stats. But I’ll write a generator from scratch just to show what you might do in another environment where you didn’t have exponential or Laplace generators.

    from math import log
from random import random

def exp_sample(mean):
return -mean*log(random())

def laplace(scale):
e1 = exp_sample(scale)
e2 = exp_sample(scale)
return e1 - e2


Related: Stand-alone numerical code, useful when you need a few common mathematical functions but are in an environment that doesn’t provide them, or when you want to avoid adding a library to your project.

# Could you read on Pluto?

I heard somewhere that Pluto receives more sunlight than you might think, enough to read by, and that sunlight on Pluto is much brighter than moonlight on Earth. I forget where I heard that, but I’ve done a back-of-the-envelope calculation to confirm that it’s true.

Pluto is about 40 AU from the sun, i.e. forty times as far from the sun as we are. The inverse square law implies Pluto gets 1/1600 as much light from the sun as Earth does.

Direct sun is between 30,000 and 100,000 lux (lumens per square meter). We’ll go with the high end because that’s an underestimate of how bright the sun would be if we didn’t have an atmosphere between us and the sun. So at high noon on Pluto you’d get at least 60 lux of sunlight.

Civil twilight is roughly enough light to read by, and that’s 3.4 lux. Moonlight is less than 0.3 lux.

60 lux would be comparable to indoor lighting in a hall or stairway.

# Minimizing relative error

Suppose you know a number is between 30 and 42. You want to guess the number while minimizing how wrong you could be in the worst case. Then you’d guess the midpoint of the two ends, which gives you 36.

But suppose you want to minimize the worst case relative error? If you chose 36, as above, but the correct answer turned out to be 30, then your relative error is  1/5.

But suppose you had guessed 35. The numbers in the interval [30, 42] furthest away from 35 are of course the end points, 30 and 42. In either case, the relative error would be 1/6.

Let’s look at the general problem for an interval [ab]. It seems the thing to do is to pick x so that the relative error is the same whether the true value is either extreme, a or b. In that case

(x – a) / a = (b – x) / b

and if we solve for x we get

x = 2ab/(ab).

In other words, the worst case error is minimized when we pick x to be the harmonic mean of a and b.

Related post: Where to wait for an elevator

# Bits of information in age, birthday, and birthdate

The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone’s age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney’s famous result [1] that 87% of the US population can be identified based on zip code, sex, and birth date.

## Birthday

Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn’t matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.

## Age

I’ll be using the following age distribution data found on Wikipedia.

|-----------+------------|
| Age range | Population |
|-----------+------------|
|  0– 4     |   20201362 |
|  5– 9     |   20348657 |
| 10–14     |   20677194 |
| 15–19     |   22040343 |
| 20–24     |   21585999 |
| 25–29     |   21101849 |
| 30–34     |   19962099 |
| 35–39     |   20179642 |
| 40–44     |   20890964 |
| 45–49     |   22708591 |
| 50–54     |   22298125 |
| 55–59     |   19664805 |
| 60–64     |   16817924 |
| 65–69     |   12435263 |
| 70–74     |    9278166 |
| 75–79     |    7317795 |
| 80–84     |    5743327 |
| 85+       |    5493433 |
|-----------+------------|


To get data for each particular age, I’ll assume ages are evenly distributed in each group, and I’ll assume the 85+ group consists of people from ages 85 to 92. [2]

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.

## Birth date

If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

## Zip code, sex, and age

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 228 and 229, it’s plausible that we’d have enough information to identify everyone.

We’ve made a number of simplifying assumptions. We were a little fast and loose with age data, and we’ve assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date [1]. Her study was based on 1990 census data, and at that time the US population was a little less than 228.

## Related posts

***

[1] Latanya Sweeney. “Simple Demographics Often Identify People Uniquely”. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.

[1] Bob Wells and Mel Tormé. “The Christmas Song.” Commonly known as “Chestnuts Roasting on an Open Fire.”

# Bits of information in a US zip code

If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.

To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.

There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.

Average information is maximized by dividing people into categories as evenly as possible. Maximum information about one person is maximized by dividing people into categories as unevenly as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!

If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.

But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.

We can go through the data and find the probability p of someone living in each ZCTA and add up –p log2p over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)

The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.

If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.

See the next post on information contained in age, birthday, and birth date.

Related posts:

# Dark debt

Dark debt is a form of technical debt that is invisible until it causes failures. The term was coined in the STELLA conference and codified in the conference report.

Dark debt is found in complex systems and the anomalies it generates are complex system failures. Dark debt is not recognizable at the time of creation. … It arises from the unforeseen interactions of hardware or software with other parts of the framework. …

The challenge of dark debt is a difficult one. Because it exists mainly in interactions between pieces of the complex system, it cannot be appreciated by examination of those pieces. …
Unlike technical debt, which can be detected and, in principle at least, corrected by refactoring, dark debt surfaces through anomalies. Spectacular failures like those listed above do not arise from technical debt.

Complexity posts: