Six sigma events

I saw on Twitter this afternoon a paraphrase of a quote from Nassim Taleb to the effect that if you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a k-sigma event for various k.

        |-------+-----------------|
        | Sigma | Odds            |
        |-------+-----------------|
        |     1 | 2 : 1           |
        |     2 | 21 : 1          |
        |     3 | 370 : 1         |
        |     4 | 16,000 : 1      |
        |     5 | 1,700,000 : 1   |
        |     6 | 500,000,000 : 1 |
        |-------+-----------------|

If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

Bayes rule and Bayes factors

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a fat-tailed distribution, say a Student t distribution with 10 degrees of freedom, rescaled to also have variance 1.

normal distribution vs t with 10 dof

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is relatively much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the t distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the t distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the fat-tailed distribution is equally likely, just a possibility. If you did think a priori that both possibilities were equally likely, the posterior odds for the t distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your a priori confidence in a normal model.

By the way, a t distribution with 10 degrees of freedom is not a very fat-tailed distribution. It has fatter tails than a normal for sure, but not nearly as fat as a Cauchy, which corresponds to a t with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.

***

[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

Calendars and continued fractions

Calendars are based on three frequencies: the rotation of the Earth on its axis, the rotation of the moon around the Earth, and the rotation of the Earth around the sun. Calendars are complicated because none of these periods is a simple multiple of any other. The ratios are certainly not integers, but they’re not even fractions with a small denominator.

As I wrote about the other day in the context of rational approximations for π, the most economical rational approximations of an irrational number come from convergents of continued fractions. The book Calendrical Calculations applies continued fractions to various kinds of calendars.

Ratio of years to days

The continued fraction for the number of days in a year is as follows.

365.2424177 = 365 + \cfrac{1}{4 + \cfrac{1}{7+ \cfrac{1}{1 + \cfrac{1}{2 + \cfrac{1}{1 + \cfrac{1}{5 + \ldots }}}}}}

This means that the best approximations for the length of a year are 365 days plus a fraction of 1/4, or 7/29, or 8/33, or 23/95, etc.

You could have one leap day every four years. More accurate would be 7 leap days every 29 years, etc. The Gregorian calendar has 97 leap days every 400 years.

Ratio of years to months

If we express the ratio of the length of the year to the length of the month, we get

\frac{365.24244}{29.53059} = 12 +\cfrac{1}{2 + \cfrac{1}{1+ \cfrac{1}{2 + \cfrac{1}{1 + \cfrac{1}{1 + \cfrac{1}{18 + \cfrac{1}{3 + \ldots }}}}}}}

By taking various convergents we get 37/3, 99/8, 136/11, etc. So if you want to design successively more accurate lunisolar calendars, you’d have 37 lunar months every 3 years, 99 lunar months every 8 years, etc.

Related posts

Computing smooth max without overflow

Erik Erlandson sent me a note saying he found my post on computing the soft maximum helpful. (If you’re unfamiliar with the soft maximum, here’s a brief description of what it is and how you might use it.) Erik writes

I used your post on practical techniques for computing smooth max, which will probably be the difference between being able to actually use my result and not. I wrote up what I’m planning to do here.

His post extends the idea in my post to computing the gradient and Hessian of the soft max.

Proving life exists on Earth

NASA’s Galileo mission was primarily designed to explore Jupiter and its moons. In 1989, the Galileo probe started out traveling away from Jupiter in order to do a gravity assist swing around Venus. About a year later it also did a gravity assist maneuver around Earth. Carl Sagan suggested that when passing Earth, the Galileo probe should turn its sensors on Earth to look for signs of life. [1]

Artist conception of Galileo probe surveying Jupiter and its moons

Now obviously we know there’s life on Earth. But if we’re going look for life on other planets, it’s reasonable to ask that our methods return positive results when examining the one planet we know for sure does host life. So scientists looked at the data from Galileo as if it were coming from another planet to see what patterns in the data might indicate life.

I’ve started using looking for life on Earth as a metaphor. I’m working on a project right now where I’m looking for a needle in a haystack, or rather another needle in a haystack: I knew that one needle existed before I got started. So I want to make sure that my search procedure at least finds the one positive result I already know exists. I explained to a colleague that we need to make sure we can at least find life on Earth.

This reminds me of simulation work. You make up a scenario and treat it as the state of nature. Then pretend you don’t know that state, and see how successful your method is at discovering that state. It’s sort of a schizophrenic way of thinking, pretending that half of your brain doesn’t know what the other half is doing.

It also reminds me of software testing. The most trivial tests can be surprisingly effective at finding bugs. So you write a few tests to confirm that there’s life on Earth.

Related posts

[1] I found out about Galileo’s Earth reconnaissance listening to the latest episode of the .NET Rocks! podcast.

Combinatorics, just beyond the basics

Most basic combinatorial problems can be solved in terms of multiplication, permutations, and combinations. The next step beyond the basics, in my experience, is counting selections with replacement. Often when I run into a problem that is not quite transparent, it boils down to this.

Examples of selection with replacement

Here are three problems that reduce to counting selections with replacement.

Looking ahead in an experiment

For example, suppose you’re running an experiment in which you randomize to n different treatments and you want to know how many ways the next k subjects can be assigned to treatments. So if you had treatments A, B, and C, and five subjects, you could assign all five to A, or four to A and one to B, etc. for a total of 21 possibilities. Your choosing 5 things from a set of 3, with replacement because you can (and will) assign the same treatment more than once.

Partial derivatives

For another example, if you’re taking the kth order partial derivatives of a function of n variables, you’re choosing k things (variables to differentiate with respect to) from a set of n (the variables). Equality of mixed partials for smooth functions says that all that matters is how many times you differentiate with respect to each variable. Order doesn’t matter: differentiating with respect to x and then with respect to y gives the same result as taking the derivatives in the opposite order, as long as the function you’re differentiating has enough derivatives. I wrote about this example here.

Sums over even terms

I recently had an expression come up that required summing over n distinct variables, each with a non-negative even value, and summing to 2k. How many ways can that be done? As many as dividing all the variable values in half and asking that they sum to k. Here the thing being chosen the variable, and since the indexes sum to k, I have a total of k choices to make, with replacement since I can chose a variable more than once. So again I’m choosing k things with replacement from a set of size n.

Formula

I wrote up a set of notes on sampling with replacement that I won’t duplicate here, but in a nutshell:

\left( {n \choose k} \right) = {n + k - 1 \choose k}

The symbol on the left is Richard Stanley’s suggested notation for the number of ways to select k things with replacement from a set of size n. It turns out that this equals the expression on the right side. The derivation isn’t too long, but it’s not completely obvious either. See the aforementioned notes for details.

I started by saying basic combinatorial problems can be reduced to multiplication, permutations, and combinations. Sampling with replacement can be reduced to a combination, the right side of the equation above, but with non-obvious arguments. Hence the need to introduce a new symbol, the one on the right, that maps directly to problem statements.

Enumerating possibilities

Sometimes you just need to count how many ways one can select k things with replacement from a set of size n. But often you need to actually enumerate the possibilities, say to loop over them in software.

In conducting a clinical trial, you may want to enumerate all the ways pending data could turn out. If you’re going to act the same way, no matter how the pending data work out, there’s no need to wait for the missing data before proceeding. This is something I did many times when I worked at MD Anderson Cancer Center.

When you evaluate a multivariate Taylor series, you need to carry out a sum that has a term for corresponding to each partial derivative. You could naively sum over all possible derivatives, not taking into account equality of mixed partials, but that could be very inefficient, as described here.

The example above summing over even partitions of an even number also comes out of a problem with multivariate Taylor series, one in which odd terms drop out by symmetry.

To find algorithms for enumerating selections with replacement, see Knuth’s TAOCP, volume 4, Fascicle 3.

More combinatorics posts

10 best rational approximations for pi

It’s easy to create rational approximations for π. Every time you write down π to a few decimal places, that’s a rational approximation. For example, 3.14 = 314/100. But that’s not the best approximation.

Think of the denominator of your fraction as something you have to buy. If you have enough budget to buy a three-digit denominator, then you’re better off buying 355/113 rather than 314/100 because the former is more accurate. The approximation 355/113 is accurate to 6 decimal places, whereas 314/100 is only accurate to 2 decimal places.

There’s a way to find the most economical rational approximations to π, or any other irrational number, using continued fractions. If you’re interested in details, see the links below.

Here are the 10 best rational approximations to π, found via continued fractions.

    |----------------+----------|
    | Fraction       | Decimals |
    |----------------+----------|
    | 3              |      0.8 |
    | 22/7           |      2.9 |
    | 333/106        |      4.1 |
    | 355/113        |      6.6 |
    | 103993/33102   |      9.2 |
    | 104348/33215   |      9.5 |
    | 208341/66317   |      9.9 |
    | 312689/99532   |     10.5 |
    | 833719/265381  |     11.1 |
    | 1146408/364913 |     11.8 |
    |----------------+----------|

If you only want to know the number of correct decimal places, ignore the fractional parts of the numbers in the Decimals column above. For example, 22/7 gives two correct decimal places. But it almost gives three. (Technically, the Decimals column gives the −log10 of the absolute error.)

In case you’re curious, here’s a plot of the absolute errors on a log scale.

Errors in rational approximations to pi

Related posts

Fixed points of logistic function

Here’s an interesting problem that came out of a logistic regression application. The input variable was between 0 and 1, and someone asked when and where the logistic transformation

f(x) = 1/(1 + exp(abx))

has a fixed point, i.e. f(x) = x.

So given logistic regression parameters a and b, when does the logistic curve given by yf(x) cross the line yx? Do they always cross? Can they cross twice?

There’s always at least one solution. Because f(x) is strictly between 0 and 1, the function

g(x) = f(x) − x

is positive at 0 and negative at 1, and by the intermediate value theorem g(x) must be zero for some x between 0 and 1.

Sometimes f has only one fixed point. It may have two or three fixed points, as demonstrated in the graph below. The case of two fixed points is unstable: the logistic curve is tangent to the line yx at one point, and a tiny change would turn this tangent point into either no crossing or two crossings.

Logistic curves crossing identity line

If |b| < 1, then you can show that the function f is a contraction map on [0, 1]. In that case there is a unique solution to f(x) = x, and you can find it by starting with an arbitrary value for x and repeatedly applying f to it. For example, if a = 1 and b = 0.8 and we start with x = 0, after applying f ten times we get xf(x) = 0.233790157.

There are a couple questions left to finish this up. How can you tell from a and b how many fixed points there will be? The condition |b| < 1 is sufficient for f to be a contraction map on [0, 1]. Can you find necessary and sufficient conditions?

Related post: Sensitivity of logistic regression prediction

Line art

A new video from 3Blue1Brown is about visualizing derivatives as stretching and shrinking factors. Along the way they consider the function f(x) = 1 + 1/x.

Iterations of f converge on the golden ratio, no matter where you start (with one exception). The video creates a graph where they connect values of x on one line to values of f(x) on another line. Curiously, there’s an oval that emerges where no lines cross.

Here’s a little Python I wrote to play with this:

    import matplotlib.pyplot as plt
    from numpy import linspace

    N = 70
    x = linspace(-3, 3, N)
    y = 1 + 1/x

    for i in range(N):
        plt.plot([x[i], y[i]], [0, 1])
    plt.xlim(-5, 5)

And here’s the output:

In the plot above I just used matplotlib’s default color sequence for each line. In the plot below, I used fewer lines (N = 50) and specified the color for each line. Also, I made a couple changes to the plot command, specifying the color for each line and putting the x line above the y line.

        plt.plot([x[i], y[i]], [0, 1], c="#243f6a")

If you play around with the Python code, you probably want to keep N even. This prevents the x array from containing zero.

Update: Here’s a variation that extends the lines connecting (x, 0) and (y, 1). And I make a few other changes while I’m at it.

    N = 200
    x = linspace(-10, 10, N)
    y = 1 + 1/x
    z = 2*y-x

    for i in range(N):
        plt.plot([x[i], z[i]], [0, 2], c="#243f6a")
    plt.xlim(-10, 10)

Making a career out of the chain rule

When I was a teenager, my uncle gave me a calculus book and told me that mastering calculus was the most important thing I could do for starting out in math. So I learned the basics of calculus from that book. Later I read Michael Spivak’s two calculus books. I took courses that built on calculus, and studied generalizations of calculus such as calculus with complex variables, calculus in Banach spaces, etc. I taught calculus. After a while, I started to feel maybe I’d mastered calculus.

Last year I started digging into automatic differentiation and questioned whether I really had mastered calculus. At a high level, automatic differentiation is “just” an application of the chain rule in many variables. But you can make career out of exploring automatic differentiation (AD), and many people do. The basics of AD are not that complicated, but you can go down a deep rabbit hole exploring optimal ways to implement AD in various contexts.

You can make a career out of things that seem even simpler than AD. Thousands of people have made a career out of solving the equation Ax = b where A is a large matrix and the vector b is given. In high school you solve two equations in two unknowns, then three equations in three unknowns, and in principle you could keep going. But you don’t solve a sparse system of millions of equations the same way. When you consider efficiency, accuracy, limitations of computer arithmetic, parallel computing, etc. the humble equation Ax = b is not so simple to solve.

As Richard Feynman said, nearly everything is really interesting if you go into it deeply enough.

Related math posts