Small course corrections

You don’t send a probe to Pluto by pointing really well.

Pluto is about one thousand kilometers across and a few billion kilometers away. It’s not possible to calculate, much less execute, a trajectory so accurately that we could fire a projectile at Pluto and hit it. But space probes are not simply passive projectiles. They make lots of small course corrections along the way. New Horizons is on its way to Pluto and will arrive on July 14, 2015, not because the rocket that launched it was aligned precisely but because its course has changed slightly all along the way. (How do probes do this? Kalman filters and their extensions.)

The human world is far less predictable than celestial mechanics. It’s foolish to think we can plan far ahead without any course corrections.

Artist conception of New Horizons via NASA

Academic freedom

This tweet from Luis Pedro Coelho says so much in 140 characters:

“Oh, the intellectual freedom of academia” he thought while filling out a time sheet which checks that he does not work on non-grant science.

Remove noise, remove signal

Whenever you remove noise, you also remove at least some signal. Ideally you can remove a large portion of the noise and a small portion of the signal, but there’s always a trade-off between the two. Averaging things makes them more average.

Statistics has the related idea of bias-variance trade-off. An unfiltered signal has low bias but high variance. Filtering reduces the variance but introduces bias.

If you have a crackly recording, you want to remove the crackling and leave the music. If you do it well, you can remove most of the crackling effect and reveal the music, but the music signal will be slightly diminished. If you filter too aggressively, you’ll get rid of more noise, but create a dull version of the music. In the extreme, you get a single hum that’s the average of the entire recording.

This is a metaphor for life. If you only value your own opinion, you’re an idiot in the oldest sense of the word, someone in his or her own world. Your work may have a strong signal, but it also has a lot of noise. Getting even one outside opinion greatly cuts down on the noise. But it also cuts down on the signal to some extent. If you get too many opinions, the noise may be gone and the signal with it. Trying to please too many people leads to work that is offensively bland.

Related post: The cult of average

Levels of uncertainty

The other day I heard someone say something like the following:

I can’t believe how people don’t understand probability. They don’t realize that if a coin comes up heads 20 times, on the next flip there’s still a 50-50 chance of it coming up tails.

But if I saw a coin come up heads 20 times, I’d suspect it would come up heads the next time.

There are two levels of uncertainty here. IF the probability of a coin coming up heads is θ = 1/2 and the tosses are independent, then yes, the probability of a head is 1/2 each time, regardless of how many heads have shown before. The parameter θ models our uncertainty regarding which side will show after a toss of the coin. That’s the first level of uncertainty.

But what about our uncertainty in the value of θ? Twenty flips showing the same side up should cause us to question whether θ really is 1/2. Maybe it’s a biased coin and θ is greater than 1/2. Or maybe it really is a fair coin and we’ve just seen a one-in-a-million event. (Such events do happen, but only one in a million times.) Our uncertainty regarding the value of θ is a second level of uncertainty.

Frequentist statistics approaches these two kinds of uncertainty differently. That approach says that θ is a constant but unknown quantity. Probability describes the uncertainty regarding the coin toss given some θ but not the uncertainty regarding θ. The Bayesian models all uncertainty using probability. So the outcome of the coin toss given θ is random, but θ itself is also random. It’s turtles all the way down.

It’s possible to have different degrees of uncertainty at each level. You could, for example, calculate the probability of some quantum event very accurately. If that probability is near 1/2, there’s a lot of uncertainty regarding the event itself, but little uncertainty about the parameter. High uncertainty at the first level, low uncertainty at the next level. If you warp a coin, it may not be apparent what effect that will have on the probability of the outcome. Now there’s significant uncertainty at the first and second level.

We’ve implicitly assumed that a single parameter θ describes the uncertainty in a coin toss outcome. Maybe that’s not true. Maybe the person tossing the coin has the ability to influence the outcome. (Some very skilled people can. I’ve heard rumors that Persi Diaconis is good at this.) Now we have a third level of uncertainty, uncertainty regarding our model and not just its parameter.

If you’re sure that a parameter θ describes the coin toss, but you don’t know θ, then the coin toss outcome is an known unknown and θ is an unknown unknown, a second-order uncertainty. More often though people use the term “unknown unknown” to describe a third-order uncertainty, unforeseen factors that are not included in a model, not even as uncertain parameters.

Deriving distributions vs fitting distributions

Sometimes you can derive a probability distributions from a list of properties it must have. For example, there are several properties that lead inevitably to the normal distribution or the Poisson distribution.

Although such derivations are attractive, they don’t apply that often, and they’re suspect when they do apply. There’s often some effect that keeps the prerequisite conditions from being satisfied in practice, so the derivation doesn’t lead to the right result.

The Poisson may be the best example of this. It’s easy to argue that certain count data have a Poisson distribution, and yet empirically the Poisson doesn’t fit so well because, for example, you have a mixture of two populations with different rates rather than one homogeneous population. (Averages of Poisson distributions have a Poisson distribution. Mixtures of Poisson distributions don’t.)

The best scenario is when a theoretical derivation agrees with empirical analysis. Theory suggests the distribution should be X, and our analysis confirms that. Hurray! The theoretical and empirical strengthen each other’s claims.

Theoretical derivations can be useful even when they disagree with empirical analysis. The theoretical distribution forms a sort of baseline, and you can focus on how the data deviate from that baseline.

More posts on probability distributions

A prime-generating formula and SymPy

Mills’ constant is a number θ such that the integer part of θ raised to a power of 3 is always a prime. We’ll see if we can verify this computationally with SymPy.

from sympy import floor, isprime
from sympy.mpmath import mp, mpf

# set working precision to 200 decimal places
mp.dps = 200

# Mills' constant http://oeis.org/A051021
theta = mpf("1.30637788386308069046861449260260571291678458515671364436805375996643405376682659882150140370119739570729")

for i in range(1, 7):
    n = floor(theta**3**i)
    print i, n, isprime(n)

Note that the line of code defining theta wraps Mill’s constant as a string and passes it as an argument to mpf. If we just wrote theta = 1.30637788386308… then theta would be truncated to an ordinary floating point number and most of the digits would be lost. Not that I did that in the first version of the code. :)

This code says that the first five outputs are prime but that the sixth is not. That’s because we are not using Mills’ constant to enough precision. The integer part of θ^3^6 is 85 digits long, and we don’t have enough precision to compute the last digit correctly.

If we use the value of Mills’ constant given here to 640 decimal places, and increase mp.dps to 1000, we can correctly compute the sixth term, but not the seventh.

Mill’s formula is just a curiosity with no practical use that I’m aware of, except possibly as an illustration of impractical formulas.

Risk vs Change

From an interview with Dan North:

People talk about being risk averse. “I’m very risk averse.” But actually when you scratch the surface they’re change averse. They’re terrified of change. And they wrap that up in “risk” which is a much more business-like word. … I’m risk averse, and the way I manage risk is I have many, many, many small changes because I find that very non-risky.

Fermat’s proof of his last theorem

Fermat famously claimed to have a proof of his last theorem that he didn’t have room to write down. Mathematicians have speculated ever since what this proof must have been, though everyone is convinced the proof must have been wrong.

The usual argument for Fermat being wrong is that since it took over 350 years, and some very sophisticated mathematics, to prove the theorem, it’s highly unlikely that Fermat had a simple proof. That’s a reasonable argument, but somewhat unsatisfying because it’s risky business to speculate on what a proof must require. Who knows how complex the proof of FLT in The Book is?

André Weil offers what I find to be a more satisfying argument that Fermat did not have a proof, based on our knowledge of Fermat himself. Dana Mackinzie summarizes Weil’s argument as follows.

Fermat repeatedly bragged about the n = 3 and n = 4 cases and posed them as challenges to other mathematicians … But the never mentioned the general case, n = 5 and higher, in any of his letters. Why such restraint? Most likely, Weil argues, because Fermat had realized that his “truly wonderful proof” did not work in those cases.

Dana comments:

Every mathematician has had days like this. You think you have a great insight, but then you go out for a walk, or you come back to the problem the next day, and you realize that your great idea has a flaw. Sometimes you can go back and fix it. And sometimes you can’t.

The quotes above come from The Universe in Zero Words. I met Dana Mackinzie in Heidelberg a few weeks ago, and when I came home I looked for this book and his book on the formation of the moon, The Big Splat.

More on Fermat’s last theorem