Small course corrections

You don’t send a probe to explore the solar system by pointing really well.

Pluto is about one thousand kilometers across and a few billion kilometers away. It’s not possible to calculate, much less execute, a trajectory so accurately that we could fire a projectile at Pluto and hit it. But space probes are not simply passive projectiles. They make lots of small course corrections along the way. New Horizons is on its way to Pluto and will arrive on July 14, 2015, not because the rocket that launched it was aligned precisely but because its course has changed slightly all along the way.

The human world is far less predictable than celestial mechanics. It’s foolish to think we can plan far ahead without any course corrections.

Artist conception of New Horizons via NASA

Remove noise, remove signal

Whenever you remove noise, you also remove at least some signal. Ideally you can remove a large portion of the noise and a small portion of the signal, but there’s always a trade-off between the two. Averaging things makes them more average.

Statistics has the related idea of bias-variance trade-off. An unfiltered signal has low bias but high variance. Filtering reduces the variance but introduces bias.

If you have a crackly recording, you want to remove the crackling and leave the music. If you do it well, you can remove most of the crackling effect and reveal the music, but the music signal will be slightly diminished. If you filter too aggressively, you’ll get rid of more noise, but create a dull version of the music. In the extreme, you get a single hum that’s the average of the entire recording.

This is a metaphor for life. If you only value your own opinion, you’re an idiot in the oldest sense of the word, someone in his or her own world. Your work may have a strong signal, but it also has a lot of noise. Getting even one outside opinion greatly cuts down on the noise. But it also cuts down on the signal to some extent. If you get too many opinions, the noise may be gone and the signal with it. Trying to please too many people leads to work that is offensively bland.

Related post: The cult of average

Levels of uncertainty

The other day I heard someone say something like the following:

I can’t believe how people don’t understand probability. They don’t realize that if a coin comes up heads 20 times, on the next flip there’s still a 50-50 chance of it coming up tails.

But if I saw a coin come up heads 20 times, I’d suspect it would come up heads the next time.

There are two levels of uncertainty here. If the probability of a coin coming up heads is θ = 1/2 and the tosses are independent, then yes, the probability of a head is 1/2 each time, regardless of how many heads have shown before. The parameter θ models our uncertainty regarding which side will show after a toss of the coin. That’s the first level of uncertainty.

But what about our uncertainty in the value of θ? Twenty flips showing the same side up should cause us to question whether θ really is 1/2. Maybe it’s a biased coin and θ is greater than 1/2. Or maybe it really is a fair coin and we’ve just seen a one-in-a-million event. (Such events do happen, but only one in a million times.) Our uncertainty regarding the value of θ is a second level of uncertainty.

Frequentist statistics approaches these two kinds of uncertainty differently. That approach says that θ is a constant but unknown quantity. Probability describes the uncertainty regarding the coin toss given some θ but not the uncertainty regarding θ. The Bayesian models all uncertainty using probability. So the outcome of the coin toss given θ is random, but θ itself is also random. It’s turtles all the way down.

It’s possible to have different degrees of uncertainty at each level. You could, for example, calculate the probability of some quantum event very accurately. If that probability is near 1/2, there’s a lot of uncertainty regarding the event itself, but little uncertainty about the parameter. High uncertainty at the first level, low uncertainty at the next level. If you warp a coin, it may not be apparent what effect that will have on the probability of the outcome. Now there’s significant uncertainty at the first and second level.

We’ve implicitly assumed that a single parameter θ describes the uncertainty in a coin toss outcome. Maybe that’s not true. Maybe the person tossing the coin has the ability to influence the outcome. (Some very skilled people can. I’ve heard rumors that Persi Diaconis is good at this.) Now we have a third level of uncertainty, uncertainty regarding our model and not just its parameter.

If you’re sure that a parameter θ describes the coin toss, but you don’t know θ, then the coin toss outcome is an known unknown and θ is an unknown unknown, a second-order uncertainty. More often though people use the term “unknown unknown” to describe a third-order uncertainty, unforeseen factors that are not included in a model, not even as uncertain parameters.

Deriving distributions vs fitting distributions

Sometimes you can derive a probability distributions from a list of properties it must have. For example, there are several properties that lead inevitably to the normal distribution or the Poisson distribution.

Although such derivations are attractive, they don’t apply that often, and they’re suspect when they do apply. There’s often some effect that keeps the prerequisite conditions from being satisfied in practice, so the derivation doesn’t lead to the right result.

The Poisson may be the best example of this. It’s easy to argue that certain count data have a Poisson distribution, and yet empirically the Poisson doesn’t fit so well because, for example, you have a mixture of two populations with different rates rather than one homogeneous population. (Averages of Poisson distributions have a Poisson distribution. Mixtures of Poisson distributions don’t.)

The best scenario is when a theoretical derivation agrees with empirical analysis. Theory suggests the distribution should be X, and our analysis confirms that. Hurray! The theoretical and empirical strengthen each other’s claims.

Theoretical derivations can be useful even when they disagree with empirical analysis. The theoretical distribution forms a sort of baseline, and you can focus on how the data deviate from that baseline.

 

Related posts:

A prime-generating formula and SymPy

Mills’ constant is a number θ such that the integer part of θ raised to a power of 3 is always a prime. We’ll see if we can verify this computationally with SymPy.

from sympy import floor, isprime
from sympy.mpmath import mp, mpf

# set working precision to 200 decimal places
mp.dps = 200

# Mills' constant http://oeis.org/A051021
theta = mpf("1.30637788386308069046861449260260571291678458515671364436805375996643405376682659882150140370119739570729")

for i in range(1, 7):
    n = floor(theta**3**i)
    print i, n, isprime(n)

Note that the line of code defining theta wraps Mill’s constant as a string and passes it as an argument to mpf. If we just wrote theta = 1.30637788386308… then theta would be truncated to an ordinary floating point number and most of the digits would be lost. Not that I did that in the first version of the code. :)

This code says that the first five outputs are prime but that the sixth is not. That’s because we are not using Mills’ constant to enough precision. The integer part of θ^3^6 is 85 digits long, and we don’t have enough precision to compute the last digit correctly.

If we use the value of Mills’ constant given here to 640 decimal places, and increase mp.dps to 1000, we can correctly compute the sixth term, but not the seventh.

Mill’s formula is just a curiosity with no practical use that I’m aware of, except possibly as an illustration of impractical formulas.

Risk vs Change

From an interview with Dan North:

People talk about being risk averse. “I’m very risk averse.” But actually when you scratch the surface they’re change averse. They’re terrified of change. And they wrap that up in “risk” which is a much more business-like word. … I’m risk averse, and the way I manage risk is I have many, many, many small changes because I find that very non-risky.

Fermat’s proof of his last theorem

Fermat famously claimed to have a proof of his last theorem that he didn’t have room to write down. Mathematicians have speculated ever since what this proof must have been, though everyone is convinced the proof must have been wrong.

The usual argument for Fermat being wrong is that since it took over 350 years, and some very sophisticated mathematics, to prove the theorem, it’s highly unlikely that Fermat had a simple proof. That’s a reasonable argument, but somewhat unsatisfying because it’s risky business to speculate on what a proof must require. Who knows how complex the proof of FLT in The Book is?

André Weil offers what I find to be a more satisfying argument that Fermat did not have a proof, based on our knowledge of Fermat himself. Dana Mackinzie summarizes Weil’s argument as follows.

Fermat repeatedly bragged about the n = 3 and n = 4 cases and posed them as challenges to other mathematicians … But the never mentioned the general case, n = 5 and higher, in any of his letters. Why such restraint? Most likely, Weil argues, because Fermat had realized that his “truly wonderful proof” did not work in those cases.

Dana comments:

Every mathematician has had days like this. You think you have a great insight, but then you go out for a walk, or you come back to the problem the next day, and you realize that your great idea has a flaw. Sometimes you can go back and fix it. And sometimes you can’t.

The quotes above come from The Universe in Zero Words. I met Dana Mackinzie in Heidelberg a few weeks ago, and when I came home I looked for this book and his book on the formation of the moon, The Big Splat.

Related posts:

Prime-generating fractions

I posted a couple prime-generating fractions on Google+ this weekend and said that I’d post an explanation later. Here it goes.

The decimal expansion of 18966017 / 997002999 is

.019 023 029 037 047 059 073 089 107 127 149 173 199 227 257 …

where each group of three digits is a prime number.

The fraction 4099200041 / 999700029999 produces primes in groups of four digits:

.0041 0043 0047 0053 0061 0071 0083 0097 0113 0131 0151 0173 0197 0223 0251 0281 0313 0347 0383 0421 0461 0503 0547 0593 0641 0691 0743 0797 0853 0911 0971 1033 1097 1163 1231 1301 1373 1447 1523 1601…

What’s going on here? Both fractions come from infinite sums involving prime-generating polynomials. No polynomial in one variable will generate only primes, but several do produce a long sequence of primes.

Euler discovered in 1772 that n2n + 41 is prime for values of n from -40 to 40. The second fraction above comes from Euler’s polynomial:

\frac{4099200041}{999700029999} = \sum_{n=1}^\infty \frac{n^2 - n +41}{10^{4n}}

The digits in the decimal expansion stop being prime after the 40th quartet because Euler’s polynomial isn’t prime for n = 41.

The first fraction above comes from a similar polynomial discovered by Legendre, n2 + n + 17. It produces primes for n = -16 to 15.

\frac{18966017}{997002999} = \sum_{n=1}^\infty \frac{n^2 + n +17}{10^{3n}}

You can make your own prime-generating fraction by finding a prime-generating polynomial and using the process above. Euler’s polynomial produces 4-digit primes before becoming composite, so the denominator in the sum is 104n. Legendre’s polynomial produces 3-digit primes for the corresponding denominator in the sum is 103n.

Related posts:

Frequently asked questions

Here, in no particular order, are a few questions people frequently ask me.

What kind of work do you do?

I’m a consultant working in math, statistics, and computing. Sometimes this means modeling, coming up with a mathematical description of a problem and seeing what I can learn from it. Sometimes this means seeing what can be learned from a set of data and determining what to do next, what decisions to make, what data to collect next, etc. Sometimes this means developing software, especially coming up with algorithms to implement math models efficiently.

I also teach. I’ve taught undergraduate and graduate courses in math and statistics. I’m not doing any classroom teaching these days, but teaching is a component of my consulting. Sometimes I serve as an interpreter, helping non-technical people digest technical material, and sometimes people ask for technical mentoring. I’ve also done some expert testimony, which is kind of a form of teaching.

Can you recommend an introductory statistics book?

No.

What programming language do you use?

In the last year or so I’ve used C++, Python, C#, R, Mathematica, sed, awk, and a little JavaScript. Over my career I’ve written more C++ than anything else, and lately I’ve mostly used Python. I’ve also worked with Perl, Java, and PowerShell, though not recently.

Why do you prefer Python to R?

I’d rather do math in a general-purpose language than do general-purpose programming in a math language.

R is designed for interactive statistical computing, and it’s good for that. But I do a lot more than interactive statistical computing. Even on a statistical project, there’s usually non-statistical work to do, such as munging text files or creating a user interface. The statistical software may just be one component in a larger system. It’s easier to do general programming in a language designed for general programming.

I like using Python for convenience or C++ for speed.

How do you charge for consulting?

My first choice is to charge by the project if a project is well-defined. But if that’s not possible, or if a project is open-ended, I’ll charge by the hour.

How do you run your Twitter accounts?

I use my personal account live and I mostly schedule my daily tip tweets in advance.

Why don’t use you XeTeX?

This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using \AA to produce Å. Every time someone will as “Why not use XeTeX and just enter these characters?”

If you can “just enter” non-English characters, then you don’t need a tip. But a lot of people either don’t know how to do this or don’t have a convenient way to do so. Most English speakers only need to type foreign characters occasionally, and will find it easier, for example, to type \AA or \ss than to learn how to produce Å or ß from a keyboard. If you frequently need to enter Unicode characters, and know how to do so, then XeTeX is great.

Can I use your code?

If I post code online, you’re free to use it however you wish. This page has links to code some people have found useful. And here are some articles with code that I’ve written for Code Project.

How can you test a random number generator?

Here’s a book chapter I wrote on that.

How do you type math characters in HTML?

See this page. I prefer writing math in LaTeX, but when I’m writing HTML I like to stay in HTML if I can. I’ll use LaTeX for displayed equations, but I try to stick to HTML inline so the page doesn’t look like a ransom note.

How can I contact you?

See this page.

By the way, if you send a message to one of my daily tip Twitter accounts, I might not see it. I get more responses there than I can read. It’s better to send me email.

Contrasting Hilbert and DARPA

This morning I ran across a list of 23 math problems compiled by DARPA. I assume their choice of exactly 23 problems was meant to be an allusion to Hilbert’s famous list of 23 math problems from 1900.

Some of Hilbert’s problems were a little broad, but most had crisp mathematical statements. DARPA’s list is the opposite: a few of the questions have a crisp statement, but most are very broad. Hilbert’s problems were pure. DARPA’s problems are applied.

One of the questions that stood out to me concerns computational duality.

Duality in mathematics has been a profound tool for theoretical understanding. Can it be extended to develop principled computational techniques where duality and geometry are the basis for novel algorithms?

I think it’s safe to say the answer is “yes” because it already has been done to some degree, though the question remains as to how much more can be done.

Fibonomial coefficients

Binomial coefficients can be defined by

{n \choose k} = \frac{n(n-1)(n-2)\cdots(n-k+1)}{k(k-1)(k-2)\cdots 1}

Edouard Lucas defined the Fibonomial coefficients analogously by

{n \choose k}_{\cal F} = \frac{F_nF_{n-1}F_{n-2}\cdots F_{n-k+1}}{F_kF_{k-1}F_{k-2}\cdots F_1}

where Fj is the jth Fibonacci number.

Binomial coefficients satisfy

{n \choose k} = {n-1 \choose k} + {n-1 \choose k-1}

and Fibonomial coefficients satisfy an analogous identity

{n \choose k}_{\cal F} = F_{k-1}{n-1 \choose k}_{\cal F} + F_{n-k+1}{n-1 \choose k-1}_{\cal F}

Incidentally, this identity can be used to show that Fibonomial coefficients are integers, which isn’t obvious from the definition.

Related posts:

Visualizing suffix primes

A few months ago I wrote about suffix primes, prime numbers such that all their “suffixes” are prime. For example, 653 is prime, and it’s a suffix prime because 53 and 3 are primes.

James Griffin pointed out that the primes have a natural tree structure, the parent of each number being that number with its most significant digit removed. There are two separate trees: one for numbers ending in 3 and another for numbers ending in 7. The numbers 2 and 5 are isolated nodes with no descendants.

There are 4260 suffix primes, about half ending in 3 and half ending in 7. A full list is available here. The graph of either tree is a mess, but here we show the first three levels of the graph for numbers ending in 3.

And here’s the corresponding graph for numbers ending in 7.

How bushy are these trees? The number of nodes at each level increases up to a maximum of 545 nodes at level 9 then decreases. The levels of the suffix primes in their respective trees is given in the histogram below.