There are no outliers

Matt Brigg’s comment on outliers in his post Tyranny of the mean:

Coontz used the word “outliers”. There are no such things. There can be mismeasured data, i.e. incorrect data, say when you tried to measure air temperature but your thermometer fell into boiling water. Or there can be errors in recording the data; transposition and such forth. But excluding mistakes, and the numbers you meant to measure are the numbers you meant to measure, there are no outliers. There are only measurements which do not accord with your theory about the thing of interest.

Emphasis added.

I have a slight quibble with this description of outliers. Some people use the term to mean legitimate extreme values, and some use the term to mean values that “didn’t really happen” in some sense. I assume Matt is criticizing the latter. For example, Michael Jordan’s athletic ability is an outlier in the former sense. He’s only an outlier in the latter sense if someone decides he “doesn’t count” in some context.

A few weeks ago I said this about outliers:

When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.

Sometimes you have to exclude a data point because you believe it is far more likely to be a mistake than an accurate measurement. You may also decide that an extreme value is legitimate, but that you wish to exclude from your model. The latter should be done with fear and trembling, or at least with an explicit disclaimer.

Related post: The cult of average

A priori overfitting

The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.

An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.

If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.

With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random—which would be an improvement over manually selecting scenarios that present a method in the best light—how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.

Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.

This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself—each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.

Related posts

Maybe you’re doing more than you need to

Suppose a = 2485144657 and b = 7751389993.

  1. What is the last digit of a*b?
  2. What is the median digit of a*b?

In both questions it is conceptually necessary to compute a*b, but not logically necessary. Both are a question about a*b, so computing the product is conceptually necessary. But there is no logical necessity to actually compute a*b in order to answer a question about it.

In the first question, there’s an obvious short cut: multiply the last two digits and see that the last digit of the product must be 1.

In the second question, it is conceivable that there is some way to find the median digit that is less work than computing a*b first, though I don’t see how. Conceptually, you need to find the digits that make up a*b, sort them, and select the one in the middle. But it is conceivable, for example, that there is some way to find the digits of a*b that is less work than finding them in the right order, i.e. computing a*b.

I brought up the example above to use it as a metaphor.

In your work, how can you tell whether a problem is more like the first question or the second? Are you presuming you have to do something that you don’t? Are you assuming something is logically necessary when it is only conceptually necessary?

When I’m stuck on a problem, I often ask myself whether I really need to do what I assume I need to do. Sometimes that helps me get unstuck.

Related post: Maybe you only need it because you have it

Log concave coefficients

A few days ago I wrote about the rise and fall of binomial coefficients. There I gave a proof that binomial coefficients are log-concave, and so a local maximum has to be a global maximum.

Here I’ll give a one-line proof of the same result, taking advantage of the following useful theorem.

Let p(x) = c0 + c1x + c2x2 + … + cnxn be a polynomial all of whose zeros are real and negative. Then the coefficient sequence ck is strictly log concave.

This is Theorem 4.5.2 from Generatingfunctionology, available for download here.

Now for the promised one-line proof. Binomial coefficients are the coefficients of (x + 1)n, which is clearly a polynomial with only real negative roots.

The same theorem shows that Stirling numbers of the first kind, s(n, k), are log concave for fixed n and k ≥ 1. This because these numbers are the coefficients of xk in

(x + 1)(x + 2) … (x + n – 1).

The theorem can also show that Stirling numbers of the second kind are log-concave, but in that case the generating polynomial is not so easy to write out.

Looking for small projects

I’m looking for small consulting projects to fill the gaps between larger projects. I’m available for projects that would take up to a few days.

I can’t take on another large project right now. However, if your company takes several weeks to initiate a project, we could start the process now and I may be available by the time the paperwork is done.

If you have a project you’d like to discuss, please let me know.

Never applied for a job

John Conway explained in an interview that he’s never applied for an academic job.

I am rather proud of the fact that, in some sense, I never applied for an academic position in my life. What happened: I was walking down King’s Parade, the main street in Cambridge, after I received my Ph.D. The chairman of the mathematics department said, “Oh, Conway, what have you done about applying for jobs? And I said, “Nothing.” “I thought that might be the answer,” he said, “Well, we have a position in our department, and I think you should apply.” And I said, “How do I apply?” And he said, “You write a letter to me.” And I said, “What should I put in that letter?” Then he lost his patience, pulled out of his pocket a letter — which had been written on one side — turned it over, and scribbled “Dear Professor Cassels” — that was his name — “I wish to apply for dot-dot-dot.” He handed it to me, and I signed it. It would have been nice if I’d gotten the job. I didn’t that year, but I got the same job the next year with the same letter.

The rise and fall of binomial coefficients

When you expand (x + y)n, the coefficients increase then decrease. The largest coefficient is in the middle if n is even; it’s the two in the middle if n is odd. For example, the coefficients for (1 + x)4 are 1, 4, 6, 4, 1 and the coefficients for (1 + x)5 are 1, 5, 10, 10, 5, 1.

More generally, if a > 0 and b > 0, the coefficients of (ax + by)n can only have one maximum. They may increase, or decrease, or change direction once, but they cannot wiggle more than that. They can’t, for example, increase, decrease, then increase again.

Here’s a proof. The coefficients are

{n \choose k} a^k b^{n-k}

To show that the coefficients are unimodal as a function of k, we’ll show that their logarithms are unimodal. And we’ll do that by showing that they are samples from a concave function.

The log of the kth coefficient is

log Γ(n+1) – log Γ(k+1) – log Γ(nk+1) + k log a + (nk) log b.

As a function of k, the terms

log Γ(n+1) + k log a + (nk) log b

form an affine function. The function log Γ is convex, so -log Γ is concave. The composition of a concave function with an affine function is concave, so – log Γ(k+1) and – log Γ(nk+1) are concave functions of k. The sum of concave functions is concave. And the sum of a concave function with an affine function is concave. So binomial coefficients are log-concave and they can only have one maximum.

(The fact log Γ(z) is a convex is the easy direction of the Bohr-Mollerup theorem. The harder direction is that Γ(z) is the only way to extend factorials to all reals that is log-convex.)

Need a 12-digit prime?

You may have seen the joke “Enter any 12-digit prime number to continue.” I’ve seen it floating around as the punchline in several contexts.

So what do you do if you need a 12-digit prime? Here’s how to find the smallest one using Python.

    >>> from sympy import nextprime
    >>> nextprime(10**11)
    100000000003L

The function nextprime gives the smallest prime larger than its argument. (Note that the smallest 12-digit number is 1011, not 1012. Great opportunity for an off-by-one mistake.)

Optionally you can provide an addition argument to nextprime to get primes further down the list. For example, this gives the second prime larger than 1011.

    >>> nextprime(10**11, 2)
    100000000019L

What if you wanted the largest 12-digit prime rather than the smallest?

    >>> from sympy import prevprime
    >>> prevprime(10**12)
    999999999989L

Finally, suppose you want to know how many 12-digit primes there are. SymPy has a function primepi that returns the number of primes less than its argument. Unfortunately, it fails for large arguments. It works for arguments as big as 2**27 but throws a memory error for 2**28.

The number of primes less than n is approximately n / log n, so there are about 32 billion primes between 1011 and 1012. According to Wolfram Alpha, the exact number of 12-digit primes is 33,489,857,205. So if you try 12-digit numbers at random, your chances are about 1 in 30 of getting a prime. If you’re clever enough to just pick odd numbers, your chances go up to 1 in 15.

Prove or disprove

From Concrete Mathematics:

Incidentally, when we’re faced with a “prove or disprove,” we’re usually better off trying first to disprove with a counterexample, for two reasons: A disproof is potentially easier (we just need one counterexample); and nit-picking arouses our creative juices. Even if the given assertion is true, out search for a counterexample often leads us to a proof, as soon as we see why  a counterexample is impossible. Besides, it’s healthy to be skeptical.