From David Jacobs:
Code is like poetry; most of it shouldn’t have been written.
Matt Brigg’s comment on outliers in his post Tyranny of the mean:
Coontz used the word “outliers”. There are no such things. There can be mismeasured data, i.e. incorrect data, say when you tried to measure air temperature but your thermometer fell into boiling water. Or there can be errors in recording the data; transposition and such forth. But excluding mistakes, and the numbers you meant to measure are the numbers you meant to measure, there are no outliers. There are only measurements which do not accord with your theory about the thing of interest.
I have a slight quibble with this description of outliers. Some people use the term to mean legitimate extreme values, and some use the term to mean values that “didn’t really happen” in some sense. I assume Matt is criticizing the latter. For example, Michael Jordan’s athletic ability is an outlier in the former sense. He’s only an outlier in the latter sense if someone decides he “doesn’t count” in some context.
A few weeks ago I said this about outliers:
When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.
Sometimes you have to exclude a data point because you believe it is far more likely to be a mistake than an accurate measurement. You may also decide that an extreme value is legitimate, but that you wish to exclude from your model. The latter should be done with fear and trembling, or at least with an explicit disclaimer.
Related post: The cult of average
The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.
An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.
If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.
With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random — which would be an improvement over manually selecting scenarios that present a method in the best light — how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.
Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.
This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself — each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.
Suppose a = 2485144657 and b = 7751389993.
In both questions it is conceptually necessary to compute a*b, but not logically necessary. Both are a question about a*b, so computing the product is conceptually necessary. But there is no logical necessity to actually compute a*b in order to answer a question about it.
In the first question, there’s an obvious short cut: multiply the last two digits and see that the last digit of the product must be 1.
In the second question, it is conceivable that there is some way to find the median digit that is less work than computing a*b first, though I don’t see how. Conceptually, you need to find the digits that make up a*b, sort them, and select the one in the middle. But it is conceivable, for example, that there is some way to find the digits of a*b that is less work than finding them in the right order, i.e. computing a*b.
I brought up the example above to use it as a metaphor.
In your work, how can you tell whether a problem is more like the first question or the second? Are you presuming you have to do something that you don’t? Are you assuming something is logically necessary when it is only conceptually necessary?
When I’m stuck on a problem, I often ask myself whether I really need to do what I assume I need to do. Sometimes that helps me get unstuck.
Related post: Maybe you only need it because you have it
A few days ago I wrote about the rise and fall of binomial coefficients. There I gave a proof that binomial coefficients are log-concave, and so a local maximum has to be a global maximum.
Here I’ll give a one-line proof of the same result, taking advantage of the following useful theorem.
Let p(x) = c0 + c1x + c2x2 + … + cnxn be a polynomial all of whose zeros are real and negative. Then the coefficient sequence ck is strictly log concave.
This is Theorem 4.5.2 from Generatingfunctionology, available for download here.
Now for the promised one-line proof. Binomial coefficients are the coefficients of (x + 1)n, which is clearly a polynomial with only real negative roots.
The same theorem shows that Stirling numbers of the first kind, s(n, k), are log concave for fixed n and k ≥ 1. This because these numbers are the coefficients of xk in
(x + 1)(x + 2) … (x + n – 1).
The theorem can also show that Stirling numbers of the second kind are log-concave, but in that case the generating polynomial is not so easy to write out.
I’m looking for small consulting projects to fill the gaps between larger projects. I’m available for projects that would take up to a few days.
I can’t take on another large project right now. However, if your company takes several weeks to initiate a project, we could start the process now and I may be available by the time the paperwork is done.
If you have a project you’d like to discuss, please let me know.
John Conway explained in an interview that he’s never applied for an academic job.
I am rather proud of the fact that, in some sense, I never applied for an academic position in my life. What happened: I was walking down King’s Parade, the main street in Cambridge, after I received my Ph.D. The chairman of the mathematics department said, “Oh, Conway, what have you done about applying for jobs? And I said, “Nothing.” “I thought that might be the answer,” he said, “Well, we have a position in our department, and I think you should apply.” And I said, “How do I apply?” And he said, “You write a letter to me.” And I said, “What should I put in that letter?” Then he lost his patience, pulled out of his pocket a letter — which had been written on one side — turned it over, and scribbled “Dear Professor Cassels” — that was his name — “I wish to apply for dot-dot-dot.” He handed it to me, and I signed it. It would have been nice if I’d gotten the job. I didn’t that year, but I got the same job the next year with the same letter.
When you expand (x + y)n, the coefficients increase then decrease. The largest coefficient is in the middle if n is even; it’s the two in the middle if n is odd. For example, the coefficients for (1 + x)4 are 1, 4, 6, 4, 1 and the coefficients for (1 + x)5 are 1, 5, 10, 10, 5, 1.
More generally, if a > 0 and b > 0, the coefficients of (ax + by)n can only have one maximum. They may increase, or decrease, or change direction once, but they cannot wiggle more than that. They can’t, for example, increase, decrease, then increase again.
Here’s a proof. The coefficients are
To show that the coefficients are unimodal as a function of k, we’ll show that their logarithms are unimodal. And we’ll do that by showing that they are samples from a concave function.
The log of the kth coefficient is
log Γ(n+1) – log Γ(k+1) – log Γ(n-k+1) + k log a + (n-k) log b.
As a function of k, the terms
log Γ(n+1) + k log a + (n-k) log b
form an affine function. The function log Γ is convex, so -log Γ is concave. The composition of a concave function with an affine function is concave, so – log Γ(k+1) and – log Γ(n-k+1) are concave functions of k. The sum of concave functions is concave. And the sum of a concave function with an affine function is concave. So binomial coefficients are log-concave and they can only have one maximum.
(The fact log Γ(z) is a convex is the easy direction of the Bohr-Mollerup theorem. The harder direction is that Γ(z) is the only way to extend factorials to all reals that is log-convex.)
You may have seen the joke “Enter any 12-digit prime number to continue.” I’ve seen it floating around as the punchline in several contexts.
So what do you do if you need a 12-digit prime? Here’s how to find the smallest one using Python.
>>> from sympy import nextprime >>> nextprime(10**11) 100000000003L
nextprime gives the smallest prime larger than its argument. (Note that the smallest 12-digit number is 1011, not 1012. Great opportunity for an off-by-one mistake.)
Optionally you can provide an addition argument to
nextprime to get primes further down the list. For example, this gives the second prime larger than 1011.
>>> nextprime(10**11, 2) 100000000019L
What if you wanted the largest 12-digit prime rather than the smallest?
>>> from sympy import prevprime >>> prevprime(10**12) 999999999989L
Finally, suppose you want to know how many 12-digit primes there are. SymPy has a function
primepi that returns the number of primes less than its argument. Unfortunately, it fails for large arguments. It works for arguments as big as
2**27 but throws a memory error for
The number of primes less than n is approximately n / log n, so there are about 32 billion primes between 1011 and 1012. According to Wolfram Alpha, the exact number of 12-digit primes is 33,489,857,205. So if you try 12-digit numbers at random, your chances are about 1 in 30 of getting a prime. If you’re clever enough to just pick odd numbers, your chances go up to 1 in 15.
From Concrete Mathematics:
Incidentally, when we’re faced with a “prove or disprove,” we’re usually better off trying first to disprove with a counterexample, for two reasons: A disproof is potentially easier (we just need one counterexample); and nit-picking arouses our creative juices. Even if the given assertion is true, out search for a counterexample often leads us to a proof, as soon as we see why a counterexample is impossible. Besides, it’s healthy to be skeptical.
In his book Let Over Lambda, Doug Hoyte says
Lisp is the result of taking syntax away, Perl is the result of taking syntax all the way.
Lisp practically has no syntax. It simply has parenthesized expressions. This makes it very easy to start using the language. And above all, it makes it easy to treat code as data. Lisp macros are very powerful, and these macros are made possible by the fact that the language is simple to parse.
Perl has complex syntax. Some people say it looks like line noise because its liberal use of non-alphanumeric characters as operators. Perl is not easy to parse — there’s a saying that only Perl can parse Perl — nor is it easy to start using. But the language was designed for regular users, not beginners, because you spend more time using a language than learning it.
There are reasons I no longer use Perl, but I don’t object to the rich syntax. Saying Perl is hard to use because of its symbols is like saying Greek is hard to learn because it has a different alphabet. It takes years to master Greek, but you can learn the alphabet in a day. The alphabet is not the hard part.
Symbols can make text more expressive. If you’ve ever tried to read mathematics from the 18th or 19th century, you’ll see what I mean. Before the 20th century, math publications were very verbose. It might take a paragraph to say what would now be said in a single equation. In part this is because notation has developed and standardized over time. Also, it is now much easier to typeset the symbols someone would use in handwriting. Perl’s repertoire of symbols is parsimonious compared to mathematics.
I imagine that programming languages will gradually expand their range of symbols.
People joke about how unreadable Perl code is, but I think a page of well-written Perl is easier to read than a page of well-written Lisp. At least the Perl is easier to scan: Lisp’s typographical monotony makes it hard to skim for landmarks. One might argue that a page of Lisp can accomplish more than a page of Perl, and that may be true, but that’s another topic.
Any discussion of symbols and programming languages must mention APL. This language introduced a large number of new symbols and never gained wide acceptance. I don’t know that much about APL, but I’ll give my impression of why I don’t think APL’s failure is not proof that programmers won’t use more symbols.
APL required a special keyboard to input. That would no longer be necessary. APL also introduced a new programming model; the language would have been hard to adopt even without the special symbols. Finally, APL’s symbols were completely unfamiliar and introduced all at once, unlike math notation that developed world-wide over centuries.
What if programming notation were more like music notation? Music notation is predominately non-verbal, but people learn to read it fluently with a little training. And it expresses concurrency very easily. Or maybe programs could look more like choral music, a mixture of symbols and prose.
Suppose you want to know when your great-grandmother was born. You can’t find the year recorded anywhere. But you did discover an undated letter from her father that mentions her birth and one curious detail: the 13-year and 17-year cicadas were swarming.
You do a little research and find that the 13-year cicadas are supposed to come out next year, and that the 17-year cicadas came out last year. When was your great-grandmother born?
Since 13 and 17 are relatively prime, the 13-year and 17-year cicadas will synchronize their schedules every 13 × 17 = 221 years. Suppose your great-grandmother was born n years ago. The remainder when n is divided by 13 must be 12, and the remainder when n is divided by 17 must be 1. We have to solve the pair of congruences n = 12 mod 13 and n = 1 mod 17. The Chinese Remainder Theorem says that this pair of congruences has a solution, and the proof of the theorem suggests an algorithm for computing the solution.
The Python library SymPy has a function for solving linear congruences.
>>> from sympy.ntheory.modular import solve_congruence >>> solve_congruence( (12, 13), (1, 17) ) (103, 221)
This says we’re 103 years into the joint cycle of the 13-year and 17-year cicadas. So your great-grandmother was born 103 years ago. (The solution 324 = 103 + 221 is also mathematically possible, but not biologically possible.)
You can use the same SymPy function to solve systems of more congruences. For example, when is the next year in which there will be summer Olympic games and the 13-year and and 17-year cicadas will swarm? Here are a couple ways to approach this. First, you could find the last time this happened, then find when it will happen next. You’d need to solve n = 1 mod 4 (since we had summer Olympics last year) and n = 12 mod 13 and n = 1 mod 17.
>>> solve_congruence( (1, 4), (12, 13), (1, 17) ) (545, 884)
So the cicadas and the summer Olympics were in sync 545 years ago. (Well, they would have been if the modern Olympics had existed in the middle ages.) This says they’ll be in sync again in 885 – 545 = 339 years.
Here’s a more direct approach. We want to know when the summer Olympics will be 3 years ahead of where they are now in the cycle, when the 13-year cicadas will be 1 year ahead, and the 17-year cicadas will be 16 years ahead.
>>> solve_congruence( (3, 4), (1, 13), (16, 17) ) (339, 884)
By the way, you can use negative integers with the congruences, so you could have used (-1, 17) to say the 17-year cicadas will be 1 year back instead of 16 years ahead in their cycle.
A week ago I wrote about Perrin numbers, numbers Pn defined by a recurrence relation similar to Fibonacci numbers. If n is prime, Pn mod n = 0, and the converse is nearly always true. That is, if Pn mod n = 0, n is usually prime. The exceptions are called Perrin pseudoprimes.
Matt McIrvin wrote an excellent post explaining how to compute Perrin pseudoprimes. Here I’m just going to elaborate on a couple points in his post.
Matt’s first point is that if you want to search for Perrin pseudoprimes, the most direct approach won’t get you very far. The obvious thing to do is compute Pn and then see whether it has remainder 0 when you divide by n. The problem is that Pn grows exponentially with n. In fact, Pn is approximately ρn where ρ = 1.3247… is the plastic number. This means that Pn has about n log10 ρ digits. So searching for pseudoprimes less than one billion would require working with numbers with over 100,000,000 digits. This can be done, but it’s slow and unnecessary.
Since the goal is to compute Pn mod n rather than Pn per se, we can carry out all calculations mod n and avoid extended precision arithmetic as long as n itself can fit in an ordinary precision integer. If we want to find pseudoprimes less than one billion, we calculate Pn mod n for each n up to N = 109. This only requires ordinary arithmetic.
However, this approach takes O(N2) time unless we’re clever. We have to compute Pn mod n separately for each n, and the most direct approach takes n steps. This leads to Matt’s second point: use matrix multiplication (mod n) to calculate Pn mod n. This requires calculating the nth power of a 3×3 matrix, which can be done in O(log n) time using fast exponentiation. This makes the search for pseudoprimes less than N require O(N log N) rather than O(N2) time. This is enough to make the search for pseudoprimes less than a billion practical.
From David Mumford’s May 2013 interview in SIAM News:
The applied mathematician has the difficult job of looking at a problem in context with no explicit mathematics and trying to see what kinds of mathematical ideas are under the surface that could clarify the situation. I think the most successful applied mathematicians are those who look in both directions, at the science and the math.
You can’t become too attached to one way of looking at things. Applied math has always rejuvenated pure, and theorems in pure math can unexpectedly lead to new tools with vast applications.
Emphasis added. I wish Mumford had said “at the problem and at the math” because not all applications of math are scientific.