Mathematical arbitrage

I suspect there’s a huge opportunity in moving mathematics from the pure column to the applied column. There may be a lot of useful math that never sees application because the experts are unconcerned with or unaware of applications.

In particular I wonder what applications there may be of number theory, especially analytic number theory. I’m not thinking of the results of number theory but rather the elegant machinery developed to attack problems in number theory. I expect more of this machinery could be useful to problems outside of number theory.

I also wonder about category theory. The theory certainly finds uses within pure mathematics, but I’m not sure how useful it is in direct application to problems outside of mathematics. Many of the reported applications don’t seem like applications at all, but window dressing applied after-the-fact. On the other hand, there are also instances where categorical thinking led the way to a solution, but did its work behind the scenes; once a solution was in hand, it could be presented more directly without reference to categories. So it’s hard to say whether applications of category theory are over-reported or under-reported.

The mathematical literature can be misleading. When researchers say their work has various applications, they may be blowing smoke. At the same time, there may be real applications that are never mentioned in journals, either because the work is proprietary or because it is not deemed original in the academic sense of the word.

Extremely small probabilities

One objection to modeling adult heights with a normal distribution is that the former is obviously positive but the latter can be negative. However, by this model negative heights are astronomically unlikely. I’ll explain below how one can take “astronomically” literally in this context.

A common model says that men’s and women’s heights are normally distributed with means of 70 and 64 inches respectively, both with a standard deviation of 3 inches. A woman with negative height would be 21.33 standard deviations below the mean, and a man with negative height would be 23.33 standard deviations below the mean. These events have probability 3 × 10-101 and 10-120 respectively. Or to write them out in full




As I mentioned on Twitter yesterday, if you’re worried about probabilities that require scientific notation to write down, you’ve probably exceeded the resolution of your model. I imagine most probability models are good to two or three decimal places at most. When model probabilities are extremely small, factors outside the model become more important than ones inside.

According to Wolfram Alpha, there are around 1080 atoms in the universe. So picking one particular atom at random from all atoms in the universe would be on the order of a billion trillion times more likely than running into a woman with negative height. Of course negative heights are not just unlikely, they’re impossible. As you travel from the mean out into the tails, the first problem you encounter with the normal approximation is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is under-estimated. There exist people whose heights would be impossibly unlikely according to this normal approximation. See examples here.

Probabilities such as those above have no practical value, but it’s interesting to see how you’d compute them anyway. You could find the probability of a man having negative height by typing pnorm(-23.33) into R or scipy.stats.norm.cdf(-23.33) into Python. Without relying on such software, you could use the bounds

\frac{x}{\sqrt{2\pi}(x^2 + 1)} \exp(-x^2/2) < \Phi^c(x) < \frac{1}{\sqrt{2\pi}\,x} \exp(-x^2/2)

with x equal to -21.33 and -23.33. For a proof of these bounds and tighter bounds see these notes.

Random walks and the arcsine law

Suppose you stand at 0 and flip a fair coin. If the coin comes up heads, you take a step to the right. Otherwise you take a step to the left. How much of the time will you spend to the right of where you started?

As the number of steps N goes to infinity, the probability that the proportion of your time in positive territory is less than x approaches 2 arcsin(√x)/π. The arcsine term gives this rule its name, the arcsine law.

Here’s a little Python script to illustrate the arcsine law.

import random
from numpy import arcsin, pi, sqrt

def step():
    u = random.random()
    return 1 if u < 0.5 else -1

M = 1000 # outer loop    
N = 1000 # inner loop

x = 0.3 # Use any 0 < x < 1 you'd like.

outer_count = 0
for _ in range(M):
    n = 0
    position= 0 
    inner_count = 0
    for __ in range(N):
        position += step()
        if position > 0:
            inner_count += 1
    if inner_count/N < x:
        outer_count += 1

print (outer_count/M)
print (2*arcsin(sqrt(x))/pi)

Playing with continued fractions and Khinchin’s constant

Take a real number x and expand it as a continued fraction. Compute the geometric mean of the first n coefficients.

Aleksandr Khinchin proved that for almost all real numbers x, as n → ∞ the geometric means converge. Not only that, they converge to the same constant, known as Khinchin’s constant, 2.685452001…. (“Almost all” here mean in the sense of measure theory: the set of real numbers that are exceptions to Khinchin’s theorem have measure zero.)

To get an idea how fast this convergence is, let’s start by looking at the continued fraction expansion of π. In Sage, we can type


to get the continued fraction coefficient

[3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1, 15, 3]

for π to 100 decimal places. The geometric mean of these coefficients is 2.84777288486, which only matches Khinchin’s constant to 1 significant figure.

Let’s try choosing random numbers and working with more decimal places.

There may be a more direct way to find geometric means in Sage, but here’s a function I wrote. It leaves off any leading zeros that would cause the geometric mean to be zero.

from numpy import exp, mean, log
def geometric_mean(x):
    return exp( mean([log(k) for k in x if k > 0]) )

Now let’s find 10 random numbers to 1,000 decimal places.

for _ in range(10):
    r = RealField(1000).random_element(0,1)

This produced


Three of these agree with Khinchin’s constant to two significant figures but the rest agree only to one. Apparently the convergence is very slow.

If we go back to π, this time looking out 10,000 decimal places, we get a little closer:


produces 2.67104567579, which differs from Khinchin’s constant by about 0.5%.

Grand unification of mathematics

Greg Egan’s short story Glory features a “xenomathematician” who discovers that an ancient civilization had produced a sort of grand unification of their various branches of mathematics.

It was not a matter of everything in mathematics collapsing in on itself, with one branch turning out to have been merely a recapitulation of another under a different guise. Rather, the principle was that every sufficiently beautiful mathematical system was rich enough to mirror in part — and sometimes in a complex and distorted fashion — every other sufficiently beautiful system. Nothing became sterile and redundant, nothing proved to have been a waste of time, but everything was shown to be magnificently intertwined.

Natural optima occur in the middle

Akin’s eighth law of spacecraft design says

In nature, the optimum is almost always in the middle somewhere. Distrust assertions that the optimum is at an extreme point.

When I first read this I immediately thought of several examples where theory said that an optima was at an extreme, but experience said otherwise.

Linear programming (LP) says the opposite of Akin’s law. The optimal point for a linear objective function subject to linear constraints is always at an extreme point. The constraints form a many-sided shape—you could think of it something like a diamond—and the optimal point will always be at one of the corners.

Nature is not linear, though it is often approximately linear over some useful range. One way to read Akin’s law is to say that even when something is approximately linear in the middle, there’s enough non-linearity at the extremes to pull the optimal points back from the edges. Or said another way, when you have an optimal value at an extreme point, there may be some unrealistic aspect of your model that pushed the optimal point out to the boundary.

Related postData calls the model’s bluff

Another reason natural logarithms are natural

In mathematics, log means natural logarithm by default; the burden of explanation is on anyone taking logarithms to a different base. I elaborate on this a little here.

Looking through Andrew Gelman and Jennifer Hill’s regression book, I noticed a justification for natural logarithms I hadn’t thought about before.

We prefer natural logs (that is, logarithms base e) because, as described above, coefficients on the natural-log scale are directly interpretable as approximate proportional differences: with a coefficient of 0.06, a difference of 1 in x corresponds to an approximate 6% difference in y, and so forth.

This is because

exp(x) ≈ 1 + x

for small values of x based on a Taylor series expansion. So in Gelman and Hill’s example, a difference of 0.06 on a natural log scale corresponds to roughly multiplying by 1.06 on the original scale, i.e. a 6% increase.

The Taylor series expansion for exponents of 10 is not so tidy:

10x ≈ 1 + 2.302585 x

where 2.302585 is the numerical value of the natural log of 10. This means that a change of 0.01 on a log10 scale corresponds to an increase of about 2.3% on the original scale.

Related post: Approximation relating lg, ln, and log10

Miscellaneous math resources

Every Wednesday I’ve been pointing out various resources on my web site. So far they’ve all been web pages, but the following are all PDF files.

Probability and statistics:

Other math:

See also journal articles and technical reports.

Last week: Probability approximations

Next week: Code Project articles

Disappearing data projections

Suppose you have data in an N-dimensional space where N is large and consider the cube [-1, 1]N. The coordinate basis vectors start in the center of the cube and poke out through the middle of the faces. The diagonals of the cube run from the center to one of the corners.

If your points cluster along one of the coordinate axes, then projecting them to that axis will show the full width of the data. But if your points cluster along one of the diagonal directions, the projection along every coordinate axis will be a tiny smudge near the origin. There are a lot more diagonal directions than coordinate directions, 2N versus N, and so there are a lot of orientations of your points that could be missed by every coordinate projection.

Here’s the math behind the loose statements above. The diagonal directions of the form (±1, ±1, …, ±1). A unit vector in one of these directions will have the form (1/√N)(±1, ±1, …, ±1) and so its inner product with any of the coordinate basis vectors is 1/√N, which goes to zero as N gets large. Said another way, taking a set of points along a diagonal and projecting it to a coordinate axis divides its width by √N.


Zig Ziglar said that if you increase your confidence, you increase your competence. I think that’s generally true. Of course you could be an idiot and become a more confident idiot. In that case confidence just makes things worse [1]. But otherwise when you have more confidence, you explore more options, and in effect become more competent.

There are some things you may need to learn not for the content itself but for the confidence boost. Maybe you need to learn them so you can confidently say you didn’t need to. Also, some things you need to learn before you can see uses for them. (More on that theme here.)

I’ve learned several things backward in the sense of learning the advanced material before the elementary. For example, I studied PDEs in graduate school before having mastered the typical undergraduate differential equation curriculum. That nagged at me. I kept thinking I might find some use for the undergrad tricks. When I had a chance to teach the undergrad course a couple times, I increased my confidence. I also convinced myself that I didn’t need that material after all.

My experience with statistics was similar. I was writing research articles in statistics before I learned some of the introductory material. Once again the opportunity to teach the introductory material increased my confidence. The material wasn’t particularly useful, but the experience of having taught it was.

Related post: Psychological encapsulation

[1] See Yeats’ poem The Second Coming:

The best lack all conviction, while the worst
Are full of passionate intensity.


Probability approximations

This week’s resource post lists notes on probability approximations.

Do we even need probability approximations anymore? They’re not as necessary for numerical computation as they once were, but they remain vital for understanding the behavior of probability distributions and for theoretical calculations.

Textbooks often leave out details such as quantifying the error when discussion approximations. The following pages are notes I wrote to fill in some of these details when I was teaching.

See also blog posts tagged Probability and statistics and the Twitter account ProbFact.

Last week: Numerical computing resources

Next week: Miscellaneous math notes

Striving for simplicity, arriving at complexity

This post is a riff on a line from Mathematics without Apologies, the book I quoted yesterday.

In an all too familiar trade-off, the result of striving for ultimate simplicity is intolerable complexity; to eliminate too-long proofs we find ourselves “hopelessly lost” among the too-long definitions. [emphasis added]

It’s as if there’s some sort of conservation of complexity, but not quite in the sense of a physical conservation law. Conservation of momentum, for example, means that if one part of a system loses 5 units of momentum, other parts of the system have to absorb exactly 5 units of momentum. But perceived complexity is psychological, not physical, and the accounting is not the same. By moving complexity around we might increase or decrease the overall complexity.

The opening quote suggests that complexity is an optimization problem, not an accounting problem. It also suggests that driving the complexity of one part of a system to its minimum may disproportionately increase the complexity of another part. Striving for the simplest possible proofs, for example, could make the definitions much harder to digest. There’s a similar dynamic in programming languages and programs.

Larry Wall said that Scheme is a beautiful programming language, but every Scheme program is ugly. Perl, on the other hand, is ugly, but it lets you write beautiful programs. Scheme can be simple because it requires libraries and applications to implement functionality that is part of more complex languages. I had similar thoughts about COM. It was an elegant object system that lead to hideous programs.

Scheme is a minimalist programming language, and COM is a minimalist object framework. By and large the software development community prefers complex languages and frameworks in hopes of writing smaller programs. Additional complexity in languages and frameworks isn’t felt as strongly as additional complexity in application code. (Until something breaks. Then you might have to explore parts of the language or framework that you had blissfully ignored before.)

The opening quote deals specifically with the complexity of theorems and proofs. In context, the author was saying that the price of Grothendieck’s elegant proofs was a daunting edifice of definitions. (More on that here.) Grothendieck may have taken this to extremes, but many mathematicians agree with the general approach of pushing complexity out of theorems and into definitions. Michael Spivak defends this approach in the preface to his book Calculus on Manifolds.

… the proof of [Stokes’] theorem is, in the mathematician’s sense, an utter triviality — a straight-forward calculation. On the other hand, even the statement of this triviality cannot be understood without a horde of definitions … There are good reasons why the theorems should all be easy and the definitions hard. As the evolution of Stokes’ theorem revealed, a single simple principle can masquerade as several difficult results; the proofs of many theorems involve merely stripping away the disguise. The definitions, on the other hand, serve a twofold purpose: they are rigorous replacements for vague notions, and machinery for elegant proofs. [emphasis added]

Mathematicians like to push complexity into definitions like software developers like to push complexity into languages and frameworks. Both strategies can make life easier on professionals while making it harder on beginners.

Related post: A little simplicity goes a long way

Problem solvers and theory builders

From Mathematics without Apologies:

It’s conventional to classify mathematicians as “problem solvers” or “theory builders,” depending on temperament. My experiences and the sources I consulted in writing this book convince me that curiosity about problems guides the growth of theories, rather than the other way around. Alexander Grothendieck and Robert Langlands … count among the most ambitious of all builders of mathematical theories, but everything they built was addressed to specific problems with ancient roots.

Related post: Examples bring a subject to life

Special function resources

This week’s resource post: some pages of notes on special functions:

See also blog posts tagged special functions.

I tweet about special functions occasionally on AnalysisFact

Last week: HTML, LaTeX, and Unicode

Next week: Python resources

Counting primitive bit strings

A string of bits is called primitive if it is not the repetition of several copies of a smaller string of bits. For example, the 101101 is not primitive because it can be broken down into two copies of the string 101. In Python notation, you could produce 101101 by "101"*2. The string 11001101, on the other hand, is primitive. (It contains substrings that are not primitive, but the string as a whole cannot be factored into multiple copies of a single string.)

For a given n, let’s count how many primitive bit strings there are of length n. Call this f(n). There are 2n bit strings of length n, and f(n) of these are primitive. For example, there are f(12) primitive bit strings of length 12. The strings that are not primitive are made of copies of primitive strings: two copies of a primitive string of length 6, three copies of a primitive string of length 4, etc. This says

 2^{12} = f(12) + f(6) + f(4) + f(3) + f(2) + f(1)

and in general

2^n = \sum_{d \mid n} f(d)

Here the sum is over all positive integers d that divide n.

Unfortunately this formula is backward. It gives is a formula for something well known, 2n, as a sum of things we’re trying to calculate. The Möbius inversion formula is just what we need to turn this formula around so that the new thing is on the left and sums of old things are on the right. It tells us that

f(n) = \sum_{d \mid n} \mu\left(\frac{n}{d}\right) 2^d

where μ is the Möbius function.

We could compute f(n) with Python as follows:

from sympy.ntheory import mobius, divisors

def num_primitive(n):
    return sum( [mobius(n/d)*2**d for d in divisors(n)] )

The latest version of SymPy, version 0.7.6, comes with a function mobius for computing the Möbius function. If you’re using an earlier version of SymPy, you can roll your own mobius function:

from sympy.ntheory import factorint

def mobius(n):
    exponents = factorint(n).values()
    lenexp = len(exponents)
    m = 0 if lenexp == 0 else max(exponents)
    return 0 if m > 1 else (-1)**lenexp

The version of mobius that comes with SymPy 0.7.6 may be more efficient. It could, for example, stop the factorization process early if it discovers a square factor.