Last digits of Fibonacci numbers

If you write out a sequence of Fibonacci numbers, you can see that the last digits repeat every 60 numbers.

The 61st Fibonacci number is 2504730781961. The 62nd is 4052739537881. Since these end in 1 and 1, the 63rd Fibonacci number must end in 2, etc. and so the pattern starts over.

It’s not obvious that the cycle should have length 60, but it is fairly easy to see that there must be a cycle. There are only 10*10 possibilities for two consecutive digits. Since the Fibonacci numbers are determined by a two-term recurrence, and since the last digit of a sum is determined by the sum of the last digits, the sequence of last digits must repeat eventually. Here “eventually” means after at most 10*10 terms.

Replace “10” by any other base in the paragraph above to show that the sequence of last digits must be cyclic in any base. In base 16, for example, the period is 24. In hexadecimal notation the 25th Fibonacci number is 12511 and the 26th is 1DA31, so the 27th must end in 2, etc.

Here’s a little Python code to find the period of the last digits of Fibonacci numbers working in any base b.

from sympy import fibonacci as f

def period(b):
    for i in range(1, b*b+1):
        if f(i)%b == 0 and f(i+1)%b == 1:

This shows that in base 100 the period is 300. So in base 10 the last two digits repeat every 300 terms.

The period seems to vary erratically with base as shown in the graph below.

Life lessons from differential equations

Ten life lessons from differential equations:

  1. Some problems simply have no solution.
  2. Some problems have no simple solution.
  3. Some problems have many solutions.
  4. Determining that a solution exists may be half the work of finding it.
  5. Solutions that work well locally may blow up when extended too far.

High-dimensional integration

Numerically integrating functions of many variables almost always requires Monte Carlo integration or some variation. Numerical analysis textbooks often emphasize one-dimensional integration and, almost as a footnote, say that you can use a product scheme to bootstrap one-dimensional methods to higher dimensions. True, but impractical.

Traditional numerical integration routines work well in low dimensions. (“Low” depends on context, but let’s say less than 10 dimensions.) When you have the choice of a well-understood, deterministic method, and it works well, by all means use it. I’ve seen people who are so used to problems requiring Monte Carlo methods that they use these methods on low-dimensional problems where they’re not the most efficient (or robust) alternative.

Monte Carlo

Except for very special integrands, high dimensional integration requires either Monte Carlo integration or some variation such as quasi-Monte Carlo methods.

As I wrote about here, naive Monte Carlo integration isn’t practical for high dimensions. It’s unlikely that any of your integration points will fall in the region of interest without some sort of importance sampler. Constructing a good importance sampler requires some understanding of your particular problem. (Sorry. No one-size-fits-all black-box methods are available.)

Quasi-Monte Carlo (QMC)

Quasi-Monte Carlo methods (QMC) are interesting because they rely on something like a random sequence, but “more random” in a sense, i.e. more effective at exploring a high-dimensional space. These methods work well when an integrand has low effective dimension. The effective dimension is low when nearly all the action takes place on a lower dimensional manifold. For example, a function may ostensibly depend on 10,000 variables, while for practical purposes 12 of these are by far the most important. There are some papers by Art Owen that say, roughly, that Quasi-Monte Carlo methods work well if and only if the effective dimension is much smaller than the nominal dimension. (QMC methods are especially popular in financial applications.)

While QMC methods can be more efficient than Monte Carlo methods, it’s hard to get bounds on the integration error with these methods. Hybrid approaches, such as randomized QMC can perform remarkably well and give theoretical error bounds.

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is a common variation on Monte Carlo integration that uses dependent random samples. In many applications, such as Bayesian statistics, you’d like to be able to create independent random samples from some probability distribution, but this is not practical. MCMC is a compromise. It produces a Markov chain of dependent samples, and asymptotically these samples have the desired distribution. The dependent nature of the samples makes it difficult to estimate error and to determine how well the integration estimates using the Markov chain have converged.

(Some people speak of the Markov chain itself converging. It doesn’t. The estimates using the Markov chain may converge. Speaking about the chain converging may just be using informal language, or it may betray a lack of understanding.)

There is little theory regarding the convergence of estimates from MCMC, aside from toy problems. An estimate can appear to have converged, while an important region of the integration problem remains unexplored. Despite its drawbacks, MCMC is the only game in town for many applications.


If you’d like help with high dimensional integration, or any other kind of numerical integration, please contact me to discuss how we might work together.

When the last digits of powers don’t change

If you raise any integer to the fifth power, its last digit doesn’t change. For example, 25 = 32.

It’s easy to prove this assertion by brute force. Since the last digit of bn only depends on the last digit of b, it’s enough to verify that the statement above holds for 0, 1, 2, …, 9.

Although that’s a valid proof, it’s not very satisfying. Why fifth powers? We’re implicitly working in base 10, as humans usually do, so what is the relation between 5 and 10 in this context? How might we generalize it to other bases?

The key is that 5 = φ(10) + 1 where φ(n) is the number of positive integers less than n and relatively prime to n.

Euler discovered the φ function and proved that if a and m are relatively prime, then

aφ(m) = 1 (mod m)

This means that aφ(m) – 1 is divisible by m. (Technically I should use the symbol ≡ (U+2261) rather than = above since the statement is a congruence, not an equation. But since Unicode font support on various devices is disappointing, not everyone could see the proper symbol.)

Euler’s theorem tells us that if a is relatively prime to 10 then a4 ends in 1, so a5 ends in the same digit as a. That proves our original statement for numbers ending in 1, 3, 7, and 9. But what about the rest, numbers that are divisible by 2 or 5? We’ll answer that question and a little more general one at the same time. Let m = αβ where α and β are distinct primes. In particular, we could choose α = 2 and β = 5; We will show that

aφ(m)+1 = a (mod m)

for all a, whether relatively prime to m or not. This would show in addition, for example, that in base 15, every number keeps the same last “digit” when raised to the 9th power since φ(15) = 8.

We only need to be concerned with the case of a being a multiple of α or β since Euler’s theorem proves our theorem for the rest of the cases. If a = αβ our theorem obviously holds, so assume a is some multiple of α, a = kα with k less than β (and hence relatively prime to β).

We need to show that αβ divides (kα)φ(αβ)+1kα. This expression is clearly divisible by α, so the remaining task is to show that it is divisible by β. We’ll show that (kα)φ(αβ) – 1 is divisible by β.

Since α and k are relatively prime to β, Euler’s theorem tells us that αφ(β) and kφ(β) are congruent to 1 (mod β). This implies that kαφ(β) is congruent to 1, and so kαφ(α)φ(β) is also congruent to 1 (mod β). One of the basic properties of φ is that for relatively prime arguments α and β, φ(αβ) = φ(α)φ(β) and so we’re done.

Exercise: How much more can you generalize this? Does the base have to be the product of two distinct primes?

Pedantic arithmetic rules

Generations of math teachers have drilled into their students that they must reduce fractions. That serves some purpose in the early years, but somewhere along the way students need to learn reducing fractions is not only unnecessary, but can be bad for communication. For example, if the fraction 45/365 comes up in the discussion of something that happened 45 days in a year, the fraction 45/365 is clearer than 9/73. The fraction 45/365 is not simpler in a number theoretic sense, but it is psychologically simpler since it’s obvious where the denominator came from. In this context, writing 9/73 is not a simplification but an obfuscation.

Simplifying fractions sometimes makes things clearer, but not always. It depends on context, and context is something students don’t understand at first. So it makes sense to be pedantic at some stage, but then students need to learn that clear communication trumps pedantic conventions.

Along these lines, there is a old taboo against having radicals in the denominator of a fraction. For example, 3/√5 is not allowed and should be rewritten as 3√5/5. This is an arbitrary convention now, though there once was a practical reason for it, namely that in hand calculations it’s easier to multiply by a long fraction than to divide by it. So, for example, if you had to reduce 3/√5 to a decimal in the old days, you’d look up √5 in a table to find it equals 2.2360679775. It would be easier to compute 0.6*2.2360679775 by hand than to compute 3/2.2360679775.

As with unreduced fractions, radicals in the denominator might be not only mathematically equivalent but psychologically preferable. If there’s a 3 in some context, and a √5, then it may be clear that 3/√5 is their ratio. In that same context someone may look at 3√5/5 and ask “Where did that factor of 5 in the denominator come from?”

A possible justification for rules above is that they provide standard forms that make grading easier. But this is only true for the simplest exercises. With moderately complicated exercises, following a student’s work is harder than determining whether two expressions represent the same number.

One final note on pedantic arithmetic rules: If the order of operations isn’t clear, make it clear. Add a pair of parentheses if you need to. Or write division operations as one thing above a horizontal bar and another below, not using the division symbol. Then you (and your reader) don’t have to worry whether, for example, multiplication has higher precedence than division or whether both have equal precedence and are carried out left to right.

Why is an empty sum 0 and an empty product 1?

In response to my earlier post on why 0! should be 1, several people replied that 0! = 1 because an empty product is 1. You can define the factorial of an integer n as the product of all positive numbers less than or equal to n. There are no positive integers less than or equal to 0, so 0! is an empty product. But this raises the question of why an empty product should be 1.

You could say that an empty sum is 0 because 0 is the additive identity and an empty product is 1 because 1 is the multiplicative identity. If you’d like a simple answer, maybe you should stop reading here.

The problem with the answer above is that it doesn’t say why an operation on an empty set should be defined to be the identity for that operation. The identity is certainly a plausible candidate, but why should it make sense to even define an operation on an empty set, and why should the identity turn out so often to be the definition that makes things proceed smoothly?

The convention that the sum over an empty set should be defined as 0, and that a product over an empty set should be defined to be 1 works well in very general settings where “sum”, “product”, “0”, and “1” take on abstract meanings.

The ultimate generalization of products is the notion of products in category theory. Similarly, the ultimate generalization of sums is categorical co-products. (Co-products are sometimes called sums, but they’re usually called co-products due to a symmetry with products.) Category theory simultaneously addresses a wide variety of operations that could be called products or sums (co-products).

The particular advantage of bringing category theory into this discussion is that it has definitions of product and co-product that are the same for any number of objects, including zero objects; there is no special definition for empty products. Empty products and co-products are a consequence of a more general definition, not special cases defined by convention.

In the category of sets, products are Cartesian products. The product of a set with n elements and one with m elements is one with nm elements. Also in the category of sets, co-products are disjoint unions. The co-product of a set with n elements and one with m elements is one with n+m elements. These examples show a connection between products and sums in arithmetic and products and co-products in category theory.

You can find the full definition of a categorical product here. Below I give the definition leaving out details that go away when we look at empty products.

The product of a set of objects is an object P such that given any other object X … there exists a unique morphism from X to P such that ….

If you’ve never seen this before, you might rightfully wonder what in the world this has to do with products. You’ll have to trust me on this one. [1]

When the set of objects is empty, the missing parts of the definition above don’t matter, so we’re left with requiring that there is a unique morphism [2] from each object X to the product P. In other words, P is a terminal object, often denoted 1. So in category theory, you can say empty products are 1.

But that seems like a leap, since “1” now takes on a new meaning that isn’t obviously connected to the idea of 1 we learned as a child. How is an object such that every object has a unique arrow to it at all like, say, the number of noses on a human face?

We drew a connection between arithmetic and categories before by looking at the cardinality of sets. We could define the product of the numbers n and m as the number of elements in the product of a set with n elements and one with m elements. Similarly we could define 1 as the cardinality of the terminal element, also denoted 1. This is because there is a unique map from any set to the set with 1 element. Pick your favorite one-element set and call it 1. Any other choice is isomorphic to your choice.

Now for empty sums. The following is the definition of co-product (sum), leaving out details that go away when we look at empty co-products.

The co-product of a set of objects is an object S such that given any other object X … there exists a unique morphism from S to X such that ….

As before, when the set of objects is empty, the missing parts don’t matter. Notice that the direction of the arrow in the definition is reversed: there is a unique morphism from the co-product S to any object X. In other words, S is an initial object, denoted for good reasons as 0.  [3]

In set theory, the initial object is the empty set. (If that hurts your head, you’re not alone. But if you think of functions in terms of sets of ordered pairs, it makes a little more sense. The function that sends the empty set to another set is an empty set of ordered pairs!) The cardinality of the initial object 0 is the integer 0, just as the cardinality of the initial object 1 is the integer 1.

* * *

[1] Category theory has to define operations entirely in terms of objects and morphisms. It can’t look inside an object and describe things in terms of elements the way you’d usually do to define the product of two numbers or two sets, so the definition of product has to look very different. The benefit of this extra work is a definition that applies much more generally.

To understand the general definition of products, start by understanding the product of two objects. Then learn about categorical limits and how products relate to limits. (As with products, the categorical definition of limits will look entirely different from familiar limits, but they’re related.)

[2] Morphisms are a generalization of functions. In the category of sets, morphisms are functions.

[3] Sometimes initial objects are denoted by ∅, the symbol for the empty set, and sometimes by 0. To make things more confusing, a “zero,” spelled out as a word rather than a symbol, has a different but related meaning in category theory: an object that is both initial and terminal.

Defining zero factorial

Things are defined the way they are for good reasons. This seems blatantly obvious now, but it was eye-opening when I learned this my first year in college. Our professor, Mike Starbird, asked us to go home and think about how convergence of a series should be defined. Not how it is defined, but how it should be defined. We were not to look up the definition but to think about what it should be. The next day we proposed our definitions. In good Socratic fashion Starbird showed us the flaws of each and lead us to arrive at the standard definition.

This exercise gave me confidence that mathematical definitions were created by ordinary mortals like myself. It also began my habit of examining definitions carefully to understand what motivated them.

One question that comes up frequently is why zero factorial equals 1. The pedantic answer is “Because it is defined that way.” This answer alone is not very helpful, but it does lead to the more refined question: Why is 0! defined to be 1?

The answer to the revised question is that many formulas are simpler if we define 0! to be 1. If we defined 0! to be 0, for example, countless formulas would have to add disqualifiers such as “except when n is zero.”

For example, the binomial coefficients are defined by

C(n, k) = n! / k!(nk)!.

The binomial coefficient C(n, k) tells us how many ways one can draw take a set of n things and select k of them. For example, the number of ways to deal a hand of five cards from a deck of 52 is C(52, 5) = 52! / 5! 47! = 2,598,960.

How many ways are there to deal a hand of 52 cards from a deck of 52 cards? Obviously one: the deck is the hand. But our formula says the answer is

C(52, 52) = 52! / 52! 0!,

and the formula is only correct if 0! = 1. If 0! were defined to be anything else, we’d have to say “The number of ways to deal a hand of k cards from a deck of n cards is C(n, k), except when k = 0 or k = n, in which case the answer is 1.” (See [1] below for picky details.)

The example above is certainly not the only one where it is convenient to define 0! to be 1. Countless theorems would be more awkward to state if 0! were defined any other way.

Sometimes people appeal to the gamma function for justification that 0! should be defined to be 1. The gamma function extends factorial to real numbers, and the gamma function value associated with 0! is 1. (In detail, n! = Γ(n+1) for positive integers n and Γ(1) = 1.) This is reassuring, but it raises another question: Why should the gamma function be authoritative?

Indeed, there are many ways to extend factorial to non-integer values, and historically many ways were proposed. However, the gamma function won and its competitors have faded into obscurity. So why did it win? Analogous to the discussion above, we could say that the gamma function won because more formulas work out simply with this definition than with others. That is, you can very often replace n! with Γ(n + 1) in a formula true for positive integer values of n and get a new formula valid for real or even complex values of n.

There is another reason why gamma won, and that’s the Bohr–Mollerup theorem. It says that if you’re looking for a function f(x) defined for x > 0 that satisfies f(1) = 1 and f(x+1) = x f(x), then the gamma function is the only log-convex solution. Why should we look for log-convex functions? Because factorial is log-convex, and so this is a natural property to require of its extension.

Update: Occasionally I hear someone say that the gamma function (shifting its argument by 1) is the only analytic function that extends factorial to the complex plane, but this isn’t true. For example, if you add sin(πx) to the gamma function, you get another analytic function that takes on the same values as gamma for positive integer arguments.

Related posts:

* * *

[1] Theorems about binomial coefficients have to make some restrictions on the arguments. See these notes for full details. But in the case of dealing cards, the only necessary constraints are the natural ones: we assume the number of cards in the deck and the number we want in a hand are non-negative integers, and that we’re not trying to draw more cards for a hand than there are in a deck. Defining 0! as 1 keeps us from having to make any unnatural qualifications such as “unless you’re dealing the entire deck.”

Integration by Darts

Monte Carlo integration has been called “Integration by Darts,” a clever pun on “integration by parts.” I ran across the phrase looking at some slides by Brian Hayes, but apparently it’s been around a while. The explanation that Monte Carlo is “integration by darts” is fine as a 0th order explanation, but it can be misleading.

Introductory courses explain Monte Carlo integration as follows.

  1. Plot the function you want to integrate.
  2. Draw a box that contains the graph.
  3. Throw darts (random points) at the box.
  4. Count the proportion of darts that land between the graph and the horizontal axis.
  5. Estimate the area under the graph by multiplying the area of the box by the proportion above.

In principle this is correct, but this is far from how Monte Carlo integration is usually done in practice.

For one thing, Monte Carlo integration is seldom used to integrate functions of one variable. Instead, it is mostly used on functions of many variables, maybe hundreds or thousands of variables. This is because more efficient methods exist for low-dimensional integrals, but very high dimensional integrals can usually only be computed using Monte Carlo or some variation like quasi-Monte Carlo.

If you draw a box around your integrand, especially in high dimensions, it may be that nearly all your darts fall outside the region you’re interested in. For example, suppose you throw a billion darts and none land inside the volume determined by your integration problem. Then the point estimate for your integral is 0. Assuming the true value of the integral is positive, the relative error in your estimate is 100%. You’ll need a lot more than a billion darts to get an accurate estimate. But is this example realistic? Absolutely. Nearly all the volume of a high-dimensional cube is in the “corners” and so putting a box around your integrand is naive. (I’ll elaborate on this below. [1])

So how do you implement Monte Carlo integration in practice? The next step up in sophistication is to use “importance sampling.” [2] Conceptually you’re still throwing darts at a box, but not with a uniform distribution. You find a probability distribution that approximately matches your integrand, and throw darts according to that distribution. The better the fit, the more efficient the importance sampler. You could think of naive importance sampling as using a uniform distribution as the importance sampler. It’s usually not hard to find an importance sampler much better than that. The importance sampler is so named because it concentrates more samples in the important regions.

Importance sampling isn’t the last word in Monte Carlo integration, but it’s a huge improvement over naive Monte Carlo.

* * *

[1] So what does it mean to say most of the volume of a high-dimensional cube is in the corners? Suppose you have an n-dimensional cube that runs from -1 to 1 in each dimension and you have a ball of radius 1 inside the cube. To make the example a little simpler, assume n is even, n = 2k. Then the volume of the cube is 4k and the volume of the sphere is πk / k!. If k = 1 (n = 2) then the sphere (circle in this case) takes up π/4 of the volume (area), about 79%. But when k = 100 (n = 200), the ball takes up 3.46×10-169 of the volume of the cube. You could never generate enough random samples from the cube to ever hope to land a single point inside the ball.

[2] In a nutshell, importance sampling replaces the problem of integrating f(x) with that of integrating (f(x) / g(x)) g(x) where g(x) is the importance sampler, a probability density. Then the integral of (f(x) / g(x)) g(x) is the expected value of (f(X) / g(X)) where X is a random variable with density given by the importance sampler. It’s often a good idea to use an importance sampler with slightly heavier tails than the original integrand.

If you’d like some help with numerical integration, let me know.

Mathematical arbitrage

I suspect there’s a huge opportunity in moving mathematics from the pure column to the applied column. There may be a lot of useful math that never sees application because the experts are unconcerned with or unaware of applications.

In particular I wonder what applications there may be of number theory, especially analytic number theory. I’m not thinking of the results of number theory but rather the elegant machinery developed to attack problems in number theory. I expect more of this machinery could be useful to problems outside of number theory.

I also wonder about category theory. The theory certainly finds uses within pure mathematics, but I’m not sure how useful it is in direct application to problems outside of mathematics. Many of the reported applications don’t seem like applications at all, but window dressing applied after-the-fact. On the other hand, there are also instances where categorical thinking led the way to a solution, but did its work behind the scenes; once a solution was in hand, it could be presented more directly without reference to categories. So it’s hard to say whether applications of category theory are over-reported or under-reported.

The mathematical literature can be misleading. When researchers say their work has various applications, they may be blowing smoke. At the same time, there may be real applications that are never mentioned in journals, either because the work is proprietary or because it is not deemed original in the academic sense of the word.

Extremely small probabilities

One objection to modeling adult heights with a normal distribution is that the former is obviously positive but the latter can be negative. However, by this model negative heights are astronomically unlikely. I’ll explain below how one can take “astronomically” literally in this context.

A common model says that men’s and women’s heights are normally distributed with means of 70 and 64 inches respectively, both with a standard deviation of 3 inches. A woman with negative height would be 21.33 standard deviations below the mean, and a man with negative height would be 23.33 standard deviations below the mean. These events have probability 3 × 10-101 and 10-120 respectively. Or to write them out in full




As I mentioned on Twitter yesterday, if you’re worried about probabilities that require scientific notation to write down, you’ve probably exceeded the resolution of your model. I imagine most probability models are good to two or three decimal places at most. When model probabilities are extremely small, factors outside the model become more important than ones inside.

According to Wolfram Alpha, there are around 1080 atoms in the universe. So picking one particular atom at random from all atoms in the universe would be on the order of a billion trillion times more likely than running into a woman with negative height. Of course negative heights are not just unlikely, they’re impossible. As you travel from the mean out into the tails, the first problem you encounter with the normal approximation is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is under-estimated. There exist people whose heights would be impossibly unlikely according to this normal approximation. See examples here.

Probabilities such as those above have no practical value, but it’s interesting to see how you’d compute them anyway. You could find the probability of a man having negative height by typing pnorm(-23.33) into R or scipy.stats.norm.cdf(-23.33) into Python. Without relying on such software, you could use the bounds

\frac{x}{\sqrt{2\pi}(x^2 + 1)} \exp(-x^2/2) < \Phi^c(x) < \frac{1}{\sqrt{2\pi}\,x} \exp(-x^2/2)

with x equal to -21.33 and -23.33. For a proof of these bounds and tighter bounds see these notes.

Random walks and the arcsine law

Suppose you stand at 0 and flip a fair coin. If the coin comes up heads, you take a step to the right. Otherwise you take a step to the left. How much of the time will you spend to the right of where you started?

As the number of steps N goes to infinity, the probability that the proportion of your time in positive territory is less than x approaches 2 arcsin(√x)/π. The arcsine term gives this rule its name, the arcsine law.

Here’s a little Python script to illustrate the arcsine law.

import random
from numpy import arcsin, pi, sqrt

def step():
u = random.random()
return 1 if u < 0.5 else -1 M = 1000 # outer loop N = 1000 # inner loop x = 0.3 # Use any 0 < x < 1 you'd like. outer_count = 0 for _ in range(M): n = 0 position= 0 inner_count = 0 for __ in range(N): position += step() if position > 0:
inner_count += 1
if inner_count/N < x: outer_count += 1 print (outer_count/M) print (2*arcsin(sqrt(x))/pi) [/code]

Playing with continued fractions and Khinchin’s constant

Take a real number x and expand it as a continued fraction. Compute the geometric mean of the first n coefficients.

Aleksandr Khinchin proved that for almost all real numbers x, as n → ∞ the geometric means converge. Not only that, they converge to the same constant, known as Khinchin’s constant, 2.685452001…. (“Almost all” here mean in the sense of measure theory: the set of real numbers that are exceptions to Khinchin’s theorem have measure zero.)

To get an idea how fast this convergence is, let’s start by looking at the continued fraction expansion of π. In Sage, we can type


to get the continued fraction coefficient

[3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1, 15, 3]

for π to 100 decimal places. The geometric mean of these coefficients is 2.84777288486, which only matches Khinchin’s constant to 1 significant figure.

Let’s try choosing random numbers and working with more decimal places.

There may be a more direct way to find geometric means in Sage, but here’s a function I wrote. It leaves off any leading zeros that would cause the geometric mean to be zero.

from numpy import exp, mean, log
def geometric_mean(x):
    return exp( mean([log(k) for k in x if k > 0]) )

Now let’s find 10 random numbers to 1,000 decimal places.

for _ in range(10):
    r = RealField(1000).random_element(0,1)

This produced


Three of these agree with Khinchin’s constant to two significant figures but the rest agree only to one. Apparently the convergence is very slow.

If we go back to π, this time looking out 10,000 decimal places, we get a little closer:


produces 2.67104567579, which differs from Khinchin’s constant by about 0.5%.

Grand unification of mathematics

Greg Egan’s short story Glory features a “xenomathematician” who discovers that an ancient civilization had produced a sort of grand unification of their various branches of mathematics.

It was not a matter of everything in mathematics collapsing in on itself, with one branch turning out to have been merely a recapitulation of another under a different guise. Rather, the principle was that every sufficiently beautiful mathematical system was rich enough to mirror in part — and sometimes in a complex and distorted fashion — every other sufficiently beautiful system. Nothing became sterile and redundant, nothing proved to have been a waste of time, but everything was shown to be magnificently intertwined.

Natural optima occur in the middle

Akin’s eighth law of spacecraft design says

In nature, the optimum is almost always in the middle somewhere. Distrust assertions that the optimum is at an extreme point.

When I first read this I immediately thought of several examples where theory said that an optima was at an extreme, but experience said otherwise.

Linear programming (LP) says the opposite of Akin’s law. The optimal point for a linear objective function subject to linear constraints is always at an extreme point. The constraints form a many-sided shape—you could think of it something like a diamond—and the optimal point will always be at one of the corners.

Nature is not linear, though it is often approximately linear over some useful range. One way to read Akin’s law is to say that even when something is approximately linear in the middle, there’s enough non-linearity at the extremes to pull the optimal points back from the edges. Or said another way, when you have an optimal value at an extreme point, there may be some unrealistic aspect of your model that pushed the optimal point out to the boundary.

Related postData calls the model’s bluff

Another reason natural logarithms are natural

In mathematics, log means natural logarithm by default; the burden of explanation is on anyone taking logarithms to a different base. I elaborate on this a little here.

Looking through Andrew Gelman and Jennifer Hill’s regression book, I noticed a justification for natural logarithms I hadn’t thought about before.

We prefer natural logs (that is, logarithms base e) because, as described above, coefficients on the natural-log scale are directly interpretable as approximate proportional differences: with a coefficient of 0.06, a difference of 1 in x corresponds to an approximate 6% difference in y, and so forth.

This is because

exp(x) ≈ 1 + x

for small values of x based on a Taylor series expansion. So in Gelman and Hill’s example, a difference of 0.06 on a natural log scale corresponds to roughly multiplying by 1.06 on the original scale, i.e. a 6% increase.

The Taylor series expansion for exponents of 10 is not so tidy:

10x ≈ 1 + 2.302585 x

where 2.302585 is the numerical value of the natural log of 10. This means that a change of 0.01 on a log10 scale corresponds to an increase of about 2.3% on the original scale.

Related post: Approximation relating lg, ln, and log10