Casting out sevens

A while back I wrote about a method to test whether a number is divisible by seven. I recently ran across another method for testing divisibility by 7 in Martin Gardner’s book The Unexpected Hanging and Other Mathematical Diversions. The method doesn’t save too much effort compared to simply dividing by 7, but it’s interesting. It looks a little mysterious at first, though the explanation of why it works is very simple.

Suppose you want to find whether a number n is divisible by 7. Start with the last digit of n and write a 1 under the last digit, and a 3 under the next to last digit. The digits under the digits of n cycle through 1, 3, 2, 6, 4, 5, repeatedly until there is something under each digit in n. Now multiply each digit of n by the digit under it and add the results.

For example, suppose n = 1394. Write the digits 1, 3, 2, and 6 underneath, from right to left:

1394
6231

The sum we need to compute is 1*6 + 3*2 + 9*3 + 4*1 = 6 + 6 + 27 + 4 = 43.

This sum, 43 in our example, has the same remainder when divided by 7 as the original number, 1394 in our example. Since 43 is not divisible by 7, neither is 1394. Not only that, the result of our method has the same remainder by 7 as the number we started with. In our example, 43 leaves a remainder of 1 by 7, so 1394 also leaves a remainder of 1.

You could apply this method repeatedly, though in this case 43 is small enough that it’s easy enough to see that it leaves a remainder of 1.

Suppose you started with a 1000-digit number n. Each digit is no more than 9, and is being multiplied by a number no more than 6. So the sum would be less than 54000. So you’ve gone from a 1000-digit number to at most a 5-digit number in one step. One or two more steps should be enough for the remainder to be obvious.

Why does this method work? The key is that the multipliers 1, 3, 2, 6, 4, 5 are the remainders when the powers of 10 are divided by 7. Since 106 has a remainder of 1 when divided by 7, the numbers 106a+b and 10b have the same remainder by 7, and that’s why the multipliers have period 6.

All the trick is doing is expanding the base 10 representation of a number and adding up the remainders when each term is divided by seven. In our example, 1394 = 1000 + 3*100 + 9*10 + 4, and mod 7 this reduces to 1*6 + 3*2 + 9*3 + 4*1, the exact calculation above.

The trick presented here is analogous to casting out nines. But since every power of 10 leaves a remainder of 1 when divided by 9, all the multipliers in casting out nines are 1.

You could follow the pattern of this method to create a divisibility rule for any other divisor, say 13 for example, by letting the multipliers be the remainders when powers of 10 are divided by 13.

Related post: Divisibility rules in hex

Computing square triangular numbers

The previous post stated a formula for f(n), the nth square triangular number (i.e. the nth triangular number that is also a square number):

((17 + 12√2)n + (17 – 12√2)n – 2)/32

Now 17 – 12√2 is 0.029… and so the term (17 – 12√2)n approaches zero very quickly as n increases. So the f(n) is very nearly

((17 + 12√2)n – 2)/32

The error in the approximation is less than 0.001 for all n, so the approximation is exact when rounded to the nearest integer. So the nth square triangular number is

⌊((17 + 12√2)n +14)/32⌋

where ⌊x⌋ is the greatest integer less than x.

Here’s how you might implement this in Python:

    from math import sqrt, floor

    def f(n):
        x = 17 + 12*sqrt(2)
        return floor((x**n + 14)/32.0)

Unfortunately this formula isn’t that useful in ordinary floating point computation. When n = 11 or larger, the result is needs more bits than are available in the significand of a floating point number. The result is accurate to around 15 digits, but the result is longer than 15 digits.

The result can also be computed with a recurrence relation:

f(n) = 34 f(n-1) – f(n-2) + 2

for n > 2. (Or n > 1 if you define f(0) to be 0, a reasonable thing to do).

This only requires integer arithmetic, so it’s only limitation is the size of the integers you can compute with. Since Python has arbitrarily large integers, the following works for very large integers.

    def f(n):
        if n < 2:
            return n
        return f(n-1)*34 - f(n-2) + 2

This is a direct implementation, though it’s inefficient because of the redundant function evaluations. It could be made more efficient by, for example, using memoization.

When is a triangle a square?

Of course a triangle cannot be a square, but a triangular number can be a square number.

A triangular number is the sum of the first so many positive integers. For example, 10 is a triangular number because it equals 1+2+3+4. These numbers are called triangle numbers because you can form a triangle by having a row of one coin, two coin, three coins, etc. forming a triangle.

The smallest number that is both triangular and square is 1. The next smallest is 36. There are infinitely many numbers that are both triangular and square, and there’s even a formula for the nth number that is both a triangle and a square:

((17 + 12√2)n + (17 – 12√2)n – 2)/32

Source: American Mathematical Monthly, February 1962, page 169.

For more on triangle numbers and their generalizations, see Twelve Days of Christmas and Tetrahedral Numbers.

There is also a way to compute the square triangular numbers recursively discussed in the next post.

Attributing changes to numerator and denominator

This afternoon I helped someone debug a financial spreadsheet. One of the reasons spreadsheets can be so frustrating to work with is that assumptions are hard to see. You have to click on cells one at a time to find formulas, then decode cell coordinates into their meanings.

The root problem turned out to be an assumption that sounds reasonable. You’re making two changes, one to a numerator and one to a denominator. The total change equals the sum of the results of each change separately. Except that’s not so.

At this point, a mathematician would say “Of course you can’t split the effects like that. It’s nonlinear.” But it’s worth pursuing a little further. For one thing, it doesn’t help a general audience to just say “it’s nonlinear.” For another, it’s worth seeing when it is appropriate, at least approximately, to attribute the effects this way.

You start with a numerator and denominator, N/D, then change N to N+n and change D to D+d. The total change is then (N+n)/(D+d) – N/D.

The result from only the change in the numerator is n/D. The result from only the change in denominator is N/(D+d) – N/D.

The difference between the total change and the sum of the two partial changes is

nd/D(D+d).

The assumption that you can take the total change and attribute it to each change separately is wrong in general. But it is correct if n or d is zero, and it is approximately correct with nd is small. This can make the bug harder to find. It could also be useful when nd is indeed small and you don’t need to be exact.

Also, if all the terms are positive, the discrepancy is negative, i.e. the total change is less than the sum of the partial changes. Said another way, allocating the change to each cause separately over-estimates the total change.

 

Last digit of largest known prime

In my previous post, we looked at the largest known prime,  P = 257885161 – 1, and how many digits it has in various bases. This post looks at how to find the last digit of P in base b. We assume b < P. (If b = P then the last digit is 0, and if b > P the last digit is P.)

If b is a power of 2, we showed in the previous post that the last digit of P is b-1.

If b is odd, we can find the last digit using Euler’s totient theorem. Let φ(b) be the number of positive integers less than b and relatively prime to b. Then Euler’s theorem tells us that 2φ(b) ≡ 1 mod b since b is odd. So if r is the remainder when 57885161 is divided by φ(b), then the last digit of Q = P+1 in base b is the same as the last digit of 2r in base b.

For example, suppose we wanted to know the last digit of P in base 15. Since φ(15) = 8, and the remainder when 57885161 is divided by 8 is 1, the last digit of Q base 15 is 2. So the last digit of P is 1.

So we know how to compute the last digit in base b if b is a power or 2 or odd. Factor out all the powers of 2 from b so that b = 2k d and d is odd. We can find the last digit base 2k, and we can find the last digit base d, so can we combine these to find the last digit base b? Yes, that’s exactly what the Chinese Remainder Theorem is for.

To illustrate, suppose we want to find the last digit of P in base 12. P ≡ 3 mod 4 and P ≡ 1 mod 3, so P ≡ 7 mod 12. (The numbers are small enough here to guess. For a systematic approach, see the post mentioned above.) So the last digit of P is 7 in base 12.

If you’d like to write a program to play around with this, you need to be able to compute φ(n). You may have an implementation of this function in your favorite environment. For example, it’s sympy.ntheory.totient in Python. If not, it’s easy to compute φ(n) if you can factor n. You just need to know two things about φ. First, it’s a multiplicative function, meaning that if a and b are relatively prime, then φ(ab) = φ(a) φ(b). Second, if p is a prime, then φ(pk) = pkpk-1.

Playing with the largest known prime

The largest known prime at the moment is P = 257885161 – 1. This means that in binary, the number is a string of 57,885,161 ones.

You can convert binary numbers to other bases that are powers of two, say 2k, by starting at the right end and converting to the new base in blocks of size k. So in base 4, we’d start at the right end and convert pairs of bits to base 4. Until we get to the left end, all the pairs are 11, and so they all become 3’s in base four. So in base 4, P would be written as a 1 followed by 28,942,580 threes.

Similarly, we could write P in base 8 by starting at the right end and converting triples of bits to base 8. Since 57885161 = 3*19295053 + 2, we’ll have two bits left over when we get to the left end, so in base 8, P would be written as 3 followed by 19,295,053 sevens. And since 57885161 = 4*14471290 + 1, the hexadecimal (base 16) representation of P is a 1 followed by 14,471,290 F’s.

What about base 10, i.e. how many digits does P have? The number of digits in a positive integer n is ⌊log10n⌋ + 1 where ⌊x⌋ is the greatest integer less than x. For positive numbers, this just means chop off the decimals and keep the integer part.

We’ll make things slightly easier for a moment and find how many digits P+1 has.

log10(P+1) = log10(257885161) = 57885161 log10(2)  = 17425169.76…

and so P+1 has 17425170 digits, and so does P. How do we know P and P+1 have the same number of digits? There are several ways you could argue this. One is that P+1 is not a power of 10, so the number of digits couldn’t change between P and P+1.

Update: How many digits does P have in other bases?

We can take the formula above and simply substitute b for 10: The base b representation of a positive integer n has ⌊logbn⌋ + 1 digits.

As above, we’d rather work with Q = P+1 than P. The numbers P and Q will have the same number of digits unless Q is a power of the base b. But since Q is a power of 2, this could only happen if b is also a power of two, say b = 2k, and k is a divisor of 57885161. But 57885161 is prime, so this could only happen if b = Q. If b > P, then P has one digit base b. So we can concentrate on the case b < Q, in which case P and Q have the same number of digits. The number of digits in Q is then 57885161 logb(2) + 1.

If b = 2k, we can generalize the discussion above about bases 4, 8, and 16. Let q and r be the quotient and remainder when 57885161 is divided by k. The first digit is r and the remaining q digits are b-1.

Next post: Last digit of largest known prime

 

Last digits of Fibonacci numbers

If you write out a sequence of Fibonacci numbers, you can see that the last digits repeat every 60 numbers.

The 61st Fibonacci number is 2504730781961. The 62nd is 4052739537881. Since these end in 1 and 1, the 63rd Fibonacci number must end in 2, etc. and so the pattern starts over.

It’s not obvious that the cycle should have length 60, but it is fairly easy to see that there must be a cycle. There are only 10*10 possibilities for two consecutive digits. Since the Fibonacci numbers are determined by a two-term recurrence, and since the last digit of a sum is determined by the sum of the last digits, the sequence of last digits must repeat eventually. Here “eventually” means after at most 10*10 terms.

Replace “10” by any other base in the paragraph above to show that the sequence of last digits must be cyclic in any base. In base 16, for example, the period is 24. In hexadecimal notation the 25th Fibonacci number is 12511 and the 26th is 1DA31, so the 27th must end in 2, etc.

Here’s a little Python code to find the period of the last digits of Fibonacci numbers working in any base b.

from sympy import fibonacci as f

def period(b):
    for i in range(1, b*b+1):
        if f(i)%b == 0 and f(i+1)%b == 1:
            return(i)

This shows that in base 100 the period is 300. So in base 10 the last two digits repeat every 300 terms.

The period seems to vary erratically with base as shown in the graph below.

Life lessons from differential equations

Ten life lessons from differential equations:

  1. Some problems simply have no solution.
  2. Some problems have no simple solution.
  3. Some problems have many solutions.
  4. Determining that a solution exists may be half the work of finding it.
  5. Solutions that work well locally may blow up when extended too far.

Related posts:

 

High-dimensional integration

Numerically integrating functions of many variables almost always requires Monte Carlo integration or some variation. Numerical analysis textbooks often emphasize one-dimensional integration and, almost as a footnote, say that you can use a product scheme to bootstrap one-dimensional methods to higher dimensions. True, but impractical.

Traditional numerical integration routines work well in low dimensions. (“Low” depends on context, but let’s say less than 10 dimensions.) When you have the choice of a well-understood, deterministic method, and it works well, by all means use it. I’ve seen people who are so used to problems requiring Monte Carlo methods that they use these methods on low-dimensional problems where they’re not the most efficient (or robust) alternative.

Monte Carlo

Except for very special integrands, high dimensional integration requires either Monte Carlo integration or some variation such as quasi-Monte Carlo methods.

As I wrote about here, naive Monte Carlo integration isn’t practical for high dimensions. It’s unlikely that any of your integration points will fall in the region of interest without some sort of importance sampler. Constructing a good importance sampler requires some understanding of your particular problem. (Sorry. No one-size-fits-all black-box methods are available.)

Quasi-Monte Carlo (QMC)

Quasi-Monte Carlo methods (QMC) are interesting because they rely on something like a random sequence, but “more random” in a sense, i.e. more effective at exploring a high-dimensional space. These methods work well when an integrand has low effective dimension. The effective dimension is low when nearly all the action takes place on a lower dimensional manifold. For example, a function may ostensibly depend on 10,000 variables, while for practical purposes 12 of these are by far the most important. There are some papers by Art Owen that say, roughly, that Quasi-Monte Carlo methods work well if and only if the effective dimension is much smaller than the nominal dimension. (QMC methods are especially popular in financial applications.)

While QMC methods can be more efficient than Monte Carlo methods, it’s hard to get bounds on the integration error with these methods. Hybrid approaches, such as randomized QMC can perform remarkably well and give theoretical error bounds.

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is a common variation on Monte Carlo integration that uses dependent random samples. In many applications, such as Bayesian statistics, you’d like to be able to create independent random samples from some probability distribution, but this is not practical. MCMC is a compromise. It produces a Markov chain of dependent samples, and asymptotically these samples have the desired distribution. The dependent nature of the samples makes it difficult to estimate error and to determine how well the integration estimates using the Markov chain have converged.

(Some people speak of the Markov chain itself converging. It doesn’t. The estimates using the Markov chain may converge. Speaking about the chain converging may just be using informal language, or it may betray a lack of understanding.)

There is little theory regarding the convergence of estimates from MCMC, aside from toy problems. An estimate can appear to have converged, while an important region of the integration problem remains unexplored. Despite its drawbacks, MCMC is the only game in town for many applications.

Consulting

If you’d like help with high dimensional integration, or any other kind of numerical integration, please contact me to discuss how we might work together.

When the last digits of powers don’t change

If you raise any integer to the fifth power, its last digit doesn’t change. For example, 25 = 32.

It’s easy to prove this assertion by brute force. Since the last digit of bn only depends on the last digit of b, it’s enough to verify that the statement above holds for 0, 1, 2, …, 9.

Although that’s a valid proof, it’s not very satisfying. Why fifth powers? We’re implicitly working in base 10, as humans usually do, so what is the relation between 5 and 10 in this context? How might we generalize it to other bases?

The key is that 5 = φ(10) + 1 where φ(n) is the number of positive integers less than n and relatively prime to n.

Euler discovered the φ function and proved that if a and m are relatively prime, then

aφ(m) = 1 (mod m)

This means that aφ(m) – 1 is divisible by m. (Technically I should use the symbol ≡ (U+2261) rather than = above since the statement is a congruence, not an equation. But since Unicode font support on various devices is disappointing, not everyone could see the proper symbol.)

Euler’s theorem tells us that if a is relatively prime to 10 then a4 ends in 1, so a5 ends in the same digit as a. That proves our original statement for numbers ending in 1, 3, 7, and 9. But what about the rest, numbers that are divisible by 2 or 5? We’ll answer that question and a little more general one at the same time. Let m = αβ where α and β are distinct primes. In particular, we could choose α = 2 and β = 5; We will show that

aφ(m)+1 = a (mod m)

for all a, whether relatively prime to m or not. This would show in addition, for example, that in base 15, every number keeps the same last “digit” when raised to the 9th power since φ(15) = 8.

We only need to be concerned with the case of a being a multiple of α or β since Euler’s theorem proves our theorem for the rest of the cases. If a = αβ our theorem obviously holds, so assume a is some multiple of α, a = kα with k less than β (and hence relatively prime to β).

We need to show that αβ divides (kα)φ(αβ)+1kα. This expression is clearly divisible by α, so the remaining task is to show that it is divisible by β. We’ll show that (kα)φ(αβ) – 1 is divisible by β.

Since α and k are relatively prime to β, Euler’s theorem tells us that αφ(β) and kφ(β) are congruent to 1 (mod β). This implies that kαφ(β) is congruent to 1, and so kαφ(α)φ(β) is also congruent to 1 (mod β). One of the basic properties of φ is that for relatively prime arguments α and β, φ(αβ) = φ(α)φ(β) and so we’re done.

Exercise: How much more can you generalize this? Does the base have to be the product of two distinct primes?

Pedantic arithmetic rules

Generations of math teachers have drilled into their students that they must reduce fractions. That serves some purpose in the early years, but somewhere along the way students need to learn reducing fractions is not only unnecessary, but can be bad for communication. For example, if the fraction 45/365 comes up in the discussion of something that happened 45 days in a year, the fraction 45/365 is clearer than 9/73. The fraction 45/365 is not simpler in a number theoretic sense, but it is psychologically simpler since it’s obvious where the denominator came from. In this context, writing 9/73 is not a simplification but an obfuscation.

Simplifying fractions sometimes makes things clearer, but not always. It depends on context, and context is something students don’t understand at first. So it makes sense to be pedantic at some stage, but then students need to learn that clear communication trumps pedantic conventions.

Along these lines, there is a old taboo against having radicals in the denominator of a fraction. For example, 3/√5 is not allowed and should be rewritten as 3√5/5. This is an arbitrary convention now, though there once was a practical reason for it, namely that in hand calculations it’s easier to multiply by a long fraction than to divide by it. So, for example, if you had to reduce 3/√5 to a decimal in the old days, you’d look up √5 in a table to find it equals 2.2360679775. It would be easier to compute 0.6*2.2360679775 by hand than to compute 3/2.2360679775.

As with unreduced fractions, radicals in the denominator might be not only mathematically equivalent but psychologically preferable. If there’s a 3 in some context, and a √5, then it may be clear that 3/√5 is their ratio. In that same context someone may look at 3√5/5 and ask “Where did that factor of 5 in the denominator come from?”

A possible justification for rules above is that they provide standard forms that make grading easier. But this is only true for the simplest exercises. With moderately complicated exercises, following a student’s work is harder than determining whether two expressions represent the same number.

One final note on pedantic arithmetic rules: If the order of operations isn’t clear, make it clear. Add a pair of parentheses if you need to. Or write division operations as one thing above a horizontal bar and another below, not using the division symbol. Then you (and your reader) don’t have to worry whether, for example, multiplication has higher precedence than division or whether both have equal precedence and are carried out left to right.

Why is an empty sum 0 and an empty product 1?

In response to my earlier post on why 0! should be 1, several people replied that 0! = 1 because an empty product is 1. You can define the factorial of an integer n as the product of all positive numbers less than or equal to n. There are no positive integers less than or equal to 0, so 0! is an empty product. But this raises the question of why an empty product should be 1.

You could say that an empty sum is 0 because 0 is the additive identity and an empty product is 1 because 1 is the multiplicative identity. If you’d like a simple answer, maybe you should stop reading here.

The problem with the answer above is that it doesn’t say why an operation on an empty set should be defined to be the identity for that operation. The identity is certainly a plausible candidate, but why should it make sense to even define an operation on an empty set, and why should the identity turn out so often to be the definition that makes things proceed smoothly?

The convention that the sum over an empty set should be defined as 0, and that a product over an empty set should be defined to be 1 works well in very general settings where “sum”, “product”, “0”, and “1” take on abstract meanings.

The ultimate generalization of products is the notion of products in category theory. Similarly, the ultimate generalization of sums is categorical co-products. (Co-products are sometimes called sums, but they’re usually called co-products due to a symmetry with products.) Category theory simultaneously addresses a wide variety of operations that could be called products or sums (co-products).

The particular advantage of bringing category theory into this discussion is that it has definitions of product and co-product that are the same for any number of objects, including zero objects; there is no special definition for empty products. Empty products and co-products are a consequence of a more general definition, not special cases defined by convention.

In the category of sets, products are Cartesian products. The product of a set with n elements and one with m elements is one with nm elements. Also in the category of sets, co-products are disjoint unions. The co-product of a set with n elements and one with m elements is one with n+m elements. These examples show a connection between products and sums in arithmetic and products and co-products in category theory.

You can find the full definition of a categorical product here. Below I give the definition leaving out details that go away when we look at empty products.

The product of a set of objects is an object P such that given any other object X … there exists a unique morphism from X to P such that ….

If you’ve never seen this before, you might rightfully wonder what in the world this has to do with products. You’ll have to trust me on this one. [1]

When the set of objects is empty, the missing parts of the definition above don’t matter, so we’re left with requiring that there is a unique morphism [2] from each object X to the product P. In other words, P is a terminal object, often denoted 1. So in category theory, you can say empty products are 1.

But that seems like a leap, since “1” now takes on a new meaning that isn’t obviously connected to the idea of 1 we learned as a child. How is an object such that every object has a unique arrow to it at all like, say, the number of noses on a human face?

We drew a connection between arithmetic and categories before by looking at the cardinality of sets. We could define the product of the numbers n and m as the number of elements in the product of a set with n elements and one with m elements. Similarly we could define 1 as the cardinality of the terminal element, also denoted 1. This is because there is a unique map from any set to the set with 1 element. Pick your favorite one-element set and call it 1. Any other choice is isomorphic to your choice.

Now for empty sums. The following is the definition of co-product (sum), leaving out details that go away when we look at empty co-products.

The co-product of a set of objects is an object S such that given any other object X … there exists a unique morphism from S to X such that ….

As before, when the set of objects is empty, the missing parts don’t matter. Notice that the direction of the arrow in the definition is reversed: there is a unique morphism from the co-product S to any object X. In other words, S is an initial object, denoted for good reasons as 0.  [3]

In set theory, the initial object is the empty set. (If that hurts your head, you’re not alone. But if you think of functions in terms of sets of ordered pairs, it makes a little more sense. The function that sends the empty set to another set is an empty set of ordered pairs!) The cardinality of the initial object 0 is the integer 0, just as the cardinality of the initial object 1 is the integer 1.

* * *

[1] Category theory has to define operations entirely in terms of objects and morphisms. It can’t look inside an object and describe things in terms of elements the way you’d usually do to define the product of two numbers or two sets, so the definition of product has to look very different. The benefit of this extra work is a definition that applies much more generally.

To understand the general definition of products, start by understanding the product of two objects. Then learn about categorical limits and how products relate to limits. (As with products, the categorical definition of limits will look entirely different from familiar limits, but they’re related.)

[2] Morphisms are a generalization of functions. In the category of sets, morphisms are functions.

[3] Sometimes initial objects are denoted by ∅, the symbol for the empty set, and sometimes by 0. To make things more confusing, a “zero,” spelled out as a word rather than a symbol, has a different but related meaning in category theory: an object that is both initial and terminal.

Defining zero factorial

Things are defined the way they are for good reasons. This seems blatantly obvious now, but it was eye-opening when I learned this my first year in college. Our professor, Mike Starbird, asked us to go home and think about how convergence of a series should be defined. Not how it is defined, but how it should be defined. We were not to look up the definition but to think about what it should be. The next day we proposed our definitions. In good Socratic fashion Starbird showed us the flaws of each and lead us to arrive at the standard definition.

This exercise gave me confidence that mathematical definitions were created by ordinary mortals like myself. It also began my habit of examining definitions carefully to understand what motivated them.

One question that comes up frequently is why zero factorial equals 1. The pedantic answer is “Because it is defined that way.” This answer alone is not very helpful, but it does lead to the more refined question: Why is 0! defined to be 1?

The answer to the revised question is that many formulas are simpler if we define 0! to be 1. If we defined 0! to be 0, for example, countless formulas would have to add disqualifiers such as “except when n is zero.”

For example, the binomial coefficients are defined by

C(n, k) = n! / k!(nk)!.

The binomial coefficient C(n, k) tells us how many ways one can draw take a set of n things and select k of them. For example, the number of ways to deal a hand of five cards from a deck of 52 is C(52, 5) = 52! / 5! 47! = 2,598,960.

How many ways are there to deal a hand of 52 cards from a deck of 52 cards? Obviously one: the deck is the hand. But our formula says the answer is

C(52, 52) = 52! / 52! 0!,

and the formula is only correct if 0! = 1. If 0! were defined to be anything else, we’d have to say “The number of ways to deal a hand of k cards from a deck of n cards is C(n, k), except when k = 0 or k = n, in which case the answer is 1.” (See [1] below for picky details.)

The example above is certainly not the only one where it is convenient to define 0! to be 1. Countless theorems would be more awkward to state if 0! were defined any other way.

Sometimes people appeal to the gamma function for justification that 0! should be defined to be 1. The gamma function extends factorial to real numbers, and the gamma function value associated with 0! is 1. (In detail, n! = Γ(n+1) for positive integers n and Γ(1) = 1.) This is reassuring, but it raises another question: Why should the gamma function be authoritative?

Indeed, there are many ways to extend factorial to non-integer values, and historically many ways were proposed. However, the gamma function won and its competitors have faded into obscurity. So why did it win? Analogous to the discussion above, we could say that the gamma function won because more formulas work out simply with this definition than with others. That is, you can very often replace n! with Γ(n + 1) in a formula true for positive integer values of n and get a new formula valid for real or even complex values of n.

There is another reason why gamma won, and that’s the Bohr–Mollerup theorem. It says that if you’re looking for a function f(x) defined for x > 0 that satisfies f(1) = 1 and f(x+1) = x f(x), then the gamma function is the only log-convex solution. Why should we look for log-convex functions? Because factorial is log-convex, and so this is a natural property to require of its extension.

Update: Occasionally I hear someone say that the gamma function (shifting its argument by 1) is the only analytic function that extends factorial to the complex plane, but this isn’t true. For example, if you add sin(πx) to the gamma function, you get another analytic function that takes on the same values as gamma for positive integer arguments.

Related posts:

* * *

[1] Theorems about binomial coefficients have to make some restrictions on the arguments. See these notes for full details. But in the case of dealing cards, the only necessary constraints are the natural ones: we assume the number of cards in the deck and the number we want in a hand are non-negative integers, and that we’re not trying to draw more cards for a hand than there are in a deck. Defining 0! as 1 keeps us from having to make any unnatural qualifications such as “unless you’re dealing the entire deck.”

Integration by Darts

Monte Carlo integration has been called “Integration by Darts,” a clever pun on “integration by parts.” I ran across the phrase looking at some slides by Brian Hayes, but apparently it’s been around a while. The explanation that Monte Carlo is “integration by darts” is fine as a 0th order explanation, but it can be misleading.

Introductory courses explain Monte Carlo integration as follows.

  1. Plot the function you want to integrate.
  2. Draw a box that contains the graph.
  3. Throw darts (random points) at the box.
  4. Count the proportion of darts that land between the graph and the horizontal axis.
  5. Estimate the area under the graph by multiplying the area of the box by the proportion above.

In principle this is correct, but this is far from how Monte Carlo integration is usually done in practice.

For one thing, Monte Carlo integration is seldom used to integrate functions of one variable. Instead, it is mostly used on functions of many variables, maybe hundreds or thousands of variables. This is because more efficient methods exist for low-dimensional integrals, but very high dimensional integrals can usually only be computed using Monte Carlo or some variation like quasi-Monte Carlo.

If you draw a box around your integrand, especially in high dimensions, it may be that nearly all your darts fall outside the region you’re interested in. For example, suppose you throw a billion darts and none land inside the volume determined by your integration problem. Then the point estimate for your integral is 0. Assuming the true value of the integral is positive, the relative error in your estimate is 100%. You’ll need a lot more than a billion darts to get an accurate estimate. But is this example realistic? Absolutely. Nearly all the volume of a high-dimensional cube is in the “corners” and so putting a box around your integrand is naive. (I’ll elaborate on this below. [1])

So how do you implement Monte Carlo integration in practice? The next step up in sophistication is to use “importance sampling.” [2] Conceptually you’re still throwing darts at a box, but not with a uniform distribution. You find a probability distribution that approximately matches your integrand, and throw darts according to that distribution. The better the fit, the more efficient the importance sampler. You could think of naive importance sampling as using a uniform distribution as the importance sampler. It’s usually not hard to find an importance sampler much better than that. The importance sampler is so named because it concentrates more samples in the important regions.

Importance sampling isn’t the last word in Monte Carlo integration, but it’s a huge improvement over naive Monte Carlo.

* * *

[1] So what does it mean to say most of the volume of a high-dimensional cube is in the corners? Suppose you have an n-dimensional cube that runs from -1 to 1 in each dimension and you have a ball of radius 1 inside the cube. To make the example a little simpler, assume n is even, n = 2k. Then the volume of the cube is 4k and the volume of the sphere is πk / k!. If k = 1 (n = 2) then the sphere (circle in this case) takes up π/4 of the volume (area), about 79%. But when k = 100 (n = 200), the ball takes up 3.46×10-169 of the volume of the cube. You could never generate enough random samples from the cube to ever hope to land a single point inside the ball.

[2] In a nutshell, importance sampling replaces the problem of integrating f(x) with that of integrating (f(x) / g(x)) g(x) where g(x) is the importance sampler, a probability density. Then the integral of (f(x) / g(x)) g(x) is the expected value of (f(X) / g(X)) where X is a random variable with density given by the importance sampler. It’s often a good idea to use an importance sampler with slightly heavier tails than the original integrand.

Numerical integration

Mathematical arbitrage

I suspect there’s a huge opportunity in moving mathematics from the pure column to the applied column. There may be a lot of useful math that never sees application because the experts are unconcerned with or unaware of applications.

In particular I wonder what applications there may be of number theory, especially analytic number theory. I’m not thinking of the results of number theory but rather the elegant machinery developed to attack problems in number theory. I expect more of this machinery could be useful to problems outside of number theory.

I also wonder about category theory. The theory certainly finds uses within pure mathematics, but I’m not sure how useful it is in direct application to problems outside of mathematics. Many of the reported applications don’t seem like applications at all, but window dressing applied after-the-fact. On the other hand, there are also instances where categorical thinking led the way to a solution, but did its work behind the scenes; once a solution was in hand, it could be presented more directly without reference to categories. So it’s hard to say whether applications of category theory are over-reported or under-reported.

The mathematical literature can be misleading. When researchers say their work has various applications, they may be blowing smoke. At the same time, there may be real applications that are never mentioned in journals, either because the work is proprietary or because it is not deemed original in the academic sense of the word.