An integral with a couple lessons

If you present calculus students with a definite integral, their Pavlovian response is “Take the anti-derivative, evaluate it at the limits, and subtract.” They think that’s what it means. But it’s not what a definite integral means. It’s how you (usually) calculate its value. This is not a pedantic fine point but a practically important distinction. It pays to distinguish what something means from how you usually calculate it. Without this distinction, things that are possible may seem impossible. [1]

For example, suppose you want to compute the following integral that comes up frequently in probability.

\int_{-\infty}^\infty e^{-x^2}\, dx

There is no (elementary) function whose derivative is exp(-x2). It’s not just hard to find or ugly. It simply doesn’t exist, not within the universe of elementary functions. There are functions whose derivative is exp(-x2), but these functions are not finite algebraic combinations of the kinds of functions you’d see in high school.

If you think of the definite integral above as meaning “the result you get when you find an antiderivative, let its arguments go off to ∞ and -∞, and subtract the two limits” then you’ll never calculate it. And when you hear that the antiderivative doesn’t exist (in the world of functions you’re familiar with) then you might think that not only can you not calculate the integral, no one can.

In fact the integral is easy to calculate. It requires an ingenious trick [2], but once you see that trick it’s not hard.

Let I be the value of the integral. Changing the integration variable makes no difference, i.e.

I = \int_{-\infty}^\infty e^{-x^2}\, dx = \int_{-\infty}^\infty e^{-y^2}\, dy

and so

I^2 = \left(\int_{-\infty}^\infty e^{-x^2}\, dx\right) \left( \int_{-\infty}^\infty e^{-y^2}\, dy\right) = \int_{-\infty}^\infty\!\int_{-\infty}^\infty e^{-x^2 - y^2} \, dx\, dy

This integral can be converted to polar coordinates. Instead of describing the plane as an infinite square with x and y each going off to infinity in both directions, we can think of it as an infinite disk, with radius going off to infinity. The advantage of this approach is that the Jacobian of the change of variables gives us an extra factor of r that makes the exponential integral tractable.

\int_0^{2\pi} \! \int_0^\infty e^{-r^2} r \, dr\, d\theta = \frac{1}{2} \int_0^{2\pi} 1\, d\theta = \pi

From this we get I2 = π and so I = √π.

This specific trick comes up occasionally. But more generally, it is often the case that definite integrals are easier to compute than indefinite integrals. One of the most common applications of complex analysis is computing such integrals through the magic of contour integration. This leads to a lesson closely related to the one above, namely that you may not have to do what it looks like you need to do. In this case, you don’t always need to compute indefinite integrals (anti-derivatives) as an intermediate step to compute definite integrals. [3]

Mathematics is filled with theorems that effectively say that you don’t actually have to compute what you conceptually need to compute. Sometimes you can get by with calculating much less.

* * *

[1] One frustration I’ve had working with statisticians is that many have forgotten the distinction between what they want to calculate and how they calculate it. This makes it difficult to suggest better ways of computing things.

[2] Lord Kelvin said of this trick “A mathematician is one to whom that is as obvious as that twice two makes four is to you. Liouville was a mathematician.”

[3] If you look back carefully, we had to compute the integral of exp(-r2) r, which you would do by first computing its anti-derivative. But we didn’t have to compute the anti-derivative of the original integrand. We traded a hard (in some sense impossible) anti-derivative problem for an easy one.

Truncated exponential series inequality

Define Tn to be the Taylor series for exp(x) truncated after n terms:

T_n(x) = 1 + x+ \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots + \frac{x^n}{n!}

How does this function compare to its limit, exp(x)? We might want to know because it’s often useful to have polynomial upper or lower bounds on exp(x).

For x > 0 it’s clear that exp(x) is larger than Tn(x) since the discarded terms in the power series for exp(x) are all positive.

The case of x < 0 is more interesting. There exp(x) > Tn(x) if n is odd and exp(x) < Tn(x) if n is even.

Define fn(x) = exp(x) – Tn(x). If x > 0 then fn(x) > 0.

We want to show that if x < 0 then fn(x) > 0 for odd n and fn(x) < 0 for even n.

For n = 1, note that f1 and its derivative are both zero at 0. Now suppose f1 is zero at some point a < 0. Then by Rolle’s theorem, there is some point b with a < b < 0 where the derivative of f1 is 0. Since the derivative of f1 is also zero at 0, there must be some point c with bc < 0 where the second derivative of f1 is 0, again by Rolle’s theorem. But the second derivative of f1 is exp(x) which is never 0. So our assumption f1(a) = 0 leads to a contradiction.

Now  f1(0) = 0 and f1(x) ≠ 0 for x < 0. So f1(x) must be always positive or always negative. Which is it? For negative x, exp(x) is bounded and so

f1(x)  = exp(x) – 1 – x

is eventually dominated by the –x term, which is positive since x is negative.

The proof for n = 2 is similar. If f2(x) is zero at some point a < 0, then we can use Rolle’s theorem to find a point b < 0 where the derivative of f2 is zero, and a point c < 0 where the second derivative is zero, and a point d < 0 where the third derivative is zero. But the third derivative of f2 is exp(x) which is never zero.

As before the contradiction shows  f2(x) ≠ 0 for x < 0. So is  f2(x) always positive or always negative? This time we have

f2(x) = exp(x) – 1 – xx2/2

which is eventually dominated by the –x2 term, which is negative.

For general n, we assume fn is zero for some point x < 0 and apply Rolle’s theorem n+1 times to reach the contradiction that exp(x) is zero somewhere. This tells us that fn(x) is never zero for negative x. We then look at the dominant term –xn to argue that fn is positive or negative depending on whether n is odd or even.

Another way to show the sign of  fn(x) for negative x would be to apply the alternating series theorem to x = -1.

Random squares

In geometry, you’d say that if a square has side x, then it has area x2.

In calculus, you’d say more. First you’d say that if a square has side near x, then it has area near x2. That is, area is a continuous function of the length of a side. As the length of the side changes, there’s never an abrupt jump in area. Next you could be more specific and say that a small change Δx to a side of length x corresponds to approximately a change of 2x Δx in the area.

In probability, you ask what is the area of a square like if you pick the length of its side at random. If you pick the length of the side from a distribution with mean μ, does the distribution of the area have mean μ2? No, but if the probability distribution on side length is tightly concentrated around μ, then the distribution on area will be concentrated near μ2. And you can approximate just how near the area is to μ2 using the delta method, analogous to the calculus discussion above.

If the distribution on side lengths is not particularly concentrated, finding the distribution on the area is more interesting. It will depend on the specific distribution on side length, and the mean area might not be particularly close to the square of the mean side length. The function to compute area is trivial, and yet the question of what happens when you stick a random variable into that function is not trivial. Random variables behave as you might expect when you stick them into linear functions, but offer surprises when you stick them into nonlinear functions.

Suppose you pick the length of the side of a square uniformly from the interval [0, 1]. Then the average side is 1/2, and so you might expect the average area to be 1/4. But the expected area is actually 1/3. You could see this a couple ways, analytically and empirically.

First an analytical derivation. If X has a uniform [0, 1] distribution and ZX2, then the CDF of Z is

Prob(Zz) = Prob(X ≤ √z) = √ z.

and so the PDF for Z, the derivative of the CDF, is -1/2√z. From there you can compute the expected value by integrating z times the PDF.

You could check your calculations by seeing whether simulation gives you similar results. Here’s a little Python code to do that.

      from random import random
      N = 1000000
      print( sum([random()**2 for _ in range(N)] )/N )

When I run this, I get 0.33386, close to 1/3.

Now lets look at an exponential distribution on side length with mean 1. Then a calculation similar to the one above shows that the expected value of the product is 2. You can also check this with simulation. This time we’ll be a little fancier and let SciPy generate our random values for us.

      print( sum(expon.rvs(size=N)**2)/N )

When I ran this, I got 1.99934, close to the expected value of 2.

You’ll notice that in both examples, the expected value of the area is more than the square of the expected value of the side. This is not a coincidence but consequence of Jensen’s inequality. Squaring is a convex function, so the expected value of the square is larger than the square of the expected value for any random variable.

Normal hazard continued fraction

The hazard function of a probability distribution is the instantaneous probability density of an event given that it hasn’t happened yet. This works out to be the ratio of the PDF (probability density function) to the CCDF (complementary cumulative density function).

For the standard normal distribution, the hazard function is

h(x) = \frac{\exp(-x^2/2)}{\int_x^\infty \exp(-t^2/2)\,dt}

and has a surprisingly simple continued fraction representation:

h(x) = 1 + \cfrac{1}{x+\cfrac{2}{x+\cfrac{3}{x+\cfrac{4}{x+\cdots}}}}

Aside from being an elegant curiosity, this gives an efficient way to compute the hazard function for large x. (It’s valid for any positive x, but most efficient for large x.)

Source: A&S equation 26.2.14

Related posts:

A short, unusual proof that there are infinitely many primes

Sam Northshield [1] came up with the following clever proof that there are infinitely many primes.

Suppose there are only finitely many primes and let P be their product. Then

0 < \prod_p \sin\left( \frac{\pi}{p} \right) = \prod_p \sin\left(\frac{\pi(1+2P)}{p} \right) = 0

The original publication gives the calculation above with no explanation. Here’s a little commentary to explain the calculation.

Since prime numbers are greater than 1, sin(π/p) is positive for every prime. And a finite product of positive terms is positive. (An infinite product of positive terms could converge to zero.)

Since p is a factor of P, the arguments of sine in the second product differ from those in the first product by an integer multiple of 2π, so the corresponding terms in the two products are the same.

There must be some p that divides 1 + 2P, and that value of p contributes the sine of an integer multiple of π to the product, i.e. a zero. Since one of the terms in the product is zero, the product is zero. And since zero is not greater than zero, we have a contradiction.

* * *

[1] A One-Line Proof of the Infinitude of Primes, The American Mathematical Monthly, Vol. 122, No. 5 (May 2015), p. 466

A curious property of catenaries

Suppose you have a flat line f(x) = k and an interval [ab]. Then the area under the line and over the interval is k times the length of the segment of the line.

Surprisingly, the same is true for a catenary with scale k.

With the flat line, the length of the segment of the graph is the same as the length of the segment [ab] on the x-axis, but in general the curve will be longer. The catenary is convex, and it bends so that area under the curve decreases exactly enough balance out the increase in arc length.

The area under a curve f(x) and over the interval [ab] is simply the integral of f from a to b:

\int_a^b f(x)\, dx.

The length of the curve from a to b is also given by an integral:

\int_a^b \sqrt{1 + (f'(x))^2} \, dx.

You can prove the claim above by showing that the first integral is k times the second integral when f(x) = k cosh((xc)/k), the catenary centered at c with scale k.

By the way, this result was discovered independently by Johann Bernoulli and Gottfried Liebnitz three centuries ago.

When does a function equal its Taylor series?

Taylor’s theorem says

f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(a)}{n!}(x-a)^n

When does the thing on the left equal the thing on the right?

A few things could go wrong:

  • Maybe not all the terms on the right side exist, i.e. the function f might not be infinitely differentiable.
  • Maybe f is infinitely differentiable but the series diverges.
  • Maybe f is infinitely differentiable but the series converges to something other than f.

The canonical example of the last problem is the function g(x) defined to be exp(-1/x2) for positive x and 0 otherwise. This function is so flat when it approaches the origin that all derivatives are zero there. So all the coefficients in Taylor’s formula are zero. But the function g is positive for positive x.

Everything above can be found in a standard calculus text, but the following results are hardly known.

Being infinitely differentiable is necessary but not sufficient for a function to be real analytic. What additional requirements would be sufficient?

We start by examining what went wrong with the function g(x) above. It has a Taylor series at every point, but the radii of convergence go to zero as we get close to the origin. At the origin it is not analytic because the radius of convergence collapses to 0.

Alfred Pringsheim claimed that it is enough to require that the radii of convergence be bounded below on an interval. If we let ρ(x) be the radius of convergence for a series centered at x, then a function f is analytic in an open interval J if ρ(x) has some lower bound δ > 0 in J. Pringsheim’s theorem was correct, but his proof was flawed. The proof was accepted for 40 years until Ralph Boas discovered the flaw.

Here is a generalization of Pringsheim’s theorem. (It’s not clear to me from my source [1] who first proved this theorem. Perhaps Boas, but in the same context Boas mentions the work of others on the problem.)

Let f be infinitely differentiable in an open interval J. Suppose the radius of convergence ρ(x) for a Taylor series centered at x is positive for each x in the interval J. Suppose also that for every point p in J we have

\liminf_{x\to p} \frac{\rho(x)}{|x-p|} > 1.

Then f is analytic on the interval J.

Note that if ρ(x) is bounded below by δ > 0 then the limit above is infinite and so Pringsheim’s theorem follows as a special case.

The function g(x) above does not satisfy the hypothesis of the theorem on any open interval around 0 because if we set p = 0, the limit above is 1, not greater than 1.

* * *

[1] R. P. Boas. When Is a C Function Analytic? Mathematical Intelligencer, Vol 11, No. 4, pp. 34–37.

 

Computing discrete logarithms with baby-step giant-step algorithm

At first “discrete logarithm” sounds like a contradiction in terms. Logarithms aren’t discrete, not as we usually think of them. But you can define and compute logarithms in modular arithmetic.

What is a logarithm? It’s the solution to an exponential equation. For example, the logarithm base 10 of 2 is the solution to the equation 10x = 2, and so x =0.30103. Similarly, you could look for the logarithm base 10 of 2 modulo 19. This is an integer value of x such that 10x = 2 mod 19, and you can show that 17 is a solution.

If we work modulo an integer n, the discrete logarithm base a of a number y is an integer x such that ax = y. For some values of a there may not be a solution, but there will be a solution if a is a generator of the integers mod n.

Brute force the simplest algorithm for finding discrete logarithms: try x = 1, 2, 3, …, n until you find a value of x satisfying ax = y. The problem with brute force is that the expected run time is on the order of n, and n is often very large in application.

Discrete logarithms are expensive to compute, but we can do better than brute force. (Cryptographic applications take advantage of the fact that discrete logarithms are hard to compute.) There’s a simple algorithm by Daniel Shanks, known as the baby-step giant-step algorithm, that reduces the run time from order n to order roughly √n. (Actually O(√n log n) for reasons we’ll see soon.)

Let s be the ceiling of the square root of n. The idea is to search for x by taking giant steps forward (multiples of s) and baby (single) steps backward.

Let G be the set of pairs (ags mod n, gs) where g ranges from 1 to s. These are the giant steps.

Let B be the set of pairs (yab mod n, b) where b ranges from 0 to s-1. These are the baby steps.

Now look for a pair in G and a pair in B with the matching first components, i.e. yab = ags. Then x = gsb is the required solution.

Constructing the sets G and requires O(s) operations, and matching the two sets takes O(s log s) operations.

Here’s an example, going back to the problem above of finding the discrete log base 10 of 2 mod 19, using a little Python code to help with the calculations.

The square root of 19 is between 4 and 5 so s = 5.

     >>> [(2*10**r % 19, r) for r in range(5)]
     [(2, 0), (1, 1), (10, 2), (5, 3), (12, 4)]
     >>> [(10**(4*t) % 19, 4*t) for t in range(1,6)]
     [(6, 4), (17, 8), (7, 12), (4, 16), (5, 20)]

The number 5 appears as the first element of a pair in both B and G. We have (5, 3) in B and (5, 20) in G so x = 20 – 3 = 17.

Related: Applied number theory

Periods of fractions

Suppose you have a fraction a/b where 0 < ab, and a and b are relatively prime integers. The decimal expansion of a/b either terminates or it has an initial non-repeating part followed by a repeating part.

How long is the non-repeating part? How long is the period of the repeating part?

The answer depends on the prime factorization of the denominator b. If b has the form

b = 2α 5β

then the decimal expansion has r digits where r = max(α, β).

Otherwise b has the factorization

b = 2α 5β d

where d is some integer greater than 1 and relatively prime to 10. Then the decimal expansion of a/b has r non-repeating digits and a repeating part with period s where r is as above, and s is the smallest positive integer such that

d | 10s– 1,

i.e. the smallest s such that 10s – 1 is divisible by d. How do we know there exists any such integer s? This isn’t obvious.

Fermat’s little theorem tells us that

d | 10d – 1 – 1

and so we could take s = d – 1, though this may not be the smallest such s. So not only does s exist, we know that it is at most d – 1. This means that the period of the repeating part of a/b is no more than d – 1 where d is what’s left after factoring out as many 2’s and 5’s from b as possible.

By the way, this means that you can take any integer d, not divisible by 2 or 5, and find some integer k such that dk consists only of 9’s.

Related post: Cyclic fractions

One of my favorite proofs: Lagrange multipliers

One of my lightbulb moments in college was when my professor, Jim Vick, explained the Lagrange multiplier theorem. The way I’d seen it stated in a calculus text gave me no feel for why it should be true, but his explanation made sense immediately.

Suppose f(x) is a function of several variables, i.e. x is a vector, and g(x) = c is a constraint. Then the Lagrange multiplier theorem says that at the maximum of f subject to the constraint g we have ∇f = λ ∇g.

Where does this mysterious λ come from? And why should the gradient of your objective function be related to the gradient of a constraint? These seem like two different things that shouldn’t even be comparable.

Here’s the geometric explanation. The set of points satisfying g(x) = c is a surface. And for any k, the set of points satisfying f(x) = k is also surface. Imagine k very large, larger than the maximum of f on the surface defined by g(x) = c. You could think of the surface g(x) = c being a balloon inside the larger balloon  f(x) = k.

Now gradually decrease k, like letting the air out of the outer balloon, until the surfaces g(x) = c and f(x) = k first touch. At that point, the two surfaces will be tangent, and so their normal vectors, given by their gradients, point in the same direction. That is, ∇f and ∇g are parallel, and so ∇f is some multiple of ∇g. Call that multiple λ.

I don’t know how well that explanation works when written down. But when I heard Jim Vick explain it, moving his hands in the air, it was an eye-opener.

This is not a rigorous proof, and it does not give the most general result possible, but it explains what’s going on. It’s something to keep in mind when reading proofs that are more rigorous or more general. As I comment on here,

Proofs serve two main purposes: to establish that a proposition is true, and to show why it is true.

The literally hand-wavy proof scores low on the former criterion and high on the latter.

***

Jim Vick was a great teacher. Some of us affectionately called him The Grinning Demon because he was always smiling, even while he gave devilishly hard homework. He was Dean of Natural Sciences when I took a couple classes from him. He later became Vice President for Student Affairs and kept teaching periodically. He has since retired but still teaches.

After taking his topology course, several of us asked him to teach a differential geometry course. He hesitated because it’s a challenge to put together an undergraduate differential geometry course. The usual development of differential geometry uses so much machinery that it’s typically a graduate-level course.

Vick found a book that starts by looking only at manifolds given by level sets of smooth functions, like the surfaces discussed above. Because these surfaces sit inside a Euclidean space, you can quickly get to some interesting geometric results using elementary methods. We eventually got to the more advanced methods, but by then we had experience in a more tangible setting. As Michael Atiyah said, abstraction should follow experience, not precede it.

Münchausen numbers

Baron Münchausen

Baron Münchausen

The number 3435 has the following curious property:

3435 = 33 + 44 + 33 + 55.

It is called a Münchausen number, an allusion to fictional Baron Münchausen. When each digit is raised to its own power and summed, you get the original number back. The only other Münchausen number is 1.

At least in base 10. You could look at Münchausen numbers in other bases. If you write out a number n in base b, raise each of its “digits” to its own power, take the sum, and get n back, you have a Münchausen number in base b. For example 28 is a Münchausen number in base 9 because

28ten = 31nine = 33 + 11

Daan van Berkel proved that there are only finitely many Münchausen in any given base. In fact, he shows that a Münchausen number in base b cannot be greater than 2bb, and so you could do a brute-force search to find all the Münchausen numbers in any base.

The upper bound 2bb grows very quickly with b and so brute force becomes impractical for large b. If you wanted to find all the hexadecimal Münchausen numbers you’d have to search 2*1616 = 36,893,488,147,419,103,232 numbers. How could you do this more efficiently?

Less likely to get half, more likely to get near half

I was catching up on Engines of our Ingenuity episodes this evening when the following line jumped out at me:

If I flip a coin a million times, I’m virtually certain to get 50 percent heads and 50 percent tails.

Depending on how you understand that line, it’s either imprecise or false. The more times you flip the coin, the more likely you are to get nearly half heads and half tails, but the less likely you are to get exactly half of each. I assume Dr. Lienhard knows this and that by “50 percent” he meant “nearly half.”

Let’s make the fuzzy statements above more quantitative. Suppose we flip a coin 2n times for some large number n. Then a calculation using Stirling’s approximation shows that the probability of n heads and n tails is approximately

1/√(πn)

which goes to zero as n goes to infinity. If you flip a coin a million times, there’s less than one chance in a thousand that you’d get exactly half heads.

Next, let’s quantify the statement that nearly half the tosses are likely to be heads. The normal approximation to the binomial tells us that for large n, the number of heads out of 2n tosses is approximately distributed like a normal distribution with the same mean and variance, i.e. mean n and variance n/2. The proportion of heads is thus approximately normal with mean 1/2 and variance 1/8n. This means the standard deviation is 1/√(8n). So, for example, about 95% of the time the proportion of heads will be 1/2 plus or minus 2/√(8n). As n goes to infinity, the width of this interval goes to 0. Alternatively, we could pick some fixed interval around 1/2 and show that the probability of the proportion of heads being outside that interval goes to 0.

What is calculus?

When people ask me what calculus is, my usual answer is “the mathematics of change,” studying things that change continually. Algebra is essentially static, studying things frozen in time and space. Calculus studies things that move, shapes that bend, etc. Algebra deals with things that are exact and consequently can be fragile. Calculus deals with approximation and is consequently more robust.

I’m happier with the paragraph above if you replace “calculus” with “analysis.” Analysis certainly seeks to understand and model things that change continually, but calculus per se is the mechanism of analysis.

I used to think it oddly formal for people to say “differential and integral calculus.” Is there any other kind? Well yes, yes there is, though I didn’t know that at the time. A calculus is a system of rules for computing things. Differential and integral calculus is a system of rules for calculating derivatives and integrals. Lately I’ve thought about other calculi more than differential calculus: propositional calculus, lambda calculus, calculus of inductive constructions, etc.

In my first career I taught (differential and integral) calculus and was frustrated with students who would learn how to calculate derivatives but never understood what a derivative was or what it was good for. In some sense though, they got to the core of what a calculus is. It would be better if they knew what they were calculating and how to apply it, but they still learn something valuable by knowing how to carry out the manipulations. A computer science major, for example, who gets through (differential) calculus knowing how to calculate derivatives without knowing what they are is in a good position to understand lambda calculus later.

Duality in spherical trigonometry

This evening I ran across an unexpected reference to spherical trigonometry: Thomas Hales’ lecture on lessons learned from the formal proof of the Kepler conjecture. He mentions at one point a lemma that was awkward to prove in its original form, but that became trivial when he looked at its spherical dual.

The sides of a spherical triangle are formed by great circular arcs through the vertices. Since the sides are portions of a circle, they can be measured as angles. So in spherical trig you have this interesting interplay of two kinds of angles: the angles formed at the intersections of the sides, and the angles describing the sides themselves.

Here’s how you form the dual of a spherical triangle. Suppose the vertices of the angle are AB, and C. Think of the arc connecting A and B as an equator, and let C‘ be the corresponding pole that lies on the same side of the arc as the original triangle ABC. Do the analogous process to find the points A‘ and B‘. The triangle ABC‘ is the dual of the triangle ABC. (This idea goes back to the Persian mathematician Abu Nasr Mansur circa 1000 AD.)

The sides in  ABC‘ are the supplementary angles of the corresponding intersection angles in ABC, and the intersection angles in  ABC‘ are the supplementary angles of the corresponding sides in ABC.

In his paper “Duality in the formulas of spherical trigonometry,” published in American Mathematical Monthly in 1909, W. A. Granville gives the following duality principle:

If the sides of a spherical triangle be denoted by Roman letters abc and the supplements of the corresponding opposite angles by the Greek letters α, β, γ, then from any given formula involving any of these six parts, we may wrote down a dual formula simply by interchanging the corresponding Greek and Roman letters.

Related: Notes on Spherical Trigonometry