Posts Tagged ‘Math’

Diagramming modes of convergence

Sunday, May 25th, 2008

Sometimes a good diagram is a godsend, reducing the entropy in your head at a glace.

When I was studying integration theory, I ran across a diagram something like the following in the out-of-print book Elements of Integration by Robert G. Bartle.

convergence for a general measure space

Each arrow summarizes a theorem. The four corners stand for modes of convergence: almost everywhere, almost uniform, Lp, and convergence in measure. A solid line from one mode to another means that if a sequence of function converges in the first mode then it also converges in the other. A dashed line means that a subsequence converges in the second mode. For example, if a sequence of functions fnconverge to a function f, then the sequence also converges in measure to f, and a subsequence converges to f almost everywhere.

Note that this isn’t the typical mathematical diagram where the corners represent spaces and the arrows represent functions. I don’t recall seeing anything like this outside of Bartle’s book.

For a finite measure space, additional relationships hold as summarized in the diagram below.

convergence for a finite measure space

If the sequence fn is uniformly dominated by an Lp function g, then even more relationships hold.

dominated convergence

For more details, see modes of convergence. These notes also contain counterexamples that show no other relationships exist. Three simple counterexamples suffice to show no arrows are missing.

Jenga mathematics

Sunday, May 11th, 2008

Jenga is a game where you start with a tower of wooden pegs and take turns removing pegs until someone makes the tower collapse. A style of mathematics analogous to Jenga reached the height of its popularity about 40 years ago and then fell out of fashion. I use the phrase “Jenga mathematics” to refer to generalizing a well-known theorem by weakening its hypotheses, seeing how many pegs you can pull out before it falls.

Jenga game photo

Many 20th century mathematicians spent their careers going over the work of 19th century mathematicians, removing every hypothesis they could. Sometimes a 20th century mathematician would get his name tacked on to a 19th century theorem due to his Jenga accomplishments.

Taken to extremes, Jenga mathematics turns theorems inside-out and proofs become hypotheses. Natural hypotheses are replaced with a laundry list of properties necessary to make the proof work. Start with some theorem of the form “Let X be a widget. Then X has a foozle.” Go back over the proof and see just what features of a widget are needed for the proof. Then restate the theorem as “Let X have the following apparently arbitrary list of properties necessary for my proof to work. Then X has a foozle.” Never mind whether anybody can think of anything other that a widget that satisfies the hypotheses of the new theorem.

Jenga mathematics is no longer fashionable. Mathematicians still value removing unneeded hypotheses, but they’re not as willing to go to extremes to do so. They are more interested in building new towers than in removing every piece possible from old towers.

How to calculate binomial probabilities

Thursday, April 24th, 2008

Suppose you’re doing something that has probability of success p and probability of failure q = 1-p. If you repeat what you’re doing m+n times, the probability of m successes and n failures is given by  

\frac{(m+n)!}{m!\, n!} p^m q^n

Now suppose m and n are moderately large. The terms (m+n)! and m! n! will be huge, but the terms pm and qn will be tiny. The huge terms might overflow, and the tiny terms might underflow, even though the final result may be a moderate-sized number. The numbers m and n don’t have to be that large for this to happen since factorials grow very quickly. On a typical system, overflow happens if m+n > 170. How do you get to the end result without having to deal with impossibly large or small values along the way?

The trick is to work with logs. You want to first calculate log( (m+n)! ) - log( m! ) - log( n! ) + m log( p ) + n log( q ) , then exponentiate the result. This pattern comes up constantly in statistical computing.

Libraries don’t always include functions for evaluating factorial or log factorial. They’re more likely to include functions for Γ(x) and its log. For positive integers n, Γ(n+1) = n!. Now suppose you have a function lgamma that computes log Γ(x). You might write something like this.

    double probability(double p, double q, int m, int n)
    { 
        double temp = lgamma(m + n + 1.0);
        temp -=  lgamma(n + 1.0) + lgamma(m + 1.0);
        temp += m*log(p) + n*log(q);
        return exp(temp);
    }

The function lgamma is not part of the ANSI standard library for C or C++, but it is part of the POSIX standard. On Unix-like systems, lgamma is included in the standard library. However, Microsoft does not include lgamma as part of the Visual C++ standard library. On Windows you have to either implement your own lgamma function or grab an implementation from somewhere like Boost.

Here’s something to watch out for with POSIX math libraries. I believe the POSIX standard does not require a function called gamma for evaluating the gamma function Γ(x). Instead, the standard requires functions lgamma for the log of the gamma function and tgamma for the gamma function itself. (The mnemonic is “t” for “true,” meaning that you really want the gamma function.) I wrote a little code that called gamma and tested it on OS X and Red Hat Linux this evening. In both cases gcc compiled the code without warning, even with the -Wall and -pedantic warning flags. But on the Mac, gamma was interpreted as tgamma and on the Linux box it was interpreted as lgamma. This means that gamma(10.0) would equal 362880 on the former and 12.8018 on the latter.

Fibonacci numbers at work

Wednesday, April 23rd, 2008

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form  

f(x, y) = xa (1-x)b yc (1-y)d p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

Fibonacci lattice integration rule

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule typically not very efficient. It’s more efficient for periodic integrands, but the lattice rule was more efficient still.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

    int Fibonacci(int k)
    {
        double phi = 0.5*(1.0 + sqrt(5.0));
        return int( pow(phi, k)/sqrt(5.0) + 0.5 );
    }

The reason this works is that

equation for Fibonacci numbers

where

definition of phi, golden ratio

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

If you’re interested in technical details of the numerical integration method, see Lattice Methods for Multiple Integration by I. H. Sloan and S. Joe.

Overflow and loss of precision

Wednesday, April 16th, 2008

Suppose you need to evaluate the function f(x) = log(1 + ex). The most obvious code to write something like this in C:

    double f(double x) { return log(1 + exp(x)); }

This is so simple you don’t even test it. Then someone calls you to say they’re getting strange output such as 1.#INF depending on your environment. What’s going on?

For large values of x, ex is much bigger than 1, so f(x) is approximately log(ex) which equals x. So, for example, f(1000) should return something very close to 1000. But instead, you get gibberish. In most environments a floating point number can be as large as about 10308, but exp(1000) is about 2×10434 and so the result overflows. Then the code takes the log of “infinity.”

But f(1000) shouldn’t overflow; the result is about 1000. Our function result can be contained in a floating point number, but our intermediate result exp(1000) cannot be. We need some way of telling the computer “Hold on. I know this is going to be a huge number, but trust me for a minute. After I add 1 I’m going to take a log and bring the result back down to a medium sized number.” Unfortunately computers don’t work that way.

One solution would be to see whether x is so large that exp(x) will overflow, and in that case f(x) could just return x. So our second attempt might look like this:

    double f(double x) { return (x > X_MAX) ? x : log(1 + exp(x)); }

I’ll come back later to what X_MAX should be. Suppose you’ve found a good value for this constant and release your code again. Then you get another call. They say your code is returning zero when it should not.

Someone is trying to compute f(-50). They know the answer is small, but it shouldn’t be zero. How can that be? Since you just learned about overflow, you suspect the problem might have to do with the opposite problem, underflow. But that’s not quite it. The result of exp(-50) does not underflow; it’s about 2×10-22. But it is so small that machine precision cannot distinguish 1+exp(-50) from 1. So the code returns log(1), which equals zero. This may or may not be a problem. The correct answer is very small, so the absolute error is small but the relative error is 100%. Whether that is a problem depends on what you’re going to do with the result. If you’re going to add it to a moderate size number, no big deal. If you’re going to divide by it, it’s a very big deal.

Now what do you do? You think back to your calculus class and remember that for small x, log(1 + x) is approximately x with error on the order of x2. (To see this, expand log(1 + x) in a power series centered at 1.) If x is so small that 1 + x equals 1 inside the computer, x2 must be very small indeed. So f(x) could return exp(x) if exp(x) is sufficiently small. This gives our third attempt.

double f(double x)
{ 
    if (x > X_MAX) return x; 
    if (x < X_MIN) return exp(x);
    return log(1.0 + exp(x));
}

This is a big improvement. It can still underflow for large negative x, but in that case the function itself is underflowing, not just an intermediate result.

Now what do we make X_MAX and X_MIN? For X_MAX we don’t really have to worry about exp(x) overflowing. We can stop when x is so large that exp(x) + 1 equals exp(x) to machine precision. In C there is a header file float.h that contains a constant DBL_EPSILON which is the smallest number ε such that 1 + ε does not equal 1 in machine precision. So it turns out we can set X_MAX equal to -log(DBL_EPSILON). Similarly we could set X_MIN equal to log(DBL_EPSILON).

There’s still a small problem. When x is large and negative, but not so negative that 1 + exp(x) equals 1 in the computer, we could lose precision in computing log(1 + exp(x)).

I have just posted an article on CodeProject entitled Avoiding Overflow, Underflow, and Loss of Precision that has more examples where the most direct method for evaluating a function may not work. Example 2 in that paper explains why directly computing log(1 + y) can be a problem even if y is not so small that 1 + y equals 1 to machine precision. The article explains how to compute log(1 + y) in that case. Setting y = exp(x) explains how to finish the example here when x is moderately large and negative.

Random number generator controversy

Saturday, April 12th, 2008

I submitted an article to Code Project yesterday, Simple Random Number Generation, describing a small C# class called SimpleRNG that uses George Marsaglia’s WMC algorithm. The article was posted around 5 PM (central US time) and comments started pouring in right away. I didn’t expect any feedback on a Friday afternoon or Saturday morning. But as I write this post, there have been 580 page views and 11 comments.

There have been three basic questions raised in the comments.

  1. Why not just use the random number generator that comes with .NET?
  2. Is this code suitable for cryptography?
  3. Is this code suitable for Monte Carlo applications?

Why not use the built-in generator? For many applications, the simplest thing would be to use the .NET random number generator. But there are instances where this might not be best. There are questions about the statistical quality of the .NET generator; I’ll get to that in a minute. The primary advantages I see to the SimpleRNG class are transparency and portability.

By transparency I mean that the internal state of the generator is simple and easy to access. When you’re trying to reproduce a result, say while debugging, it’s convenient to have full access to the internal state of the random generator. If you’re using your own generator, you can see everything. You can even temporarily change it: for debugging, it may be convenient to temporarily have the “random” generator return a very regular, predictable sequence.

By portability I do not necessarily mean moving the code between operating systems. The primary application I have in mind is moving the algorithm between languages. For example, in my work we often have prototype code written in R that needs to be rewritten in C++ for efficiency. If the code involves random number generation, the output of the prototype and the rewrite cannot be directly compared, only compared on average. Then you have to judge whether the differences are to be expected or whether they indicate a bug. But if both the R and the C++ code use the same RNG algorithm and the same seed, the results may be directly comparable. (They still may not be directly comparable due to other factors, but at least this way the results are often comparable.)

As for cryptography, no, SimpleRNG is not appropriate for cryptography.

As for Monte Carlo applications, not all Monte Carlo applications are created equal. Some applications do not require high quality random number generators. Or more accurately, different applications require different kinds of quality. Some random number generators break down when used for high-dimensional integration. I suspect SimpleRNG is appropriate for moderate dimensions. I use the Mersenne Twister generator for numerical integration. However, SimpleRNG is faster and much simpler; the MT generator has a very large internal state.

Someone commented on the CodeProject article that the random number generator in .NET is not appropriate for Monte Carlo simulation because it does not pass Marsaglia’s DIEHARD tests while SimpleRNG does. I don’t know what algorithm the .NET generator uses, so I can’t comment on its quality. Before I’d use it in statistical applications, I’d want to find out.

Linear interpolator

Wednesday, March 26th, 2008

I added a form to my web site yesterday that does linear interpolation. If you enter (x1, y1) and (x2, y2), it will predict x3 given y3or vice versa by fitting a straight line to the first two points. It’s a simple calculation, but it comes up just often enough that it would be handy to have a page to do it.

Plausible reasoning

Wednesday, March 19th, 2008

If Socrates is probably a man, he’s probably mortal.

How do you extend classical logic to reason with uncertain propositions, such as the statement above? Suppose we agree to represent degrees of plausibility with real numbers, larger numbers indicating greater plausibility. If we also agree to a few axioms to quantify what we mean by consistency and common sense, there is a unique system that satisfies the axioms. The derivation is tedious and not well suited to a blog posting, so I’ll cut to the chase: given certain axioms, the inevitable system for plausible reasoning is probability theory.

There are two important implications of this result. First, it is possible to develop probability theory with no reference to sets. This renders much of the controversy about the interpretation of probability moot. Instead of arguing about what a probability can and cannot represent, one could concede the point. “We won’t use probabilities to represent uncertain information. We’ll use ‘plausibilities’ instead, derived from rules of common sense reasoning. And by the way, the resulting theory is identical to probability theory.”

The other important implication is that all other systems of plausible reasoning — fuzzy logic, neural networks, artificial intelligence, etc. — must either lead to the same conclusions as probability theory, or violate one of the axioms used to derive probability theory.

See the first two chapters of Probability Theory by E. T. Jaynes for a full development. It’s interesting to note that the seminal paper in this area came out over 60 years ago. (Richard Cox, “Probability, frequency, and reasonable expectation”, 1946.)

Error function and the normal distribution

Saturday, March 15th, 2008

The error function erf(x) and the normal distribution Φ(x) are essentially the same function. The former is more common in math, the latter in statistics. I often have to convert between the two.

It’s a simple exercise to move between erf(x) and Φ(x), but it’s tedious and error-prone, especially when you throw in variations on these two functions such as their complements and inverses. Some time ago I got sufficiently frustrated to write up the various relationships in a LaTeX file for future reference. I was using this file yesterday and thought I should post it as a PDF file in case it could save someone else time and errors.

What is the cosine of a matrix?

Friday, March 14th, 2008

How would you define the cosine of a matrix? If you’re trying to think of a triangle whose sides are matrices, you’re not going to get there. Think of power series. If a matrix A is square, you can stick it into the power series for cosine and call the sum the cosine of A.

cosine series

For example,

cosine of a 2x2 matrix 

This only works for square matrices. Otherwise the powers of A are not defined.

The power series converges and has many of the properties you’d expect. However, the usual trig identities may or may not apply. For example,

cos(A+B)

only if the matrices A and B commute, i.e. AB = BA. To see why this is necessary, imagine trying to prove the sum identity above. You’d stick A+B into the power series and do some algebra to re-arrange terms to get the terms on the right side of the equation. Along the way you’ll encounter terms like A2 + AB + BA + B2 and you’d like to factor that into (A+B)2, but you can’t justify that unless A and B commute.

Is cosine still periodic in this context? Yes, in the sense that cos(A + 2πI) = cos(A). This is because the diagonal matrix 2πI commutes with every matrix A and so the sum identity above holds.

Why would you want to define the cosine of a matrix? One application of analytic functions of a matrix is solving systems of differential equations. Any linear system of ODEs, of any order, can be rewritten in the form x‘ = Ax where x is a vector of functions and A is a square matrix. Then the solution is x(t) = etA x(0). And cos(At) is a solution to x‘ ‘+ A2x = 0, just as in calculus.

Four characterizations of the normal distribution

Thursday, March 13th, 2008

The normal distribution pops up everywhere in statistics. Contrary to popular belief, the name does not come from “normal” as in “conventional.” Instead the term comes from a detail in a proof by Gauss discussed below where he showed that two things were perpendicular in a sense.

(The word “normal” originally meant “at a right angle,” going back to the Latin word normalis for a carpenter’s square. Later the word took on the metaphorical meaning of something in line with custom. Mathematicians sometimes use “normal” in the original sense of being orthogonal.) 

The mistaken etymology persists because the normal distribution is conventional. Statisticians often assume anything random has a normal distribution by default. While this assumption is not always justified, it often works remarkably well. This post gives four lines of reasoning that lead naturally to the normal distribution.

Abraham de Moivre

1) The earliest characterization of the normal distribution is the central limit theorem, going back to Abraham de Moivre. Roughly speaking, this theorem says that if you average enough distributions together, even if they’re not normal, in the limit their average is normal. But this justification for assuming normal distributions everywhere has a couple problems. First, the convergence in the central limit theorem may be slow, depending on what is being averaged. Second, if you relax the hypotheses on the central limit theorem, other stable distributions with thicker tails also satisfy a sort of central limit theorem. The characterizations given below are more satisfying because they do not rely on limit theorems.

William Herschel

2) The astronomer William Herschel discovered the simplest characterization of the normal. He wanted to characterize the errors in astronomical measurements. He assumed (1) the distribution of errors in the x and y directions must be independent, and (2) the distribution of errors must be independent of angle when expressed in polar coordinates. These are very natural assumptions for an astronomer, and the only solution is a product of the same normal distribution in x and y. James Clerk Maxwell came up with an analogous derivation in three dimensions when modelling gas dynamics.

Carl Friedrich Gauss

3) Carl Friedrich Gauss came up with the characterization of the normal distribution that caused it to be called the “Gaussian” distribution. There are two strategies for estimating the mean of a random variable from a sample: the arithmetic mean of the samples, and the maximum likelihood value. Only for the normal distribution do these coincide.

C. R. Rao

4) The final characterization listed here is in terms of entropy. For a specified mean and variance, the probability density with the greatest entropy (least information) is the normal distribution. I don’t know who discovered this result, but I read it in C. R. Rao’s book. Perhaps it’s his result. If anyone knows, please let me know and I’ll update this post. For advocates of maximum entropy this is the most important characterization of the normal distribution.

In praise of tedious proofs

Tuesday, March 11th, 2008

The book Out of Their Minds quotes Leslie Lamport on proofs:

The proofs have been carried out to an excruciating level of detail … The reader may feel that we have given long, tedious proofs of obvious assertions. However, what he has not seen are the many equally obvious assertions that we discovered to be wrong only by trying to write similarly long, tedious proofs.                   

See Lamport’s paper How to Write a Proof. See also Complementary validation.

Math and stat posts classified

Sunday, March 9th, 2008

I’ve added a page to my web site where I classified my blog posts and informal articles on math and statistics into six categories:

  • Elementary
  • Preventing and detecting errors
  • Interpreting and misinterpreting probabilities
  • Mathematical statistics
  • Practicalities
  • Pure math

I’ve put a link to this page on the side of my blog and intend to keep it up as I add posts related to math and statistics.

Coloring numbers

Thursday, March 6th, 2008

Attaching units to numbers reduces the chance of mistakes. For example, if you’re about to add 3 pounds to 7 feet, you’ve probably done something wrong. (The result would be 10 what?) In engineering, this is called “dimensional analysis.”

In computer science, the analogous discipline is strong typing. In strongly typed programming languages, a function that expects a floating point number will complain bitterly if you pass in a JPEG image instead.

When I’m doing math, I’ll sometimes start out with back-of-the-envelope scribbling until I start getting confused. Often the way out of my confusion is to be explicit about function domains and ranges, very much like strong typing in programming.

But sometimes natural typing isn’t enough, and it helps to create artificial distinctions. For example, torque and work both have units of force times distance, but it would be a mistake to add torque and work.

Sometimes it helps me to imagine numbers having different colors. Say I’ve got a function f(x) and I imagine that it takes in red numbers and produces blue numbers, while another function g(x) takes in blue numbers and produces green numbers. That helps me keep straight that expressions like g(f(x)) are OK, but expressions like f(x) + g(x) are probably not.

Stable distributions

Saturday, March 1st, 2008

In a beginning class in probability or statistics, you learn that a linear combination of normal random variables is another normal random variable. For example, if X and Y are normally distributed, so is 3X + 5.2Y.  Since the normal distribution gets an inordinate amount of attention, there may be an implicit message that this is a common property. Is it? Are there many distributions closed under linear combinations?

Distributions families that are closed under linear combinations are called “stable”. Only three families of stable distributions are known that can be written down in analytic form: normal, Lévy,  and Cauchy. Of these, the only symmetric ones are normal and Cauchy. The Lévy and Cauchy distributions have much heavier tails than the normal, so heavy that they do not have a mean.

Stable distributions are the only non-trivial limiting distributions of generalized central limit theorems. 

The Lévy distribution was in the news recently. See the February 28 Nature podcast for a story about marine predator food-finding behavior and random walks with a Lévy distribution.