From the category archives:

Math

Riemann hypothesis proof?

by John on July 2, 2008

Xian-Jin Li claims to have proven the Riemann hypothesis, one of the most famous open problems in math. His paper is posted arXiv.

The Riemann hypothesis is no obscure conjecture. It’s a natural question and central to number theory. For years mathematicians have been proving theorems of the form “If the Riemann hypothesis is true, then …” and so if Li’s result is correct, many other new results follow. Also, if the proof holds up, Li wins a $1,000,000 prize from the Clay Institute for solving one of the Millennium Problems.

Hat tip: Isabel Lugo

Update: Looks like there’s a flaw in the proof.

{ 0 comments }

Two definitions of expectation

by John on June 27, 2008

In an introductory probability class, the expected value of a random variable X is defined as

E(X) = \int_{-\infty}^\infty x\, f_X(x) \,dx

where fXis the probability density function of X. I’ll call this the analytical definition.

In a more advanced class the expected value of X is defined as

E(X) = \int_\Omega X \,dP

where (Ω, P) is a probability space. I’ll call this the measure-theoretic definition. It’s not obvious that these two definitions are equivalent. They may even seem contradictory unless you look closely: they’re integrating different functions over different spaces.

If for some odd reason you learned the measure-theoretic definition first, you could see the analytical definition as a theorem. But if, like most people, you learn the analytical definition first, the measure-theoretic version is quite mysterious. When you take an advanced course and look at the details previously swept under the rug, probability looks like an entirely different subject, unrelated to your introductory course. The definition of expectation is just one concept among many that takes some work to resolve.

I’ve written a couple pages of notes that bridge the gap between the two definitions of expectation and show that they are equivalent.

{ 0 comments }

Probability approximations

by John on June 22, 2008

When I took my first probability course, it seemed like there were an infinite number of approximation theorems to learn, all mysterious. Looking back, there were probably only two or three, and they don’t need to be mysterious.

For example, under the right circumstances you can approximate a Binomial(n, p) well with a Normal(np, np(1-p)). While the relationship between the parameters in these two distributions is obvious to the initiated, it’s not at all obvious to a beginner. It seems much clearer to say that a Binomial can be approximated by a Normal with the same mean and variance. After all, a distribution that doesn’t get the mean and variance correct doesn’t sound like a very promising approximation.

Taking it a step further, a good teacher could guide a class to discover this approximation themselves. This would take more time than simply stating the result and working an example or two, but the difference in understanding would be immense. And if you’re not going to take the time to aim for understanding, what’s the point in covering approximation theorems at all? They’re not used that often for computation anymore. In my opinion, the only reason to go over them is the insight they provide.

{ 0 comments }

Diagramming modes of convergence

by John on May 25, 2008

Sometimes a good diagram is a godsend, reducing the entropy in your head at a glace.

When I was studying integration theory, I ran across a diagram something like the following in the out-of-print book Elements of Integration by Robert G. Bartle.

convergence for a general measure space

Each arrow summarizes a theorem. The four corners stand for modes of convergence: almost everywhere, almost uniform, Lp, and convergence in measure. A solid line from one mode to another means that if a sequence of function converges in the first mode then it also converges in the other. A dashed line means that a subsequence converges in the second mode. For example, if a sequence of functions fnconverge to a function f, then the sequence also converges in measure to f, and a subsequence converges to f almost everywhere.

Note that this isn’t the typical mathematical diagram where the corners represent spaces and the arrows represent functions. I don’t recall seeing anything like this outside of Bartle’s book.

For a finite measure space, additional relationships hold as summarized in the diagram below.

convergence for a finite measure space

If the sequence fn is uniformly dominated by an Lp function g, then even more relationships hold.

dominated convergence

For more details, see modes of convergence. These notes also contain counterexamples that show no other relationships exist. Three simple counterexamples suffice to show no arrows are missing.

{ 0 comments }

Jenga mathematics

by John on May 11, 2008

Jenga is a game where you start with a tower of wooden pegs and take turns removing pegs until someone makes the tower collapse. A style of mathematics analogous to Jenga reached the height of its popularity about 40 years ago and then fell out of fashion. I use the phrase “Jenga mathematics” to refer to generalizing a well-known theorem by weakening its hypotheses, seeing how many pegs you can pull out before it falls.

Jenga game photo

Many 20th century mathematicians spent their careers going over the work of 19th century mathematicians, removing every hypothesis they could. Sometimes a 20th century mathematician would get his name tacked on to a 19th century theorem due to his Jenga accomplishments.

Taken to extremes, Jenga mathematics turns theorems inside-out and proofs become hypotheses. Natural hypotheses are replaced with a laundry list of properties necessary to make the proof work. Start with some theorem of the form “Let X be a widget. Then X has a foozle.” Go back over the proof and see just what features of a widget are needed for the proof. Then restate the theorem as “Let X have the following apparently arbitrary list of properties necessary for my proof to work. Then X has a foozle.” Never mind whether anybody can think of anything other that a widget that satisfies the hypotheses of the new theorem.

Jenga mathematics is no longer fashionable. Mathematicians still value removing unneeded hypotheses, but they’re not as willing to go to extremes to do so. They are more interested in building new towers than in removing every piece possible from old towers.

{ 6 comments }

How to calculate binomial probabilities

by John on April 24, 2008

Suppose you’re doing something that has probability of success p and probability of failure q = 1-p. If you repeat what you’re doing m+n times, the probability of m successes and n failures is given by  

\frac{(m+n)!}{m!\, n!} p^m q^n

Now suppose m and n are moderately large. The terms (m+n)! and m! n! will be huge, but the terms pm and qn will be tiny. The huge terms might overflow, and the tiny terms might underflow, even though the final result may be a moderate-sized number. The numbers m and n don’t have to be that large for this to happen since factorials grow very quickly. On a typical system, overflow happens if m+n > 170. How do you get to the end result without having to deal with impossibly large or small values along the way?

The trick is to work with logs. You want to first calculate log( (m+n)! ) - log( m! ) - log( n! ) + m log( p ) + n log( q ) , then exponentiate the result. This pattern comes up constantly in statistical computing.

Libraries don’t always include functions for evaluating factorial or log factorial. They’re more likely to include functions for Γ(x) and its log. For positive integers n, Γ(n+1) = n!. Now suppose you have a function lgamma that computes log Γ(x). You might write something like this.

    double probability(double p, double q, int m, int n)
    { 
        double temp = lgamma(m + n + 1.0);
        temp -=  lgamma(n + 1.0) + lgamma(m + 1.0);
        temp += m*log(p) + n*log(q);
        return exp(temp);
    }

The function lgamma is not part of the ANSI standard library for C or C++, but it is part of the POSIX standard. On Unix-like systems, lgamma is included in the standard library. However, Microsoft does not include lgamma as part of the Visual C++ standard library. On Windows you have to either implement your own lgamma function or grab an implementation from somewhere like Boost.

Here’s something to watch out for with POSIX math libraries. I believe the POSIX standard does not require a function called gamma for evaluating the gamma function Γ(x). Instead, the standard requires functions lgamma for the log of the gamma function and tgamma for the gamma function itself. (The mnemonic is “t” for “true,” meaning that you really want the gamma function.) I wrote a little code that called gamma and tested it on OS X and Red Hat Linux this evening. In both cases gcc compiled the code without warning, even with the -Wall and -pedantic warning flags. But on the Mac, gamma was interpreted as tgamma and on the Linux box it was interpreted as lgamma. This means that gamma(10.0) would equal 362880 on the former and 12.8018 on the latter.

{ 2 comments }

Fibonacci numbers at work

by John on April 23, 2008

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form  

f(x, y) = xa (1-x)b yc (1-y)d p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

Fibonacci lattice integration rule

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule typically not very efficient. It’s more efficient for periodic integrands, but the lattice rule was more efficient still.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

    int Fibonacci(int k)
    {
        double phi = 0.5*(1.0 + sqrt(5.0));
        return int( pow(phi, k)/sqrt(5.0) + 0.5 );
    }

The reason this works is that

equation for Fibonacci numbers

where

definition of phi, golden ratio

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

If you’re interested in technical details of the numerical integration method, see Lattice Methods for Multiple Integration by I. H. Sloan and S. Joe.

{ 3 comments }

Overflow and loss of precision

by John on April 16, 2008

Suppose you need to evaluate the function f(x) = log(1 + ex). The most obvious code to write something like this in C:

    double f(double x) { return log(1 + exp(x)); }

This is so simple you don’t even test it. Then someone calls you to say they’re getting strange output such as 1.#INF depending on your environment. What’s going on?

For large values of x, ex is much bigger than 1, so f(x) is approximately log(ex) which equals x. So, for example, f(1000) should return something very close to 1000. But instead, you get gibberish. In most environments a floating point number can be as large as about 10308, but exp(1000) is about 2×10434 and so the result overflows. Then the code takes the log of “infinity.”

But f(1000) shouldn’t overflow; the result is about 1000. Our function result can be contained in a floating point number, but our intermediate result exp(1000) cannot be. We need some way of telling the computer “Hold on. I know this is going to be a huge number, but trust me for a minute. After I add 1 I’m going to take a log and bring the result back down to a medium sized number.” Unfortunately computers don’t work that way.

One solution would be to see whether x is so large that exp(x) will overflow, and in that case f(x) could just return x. So our second attempt might look like this:

    double f(double x) { return (x > X_MAX) ? x : log(1 + exp(x)); }

I’ll come back later to what X_MAX should be. Suppose you’ve found a good value for this constant and release your code again. Then you get another call. They say your code is returning zero when it should not.

Someone is trying to compute f(-50). They know the answer is small, but it shouldn’t be zero. How can that be? Since you just learned about overflow, you suspect the problem might have to do with the opposite problem, underflow. But that’s not quite it. The result of exp(-50) does not underflow; it’s about 2×10-22. But it is so small that machine precision cannot distinguish 1+exp(-50) from 1. So the code returns log(1), which equals zero. This may or may not be a problem. The correct answer is very small, so the absolute error is small but the relative error is 100%. Whether that is a problem depends on what you’re going to do with the result. If you’re going to add it to a moderate size number, no big deal. If you’re going to divide by it, it’s a very big deal.

Now what do you do? You think back to your calculus class and remember that for small x, log(1 + x) is approximately x with error on the order of x2. (To see this, expand log(1 + x) in a power series centered at 1.) If x is so small that 1 + x equals 1 inside the computer, x2 must be very small indeed. So f(x) could return exp(x) if exp(x) is sufficiently small. This gives our third attempt.

double f(double x)
{ 
    if (x > X_MAX) return x; 
    if (x < X_MIN) return exp(x);
    return log(1.0 + exp(x));
}

This is a big improvement. It can still underflow for large negative x, but in that case the function itself is underflowing, not just an intermediate result.

Now what do we make X_MAX and X_MIN? For X_MAX we don’t really have to worry about exp(x) overflowing. We can stop when x is so large that exp(x) + 1 equals exp(x) to machine precision. In C there is a header file float.h that contains a constant DBL_EPSILON which is the smallest number ε such that 1 + ε does not equal 1 in machine precision. So it turns out we can set X_MAX equal to -log(DBL_EPSILON). Similarly we could set X_MIN equal to log(DBL_EPSILON).

There’s still a small problem. When x is large and negative, but not so negative that 1 + exp(x) equals 1 in the computer, we could lose precision in computing log(1 + exp(x)).

I have just posted an article on CodeProject entitled Avoiding Overflow, Underflow, and Loss of Precision that has more examples where the most direct method for evaluating a function may not work. Example 2 in that paper explains why directly computing log(1 + y) can be a problem even if y is not so small that 1 + y equals 1 to machine precision. The article explains how to compute log(1 + y) in that case. Setting y = exp(x) explains how to finish the example here when x is moderately large and negative.

{ 0 comments }

Random number generator controversy

by John on April 12, 2008

I submitted an article to Code Project yesterday, Simple Random Number Generation, describing a small C# class called SimpleRNG that uses George Marsaglia’s WMC algorithm. The article was posted around 5 PM (central US time) and comments started pouring in right away. I didn’t expect any feedback on a Friday afternoon or Saturday morning. But as I write this post, there have been 580 page views and 11 comments.

There have been three basic questions raised in the comments.

  1. Why not just use the random number generator that comes with .NET?
  2. Is this code suitable for cryptography?
  3. Is this code suitable for Monte Carlo applications?

Why not use the built-in generator? For many applications, the simplest thing would be to use the .NET random number generator. But there are instances where this might not be best. There are questions about the statistical quality of the .NET generator; I’ll get to that in a minute. The primary advantages I see to the SimpleRNG class are transparency and portability.

By transparency I mean that the internal state of the generator is simple and easy to access. When you’re trying to reproduce a result, say while debugging, it’s convenient to have full access to the internal state of the random generator. If you’re using your own generator, you can see everything. You can even temporarily change it: for debugging, it may be convenient to temporarily have the “random” generator return a very regular, predictable sequence.

By portability I do not necessarily mean moving the code between operating systems. The primary application I have in mind is moving the algorithm between languages. For example, in my work we often have prototype code written in R that needs to be rewritten in C++ for efficiency. If the code involves random number generation, the output of the prototype and the rewrite cannot be directly compared, only compared on average. Then you have to judge whether the differences are to be expected or whether they indicate a bug. But if both the R and the C++ code use the same RNG algorithm and the same seed, the results may be directly comparable. (They still may not be directly comparable due to other factors, but at least this way the results are often comparable.)

As for cryptography, no, SimpleRNG is not appropriate for cryptography.

As for Monte Carlo applications, not all Monte Carlo applications are created equal. Some applications do not require high quality random number generators. Or more accurately, different applications require different kinds of quality. Some random number generators break down when used for high-dimensional integration. I suspect SimpleRNG is appropriate for moderate dimensions. I use the Mersenne Twister generator for numerical integration. However, SimpleRNG is faster and much simpler; the MT generator has a very large internal state.

Someone commented on the CodeProject article that the random number generator in .NET is not appropriate for Monte Carlo simulation because it does not pass Marsaglia’s DIEHARD tests while SimpleRNG does. I don’t know what algorithm the .NET generator uses, so I can’t comment on its quality. Before I’d use it in statistical applications, I’d want to find out.

{ 0 comments }

Learning is not the same as just gaining information. Sometimes learning means letting go of previously held beliefs. While this is true in life in general, my point here is to show how this holds true when using the mathematical definition of information.

The information content of a probability density function p(x) is given by

integral of p(x) log p(x)

Suppose we have a Beta(2, 6) prior on the probability of success for a binary outcome.

plot of beta(2,6) density

The prior density has information content 0.597. Then suppose we observe a success. The posterior density is distributed as Beta(3, 6). The posterior density has information 0.516, less information than the prior density.

plot of beta(3,6) density

Observing a success pulled the posterior density toward the right. The posterior density is a little more diffuse than the prior and so has lower information content. In that sense, we know less than before we observed the data! Actually, we’re less certain than we were before observing the data. But if the true probability of response is larger than our prior would indicate, we’re closer to the truth by becoming less confident of our prior belief, and we’ve learned something.

{ 0 comments }

Linear interpolator

by John on March 26, 2008

I added a form to my web site yesterday that does linear interpolation. If you enter (x1, y1) and (x2, y2), it will predict x3 given y3or vice versa by fitting a straight line to the first two points. It’s a simple calculation, but it comes up just often enough that it would be handy to have a page to do it.

{ 0 comments }

Plausible reasoning

by John on March 19, 2008

If Socrates is probably a man, he’s probably mortal.

How do you extend classical logic to reason with uncertain propositions, such as the statement above? Suppose we agree to represent degrees of plausibility with real numbers, larger numbers indicating greater plausibility. If we also agree to a few axioms to quantify what we mean by consistency and common sense, there is a unique system that satisfies the axioms. The derivation is tedious and not well suited to a blog posting, so I’ll cut to the chase: given certain axioms, the inevitable system for plausible reasoning is probability theory.

There are two important implications of this result. First, it is possible to develop probability theory with no reference to sets. This renders much of the controversy about the interpretation of probability moot. Instead of arguing about what a probability can and cannot represent, one could concede the point. “We won’t use probabilities to represent uncertain information. We’ll use ‘plausibilities’ instead, derived from rules of common sense reasoning. And by the way, the resulting theory is identical to probability theory.”

The other important implication is that all other systems of plausible reasoning — fuzzy logic, neural networks, artificial intelligence, etc. — must either lead to the same conclusions as probability theory, or violate one of the axioms used to derive probability theory.

See the first two chapters of Probability Theory by E. T. Jaynes for a full development. It’s interesting to note that the seminal paper in this area came out over 60 years ago. (Richard Cox, “Probability, frequency, and reasonable expectation”, 1946.)

{ 0 comments }

Error function and the normal distribution

by John on March 15, 2008

The error function erf(x) and the normal distribution Φ(x) are essentially the same function. The former is more common in math, the latter in statistics. I often have to convert between the two.

It’s a simple exercise to move between erf(x) and Φ(x), but it’s tedious and error-prone, especially when you throw in variations on these two functions such as their complements and inverses. Some time ago I got sufficiently frustrated to write up the various relationships in a LaTeX file for future reference. I was using this file yesterday and thought I should post it as a PDF file in case it could save someone else time and errors.

{ 0 comments }

What is the cosine of a matrix?

by John on March 14, 2008

How would you define the cosine of a matrix? If you’re trying to think of a triangle whose sides are matrices, you’re not going to get there. Think of power series. If a matrix A is square, you can stick it into the power series for cosine and call the sum the cosine of A.

cosine series

For example,

cosine of a 2x2 matrix 

This only works for square matrices. Otherwise the powers of A are not defined.

The power series converges and has many of the properties you’d expect. However, the usual trig identities may or may not apply. For example,

cos(A+B)

only if the matrices A and B commute, i.e. AB = BA. To see why this is necessary, imagine trying to prove the sum identity above. You’d stick A+B into the power series and do some algebra to re-arrange terms to get the terms on the right side of the equation. Along the way you’ll encounter terms like A2 + AB + BA + B2 and you’d like to factor that into (A+B)2, but you can’t justify that unless A and B commute.

Is cosine still periodic in this context? Yes, in the sense that cos(A + 2πI) = cos(A). This is because the diagonal matrix 2πI commutes with every matrix A and so the sum identity above holds.

Why would you want to define the cosine of a matrix? One application of analytic functions of a matrix is solving systems of differential equations. Any linear system of ODEs, of any order, can be rewritten in the form x‘ = Ax where x is a vector of functions and A is a square matrix. Then the solution is x(t) = etA x(0). And cos(At) is a solution to x‘ ‘+ A2x = 0, just as in calculus.

{ 0 comments }

The normal distribution pops up everywhere in statistics. Contrary to popular belief, the name does not come from “normal” as in “conventional.” Instead the term comes from a detail in a proof by Gauss discussed below where he showed that two things were perpendicular in a sense.

(The word “normal” originally meant “at a right angle,” going back to the Latin word normalis for a carpenter’s square. Later the word took on the metaphorical meaning of something in line with custom. Mathematicians sometimes use “normal” in the original sense of being orthogonal.) 

The mistaken etymology persists because the normal distribution is conventional. Statisticians often assume anything random has a normal distribution by default. While this assumption is not always justified, it often works remarkably well. This post gives four lines of reasoning that lead naturally to the normal distribution.

Abraham de Moivre

1) The earliest characterization of the normal distribution is the central limit theorem, going back to Abraham de Moivre. Roughly speaking, this theorem says that if you average enough distributions together, even if they’re not normal, in the limit their average is normal. But this justification for assuming normal distributions everywhere has a couple problems. First, the convergence in the central limit theorem may be slow, depending on what is being averaged. Second, if you relax the hypotheses on the central limit theorem, other stable distributions with thicker tails also satisfy a sort of central limit theorem. The characterizations given below are more satisfying because they do not rely on limit theorems.

William Herschel

2) The astronomer William Herschel discovered the simplest characterization of the normal. He wanted to characterize the errors in astronomical measurements. He assumed (1) the distribution of errors in the x and y directions must be independent, and (2) the distribution of errors must be independent of angle when expressed in polar coordinates. These are very natural assumptions for an astronomer, and the only solution is a product of the same normal distribution in x and y. James Clerk Maxwell came up with an analogous derivation in three dimensions when modelling gas dynamics.

Carl Friedrich Gauss

3) Carl Friedrich Gauss came up with the characterization of the normal distribution that caused it to be called the “Gaussian” distribution. There are two strategies for estimating the mean of a random variable from a sample: the arithmetic mean of the samples, and the maximum likelihood value. Only for the normal distribution do these coincide.

C. R. Rao

4) The final characterization listed here is in terms of entropy. For a specified mean and variance, the probability density with the greatest entropy (least information) is the normal distribution. I don’t know who discovered this result, but I read it in C. R. Rao’s book. Perhaps it’s his result. If anyone knows, please let me know and I’ll update this post. For advocates of maximum entropy this is the most important characterization of the normal distribution.

{ 2 comments }