I published an article on CodeProject this week about some of the traps you can fall into when using random number generators.
From the category archives:
Math
Two weeks ago I wrote a series of posts on random inequalities: part I, part II, part III. In the process of writing these, I found an error in a tech report I wrote five years ago. I’ve posted a corrected version and describe the changes here.
Suppose X1 is a Cauchy random variable with median m1 and scale s1 and similarly for X2. Then X1 - X2 is a Cauchy random variable with median m1 - m2 and scale s1 + s2. Then P(X1 > X2) equals
P(X1 - X2 > 0) = P(m1 - m2 + (s1 + s2) C > 0)
where C is a Cauchy random variable with median 0 and scale 1. This reduces to
P(C < (m1 - m2)/(s1 + s2)) = 1/2 + atan( (m1 - m2)/(s1 + s2) )/π.
The original version was missing the factor of 1/2. This is obviously wrong because it would say that P(X1 > X2) is negative when m1 < m2.
By the way, I was told in college that the Cauchy distribution is an impractical curiosity, something more useful for developing counterexamples than modeling real phenomena. That was an overstatement. Thick-tailed distributions like the Cauchy often arise in applications, sometimes directly (see Noise, The Black Swan) or indirectly (for example, robust or default prior distributions).
Update: See part V on beta distributions.
{ 0 comments }
Scientific American has an article suggesting that our natural sense of numbers may operate on a logarithmic scale rather than a linear scale.
It has long been known that our senses often work on a logarithmic scale. For example, sound intensity is measured in decibels, a logarithmic scale. Pitch is perceived on a logarithmic scale, as the Pythagoreans discovered. When moving up a chromatic scale, it’s not the differences in frequencies but the ratios of frequencies that are constant. An octave is a ratio of 2 to 1, so a half step is a ratio of 21/12 to one since there are 12 half step in an octave.
The Statistical Modeling, Causal Inference, and Social Science blog gives an interesting example combining linear and logarithmic perceptions. They quote a study suggesting that when deciding whether to walk to a new well based on information regarding arsenic levels, Bangladeshis perceived “distance to nearest safe well” linearly but perceived “arsenic level” logarithmically.
{ 0 comments }
Textbooks say the normal approximation to the binomial distribution is good when “n is large.” How large is large enough? Some books say n ≥ 10. How good is it in that case? Hmm, it doesn’t say. The same holds for the normal approximation to the Poisson. The only advice you’re likely to see is that the approximation is better when the λ parameter is “large.”
Approximations are much more useful when you have an error bound or an error estimate, and yet I’ve never seen a textbook that gave an error bound when introducing these approximations. You’ll probably get an example or two, but nothing general. One justification for this omission is that the error bounds rely on more advanced results than could be proven in an introductory class. But the approximation result is already a rabbit out of a hat. Why not pull one more rabbit out of the hat and state an error bound?
I’m supposed to teach a math stat class in the Fall. I’ve prepared some notes looking into the details of the various approximations so I could do better than just wave my hands when presenting this material. Here are my notes for the Poisson and binomial distributions.
Here’s one of the graphs from the notes, an example showing the error in normal approximation to a Poisson(20) density function.

{ 0 comments }
E. Kowalski has a post this morning called Finding life beyond the Central Limit Theorem. The post mentions a theorem I’ve never heard of that connects basic concepts in probability and number theory.
Surely the Central Limit Theorem and the normal distribution are important in probability. You might even call them “central.” And counting prime divisors of integers is a natural thing to do in number theory. It turns out these two concepts are closely related.
Pick a big number N. Now select a random integer n between 1 and N. Count the number of distinct prime divisors of n and call that ω(n). Roughly speaking, ω(n) acts like a normal random variable with mean and variance log log N.
There are plenty of other connections between probability and number theory. I took probability from Jeffrey Vaaler, a number theorist. I remember three informal remarks he made to suggest connections between number theory and probability.
- Being relatively prime is analogous to being independent. Suppose p and q are two positive integers with no factors in common. Now pick numbers at random between 1 and pq. Being divisible by p is independent of being divisible by q.
- There’s a heuristic in number theory that prime numbers behave randomly except when there’s a good reason that they don’t.
- Some things are so hard to find that the best way to look for them is to search at random. Often an effective way to prove that something exists is to show that the probability of selecting the thing you’re looking for from some larger set is positive.
{ 0 comments }
Classroom exercises always have nice, tidy solutions. So students implicitly assume that all problems have nice, tidy solutions. If the solution isn’t working out simply, you must have made a mistake.
Outside the classroom, applications seldom have simple solutions. So after a while you get jaded and quit trying to find a simple solution. But sometimes real problems do have simple solutions, or at least simpler solutions than seemed possible.
The Extreme Programming folks have a saying “Try the simplest thing that could possibly work.” If that doesn’t work, then try the next simplest thing that could possibly work. That line of thinking has paid off a few times lately.
I’ve had a couple math problems that I first assumed had to be approximated numerically that were more easily computed exactly. And I’ve had a couple programs where I was able to debug a section of code by simply deleting it. Things don’t always work out that well, but it’s fun when they do.
{ 0 comments }
The first post in this series introduced random inequalities. The second post discussed random inequalities can could be computed in closed form. This post considers random inequalities that must be evaluated numerically.
The simplest and most obvious approach to computing P(X > Y) is to generate random samples from X and Y and count how often the sample from X is bigger than the sample from Y. However, this approach is only accurate on average. In any particular sample, the error may be large. Even if you are willing to tolerate, for example, a 5% chance of being outside your error tolerance, the required number of random draws may be large. The more accuracy you need, the worse it gets since your accuracy only improves as n-1/2 where n is the number of samples. For numerical integration methods, the error decreases as a much larger negative power of n depending on the method. In the timing study near the end of this technical report, numerical integration was 2,875 times faster than simulation for determining beta inequality probabilities to just two decimal places. For greater desired accuracy, the advantage of numerical integration would increase.
Why is efficiency important? For some applications it may not be, but often these random inequalities are evaluated in the inner loop of a simulation. A simulation study that takes hours to run may be spending most of its time evaluating random inequalities.
As derived in the previous post, the integral to evaluate is given by
![\begin{eqnarray*} P(X \geq Y) &=& \int \!\int _{[x > y]} f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty \! \int_{-\infty}^x f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty f_X(x) F_Y(x) \, dx \end{eqnarray*}](http://www.johndcook.com/ineq_integral.gif)
The technical report Numerical computation of stochastic inequality probabilities gives some general considerations for numerically evaluating random inequalities, but each specific distribution family must be considered individually since each may present unique numerical challenges. The report gives some specific techniques for beta and Weibull distributions. The report Numerical evaluation of gamma inequalities gives some techniques for gamma distributions.
Numerical integration can be complicated by singularities in integrand for extreme values of distribution parameters. For example, if X ~ beta(a, b) the integral for computing P(X > Y) has a singularity if either a or b are less than 1. For general advice on numerically integrating singular integrands, see “The care and treatment of singularities” in Numerical Methods that Work by Forman S. Acton.
Once you come up with a clever numerical algorithm for evaluating a random inequality, you need to have a way to test the results.
One way to test numerical algorithms for random inequalities is to compare simulation results to integration results. This will find blatant errors, but it may not be as effective in uncovering accuracy problems. It helps to use random parameters for testing.
Another way to test numerical algorithms in this context is to compute both P(X > Y) and P(Y > X). For continuous random variables, these two values should add to 1. Of course such a condition is not sufficient to guarantee accuracy, but it is a simple and effective test. The Inequality Calculator software reports both P(X > Y) and P(Y > X) partly for the convenience of the user but also as a form of quality control: if the two probabilities do not add up to approximately 1, you know something has gone wrong. Along these lines, it’s useful to verify that P(X > Y) = 1/2 if X and Y are identically distributed.
Finally, integrands may have closed-form solutions for special parameters. For example, Exact calculation of beta inequalities gives closed-form results for many special cases of the parameters. As long as the algorithm being tested does not depend on these special values, these special values provide valuable test cases.
{ 0 comments }
My previous post introduced random inequalities and their application to Bayesian clinical trials. This post will discuss how to evaluate random inequalities analytically. The next post in the series will discuss numerical evaluation when analytical evaluation is not possible.
For independent random variables X and Y, how would you compute P(X>Y), the probability that a sample from X will be larger than a sample from Y? Let fX be the probability density function (PDF) of X and let FX be the cumulative distribution function (CDF) of X. Define fY and FY similarly. Then the probability P(X > Y) is the integral of fX(x) fY(y) over the part of the x-y plane below the diagonal line x = y.
![\begin{eqnarray*} P(X \geq Y) &=& \int \!\int _{[x > y]} f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty \! \int_{-\infty}^x f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty f_X(x) F_Y(x) \, dx \end{eqnarray*}](http://www.johndcook.com/ineq_integral.gif)
This result makes intuitive sense: fX(x) is the density for x and FY(x) is the probability that Y is less than x. Sometimes this integral can be evaluated analytically, though in general it must be evaluated numerically. The technical report Numerical computation of stochastic inequality probabilities explains how P(X > Y) can be computed in closed form for several common distribution families as well as how to evaluate inequalities involving other distributions numerically.
Exponential: If X and Y are exponential random variables with mean μX and μY respectively, then P(X > Y) = μX/(μX + μY).
Normal: If X and Y are normal random variables with mean and standard deviation (μX, σX) and (μY, σY) respectively, then P(X > Y) = Φ((μX - μY)/√(σX2 + σY2)) where Φ is the CDF of a standard normal distribution.
Gamma: If X and Y are gamma random variables with shape and scale (αX, βX) and (αY, βY) respectively, then P(X > Y) = Ix(βX/(βX + βY)) where Ix is the incomplete beta function with parameters αY and αX, i.e. the CDF of a beta distribution with parameters αY and αX.
The inequality P(X > Y) where X and Y are beta random variables comes up very often in applications. This inequality cannot be computed in closed form in general, though there are closed-form solutions for special values of the beta parameters. If X ~ beta(a, b) and Y ~ beta(c, d), the probability P(X > Y) can be evaluated in closed form if (1) one of the parameters a, b, c, or d is an integer, (2) a + b + c + d = 1, or (3) a + b = c + d = 1. These last two cases can be combined with a recurrence relation to compute other probabilities. See Exact calculation of beta inequalities for more details.
Sometimes you need to calculate P(X > max(Y, Z)) for three independent random variables. This comes up, for example, when computing adaptive randomization probabilities for a three-arm clinical trial. For a time-to-event trial as implemented here, the random variables have a gamma distribution. See Numerical evaluation of gamma inequalities for analytical as well as numerical results for computing P(X > max(Y, Z)) in that case.
The next post in this series will discuss how to evaluate random inequalities numerically when closed-form integration is not possible.
Update: See Part IV of this series for results with the Cauchy distribution.
{ 0 comments }
Many Bayesian clinical trial methods have at their core a random inequality. Some examples from M. D. Anderson: adaptive randomization, binary safety monitoring, time-to-event safety monitoring. These method depends critically on evaluating P(X > Y) where X and Y are independent random variables. Roughly speaking, P(X > Y) is the probability that the treatment represented by X is better than the treatment represented by Y. In a trial with binary outcomes, X and Y may be the posterior probabilities of response on each treatment. In a trial with time-to-event outcomes, X and Y may be posterior probabilities of median survival time on two treatments.
People often have a little difficulty understanding what P(X > Y) means. What does it mean? If we take a sample from X and a random sample from Y, P(X >Y) is the probability that the former is larger than the latter. Most confusion around random inequalities comes from thinking of random variables as constants, not random quantities. Here are a couple examples.
First, suppose X and Y have normal distributions with standard deviation 1. If X has mean 4 and Y has mean 3, what is P(X > Y)? Some would say 1, because X is bigger than Y. But that’s not true. X has a larger mean than Y, but fairly often a sample from Y will be larger than a sample from X. P(X > Y) = 0.76 in this case.
Next, suppose X and Y are identically distributed. Now what is P(X > Y)? I’ve heard people say zero because the two random variables are equal. But they’re not equal. Their distribution functions are equal but the two random variables are independent. P(X > y) = 1/2 by symmetry.
I believe there’s a psychological tendency to underestimate large inequality probabilities. (I’ve had several discussions with people who would not believe a reported inequality probability until they computed it themselves. These discussions are important when the decision whether to continue a clinical trial hinges on the result.) For example, suppose X and Y represent the probability of success in a trial in which there were 17 successes out of 30 on X and 12 successes out of 30 on Y. Using a beta distribution model, the density functions of X and Y are given below.

The density function for X is essentially the same as Y but shifted to the right. Clearly P(X > Y) is greater than 1/2. But how much greater than a half? You might think not too much since there’s a lot of mass in the overlap of the two densities. But P(X > Y) is a little more than 0.9.
The image above and the numerical results mentioned in this post were produced by the Inequality Calculator software.
Part II will discuss analytically evaluating random inequalities. Part III will discuss numerically evaluating random inequalities.
{ 2 comments }
Many well-known probability distributions converge to the normal distribution as some parameter or other increases. In a sense this is not very interesting: All roads lead to Rome. But though destinations are the same, the paths to the destination are varied and more interesting.
I’ve posted notes on how the error in the normal approximation varies for the beta, gamma, and Student t distributions.
The animation below shows the error in the normal approximation to the gamma distribution as the shape parameter grows from 3 to 23. See the gamma notes for more details.

(If the image above is not animated in your browser, visit the gamma notes page where the image should display correctly in all browsers.)
{ 0 comments }
It appears that someone has found a flaw in Xian-Jin Li’s proposed proof of the Riemann hypothesis according to the Not Even Wrong blog. (Hat tip: Ars Mathematica)
This doesn’t mean that all is lost. Andrew Wiles’ first attempt at proving Fermat’s Last Theorem was flawed, but he fixed it. Perhaps Li can patch his proof. If not, he may be able to salvage a proof of something valuable short of the Riemann hypothesis.
The news of Li’s proof and its refutation underscores a point I made in an earlier post about proofs of false statements. Namely, “… in mainstream areas of math, blunders are usually uncovered very quickly.” The Riemann hypothesis is very much in the mainstream, and it looks like a blunder was uncovered within 24 hours.
{ 1 comment }
Xian-Jin Li claims to have proven the Riemann hypothesis, one of the most famous open problems in math. His paper is posted arXiv.
The Riemann hypothesis is no obscure conjecture. It’s a natural question and central to number theory. For years mathematicians have been proving theorems of the form “If the Riemann hypothesis is true, then …” and so if Li’s result is correct, many other new results follow. Also, if the proof holds up, Li wins a $1,000,000 prize from the Clay Institute for solving one of the Millennium Problems.
Hat tip: Isabel Lugo
Update: Looks like there’s a flaw in the proof.
{ 0 comments }
In an introductory probability class, the expected value of a random variable X is defined as
![]()
where fXis the probability density function of X. I’ll call this the analytical definition.
In a more advanced class the expected value of X is defined as
![]()
where (Ω, P) is a probability space. I’ll call this the measure-theoretic definition. It’s not obvious that these two definitions are equivalent. They may even seem contradictory unless you look closely: they’re integrating different functions over different spaces.
If for some odd reason you learned the measure-theoretic definition first, you could see the analytical definition as a theorem. But if, like most people, you learn the analytical definition first, the measure-theoretic version is quite mysterious. When you take an advanced course and look at the details previously swept under the rug, probability looks like an entirely different subject, unrelated to your introductory course. The definition of expectation is just one concept among many that takes some work to resolve.
I’ve written a couple pages of notes that bridge the gap between the two definitions of expectation and show that they are equivalent.
{ 1 comment }
When I took my first probability course, it seemed like there were an infinite number of approximation theorems to learn, all mysterious. Looking back, there were probably only two or three, and they don’t need to be mysterious.
For example, under the right circumstances you can approximate a Binomial(n, p) well with a Normal(np, np(1-p)). While the relationship between the parameters in these two distributions is obvious to the initiated, it’s not at all obvious to a beginner. It seems much clearer to say that a Binomial can be approximated by a Normal with the same mean and variance. After all, a distribution that doesn’t get the mean and variance correct doesn’t sound like a very promising approximation.
Taking it a step further, a good teacher could guide a class to discover this approximation themselves. This would take more time than simply stating the result and working an example or two, but the difference in understanding would be immense. And if you’re not going to take the time to aim for understanding, what’s the point in covering approximation theorems at all? They’re not used that often for computation anymore. In my opinion, the only reason to go over them is the insight they provide.
{ 0 comments }
Sometimes a good diagram is a godsend, reducing the entropy in your head at a glace.
When I was studying integration theory, I ran across a diagram something like the following in the out-of-print book Elements of Integration by Robert G. Bartle.

Each arrow summarizes a theorem. The four corners stand for modes of convergence: almost everywhere, almost uniform, Lp, and convergence in measure. A solid line from one mode to another means that if a sequence of function converges in the first mode then it also converges in the other. A dashed line means that a subsequence converges in the second mode. For example, if a sequence of functions fnconverge to a function f, then the sequence also converges in measure to f, and a subsequence converges to f almost everywhere.
Note that this isn’t the typical mathematical diagram where the corners represent spaces and the arrows represent functions. I don’t recall seeing anything like this outside of Bartle’s book.
For a finite measure space, additional relationships hold as summarized in the diagram below.

If the sequence fn is uniformly dominated by an Lp function g, then even more relationships hold.

For more details, see modes of convergence. These notes also contain counterexamples that show no other relationships exist. Three simple counterexamples suffice to show no arrows are missing.
{ 0 comments }
