Bounding the 3rd moment by the 4th moment

For a random variable X, the kth moment of X is the expected value of Xk.

For any random variable X with 0 mean, or negative mean, there’s an inequality that bounds the 3rd moment, m3 in terms of the 4th moment, m4:

m_3 \leq \left(\frac{4}{27} \right )^{1/4} m_4^{3/4}

The following example shows that this bound is the best possible.

Define

u = (1 – √ 3)/√ 2
v = (1 + √ 3)/√ 2
p = (3 + √ 3) / 6

and suppose Xu with probability p and Xv with probability q = 1 – p. Then X has mean 0, and you can show that exact equality holds in the inequality above.

Here’s some Mathematica code to verify this claim.

    u = (1 - Sqrt[3])/Sqrt[2]
    v = (1 + Sqrt[3])/Sqrt[2]
    p = (Sqrt[3] + 1)/(2 Sqrt[3])
    q = 1 - p
    m3 = p u^3 + q v^3
    m4 = p u^4 + q v^4
    Simplify[ (4/27)^(1/4) m4^(3/4) / m3 ]

As hoped, the code returns 1.

Reference: Iosif Pinelis, Relations Between the First Four Moments, American Mathematical Monthly, Vol 122, No 5, pp 479-481.

Gaussian correlation inequality

The Gaussian correlation inequality was proven in 2014, but the proof only became widely known this year. You can find Thomas Royan’s remarkably short proof here.

Let X be a multivariate Gaussian random variable with mean zero and let E and F be two symmetric convex sets, both centered at the origin. The Gaussian correlation inequality says that

Prob(in E and F) ≥ Prob(X in E) Prob(X in F).

Here’s a bit of Python code for illustrating the inequality. For symmetric convex sets we take balls of p-norm r where p ≥ 1 and r > 0. We could, for example, set one of the values of p to 1 to get a cube and set the other p to 2 to get a Euclidean ball.

from scipy.stats import norm as gaussian

def pnorm(v, p):
    return sum( abs(x)**p for x in v )**(1./p)

def simulate(dim, r1, p1, r2, p2, numReps):
    count_1, count_2, count_both = (0, 0, 0)
    for _ in range(numReps):
        x = gaussian.rvs(0, 1, dim)
        in_1 = (pnorm(x, p1) < r1)
        in_2 = (pnorm(x, p2) < r2)
        if in_1:
            count_1 += 1
        if in_2:
            count_2 += 1
        if in_1 and in_2:
            count_both += 1
    print("Prob in both:", count_both / numReps)
    print("Lower bound: ", count_1*count_2 * numReps**-2)


simulate(3, 1, 2, 1, 1, 1000)

When numReps is large, we expect the simulated probability of the intersection to be greater than the simulated lower bound. In the example above, the former was 0.075 and the latter 0.015075, ordered as we’d expect.

If we didn’t know that the theorem has been proven, we could use code like this to try to find counterexamples. Of course a simulation cannot prove or disprove a theorem, but if we found what appeared to be a counterexample, we could see whether it persists with different random number generation seeds and with a large value of  numReps. If so, then we could try to establish the inequality analytically. Now that the theorem has been proven we know that we’re not going to find real counterexamples, but the code is only useful as an illustration.

Related math posts

Truncated exponential series inequality

Define Tn to be the Taylor series for exp(x) truncated after n terms:

T_n(x) = 1 + x+ \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots + \frac{x^n}{n!}

How does this function compare to its limit, exp(x)? We might want to know because it’s often useful to have polynomial upper or lower bounds on exp(x).

For x > 0 it’s clear that exp(x) is larger than Tn(x) since the discarded terms in the power series for exp(x) are all positive.

The case of x < 0 is more interesting. There exp(x) > Tn(x) if n is odd and exp(x) < Tn(x) if n is even.

Define fn(x) = exp(x) – Tn(x). If x > 0 then fn(x) > 0.

We want to show that if x < 0 then fn(x) > 0 for odd n and fn(x) < 0 for even n.

For n = 1, note that f1 and its derivative are both zero at 0. Now suppose f1 is zero at some point a < 0. Then by Rolle’s theorem, there is some point b with a < b < 0 where the derivative of f1 is 0. Since the derivative of f1 is also zero at 0, there must be some point c with bc < 0 where the second derivative of f1 is 0, again by Rolle’s theorem. But the second derivative of f1 is exp(x) which is never 0. So our assumption f1(a) = 0 leads to a contradiction.

Now  f1(0) = 0 and f1(x) ≠ 0 for x < 0. So f1(x) must be always positive or always negative. Which is it? For negative x, exp(x) is bounded and so

f1(x)  = exp(x) – 1 – x

is eventually dominated by the –x term, which is positive since x is negative.

The proof for n = 2 is similar. If f2(x) is zero at some point a < 0, then we can use Rolle’s theorem to find a point b < 0 where the derivative of f2 is zero, and a point c < 0 where the second derivative is zero, and a point d < 0 where the third derivative is zero. But the third derivative of f2 is exp(x) which is never zero.

As before the contradiction shows  f2(x) ≠ 0 for x < 0. So is  f2(x) always positive or always negative? This time we have

f2(x) = exp(x) – 1 – xx2/2

which is eventually dominated by the –x2 term, which is negative.

For general n, we assume fn is zero for some point x < 0 and apply Rolle’s theorem n+1 times to reach the contradiction that exp(x) is zero somewhere. This tells us that fn(x) is never zero for negative x. We then look at the dominant term –xn to argue that fn is positive or negative depending on whether n is odd or even.

Another way to show the sign of  fn(x) for negative x would be to apply the alternating series theorem to x = -1.

Improving on Chebyshev’s inequality

Chebyshev’s inequality says that the probability of a random variable being more than k standard deviations away from its mean is less than 1/k2. In symbols,

P(|X - E(X)| \geq k\sigma) \leq \frac{1}{k^2}

This inequality is very general, but also very weak. It assumes very little about the random variable X but it also gives a loose bound. If we assume slightly more, namely that X has a unimodal distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

P(|X - E(X)| \geq k\sigma) \leq \frac{4}{9k^2}

However, the Vysochanskiĭ-Petunin inequality does require that k be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on k we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose X has a uniform distribution. What is the probability that X is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of k you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let k be 2 and 3.

Distribution family P(|X – E(X)| > 2σ) V-P estimate P(|X – E(X)| > 3σ) V-P estimate
Uniform 0.0000 0.1111 0.0000 0.0494
Normal 0.0455 0.1111 0.0027 0.0494
Exponential 0.0498 0.1111 0.0183 0.0494
Pareto 0.0277 0.1111 0.0156 0.0494
Double exponential 0.0591 0.1111 0.0144 0.0494

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for the variance to exist. The probability of our random variable being more than k standard deviations away from its mean works out to ((α-1)/((k-1)α))α and is largest as α converges down toward 2. The limiting values for k equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also known as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but still less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

More inequality posts

Mortgages, banks, and Jensen’s inequality

Sam Savage’s new book Flaw of Averages has a brilliantly simple explanation of why volatility in the housing market caused such problems for banks recently. When housing prices drop, more people default on their mortgages and obviously that hurts banks. But banks are also in trouble when the housing market is volatile, even if on average house prices are good.

Suppose there’s no change in the housing market on average. Prices go up in some areas and down in other areas. As long as the ups and downs average out, there should be no change in bank profits, right? Wrong!

When the housing market goes up a little bit, mortgage defaults drop a little bit and bank profits go up a little bit. But when the market goes down a little bit, defaults go up more than just a little bit, and bank profits go down more than a little bit. There is much more down-side potential than up-side potential. Say 95% of homeowners pay their mortgages. Then a good housing market can only improve repayments by 5%. But a bad housing market could decrease repayments by much more.

In mathematical terminology, the bank profits are a concave function of house prices. Jensen’s inequality says that if f() is a concave function (say bank profits) and X is a random variable (say, house prices) then the average of f(X) is less than f(average of X). Average profit is less than the profit from the average.

Related posts

Three books on inequalities

The classic book on inequalities is Inequalities by Hardy, Littlewood, and Pólya, first published in 1934 (ISBN 0521358809). I’ve mentioned it a few times before. The book is quite thorough. I imagine more people use it as a reference than read it cover to cover.

The Cauchy-Schwarz Master Class by Michael Steele (ISBN 052154677X) is a more recent book on inequalities, published in 2004. The preface explains what the title means by a “master class.”

In the fine arts, a master class is a small class where students and coaches work together to support a high level of technical and creative excellence. This book tries to capture the spirit of a master class while providing coaching for readers who want to refine their skills as solvers of problems, especially those problems dealing with mathematical inequalities.

The Cauchy-Schwarz Master Class book lives up to its stated aim. Reading the book and working through the exercises reminds me of taking music lessons. The emphasis is on learning techniques rather than cataloging specific results.

The final book I’ll mention is When Less is More: Visualizing Basic Inequalities by Claudi Alsina and Roger B. Nelsen (ISBN 0883853426). Like The Cauchy-Schwarz Master Class, this book emphasizes problem solving technique. More specifically, it emphasizes geometric techniques for understanding and proving inequalities. (I’ve written a review of When Less is More, but unfortunately you have to log in as an MAA member to read it.) When Less is More is as concerned with beauty as with truth.

You can tell from the cover art and the architectural allusion in the title that this book cares about aesthetics. Each proof is a polished gem.

Related posts

Means and inequalities for functions

A post on Monday looked at means an inequalities for a lists of non-negative numbers. This post looks at analogous means and inequalities for non-negative functions. We go from means defined in terms of sums to means defined in terms of integrals.

Let p(x) be a non-negative function that integrates to 1. (That makes p(x) a probability density, but you don’t have to think of this in terms of probability.) Think of p(x) as being a weighting function. The dependence on p(x) will be implicit. For a non-negative function f(x) and real number r ≠ 0, define Mr( f ) by

M_r(f) = \left(\int p(x) \left( f(x) \right)^r \, dx \right)^{1/r}

If r > 0 and the integral defining Mr( f ) diverges, we say Mr( f ) = ∞ and Mr( f ) = 0.

Define the geometric mean of the function f by

G(f) = \exp\left( \int p(x) \log f(x)\, dx \right)

There are a few special cases to consider when defining G(f). See Inequalities for details.

First we give several limiting theorems.

  • As r → –∞, Mr( f ) → min( f )
  • As r → +∞, Mr( f ) → max( f )
  • As r → 0+, Mr( f ) → G( f )

And now for the big theorem: If rs, then Mr( f ) ≤ Ms( f ).The conditions under which equality hold are a little complicated. Again, see Inequalities for details.

We could derive analogous results for infinite sums since sums are just a special case of integrals.

The assumption that the weight function p(x) has a finite integral is critical. We could change the definition of Mr( f ) slightly to accommodate the case that the integral of p(x) is finite but not equal to 1, and all the conclusions above would remain true. But if we allowed p(x) to have a divergent interval, the theorems do not hold. Suppose p(x) is constantly 1, and our region of integration is (0, ∞). Then Mr( f ) might be more or less than Ms( f ) depending on f. For example, let f(x) = b exp( – bx ) for some b > 0. M1( f ) = 1, but M( f ) = b. Then M1( f ) is less than or greater than M( f ) depending on whether b is less than or greater than 1.

Means and inequalities

The arithmetic mean of two numbers a and b is (a + b)/2.

The geometric mean of a and b is √(ab).

The harmonic mean of a and b is 2/(1/a + 1/b).

This post will generalize these definitions of means and state a general inequality relating the generalized means.

Let x be a vector of non-negative real numbers, x = (x1, x2, x3…, xn). Define Mr( x ) to be

M_r(x) = \left( \frac{1}{n} \sum_{i = 1}^n x_i^r \right)^{1/r}

unless r = 0 or  r is negative and one of the xi is zero. If r = 0, define Mr( x ) to be the limit of Mr( x ) as r decreases to 0 . And if r is negative and one of the xi is zero, define Mr( x ) to be zero. The arithmetic, geometric, and harmonic means correspond to M1, M0, and M-1 respectively.

Define M( x ) to be the limit of Mr( x ) as r goes to ∞. Similarly, define M-∞( x ) to be the limit of Mr( x ) as r goes to –∞. Then M( x ) equals max(x1, x2, x3…, xn) and M-∞( x ) equals min(x1, x2, x3…, xn).

In summary, the minimum, harmonic mean, geometric mean, arithmetic mean and maximum are all special cases of Mr( x ) corresponding to r = –∞, –1, 0, 1, and ∞ respectively. Of course other values of r are possible; these five are just the most familiar. Another common example is the root-mean-square (RMS) corresponding to r = 2.

A famous theorem says that the geometric mean is never greater than the arithmetic mean. This is a very special case of the following theorem.

If r s then Mr( x ) ≤ Ms( x ).

In fact we can say a little more. If r < s then Mr( x ) < Ms( x ) unless x1 = x2 = x3 = …. = xn or s ≤ 0 and one of the xi is zero.

We could generalize the means Mr a bit more by introducing positive weights pi such that p1 + p2 + p3 + … + pn = 1. We could then define Mr( x ) as

M_r(x) = \left( \sum_{i = 1}^n p_i x_i^r \right)^{1/r}

with the same fine print as in the previous definition. The earlier definition reduces to this new definition with pi = 1/n. The above statements about the means Mr( x ) continue to hold under this more general definition.

For more on means and inequalities, see Inequalities by Hardy, Littlewood, and Pólya.

Update: Analogous results for means of functions, replacing sums with integrals. Also, physical examples of harmonic mean with springs and resistors.

Related post: Old math books