Strengthen Markov’s inequality with conditional probability

Markov’s inequality is very general and hence very weak. Assume that X is a non-negative random variable, a > 0, and X has a finite expected value, Then Markov’s inequality says that

\text{P}(X > a) \leq \frac{\text{E}(X)}{a}

In [1] the author gives two refinements of Markov’s inequality which he calls Hansel and Gretel.

Hansel says

\text{P}(X > a) \leq \frac{\text{E}(X)}{a + \text{E}(X - a \mid X > a)}

and Gretel says

\text{P}(X > a) \leq \frac{\text{E}(X) - \text{E}(X \mid X \leq a)}{a - \text{E}(X \mid X \leq a)}

Related posts

[1] Joel E. Cohen. Markov’s Inequality and Chebyshev’s Inequality for Tail Probabilities: A Sharper Image. The American Statistician, Vol. 69, No. 1 (Feb 2015), pp. 5-7

Inequalities for inequality: Gini coefficient lower bounds

The Gini coefficient, a.k.a. Gini index, of a set of numbers is the average of all differences divided by twice the mean. Specifically, let

{\cal X} = \{x_1, x_2, x_3, \ldots, x_n\}

Then the Gini coefficient of x is defined to be

G({\cal X}) = \frac{1}{2\mu n^2} \sum_{i=1}^n \sum_{j=1}^n |x_i − x_j|
where μ is the mean of the set. The Gini coefficient is often used in economics to measure inequalities in wealth.

Now suppose the data is divided into r disjoint groups:

{\cal X} = \bigcup_{i = 1}^r {\cal X}_i

We would like to estimate the Gini coefficient of the entire group from Gini coefficients of each subgroup. This individual Gini coefficients alone are not enough data for the task, but if we also know the size and sum of each subgroup, we can compute lower bounds on G. The paper [1] gives five such lower bounds.

We will present the five lower bounds and see how well each does in a simulation.

Zagier’s lower bounds

Here are Zagier’s five lower bounds, listed in Theorem 1 of [1].

\begin{align*} G({\cal X}) &\geq \sum_{i=1}^r \frac{n_i}{n} G({\cal X}_i) \\ G({\cal X}) &\geq \sum_{i=1}^r \frac{X_i}{X} G({\cal X}_i) \\ G({\cal X}) &\ge \left(\sum_{i=1}^r \sqrt{\frac{n_i}{n} \frac{X_i}{X} G({\cal X}_i)} \right)^2 \\ G({\cal X}) &\geq 1 - \left(\sum_{i=1}^r \sqrt{\frac{n_i}{n} \frac{X_i}{X} (1- G({\cal X}_i))} \right)^2 \\ G({\cal X}) &\geq G_0 = \sum_{i=1}^r \frac{n_i}{n} \frac{X_i}{X} G({\cal X}_i) \end{align*}

Here ni is the size of the ith subgroup and Xi is the sum of the elements in the ith subgroup. Also, n is the sum of the ni and X is the sum of the Xi.

G0 is the Gini coefficient we would get if we replaced each subgroup with its mean, eliminating all variance within subgroups.


I drew 102 samples from a uniform random variable and computed the Gini coefficient with

    def gini(x):
        n = len(x)
        mu = sum(x)/n    
        s = sum(abs(a-b) for a in x for b in x)
        return s/(2*mu*n**2)

I split the sample evenly into three subgroups. I then sorted the list of samples and divided into three even groups again.

The Gini coefficient of the entire data set was 0.3207. The Gini coefficients of the three subgroups were 0.3013, 0.2798, and 0.36033. When I divided the sorted data into three groups, the Gini coefficients were 0.3060, 0.0937, and 0.0502. The variation in each group is the same, but the smallest group has a smaller mean and thus a larger Gini coefficient.

When I tested Zagier’s lower bounds on the three unsorted partitions, I got estimates of

[0.3138, 0.3105, 0.3102, 0.3149, 0.1639]

for the five estimators.

When I repeated this exercise with the sorted groups, I got

[0.1499, 0.0935, 0.0933, 0.1937, 0.3207]

The bounds for the first four estimates were much better for the unsorted partition, but the last estimate was better for the sorted partition.

More posts on inequalities

[1] Don Zagier. Inequalities for the Gini coefficient of composite populations. Journal of Mathematical Economics 12 (1983) 102–118.

Mahler’s inequality

I ran across a reference to Mahler the other day, not the composer Gustav Mahler but the mathematician Kurt Mahler, and looked into his work a little. A number of things have been named after Kurt Mahler, including Mahler’s inequality.

Mahler’s inequality says the geometric mean of a sum bounds the sum of the geometric means. In detail, the geometric mean of a list of n non-negative real numbers is the nth root of their product. If x and y are two vectors of length n containing non-negative components, Mahler’s inequality says

G(x + y) ≥ G(x) + G(y)

where G is the geometric mean. The left side is strictly larger than the right unless x and y are proportional, or x and y both have a zero component in the same position.

I’m curious why this inequality is named after Mahler. The classic book Inequalities by Hardy, Littlewood, and Polya list the inequality but call it Hölder’s inequality. In a footnote they note that the inequality above appears in a paper by Minkowski in 1896 (seven years before Kurt Mahler was born). Presumably the authors file the inequality under Hölder’s name because it follows easily from Hölder’s inequality.

I imagine Mahler made good use of his eponymous inequality, i.e. that the inequality became associated with him because he applied it well rather than because he discovered it.

More geometric mean posts

Reversed Cauchy-Schwarz inequality

This post will state a couple forms of the Cauchy-Schwarz inequality and then present the lesser-known reverse of the Cauchy-Schwarz inequality due to Pólya and Szegö.

Cauchy-Schwarz inequality

The summation form of the Cauchy-Schwarz inequality says that

\left( \sum_{n=1}^N x_n y_n \right)^2 \leq \left(\sum_{n=1}^N x_n^2\right) \left(\sum_{n=1}^N y_n^2\right)

for sequences of real numbers xn and yn.

The integral form of the Cauchy-Schwarz inequality says that

\left( \int_E f g\,d\mu \right)^2 \leq \left(\int_E f^2\,d\mu\right) \left(\int_E g^2 \,d\mu\right)

for any two real-valued functions f and g over a measure space (E, μ) provided the integrals above are defined.

You can derive the sum form from the integral form by letting your measure space be the integers with counting measure. You can derive the integral form by applying the sum form to the integrals of simple functions and taking limits.

Flipping Cauchy-Schwarz

The Cauchy-Schwarz inequality is well known [1]. There are reversed versions of the Cauchy-Schwarz inequality that not as well known. The most basic such reversed inequality was proved by Pólya and Szegö in 1925 and many variations on the theme have been proved ever sense.

Pólya and Szegö’s inequality says

\left(\int_E f^2\,d\mu\right) \left(\int_E g^2 \,d\mu\right) \leq C \left( \int_E f g\,d\mu \right)^2

for some constant C provided f and g are bounded above and below. The constant C does not depend on the functions per se but on their upper and lower bounds. Specifically, assume

\begin{align*} 0 < m_f \leq f \leq M_f < \infty \\ 0 < m_g \leq g \leq M_g < \infty \end{align*}


C = \frac{1}{4} \frac{(m+M)^2}{mM}


\begin{align*} m &= m_f\, m_g \\ M &= M_f\, M_g \\ \end{align*}

Sometimes you’ll see C written in the equivalent form

C = \frac{1}{4} \left( \sqrt{\frac{M}{m}} + \sqrt{\frac{m}{M}} \right)^2

This way of writing C makes it clear that the constant only depends on m and M via their ratio.

Note that if f and g are constant, then the inequality is exact. So the constant C is best possible without further assumptions.

The corresponding sum form follows immediately by using counting measure on the integers. Or in more elementary terms, by integrating step functions that have width 1.

Sum example

Let x = (2, 3, 5) and y = (9, 8, 7).

The sum of the squares in x is 38 and the sum of the squares in y is 194. The inner product of x and y is 18+24+35 = 77.

The product of the lower bounds on x and y is m = 14. The product of the upper bounds is  M = 45. The constant C = 59²/(4×14×45) = 1.38.

The left side of the Pólya and Szegö inequality is 38×194 = 7372. The right side is 1.38×77²= 8182.02, and so the inequality holds.

Integral example

Let f(x) = 3 + cos(x) and let g(x) = 2 + sin(x). Let E be the interval [0, 2π].

The following Mathematica code shows that the left side of the Pólya and Szegö inequality is 171π² and the right side is 294 π².

The function f is bound below by 2 and above by 4. The function g is bound below by 1 and above by 3. So m = 2 and M = 12.

    In[1]:= f[x_] := 3 + Cos[x]
    In[2]:= g[x_] := 2 + Sin[x]
    In[3]:= Integrate[f[x]^2, {x, 0, 2 Pi}] Integrate[g[x]^2, {x, 0, 2 Pi}]
    Out[3]= 171 π²
    In[4]:= {m, M} = {2, 12};
    In[5]:= c = (m + M)^2/(4 m M);
    In[6]:= c Integrate[f[x] g[x], {x, 0, 2 Pi}]^2
    Out[6]= 294 π²

Related posts

[1] The classic book on inequalities by Hardy, Littlewood, and Pólya mentions the Pólya-Szegö inequality on page 62, under “Miscellaneous theorems and examples.” Maybe Pólya was being inappropriately humble, but it’s odd that his inequality isn’t more prominent in his book.

Expected value of X and 1/X

Yesterday I blogged about an exercise in the book The Cauchy-Schwarz Master Class. This post is about another exercise from that book, exercise 5.8, which is to prove Kantorovich’s inequality.


0 < m \leq x_1 \leq x_2 \leq \cdots \leq x_n \leq M < \infty


p_1 + p_2 + \cdots + p_n = 1

for non-negative numbers pi.


\left(\sum_{i=1}^n p_i x_i \right) \left(\sum_{i=1}^n p_i \frac{1}{x_i} \right) \leq \frac{\mu^2}{\gamma^2}


\mu = \frac{m+M}{2}

is the arithmetic mean of m and M and

\gamma = \sqrt{mM}

is the geometric mean of m and M.

In words, the weighted average of the x‘s times the weighted average of their reciprocals is bounded by the square of the ratio of the arithmetic and geometric means of the x‘s.

Probability interpretation

I did a quick search on Kantorovich’s inequality, and apparently it first came up in linear programming, Kantorovich’s area of focus. But when I see it, I immediately think expectations of random variables. Maybe Kantorovich was also thinking about random variables, in the context of linear programming.

The left side of Kantorovich’s inequality is the expected value of a discrete random variable X and the expected value of 1/X.

To put it another way, it’s a relationship between E[1/X] and 1/E[X],

\text{E}\left(\frac{1}{X} \right ) \leq \frac{\mu^2}{\gamma^2} \frac{1}{\text{E}(X)}

which I imagine is how it is used in practice.

I don’t recall seeing this inequality used, but it could have gone by in a blur and I didn’t pay attention. But now that I’ve thought about it, I’m more likely to notice if I see it again.

Python example

Here’s a little Python code to play with Kantorovich’s inequality, assuming the random values are uniformly distributed on [0, 1].

    from numpy import random

    x = random.random(6)
    m = min(x)
    M = max(x)
    am = 0.5*(m+M)
    gm = (m*M)**0.5
    prod = x.mean() * (1/x).mean()
    bound = (am/gm)**2
    print(prod, bound)

This returned 1.2021 for the product and 1.3717 for the bound.

If we put the code above inside a loop we can plot the product and its bound to get an idea how tight the bound is typically. (The bound is perfectly tight if all the x’s are equal.) Here’s what we get.

All the dots are above the dotted line, so we haven’t found an exception to our inequality.

(I didn’t think that Kantorovich had made a mistake. If he had, someone would have noticed by now. But it’s worth testing a theorem you know to be true, in order to test that your understanding of the theorem is correct.)

More inequalities

The baseball inequality

baseball game

There’s a theorem that’s often used and assumed to be true but rarely stated explicitly. I’m going to call it “the baseball inequality” for reasons I’ll get to shortly.

Suppose you have two lists of k positive numbers each:

n_1, n_2, n_3, \ldots, n_k


d_1, d_2, d_3, \ldots, d_k


\min_{1 \leq i \leq k} \frac{n_i}{d_i} \leq \frac{n_1 + n_2 + n_3 + \cdots + n_k}{d_1 + d_2 + d_3 + \cdots + d_k} \leq \max_{1 \leq i \leq k} \frac{n_i}{d_i}

This says, for example, that the batting average of a baseball team is somewhere between the best individual batting average and the worst individual batting average.

The only place I can recall seeing this inequality stated is in The Cauchy-Schwarz Master Class by Michael Steele. He states the inequality in exercise 5.1 and gives it the batting average interpretation. (Update: This is known as the “mediant inequality.” Thanks to Tom in the comments for letting me know. So the thing in the middle is called the “mediant” of the fractions.)

Note that this is not the same as saying the average of a list of numbers is between the smallest and largest numbers in the list, though that’s true. The batting average of a team as a whole is not the same as the average of the individual batting averages on that team. It might happen to be, but in general it is not.

I’ll give a quick proof of the baseball inequality. I’ll only prove the first of the two inequalities. That is, I’ll prove that the minimum fraction is no greater than the ratio of the sums of numerators and denominators. Proving that the latter is no greater than the maximum fraction is completely analogous.

Also, I’ll only prove the theorem for two numerators and two denominators. Once you have proved the inequality for two numerators and denominators, you can bootstrap that to prove the inequality for three numerators and three denominators, and continue this process for any number of numbers on top and bottom.

So we start by assuming

\frac{a}{b} \leq \frac{c}{d}

Then we have

\begin{align*} \frac{a}{b} &= \frac{a\left(1 + \dfrac{d}{b} \right )}{b\left(1 + \dfrac{d}{b} \right )} \\ &= \frac{a + \dfrac{a}{b}d}{b + d} \\ &\leq \frac{a + \dfrac{c}{d}d}{b+d} \\ &= \frac{a + c}{b+d} \end{align*}

More inequality posts

Hadamard’s upper bound on determinant

For an n by n real matrix A, Hadamard’s upper bound on determinant is

 |A|^2 \leq \prod_{i=1}^n \sum_{j=1}^n a_{ij}^2

where aij is the element in row i and column j. See, for example, [1].

How tight is this upper bound? To find out, let’s write a little Python code to generate random matrices and compare their determinants to Hadamard’s bounds. We’ll take the square root of both sides of Hadamard’s inequality to get an upper bound on the absolute value of the determinant.

Hadamard’s inequality is homogeneous: multiplying the matrix A by λ multiplies both sides by λn. We’ll look at the ratio of Hadamard’s bound to the exact determinant. This has the same effect as generating matrices to have a fixed determinant value, such as 1.

    from scipy.stats import norm
    from scipy.linalg import det
    import matplotlib.pyplot as plt
    import numpy as np
    # Hadamard's upper bound on determinant squared
    def hadamard(A):
        return**2, axis=1))
    N = 1000
    ratios = np.empty(N)
    dim = 3
    for i in range(N):
        A = norm.rvs(size=(dim, dim))
        ratios[i] = hadamard(A)**0.5/abs(det(A))
    plt.hist(ratios, bins=int(N**0.5))

In this simulation the ratio is very often around 25 or less, but occasionally much larger, 730 in this example.


It makes sense that the ratio could be large; in theory the ratio could be infinite because the determinant could be zero. The error is frequently much smaller than the histogram might imply since a lot of small values are binned together.

I modified the code above to print quantiles and ran it again.

    print(min(ratios), max(ratios))
    qs = [0.05, 0.25, 0.5, 0.75, 0.95]
    print( [np.quantile(ratios, q) for q in qs] )

This printed

    1.0022 1624.9836
    [1.1558, 1.6450, 2.6048, 5.7189, 32.49279]

So while the maximum ratio was 1624, the ratio was less than 2.6048 half the time, and less than 5.7189 three quarters of the time.

Hadamard’s upper bound can be very inaccurate; there’s no limit on the relative error, though you could bound the absolute error in terms of the norm of the matrix. However, very often the relative error is moderately small.

More posts on determinants

[1] Courant and Hilbert, Methods of Mathematical Physics, Volume 1.

Convex function of diagonals and eigenvalues

Sam Walters posted an elegant theorem on his Twitter account this morning. The theorem follows the pattern of an equality for linear functions generalizing to an inequality for convex functions. We’ll give a little background, state the theorem, and show an example application.

Let A be a real symmetric n×n matrix, or more generally a complex n×n Hermitian matrix, with entries aij. Note that the diagonal elements aii are real numbers even if some of the other entries are complex. (A Hermitian matrix equals its conjugate transpose, which means the elements on the diagonal equal their own conjugate.)

A general theorem says that A has n eigenvalues. Denote these eigenvalues λ1, λ2, …, λn.

It is well known that the sum of the diagonal elements of A equals the sum of its eigenvalues.

\sum_{i=1}^n a_{ii} = \sum_{i=1}^n \lambda_i

We could trivially generalize this to say that for any linear function φ: RR,

\sum_{i=1}^n \varphi(a_{ii}) = \sum_{i=1}^n \varphi({\lambda_i})

because we could pull any shifting and scaling constants out of the sum.

The theorem Sam Walters posted says that the equality above extends to an inequality if φ is convex.

\sum_{i=1}^n \varphi(a_{ii}) \leq \sum_{i=1}^n \varphi({\lambda_i})

Here’s an application of this theorem. Assume the eigenvalues of A are all positive and let φ(x) = – log(x). Then φ is convex, and

-\sum_{i=1}^n \log(a_{ii}) \leq -\sum_{i=1}^n \log({\lambda_i})

and so

\prod_{i=1}^n a_{ii} \geq \prod_{i=1}^n \lambda_i = \det(A)

i.e. the product of the diagonals of A is an upper bound on the determinant of A.

This post illustrates two general principles:

  1. Linear equalities often generalize to convex inequalities.
  2. When you hear a new theorem about convex functions, see what it says about exp or -log.

More linear algebra posts

Sum and mean inequalities move in opposite directions

It would seem that sums and means are trivially related; the mean is just the sum divided by the number of items. But when you generalize things a bit, means and sums act differently.

Let x be a list of n non-negative numbers,

x = (x_1, x_2, \ldots, x_n)

and let r > 0 [*]. Then the r-mean is defined to be

M_r(x) = \left( \frac{1}{n} \sum_{i=1}^n x_i^r \right)^{1/r}

and the r-sum is define to be

 S_r(x) = \left( \sum_{i=1}^n x_i^r \right)^{1/r}

These definitions come from the classic book Inequalities by Hardy, Littlewood, and Pólya, except the authors use the Fraktur forms of M and S. If r = 1 we have the elementary mean and sum.

Here’s the theorem alluded to in the title of this post:

As r increases, Mr(x) increases and Sr(x) decreases.

If x has at least two non-zero components then Mr(x) is a strictly increasing function of r and Sr(x) is a strictly decreasing function of r. Otherwise Mr(x) and Sr(x) are constant.

The theorem holds under more general definitions of M and S, such letting the sums be infinite and inserting weights. And indeed much of Hardy, Littlewood, and Pólya is devoted to studying variations on M and S in fine detail.

Here are log-log plots of Mr(x) and Sr(x) for x = (1, 2).

Plot of M_r and S_r

Note that both curves asymptotically approach max(x), M from below and S from above.

Related posts

[*] Note that r is only required to be greater than 0; analysis books typically focus on r ≥ 1.

The Brothers Markov

The Markov brother you’re more likely to have heard of was Andrey Markov. He was the Markov of Markov chains, the Gauss-Markov theorem, and Markov’s inequality.

Andrey had a lesser known younger brother Vladimir who was also a mathematician. Together the two of them proved what is known as the Markov Brothers’ inequality to distinguish it from (Andrey) Markov’s inequality.

For any polynomial p(x) of degree n, and for any non-negative integer k, the maximum of the kth derivative of p over the interval [-1, 1] is bounded by a constant times the maximum of p itself. The constant is a function of k and n but is otherwise independent of the particular polynomial.

In detail, the Markov Brothers’ inequality says

\max_{-1\leq x \leq 1} |p^{(k)}(x)|\,\, \leq \prod_{0 \leq j < k} \frac{n^2 - j^2}{2j+1} \,\max_{-1\leq x \leq 1}|p (x)|

Andrey proved the theorem for k = 1 and his brother Vladimir generalized it for all positive k.

The constant in the Markov Brothers’ inequality is the smallest possible because the bound is exact for Chebyshev polynomials [1].

Let’s look at an example. We’ll take the second derivative of the fifth Chebyshev polynomial.

T5(x) = 16x5 – 20x3 + 5x.

The second derivative is

T5”(x) = 320x3 – 120x.

Here are their plots:

T5 and its second derivative

The maximum of T5(x) is 1 and the maximum of its second derivative is 200.

The product in the Markov Brothers’ inequality with n = 5 and k = 2 works out to

(25/1)(24/3) = 200

and so the bound is exact for p(x) = T5(x).


It took a while for westerners to standardize how to transliterate Russian names, so you might see Andrey written as Andrei or Markov written as Markoff.

There were even more ways to transliterate Chebyshev, including Tchebycheff, Tchebyshev, and Tschebyschow. These versions are the reason Chebyshev polynomials [1] are denoted with a capital T.

More posts mentioning Markov

[1] There are two families of Chebyshev polynomials. When used without qualification, as in this post, “Chebyshev polynomial” typically means Chebyshev polynomial of the first kind. These are denoted Tn. Chebyshev polynomials of the second kind are denoted Un.