f(g(x)) versus g(f(x))

I stumbled upon a theorem today that I feel like I’ve needed in the past, though I can’t remember any particular applications. I’m writing it up here as a note to my future self should the need reappear.

The theorem gives sufficient conditions to conclude

f(g(x)) ≤ g(f(x))

and uses this to prove, for example, that

arcsin( sinh(x/2) ) ≤ sinh( arcsin(x)/2 )

on the interval [0, 1].

If you think of any applications, please leave a comment.

Here’s the theorem, found in [1].

Let f be continuous with domain 0 ≤ x < 1 or 0 ≤ x ≤ 1, f(0) = 0, f(1) > 1 (including the possibility that f(1) = +∞); let g be continuous with domain the range of f, and g(1) ≤ 1. Let f(x)/x and g(x)/x be strictly increasing on their domains. Finally let f(x) ≠ x for 0 < x < 1. Then f(g(x)) ≤ g(f(x)) for 0 < x < 1.

[1] Ralph P. Boas. Inequalities for a Collection. Mathematics Magazine, January 1979, Vol. 52, No. 1, pp. 28–31

Maclaurin’s inequality

This afternoon I wrote a brief post about Terence Tao’s new paper A Maclaurin type inequality. That paper builds on two classical inequalities: Newton’s inequality and Maclaurin’s inequality. The previous post expanded a bit on Newton’s inequality. This post will do the same for Maclaurin’s inequality.

As before, let x be a list of real numbers and define Sn(x) to be the average over all products of n elements from x. Maclaurin’s inequality says that Sn(x)1/n is decreasing function of n.

S1(x) ≥ S2(x)1/2S3(x)1/3 ≥ …

We can illustrate this using the Python code used in the previous post with a couple minor changes. We change the definition of ys to

   ys = [S(xs, n)**(1/n) for n in ns]

and change the label on the vertical axis accordingly.

Looks like a decreasing function to me.

Newton’s inequality and log concave sequences

The previous post mentioned Newton’s inequality. This post will explore this inequality.

Let x be a list of real numbers and define Sn(x) to be the average over all products of n elements from x. Newton’s inequality says that

Sn−1 Sn+1S²n

In more terminology more recent than Newton, we say that the sequence Sn is log-concave.

The name comes from the fact that if the elements of x are positive, and hence the S‘s are positive, we can take the logarithm of both sides of the inequality and have

(log Sn−1 + log Sn+1)/2 ≤ log Sn

which is the discrete analog of saying log Sk is concave as a function of k.

Let’s illustrate this with some Python code.

from itertools import combinations as C
from numpy import random
from math import comb
import matplotlib.pyplot as plt

def S(x, n):
    p = lambda c: prod([t for t in c])
    return sum(p(c) for c in C(x, n))/comb(N, n)

random.seed(20231010)
N = 10
xs = random.random(N)
ns = range(1, N+1)
ys = [log(S(xs, n)) for n in ns]

plt.plot(ns, ys, 'o')
plt.xlabel(r"$n$")
plt.ylabel(r"$\log S_n$")
plt.show()

This produces the following plot.

This plot looks nearly linear. It’s plausible that it’s concave.

Cosine similarity does not satisfy the triangle inequality

The previous post looked at cosine similarity for embeddings of words in vector spaces. Word embeddings like word2vec map words into high-dimensional vector spaces in such a way that related words correspond to vectors that are roughly parallel. Ideally the more similar the words, the smaller the angle between their corresponding vectors.

The cosine similarity of vectors x and y is given by

\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{ ||\mathbf{x} || \,||\mathbf{y} || }

If the vectors are normalized to have length 1 (which word embeddings are typically not) then cosine similarity is just dot product. Vectors are similar when their cosine similarity is near 1 and dissimilar when their cosine similarity is near 0.

If x is similar to y and y is similar to z, is x similar to z? We could quantify this as follows.

Let cossim(x, y) be the cosine similarity between x and y.

If cossim(x, y)= 1 − ε1, and cossim(y, z) =1 − ε2, is cossim(x, z) at least 1 − ε1 − ε2? In other words, does the complement of cosine similarity satisfy the triangle inequality?

Counterexample

The answer is no. I wrote a script to search for a counterexample by generating random points on a sphere. Here’s one of the examples it came up with.

    x = [−0.9289  −0.0013   0.3704]
    y = [−0.8257   0.3963   0.4015]
    z = [−0.6977   0.7163  −0.0091]

Let d1 = 1 − cossim(x, y), d2 = 1 − cossim(y, z), and d3 be 1 − cossim(x, z).

Then d1 = 0.0849, d2 = 0.1437, and d3 = 0.3562 and so d3 > d1 + d2.

Triangle inequality

The triangle inequality does hold if we work with θs rather than their cosines. The angle θ between two vectors is the distance between these two vectors interpreted as points on a sphere and the triangle inequality does hold on the sphere.

Approximate triangle inequality

If the cosine similarity between x and y is close to 1, and the cosine similarity between y and z is close to 1, then the cosine similarity between x and z is close to one, though the triangle inequality may not hold. I wrote about this before in the post Nearly parallel is nearly transitive.

I wrote in that post that the law of cosines for spherical trigonometry says

cos c = cos a cos b + sin a sin b cos γ

where γ is the angle between the arcs a and b. If cos a = 1 − ε1, and cos b = 1 − ε2, then

cos a cos b = (1 − ε1)(1 − ε2) = 1 − ε1 − ε2 + ε1 ε2

If ε1 and ε2 are small, then their product is an order of magnitude smaller. Also, the term

sin a sin b cos γ

is of the same order of magnitude. So if 1 − cossim(x, y) =  ε1 and 1 − cossim(y, z) =  ε2 then

1 − cossim(x, z) =  ε1 + ε1 + O1 ε2)

Is the triangle inequality desirable?

Cosine similarity does not satisfy the triangle inequality, but do we want a measure of similarity between words to satisfy the triangle inequality? You could argue that semantic similarity isn’t transitive. For example, lion is similar to king in that a lion is a symbol for a king. And lion is similar to house cat in that both are cats. But king and house cat are not similar. So the fact that cosine similarity does not satisfy the triangle inequality might be a feature rather than a bug.

Related posts

Lehman’s inequality, circuits, and LaTeX

Let A, B, C, and D be positive numbers. Then Lehman’s inequality says

\frac{(A+B)(C+D)}{A+B+C+D} \geq \frac{AC}{A+C} + \frac{BD}{B+D}

Proof by circuit

This inequality can be proved analytically, but Lehman’s proof is interesting because he uses electrical circuits [1].

Let A, B, C, and D be the resistances of resistors arranges as in the circuit on the left.

Resistors R1 and R2 in series have a combined resistance of

R = R_1 + R_2

and the same resistors in parallel have a combined resistance of

R = \frac{R_1 R_2}{R_1 + R_2}

This means the circuit on the left has combined resistance

\frac{(A+B)(C+D)}{A+B+C+D}

The resistance of the circuit on the right is

\frac{AC}{A+C} + \frac{BD}{B+D}

Adding a short cannot increase resistance, so the resistance of the circuit on the right must be the same or lower than the resistance of the one on the left. Therefore

\frac{(A+B)(C+D)}{A+B+C+D} \geq \frac{AC}{A+C} + \frac{BD}{B+D}

Drawing circuits in LaTeX

I drew the circuits above using the circuitikz package in LaTeX. Here’s the code to draw the circuit on the left.

    \documentclass{article}
    
    \usepackage{tikz}
    \usepackage{circuitikz}
    
    \begin{document}
    
    \begin{figure}[h!]
      \begin{center}
        \begin{circuitikz}
          \draw (0,0)
          to[R=$B$] (0,2) 
          to[R=$A$] (0,4)
          to[short] (2,4) 
          to[R=$C$] (2,2)
          to[R=$D$] (2,0)
          to[short] (0,0);
          \draw (1,4)
          to[short] (1, 5);
          \draw (1, 0)
          to[short] (1, -1);
          %\draw (0, 2)
          %to[short] (2, 2);
        \end{circuitikz}
      \end{center}
    \end{figure}
    
    \end{document}

The code to draw the second circuit removes the %’s commenting out the code that draws the short between the two parallel arms of the circuit.

[1] Alfred Lehman proves a more general result using circuits in SIAM Review, Vol. 4, No. 2 (April 1962) pp. 150–151. Fazlollah Reza gives a purely mathematical proof on pages 151 and 152 of the same journal.

Strengthen Markov’s inequality with conditional probability

Markov’s inequality is very general and hence very weak. Assume that X is a non-negative random variable, a > 0, and X has a finite expected value, Then Markov’s inequality says that

\text{P}(X > a) \leq \frac{\text{E}(X)}{a}

In [1] the author gives two refinements of Markov’s inequality which he calls Hansel and Gretel.

Hansel says

\text{P}(X > a) \leq \frac{\text{E}(X)}{a + \text{E}(X - a \mid X > a)}

and Gretel says

\text{P}(X > a) \leq \frac{\text{E}(X) - \text{E}(X \mid X \leq a)}{a - \text{E}(X \mid X \leq a)}

Related posts

[1] Joel E. Cohen. Markov’s Inequality and Chebyshev’s Inequality for Tail Probabilities: A Sharper Image. The American Statistician, Vol. 69, No. 1 (Feb 2015), pp. 5-7

Inequalities for inequality: Gini coefficient lower bounds

The Gini coefficient, a.k.a. Gini index, of a set of numbers is the average of all differences divided by twice the mean. Specifically, let

{\cal X} = \{x_1, x_2, x_3, \ldots, x_n\}

Then the Gini coefficient of x is defined to be

G({\cal X}) = \frac{1}{2\mu n^2} \sum_{i=1}^n \sum_{j=1}^n |x_i - x_j|
where μ is the mean of the set. The Gini coefficient is often used in economics to measure inequalities in wealth.

Now suppose the data is divided into r disjoint groups:

{\cal X} = \bigcup_{i = 1}^r {\cal X}_i

We would like to estimate the Gini coefficient of the entire group from Gini coefficients of each subgroup. This individual Gini coefficients alone are not enough data for the task, but if we also know the size and sum of each subgroup, we can compute lower bounds on G. The paper [1] gives five such lower bounds.

We will present the five lower bounds and see how well each does in a simulation.

Zagier’s lower bounds

Here are Zagier’s five lower bounds, listed in Theorem 1 of [1].

\begin{align*} G({\cal X}) &\geq \sum_{i=1}^r \frac{n_i}{n} G({\cal X}_i) \\ G({\cal X}) &\geq \sum_{i=1}^r \frac{X_i}{X} G({\cal X}_i) \\ G({\cal X}) &\ge \left(\sum_{i=1}^r \sqrt{\frac{n_i}{n} \frac{X_i}{X} G({\cal X}_i)} \right)^2 \\ G({\cal X}) &\geq 1 - \left(\sum_{i=1}^r \sqrt{\frac{n_i}{n} \frac{X_i}{X} (1- G({\cal X}_i))} \right)^2 \\ G({\cal X}) &\geq G_0 = \sum_{i=1}^r \frac{n_i}{n} \frac{X_i}{X} G({\cal X}_i) \end{align*}

Here ni is the size of the ith subgroup and Xi is the sum of the elements in the ith subgroup. Also, n is the sum of the ni and X is the sum of the Xi.

G0 is the Gini coefficient we would get if we replaced each subgroup with its mean, eliminating all variance within subgroups.

Simulation

I drew 102 samples from a uniform random variable and computed the Gini coefficient with

    def gini(x):
        n = len(x)
        mu = sum(x)/n    
        s = sum(abs(a-b) for a in x for b in x)
        return s/(2*mu*n**2)

I split the sample evenly into three subgroups. I then sorted the list of samples and divided into three even groups again.

The Gini coefficient of the entire data set was 0.3207. The Gini coefficients of the three subgroups were 0.3013, 0.2798, and 0.36033. When I divided the sorted data into three groups, the Gini coefficients were 0.3060, 0.0937, and 0.0502. The variation in each group is the same, but the smallest group has a smaller mean and thus a larger Gini coefficient.

When I tested Zagier’s lower bounds on the three unsorted partitions, I got estimates of

[0.3138, 0.3105, 0.3102, 0.3149, 0.1639]

for the five estimators.

When I repeated this exercise with the sorted groups, I got

[0.1499, 0.0935, 0.0933, 0.1937, 0.3207]

The bounds for the first four estimates were much better for the unsorted partition, but the last estimate was better for the sorted partition.

More posts on inequalities

[1] Don Zagier. Inequalities for the Gini coefficient of composite populations. Journal of Mathematical Economics 12 (1983) 102–118.

Mahler’s inequality

I ran across a reference to Mahler the other day, not the composer Gustav Mahler but the mathematician Kurt Mahler, and looked into his work a little. A number of things have been named after Kurt Mahler, including Mahler’s inequality.

Mahler’s inequality says the geometric mean of a sum bounds the sum of the geometric means. In detail, the geometric mean of a list of n non-negative real numbers is the nth root of their product. If x and y are two vectors of length n containing non-negative components, Mahler’s inequality says

G(x + y) ≥ G(x) + G(y)

where G is the geometric mean. The left side is strictly larger than the right unless x and y are proportional, or x and y both have a zero component in the same position.

I’m curious why this inequality is named after Mahler. The classic book Inequalities by Hardy, Littlewood, and Polya list the inequality but call it Hölder’s inequality. In a footnote they note that the inequality above appears in a paper by Minkowski in 1896 (seven years before Kurt Mahler was born). Presumably the authors file the inequality under Hölder’s name because it follows easily from Hölder’s inequality.

I imagine Mahler made good use of his eponymous inequality, i.e. that the inequality became associated with him because he applied it well rather than because he discovered it.

More geometric mean posts

Reversed Cauchy-Schwarz inequality

This post will state a couple forms of the Cauchy-Schwarz inequality and then present the lesser-known reverse of the Cauchy-Schwarz inequality due to Pólya and Szegö.

Cauchy-Schwarz inequality

The summation form of the Cauchy-Schwarz inequality says that

\left( \sum_{n=1}^N x_n y_n \right)^2 \leq \left(\sum_{n=1}^N x_n^2\right) \left(\sum_{n=1}^N y_n^2\right)

for sequences of real numbers xn and yn.

The integral form of the Cauchy-Schwarz inequality says that

\left( \int_E f g\,d\mu \right)^2 \leq \left(\int_E f^2\,d\mu\right) \left(\int_E g^2 \,d\mu\right)

for any two real-valued functions f and g over a measure space (E, μ) provided the integrals above are defined.

You can derive the sum form from the integral form by letting your measure space be the integers with counting measure. You can derive the integral form by applying the sum form to the integrals of simple functions and taking limits.

Flipping Cauchy-Schwarz

The Cauchy-Schwarz inequality is well known [1]. There are reversed versions of the Cauchy-Schwarz inequality that not as well known. The most basic such reversed inequality was proved by Pólya and Szegö in 1925 and many variations on the theme have been proved ever sense.

Pólya and Szegö’s inequality says

\left(\int_E f^2\,d\mu\right) \left(\int_E g^2 \,d\mu\right) \leq C \left( \int_E f g\,d\mu \right)^2

for some constant C provided f and g are bounded above and below. The constant C does not depend on the functions per se but on their upper and lower bounds. Specifically, assume

\begin{align*} 0 < m_f \leq f \leq M_f < \infty \\ 0 < m_g \leq g \leq M_g < \infty \end{align*}

Then

C = \frac{1}{4} \frac{(m+M)^2}{mM}

where

\begin{align*} m &= m_f\, m_g \\ M &= M_f\, M_g \\ \end{align*}

Sometimes you’ll see C written in the equivalent form

C = \frac{1}{4} \left( \sqrt{\frac{M}{m}} + \sqrt{\frac{m}{M}} \right)^2

This way of writing C makes it clear that the constant only depends on m and M via their ratio.

Note that if f and g are constant, then the inequality is exact. So the constant C is best possible without further assumptions.

The corresponding sum form follows immediately by using counting measure on the integers. Or in more elementary terms, by integrating step functions that have width 1.

Sum example

Let x = (2, 3, 5) and y = (9, 8, 7).

The sum of the squares in x is 38 and the sum of the squares in y is 194. The inner product of x and y is 18+24+35 = 77.

The product of the lower bounds on x and y is m = 14. The product of the upper bounds is  M = 45. The constant C = 59²/(4×14×45) = 1.38.

The left side of the Pólya and Szegö inequality is 38×194 = 7372. The right side is 1.38×77²= 8182.02, and so the inequality holds.

Integral example

Let f(x) = 3 + cos(x) and let g(x) = 2 + sin(x). Let E be the interval [0, 2π].

The following Mathematica code shows that the left side of the Pólya and Szegö inequality is 171π² and the right side is 294 π².

The function f is bound below by 2 and above by 4. The function g is bound below by 1 and above by 3. So m = 2 and M = 12.

    In[1]:= f[x_] := 3 + Cos[x]
    In[2]:= g[x_] := 2 + Sin[x]
    In[3]:= Integrate[f[x]^2, {x, 0, 2 Pi}] Integrate[g[x]^2, {x, 0, 2 Pi}]
    Out[3]= 171 π²
    In[4]:= {m, M} = {2, 12};
    In[5]:= c = (m + M)^2/(4 m M);
    In[6]:= c Integrate[f[x] g[x], {x, 0, 2 Pi}]^2
    Out[6]= 294 π²

Related posts

[1] The classic book on inequalities by Hardy, Littlewood, and Pólya mentions the Pólya-Szegö inequality on page 62, under “Miscellaneous theorems and examples.” Maybe Pólya was being inappropriately humble, but it’s odd that his inequality isn’t more prominent in his book.

Expected value of X and 1/X

Yesterday I blogged about an exercise in the book The Cauchy-Schwarz Master Class. This post is about another exercise from that book, exercise 5.8, which is to prove Kantorovich’s inequality.

Assume

0 < m \leq x_1 \leq x_2 \leq \cdots \leq x_n \leq M < \infty

and

p_1 + p_2 + \cdots + p_n = 1

for non-negative numbers pi.

Then

\left(\sum_{i=1}^n p_i x_i \right) \left(\sum_{i=1}^n p_i \frac{1}{x_i} \right) \leq \frac{\mu^2}{\gamma^2}

where

\mu = \frac{m+M}{2}

is the arithmetic mean of m and M and

\gamma = \sqrt{mM}

is the geometric mean of m and M.

In words, the weighted average of the x‘s times the weighted average of their reciprocals is bounded by the square of the ratio of the arithmetic and geometric means of the x‘s.

Probability interpretation

I did a quick search on Kantorovich’s inequality, and apparently it first came up in linear programming, Kantorovich’s area of focus. But when I see it, I immediately think expectations of random variables. Maybe Kantorovich was also thinking about random variables, in the context of linear programming.

The left side of Kantorovich’s inequality is the expected value of a discrete random variable X and the expected value of 1/X.

To put it another way, it’s a relationship between E[1/X] and 1/E[X],

\text{E}\left(\frac{1}{X} \right ) \leq \frac{\mu^2}{\gamma^2} \frac{1}{\text{E}(X)}

which I imagine is how it is used in practice.

I don’t recall seeing this inequality used, but it could have gone by in a blur and I didn’t pay attention. But now that I’ve thought about it, I’m more likely to notice if I see it again.

Python example

Here’s a little Python code to play with Kantorovich’s inequality, assuming the random values are uniformly distributed on [0, 1].

    from numpy import random

    x = random.random(6)
    m = min(x)
    M = max(x)
    am = 0.5*(m+M)
    gm = (m*M)**0.5
    prod = x.mean() * (1/x).mean()
    bound = (am/gm)**2
    print(prod, bound)

This returned 1.2021 for the product and 1.3717 for the bound.

If we put the code above inside a loop we can plot the product and its bound to get an idea how tight the bound is typically. (The bound is perfectly tight if all the x’s are equal.) Here’s what we get.

All the dots are above the dotted line, so we haven’t found an exception to our inequality.

(I didn’t think that Kantorovich had made a mistake. If he had, someone would have noticed by now. But it’s worth testing a theorem you know to be true, in order to test that your understanding of the theorem is correct.)

More inequalities