Hebrew letters spotted in applied math

Math and physics use Greek letters constantly, but seldom do they use letters from any other alphabet.

The only Cyrillic letter I recall seeing in math is sha (Ш, U+0428) for the so-called Dirc comb distribution.

One Hebrew letter is commonly used in math, and that’s aleph (א, U+05D0). Aleph is used fairly often, but other Hebrew letters are much rarer. If you see any other Hebrew letter in math, it’s very likely to be one of the next three letters: beth (ב, U+05D1), gimel (ג, U+05D2), or dalet (ד, U+05D3).

To back up this claim, basic LaTeX only has a command for aleph (unsurprisingly, it’s \aleph). AMS-LaTeX adds the commands \beth, \gimel, and \daleth, but no more. Those are the only Hebrew letters you can use in LaTeX without importing a package or using XeTeX so you can use Unicode symbols.

Not only are Hebrew letters rare in math, the only area of math that uses them at all set theory, where they are used to represent transfinite numbers.

So in short, if you see a Hebrew letter in math, it’s overwhelmingly likely to be in set theory, and it’s very likely to be aleph, or possibly beth, gimel, or dalet.

But today I was browsing through Morse and Feschbach and was very surprised to see the following on page 324.

gimel = lambda ayin + mu yod + mu yod star

I’ve never seen a Hebrew letter in applied math, and I’ve never seen ayin (ע, U+05E2) or yod (י, U+05D9) used anywhere in math.

In context, the authors had used Roman letters, Fraktur letters, and Greek letters and so they’d run out of alphabets. The entity denoted by gimel is related to a tensor the authors denoted with g, so presumably they used the Hebrew letter that sounds like “g”. But I have no idea why they chose ayin or yod.

Related posts

Sum of independent but differently distributed variables

It’s well known that a binomial random variable can be approximated by a Poisson random variable, and under what circumstances the approximation is particularly good. See, for example, this post.

A binomial random variable is the sum of iid (independent, identically distributed) Bernoulli random variables. But what if the Bernoulli random variables don’t have the same distribution. That is, suppose you’re counting the number of heads seen in flipping n coins, where each coin has a potentially different probability of coming up heads. Will a Poisson approximation still work?

This post will cite three theorems on the error in approximating a sum of n independent Bernoulli random variables, each with a different probability of success pi. I’ll state each theorem and very briefly discuss its advantages. The theorems can be found in [1].


For i = 1, 2, 3, …, n let Xi be Bernoulli random variables with

Prob(Xi = 1) = pi

and let X with no subscript be their sum:

X = X1 + X2 + X3 + … + Xn

We want to approximate the distribution of X with a Poisson distribution with parameter λ. We will measure the error in the Poisson approximation by the maximum difference between the mass density function for X and the mass density function for a Poisson(λ) random variable.

Sum of p‘s

We consider two ways to choose λ. The first is

λ = p1 + p2 + p3 + … + pn.

For this choice we have two different theorems that give upper bounds on the approximation error. One says that the error is bounded by the sum of the squares of the p‘s

p1² + p2² + p3² + … + pn²

and the other says it is bounded by 9 times the maximum of the p‘s

9 max(p1, p2, p3, …,  pn).

The sum of squares bound will be smaller when n is small and the maximum bound will be smaller when n is large.

Sum of transformed p‘s

The second way to choose λ is

λ = λ1 + λ2 + λ3 + … + λn


λi = -log(1 – pi).

In this case the bound on the error is one half the sum of the squared λ’s:

1² + λ2² + λ3² + … + λn²)/2.

When pi is small, λipi. In this case the error bound for the transformed Poisson approximation will be about half that of the one above.

Related posts

[1] R. J. Serfling. Some Elementary Results on Poisson Approximation in a Sequence of Bernoulli Trials. SIAM Review, Vol. 20, No. 3 (July, 1978), pp. 567-579.


HAKMEM is a collection of tricks and trivia from the MIT AI lab written in 1972. I’ve mentioned HAKMEM here before. The image below came from HAKMEM item 123, explained here.

I ran across HAKMEM item 143 while writing my previous two blog posts about the arithmetic-geometric mean (AGM). This entry, due to Gene Salamin, says that

\pi \approx 2n \text{AGM}(1, 4e^{-n})

for large n. In fact, n doesn’t have to be very large. The expression on the right converges exponentially as n grows.

Here’s the Mathematica code that produced the plot above.

    Plot[{2 n ArithmeticGeometricMean[1, 4 Exp[-n]], Pi}, 
         {n, 1, 5}, PlotRange -> All]

If you have e to a large number of decimals, you could compute π efficiently by letting n be a power of 2, computing en by repeatedly squaring 1/e, and using the AGM. Computing the AGM is simple. In Python notation, iterate

    a, b = (a+b)/2, sqrt(a*b)

until a and b are equal modulo your tolerance.

Computing logs with the AGM

In the previous post I said that Jonathan and Peter Borwein figured out how to use the rapid convergence of the AGM to compute various functions, including logarithms. This post will show how to compute logarithms using the AGM.

First I need to define the Jacobi theta functions

\begin{align*} \theta_2(q) &= \sum_{n\in\mathbb{Z}} q^{(n + 1/2)^2} \\ \theta_3(q) &= \sum_{n\in\mathbb{Z}} q^{n^2} \end{align*}

Note that if q is very small, the series above converge very quickly. We will pick q so small that θ2(q) and θ3(q) are trivial to compute with sufficient accuracy.

Suppose we want to compute the natural log of 10 to a thousand digits. We can use the following equation from [1].

\log(1/q) = \frac{\pi/4}{\text{AGM}\Big(\,\theta^2_2(q^4),\, \theta^2_3(q^4)\,\Big)}

We want q to be small, so we want 1/q to be large. We set q = 10n to compute log 10n and divide the result by n.

We can show from the definition of the theta functions that

\begin{align*} \theta_2^2(q^4) &= 4q^2 + {\cal O}(q^{10}) \\ \theta_3^2(q^4) &= 1 + 4q^4 + {\cal O}(q^8) \\ \end{align*}

Since we want 1000 digit accuracy, we can set q = 10-250. Then θ22(q4) = 4×10-500 and θ32(q4) = 1 to about 2000 digits of accuracy.

We can calculate the result we want with just 20 iterations of the AGM as the following bc code shows.

    define agm(a, b, n) {
        auto i, c, d
        for (i = 1; i <= n; i++) {
            c = (a + b)/2
            d = sqrt(a*b)
            a = c
            b = d
        return a

    # a(1) = arctan(1) = pi/4
    x = a(1)/(agm(4*10^-500, 1, 20)*250) - l(10) # error
    l(x)/l(10) # log_10(error)

This returns -999.03…, so we were basically able to get 1000 digits, though the last digit or two might be dodgy. Let’s try again with a smaller value of q, setting 10-300.

    x = a(1)/(agm(4*10^-600,1, 20)*300) - l(10); l(x)/l(10)

This returns -1199.03… and so we have about 1200 digits. We were using 2000 digit precision in our calculations, but our approximation θ32(q4) = 1 wasn’t good to quite 2000 digits. Using a smaller value of q fixed that.

Related posts

[1] The Borwein brothers, Pi and the AGM. arXiv

The magic AGM box

Suppose you are visited by aliens from halfway across the galaxy. After asking you a lot of questions, they give you a parting gift, little black boxes can compute

xx²/2 + x³/3 – …

with unbelievable speed and accuracy. You say thank you and your visitors vanish.

You get back home and wonder what you can do with your black boxes. The series the boxes compute looks vaguely familiar, but you can’t remember what it is. You call up a friend and he tells you that it’s the Taylor series for log(1 + x).

OK, so now what?

Your friend tells you he can predict what the boxes will output. He tells you, for example, that if you enter 0.5 it will output 0.4054651081081644. You try it, and your friend is right. At least partly. If you ask your box to give you only 16 digits, it will give you exactly what your friend said it would. But you could ask it for a thousand digits or a million digits or a billion digits, something your friend cannot do, at least not quickly.

Then you realize your friend has things backward. The way to exploit these boxes is not to compute logs on your laptop to predict their output, but to use the boxes instead of your laptop.

So you have a way to compute logs. You can bootstrap that to compute inverse logs, i.e. exp(x). And you can bootstrap that to compute sines and cosines. You try to compute anything you can starting from logs.

Enter the AGM

The preceding story was an introduction to the AGM, the arithmetic-geometric mean. It is the limit of alternatingly taking ordinary and geometric means. More on that here. What I want to focus on here is that the AGM can be computed extremely quickly.

Each iteration in the process of computing the AGM doubles the number of correct figures in the answer. Suppose you want to compute its output to a billion decimal places, and you’ve calculated a million decimal places. You need to compute 999,000,000 more decimal places, but you’re nearly there! Ten more steps and you’ll have all billion decimal places.

If you want to compute something to millions of digits, it would make sense to try to compute it in terms of the AGM. This was the research program of the brothers Jonathan and Peter Borwein. Much of this research was codified in their book Pi and the AGM. They used the AGM to compute π to crazy precision, but that wasn’t their goal per se.

Computing π was a demonstration project for a deeper agenda. While describing the work of the Borwein brothers, Richard Brent said

… theorems about π are often just the tips of “mathematical icebergs”—much of interest lies hidden beneath the surface.

The AGM of x and y equals

\frac{\pi(x+y)}{4\,K\left( \dfrac{x-y}{x+y}\right)}

where K is the “complete elliptic integral of the first kind.” You might reasonably think “Great. I’ll keep that in mind if I ever need to compute the compute elliptic integral of the first kind, whatever that is.” But K is like the log function in the story above, something that can be bootstrapped to compute other things.

The aliens Gauss, Lagrange, and Legendre gave the Borweins the AGM black box, and the Borweins figured out how to use it to compute π, but also log, exp, cos, etc. The Borwein algorithms may not be the most efficient if you only want, say, 16 digits of precision. But as you need more precision, eventually they become the algorithms to use.

See the next post for an example using the AGM to compute logarithms to 1000 digits.

And see the one after that for a way to compute π with the AGM.

Related posts

Reversed Cauchy-Schwarz inequality

This post will state a couple forms of the Cauchy-Schwarz inequality and then present the lesser-known reverse of the Cauchy-Schwarz inequality due to Pólya and Szegö.

Cauchy-Schwarz inequality

The summation form of the Cauchy-Schwarz inequality says that

\left( \sum_{n=1}^N x_n y_n \right)^2 \leq \left(\sum_{n=1}^N x_n^2\right) \left(\sum_{n=1}^N y_n^2\right)

for sequences of real numbers xn and yn.

The integral form of the Cauchy-Schwarz inequality says that

\left( \int_E f g\,d\mu \right)^2 \leq \left(\int_E f^2\,d\mu\right) \left(\int_E g^2 \,d\mu\right)

for any two real-valued functions f and g over a measure space (E, μ) provided the integrals above are defined.

You can derive the sum form from the integral form by letting your measure space be the integers with counting measure. You can derive the integral form by applying the sum form to the integrals of simple functions and taking limits.

Flipping Cauchy-Schwarz

The Cauchy-Schwarz inequality is well known [1]. There are reversed versions of the Cauchy-Schwarz inequality that not as well known. The most basic such reversed inequality was proved by Pólya and Szegö in 1925 and many variations on the theme have been proved ever sense.

Pólya and Szegö’s inequality says

\left(\int_E f^2\,d\mu\right) \left(\int_E g^2 \,d\mu\right) \leq C \left( \int_E f g\,d\mu \right)^2

for some constant C provided f and g are bounded above and below. The constant C does not depend on the functions per se but on their upper and lower bounds. Specifically, assume

\begin{align*} 0 < m_f \leq f \leq M_f < \infty \\ 0 < m_g \leq g \leq M_g < \infty \end{align*}


C = \frac{1}{4} \frac{(m+M)^2}{mM}


\begin{align*} m &= m_f\, m_g \\ M &= M_f\, M_g \\ \end{align*}

Sometimes you’ll see C written in the equivalent form

C = \frac{1}{4} \left( \sqrt{\frac{M}{m}} + \sqrt{\frac{m}{M}} \right)^2

This way of writing C makes it clear that the constant only depends on m and M via their ratio.

Note that if f and g are constant, then the inequality is exact. So the constant C is best possible without further assumptions.

The corresponding sum form follows immediately by using counting measure on the integers. Or in more elementary terms, by integrating step functions that have width 1.

Sum example

Let x = (2, 3, 5) and y = (9, 8, 7).

The sum of the squares in x is 38 and the sum of the squares in y is 194. The inner product of x and y is 18+24+35 = 77.

The product of the lower bounds on x and y is m = 14. The product of the upper bounds is  M = 45. The constant C = 59²/(4×14×45) = 1.38.

The left side of the Pólya and Szegö inequality is 38×194 = 7372. The right side is 1.38×77²= 8182.02, and so the inequality holds.

Integral example

Let f(x) = 3 + cos(x) and let g(x) = 2 + sin(x). Let E be the interval [0, 2π].

The following Mathematica code shows that the left side of the Pólya and Szegö inequality is 171π² and the right side is 294 π².

The function f is bound below by 2 and above by 4. The function g is bound below by 1 and above by 3. So m = 2 and M = 12.

    In[1]:= f[x_] := 3 + Cos[x]
    In[2]:= g[x_] := 2 + Sin[x]
    In[3]:= Integrate[f[x]^2, {x, 0, 2 Pi}] Integrate[g[x]^2, {x, 0, 2 Pi}]
    Out[3]= 171 π²
    In[4]:= {m, M} = {2, 12};
    In[5]:= c = (m + M)^2/(4 m M);
    In[6]:= c Integrate[f[x] g[x], {x, 0, 2 Pi}]^2
    Out[6]= 294 π²

Related posts

[1] The classic book on inequalities by Hardy, Littlewood, and Pólya mentions the Pólya-Szegö inequality on page 62, under “Miscellaneous theorems and examples.” Maybe Pólya was being inappropriately humble, but it’s odd that his inequality isn’t more prominent in his book.

My densest books

I recently got a copy of Methods of Theoretical Physics by Morse and Feshbach. It’s a dense book, literally and metaphorically. I wondered whether it might be the densest book I own, so I weighed some of my weightier books.

I like big books, I cannot lie.

Morse and Feshbach has density 1.005 g/cm³, denser than water.

Gravitation by Misner, Thorne, and Wheeler is, appropriately, a massive book. It’s my weightiest paperback book, literally and perhaps metaphorically. But it’s not that dense, about 0.66 g/cm³. It would easily float.

The Mathematica Book by Wolfram (4th edition) is about the same weight as Gravitation, but denser, about 0.80 g/cm³. Still, it would float.

Physically Based Rendering by Pharr and Humphreys weighs in at 1.05 g/cm³. Like Morse and Feshbach, it would sink.

But the densest of my books is An Atlas of Functions by Oldham, Myland, and Spanier, coming in at 1.12 g/cm³.

The books that are denser than water were all printed on glossy paper. Apparently matte paper floats and glossy paper sinks.

Quaternion product as a matrix product

Pick a quaternion

p = p0 + p1i + p2j + p3k

and consider the function that acts on quaternions by multiplying them on the left by p.

If we think of q as a vector in R4 then this is a linear function of q, and so it can be represented by multiplication by a 4 × 4 matrix Mp.

It turns out

M_p = \begin{bmatrix} p_0 & {-}p_1 & {-}p_2 & {-}p_3 \\ p_1 & \phantom{-}p_0 & {-}p_3 & \phantom{-}p_2 \\ p_2 & \phantom{-}p_3 & \phantom{-}p_0 & {-}p_1 \\ p_3 & {-}p_2 & \phantom{-}p_1 & \phantom{-}p_0 \\ \end{bmatrix}

How might you remember or derive this matrix? Consider the matrix on the left below. It’s easier to see the pattern here than in Mp.

\begin{pmatrix} 1/1 & 1/i & 1/j & 1/k \\ i/1 & i/i & i/j & i/k \\ j/1 & j/i & j/j & j/k \\ k/1 & k/i & k/j & k/k \\ \end{pmatrix} = \begin{bmatrix} 1 & {-}i & {-}j & {-}k \\ i & \phantom{-}1 & {-}k & \phantom{-}j \\ j & \phantom{-}k & \phantom{-}1 & {-}i \\ k & {-}j & \phantom{-}i & \phantom{-}1 \\ \end{bmatrix}

You can derive Mp from this matrix.

Let’s look at the second row, for example. The second row of Mp, when multiplied by q as a column vector, produces the i component of the product.

How do you get an i term in the product? By multiplying the i component of p by the real component of q, or by multiplying the real component of p times the i component of p, or by multiplying the i/ j component of p by the j component of q, or by multiplying the i/k component of p by the k component of q.

The other rows follow the same pattern. To get the x component of the product, you add up the products of the x/y term of p and the y term of q. Here x and y range over

{1, i, j, k}.

To get Mp from the matrix on the right, replace 1 with the real component of p, replace i with the i component of p, etc.

As a final note, notice that the off-diagonal elements of Mp are anti-symmetric:

mij = –mji

unless i = j.

Inner product from norm

If a vector space has an inner product, it has a norm: you can define the norm of a vector to be the square root of the inner product of the vector with itself.

||v|| \equiv \langle v, v \rangle^{1/2}

You can use the defining properties of an inner product to show that

\langle v, w \rangle = \frac{1}{2}\left( || v + w ||^2 - ||v||^2 - ||w||^2 \right )

This is a form of the so-called polarization identity. It implies that you can calculate inner products if you can compute norms.

So does this mean you can define an inner product on any space that has a norm?

No, it doesn’t work that way. The polarization identity says that if you have a norm that came from an inner product then you can recover that inner product from norms.

What would go wrong if tried to use the equation above to define an inner product on a space that doesn’t have one?

Take the plane R² with the max norm, i.e.

|| (x, y) || \equiv \max(|x|, |y|)

and define a function that takes two vectors and returns the right-side of the polarization identity.

f(v, w) = \frac{1}{2}\left( || v + w ||^2 - ||v||^2 - ||w||^2 \right )

This is a well-defined function, but it’s not an inner product. An inner product is bilinear, i.e. if you multiply one of the arguments by a constant, you multiply the inner product by the same constant.

To see that f is not an inner product, let v = (1, 0) and w = (0, 1). Then f(v, w) = -1/2, but f(2v, w) is also -1/2. Multiplying the first argument by 2 did not multiply the result by 2.

When we say that R² with the max norm doesn’t have an inner product, it’s not simply that we forgot to define one. We cannot define an inner product that is consistent with the norm structure.

Email subscription switchover

I’ve used Feedburner to allow people to subscribe to this blog via email. That service is going away, and so I just moved everyone over to MailerLite. I turned off Feedburner email, so nobody should get duplicate email.

Feedburner’s RSS service is still going, for now, but most RSS subscribers use my RSS feed without going through Feedburner.

If you’d like to subscribe to my monthly newsletter or to blog post notifications by email, you can do so here.