Famous constants and the Gumbel distribution

The Gumbel distribution, named after Emil Julius Gumbel (1891–1966), is important in statistics, particularly in studying the maximum of random variables. It comes up in machine learning in the so-called Gumbel-max trick. It also comes up in other applications such as in number theory.

For this post, I wanted to point out how a couple famous constants are related to the Gumbel distribution.

Gumbel distribution

The standard Gumbel distribution is most easily described by its cumulative distribution function

F(x) = exp( −exp(−x) ).

You can introduce a location parameter μ and scale parameter β the usual way, replacing x with (x − μ)/β and dividing by β.

Here’s a plot of the density.

Euler-Mascheroni constant γ

The Euler-Mascheroni constant γ comes up frequently in applications. Here are five posts where γ has come up.

The constant γ comes up in the context of the Gumbel distribution two ways. First, the mean of the standard Gumbel distribution is γ. Second, the entropy of a standard Gumbel distribution is γ + 1.

Apéry’s constant ζ(3)

The values of the Riemann zeta function ζ(z) at positive even integers have closed-form expressions given here, but the values at odd integers do not. The value of ζ(3) is known as Apéry’s constant because Roger Apéry proved in 1978 that ζ(3) is irrational.

Like the Euler-Mascheroni constant, Apéry’s constant has come up here multiple times. Some examples:

The connection of the Gumbel distribution to Apéry’s constant is that the skewness of the distribution is

12√6 ζ(3)/π³.

Strengthen Markov’s inequality with conditional probability

Markov’s inequality is very general and hence very weak. Assume that X is a non-negative random variable, a > 0, and X has a finite expected value, Then Markov’s inequality says that

\text{P}(X > a) \leq \frac{\text{E}(X)}{a}

In [1] the author gives two refinements of Markov’s inequality which he calls Hansel and Gretel.

Hansel says

\text{P}(X > a) \leq \frac{\text{E}(X)}{a + \text{E}(X - a \mid X > a)}

and Gretel says

\text{P}(X > a) \leq \frac{\text{E}(X) - \text{E}(X \mid X \leq a)}{a - \text{E}(X \mid X \leq a)}

Related posts

[1] Joel E. Cohen. Markov’s Inequality and Chebyshev’s Inequality for Tail Probabilities: A Sharper Image. The American Statistician, Vol. 69, No. 1 (Feb 2015), pp. 5-7

Hyperbolic secant distribution

I hadn’t run into the hyperbolic secant distribution until I saw a paper by Peng Ding [1] recently. If C is a standard Cauchy random variable, then (2/π) log |C| has a hyperbolic secant distribution. Three applications of this distribution are given in [1].

Ding’s paper contains a plot comparing the density functions for the hyperbolic secant distribution, the standard normal distribution, and the logistic distribution with scale √3/π. The scale for the logistic was chosen so that all three distributions would have variance 1.

There’s something interesting about comparing logistic distribution and the hyperbolic secant distribution densities: the former is the square of the latter, aside from some scaling, and yet the two functions are similar. You don’t often approximate a function by its square.

Here’s a plot of the two densities.

The hyperbolic secant density, the blue curve, crosses the logistic density around ± 0.56 and around ± 2.33.

The hyperbolic secant distribution has density

f_H(x) = \frac{1}{2} \text{sech} \left(\frac{\pi x}{2} \right)

and the logistic distribution, as scaled in above, has density

f_L(x) = \frac{\pi}{4\sqrt 3} \,\text{sech}^2 \left(\frac{\pi x}{2\sqrt 3} \right)

and so

\frac{\pi}{\sqrt 3} \,f_H(x)^2 = f_L(x)

Related posts

[1] Peng Ding. Three Occurrences of the Hyperbolic-Secant Distribution. The American Statistician , Feb 2014, Vol. 68, No. 1 (2014), pp. 32-35

The Pearson distributions

The previous post was about 12 probability distributions named after Irving Burr. This post is about 12 probability distributions named after Karl Pearson. The Pearson distributions are better known, and include some very well known distributions.

Burr’s distributions are defined by their CDFs; Pearson’s distributions are defined by their PDFs.

Pearson’s differential equation

The densities of Pearson’s distributions all satisfy the same differential equation:

f'(x) = \frac{(x-a) f(x)}{c_0 + c_1x + c_2x^2}

This is a linear differential equation, and so multiples of a solution are also a solution. However, a probability density must integrate to 1, so there is a unique probability density solution given a, c0, c1, and c2.

Well known distributions

Note that f(x) = exp(-x²/2) satisfies the differential equation above if we set a = 0, c0 = 1, and c1 = c2 = 0. This says the normal distribution is a Pearson distribution.

If f(x) = xm exp(-x) then the differential equation is satisfied for am, c0 = −1, and c0 = c2 = 0. This says that the exponential distribution and more generally the gamma distribution are Pearson distributions.

You can also show that the Cauchy distribution and more generally the Student t distribution are also Pearson distributions. So are the beta distributions (with a transformed range).

Table of Pearson distributions

The table below lists all Pearson distributions with their traditional names. The order of the list is a little strange for historical reasons.

The table uses Iverson’s bracket notation: a Boolean expression in brackets represents the function that is 1 when the condition holds and 0 otherwise. This way all densities are defined over the entire real line, though some of them are only positive over an interval.

The densities are presented without normalization constant; the normalization constant are whatever they have to be for the function to integrate to 1. The normalization constants can be complicated functions of the parameters and so they are left out for simplicity.

\begin{align*} \text{I} \hspace{1cm} & (1+x)^{m_1} (1-x)^{m_2} \,\,[-1 \leq x \leq 1] \\ \text{II} \hspace{1cm} & (1 - x^2)^m \,\, [ -1 \leq x \leq 1] \\ \text{III} \hspace{1cm} & x^m \exp(-x) \,\, [0 \leq x] \\ \text{IV} \hspace{1cm} & (1 + x^2)^{-m} \exp(-v \arctan x) \\ \text{V} \hspace{1cm} & x^{-m} \exp(-1/x) \,\, [0 \leq x] \\ \text{VI} \hspace{1cm} & x^{m_2}(1 + x)^{-m_1} \,\,[0 \leq x] \\ \text{VII} \hspace{1cm} & (1 + x^2)^{-m}\\ \text{VIII} \hspace{1cm} & (1 + x)^{-m} \,\, [0 \leq x \leq 1] \\ \text{IX} \hspace{1cm} & (1 + x)^{m} \,\, [0 \leq x \leq 1] \\ \text{X} \hspace{1cm} & \exp(-x) \,\, [0 \leq x] \\ \text{XI} \hspace{1cm} & x^{-m} \,\,[1 \leq x] \\ \text{XII} \hspace{1cm} & \left( (g+x)(g-x)\right)^h \,\, [-g \leq x \leq g] \end{align*}

There is a lot of redundancy in the list. All the distributions are either special cases of or limiting cases of distributions I, IV, and VI.

Note that VII is the Student t distribution after you introduce a scaling factor.


The Pearson distributions are determined by their first few moments, provided these exist, and these moments can be derived from the parameters in Pearson’s differential equation.

This suggests moment matching as a way to fit Pearson distributions to data: solve for the distribution parameters that make the the exact moments match the empirical moments. Sometimes this works very well, though sometimes other approaches are better, depending on your criteria for what constitutes a good match.

The other Burr distributions

As I mentioned in the previous post, there are 12 distributions named for Irving Burr, known as Burr Type I, Burr Type II, Burr Type III, …, Burr Type XII. [1]

The last of these is by far the most common, and the rest are hard to find online. I did manage to find them, so I’ll list them here for my future reference and for the benefit of anyone else interested in these distributions in the future.

Each distribution has a closed-form CDF because each is defined by its CDF. In all but one case, Burr Type XI, the CDF functions are invertible in closed-form. This means that except for Burr Type XI, one can easily generate random samples from each of the Burr distributions by applying the inverse CDF to a uniform random variable.

For each distribution I’ll give the CDF.

Burr Type I distribution

F(x) = x

for 0 < x < 1, which is the uniform distribution.

Burr Type II distribution

for -∞ < x < ∞.

Burr Type III distribution

for 0 < x < ∞.

The Burr Type III distribution is also known as the Dagum distribution, and is probably the most well known of the Burr distributions after Type XII.

Burr Type IV distribution

for 0 < x < c.

Burr Type V distribution

for -π/2 < x < π/2.

Burr Type VI distribution

for -∞ < x < ∞.

Burr Type VII distribution

for -∞ < x < ∞.

Burr Type VIII distribution

for -∞ < x < ∞.

Burr Type IX distribution

for -∞ < x < ∞.

Burr Type X distribution

for 0 ≤ x < ∞.

Burr Type XI distribution

for 0 < x < 1.

Burr Type XII distribution

for 0 ≤ x < ∞.

The Burr Type XII distribution is also known as the Singh-Maddala distribution in economics.

[1] It’s possible to represent the Roman numerals I through XII as single Unicode characters as described here. So if you want to get fancy, we have Burr Ⅰ, Burr Ⅱ, Burr Ⅲ, …, Burr Ⅻ. Here Ⅻ, for example, is a single character, U+216B.


Burr distribution

Irving Burr came up with a set of twelve probability distributions known as Burr I, Burr II, …, Burr XII. The last of these is by far the best known, and so the Burr XII distribution is often referred to simply as the Burr distribution [1]. See the next post for the rest of the Burr distributions.

Cumulative density functions (CDFs) of probability distributions don’t always have a closed form, but it’s convenient when they do. And the CDF of the Burr XII distribution has a particularly nice closed form:

You can also add a scale parameter the usual way.

The flexibility of the Burr distribution isn’t apparent from looking at the CDF. It’s obviously a two-parameter family of distributions, but not all two-parameter families are equally flexible. Flexibility is a fuzzy concept, and there are different notions of flexibility for different applications. But one way of measuring flexibility is the range of (skewness, kurtosis) values the distribution family can have.

The diagram below [2] shows that the Burr distribution can take on a wide variety of skewness and kurtosis combinations, and that for a given value of skewness, the Burr distribution has the largest kurtosis of common probability distribution families. (Note that the skewness axis increases from top to bottom.)

The green area at the bottom of the rainbow belongs to the Burr distribution. The yellow area is a subset of the green area.

[1] In economics the Burr Type XII distribution is also known as the Singh-Maddala distribution.

[2] Vargo, Pasupathy, and Leemis. Moment-Ratio Diagrams for Univariate Distributions. Journal of Quality Technology. Vol. 42, No. 3, July 2010. The article contains a grayscale version of the image, and the color version is available via supplementary material online.

Heat equation and the normal distribution

The density function of a normal distribution with mean 0 and standard deviation √(2kt) satisfies the heat equation. That is, the function

u(x, t) = \frac{1}{2\sqrt{\pi kt}} \exp\left(-\frac{x^2}{4kt}\right)

satisfies the partial differential equation

u_t = k\,u_{xx}

You could verify this by hand, or if you’d like, here’s Mathematica code to do it.

      u[x_, t_] := PDF[NormalDistribution[0, Sqrt[2 k t]], x]
      Simplify[ D[u[x, t], {t, 1}] - k D[u[x, t], {x, 2}] ]

This returns 0 as expected.

Solutions to the heat equation are unique if we specify initial conditions. So if we start with a very concentrated heat source at time t = ε, say

u(x, \varepsilon) = \frac{1}{2\sqrt{\pi k\varepsilon}} \exp\left(-\frac{x^2}{4k\varepsilon}\right)

where ε is a tiny but positive number, then the solution at the top of the post hold for all t > 0 and is unique.

Alternatively, we could say that if our initial condition is u(x, 0) = δ where the right hand side is the Dirac delta function, then the solution at the top of the post holds for all t > 0 and is unique. You could justify this rigorously using distribution theory (in the analytic sense, not the probability sense, though they’re related). Or you could justify it heuristically by letting ε to go zero and thinking of δ as the limit.

It is also the case that a constant satisfies the heat equation. So if we add a constant to our solution, the constant representing ambient temperature, and add our initial condition to that constant, the solution is the constant plus the solution above.

The temperature at the center will decay in proportion to 1/√t. A point x away from the center will initially get warmer, reach a maximum temperature at time t = x²/2k, and then cool off roughly in proportion to 1/√t.

Here’s a plot of the temperature at x = 10 with k = 5.

Related posts

Maxwell-Boltzmann and Gamma

When I shared an image from the previous post on Twitter, someone who goes by the handle Nonetheless made the astute observation that image looked like the Maxwell-Boltzmann distribution. That made me wonder what 1/Γ(x) would be like turned into a probability distribution, and whether it would be approximately like the Maxwell-Boltzmann distribution.

(Here I’m looking at something related to what Nonetheless said, but different. He was looking at my plot of the error in approximating 1/Γ(x) by partial products, but I’m looking at 1/Γ(x) itself.)

Making probability distributions

You can make any non-negative function with a finite integral into a probability density function by dividing it by its integral. So we can define a probability density function

f(x) = \frac{c}{\Gamma(x)}


1/c = \int_0^\infty \frac{dx}{\Gamma(x)}

and so c = 0.356154. (The integral cannot be computed in closed form and so has to be evaluated numerically.)

Here’s a plot of f(x).

Note that we’re doing something kind of odd here. It’s common to see the gamma function in the definition of probability distributions, but the function is always evaluated at distribution parameters, not at the free variable x. We’ll give an example of this shortly.

Maxwell-Boltzmann distribution

The Maxwell-Boltzmann distribution, sometimes called just the Maxwell distribution, is used in statistical mechanics to give a density function for particle velocities. In statistical terminology, the Maxwell-Boltzmann distribution is a chi distribution [1] with three degrees of freedom and a scale parameter σ that depends on physical parameters.

The density function for a Maxwell-Boltzmann distribution with k degrees of freedom and scale parameter σ has the following equation.

\chi(x; k, \sigma) = \frac{1}{2^{(k/2) -1}\sigma \Gamma(k/2)} \left(\frac{x}{\sigma}\right)^{k-1} \exp(-x^2/2\sigma^2)

Think of this as xk-1 exp(-x² / 2σ²) multiplied by whatever you have to multiply it by to make it integrate to 1; most of the complexity in the definition is in the proportionality constant. And as mentioned above, the gamma function appears in the proportionality constant.

For the Maxwell-Boltzmann distribution, k = 3.

Lining up distributions

Now suppose we want to compare the distribution we’ve created out of 1/Γ(x) with the Maxwell-Boltzman distribution. One way of to do this would be to align the two distributions to have their peaks at the same place. The mode of a chi random variable with k degrees of freedom and scale σ is

x = \sqrt{k-1} \, \sigma

and so for a given k we can solve for σ to put the mode where we’d like.

For positive values, the minimum of the gamma function, and hence the maximum of its reciprocal, occurs at 1.46163. As with the integral above, this has to be computed numerically.

For the Maxwell-Boltzmann distribution, i.e. when k = 3, we don’t get a very good fit.

The tails on the chi distribution are too thin. If we try again with k = 2 we get a much better fit.

You get an even better fit with k = 1.86. The optimal value of k would depend on your criteria for optimality, but k = 1.86 looks like a good value just eyeballing it.

[1] This is a chi distribution, not the far more common chi-square distribution. The first time you see this is looks like a typo. If X is a random variable with a chi random variable with k degrees of freedom then X² has a chi-square distribution with k degrees of freedom.

Surprisingly not that surprising

Eliud Kipchoge, marathon world record holder

World record marathon times have been falling in increments of roughly 30 seconds, each new record shaving roughly 30 seconds off the previous record. If someone were to set a new record, taking 20 seconds off the previous record, this would be exciting, but not suspicious. If someone were to take 5 minutes off the previous record, that would be suspicious.

One way to quantify how surprising a new record is would be to divide its margin of improvement over the previous margin of improvement. That is, given a new record y, and previous records y1 and y2, we can calculate an index of surprise by

r = (yy1) / (y1y2)

In [1] the authors analyze this statistic and extensions that take into account more than just the latest two records under technical assumptions I won’t get into here.

A p-value for the statistic R is given by

Prob(R > r) = 2/(r + 2).

You could think of this as a scale of surprise, with 0 being impossibly surprising and 1 being completely unremarkable.

There are multiple reasons to take this statistic with a grain of salt. It is an idealization based on assumptions that may not even approximately hold in a particular setting. And yet it is at least a useful rule of thumb.

The current marathon record beat the previous record by by 30 seconds. The previous margin of improvement was 78 seconds. This gives a value of r equal to 0.385 and a corresponding p-value of 0.84. This says the current record is impressive but statistically unremarkable. An improvement of 5 minutes, i.e. 300 seconds, would result in a p-value of 0.17, which is notable but not hard evidence cheating. [2]

The assumptions in [1] do not apply to marathon times, and may not apply to many situations where the statistic above nevertheless is a useful rule of thumb. The ideas in the paper could form the basis of a more appropriate analysis customized for a particular application.

Reports of a new record in any context are usually over-hyped. The rule of thumb above gives a way to gauge for yourself whether you should share the report’s excitement. You shouldn’t read too much into it, like any rule of thumb, but it at least gives a basis for deciding whether something deserves closer attention.

More rule of thumb posts

[1] Andrew R. Solow and Woollcott Smith. How Surprising Is a New Record? The American Statistician, May, 2005, Vol. 59, No. 2, pp. 153-155

[2] I haven’t done the calculations, but I suspect that if you used the version of the index of surprise in [1] that takes into account more previous records, say the last 10, then you’d get a much smaller p-value.

The image at the top of the post is of Eliud Kipchoge, current marathon world record holder. Image under Creative Commons license. source.

Poisson distribution tail bounds

Yesterday Terence Tao published a blog post on bounds for the Poisson probability distribution. Specifically, he wrote about Bennett’s inequalities and a refinement that he developed or at least made explicit. Tao writes

This observation is not difficult and is implicitly in the literature … I was not able to find a clean version of this statement in the literature, so I am placing it here on my blog.

Bennett’s inequalities say that for a Poisson random variable X with mean λ,

P(X \geq \lambda(1+u)) \leq \exp(-\lambda h(u))

for u ≥ 0 and

P(X \leq \lambda(1+u)) \leq \exp(-\lambda h(u))

for -1 < u ≤ 0 where

h(u) = (1+u) \log(1+u) - u

I wanted to visualize how tight Bennett’s bounds are and got some interesting images due to the discrete nature of the Poisson distribution.

Here’s a plot of how tight the right-tail estimate is, the gap between the two sides of the first inequality above.

Here u ranges from 0 to 2 along the front edge of the graphic and λ ranges from 5 to 10.

And here’s a plot of how tight the left-tail estimate is, the gap between the two sides of the second inequality above.

Mathematica tweaking

The latter image was easy to make, but the first image required a couple adjustments: the image had holes in it, and the view point was awkward.

I fixed the holes in the plot by adding the option PlotPoints->50 to make the sampling finer. I fixed the view point by grabbing the image and rotating it until I thought it looked better. I could have saved the rotated image directly, but I was curious how to do this with the Export command. To do this I needed to specify the ViewPoint explicitly in the plotting command, but I didn’t know how to get the value of ViewPoint that’s I’d implicitly found via my mouse. A comment on the Mathematica Stack Exchange site told me what I needed to know.

Simply edit the output cell and wrap Options[…, ViewPoint] around the already rotated output. The graphics should be in the place of …. ViewVertical may also change during rotating, as well as some other parameters.

Related posts