Beta inequalities with integer parameters

Suppose X is a beta(a, b) random variable and Y is a beta(cd) random variable. Define the function

g(a, b, c, d) = Prob(X > Y).

At one point I spent a lot of time developing accurate and efficient ways to compute g under various circumstances. I did this because when I worked at MD Anderson Cancer Center we spent thousands of CPU-hours computing this function inside clinical trial simulations.

For this post I want to outline how to compute g in a minimal environment, i.e. a programming language with no math libraries, when the parameters a, b, c, and d are all integers. The emphasis is on portability, not efficiency. (If we had access to math libraries, we could solve the problem for general parameters using numerical integration.)

In this technical report I develop several symmetries and recurrence relationships for the function g. The symmetries can be summarized as

g(a, b, c, d) = g(d, c, b, a) = g(d, b, c, a) = 1 − g(c, d, a, b).

Without loss of generality we assume a is the smallest argument. If not, apply the symmetries above to make the first argument the smallest.

We will develop a recursive algorithm for computing the function g by recurring on a. The algorithm will terminate when a = 1, and so we want a to be the smallest parameter.

In the technical report cited above, I develop the recurrence relation

g(a + 1, b, c, d) = g(a, b, c, d) + h(a, b, c, d)/a

where

h(a, b, c, d) = B(a + c, b + d) / B(a, b) B(c, d)

and B(x, y) is the beta function.

The equation above explains how to compute g(a + 1, b, c, d) after you’ve already computed g(a, b, c, d), which is what I often needed to do in applications. But we want to do the opposite to create a recursive algorithm:

g(a, b, c, d) = g(a – 1, b, c, d) + h(a – 1, b, c, d)/(a – 1)

for a > 1.

For the base case, when a = 1, we have

g(1, b, cd) = B(c, b + d) / B(c, d).

All that’s left to develop our code is to evaluate the Beta function.

Now

B(x, y) = Γ(x + y) / Γ(x) Γ(y)

and

Γ(x)  = (x – 1)!.

So all we need is a way to compute the factorial, and that’s such a common task that it has become a cliché.

The symmetries and recurrence relation above hold for non-integer parameters, but you need a math library to compute the gamma function if the parameters are not integers. If you do have a math library, the algorithm here won’t be the most efficient one unless you need to evaluate g for a sequence of parameters that differ by small integers.

***

For more on random inequalities, here’s the first of a nine-part sequence of blog posts.

Initial letter frequency

I needed to know the frequencies of letters at the beginning of words for a project. The overall frequency of letters, wherever they appear in a word, is well known. Initial frequencies are not so common, so I did a little experiment.

I downloaded the Canterbury Corpus and looked at the frequency of initial letters in a couple of the files in the corpus. I first tried a different approach, then realized a shell one-liner [1] would be simpler and less-error prone.

cat alice29.txt | lc | grep -o '\b[a-z]' | sort | uniq -c | sort -rn

This shows that the letters in descending order of frequency at the beginning of a word are t, a, s, …, j, x, z.

The file alice29.txt is the text of Alice’s Adventures in Wonderland. Then for comparison I ran the same script on another file, lcet10.txt. a lengthy report from a workshop on electronic texts.

This technical report’s initial letter frequencies order the alphabet t, a, o, …, y, z, x. So starting with the third letter, the two files have different initial letter frequencies.

I made the following plot to visualize how the frequencies differ. The horizontal axis is sorted by overall letter frequency (based on the Google corpus summarized here).

I expected the initial letter frequencies to differ from overall letter frequencies, but I did not expect the two corpora to differ.

Apparently initial letter frequencies vary more across corpora than overall letter frequencies. The following plot shows the overall letter frequencies for both corpora, with the horizontal axis again sorted by the frequency in the Google corpus.

Here the two corpora essentially agree with each other and with the Google corpus. The tech report ranks letters in essentially the same order as the Google corpus because the orange dashed line is mostly decreasing, though there is a curious kink in the graph at c.

Related posts

[1] The lc function converts its input to lower case. See this post for how to install and use the function.

Lake Wobegon Dice

Garrison Keillor’s fictional Lake Wobegon is a place “where all the children are above average.” Donald Knuth alluded to this in his exercise regarding “Lake Wobegon Dice,” a set of dice where the roll of each die is (probably) above average.

Let A be a six-sided die with a 5 on one side and 3’s on the other sides.

Let B and C be six-sided dice with 1’s on two sides and 4’s on four sides.

Then the probabilities

Pr( A > (A + B + C)/3 )
Pr( B > (A + B + C)/3 )
Pr( C > (A + B + C)/3 )

are all greater than 1/2.

To see this, list out all the possible rolls where each die is above average. The triples show the value of A, B, and C in that order. We have A above average if the rolls are

(3, 1, 1)
(3, 1, 4)
(3, 4, 1)
(5, *, *)

These outcomes have probability 5/54, 10/54, 10/54, and 1/6. The total is 34/54 = 17/27 > 1/2.

We have B above average when the rolls are

(3, 4, *)
(5, 4, 1)

which have probability 5/9 and 1/27 for a total of 16/27 > 1/2. And C is the same as B.

It seems paradoxical for three different dice to each have a probability better than 1/2 of being above average. The resolution is that the three events

A > (A + B + C)/3
B > (A + B + C)/3
C > (A + B + C)/3

are not exclusive. For example, in the roll (3, 4, 1) both A and B are above the average of 8/3.

I ran across this in The Art of Computer Programming, Volume 4, Pre-fascicle 5A, Exercise 3. Fascicle 5 has since been published, but I don’t have a copy. I don’t know whether this exercise made it into the published version. If it did, the page number may have changed.

More dice posts

Random Fourier series

A theorem by Paley and Wiener says that a Fourier series with random coefficients produces Brownian motion on [0, 2π]. Specifically,

W(t) = \frac{Z_0}{\sqrt{2\pi}}\,t + \frac{2}{\sqrt{\pi}} \sum_{n=1}^\infty \frac{Z_n}{n} \sin\left(\frac{nt}{2}\right)

produces Brownian motion on [0, 2π]. Here the Zs are standard normal (Gaussian) random variables.

Here is a plot of 10 instances of this process.

Plot of 10 random Fourier series

Here’s the Python code that produced the plots.

    import matplotlib.pyplot as plt
    import numpy as np
    
    def W(t, N):
        z = np.random.normal(0, 1, N)
        s = z[0]*t/(2*np.pi)**0.5
        s += sum(np.sin(0.5*n*t)*z[n]/n for n in range(1, N))*2/np.pi**0.5
        return s
    
    N = 1000
    t = np.linspace(0, 2*np.pi, 100)
    
    for _ in range(10):
        plt.plot(t, W(t, N))
    
    plt.xlabel("$t$")
    plt.ylabel("$W(t)$")
    plt.savefig("random_fourier.png")

Note that you must call the function W(t, N) with a vector t of all the time points you want to see. Each call creates a random Fourier series and samples it at the points given in t. If you were to call W with one time point in each call, each call would be sampling a different Fourier series.

The plots look like Brownian motion. Let’s demonstrate that the axioms for Brownian motion are satisfied. In this post I give three axioms as

  1. W(0) = 0.
  2. For s > t ≥ 0. W(s) – W(t) has distribution N(0, st).
  3. For vust ≥ 0. W(s) – W(t) and W(v) – W(u) are independent.

The first axiom is obviously satisfied.

To demonstrate the second axiom, let t = 0.3 and s = 0.5, just to pick two arbitrary points. We expect that if we sample W(s) – W(t) a large number of times, the mean of the samples will be near 0 and the sample variance will be around 0.2. Also, we expect the samples should have a normal distribution, and so a quantile-quantile plot would be nearly a straight 45° line.

To demonstrate the third axiom, let u = 1 and v = √7, two arbitrary points in [0, 2π] larger than the first two points we picked. The correlation between our samples from W(s) – W(t) and our samples from W(v) – W(u) should be small.

First we generate our samples.

    N = 1000
    diff1 = np.zeros(N)
    diff2 = np.zeros(N)
    x = [0.3, 0.5, 1, 7**0.5]
    for n in range(N):
        w = W(x)
        diff1[n] = w[1] - w[0]
        diff2[n] = w[3] - w[2]

Now we test axiom 2.

    from statsmodels.api import qqplot
    from scipy.stats import norm

    print(f"diff1 mean = {diff1.mean()}, var = {diff1.var()}.")
    qqplot(diff1, norm(0, 0.2**0.5), line='45')
    plt.savefig("qqplot.png")

When I ran this the mean was 0.0094 and the variance was 0.2017.

Here’s the  Q-Q plot:

quantile-quantile plot to test normal distribution

And we finish by calculating the correlation between diff1 and diff2. This was 0.0226.

Related posts

The fractal nature of Brownian motion

Yesterday I wrote about how to interpolate a Brownian path. If you’ve created a discrete Brownian path with a certain step size, you can go back and fill in at smaller steps as if you’d generated the values from the beginning.

This means that instead of generating samples in order as a random walk, you could generate the samples recursively. You could generate the first and last points, then fill in a point in the middle, then fill in the points in the middle of each subinterval etc. This is very much like the process of creating a Cantor set or a Koch curve.

In Brownian motion, the variance of the difference in the values at two points in time is proportional to the distance in time. To step forward from your current position by a time t, add a sample from a normal random variable with mean 0 and variance t.

Suppose we want to generate a discrete Brownian path over [0, 128]. We could start by setting W(0) = 0, then setting W(1) to a N(0, 1) sample, then setting W(2) to be W(1) plus a N(0, 1) sample etc.

The following code carries this process out in Python.

    import numpy as np

    N = 128
    W = np.zeros(N+1)
    for i in range(N):
        W[i+1] = W[i] + np.random.normal()

We could make the code more precise, more efficient, but less clear [1], by writing

    W = np.random.normal(0, 1, N+1).cumsum()

In either case, here’s a plot of W.

Brownian motion

But if we want to follow the fractal plan mentioned above, we’d set W(0) = 0 and then generate W(128) by adding a N(0, 128) sample. (I’m using the second argument in the normal distribution to be variance. Conventions vary, and in fact NumPy’s random.gaussian takes standard deviation rather than variance as its second argument.)

    W[0] = 0
    W[128] = np.random.gaussian(0, 128**0.5)

Then as described in the earlier post, we would generate W(64) by taking the average of W(0) and W(128) and adding a N(0, 32) sample. Note that the variance in the sample is 32, not 64. See the earlier post for an explanation.

    W[64] = 0.5*(W[0] + W[128]) + np.random.gaussian(0, 32**0.5)

Next we would fill in W(32) and W(96). To fill in W(32) we’d start by averaging W(0) and W(64), then add a N(0, 16) sample. And to fill in W(96) we’d average W(64) and W(128) and add a N(0, 16) sample.

    
    W[32] = 0.5*(W[0] + W[64] ) + np.random.gaussian(0, 16**0.5)
    W[64] = 0.5*(W[0] + W[128]) + np.random.gaussian(0, 16**0.5)

We would continue this process until we have samples at 0, 1, 2, …, 128. We won’t get the same Brownian path that we would have gotten if we’d sampled the paths in order—this is all random after all—but we’d get an instance of something created by the same stochastic process.

To state the fractal property explicitly, if W(t) is a Brownian path over [0, T], then

W(c² T)/c

is a Brownian path over [0, c² T]. Brownian motion looks the same at all scales.

***

The “less clear” part depends on who the reader is. The more concise version is more clear to someone accustomed to reading such code. But the loop more explicitly reflects the sequential nature of constructing the path.

Interpolating Brownian motion

Let W(t) be a standard Wiener process, a.k.a. a one-dimensional Brownian motion.

We can produce a discrete realization of W by first setting W(0) = 0. Then let W(1) be a sample from a N(0, 1) random variable. Then let W(2) be W(1) plus another N(0, 1) sample. At each integer n > 0, W(n) equals W(n-1) plus a N(0, 1) sample. Say we do this for n up to 1000.

Then for whatever reason we wish we’d produced a discretized realization with a sample at 800.5. We could start over, generating samples at steps of size 1/2. This time we’d add a N(0, 1/2) sample at each step [1].

We’d have to throw away a lot of work. How might we reuse the values we’ve already sampled?

(Update: Thanks to the comments I learned that interpolating Brownian motion is called a “Brownian bridge” in finance.)

The first thing you might think of might be to set W(800.5) to be the average of W(800) and W(801). That would be appropriate if we were sampling a deterministic function, but not when we’re sampling a stochastic process.

To decide the right thing to do we have to look back at the definition of a standard Wiener process. The three axioms are

  1. W(0) = 0.
  2. For s > t ≥ 0. W(s) – W(t) has distribution N(0, st).
  3. For vust ≥ 0. W(s) – W(t) and W(v) – W(u) are independent.

Using linear interpolation violates (3) because if we set W(800.5) to be the average of W(800) and W(801), then W(801) – W(800.5) and W(800.5) – W(800) are not independent: they’re equal. That’s as dependent as it gets.

Another thing you might think of would be to add a N(0, 0.5) sample to W(800) because that’s exactly what you’d do if you had generated the Brownian path from scratch taking step sizes 0.5. And if you threw away the samples from 801 onward and started over again, that would work. But if you want to keep W(801) and its successors, you have to do something else.

Why wouldn’t this work? Because

W(801) – W(800.5)

would have too much variance, 1.5 when it should be 0.5.

The right thing to do is a sort of merger of the two ideas above: do the linear interpolation and add a N(0, 0.25) sample.

That is, we take

W(800.5) = 0.5 W(800) + 0.5 W(801) + N(0, 0.25).

Does this work? Let’s check that W(801) – W(800.5) has the right distribution.

W(801) – W(800.5) = 0.5W(801) – 0.5 W(800) – N(0, 0.25)

The distribution of

X = W(801) – W(800)

is N(0, 1), so 0.5 X has distribution N(0, 0.25). Thus

W(801) – W(800.5)

has the distribution of the difference of two N(0, 0.25) random variables, which is N(0, 0.5), as it should be.

A similar argument shows that W(800.5) – W(800) also has the right distribution.

Related posts

[1] There’s always some ambiguity when you see a normal distribution with numeric parameters. Is the second parameter the standard deviation or the variance? If you see N(μ, σ) then by convention the second parameter σ is the standard deviation. If you see N(μ, σ²) then by convention the second parameter σ² is the variance. If you see N(0, 1/2) as above, is 1/2 a standard deviation or a variance? I intend it to be a variance.

This is an example of why some have called conventional statistical notation a “nomenclatural abomination.” You often can’t tell what a literal number means. You have to ask “If you were to write that number as a variable, what notation would you use?” In this case, you’d want to know whether 1/2 is a σ or a σ².

Sum of independent but differently distributed variables

It’s well known that a binomial random variable can be approximated by a Poisson random variable, and under what circumstances the approximation is particularly good. See, for example, this post.

A binomial random variable is the sum of iid (independent, identically distributed) Bernoulli random variables. But what if the Bernoulli random variables don’t have the same distribution. That is, suppose you’re counting the number of heads seen in flipping n coins, where each coin has a potentially different probability of coming up heads. Will a Poisson approximation still work?

This post will cite three theorems on the error in approximating a sum of n independent Bernoulli random variables, each with a different probability of success pi. I’ll state each theorem and very briefly discuss its advantages. The theorems can be found in [1].

Setup

For i = 1, 2, 3, …, n let Xi be Bernoulli random variables with

Prob(Xi = 1) = pi

and let X with no subscript be their sum:

X = X1 + X2 + X3 + … + Xn

We want to approximate the distribution of X with a Poisson distribution with parameter λ. We will measure the error in the Poisson approximation by the maximum difference between the mass density function for X and the mass density function for a Poisson(λ) random variable.

Sum of p‘s

We consider two ways to choose λ. The first is

λ = p1 + p2 + p3 + … + pn.

For this choice we have two different theorems that give upper bounds on the approximation error. One says that the error is bounded by the sum of the squares of the p‘s

p1² + p2² + p3² + … + pn²

and the other says it is bounded by 9 times the maximum of the p‘s

9 max(p1, p2, p3, …,  pn).

The sum of squares bound will be smaller when n is small and the maximum bound will be smaller when n is large.

Sum of transformed p‘s

The second way to choose λ is

λ = λ1 + λ2 + λ3 + … + λn

where

λi = -log(1 – pi).

In this case the bound on the error is one half the sum of the squared λ’s:

1² + λ2² + λ3² + … + λn²)/2.

When pi is small, λipi. In this case the error bound for the transformed Poisson approximation will be about half that of the one above.

Related posts

[1] R. J. Serfling. Some Elementary Results on Poisson Approximation in a Sequence of Bernoulli Trials. SIAM Review, Vol. 20, No. 3 (July, 1978), pp. 567-579.

Beta distribution with given mean and variance

It occurred to me recently that a problem I solved numerically years ago could be solved analytically, the problem of determining beta distribution parameters so that the distribution has a specified mean and variance. The calculation turns out to be fairly simple. Maybe someone has done it before.

Problem statement

The beta distribution has two positive parameters, a and b, and has probability density proportional to [1]

x^{a-1} (1-x)^{b-1}

for x between 0 and 1.

The mean of a beta(a, b) distribution is

\mu = \frac{a}{a+b}

and the variance is

\sigma^2 = \frac{ab}{(a+b)^2(a+b+1)}

Given μ and σ² we want to solve for a and b. In order for the problem to be meaningful μ must be between 0 and 1, and σ² must be less than μ(1-μ). [2]

As we will see shortly, these two necessary conditions for a solution are also sufficient.

Visualization

Graphically, we want to find the intersection of a line of constant mean

with a line of constant variance.

Note that the scales in the two plots differ.

Solution

If we set

k = \frac{1-\mu}{\mu}

then b = ka and so we can eliminate b from the equation for variance to get

ka^2 = \sigma^2 (1+k)^2 a^2\left((1+k)a + 1 \right)

Now since a > 0, we can divide by a²

k = \sigma^2 (1+k)^2 \left((1+k)a + 1 \right)

and from there solve for a:

a = \frac{k - \sigma^2(1+k)^2}{\sigma^2(1+k)^3} = \mu\left( \frac{\mu(1-\mu)}{\sigma^2} - 1 \right)

and b = ka.

We require σ² to be less than μ(1-μ), or equivalently we require the ratio of μ(1-μ) to σ² to be greater than 1. It works out that the solution a is the product of the mean and the amount by which the ratio of μ(1-μ) to σ² exceeds 1.

Verification

Here is a little code to check for errors in the derivation above. It generates μ and σ² values at random, solves for a and b, then checks that the beta(a, b) distribution has the specified mean and variance.

    from scipy.stats import uniform, beta
    
    for _ in range(100):
        mu = uniform.rvs()
        sigma2 = uniform.rvs(0, mu*(1-mu))
    
        a = mu*(mu*(1-mu)/sigma2 - 1)
        b = a*(1-mu)/mu
    
        x = beta(a, b)
        assert( abs(x.mean() - mu) < 1e-10)
        assert( abs(x.var() - sigma2) < 1e-10)
    
    print("Done")

Related posts

[1] It’s often easiest to think of probability densities ignoring proportionality constants. Densities integrate to 1, so the proportionality constants are determined by the rest of the expression for the density. In the case of the beta distribution, the proportionality constant works out to Γ(a + b) / Γ(a) Γ(b).

[2] The variance of a beta distribution factors into μ(1-μ)/(a + b + 1), so it is less than μ(1-μ).

Computing normal probabilities with a simple calculator

If you need to calculate Φ(x), the CDF of a standard normal random variable, but don’t have Φ on your calculator, you can use the approximation [1]

Φ(x) ≈ 0.5 + 0.5*tanh(0.8 x).

If you don’t have a tanh function on your calculator, you can use

tanh(0.8x) = (exp(1.6x) – 1) / (exp(1.6x) + 1).

This saves a few keystrokes over computing tanh directly from its definition [2].

Example

For example, suppose we want to compute Φ(0.42) using bc, the basic calculator on Unix systems. bc only comes with a few math functions, but it has a function e for exponential. Our calculation would look something like this.

    > bc -lq
    t = e(1.6*0.42); (1 + (t-1)/(t+1))/2
    .66195084790657784948

The exact value of Φ(0.42) is 0.66275….

Plots

It’s hard to see the difference between Φ(x) and the approximation above.

Comparing Phi and its approximation

A plot of the approximation error shows that the error is greatest in [-2, -1] and [1, 2], but the error is especially small for x near 0 and x far from 0.

Error in approximating Phi

Related posts

[1] Anthony C. Robin. A Quick Approximation to the Normal Integral. The Mathematical Gazette, Vol. 81, No. 490 (Mar., 1997), pp. 95-96

[2] Proof: tanh(x) is defined to be (exex) / (ex + ex). Multiply top and bottom by ex to get (e2x – 1) / (e2x + 1).

Scaled Beta2 distribution

I recently ran across a new probability distribution called the “Scaled Beta2” or “SBeta2” for short in [1].

For positive argument x and for positive parameters p, q, and b, its density is

\frac{\Gamma(p+q)}{b\,\Gamma(p) \, \Gamma(q)}\, \frac{\left(\frac{x}{b}\right)^{p-1}}{\left(\left(\frac{x}{b}\right) + 1 \right)^{p+q}}

This is a heavy-tailed distribution. For large x, the probability density is O(xq-1), the same as a Student-t distribution with q degrees of freedom.

The authors in [1] point out several ways this distribution can be useful in application. It can approximate a number of commonly-used prior distributions while offering advantages in terms of robustness and analytical tractability.

Related posts

[1] The Scaled Beta2 Distribution as a Robust Prior for Scales. María-Eglée Pérez, Luis Raúl Pericchi, Isabel Cristina Ramírez. Available here.