Cosine power approximation to Gaussian density


Years ago I wrote about how you can make a decent approximation to the Gaussian density by shifting and scaling the cosine function. This observation dates back at least as far as 1961.

Turns out you can do even better by raising the cosine term to a power. Not only can you get a good approximation, you can get an upper and lower bound of the same form.

In [1] the authors prove the following. For −π/2 < x < π/2,

\left( \frac{1 + \cos x}{2}\right)^a \leq \exp(-x^2) \leq \left( \frac{1 + \cos x}{2}\right)^b

and

\left( \frac{2 + \cos x}{3}\right)^c \leq \exp(-x^2) \leq \left( \frac{2 + \cos x}{3}\right)^d

with the best possible constants

\begin{align*} a &= 4 \\ b &= (\pi/2)^2/\log(2) \approx 3.559707\\ c &= (\pi/2)^2 / \log(3/2) \approx 6.08536\\ d &= 6 \end{align*}

When the authors say “best” they mean best with respect to L2 norm, i.e. best the the least-squares sense.

The graph at the top of the page plots exp(−x²) as solid blue line, the first upper bound as a dotted green line, and the first lower bound as a dotted orange line. The blue and green lines blend into a single bluish-green line. You can barely see the orange line separate from the others. The curves are difficult to distinguish visually, which is the point.

The second pair of upper and lower bounds are even better. Here’s a plot of the errors for the second pair of bounds.

gaps in bounds

Note that the theorem is stated for the interval [−π/2, π/2], but the plot covers a wider range in order to show that the errors are still small outside the interval. Outside the specified interval, the lower bound error becomes positive, i.e. the lower bound is no longer a lower bound, though it’s still a good approximation.

[1] Yogesh J. Bagul and Christophe Chesneau. Some sharp circular and hyperbolic bounds of exp(−x²) with applications. Applicable Analysis and Discrete Mathematics, Vol. 14, No. 1 (April 2020), pp. 239–254

Three advantages of non-AI models

It’s hard to imagine doing anything like Midjourney’s image generation without neural networks. The same is true of ChatGPT’s text generation. But a lot of business tasks do not require AI, and in fact would be better off not using AI. Three reasons why:

  1. Statistical models are easier to interpret. When a model has a dozen parameters, you can inspect each one. When a model has a billion parameters, not so much.
  2. Statistical models are more robust. Neural networks perform well on average, but fail in bizarre ways, and there is no way to explore all the possible edge cases. The edge cases of a statistical model can be enumerated, examined, and mitigated.
  3. Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that such legislation will have unintended consequences are roughly 100%.

Eccentricity of bivariate normal level sets

Suppose you have a bivariate normal distribution with correlation ρ where

f(x,y) = \frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)} \right)

Then the level sets of the density function are ellipses, and the eccentricity e of the ellipses is related to the correlation ρ by

\begin{align*} \rho &= \pm \frac{e^2}{2 - e^2} \\ e &= \sqrt{\frac{2|\rho|}{1 + |\rho|} } \end{align*

Plots

For example, suppose ρ = 0.8.

Here’s a plot of the density function f(x, y).

And here is a plot of the contours, the level sets obtained by slicing the plot of the density horizontally.

Eccentricity and aspect ratio

The equation above says that since ρ = 0.8 the eccentricity e should be √(1.6/1.8) = 0.9428.

As I’ve written about several times, eccentricity is a little hard to interpret. It’s a number between 0 and 1, with 0 being a circle and 1 being a degenerate ellipse collapsed to a line segment. Since 0.9428 is closer to 1 than to 0, you might think that an ellipse with such a large value of e is very far from circular. But this is not the case.

Eccentric literally means off-center. Eccentricity measures how far off-center the foci of an ellipse are. This is related to how round or thin an ellipse is, but only indirectly. It is true that such a large value of e means that the foci are close to the ends of the ellipse. In fact, the foci are located at 94.28% of the distance from the center to the end of the semimajor axes. But the aspect ratio is more moderate.

It’s easier to imagine the shape of an ellipse from the aspect ratio than from the eccentricity. Aspect ratio r is related to eccentricity e by

r = \sqrt{1-e^2}

Using the expression above for eccentricity e as a function of correlation ρ we have

r = \sqrt{\frac{1 - |\rho|}{1 + |\rho|}}

This says our aspect ratio should be 1/3. So although our ellipse is highly eccentric—the foci are near the ends—it isn’t extraordinarily thin. It’s three times longer than it is wide, which matches the contour plot above.

Related posts

Bounds on the incomplete beta function (beta CDF)

The incomplete beta function is defined by

B(x, y, z) = \int_0^z t^{x-1} (1-t)^{y-1} \, dx

It is called incomplete because the limit of integration doesn’t necessarily run all the way up to 1. When z = 1 the incomplete beta function is simply the beta function

B(x, y) = B(x, y, 1)

discussed in the previous post [1].

The incomplete beta function is proportional to the CDF of a beta random variable, and the proportionality constant is 1/B(x, y). That is, if X is a beta(a, b) random variable, then

\text{Prob}(X \leq x) = \frac{B(a, b, x)}{B(a, b)}

The right-hand side of the equation above is often called the regularized incomplete beta function. This is what an analyst would call it. A statistician would call it the beta distribution function.

The previous post gave simple bounds on the (complete) beta function. This post gives bounds on the incomplete beta function. These bounds are more complicated but also more useful. The incomplete beta function is more difficult to work with than the beta function, and so upper and lower bounds are more appreciated.

One application of these bounds would be to speed up software that evaluates beta probabilities. If you only need to know whether a probability is above or below some threshold, you could first compute an upper or lower bound. Maybe computing the exact probability is unnecessary because you can decide from your bounds whether the probability is above or below your threshold. If not, then you compute the exact probability. This is less work if your bounds are usually sufficient, and more work if they’re usually not. In the former case this could result in a significant speed up because the incomplete beta function is relatively expensive to evaluate.

Cerone, cited in the previous post, gives two lower bounds, L1 and L2, and two upper bounds, U1 and U2, for B(x, y, z). Therefore the maximum of L1 and L2 is a lower bound and the minimum of U1 and U2 is an upper bound. The bounds are complicated, but easier to work with than the incomplete beta function.

\begin{align*} L_1 &= \frac{z^{x-1}}{y} \left[ \left(1 - z + \frac{z}{x} \right)^y - (1-z)^y \right] \\ U_1 &= \frac{z^{x-1}}{y} \left[ 1 - \left(1 - \frac{z}{x}\right)^y\right] \\ L_2 &= \frac{\lambda(z)^x}{x} + (1-z)^{y-1}\, \frac{z^x - \lambda(z)^x}{x} \\ U_2 &= (1-z)^{y-1} \frac{\left( x - \lambda(z) \right)^x}{x} + \frac{z^x - \left(z - \lambda(z)\right)^x}{x} \\ \end{align*}

where

\lambda(z) = \frac{1 - (1-z)[1 - z(1-y)]}{y\left[1-(1-z)^{y-1}\right]}

Related posts

[1] Incidentally, Mathematica overloads the Beta function just as we overload B. Mathematica’s Beta[x, y] corresponds to our B(x, y), but the three-argument version of Beta, the incomplete beta function, puts the limit of integration z first. That is, our B(x, y, z) corresponds to Beta[z, x, y] in Mathematica.

Upper and lower bounds on the beta function

The beta function B(x, y) is defined by

B(x, y) = \int_0^1 t^{x-1} (1-t)^{y-1}\, dt

and is the normalizing constant for the beta probability distribution. It is related to the gamma function via

B(x, y) = \frac{\Gamma(x)\,\Gamma(y)}{\Gamma(x+y)}

The beta function comes up often in applications. It can be challenging to work with, however, and so estimates for the function are welcome.

The function 1/xy gives simple approximation and upper bound for B(x, y). Alzer [1] proved that when x > 1 and y > 1

0 \leq \frac{1}{xy} - B(x,y) \leq b

where the constant b is defined by

b = \max_{x\geq1}\left( \frac{1}{x^2} - \frac{\Gamma(x)^2}{\Gamma(2x)} \right) = 0.08731\ldots

Cerone [2] gives a different bound which varies with x and y and is usually better than Alzer’s bound. For x and y greater than 1, Cerone shows

0 \leq \frac{1}{xy} - B(x,y) \leq C(x) C(y) \leq b'

where

C(x) = \frac{x-1}{x\sqrt{2x-1}}

and

C(x) = \frac{x-1}{x\sqrt{2x-1}}

Cerone’s bound is slightly larger in the worst case, near x = y = (3 + √5)/2, but is smaller in general.

The difference between B(x, y) and 1/xy is largest when x or y is small. We can visualize this with the Mathematica command

    Plot3D[Beta[x, y] - 1/(x y), {x, 0.5, 2.5}, {y, 0.5, 2.5}]

which produces the following plot.

The plot dips down in the corner where x and y are near 0.5 and curls upward on the edges where one of the variables is near 0.5 and the other is not.

Let’s look at B(x, y) and 1/xy at along a diagonal slice (3t, 4t).

This suggests that approximating B(x, y) with 1/xy works best when the arguments are either small or large, with the maximum difference being when the arguments are moderate-sized. In the plot we see B(3, 4) is not particularly close to 1/12.

Next lets look at 1/xyB(x, y) along the same diagonal slice.

This shows that the error bound C(x) C(y) is not too tight, but better than the constant bound except near the maximum of 1/xyB(x, y).

Related posts

[1] H. Alzer. Monotonicity properties of the Hurwitz zeta function. Canadian Mathematical Bulletin 48 (2005), 333–339.

[2] P. Cerone. Special functions: approximations and bounds. Applicable Analysis and Discrete Mathematics, 2007, Vol. 1, No. 1, pp. 72–91

Beta inequalities and cross ratios

When I worked at MD Anderson Cancer Center, we spent a lot of compute cycles evaluating the function g(a, b, c, d), defined as the probability that a sample from a beta(a, b) random variable is larger than a sample from a beta(c, d) random variable. This function was often in the inner loop of simulations that ran for hours or even days.

I developed ways to evaluate this function more efficiently because it was a bottleneck. Along the way I found a new symmetry. W. R. Thompson had studied what I call the function g back in 1933 and reported two symmetries:

g(a, b, c, d) = 1 − g(c, d, a, b)

and

g(a, b, c, d) = g(d, c, b, a).

I found that

g(a, b, c, d) = g(d, b, c, a)

as well. See a proof here.

You can conclude from these rules that

  1. g(a, b, c, d) = g(d, c, b, a) = g(d, b, c, a) = g(a, c, b, d)
  2. g(a, b, c, d) = 1 − g(c, d, a, b)

I was just looking at a book that mentioned the symmetries of the cross ratio which I will denote

r(a, b, c, d) = (ac)(b d) / (bc)(ad).

Here is Theorem 4.2 from [1] written in my notation.

Let a, b, c, d be four points on a projective line with cross ratio r(a, b, c, d) = λ. Then we have

    1. r(a, b, c, d) = r(b, a, d, c) = r(c, d, a, b) = r(d, c, b, a).
    2. r(a, b, d, c) = 1/λ
    3. r(a, c, b, d) = 1 − λ
    4. the values for the remaining permutations are consequences of these three basic rules.

This looks awfully familiar. Rules 1 and 3 for cross ratios correspond to rules 1 and 2 for beta inequalities, though not in the same order. Both g and r are invariant under reversing their arguments, but are otherwise invariant under different permutations of the arguments.

Both g and r take on 6 distinct values, taking on each 4 times. I feel like there is some deeper connection here but I can’t see it. Maybe I’ll come back to this later when I have the time to explore it. If you see something, please leave a comment.

There is no rule for beta inequalities analogous to rule 2 for cross ratios, at least not that I know of. I don’t know of any connection between g(a, b, c, d) and g(a, b, d, c).

Update: There cannot be a function h such that g(a, b, d, c) is a function of g(a, b, c, d) alone because I found parameters that lead to the same value of the latter but different values of the former. If there is a relation between g(a, b, c, d) and g(a, b, d, c) and it must involve the parameters and not just the value of g.

[1] Jürgen Richter-Gebert. Perspectives on Projective Geometry: A Guided Tour Through Real and Complex Geometry. Springer 2011.

 

 

Collecting a large number of coupons

This post is an addendum to the recent post Reviewing a thousand things. We’re going to look again at the coupon collector problem, randomly sampling a set of N things with replacement until we’ve collected one of everything.

As noted before, for large N the expected number of draws before you’ve seen everything at least once is approximately

N( log(N) + γ )

where γ is Euler’s constant.

A stronger result using a Chernoff bound says that for any constant c, the probability that the number of draws exceeds

N( log(N) + c )

approaches

1 − exp( − exp(−c))

as N increases. See [1].

I said in the earlier post that if you were reviewing a thousand things by sampling flash cards with replacement, there’s a 96% chance that after drawing 10,000 cards you would have seen everything. We could arrive at this result using the theorem above.

If N = 1,000 and we want N( log(N) + c ) to equal 10,000, we set c = 3.092. Then we find

1 − exp( − exp(−c) ) = 0.0444

which is consistent with saying there’s a 96% chance of having seen everything, 95.56% if you want to be more precise.

 

[1] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005.

 

Beta-binomial with given mean and variance

The previous post looked at an application of the beta-binomial distribution. The probability mass function for a beta-binomial with parameters n, a, and b is given by

\text{Prob}(X = x) = \binom{n}{x} \frac{B(x + a, n - x + b)}{B(a, b)}

The mean μ and the variance σ² are given by

\begin{align*} \mu &= \frac{na}{a+b} \\ \sigma^2 &= \frac{nab(a + b + n)}{(a+b)^2(a + b + 1)} \end{align*}

Solving for a and b to meet a specified mean and variance appears at first to require solving a cubic equation, but it’s easier than that.

If we define p = a/(a+b) then the system becomes

\begin{align*} \mu &= np \\ \sigma^2 &= n p (1-p)\frac{a + np}{a + b} \end{align*}

We assume n is known and so p is known, and we are left with a linear equation for a. With a little work we find

\begin{align*} a &= -\frac{\mu(\mu^2 - \mu n + \sigma^2)}{\mu^2 - \mu n + \sigma^2 n} \\ b &= \frac{a(n - \mu)}{\mu} \end{align*}

I verified the calculations above with Mathematica.

One use for solving for a and b would be to fit a beta-binomial distribution to data using moment matching. I don’t know how robust this would be, but at least it’s something.

Another application would be to find a beta-binomial prior distribution with elicited mean and variance.

Babies and the beta-binomial distribution

About half of children are boys and half are girls, but that doesn’t mean that every couple is equally likely to have a boy or a girl each time they conceive a child. And evidence suggests that indeed the probability of conceiving a girl varies per couple.

I will simplify things for this post and look at a hypothetical situation abstracting away the complications of biology. This post fills in the technical details of a thread I posted on Twitter this morning.

Suppose the probability p that a couple will have a baby girl has a some distribution centered at 0.5 and symmetric about that point. Then half of all births on the planet will be girls, but that doesn’t mean that a particular couple is equally likely to have a boy or a girl.

How could you tell the difference empirically? You couldn’t if every family had one child. But suppose you studied all families with four children, for example. You’d expect 1 in 16 such families to have all boys, and 1 in 16 families to have all girls. If the proportions are higher than that, and they are, then that suggests that the distribution on p, the probability of a couple having a girl, is not constant.

Suppose the probability of a couple having girls has a beta(a, b) distribution. We would expect a and b to be approximately equal, since about half of babies are girls, and we’d expect a and b to be large, i.e. for the distribution be fairly concentrated around 1/2. For example, here’s a plot with ab = 100.

Then the probability distribution for the number of girls in a family of n children is given by a beta-binomial distribution with parameters n, a, and b. That is, the probability of x girls in a family of size n is given by

\text{Prob}(X = x) = \binom{n}{x} \frac{B(x + a, n - x + b)}{B(a, b)}

The mean of this distribution is na/(a+b). And so if a = b then the mean is n/2, half girls and half boys.

But the variance is more interesting. The variance is

\frac{nab(a + b + n)}{(a + b)^2(a + b +1)} = n \,\,\frac{a}{a + b} \,\,\frac{b}{a + b} \,\,\frac{a + b + n}{a + b + 1}

The variance of a binomial, corresponding to a constant p, is np(1-p). In the equation above, p corresponds to a/(a+b), and (1-p) corresponds to b/(a+b). And there’s an extra term,

\frac{a + b + n}{a + b + 1}

which is larger than 1 when n > 1. This says a beta binomial random variable always has more variance than the corresponding binomial distribution with the same mean.

Now suppose a family has had n children, with g girls and ng boys. Then the posterior predictive probability of a girl on the next birth is

\frac{a + g}{a + b + n}

If g = n/2 then this probability is 1/2. But if g > n/2 then the probability is greater than 1/2. And the smaller a and b are, the more the probability exceeds 1/2.

The binomial model is the limit of the beta-binomial model as a and b go to infinity (proportionately). In the limit, the probability above equals a/(a+b), independent of g and n.

Related posts

Reviewing a thousand things

Suppose you’ve learned a thousand of something, maybe a thousand kanji or a thousand chemicals or a thousand species of beetles. Now you want to review them to retain what you’ve learned.

Now suppose you have a program to quiz you, drawing items from your list at random with replacement. Say you draw 100 items a day. How long until you’ve seen everything at least once?

A naive estimate would be 10 days, but that assumes you never draw the same item twice in 1000 draws. According to the birthday problem principle, there’s a 50-50 chance you’ll see a duplicate by the time you’ve made √1000 ≈ 32 draws. Drawing 100 cards a day, you’ll likely see duplicates the first day.

Coupon collector problem

Random sampling with replacement is not an efficient way to exhaustively sample a large population. The coupon collector problem says that for a set of N items, the expected number of draws before you’ve seen everything once is approximately

N(log N + γ)

for large N. Here γ is Euler’s constant. When N = 1000 this works out to 7485. So if you draw 100 items a day, you’d expect to have seen everything after 75 days. Of course it might take more or less time than that, but that’s the average.

To put it another way, random sampling with replacement is about 7.5 times less efficient than systematic sampling if you’re concerned with expected time to exhaustive sampling.

In general, the relative inefficiency is log N + γ. So for a set of N = 100 things, for example, the relative inefficiency is about 5.2.

Occupancy problem

We can arrive at approximately the same result using methods from a couple other posts. Last week I wrote about the occupancy problem distribution. That post defined X(N, r) and gives its mean and variance.

When we draw r random samples with replacement from a set of N items, let X(N, r) be the random variable that represents the number of distinct items in the sample. …

The mean of X(N, r) is

N\left( 1 - \left(\frac{N-1}{N}\right)^r \right)

So if we have set of N = 1000 things and sample with replacement 7000 times, we expect to see 999 distinct items on average. Of course this is an average, so we might see less. And we might see more, but not much more!

Probabilities

The approaches above have worked with expected values, not individual probabilities. The coupon collector problem looks at the expected number of draws needed to before you see everything. The occupancy problem looks at the expected number of things you’ve seen after a certain number of draws.

We could calculate the probability of having seen everything after r draws using the expression here.

After 7,000 draws, there’s a 40% chance we’ve seen everything. After 7,274 draws the chances of having seen everything are even, i.e. the median number of draws is 7,274, not that different from the mean number of draws estimated above.

After 10,000 draws there is a 96% chance we’ve seen everything.

For more on the distribution of the number of draws necessary, see the follow-on post Collecting a large number of coupons.

A note on computation

The expression used in the previous section relies on Stirling numbers, and these are a bit difficult to compute. The calculation involves very large integers, but we cannot use logarithms because we are adding large numbers, not multiplying them.

This post includes Python code for calculating Stirling numbers using Python’s large integer feature. But the probability we’re after is the ratio of two integers too large to convert to floating point numbers. The following code gets around this problem using the Decimal class.

    from math import comb, log10
    from decimal import Decimal, getcontext

    def k_factorial_stirling(n, k):
        return sum((-1)**i * comb(k, i)*(k-i)**n for i in range(k+1))    

    def prob(N, r):
        getcontext().prec = int(r*log10(N)) + 100
        return Decimal( k_factorial_stirling(r, N) ) / Decimal( N**r)

    print( round(prob(1000,  7000), 4) )
    print( round(prob(1000,  7274), 4) )
    print( round(prob(1000, 10000), 4) )

How do we know how many decimal places we’ll need? We’re computing a probability, so the denominator is larger than the numerator. We estimate the number of digits in the denominator, then throw in 100 more just for good measure.