Big correlations and big interactions

An outcome cannot be highly correlated with a large number of independent predictors.

This observation has been called the piranha problem. Predictors are compared to piranha fish. If you have a lot of big piranhas in a small pond, they start eating each other. If you have a lot of strong predictors, they predict each other.

In [1] the authors quantify the piranha effect several ways. I’ll just quote the first one here. See the paper for several other theorems and commentary on their implications.

If X1, …, Xp, y are real-valued random variables with finite non-zero variance, then

\sum_{i=1}^p |\text{corr}(X_i, y)| \leq \sqrt{p + \sum_{i\ne j}|\text{corr}(X_i, X_j)|}

So if the left side is large, either because p is large or because some of the correlations are large, then the right side is also large, and so the sum of the interaction terms is large.

Related posts

[1]. The piranha problem: large effects swimming in a small pond. Available on arxiv.

Design of experiments and design theory

Design of experiments is a branch of statistics, and design theory is a branch of combinatorics, and yet they overlap quite a bit.

It’s hard to say precisely what design theory is, but it’s consider with whether objects can be arranged in certain ways, and if so how many ways this can be done. Design theory is pure mathematics, but it is of interest to people working in ares of applied mathematics such as coding theory and statistics.

Here’s a recap of posts I’ve written recently related to design of experiments and design theory.

Design of Experiments

A few weeks ago I wrote about fractional factorial design. Then later I wrote about response surface models. Then a diagram from central composite design, a popular design in response surface methodology, was one the diagrams in a post I wrote about visually similar diagrams from separate areas of application.

I wrote two posts about pitfalls with A/B testing. One shows how play-the-winner sequential testing has the same problems as Condorcet’s voter paradox, with the order of the tests potentially determining the final winner. More seriously, A/B testing cannot detect interaction effects which may be critical.

ANSI and Military Standards

There are several civilian and military standards related to design of experiments. The first of these was MIL-STD-105. The US military has retired this standard in favor of the civilian standard ASQ/ANSI Z1.4 which is virtually identical.

Similarly, the US military standard MIL-STD-414 was replaced by the very similar civilian standard ASQ/ANSI Z1.9. This post looks at the mean-range method for estimating variation which these two standards reference.

Design Theory

I wrote a couple posts on Room squares, one on Room squares in general and one on Thomas Room’s original design now known as a Room square. Room squares are used in tournament designs.

I wrote a couple posts about Costas arrays, an introduction and a post on creating Costas arrays in Mathematica.

Latin Squares

Latin squares and Greco-Latin squares a part of design theory and a part of design of experiments. Here are several posts on Latin and Greco-Latin squares.

Entropy of a Student t distribution

I was looking up the entropy of a Student t distribution and something didn’t seem right, so I wanted to look at familiar special cases.

The Student t distribution with ν degrees of freedom has two important special cases: ν = 1 and ν = ∞. When ν = 1 we get the Cauchy distribution, and in the limit as ν → ∞ we get the normal distribution. The expression for entropy is simple in these two special cases, but it’s not at all obvious that the general expression at ν = 1 and ν = ∞ gives the entropy for the Cauchy and normal distributions.

The entropy of a Cauchy random variable (with scale 1) is

\log(4 \pi)

and the entropy of a normal random variable (with scale 1) is

(\log(2\pi) + 1)/2

The entropy of a Student t random variable with ν degrees of freedom is

\frac{\nu+1}{2} \left( \psi\left(\frac{\nu + 1}{2}\right) - \psi\left(\frac{\nu}{2}\right) \right) + \log\left( \sqrt{\nu} B\left(\frac{\nu}{2}, \frac{1}{2} \right) \right)

Here ψ is the digamma function, the derivative of the log of the gamma function, and B is the beta function. These two functions are implemented as psi and beta in Python, and PolyGamma and Beta in Mathematica. Equation for entropy found on Wikipedia.

This post will show numerically and analytically that the general expression does have the right special cases. As a bonus, we’ll prove an asymptotic formula for the entropy along the way.

Numerical evaluation

Numerical evaluation shows that the entropy expression with ν = 1 does give the entropy for a Cauchy random variable.

    from numpy import pi, log, sqrt
    from scipy.special import psi, beta

    def t_entropy(nu):
        S = 0.5*(nu + 1)*(psi(0.5*(nu+1)) - psi(0.5*nu))
        S += log(sqrt(nu)*beta(0.5*nu, 0.5))
        return S

    cauchy_entropy = log(4*pi)
    print(t_entropy(1) - cauchy_entropy)

This prints 0.

Experiments with large values of ν show that the entropy for large ν is approaching the entropy for a normal distribution. In fact, it seems the difference between the entropy for a t distribution with ν degrees of freedom and the entropy of a standard normal distribution is asymptotic to 1/ν.

    normal_entropy = 0.5*(log(2*pi) + 1)
    for i in range(5):
        print(t_entropy(10**i)- normal_entropy)

This prints

    1.112085713764618
    0.10232395977100861
    0.010024832113557203
    0.0010002498337291499
    0.00010000250146458001

Analytical evaluation

There are tidy expressions for the ψ function at a few special arguments, including 1 and 1/2. And the beta function has a special value at (1/2, 1/2).

We have ψ(1) = -γ and ψ(1/2) = -2 log 2 – γ where γ is the Euler–Mascheroni constant. So the first half of the expression for the entropy of a t distribution with 1 degree of freedom reduces to 2 log 2. Also, B(1/2, 1/2) = π. Adding these together we get 2 log 2 + log π which is the same as log 4π.

For large z, we have the asymptotic series

\psi(z) \sim \log(z) - \frac{1}{2z}

See, for example, A&S 6.3.18. We’ll also need the well-known fact that log(1 + z) ∼ z. for small z,

\begin{align*} \frac{\nu+1}{2} \left( \psi\left(\frac{\nu + 1}{2}\right) - \psi\left(\frac{\nu}{2}\right) \right) &\sim \frac{\nu + 1}{2}\left( \log\left(\frac{\nu+1}{\nu}\right) - \frac{1}{\nu+1} + \frac{1}{\nu}\right ) \\ &\sim \frac{\nu+1}{2}\left(\frac{1}{\nu} - \frac{1}{\nu+1} + \frac{1}{\nu} \right) \\ &= \frac{1}{2} + \frac{1}{\nu} \end{align*}

Next we use the definition of the beta function as a ratio of gamma functions, the fact that Γ(1/2) = √π, and the asymptotic formula here to find that

B\left(\frac{\nu}{2}, \frac{1}{2} \right ) = \frac{\Gamma(\nu/2) \,\Gamma(1/2)}{\Gamma((\nu +1)/2)} \sim \sqrt{\frac{2\pi}{\nu}}

This shows that the entropy of a Student t random variable with ν degrees of freedom is asymptotically

\frac{1}{2} + \frac{\log 2\pi}{2} + \frac{1}{\nu}

for large ν. This shows that we do indeed get the entropy of a normal random variable in the limit, and that the difference between the Student t and normal entropies is asymptotically 1/ν, proving the conjecture inspired by the numerical experiment above.

Robustness of mean range

Let’s suppose we have data that comes from a distribution that is approximately normal but has a heavier right tail, specifically a gamma distribution with shape 6.

gamma(6,1) PDF

We’d like to estimate the standard deviation of the data source. If the data were normally distributed, the sample standard deviation would be the most efficient unbiased estimator. But what about this example where the data are roughly normally distributed? How well does sample standard deviation compare to the mean range? Following the convention of ANSI Z1.4 we’ll average the ranges of samples of 5.

The variance of a gamma random variable with scale 1 is equal to its shape parameter, so our data has variance 6, and so standard deviation √6. We will pretend that we don’t know this and see how well we can infer this value using random samples. We’ll compare the statistical efficiency of the mean of the ranges of k samples of 5 each and the standard deviation based on 5k samples for k = 2 and 4.

First we’ll use the following code to estimate the population standard deviation with 10 samples two ways: with sample standard deviation and with the average of the range of two samples of 5. We’ll take the absolute value of the difference between each estimate and the true value, and repeat this 1,000 times to simulate the mean absolute error.

    from scipy.stats import gamma, t
    
    def meanrange(y, k):
        scale = 0.430 # scale factor for n=5
        s = 0
        for i in range(k):
            sample = y[5*i:5*(i+1)]
            s += max(sample) - min(sample)
        return scale*s/k
    
    n = 5
    k = 2
    shape = 6
    sigma = shape**0.5
    
    stderr = 0
    mrerr = 0
    N = 1000
    for _ in range(N):
        y = gamma(shape).rvs(size=n*k)    
        std = y.std(ddof=1)
        mr = meanrange(y, k)
        stderr += abs(std-sigma)
        mrerr += abs(mr-sigma)
    print(stderr/N, mrerr/N)    

The mean absolute error was 0.532 using sample standard deviation and 0.550 using mean range.

When I changed k to 4, the mean absolute error using the sample standard deviation was 0.377. The mean absolute error using the range of 4 samples of 5 each time was 0.402.

The error in both methods was comparable, though the sample standard deviation did slightly better.

Next I repeated the experiment drawing samples from a Student t distribution with 3 degrees of freedom, a distribution with heavy tails.

PDF of a t distribution with nu = 3

When k = 2, the mean absolute errors were 0.603 and 0.574 for sample standard deviation and mean range respectively. When k = 4, the errors were 0.484 and 0.431. Again the errors are comparable, but this time the mean range was more accurate.

Related posts

Using mean range method to measure variability

The most common way to measure variability, at least for data coming from a normal distribution, is standard deviation. Another less common approach is to use mean range. Standard deviation is mathematically simple but operationally a little complicated. Mean range, on the other hand, is complicated to analyze mathematically but operationally very simple.

ASQ/ANSI Z1.9

The ASQ/ANSI Z1.9 standard, Sampling Procedures and Table for Inspection by Variables for Percent Nonconforming, gives several options for measuring variability, and one of these is the mean range method. Specifically, several samples of five items each are drawn, and the average of the ranges is the variability estimate. The ANSI Z1.9 standard grew out of, and is still very similar to, the US military standard MIL-STD-414 from 1957. The ANSI standard, last updated in 2018, is not that different from the military standard from six decades earlier.

The average mean is obviously simple to carry out: take five samples, subtract the smallest value from the largest, and write that down. Repeat this a few times and average the numbers you wrote down. No squares, no square roots, easy to carry out manually. This was obviously a benefit in 1957, but not as much now that computers are ubiquitous. The more important advantage today is that the mean range can be more robust for heavy-tailed data. More on that here.

Probability distribution

The distribution of the range of a sample is not simple to write down, even when the samples come from a normal distribution. There are nice asymptotic formulas for the range as the number of samples goes to infinity, but five is a bit far from infinity [1].

This is a problem that was thoroughly studied decades ago. The random variable obtained by averaging k ranges from samples of n elements each is denoted

\overline{W}_{n,k}

or sometimes without the subscripts.

Approximate distribution

There are several useful approximations for the distribution of this statistic. Pearson [2] evaluated several proposed approximations and found that the best one for n < 10 (as it is in our case, with n = 5) to be

\frac{\overline{W}}{\sigma} \approx \frac{c \chi_\nu}{\sqrt{\nu}}

Here σ is the standard deviation of the population being sampled, and the values of c and ν vary with n and k. For example, when n = 5 and k = 4, the value of ν is 14.7 and the value of c is 2.37. The value of c doesn’t change much as k gets larger, though the value of ν does [3].

Note that the approximating distribution is chi, not the more familiar chi square. (I won’t forget a bug I had to chase down years ago that was the result of seeing χ in a paper and reading it as χ². Double, triple, quadruple check, everything looks right. Oh wait …)

For particular values of n and k you could use the approximate distribution to find how to scale mean range to a comparable standard deviation, and to assess the relative efficiency of the two methods of measuring variation. The ANSI/ASQ Z1.9 standard gives tables for acceptance based on mean range, but doesn’t go into statistical details.

Related posts

[1] Of course every finite number is far from infinity. But the behavior at some numbers is quite close to the behavior at infinity. Asymptotic estimates are not just an academic exercise. They can give very useful approximations for finite inputs—that’s why people study them—sometimes even for inputs as small as five.

[2] E. S. Pearson (1952) Comparison of two approximations to the distribution of the range in small samples from normal populations. Biometrika 39, 130–6.

[3] H. A. David (1970). Order Statistics. John Wiley and Sons.

Using Python as a statistical calculator

This post is for someone unfamiliar with SciPy who wants to use Python to do basic statistical calculations. More specifically, this post will look at working with common families of random variables—normal, exponential, gamma, etc.—and how to calculate probabilities with these.

All the statistical functions you need will be in the stats subpackage of SciPy. You could import everything with

    >>> from scipy.stats import *

This will make software developers cringe because it’s good software development practice to only import what you need [1]. But we’re not writing software here; we’re using Python as a calculator.

Distributions

Every distribution in SciPy has a location parameter loc which defaults to 0, and scale parameter scale that defaults to 1.

The table below lists additional parameters that some common distributions have.

Distribution SciPy name    Parameters
normal norm
exponential expon
beta beta shape1, shape2
binomial binom size, prob
chi-squared chi2 df
F f df1, df2
gamma gamma shape
hypergeometric hypergeom M, n, N
Poisson poisson lambda
Student t t df

When using any statistical software its good to verify that the software is using the parameterization that you expect. For example, the exponential distribution can be parameterized by rate or by scale. SciPy does everything by scale. One way to test the parameterization is to calculate the mean. For example, if the mean of an exp(100) random variable is 100, you’re software is using the scale paraemterization. If the mean is 1/100, it;s using the rate.

Functions

For a random variable X from one of the families above, we’ll show how to compute Prob(Xx), Prob(Xx), and how to solve for x given one of these probabilities. And for discrete distributions, we’ll show how to find Prob(X = p).

CDF: Cumulative distribution function

The probability Prob(Xx) is the CDF of X at x. For example, the probability that a standard normal random variable is less than 2 is

    norm.cdf(2)

We didn’t have to specify the location or scale because the standard normal uses default parameters. If X had mean 3 and standard deviation 10 we’d call

    norm.cdf(2, loc=3, scale=10)

For another example, suppose we’re working with a gamma distribution with shape 3 and scale 5 and want to know the probability of such a random variable taking on a value less than 2. Then we’d call

    gamma.cdf(2, 3, scale=5)

The second argument, 2, is the shape parameter.

CCDF: Complementary cumulative distribution function

The probability Prob(Xx) is the complementary CDF of X at x, which is 1 minus the CDF at x. SciPy uses the name sf for the CCDF. Here “sf” stands for survival function.

Why have both a CDF and CCDF function? One reason is convenience, but a more important reason is numerical accuracy. If a probability p is very near 1, then 1 – p will be partially or completely inaccurate.

For example, let X be a standard normal random variable. The probability that X is greater than 9 is

    norm.sf(9)

which is on the order of 10-19. But if we compute

    1 - norm.cdf(9)

we get exactly 0. Floating point numbers have 16 figures of precision, and we need 19 to represent the difference between our probability and 1. More on why mathematical libraries have functions that seem unnecessary at first here.

Inverse CDF

SciPy calls the inverse CDF function ppf for percentile point function.

For example, suppose X is a gamma random variable with shape 3 and scale 5, and we want to solve for the value of x such that Prob(Xx) is 0.7. We could do this with

    gamma.ppf(0.7, 3, scale=5)

Inverse CCDF

SciPy calls the inverse CCDF function isf for inverse survival function. So, for example,

    gamma.isf(0.3, 3, scale=5)

should return the same value as the example above because the point where the right tail has probability 0.3 is the same as the point where the left tail has probability 0.7.

Probability mass function

For a discrete random variable X we can compute the probability mass function, the probability that X takes on a value x. Unsurprisingly SciPy calls this function pmf.

For example, the probability of seeing 6 heads out of 10 flips of a fair coin is computed by

    binom.pmf(6, 10, 0.5)

Getting help

To find out more about a probability distribution, call help with that distribution as an argument. For example, you could call

    help(hypergeom)

to read more about the hypergeometric distribution.

For much more information, including topics not addressed here, see the documentation for scipy.stats.

***

[1] The reason to not import more than you need is to reduce namespace confusion. If you see a function foo, is that the foo function from package A or from package B. The most common way this comes up in statistics is confusion over the gamma function and the gamma distribution. In SciPy, the gamma function is in scipy.special while the gamma distribution is in scipy.stats. If you’ve imported everything from scipy.stats then gamma means scipy.stats.gamma. You could bring the gamma function into your environment by importing it with another name. For example

    from scipy.special import gamma as gammaf

would import the gamma function giving it the name gammaf.

Runs of letters and digits in passwords

If a password of length n is made up of lower case letters and digits, with each symbol appearing with equal probability, how many times would you expect to shift between letters and digits when typing the password? If you’re typing the password on a mobile device, how many times would you change between alphabetic to numeric entry modes?

This page shows how to compute the expected number of runs in a sequence of binary outcomes when the two outcomes have probabilities p and 1-p. The number we’re after is one less than this, because the first run does not requiring changing input modes. The expected number of input mode changes is

μ = 2(n – 1)p(1 – p).

In our case we can take p = 26/36. It doesn’t matter whether p represents the probability of a letter or a digit because μ is symmetric in p and 1-p.

For moderately large n, the number of mode changes is approximately proportional to n. So for a password of n characters we should expect approximately

2 × 26/36 × 10/36 n ≈ 0.4 n

mode changes.

The variance in the number of mode changes is

σ² = 2pq(2n – 3 – 2pq (3n – 5))

which for large n is approximately

4npq(1 – 3pq).

In our case this reduces to

4 × 26/36 × 10/36 × (1 – 3 × 26/36 × 10/36) n ≈ 0.32 n.

The standard deviation is the square root of variance, which is approximately

σ ≈ 0.57 √n

in our example.

So typing a password of n characters will often require around

0.4 n ± 0.57 √n

mode changes.

For example, typing a 20-character password would often require between 5 and 11 input mode changes.

Related post

Postscript

Variations on this problem come up fairly often. For example, yesterday I needed to create a spreadsheet with a permutation of the numbers 1 through 26. I had a spreadsheet with a column containing a permutation of the numbers 1 through 100, and so I copied it and deleted the rows with numbers greater than 26. The math above shows that on average you’d expect to change from a run of rows to keep to a run of rows to delete about 38 times.

Naming probability functions

Given a random variable X, you often want to compute the probability that X will take on a value less than x or greater than x. Define the functions

FX(x) = Prob(Xx)

and

GX(x) = Prob(X > x)

What do you call F and G? I tend to call them the CDF (cumulative distribution function) and CCDF (complementary cumulative distribution function) but conventions vary.

The names of software functions to compute these two functions can be confusing. For example, Python (SciPy) uses the names cdf and sf (the latter for “survival function”) while the R functions to compute the CDF take an optional argument to return the CCDF instead [1].

In the Emacs calculator, the function ltpn computes the CDF. At first glace I thought this was horribly cryptic. It’s actually a very nice naming convention; it just wasn’t what I was expecting.

The “ltp” stands for lower tail probability and “n” stands for normal. The complementary probability function is utpn where “utp” stands for upper tail probability. Unlike other software libraries, Emacs gives symmetric names to these two symmetrically related functions.

“Lower tail” probability is clearer than “cumulative” probability because it leaves no doubt whether you’re accumulating from the left or the right.

You can replace the “n” at the end of ltpn and utpn with the first letters of binomial, chi-square, t, F, and Poisson to get the corresponding functions for these distributions. For example, utpt gives the upper tail probability for the Student t distribution [2].

The Emacs calculator can be quirky, but props to the developers for choosing good names for the probability functions.

Related posts

[1] Incidentally, the CCDF cannot always be computed by simply computing the CDF first and subtracting the value from 1. In theory this is possible, but not in floating point practice. See the discussion of erf and erfc in this post for an explanation.

[2] These names are very short and only a handful of distribution families are supported. But that’s fine in context. The reason to use the Emacs calculator is to do a quick calculation without having to leave Emacs, not to develop production quality statistical software.

Sampling with replacement until you’ve seen everything

Suppose you have a standard deck of 52 cards. You pull out a card, put it back in the deck, shuffle, and pull out another card. How long would you expect to do this until you’ve seen every card?

Here’s a variation on the same problem. Suppose you’re a park ranger keeping data on tagged wolves in a national park. Each day you catch a wolf at random, collect some data, and release it. If there are 100 wolves in the park, how long until you’ve seen every wolf at least once?

We’ll solve this problem via simulation, then exactly.

Simulation

The Python code to solve the problem via simulation is straight forward. Here we’ll simulate our ranger catching and releasing one of 100 wolves. We’ll repeat our simulation five times to get an idea how much the results vary.

    import numpy as np
    from numpy.random import default_rng

    rng = default_rng()
    num_runs = 5
    N = 100 # number of unique items

    for run in range(num_runs):
        tally = np.zeros(N,dtype=int)
        trials = 0
        while min(tally) == 0:
            tally[rng.integers(N)] += 1
            trials += 1
        print(trials)

When I ran this, I got 846, 842, 676, 398, and 420.

Exact solution

Suppose we have a desk of N cards, and we sample the deck with replacement M times. We can select the first card N ways. Ditto for the second, third, and all the rest of the cards. so there are NM possible sequences of card draws.

How many of these sequences include each card at least once? To answer the question, we look at Richard Stanley’s twelvefold way. We’re looking for the number of ways to allocate M balls into N urns such that each urn contains at least one ball. The answer is

N! S(M, N)

where S(M, N) is the Stirling number of the second kind with parameters M and N [1].

So the probability of having seen each card at least once after selecting M cards randomly with replacement is

N! S(M, N) / NM.

Computing Stirling numbers

We’ve reduced our problem to the problem of computing Stirling numbers (of the second kind), so how do we do that?

We can compute Stirling numbers in terms of binomial coefficients as follows.

S(n,k) = \frac{1}{k!}\sum_{i = 0}^k (-1)^i {k \choose i} (k-i)^n

Now we have a practical problem if we want to use this formula for larger values of n and k. If we work with floating point numbers, we’re likely to run into overflow. This is the perennial with probability calculations. You may need to compute a moderate-sized number as the ratio of two enormous numbers. Even though the final result is between 0 and 1, the intermediate results may be too large to store in a floating point number. And even if we don’t overflow, we may lose precision because we’re working with an alternating sum.

The usual way to deal with overflow is to work on a log scale. But that won’t work for Stirling numbers because we’re adding terms together, not multiplying them.

One way to solve our problem—not the most efficient way but a way that will work—is to work with integers. This is easy in Python because integers by default can be arbitrarily large.

There are two ways to compute binomial coefficients in Python: scipy.special.binom and math.comb. The former uses floating point operations and the latter uses integers, so we need to use the latter.

We can compute the numerator of our probability with

    from math import comb
    def k_factorial_stirling(n, k):
        return sum((-1)**i * comb(k, i)*(k-i)**n for i in range(k+1))    

Then our probability is

    k_factorial_stirling(M, N) / N**M

If we compute the probability that our ranger has seen all 100 wolves after catching and releasing 500 times, we get 0.5116. So the median number of catch and release cycles is somewhere around 500, which is consistent with our simulation above.

Note that the numerator and denominator in our calculation are both on the order of 101000 while the largest floating point number is on the order of 10308, so we would have overflowed if we’d used binom instead of comb.

Related posts

[1] It’s unfortunate that these were called the “second kind” because they’re the kind that come up most often in application.

Image: “Fastening a Collar” by USFWS Mountain Prairie is licensed under CC BY 2.0 .

How flat is a normal mixture on top?

Male and female heights both have a standard deviation of about 3 inches, with means of 70 inches and 64 inches. That’s a good first-pass model using round numbers.

If you ask what the height of an average adult is, not specifying male or female, you get a mixture of two normal distributions. If we assume an equal probability of selecting a male or female, then the probability density function for the mixture is the average of the density functions for men and women separately.

This mixture distribution is remarkably flat on top.

This happens whenever you have an equal mixture of two normal distributions, both having the same standard deviation σ, and with means 2σ apart. If the means were any closer together, the distribution would be rounder on top. If the means were any further apart, there would be a dip in the middle.

This makes sense intuitively if you think about what happens if you make things more extreme. If the means were as close together as possible, i.e. if they were the same, we’d simply have a normal distribution with its familiar curve on top. If the means were much further apart, we’d have two bumps that hardly appear to overlap.

See more on the application to heights here.

Mathematical details

How flat is the mixture density on top? We can quantify the flatness by looking at a power series expansion centered at the peak.

To simplify matters, we can assume that σ = 1 and that the means of the two distributions are -μ and μ, with the primary interest being the case μ = 1. The probability density function for the mixture is

\frac{1}{2\sqrt{2\pi}}\left( \exp\left(\frac{-(x-\mu)^2}{2} \right ) + \exp\left(\frac{-(x+\mu)^2}{2} \right )\right )

If we expand this in a Taylor series around 0 we get

\exp\left( -\frac{\mu^2}{2} \right) \left( \frac{1}{2 \pi }+\frac{\left(\mu^2-1\right) x^2}{4 \pi }+\frac{ \left(\mu^4-6 \mu^2+3\right) x^4}{48 \pi } + \cdots \right)

This is a little complicated, but it explains a lot. Notice that the coefficient of x² has a term (μ² – 1).

We said above that if the means were any closer together than two standard deviations, the distribution would be rounder on top. That’s because if μ < 1 then (μ² – 1) is negative and the curve is convex on top.

We also said that if the means were any further apart, there would be a dip in the middle. That’s because if μ > 1 then (μ² – 1) is positive and the curve is concave on top.

Now if μ = 1 then the x² term disappears. And because our function is even, its Taylor series only has terms with even powers. So if the x² term goes away the next term is not x3 but x4.

So one way to say how flat our curve is on top is that it is flat to within O(x4).

Let’s get more specific by evaluating the coefficients numerically. For small x, our mixture density function is

0.0965 – 0.0080 x4 + O(x6).

Now if x is small, x4 is extremely small, and the coefficient of 0.008 makes it even smaller.