Knuth’s series for chi squared percentage points

In the latest two posts, we have needed to find the percentage points for a chi square random variable when the parameter ν, the number of degrees of freedom, is large.

In Volume 2 of Knuth’s magnum opus TAOCP, he gives a table of percentage points for the chi square distribution for small values of ν and he gives an asymptotic approximation to be used for values of ν > 30.

Knuth’s series is

\nu + \sqrt{2\nu}x_p + \frac{2}{3} x_p^2 - \frac{2}{3} + {\cal O}(1/\sqrt{\nu})

where xp is the corresponding percentage point function for a normal random variable. Knuth gives a few values of xp in his table.

What Knuth calls xp, I called p in the previous post. The approximation I gave, based on Fisher’s transformation to make the chi squared more like a normal, is similar to Knuth’s series, but not the same.

So which approximation is better? Where does Knuth’s approximation come from?

Comparing approximations

We want to find the quantiles for χ²(100) for p = 0.01 through 0.99. Both Fisher’s approximation and Knuth’s approximation are good enough that it’s hard to tell the curves apart visually when the exact values are plotted along with the two approximations.

But when you subtract off the exact values and just look at the errors, it’s clear that Knuth’s approximation is much better in general.

Here’s another plot, taking the absolute value of the error. This makes it easier to see little regions where Fisher’s approximation happens to be better.

Here are a couple more plots comparing the approximations at more extreme probabilities. First the left tail:

And then the right tail:

Derivation

The table mentioned above says to see exercise 16 in the same section, which in turn refers to Theorem 1.2.11.3A in Volume 1 of TAOCP. There Knuth derives an asymptotic series which is not directly the one above, but does most of the work. The exercise asks the reader to finish the derivation.

Chi squared approximations

In the previous post I needed to know the tail percentile points for a chi squared distribution with a huge number of degrees of freedom. When the number of degrees of freedom ν is large, a chi squared random variable has approximately a normal distribution with the same mean and variance, namely mean ν and variance 2ν.

In that post, ν was 9999 and we needed to find the 2.5 and 97.5 percentiles. Here are the percentiles for χ²(9999):

    >>> chi2(9999).ppf([0.025, 0.975])
    array([ 9723.73223701, 10278.05632026])

And here are the percentiles for N(9999, √19998):

    >>> norm(9999, (2*9999)**0.5).ppf([0.025, 0.975])
    array([ 9721.83309451, 10276.16690549])

So the results on the left end agree to three significant figures and the results on the right agree to four.

Fewer degrees of freedom

When ν is more moderate, say ν = 30, the normal approximation is not so hot. (We’re stressing the approximation by looking fairly far out in the tails. Closer to the middle the fit is better.)

Here are the results for χ²(30):

    >>> chi2(30).ppf([0.025, 0.975])
    array([16.79077227, 46.97924224])

And here are the results for N(30, √60):

    >>> norm(30, (60)**0.5).ppf([0.025, 0.975])
    array([14.81818426, 45.18181574])

The normal distribution is symmetric and the chi squared distribution is not, though it becomes more symmetric as ν → ∞. Transformations of the chi squared distribution that make it more symmetric may also improve the approximation accuracy. That wasn’t important when we had ν = 9999, but it is more important when ν = 30.

Fisher transformation

If X ~ χ²(ν), Fisher suggested the approximation √(2X) ~ N(√(2ν − 1), 1).

Let Y be a N(√(2ν − 1), 1) random variable and Z a standard normal random variable, N(0, 1). Then we can estimate χ² probabilities from normal probabilities.

\begin{align*} P(X \leq x) &= P(\sqrt{2X} \leq \sqrt{2x}) \\ &\approx P(Y \leq \sqrt{2x}) \\ &= P(Z \leq \sqrt{2x} - \sqrt{2\nu - 1}) \end{align*}

So if we want to find the percentage points for X, we can solve for corresponding percentage points for Z.

If z is the point where P(Zz) = p, then

x = \frac{(p + \sqrt{2\nu-1})^2}{2}

is the point where P(Xx) = p.

If we use this to find the 2.5 and 97.5 percentiles for a χ²(30) random variable, we get 16.36 and 46.48, an order of magnitude more accurate than before.

When ν = 9999, the Fisher transformation gives us percentiles that are two orders of magnitude more accurate than before.

Wilson–Hilferty transformation

If X ~ χ²(ν), the Wilson–Hilferty transformation is (X/ν)1/3 is approximately normal with mean 1 − 2/9ν and variance 2/9ν.

This transformation is a little more complicated than the Fisher transform, but also more accurate. You could go through calculations similar to those above to approximate percentage points using the Wilson–Hilferty transformation.

The main use for approximations like this is now for analytical calculations; software packages can give accurate numerical results. For analytical calculation, the simplicity of the Fisher transformation may outweigh the improve accuracy of the Wilson-Hilferty transformation.

Related posts

Variance of variances. All is variance.

In the previous post, I did a simulation to illustrate a theorem about the number of steps needed in the Euclidean algorithm. The distribution of the number of steps is asymptotically normal, and for numbers 0 < a < b < x the mean is asymptotically

12 log(2) log(x) / π².

What about the variance? The reference cited in the previous post [1] gives an expression for the variance, but it’s complicated. The article said the variance is close to a constant times log x, where that constant involves the second derivative of “a certain pseudo zeta function associated with continued fractions.”

So let’s estimate the variance empirically. We can easily compute the sample variance to get an unbiased point estimate of the population variance, but how can we judge how accurate this point estimate is? We’d need the variance of our estimate of variance.

When I taught statistics classes, this is where students would struggle. There are levels of randomness going on. We have some random process. (Algorithm run times are not really random, but we’ll let that slide.)

We have a statistic S², the sample variance, which is computed from samples from a random process, so our statistic is itself a random variable. As such, it has a mean and a variance. The mean of S² is the population variance; that’s what it means for a statistic to be an unbiased estimator.

The statistic S² has a known distribution, which we can use to create a confidence interval. Namely,

\frac{(n-1)S^2}{\sigma^2} \sim \chi^2

where the chi squared random variable has n − 1 degrees of freedom. Here n is the number of samples and σ² is the population variance. But wait, isn’t the whole purpose of this discussion to estimate σ² because we don’t know what it is?!

This isn’t quite circular reasoning. More like iterative reasoning. We have an estimate of σ² which we then turn around and use in constructing its confidence interval. It’s not perfect, but it’s useful.

A (1 − α) 100% confidence interval is given by

\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2}} \right)

Notice that the right tail probability appears on the left end of the interval and the left tail probability appears on the right end of the interval. The order flips because they’re in a denominator.

I modified the simulation code from the previous post to compute the sample standard deviation for the runs, and got 4.6926, which corresponds to a sample variance of 22.0205. There were 10,000 reps, so our chi squared distribution has 9,999 degrees of freedom. We can compute the confidence interval as follows.

>>> from scipy.stats import chi2
>>> 9999*22.0205/chi2.ppf([0.975, 0.025], 9999)

This says a 95% confidence interval for the variance, in one particular simulation run, was [21.42, 22.64].

The result in [1] said that the variance is proportional to log x, and in our case n = 263 because that post simulated random numbers up to that limit. We can estimate that the proportionality constant is 63 log(2)/22.0205 = 0.5042. In other words, the variance is about half of log x.

The theorem in [1] gives us the asymptotic variance. If we used a bigger value of x, the variance could be a little different, but I expect it would be around 0.5 log x.

Related posts

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Normal probability approximation

The previous post presented an approximation

\int_0^x \exp(-t^2)\, dt \approx \sin(\sin(x))

for −1 ≤ x ≤ 1 and said that it was related to a probability function. This post will make the connect explicit.

Let X be a normally distributed random variable with mean μ and variance σ². Then the CDF of X is

P(X \leq x) = \frac{1}{\sqrt{2\pi} \sigma} \int_{-\infty}^x \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\, dt

So if μ = 0 and σ² = 1/2 then we have the following.

P(0 \leq X \leq x) = \frac{1}{\sqrt{\pi}} \int_0^x \exp(-t^2)\, dt \approx \frac{1}{\sqrt{\pi}} \sin(\sin(x))

Here’s Python code to show how good the approximation is.

    from numpy import sin, linspace, sqrt, pi
    from scipy.stats import norm
    import matplotlib.pyplot as plt

    x = linspace(-1, 1, 200)
    X = norm(0, sqrt(0.5))
    plt.plot(x, X.cdf(x) - X.cdf(0))
    plt.plot(x, sin(sin(x))/sqrt(pi))
    plt.legend(["exact", "approx"])
    plt.xlabel("$x$")
    plt.ylabel(r"$P(X \leq x)$")

Here’s the resulting plot:

The orange curve for the plot of the approximation completely covers the blue curve of the exact value. The two curves are the same to within the resolution of the plot. See the previous post for a detailed error analysis.

The Cauchy distribution’s counter-intuitive behavior

Someone with no exposure to probability or statistics likely has an intuitive sense that averaging random variables reduces variance, though they wouldn’t state it in those terms. They might, for example, agree that the average of several test grades gives a better assessment of a student than a single test grade. But data from a Cauchy distribution doesn’t behave this way.

Averages and scaling

If you have four independent random variables, each normally distributed with the same scale parameter σ, then their average is also normally distributed but with scale parameter σ/2.

If you have four independent random variables, each Cauchy distributed with the same scale parameter σ, then their average is also Cauchy distributed but with exact same scale parameter σ.

So the normal distribution matches common intuition, but the Cauchy distribution does not.

In the case of random variables with a normal distribution, the scale parameter σ is also the standard deviation. In the case of random variables with a Cauchy distribution, the scale parameter σ is not the standard deviation because Cauchy random variables don’t have a variance, so they don’t have a standard deviation.

Modeling

Some people object that nothing really follows a Cauchy distribution because the Cauchy distribution has no mean or variance. But nothing really follows a normal distribution either. All probability distributions are idealizations. The question of any probability distribution is whether it adequately captures the aspect of reality it is being used to model.

Mean

Suppose some phenomenon appears to behave like it has a Cauchy distribution, with no mean. Alternately, suppose the phenomenon has a mean, but this mean is so variable that it is impossible to estimate. There’s no practical difference between the two.

Variance

And in the alternate case, suppose there is a finite variance, but the variance is so large that it is impossible to estimate. If you take the average of four observations, the result is still so variable that the variance is impossible to estimate. You’ve cut the theoretical variance in half, but that makes no difference. Again this is practically indistinguishable from a Cauchy distribution.

Truncating

Now suppose you want to tame the Cauchy distribution by throwing out samples with absolute value larger than M. Now you have a truncated Cauchy distribution, and it has finite mean and variance.

But how do you choose M? If you don’t have an objective reason to choose a particular value of M, you would hope that your choice doesn’t matter too much. And that would be the case for a thin-tailed probability distribution like the normal, but it’s not true of the Cauchy distribution.

The variance of the truncated distribution will be approximately equal to M, so by choosing M you choose the variance. So if you double your cutoff for outliers that are to be discarded, you approximately double the variance of what’s left. Your choice of M matters a great deal.

Related posts

Why do medical tests always have error rates?

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Related posts

Do incremental improvements add, multiply, or something else?

Suppose you make an x% improvement followed by a y% improvement. Together do they make an (x + y)% improvement? Maybe.

The business principle of kaizen, based on the Japanese 改善 for improvement, is based on the assumption that incremental improvements accumulate. But quantifying how improvements accumulate takes some care.

Add or multiply?

Two successive 1% improvements amount to a 2% improvement. But two successive 50% improvements amount to a 125% improvement. So sometimes you can add, and sometimes you cannot. What’s going on?

An x% improvement multiplies something by 1 + x/100. For example, if you earn 5% interest on a principle of P dollars, you now have 1.05 P dollars.

So an x% improvement followed by a y% improvement multiplies by

(1 + x/100)(1 + y/100) = 1 + (x + y)/100 + xy/10000.

If x and y are small, then xy/10000 is negligible. But if x and y are large, the product term may not be negligible, depending on context. I go into this further in this post: Small probabilities add, big ones don’t.

Interactions

Now let’s look at a variation. Suppose doing one thing by itself brings an x% improvement and doing another thing by itself makes a y% improvement. How much improvement could you expect from doing both?

For example, suppose you find through A/B testing that changing the font on a page increases conversions by 10%. And you find in a separate A/B test that changing an image on the page increases conversions by 15%. If you change the font and the image, would you expect a 25% increase in conversions?

The issue here is not so much whether it is appropriate to add percentages. Since

1.1 × 1.15 = 1.265

you don’t get a much different answer whether you multiply or add. But maybe you could change the font and the image and conversions increase 12%. Maybe either change alone creates a better impression, but together they don’t make a better impression than doing one of the changes. Or maybe the new font and the new image clash somehow and doing both changes together lowers conversions.

The statistical term for what’s going on is interaction effects. A sequence of small improvements creates an additive effect if the improvements are independent. But the effects could be dependent, in which case the whole is less than the sum of the parts. This is typical. Assuming that improvements are independent is often overly optimistic. But sometimes you run into a synergistic effect and the whole is greater than the sum of the parts.

Sequential testing

In the example above, we imagine testing the effect of a font change and an image change separately. What if we first changed the font, then with the new font tested the image? That’s better. If there were a clash between the new font and the new image we’d know it.

But we’re missing something here. If we had tested the image first and then tested the new font with the new image, we might have gotten different results. In general, the order of sequential testing matters.

Factorial testing

If you have a small number of things to test, you can discover interaction effects by doing a factorial design, either a full factorial design or a fractional factorial design.

If you have a large number of things to test, you’ll have to do some sort of sequential testing. Maybe you do some combination of sequential and factorial testing, guided by which effects you have reason to believe will be approximately independent.

In practice, a testing plan needs to balance simplicity and statistical power. Sequentially testing one option at a time is simple, and may be fine if interaction effects are small. But if interaction effects are large, sequential testing may be leaving money on the table.

Help with testing

If you’d like some help with testing, or with web analytics more generally, we can help.

LET’S TALK

Fitting a line without an intercept term

The other day I was looking at how many lumens LED lights put out per watt. I found some data on Wikipedia, and as you might expect the relation between watts and lumens is basically linear, though not quite.

If you were to do a linear regression on the data you’d get a relation

lumens = a × watts + b

where the intercept term b is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore b; it’s probably small. But what if you wanted to require that b = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope a without using any regression software.

Let x be the vector of input data, the wattage of the LED bulbs. And let y be the corresponding light output in lumens. The regression line uses the slope a that minimizes

(a xy)² = a² x · x − 2a x · y + y · y.

Setting the derivative with respect to a to zero shows

a = x · y / x · x

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring b = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

Related posts

[1] The orange line in the image above is the least squares fit for the model y = ax, but it’s not quite the same line you’d get if you fit the model y = ax + b.

Condition on your data

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Common sense says it matters what you saw. If 148 people were randomized to Design A, and every single one of them bought your product, while 10 out of the 152 people randomized to Design B bought your product, common sense says you should call Design A the winner and push it into production ASAP.

But what if you saw somewhat better results for Design A? You can have some confidence that Design A is better, though not as much as the nominal confidence level of the full experiment. This is what your (frequentist) statistician was trying to protect you from.

When your statistician designed your experiment, he obviously didn’t know what data you’d see, so he designed a process that would be reliable in a certain sense. When you looked at the data early, you violated the process, and so now your actual practice no longer has the probability of success initially calculated.

But you don’t care about the process; you want to know whether to deploy Design A or Design B. And you saw the data that you saw. Particularly in the case where the results were lopsidedly in favor of Design A, your gut tells you that you know what to do next. You might reasonably say “I get what you’re saying about repeated experiments and all that. (OK, not really, but let’s say I do.) But look what happened? Design A is a runaway success!”

In formal terms, your common sense is telling you to condition on the observed data. If you’ve never studied Bayesian statistics you may not know exactly what that means and how to calculate it, but it’s intuitively what you’ve done. You’re making a decision based on what you actually saw, not on the basis of a hypothetical sequence of experiments you didn’t run and won’t run.

Bayesian statistics does formally what your intuition does informally. This is important because even though your intuition is a good guide in extreme cases, it can go wrong when things are less obvious. As I wrote about recently, smart people can get probability very wrong, even when their intuition is correct in some instances.

If you’d like help designing experiments or understanding their results, we can help.

Can you look at experimental results along the way or not?

A B testing

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

Of course there’s nothing morally or legally wrong with looking at interim data, but is there anything statistically wrong? That depends on what you do with your observation.

There are basically two statistical camps, Frequentist and Bayesian. (There are others, but these two account for the lion’s share.) Frequentists say no, you should not look at interim data, unless you apply something called alpha spending. Bayesians, on the other hand, say go ahead. Why shouldn’t you look at your data? I remember one Bayesian colleague mocking alpha spending as being embarrassing.

So who is right, the Frequentists or the Bayesians? Both are right, given their respective criteria.

If you want to achieve success, as defined by Frequentists, you have to play by Frequentist rules, alpha spending and all. Suppose you design a hypothesis test to have a confidence level α, and you look at the data midway through an experiment. If the results are conclusive at the midpoint, you stop early. This procedure does not have a confidence level α. You would have to require stronger evidence for early stopping, as specified by alpha spending.

The Bayesian interprets the data differently. This approach says to quantify what is believed before conducting the experiment in the form of a prior probability distribution. Then after each data point comes in, you update your prior distribution using Bayes’ theorem to produce a posterior distribution, which becomes your new prior distribution. That’s it. At every point along the way, this distribution captures all that is known about what you’re testing. Your planned sample size is irrelevant, and the fact that you’ve looked at your data is irrelevant.

Now regarding your A/B test, why are you looking at the data early? If it’s simply out of curiosity, and cannot affect your actions, then it doesn’t matter. But if you act on your observation, you change the Frequentist operating characteristics of your experiment.

Stepping back a little, why are you conducting your test in the first place? If you want to make the correct decision with probability 1 − α in an infinite sequence of experiments, then you need to take into account alpha spending, or else you will lower that probability.

But if you don’t care about a hypothetical infinite sequence of experiments, you may find the Bayesian approach more intuitive. What do you know right at this point in the experiment? It’s all encapsulated in the posterior distribution.

Suppose your experiment size is a substantial portion of your potential visitors. You want to experiment with graphics, yes, but you also want to make money along the way. You’re ultimately concerned with profit, not publishing a scientific paper on your results. Then you could use a Bayesian approach to maximize your expected profit. This leads to things like “bandits,” so called by analogy to slot machines (“one-armed bandits”).

If you want to keep things simple, and if the experiment size is negligible relative to your expected number of visitors, just design your experiment according to Frequentist principles and don’t look at your data early.

But if you have good business reasons to want to look at the data early, not simply to satisfy curiosity, then you should probably interpret the interim data from a Bayesian perspective. As the next post explains, the Bayesian approach aligns well with common sense.

I’d recommend taking either a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.

If you need help designing, conducting, or interpreting experiments, we can help. If you want/need to look at interim results, we can show you how to do it the right way.

LET’S TALK

Related posts