Distribution of correlation

Posted on 20 October 2025 by John

One of the more subtle ideas to convey in an introductory statistics class is that statistics have distributions.

Students implicitly think that when you calculate a statistic on a data set, say the mean, that then you have THE mean. But if your data are (modeled as) samples from a random variable, then anything you compute from those samples, such as the mean, is also a random variable. When you compute a useful statistic, it’s not as random as the data, i.e. it has smaller variance, but it’s still random.

A couple days ago I wrote about Fisher’s transform to make the distribution sample correlations closer to normal. This post will make that more concrete.

Preliminaries

We’ll need to bring in a few Python libraries. While we’re at it, let’s set the random number generator seed so the results will be reproducible.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew

np.random.seed(20251020)

Correlated RNG

Next, we’ll need a way to generate correlated random samples, specifying the correlation ρ and the sample size N.

def gen_correlated_samples(rho, N):
    mean = [0, 0]
    cov = [
        [1, rho],
        [rho, 1]
    ]
    return np.random.multivariate_normal(mean, cov, size=N)

Calculating correlation

Once we generate correlated pairs, we need to calculate their correlation. To be more precise, their linear (Pearson) correlation. To do this we’ll find the empirical covariance matrix, the sample counterpart to the covariance matrix specified in the generator code above. The correlation coefficient is then the off-diagonal element of the covariance matrix.

def pearsonr(X):
    correlation_matrix = np.corrcoef(X[:,0], X[:,1])
    return correlation_matrix[0, 1]

Simulation

Now we’re ready to do our simulation.

M = 10000
rs = np.zeros(M)
for i in range(M):
    X = gen_correlated_samples(0.9, 100)
    rs[i] = pearsonr(X)

Notice that there are two levels of sampling. We’re generating random samples of size 100 and computing their correlation; that’s sampling our underlying data. And we’re repeating the process of computing the correlation 10,000 times; that’s sampling the correlation.

Untransformed distribution

Next we view the distribution of the correlation values.

plt.hist(rs, bins=int(np.sqrt(M)))
plt.show()
plt.close()

This gives the following plot.

It’s strongly skewed to the left, which we can quantify by calculating the skewness.

print(skew(rs))

This tells us the skewness is −0.616. A normal distribution has skewness 0. The negative sign tells us the direction of the skew.

Transformed distribution

Now let’s apply the Fisher transformation and see how it makes the distribution much closer to normal.

xformed = np.arctanh(rs)
plt.hist(xformed, bins=int(np.sqrt(M)))
plt.show()
plt.close()
print(skew(xformed))

This produces the plot below and prints a skewness value of −0.0415.

Small correlation example

We said before that when the correlation ρ is near zero, the Fisher transformation is less necessary. Here’s an example where ρ = 0.1. It’s not visibly different from a normal distribution, and the skewness is −0.1044.

Observation and conjecture

In our two examples, the skewness was approximately −ρ. Was that a coincidence, or does that hold more generally? We can test this with the following code.

def skewness(rho):
    rs = np.zeros(M)
    for i in range(M):
        X = gen_correlated_samples(rho, 100)
        rs[i] = pearsonr(X)
    return skew(rs)
    
rhos = np.linspace(-1, 1, 100)
ks = [skewness(rho) for rho in rhos]
plt.plot(rhos, ks)
plt.plot(rhos, -rhos, "--", color="gray")
plt.show()

Here’s the resulting plot.

It looks like the skewness is not exactly −ρ, but −cρ for some c < 1. Maybe c depends on the inner sample size, in our case 100. But it sure looks like skewness is at least approximately proportional to ρ. Maybe this is a well-known result, but I haven’t seen it before.

Distribution of coordinates on a sphere

Posted on 18 October 2025 by John

Almost Sure posted an interesting fact on X:

If a point (x, y, z) is chosen at random uniformly on the unit sphere, then x, y, and z each have the uniform distribution on [−1, 1] and zero correlations (but not independent!) This follows from Archimedes’ “On the Sphere and Cylinder” published in 225BC.

Archimedes effectively showed that projecting a sphere to a cylinder wrapping round its equator preserves area, so projection of (x,y,z) is uniform on the cylinder. Used in Lambert cylindrical equal area projection.

In this post I want to empirically demonstrate the distribution of the coordinates, their lack of linear correlation, and their dependence.

The usual way to generate random points on a sphere is to generate three samples from a standard normal distribution and normalize by dividing by the Euclidean norm of the three points. So to generate a list of N points on a sphere, I generate an N × 3 matrix of normal samples and divide each row by its norm.

import numpy as np

N = 10000
# Generate N points on a sphere
M = norm.rvs(size=(N, 3))
M /= np.linalg.norm(M, axis=1)[:, np.newaxis]

Next, I find the correlation of each column with the others.

x = M[:, 0]
y = M[:, 1]
z = M[:, 2]
cxy, _ = pearsonr(x, y)
cyz, _ = pearsonr(y, z)
cxz, _ = pearsonr(x, z)
print(cxy, cyz, cxz)

When I ran this I got correlations of -0.0023, 0.0047, and -0.0177. For reasons explained in the previous post, I’d expect values in the interval [−0.02, 0.02] if the columns were independent, so the correlations are consistent with that hypothesis.

To demonstrate that each column is uniformly distributed on [−1, 1] you could look at histograms.

import matplotlib.pyplot as plt

plt.hist(x, bins=int(N**0.5))
plt.show()
plt.hist(y, bins=int(N**0.5))
plt.show()
plt.hist(z, bins=int(N**0.5))
plt.show()

I’ll spare you the plots, but they look like what you’d expect. If you wanted to do something more analytical than visual, you could compute something like the Kolmogorov-Smirnov statistic.

The coordinates are not independent, though they have zero linear correlation. Maybe a test for nonlinear dependence will work. I tried Spearman’s rank correlation and Kendall’s tau, and neither detected the dependence. But distance correlation did.

from dcor import distance_correlation

cxy = distance_correlation(x, y)
cyz = distance_correlation(y, z)
cxz = distance_correlation(x, z)
print(cxy, cyz, cxz)

Each of the distance correlations was approximately 0.11, substantially greater than zero, implying dependence between the coordinates.

ODE to Fisher’s transform

Posted on 18 October 2025 by John

I was calculating a correlation coefficient this afternoon and ran into something interesting.

Suppose you have two uncorrelated random variables X and Y. If you draw, say, a thousand samples each from X and Y and compute Pearson’s correlation coefficient, you almost certainly will not get 0, though you very likely will get a small number.

How do you find a confidence interval around a correlation coefficient?

Sample correlation coefficient values do not follow a normal distribution, though the distribution is approximately normal if the population correlation is small. The distribution gets further from normal as the correlation gets close to 1 or −1.

Enter Fisher’s transformation. If you run the sample correlation coefficient r through the function

½ log((1 + r)/(1 − r)) = arctanh(r)

you get something that has a distribution closer to the normal distribution. You find a confidence interval for the transformed variable, then undo the transformation.

Now where did the Fisher transform come from?

I don’t know whether this was Fisher’s derivation, but Hotelling came up the following derivation. Assume you apply a transform G(r) to the correlation coefficient. Write an asymptotic expansion for the kurtosis κ₃ and set the first term equal to zero. This leads to the ordinary differential equation

3(1 − r³) G″(r) − 6r G′(r) = 0

which has the solution G(r) = arctanh(r).

I found this interesting because I’ve worked with differential equations and with statistics, but I’ve rarely seen them overlap.

Analyzing the Federalist Papers

Posted on 7 August 2025 by John

The Federalist Papers, a collection of 85 essays published anonymously between 1787 and 1788, were one of the first subjects for natural language processing aided by a computer. Because the papers were anonymous, people were naturally curious who wrote each of the essays. Early on it was determined that the authors were Alexander Hamilton, James Madison, and John Jay, but the authorship of individual essays wasn’t known.

In 1944, Douglass Adair conjectured the authorship of each essay, and twenty years later Frederick Mosteller and David Wallace confirmed Adair’s conclusions by Bayesian analysis. Mosteller and Wallace used a computer to carry out their statistical calculations, but they did not have an electronic version of the text.

They physically chopped a printed copy of the text into individual words and counted them. Mosteller recounted in his autobiography that until working on The Federalist Papers, he had underestimated how hard it was to count a large number things, especially little pieces of paper that could be scattered by a draft.

I’m not familiar with how Mosteller and Wallace did their analysis, but I presume they formed a prior distribution on the frequency of various words in writings known to be by Hamilton, Madison, and Jay, then computed the posterior probability of authorship by each author for each essay.

The authorship of the papers was summarized in the song “Non-Stop” from the musical Hamilton:

The plan was to write a total of twenty-five essays, the work divided evenly among the three men. In the end, they wrote eighty-five essays in the span of six months. John Jay got sick after writing five. James Madison wrote twenty-nine. Hamilton wrote the other fifty-one!

Yesterday I wrote about the TF-IDF statistic for the importance of words in a corpus of documents. In that post I used the books of the Bible as my corpus. Today I wanted to reuse the code I wrote for that post by applying it to The Federalist Papers.

Federalist No. 10 is the best known essay in the collection. Here are the words with the highest TF-IDF scores from that essay.

faction: 0.0084
majority: 0.0047
democracy: 0.0044
controlling: 0.0044
parties: 0.0039
republic: 0.0036
cure: 0.0035
factious: 0.0035
property: 0.0033
faculties: 0.0033

I skimmed a list of the most important words in the essays by Madison and Hamilton and noticed that Madison’s list had several words from classic literature: Achaeans, Athens, Draco, Lycurgus, Sparta, etc. There were a only couple classical references in Hamilton’s top words: Lysander and Pericles. I noticed “debt” was important to Hamilton.

You can find the list of top 10 words in each essay here.

Probability of rolling a Yahtzee

Posted on 19 May 2025 by John

IJK (@iconjack) has calculated the probability of rolling a Yahtzee in no more than n rolls.

$\begin{align*} p(n) &= 1 + \frac{53}{13}\left(\frac{5}{6}\right)^{2n+1} + \frac{11\cdot 5^n}{13\cdot 2^{n+5}\cdot3^{3n+1}} \\ &\phantom{=} - \frac{5^n}{8\cdot 3^{2n-2}} - \frac{7\cdot 5^{n+1}}{2^{n+5}\cdot 3^{n-1}} \end{align*}$

The first few numerical values are:

p(1) = 0.0007716
p(2) = 0.0126316
p(3) = 0.0460286

The probability is 0.95 after 23 rolls, and 0.99 after 32 rolls.

Here’s a plot.

Sorting Roman numerals

Posted on 9 May 2025 by John

This morning I wrote about the frequencies of names for popes and kings. This involved sorting strings with Roman numerals since it’s common for popes and kings to have Roman numerals after their names.

Something that surprised me was that sorting Roman numerals alphabetically roughly sorts them in numerical order, especially for small numbers. It’s not perfect. For example, IX comes before V in alphabetical order.

Everyone who has done much work with data will have run into the problem of a column of numbers being sorted alphabetically rather than numerically. For example, “10” comes between “1” and “2” even though 10 comes after 1 and 2.

So you can’t sort numerals, Roman or Arabic, as strings and expect them to appear in numerical order. But Roman numbers come close when you’re sorting small numbers, such as I through XXIII for popes named John or I through VIII for kings of England named Henry.

To illustrate this, I plotted how well string sort order correlates with numeric order for Roman and Arabic numbers, for the sequence 1 … n for increasing values of n. I measured correlation using Spearman’s rank-order correlation. I tried Kendall’s tau and as well and got similar results.

Alphabetical order and numerical order for Roman numerals agree pretty well up to XXXVIII, with just a few numbers out of place, namely IX, XIX, and XXIX. But alphabetical order and numerical order diverge quite a bit for Arabic numerals when all the numbers between 10 and 19 come before 2.

As you go further out, alphabetical order and numerical order diverge for both writing systems, but especially for Roman numerals.

Knuth’s series for chi squared percentage points

Posted on 29 April 2025 by John

In the latest two posts, we have needed to find the percentage points for a chi square random variable when the parameter ν, the number of degrees of freedom, is large.

In Volume 2 of Knuth’s magnum opus TAOCP, he gives a table of percentage points for the chi square distribution for small values of ν and he gives an asymptotic approximation to be used for values of ν > 30.

Knuth’s series is

$\nu + \sqrt{2\nu}x_p + \frac{2}{3} x_p^2 - \frac{2}{3} + {\cal O}(1/\sqrt{\nu})$

where x_p is the corresponding percentage point function for a normal random variable. Knuth gives a few values of x_p in his table.

What Knuth calls x_p, I called p in the previous post. The approximation I gave, based on Fisher’s transformation to make the chi squared more like a normal, is similar to Knuth’s series, but not the same.

So which approximation is better? Where does Knuth’s approximation come from?

Comparing approximations

We want to find the quantiles for χ²(100) for p = 0.01 through 0.99. Both Fisher’s approximation and Knuth’s approximation are good enough that it’s hard to tell the curves apart visually when the exact values are plotted along with the two approximations.

But when you subtract off the exact values and just look at the errors, it’s clear that Knuth’s approximation is much better in general.

Here’s another plot, taking the absolute value of the error. This makes it easier to see little regions where Fisher’s approximation happens to be better.

Here are a couple more plots comparing the approximations at more extreme probabilities. First the left tail:

And then the right tail:

Derivation

The table mentioned above says to see exercise 16 in the same section, which in turn refers to Theorem 1.2.11.3A in Volume 1 of TAOCP. There Knuth derives an asymptotic series which is not directly the one above, but does most of the work. The exercise asks the reader to finish the derivation.

Chi squared approximations

Posted on 28 April 2025 by John

In the previous post I needed to know the tail percentile points for a chi squared distribution with a huge number of degrees of freedom. When the number of degrees of freedom ν is large, a chi squared random variable has approximately a normal distribution with the same mean and variance, namely mean ν and variance 2ν.

In that post, ν was 9999 and we needed to find the 2.5 and 97.5 percentiles. Here are the percentiles for χ²(9999):

    >>> chi2(9999).ppf([0.025, 0.975])
    array([ 9723.73223701, 10278.05632026])

And here are the percentiles for N(9999, √19998):

    >>> norm(9999, (2*9999)**0.5).ppf([0.025, 0.975])
    array([ 9721.83309451, 10276.16690549])

So the results on the left end agree to three significant figures and the results on the right agree to four.

Fewer degrees of freedom

When ν is more moderate, say ν = 30, the normal approximation is not so hot. (We’re stressing the approximation by looking fairly far out in the tails. Closer to the middle the fit is better.)

Here are the results for χ²(30):

    >>> chi2(30).ppf([0.025, 0.975])
    array([16.79077227, 46.97924224])

And here are the results for N(30, √60):

    >>> norm(30, (60)**0.5).ppf([0.025, 0.975])
    array([14.81818426, 45.18181574])

The normal distribution is symmetric and the chi squared distribution is not, though it becomes more symmetric as ν → ∞. Transformations of the chi squared distribution that make it more symmetric may also improve the approximation accuracy. That wasn’t important when we had ν = 9999, but it is more important when ν = 30.

Fisher transformation

If X ~ χ²(ν), Fisher suggested the approximation √(2X) ~ N(√(2ν − 1), 1).

Let Y be a N(√(2ν − 1), 1) random variable and Z a standard normal random variable, N(0, 1). Then we can estimate χ² probabilities from normal probabilities.

$\begin{align*} P(X \leq x) &= P(\sqrt{2X} \leq \sqrt{2x}) \\ &\approx P(Y \leq \sqrt{2x}) \\ &= P(Z \leq \sqrt{2x} - \sqrt{2\nu - 1}) \end{align*}$

So if we want to find the percentage points for X, we can solve for corresponding percentage points for Z.

If z is the point where P(Z ≤ z) = p, then

$x = \frac{(p + \sqrt{2\nu-1})^2}{2}$

is the point where P(X ≤ x) = p.

If we use this to find the 2.5 and 97.5 percentiles for a χ²(30) random variable, we get 16.36 and 46.48, an order of magnitude more accurate than before.

When ν = 9999, the Fisher transformation gives us percentiles that are two orders of magnitude more accurate than before.

Wilson–Hilferty transformation

If X ~ χ²(ν), the Wilson–Hilferty transformation is (X/ν)^1/3 is approximately normal with mean 1 − 2/9ν and variance 2/9ν.

This transformation is a little more complicated than the Fisher transform, but also more accurate. You could go through calculations similar to those above to approximate percentage points using the Wilson–Hilferty transformation.

The main use for approximations like this is now for analytical calculations; software packages can give accurate numerical results. For analytical calculation, the simplicity of the Fisher transformation may outweigh the improve accuracy of the Wilson-Hilferty transformation.

Variance of variances. All is variance.

Posted on 28 April 2025 by John

In the previous post, I did a simulation to illustrate a theorem about the number of steps needed in the Euclidean algorithm. The distribution of the number of steps is asymptotically normal, and for numbers 0 < a < b < x the mean is asymptotically

12 log(2) log(x) / π².

What about the variance? The reference cited in the previous post [1] gives an expression for the variance, but it’s complicated. The article said the variance is close to a constant times log x, where that constant involves the second derivative of “a certain pseudo zeta function associated with continued fractions.”

So let’s estimate the variance empirically. We can easily compute the sample variance to get an unbiased point estimate of the population variance, but how can we judge how accurate this point estimate is? We’d need the variance of our estimate of variance.

When I taught statistics classes, this is where students would struggle. There are levels of randomness going on. We have some random process. (Algorithm run times are not really random, but we’ll let that slide.)

We have a statistic S², the sample variance, which is computed from samples from a random process, so our statistic is itself a random variable. As such, it has a mean and a variance. The mean of S² is the population variance; that’s what it means for a statistic to be an unbiased estimator.

The statistic S² has a known distribution, which we can use to create a confidence interval. Namely,

$\frac{(n-1)S^2}{\sigma^2} \sim \chi^2$

where the chi squared random variable has n − 1 degrees of freedom. Here n is the number of samples and σ² is the population variance. But wait, isn’t the whole purpose of this discussion to estimate σ² because we don’t know what it is?!

This isn’t quite circular reasoning. More like iterative reasoning. We have an estimate of σ² which we then turn around and use in constructing its confidence interval. It’s not perfect, but it’s useful.

A (1 − α) 100% confidence interval is given by

$\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2}} \right)$

Notice that the right tail probability appears on the left end of the interval and the left tail probability appears on the right end of the interval. The order flips because they’re in a denominator.

I modified the simulation code from the previous post to compute the sample standard deviation for the runs, and got 4.6926, which corresponds to a sample variance of 22.0205. There were 10,000 reps, so our chi squared distribution has 9,999 degrees of freedom. We can compute the confidence interval as follows.

>>> from scipy.stats import chi2
>>> 9999*22.0205/chi2.ppf([0.975, 0.025], 9999)

This says a 95% confidence interval for the variance, in one particular simulation run, was [21.42, 22.64].

The result in [1] said that the variance is proportional to log x, and in our case n = 2⁶³ because that post simulated random numbers up to that limit. We can estimate that the proportionality constant is 63 log(2)/22.0205 = 0.5042. In other words, the variance is about half of log x.

The theorem in [1] gives us the asymptotic variance. If we used a bigger value of x, the variance could be a little different, but I expect it would be around 0.5 log x.

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

The multiple coupon collector problem

Posted on 23 April 2025 by John

I’ve written about the Coupon Collector Problem and variations several times, most recently here.

Brian Beckman left a comment linking to an article he wrote, which in turn links to a paper on the Double Dixie Cup Problem [1].

The idea behind the Coupon Collector Problem is to estimate how long it will take to obtain at least one of each of n coupons when drawing coupons at random with replacement from a set of n.

The Double Dixie Cup Problem replaces coupons with dixie cups, and it asks how many couples you’d expect to need to collect until you have two of each cup.

The authors in [1] answer the more general question of how long to collect m of each cup, so m = 1 reduces to the Coupon Collector Problem. The punchline is Theorem 2 from the paper: the expected number of cups (or coupons) is

$n \left( \log n + (m-1) \log \log n + C_m + o(1) \right)$

for fixed m as n → ∞.

Here C_m is a constant that depends on m (but not on n) and o(1) is a term that goes to zero as n → ∞.

Since log log n is much smaller than log n when n is large, this says that collecting multiples of each coupon doesn’t take much longer, on average, than collecting one of each coupon. It takes substantially less time than playing the original coupon collector game m times.

[1] Donald J. Newman and Lawrence Shepp. The Double Dixie Cup Problem. The American Mathematical Monthly, Vol. 67, No. 1 (Jan., 1960), pp. 58–61.

Probability and Statistics

Distribution of correlation

Preliminaries

Correlated RNG

Calculating correlation

Simulation

Untransformed distribution

Transformed distribution

Small correlation example

Observation and conjecture

Distribution of coordinates on a sphere

Related posts

ODE to Fisher’s transform

Analyzing the Federalist Papers

Probability of rolling a Yahtzee

Sorting Roman numerals

Knuth’s series for chi squared percentage points

Comparing approximations

Derivation

Chi squared approximations

Fewer degrees of freedom

Fisher transformation

Wilson–Hilferty transformation

Related posts

Variance of variances. All is variance.

Related posts

The multiple coupon collector problem

Related posts