Sigmas and Student

I saw something yesterday saying that the Japanese bond market had experienced a six standard deviation move. This brought to mind a post I’d written eight years ago.

All probability statements depend on a model. And if you’re probability model says an event had a probability six standard deviations from the mean, it’s more likely that your model is wrong than that you’ve actually seen something that rare. I expand on this idea here.

How likely is it that a sample from a random variable will be six standard deviations from its mean? If you have in mind a normal (Gaussian) distribution, as most people do, then the probability is on the order of 1 chance in 10,000,000. Six sigma events are not common for any distribution, but they’re not unheard of for distributions with heavy tails.

Let X be a random variable with a Student t distribution and ν degrees of freedom. When ν is small, i.e. no more than 2, the tails of X are so fat that the standard deviation doesn’t exist. As ν → ∞ the Student t distribution approaches the normal distribution. So in some sense this distribution interpolates between fat tails and thin tails.

What is the probability that X takes on a value more than six standard deviations from its mean at 0, i.e. what does the function

f(ν) = Prob(X > 6σ)

look like as a function of ν where σ² = ν/(ν − 2) is the variance of X?

As you’d expect, the limit of f(ν) as ν → ∞ is the probability of a six-sigma event for a normal distribution, around 10−7 as mentioned above. Here’s a plot of f(ν) for ν > 3. Notice that the vertical axis is on a log scale, i.e. the probability decreases exponentially.

What you might not expect is that f(ν) isn’t monotone. It rises to a maximum value before it decays exponentially. In hindsight this makes sense. As ν → 2+ the variance becomes infinite, and the probability of being infinitely far from the mean is 0. Here’s a plot of f(ν) between 2 and 3.

So six sigma probabilities for a Student t distribution rise from 0 up to a maximum of around 10−3 then decrease exponentially, then asymptotically approach a value around 10−7.

Related posts

Stylometry

I was reading an article this morning that mentioned a styometric analysis of a controversial paragraph written by Roman historian Flavius Josephus. I’ve written several posts that could be called stylometry or adjacent, but I haven’t used that word. Here are some posts that touch on the statistical analysis of a text or of an author.

Trying to fit exponential data

The first difficulty in trying to fit an exponential distribution to data is that the data may not follow an exponential distribution. Nothing grows exponentially forever. Eventually growth slows down. The simplest way growth can slow down is to follow a logistic curve, but fitting a logistic curve has its own problems, as detailed in the previous post.

Suppose you are convinced that whatever you’re wanting to model follows an exponential curve, at least over the time scale that you’re interested in. This is easier to fit than a logistic curve. If you take the logarithm of the data, you now have a linear regression problem. Linear regression is numerically well-behaved and has been thoroughly explored.

There is a catch, however. When you extrapolate a linear regression, your uncertainly region flares out as you go from observed data to linear predictions based on the observed data. Your uncertainty grows linearly. But remember that we’re not working with the data per se; we’re working with the logarithm of the data. So on the original scale, the uncertainty flares out exponentially.

Weighting an average to minimize variance

Suppose you have $100 to invest in two independent assets, A and B, and you want to minimize volatility. Suppose A is more volatile than B. Then putting all your money on A would be the worst thing to do, but putting all your money on B would not be the best thing to do.

The optimal allocation would be some mix of A and B, with more (but not all) going to B. We will formalize this problem and determine the optimal allocation, then generalize the problem to more assets.

Two variables

Let X and Y be two independent random variables with finite variance and assume at least one of X and Y is not constant. We want to find t that minimizes

\text{Var}[tX + (1-t)Y]

subject to the constraint 0 ≤ t ≤ 1. Because X and Y are independent,

\text{Var}[tX + (1-t)Y] = t^2 \text{Var}[X] + (1-t)^2 \text{Var}[Y]

Taking the derivative with respect to t and setting it to zero shows that

t = \frac{\text{Var}[Y]}{\text{Var}[X] + \text{Var}[Y]}

So the smaller the variance on Y, the less we allocate to X. If Y is constant, we allocate nothing to X and go all in on Y.  If X and Y have equal variance, we allocate an equal amount to each. If X has twice the variance of Y, we allocate 1/3 to X and 2/3 to Y.

Multiple variables

Now suppose we have n independent random variables Xi for i running from 1 to n, and at least one of the variables is not constant. Then we want to minimize

\text{Var}\left[ \sum_{i=1}^n t_i X_i \right] = \sum_{i=1}^n t_i^2 \text{Var}[X_i]

subject to the constraint

\sum_{i=1}^n t_i = 1

and all ti non-negative. We can solve this optimization problem with Lagrange multipliers and find that

t_i \text{Var}[X_i] = t_j \text{Var}[X_j]

for all 1 ≤ i, jn. These (n − 1) equations along with the constraint that all the ti sum to 1 give us a system of equations whose solution is

t_i = \frac{\prod_{j \ne i} \text{Var}[X_j]}{\sum_{i = 1}^n \prod_{j \ne i} \text{Var}[X_j]}

Incidentally, the denominator has a name: the (n − 1)st elementary symmetric polynomial in n variables. More on this in the next post.

Related posts

Distribution of correlation

One of the more subtle ideas to convey in an introductory statistics class is that statistics have distributions.

Students implicitly think that when you calculate a statistic on a data set, say the mean, that then you have THE mean. But if your data are (modeled as) samples from a random variable, then anything you compute from those samples, such as the mean, is also a random variable. When you compute a useful statistic, it’s not as random as the data, i.e. it has smaller variance, but it’s still random.

A couple days ago I wrote about Fisher’s transform to make the distribution sample correlations closer to normal. This post will make that more concrete.

Preliminaries

We’ll need to bring in a few Python libraries. While we’re at it, let’s set the random number generator seed so the results will be reproducible.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew

np.random.seed(20251020)

Correlated RNG

Next, we’ll need a way to generate correlated random samples, specifying the correlation ρ and the sample size N.

def gen_correlated_samples(rho, N):
    mean = [0, 0]
    cov = [
        [1, rho],
        [rho, 1]
    ]
    return np.random.multivariate_normal(mean, cov, size=N)

Calculating correlation

Once we generate correlated pairs, we need to calculate their correlation. To be more precise, their linear (Pearson) correlation. To do this we’ll find the empirical covariance matrix, the sample counterpart to the covariance matrix specified in the generator code above. The correlation coefficient is then the off-diagonal element of the covariance matrix.

def pearsonr(X):
    correlation_matrix = np.corrcoef(X[:,0], X[:,1])
    return correlation_matrix[0, 1]

Simulation

Now we’re ready to do our simulation.

M = 10000
rs = np.zeros(M)
for i in range(M):
    X = gen_correlated_samples(0.9, 100)
    rs[i] = pearsonr(X)

Notice that there are two levels of sampling. We’re generating random samples of size 100 and computing their correlation; that’s sampling our underlying data. And we’re repeating the process of computing the correlation 10,000 times; that’s sampling the correlation.

Untransformed distribution

Next we view the distribution of the correlation values.

plt.hist(rs, bins=int(np.sqrt(M)))
plt.show()
plt.close()

This gives the following plot.

It’s strongly skewed to the left, which we can quantify by calculating the skewness.

print(skew(rs))

This tells us the skewness is −0.616. A normal distribution has skewness 0. The negative sign tells us the direction of the skew.

Transformed distribution

Now let’s apply the Fisher transformation and see how it makes the distribution much closer to normal.

xformed = np.arctanh(rs)
plt.hist(xformed, bins=int(np.sqrt(M)))
plt.show()
plt.close()
print(skew(xformed))

This produces the plot below and prints a skewness value of −0.0415.

Small correlation example

We said before that when the correlation ρ is near zero, the Fisher transformation is less necessary. Here’s an example where ρ = 0.1. It’s not visibly different from a normal distribution, and the skewness is −0.1044.

Observation and conjecture

In our two examples, the skewness was approximately −ρ. Was that a coincidence, or does that hold more generally? We can test this with the following code.

def skewness(rho):
    rs = np.zeros(M)
    for i in range(M):
        X = gen_correlated_samples(rho, 100)
        rs[i] = pearsonr(X)
    return skew(rs)
    
rhos = np.linspace(-1, 1, 100)
ks = [skewness(rho) for rho in rhos]
plt.plot(rhos, ks)
plt.plot(rhos, -rhos, "--", color="gray")
plt.show()

Here’s the resulting plot.

It looks like the skewness is not exactly −ρ, but −cρ for some c < 1. Maybe c depends on the inner sample size, in our case 100. But it sure looks like skewness is at least approximately proportional to ρ. Maybe this is a well-known result, but I haven’t seen it before.

Distribution of coordinates on a sphere

Almost Sure posted an interesting fact on X:

If a point (x, y, z) is chosen at random uniformly on the unit sphere, then x, y, and z each have the uniform distribution on [−1, 1] and zero correlations (but not independent!) This follows from Archimedes’ “On the Sphere and Cylinder” published in 225BC.

Archimedes effectively showed that projecting a sphere to a cylinder wrapping round its equator preserves area, so projection of (x,y,z) is uniform on the cylinder. Used in Lambert cylindrical equal area projection.

In this post I want to empirically demonstrate the distribution of the coordinates, their lack of linear correlation, and their dependence.

The usual way to generate random points on a sphere is to generate three samples from a standard normal distribution and normalize by dividing by the Euclidean norm of the three points. So to generate a list of N points on a sphere, I generate an N × 3 matrix of normal samples and divide each row by its norm.

import numpy as np

N = 10000
# Generate N points on a sphere
M = norm.rvs(size=(N, 3))
M /= np.linalg.norm(M, axis=1)[:, np.newaxis]

Next, I find the correlation of each column with the others.

x = M[:, 0]
y = M[:, 1]
z = M[:, 2]
cxy, _ = pearsonr(x, y)
cyz, _ = pearsonr(y, z)
cxz, _ = pearsonr(x, z)
print(cxy, cyz, cxz)

When I ran this I got correlations of -0.0023, 0.0047, and -0.0177. For reasons explained in the previous post, I’d expect values in the interval [−0.02, 0.02] if the columns were independent, so the correlations are consistent with that hypothesis.

To demonstrate that each column is uniformly distributed on [−1, 1] you could look at histograms.

import matplotlib.pyplot as plt

plt.hist(x, bins=int(N**0.5))
plt.show()
plt.hist(y, bins=int(N**0.5))
plt.show()
plt.hist(z, bins=int(N**0.5))
plt.show()

I’ll spare you the plots, but they look like what you’d expect. If you wanted to do something more analytical than visual, you could compute something like the Kolmogorov-Smirnov statistic.

The coordinates are not independent, though they have zero linear correlation. Maybe a test for nonlinear dependence will work. I tried Spearman’s rank correlation and Kendall’s tau, and neither detected the dependence. But distance correlation did.

from dcor import distance_correlation

cxy = distance_correlation(x, y)
cyz = distance_correlation(y, z)
cxz = distance_correlation(x, z)
print(cxy, cyz, cxz)

Each of the distance correlations was approximately 0.11, substantially greater than zero, implying dependence between the coordinates.

Related posts

Inferring sample size from confidence interval

The previous post reported that a study found a 95% confidence interval for the the area of the Mandelbrot set to be 1.506484 ± 0.000004. What was the sample size that was used to come to that conclusion?

A 95% confidence interval for a proportion is given by

\hat{p} \pm 1.96 \sqrt{\frac{\hat{p} \,(1 - \hat{p})}{n}}

and so if a confidence interval of width w is centered at the proportion estimate p hat, then we can solve

\frac{w}{2} = 1.96 \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}

to find

n = 15.37 \frac{\hat{p}(1 - \hat{p})}{w^2}

Now in the example at the top of the post, we’re randomly sampling points from the square [−2, 2] × [−2, 2] and counting how many land inside the Mandelbrot set. The square has area 16, so p hat equals 1.506484 / 16. The width of the confidence interval for the area is 8 × 10−6. This means the width of the confidence interval for the proportion is 5 × 10−7, and so n = 5.244 × 1012.

Doubts regarding Mandelbrot area

The reasoning above is correct for inferring sample size from a confidence interval, but the specific application to the Mandelbrot set is suspect. Did someone really do over a trillion iterations? Apparently not.

As far as I can tell, this was the source of the estimate above. It’s not based on random samples but rather on regular grids of samples, treated as if they were random samples, and then extrapolated to get the estimate above. The result may be correct, but it’s not a simple Monte Carlo estimate. I haven’t looked at the page carefully, but I suspect the reported confidence interval is too small; error bars tend to flare out under extrapolation.

You can’t have everything you want: beta edition

The beta distribution is a conjugate prior for a binomial likelihood function, so it makes posterior probability calculations trivial: you simply add your data to the distribution parameters. If you start with a beta(α, β) prior distribution on a proportion θ, then observe s successes and f failures, the posterior distribution on θ is

beta(α + s, β + f).

There’s an integration going on in the background, but you don’t have to think about it. If you used any other prior, you’d have to calculate an integral, and it wouldn’t have a simple closed form.

The sum α + β of the prior distribution parameters is interpreted as the number of observations contained in the prior. For example, a beta(0.5, 0.,5) prior is non-informative, having only as much information as one observation.

I was thinking about all this yesterday because I was working on a project where the client believes that a proportion θ is around 0.9. A reasonable choice of prior would be beta(0.9, 0.1). Such a distribution has mean 0.9 and is non-informative, and it would be fine for the use I had in mind.

Non-informative beta priors work well in practice, but they’re a little unsettling if you plot them. You’ll notice that the beta(0.9, 0.1) density has singularities at both ends of the interval [0, 1]. The singularity on the right end isn’t so bad, but it’s curious that a distribution designed to have mean 0.9 has infinite density at 0.

If you went further and used a completely non-informative, improper beta(0, 0) prior, you’d have strong singularities at both ends of the interval. And yet this isn’t a problem in practice. Statistical operations that are not expressed in Bayesian terms are often equivalent to a Bayesian analysis with an improper prior like this.

This raises two questions. Why must a non-informative prior put infinite density somewhere, and why is this not a problem?

Beta densities with small parameters have singularities because that’s just the limitation of the beta family. If you were using a strictly subjective Bayesian framework, you would have to say that such densities don’t match anyone’s subjective priors. But the beta distribution is very convenient to use, as explained at the top of this post, and so everyone makes a little ideological compromise.

As for why singular priors are not a problem, it boils down to the difference between density and mass. A beta(0.9, 0.1) distribution has infinite density at 0, but it doesn’t put much mass at 0. If you integrate the density over a small interval, you get a small probability. The singularity on the left is hard to see in the plot above, almost a vertical line that overlaps the vertical axis. So despite the curve going vertical, there isn’t much area under that part of the curve.

You can’t always get what you want. But sometimes you get what you need.

Related posts

The Rise and Fall of Bayesian Statistics

At one time Bayesian statistics was not just a minority approach, it was considered controversial or fringe. When I was in grad school, someone confided in me that he was a closet Bayesian. He thought the Bayesian approach to statistics made sense, but didn’t want to jeopardize his career by saying so publicly.

Then somewhere along the way, maybe 20 years ago or so, Bayesian analysis not only became acceptable, it became hot. People would throw around the term Bayesian much like the throw around AI now.

During the Bayesian heyday, someone said that you’d know Bayes won when people would quit putting the word “Bayesian” in the title of their papers. That happened. I’m not sure when, but maybe around 2013? That was the year I went out on my own as a consultant. I thought maybe I could cash in on some of the hype over Bayesian statistics, but the hype had already subsided by then.

It’s strange that Bayes was ever scandalous, or that it was ever sexy. It’s just math [1]. You’d never look askance at someone for studying Banach algebras, nor would you treat them like a celebrity.

Bayesian statistics hasn’t fallen, but the hype around Bayesian statistics has fallen. The utility of Bayesian statistics has improved as the theory and its software tools have matured. In fact, it has matured to the point that people don’t emphasize that it’s Bayesian.

[1] Statistics isn’t exactly math, but that’s a topic for another day.

Analyzing the Federalist Papers

The Federalist Papers, a collection of 85 essays published anonymously between 1787 and 1788, were one of the first subjects for natural language processing aided by a computer. Because the papers were anonymous, people were naturally curious who wrote each of the essays. Early on it was determined that the authors were Alexander Hamilton, James Madison, and John Jay, but the authorship of individual essays wasn’t known.

In 1944, Douglass Adair conjectured the authorship of each essay, and twenty years later Frederick Mosteller and David Wallace confirmed Adair’s conclusions by Bayesian analysis. Mosteller and Wallace used a computer to carry out their statistical calculations, but they did not have an electronic version of the text.

They physically chopped a printed copy of the text into individual words and counted them. Mosteller recounted in his autobiography that until working on The Federalist Papers, he had underestimated how hard it was to count a large number things, especially little pieces of paper that could be scattered by a draft.

I’m not familiar with how Mosteller and Wallace did their analysis, but I presume they formed a prior distribution on the frequency of various words in writings known to be by Hamilton, Madison, and Jay, then computed the posterior probability of authorship by each author for each essay.

The authorship of the papers was summarized in the song “Non-Stop” from the musical Hamilton:

The plan was to write a total of twenty-five essays, the work divided evenly among the three men. In the end, they wrote eighty-five essays in the span of six months. John Jay got sick after writing five. James Madison wrote twenty-nine. Hamilton wrote the other fifty-one!

Yesterday I wrote about the TF-IDF statistic for the importance of words in a corpus of documents. In that post I used the books of the Bible as my corpus. Today I wanted to reuse the code I wrote for that post by applying it to The Federalist Papers.

Federalist No. 10 is the best known essay in the collection. Here are the words with the highest TF-IDF scores from that essay.

faction: 0.0084
majority: 0.0047
democracy: 0.0044
controlling: 0.0044
parties: 0.0039
republic: 0.0036
cure: 0.0035
factious: 0.0035
property: 0.0033
faculties: 0.0033

I skimmed a list of the most important words in the essays by Madison and Hamilton and noticed that Madison’s list had several words from classic literature: Achaeans, Athens, Draco, Lycurgus, Sparta, etc. There were a only couple classical references in Hamilton’s top words: Lysander and Pericles. I noticed “debt” was important to Hamilton.

You can find the list of top 10 words in each essay here.