First time seeing a rare event

Suppose you’ve been monitoring a rare event for a long time, then you see your first occurrence on the Nth observation. Now what would you say about the event’s probability?

For example, suppose you’re wondering whether dogs ever have two tails. You observe thousands of dogs and never see two tails. But then you see a dog with two tails? Now what can you say about the probability of dogs having two tails? It’s certainly not zero.

We’ll first look at the case of 0 successes out of N trials then look at the case of 1 success out of N trials.

Zero successes

If you’re observing a binary event and you’ve seen no successes out of N trials your point estimate of the probability of your event is 0. You can’t have any confidence in the relative accuracy of your estimate: if the true probability is positive, no matter how small, then the relative error in your estimate is infinite.

But you can have a great deal of confidence in its absolute accuracy. When you’re looking for a binary event and you have not seen any instances in N trials for large N, then a 95% confidence interval for the event’s probability is approximately [0, 3/N]. This is the statistical rule of three. This is a robust estimate, one you could derive from either a frequentist or Bayesian perspective.

Note that the confidence interval [0, 3/N] is exceptionally narrow. When observing a moderate mix of successes and failures the width of the confidence interval is on the order of 1/√N, not 1/N.

First success

After seeing your first success, your point estimate jumps from 0 to 1/N, and infinite relative increase. What happens to your confidence interval?

If we use Jeffreys’ beta(1/2, 1/2) prior, then the posterior distribution after seeing 1 success and N − 1 failures is a beta(3/2, N − 1/2). Now an approximate 95% confidence interval is

[0.1/N, 4.7/N]

So compared to the case of seeing zero successes, seeing one success makes our confidence interval about 50% wider and shifts it to the left by 0.1/N.

So if you’ve seen 100,000 dogs and only 1 had two tails, you could estimate that a 95% confidence interval for the probability of a dog having two tails is

[10−6, 4.7 × 10−5].

If we run the exact numbers we get

[ 1.07 × 10−6, 4.67 ×10−5].

Related posts

Can the chi squared test detect fake primes?

Jeweler examining gemstones

This morning I wrote about Dan Piponi’s fake prime function. This evening I thought about it again and wondered whether the chi-squared test could tell the difference between the distribution of digits in real primes and fake primes.

If the distributions were obviously different, this would be apparent from looking at histograms. When distributions are more similar, visual inspection is not as reliable as a test like chi-squared. For small samples, we can be overly critical of plausible variations. For large samples, statistical tests can detect differences our eyes cannot.

When data fall into a number of buckets, with a moderate number of items expected to fall in each bucket, the chi squared test attempts to measure how the actual counts in each bucket compare to the expected counts.

This is a two-sided test. For example, suppose you expect 12% of items to fall in bucket A and 88% to fall in bucket B. Now suppose you test 1,000 items. It would be suspicious if only 50 items landed in bucket A since we’d expect around 120. On the other hand, getting exactly 120 items would be suspicious too. Getting exactly the expected number is unexpected!

Let’s look a the primes, genuine and fake, less than N = 200. We’ll take the distribution of digits in the list of primes as the expected values and compare to the distribution of the digits in the fake primes.

When I ran this experiment, I got a chi-squared value of 7.77. This is an unremarkable value for a sample from a chi-squared distribution with 9 degrees of freedom. (There are ten digits, but only nine degrees of freedom because if you rule out nine possibilities then digit is determined with certainty.)

The p-value in this case, the probability of seeing a value as large as the one we saw or larger, is 0.557.

Next I increased N to 1,000 and ran the experiment again. Now I got a chi-squared value of 19.08, with a corresponding p-value of 0.024. When I set N to 10,000 I got a chi-squared value of 18.19, with a corresponding p-value of 0.033.

When I used N = 100,000 I got a chi-squared value of 130.26, corresponding to a p-value of 10-23. Finally, when I used N = 1,000,000 I got a chi-squared value of 984.7 and a p-value of 3.4 × 10-206.

In short, the chi-squared test needs a fair amount of data to tell that fake primes are fake. The distribution of digits for samples of fake primes less than a thousand or so is plausibly the same as that of actual primes, as far as the test can distinguish. But the chi-squared values get implausibly large for fake primes up to  100,000.

Random points in a high-dimensional orthant

In high dimensions, randomly chosen vectors are very likely nearly orthogonal. I’ll unpack this a little bit then demonstrate it by simulation. Then I’ll look at what happens when we restrict our attention to points with positive coordinates.

***

The lengths of vectors don’t contribute to the angles between them, so we may as well focus our attention on spheres.

Suppose we pick two points on a familiar sphere in 3 dimensions, our home planet. By symmetry, we can assume the first vector runs from the center of the earth to the north pole. Otherwise we can imagine picking the first point, then rotating the earth so that the chosen point becomes the north pole.

What vectors are orthogonal (perpendicular) to the north pole? The vectors from the center to a point on the equator. The vectors that are nearly orthogonal to the north pole are the vectors that are near the equator.

On a high-dimensional sphere, nearly all the area is near the equator (more on that here). This means that if we choose two vectors, they are very likely to be nearly orthogonal.

Simulating points on sphere

I’ll illustrate this by choosing pairs of points at random on a unit sphere in 100 dimensions and computing their dot product. This gives the cosine similarity of the points, the cosine of the angle between them. I’ll repeat this process 1,000 times and look at the average cosine similarity.

    from scipy.stats import norm
    import numpy as np

    np.random.seed(20230809)

    numruns = 1000
    avg = 0
    dim = 100
    for _ in range(numruns):
        x = norm(0,1).rvs(size = dim)
        y = norm(0,1).rvs(size = dim)
        avg += np.dot(x, y)/(np.dot(x,x)*np.dot(y,y))**0.5
     print(avg/numruns)

Here we use the standard method of generating random points on a sphere: generate each component as a normal (Gaussian) random variable, then divide by the length. The code above combines the normalization with finding the dot product.

The code above prints 0.00043. The average dot product between vectors is very small, i.e. the vectors are nearly orthogonal.

Simulating points in orthant

Next let’s look at one orthant of the sphere. (“Orthant” is the generalization to higher dimensions of quadrant in two dimensions and octant in three dimensions.) We’ll generate random samples from the first orthant by first generating random samples from the whole sphere, then taking the absolute value to fold all the points from the other orthants into the first orthant.

The only necessary change to the code is to put abs() around the code generating the random points.

    x = abs(norm(0,1).rvs(size = dim))
    y = abs(norm(0,1).rvs(size = dim))

When we run the code again it prints 0.639. This is the average value of the cosine of the angle between points, so the average angle is about 0.877 radians or about 50°.

This isn’t quite the right thing to do. We should convert to angles first, then take the average. But that leads to essentially the same result.

If my pencil-and-paper calculations are correct, the exact expected dot product value approaches 2/π = 0.6366 as dimension goes to infinity, and the value of 0.639 above would be consistent with this.

More high-dimensional geometry posts

Variance of binned data

Suppose you have data that for some reason has been summarized into bins of width h. You don’t have the original data, only the number of counts in each bin.

You can’t exactly find the sample mean or sample variance of the data because you don’t actually have the data. But what’s the best you can do? A sensible approach would be to imagine that the contents of each bin represents samples that fell in the middle of the bin. For example, suppose your data were rounded down to the nearest integer. If you have 3 observations in the bin [5, 6] you could think of that as the value 5.5 repeated three times.

When we do this, we don’t recover the data set, but we create a pseudo data set that we can then find the sample mean and sample variance of.

This works well for estimating the mean, but it overestimates the variance. William Sheppard (1863–1936) recommended what is now known as Sheppard’s correction, subtracting h²/12 from the sample variance of the pseudo data. Richardson’s paper mentioned in the previous post fits Sheppard’s correction into the general framework of “the deferred approach to the limit.”

A more probabilistic justification of Sheppard’s correction is that h²/12 is the variance of a uniform random variable over an interval of width h. You could think of Sheppard’s correction as jittering the observations uniformly within each bin.

Let’s do a little simulation to demonstrate the accuracy of estimating the sample variance with and without Sheppard’s correction.

    from numpy import random, floor
    from scipy.stats import norm

    random.seed(20230731)

    numruns = 1000
    rawbias = 0
    sheppardbias = 0
    for run in range(numruns):
        y = norm(0, 3).rvs(size=1000)
        ypseudo = floor(y) + 0.5
        v = ypseudo.var(ddof=1)
        rawbias += abs(v - y.var(ddof=1))
        sheppardbias += abs(v - y.var(ddof=1) - 1/12)

    print(rawbias/numruns, sheppardbias/numruns)

This Python script computes the variance of the pseudodata, with and without the Sheppard correction, and compares this value to the sample variance of the actual data. This process is repeated one thousand times.

On average, the uncorrected variance of the pseudodata was off by 0.0875 compared to the variance of the actual data. With Sheppard’s correction this drops to an average of 0.0442, i.e. the calculation with Sheppard’s correction was about 2 times more accurate.

Every Japanese prefecture shrinking

It’s well known that the population of Japan has been decreasing for years, and so I was a little puzzled by a recent headline saying that Japan’s population has dropped in every one of its 47 prefectures. Although the national population is in decline, until now not all of the nation’s 47 prefectures dropped in population at the same time. Is this remarkable? Actually, yes.

This post will not be about Japan’s demographics in detail, but was motivated by the headline mentioned above.

Demographics is not random; it is the result of individual human choices. But for this post we’ll look at demographics as if it were random.

Suppose each prefecture had the same probability p of declining. Then we could infer that p is probably quite large for all prefectures to decline at the same time. If you flip a coin 47 times and it comes up heads each time, you don’t imagine it’s equally likely to come up heads or tails the next time. (More on that here.)

For a Bernoulli random variable, how large would p have to be in order for the probability of 47 successes out of 47 trials to have probability at least 1/2?

If p47 = 1/2, then p = (1/2)1/47 = 0.985.

If the value of p is different in each prefecture, and the product of all the p‘s is 1/2, then all the p must be at least 1/2, and nearly all much larger. If one of the p‘s were equal to 1/2, the rest would have to be 1.

Applying this understanding to the headline at the top of the post, we could say that the fact that every prefecture has gone down in population is strong evidence that population will continue to drop, as long as current conditions continue.

Related posts

A note on Zipf’s law

Very often when a number is large, and we don’t know or care exactly how large it is, we can model it as infinite. This may make no practical difference and can make calculations much easier. I give several examples of this in the article Infinite is easier than big.

When you run across a statement that something is infinite, you can often mentally replace “infinite” with “big, but so big that it doesn’t matter exactly how big it is.” Conversely, the term “finite” in applied mathematics means “limited in a way that matters.”

But when we say something does or does not matter, there’s some implicit context. It does or does not matter for some purpose. We can get into trouble when we’re unclear of that context or when we shift contexts without realizing it.

Zipf’s law

Zipf’s law came up a couple days ago in a post looking at rare Chinese characters. That post resolves a sort of paradox, that rare characters come up fairly often. Each of these rare characters is very unlikely to occur in a document, and yet collectively its likely that some of them will be necessary.

In some contexts we can consider the number of Chinese characters or English words to be infinite. If you want to learn English by first learning the meaning of every English word, you will never get around to using English. The number of words is infinite in the sense that you will never get to the end.

Zipf’s law says that the frequency of the nth word in a language is proportional to ns. This is of course an approximation, but a surprisingly accurate approximation for such a simple statement.

Infinite N?

The Zipf probability distribution depends on two parameters: the exponent s and the number of words N. Since the number of words in any spoken language is large, and its impossible to say exactly how large it is, can we assume N = ∞? This seems like exactly the what we were talking about at the top of this article.

Whether it is possible to model N as infinite depends on s. The value of s that models word frequency in many languages is close to 1. The plot below illustrates Zipf’s law for the text of Wikipedia in many different languages. In each case, the slope on the log-log plot is approximately 1, i.e. s ≈ 1.

When s = 1, we don’t have a probability distribution because the sum of 1/n from 1 to ∞ diverges. And so no, we cannot assume N = ∞.

Now you may object that all we have to do is set s to be slightly larger than 1. If s = 1.0000001, then the sum of ns converges. Problem solved. But not really.

When s = 1 the series diverges, but when s is slightly larger than 1 the sum is very large. Practically speaking this assigns too much probability to rare words.

If s = 2, for example, one could set N = ∞. The Zipf distribution with s = 2 may be useful for modeling some things, but not for modeling word frequency.

Zipf’s law applied to word frequency is an interesting example of a model that contains a large N, and although it doesn’t matter exactly how large N is, it matters that N is finite. In the earlier article, I used the estimate that Chinese has 50,000 characters. If I had estimated 60,000 characters the conclusion of the article would not have been too different. But assuming an infinite number of characters would result in substantially overestimating the frequency of rare words.

Related posts

Image above by Sergio Jimenez, CC BY-SA 4.0, via Wikimedia Commons

V-statistics

A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.

Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times

\underbrace{S \times S \times \cdots \times S}_{n \text{ times}}

is a V-statistic.

As in the previous post, let h(x, y) = (xy)²/2. We can illustrate the V-statistic associated with h with Python code as before.

    import numpy as np
    from itertools import product

    def var(xs):
        n = len(xs)
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in product(xs, repeat=2)) / n**2

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs))
    print(var(xs))

This time, however, we iterate over product rather than over combinations. Note also that at the bottom of the code we print

   np.var(xs)

rather than

   np.var(xs, ddof=1)

This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof.

   np.var(xs, ddof=0)

The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.

U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.

Related posts

Moments of Tukey’s g-and-h distribution

John Tukey developed his so-called g-and-h distribution to be very flexible, having a wide variety of possible values of skewness and kurtosis. Although the reason for the distribution’s existence is its range of possible skewness and values, calculating the skewness and kurtosis of the distribution is not simple.

Definition

Let φ be the function of one variable and four parameters defined by

\varphi(x; a, b, g, h) = a + b\left( \frac{\exp(gx) - 1}{g} \right) \exp(hx^2/2)

A random variable Y has a g-and-h distribution if it has the same distribution as φ(Z; a, b, g, h) where Z is a standard normal random variable. Said another way, if Y has a g-and-h distribution then the transformation φ-1 makes the data normal.

The a and b parameters are for location and scale. The name of the distribution comes from the parameters g and h that control skewness and kurtosis respectively.

The transformation φ is invertible but φ-1 does not have a closed-form; φ-1 must be computed numerically. It follows that the density function for Y does not have a closed form either.

Special cases

The g distribution is the g-and-h distribution with h = 0. It generalizes the log normal distribution.

The limit of the g-and-h distribution as g does to 0 is the h distribution.

If g and h are both zero we get the normal distribution.

Calculating skewness and kurtosis

The following method of computing the moments of Y comes from [1].

Define f by

f(g, h, i) = \frac{1}{g^i\sqrt{1 - ih}} \sum_{r=0}^i \binom{i}{r} \exp\left(\frac{((i-r)g)^2}{2(1-ih)}\right)

Then the raw moments of Y are given by

\text{E} \, Y^m = \sum_{i=0}^m \binom{m}{i} a^{m-i}b^i f(g,h,i)

Skewness is the 3rd centralized moment and kurtosis is the 4th centralized moment. Equations for finding centralized moments from raw moments are given here.

Related posts

[1] James B. McDonald and Patrick Turley. Distributional Characteristics: Just a Few More Moments. The American Statistician, Vol. 65, No. 2 (May 2011), pp. 96–103

Symmetric functions and U-statistics

A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

  • a + b + c
  • ab + bc + ac
  • abc

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let h be a symmetric function of r variables and suppose we have a set S of n numbers where nr. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.

If h(x) = x then the corresponding U-statistic is the sample mean.

If h(x, y) = (xy)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n−1 in the denominator while the corresponding term for population variance has an n.

Here is some Python code that demonstrates that the average of (xy)²/2 over all pairs in a sample is indeed the sample variance.

    import numpy as np
    from itertools import combinations

    def var(xs):
        n = len(xs)
        bin = n*(n-1)/2    
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in combinations(xs, 2)) / bin

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs, ddof=1))
    print(var(xs))

Note the ddof term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.

Best line to fit three points

Suppose you want to fit a line to three data points. If no line passes exactly through your points, what’s the best compromise you could make?

minmax line for three points

Chebyshev suggested the best thing to do is find the minmax line, the line that minimizes the maximum error. That is, for each candidate line, find the vertical distance from each point to the line, and take the largest of these three distances as the measure of how well the line fits. The best line is the that minimizes this measure.

Note that this is not the same line you would get if you did a linear regression. Customary linear regression minimizes the average squared vertical distance from each point to the line. There are reasons this is the standard approach when you have at least a moderate amount of data, but when you only have three data points, it makes sense to use the minmax approach.

We can say several things about the minmax line. For one thing, there exists such a line and it is unique. Also, the vertical distance from each data point to the line will be the same. Either two points will be over the line and one under, or two will be under and one over.

Suppose your three points are (x1, y1), (x2, y2), and (x3, y3), with x1 < x2 < x3. Then the slope will be

m = \frac{y_3 - y_1}{x_3 - x_1}

and the intercept will be

b = \frac{y_1 (x_2+x_3)+y_2 (x_3-x_1) -y_3 (x_1+x_2)}{2 (x_3-x_1)}

I made an online calculator to find the best line for three points.