Hyperellipsoid surface area

Dimension 2

The equation for the perimeter of an ellipse is

p = 4aE(e^2)

where a is the semimajor axis, e is eccentricity, and E is a special function. The equation is simple, in the sense that it has few terms, but it is not elementary, because it depends on an advanced function, the complete elliptic integral of the second kind.

However, there is an approximation for the perimeter that is both simple and elementary:

p \approx 2 \pi \left(\frac{a^{3/2} + b^{3/2}}{2} \right )^{2/3}

Dimension 3

The generalization of an ellipse to three dimensions is an ellipsoid. The surface area of an ellipsoid is neither simple nor elementary. The surface area S is given by

\begin{align*} \varphi &= \arccos\left(\frac{c}{a} \right) \\ k^2 &= \frac{a^2\left(b^2 - c^2\right)}{b^2\left(a^2 - c^2\right)} \\ S &= 2\pi c^2 + \frac{2\pi ab}{\sin(\varphi)}\left(E(\varphi, k)\,\sin^2(\varphi) + F(\varphi, k)\,\cos^2(\varphi)\right) \end{align*}

where E is incomplete elliptic integral of the second kind and F is the incomplete elliptic integral of the first kind.

However, once again there is an approximation that is simple and elementary. The surface area approximately

S \approx 4\pi \left( \frac{(ab)^p + (bc)^p + (ac)^p}{3} \right )^{1/p}
where p = 1.6075.

Notice the similarities between the approximation for the perimeter of an ellipse and the approximation for the area of an ellipsoid. The former is the perimeter of a unit circle times a kind of mean of the axes. The latter is the area of a unit sphere times a kind of mean of the products of pairs of axes. The former uses a p-mean with p = 1.5 and the latter uses a p-mean with p = 1.6075. More on such means here.

Dimensions 4 and higher

The complexity of expressions for the surface area of an ellipsoid apparently increase with dimension. The expression get worse for hyperellipsoids, i.e. n-ellipsoids for n > 3. You can find such expressions in [1]. More of that in just a minute.


It is natural to conjecture, based on the approximations above, that the surface area of an n-ellipsoid is the area of a unit sphere in dimension n times the p-mean of all products of of n-1 semiaxes for some value of p.

For example, the surface area of an ellipsoid in 4 dimensions might be approximately

S \approx 2\pi^2 \left( \frac{(abc)^p + (abd)^p + (acd)^p + (bcd)^p}{4} \right )^{1/p}

for some value of p.

Why this form? Permutations of the axes do not change the surface area, so we’d expect permutations not to effect the approximation either. (More here.) Also, we’d expect from dimensional analysis for the formula to involve products of n-1 terms since the result gives n-1 dimensional volume.

Surely I’m not the first to suggest this. However, I don’t know what work has been done along these lines.

Exact results

In [1] the author gives some very complicated but general expressions for the surface area of a hyperellipsoid. The simplest of his expression involves probability:

S = n \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n+1}{2} \right)} \text{E}\left(\sqrt{q_1^2 X_1^2 + q_2^2 X_2^2 + \cdots + q_n^2 X_n^2 }\right)

where the Xs are independent normal random variables with mean 0 and variance 1/2.

At first it may look like this can be simplified. The sum of normal random variables is a normal random variable. But the squares of normal random variables are not normal, they’re gamma random variables. The sum of gamma random variables is a gamma random variable, but that’s only if the variables have the same scale parameter, and these do not unless all the semiaxes, the qs, are the same.

You could use the formula above as a way to approximate S via Monte Carlo simulation. You could also use asymptotic results from probability to get approximate formulas valid for large n.


[1] Igor Rivin. Surface area and other measures of ellipsoids. Advances in Applied Mathematics 39 (2007) 409–427

Justifiable sample size

One of the most common things a statistician is asked to do is compute a sample. There are well known formulas for this, so why isn’t calculating a sample size trivial?

As with most things in statistics, plugging numbers into a formula is not the hard part. The hard part is deciding what numbers to plug in, which in turn depends on understanding the context of the study. What are you trying to learn? What are the constraints? What do you know a priori?

In my experience, sample size calculation is very different in a scientific setting versus a legal setting.

In a scientific setting, sample size is often determined by budget. This isn’t done explicitly. There is a negotiation between the statistician and the researcher that starts out with talk of power and error rates, but the assumptions are adjusted until the sample size works out to something the researcher can afford.

In a legal setting, you can’t get away with statistical casuistry as easily because you have to defend your choices. (In theory researchers have to defend themselves too, but that’s a topic for another time.)

Opposing counsel or a judge may ask how you came up with the sample size you did. The difficulty here may be more expository than mathematical, i.e. the difficulty lies in explaining subtle concepts, not is carrying out calculations. A statistically defensible study design is no good unless there is someone there who can defend it.

One reason statistics is challenging to explain to laymen is that there are multiple levels of uncertainty. Suppose you want to determine the defect rate of some manufacturing process. You want to quantify the uncertainty in the quality of the output. But you also want to quantify your uncertainty about your estimate of uncertainty. For example, you may estimate the defect rate at 5%, but how sure are you that the defect rate is indeed 5%? How likely is it that the defect rate could be 10% or greater?

When there are multiple contexts of uncertainty, these contexts get confused. For example, variations on the following dialog come up repeatedly.

“Are you saying the quality rate is 95%?”

“No, I’m saying that I’m 95% confident of my estimate of the quality rate.”

Probability is subtle and there’s no getting around it.

Related posts

First time seeing a rare event

Suppose you’ve been monitoring a rare event for a long time, then you see your first occurrence on the Nth observation. Now what would you say about the event’s probability?

For example, suppose you’re wondering whether dogs ever have two tails. You observe thousands of dogs and never see two tails. But then you see a dog with two tails? Now what can you say about the probability of dogs having two tails? It’s certainly not zero.

We’ll first look at the case of 0 successes out of N trials then look at the case of 1 success out of N trials.

Zero successes

If you’re observing a binary event and you’ve seen no successes out of N trials your point estimate of the probability of your event is 0. You can’t have any confidence in the relative accuracy of your estimate: if the true probability is positive, no matter how small, then the relative error in your estimate is infinite.

But you can have a great deal of confidence in its absolute accuracy. When you’re looking for a binary event and you have not seen any instances in N trials for large N, then a 95% confidence interval for the event’s probability is approximately [0, 3/N]. This is the statistical rule of three. This is a robust estimate, one you could derive from either a frequentist or Bayesian perspective.

Note that the confidence interval [0, 3/N] is exceptionally narrow. When observing a moderate mix of successes and failures the width of the confidence interval is on the order of 1/√N, not 1/N.

First success

After seeing your first success, your point estimate jumps from 0 to 1/N, and infinite relative increase. What happens to your confidence interval?

If we use Jeffreys’ beta(1/2, 1/2) prior, then the posterior distribution after seeing 1 success and N − 1 failures is a beta(3/2, N − 1/2). Now an approximate 95% confidence interval is

[0.1/N, 4.7/N]

So compared to the case of seeing zero successes, seeing one success makes our confidence interval about 50% wider and shifts it to the left by 0.1/N.

So if you’ve seen 100,000 dogs and only 1 had two tails, you could estimate that a 95% confidence interval for the probability of a dog having two tails is

[10−6, 4.7 × 10−5].

If we run the exact numbers we get

[ 1.07 × 10−6, 4.67 ×10−5].

Related posts

Can the chi squared test detect fake primes?

Jeweler examining gemstones

This morning I wrote about Dan Piponi’s fake prime function. This evening I thought about it again and wondered whether the chi-squared test could tell the difference between the distribution of digits in real primes and fake primes.

If the distributions were obviously different, this would be apparent from looking at histograms. When distributions are more similar, visual inspection is not as reliable as a test like chi-squared. For small samples, we can be overly critical of plausible variations. For large samples, statistical tests can detect differences our eyes cannot.

When data fall into a number of buckets, with a moderate number of items expected to fall in each bucket, the chi squared test attempts to measure how the actual counts in each bucket compare to the expected counts.

This is a two-sided test. For example, suppose you expect 12% of items to fall in bucket A and 88% to fall in bucket B. Now suppose you test 1,000 items. It would be suspicious if only 50 items landed in bucket A since we’d expect around 120. On the other hand, getting exactly 120 items would be suspicious too. Getting exactly the expected number is unexpected!

Let’s look a the primes, genuine and fake, less than N = 200. We’ll take the distribution of digits in the list of primes as the expected values and compare to the distribution of the digits in the fake primes.

When I ran this experiment, I got a chi-squared value of 7.77. This is an unremarkable value for a sample from a chi-squared distribution with 9 degrees of freedom. (There are ten digits, but only nine degrees of freedom because if you rule out nine possibilities then digit is determined with certainty.)

The p-value in this case, the probability of seeing a value as large as the one we saw or larger, is 0.557.

Next I increased N to 1,000 and ran the experiment again. Now I got a chi-squared value of 19.08, with a corresponding p-value of 0.024. When I set N to 10,000 I got a chi-squared value of 18.19, with a corresponding p-value of 0.033.

When I used N = 100,000 I got a chi-squared value of 130.26, corresponding to a p-value of 10-23. Finally, when I used N = 1,000,000 I got a chi-squared value of 984.7 and a p-value of 3.4 × 10-206.

In short, the chi-squared test needs a fair amount of data to tell that fake primes are fake. The distribution of digits for samples of fake primes less than a thousand or so is plausibly the same as that of actual primes, as far as the test can distinguish. But the chi-squared values get implausibly large for fake primes up to  100,000.

Random points in a high-dimensional orthant

In high dimensions, randomly chosen vectors are very likely nearly orthogonal. I’ll unpack this a little bit then demonstrate it by simulation. Then I’ll look at what happens when we restrict our attention to points with positive coordinates.


The lengths of vectors don’t contribute to the angles between them, so we may as well focus our attention on spheres.

Suppose we pick two points on a familiar sphere in 3 dimensions, our home planet. By symmetry, we can assume the first vector runs from the center of the earth to the north pole. Otherwise we can imagine picking the first point, then rotating the earth so that the chosen point becomes the north pole.

What vectors are orthogonal (perpendicular) to the north pole? The vectors from the center to a point on the equator. The vectors that are nearly orthogonal to the north pole are the vectors that are near the equator.

On a high-dimensional sphere, nearly all the area is near the equator (more on that here). This means that if we choose two vectors, they are very likely to be nearly orthogonal.

Simulating points on sphere

I’ll illustrate this by choosing pairs of points at random on a unit sphere in 100 dimensions and computing their dot product. This gives the cosine similarity of the points, the cosine of the angle between them. I’ll repeat this process 1,000 times and look at the average cosine similarity.

    from scipy.stats import norm
    import numpy as np


    numruns = 1000
    avg = 0
    dim = 100
    for _ in range(numruns):
        x = norm(0,1).rvs(size = dim)
        y = norm(0,1).rvs(size = dim)
        avg += np.dot(x, y)/(np.dot(x,x)*np.dot(y,y))**0.5

Here we use the standard method of generating random points on a sphere: generate each component as a normal (Gaussian) random variable, then divide by the length. The code above combines the normalization with finding the dot product.

The code above prints 0.00043. The average dot product between vectors is very small, i.e. the vectors are nearly orthogonal.

Simulating points in orthant

Next let’s look at one orthant of the sphere. (“Orthant” is the generalization to higher dimensions of quadrant in two dimensions and octant in three dimensions.) We’ll generate random samples from the first orthant by first generating random samples from the whole sphere, then taking the absolute value to fold all the points from the other orthants into the first orthant.

The only necessary change to the code is to put abs() around the code generating the random points.

    x = abs(norm(0,1).rvs(size = dim))
    y = abs(norm(0,1).rvs(size = dim))

When we run the code again it prints 0.639. This is the average value of the cosine of the angle between points, so the average angle is about 0.877 radians or about 50°.

This isn’t quite the right thing to do. We should convert to angles first, then take the average. But that leads to essentially the same result.

If my pencil-and-paper calculations are correct, the exact expected dot product value approaches 2/π = 0.6366 as dimension goes to infinity, and the value of 0.639 above would be consistent with this.

More high-dimensional geometry posts

Variance of binned data

Suppose you have data that for some reason has been summarized into bins of width h. You don’t have the original data, only the number of counts in each bin.

You can’t exactly find the sample mean or sample variance of the data because you don’t actually have the data. But what’s the best you can do? A sensible approach would be to imagine that the contents of each bin represents samples that fell in the middle of the bin. For example, suppose your data were rounded down to the nearest integer. If you have 3 observations in the bin [5, 6] you could think of that as the value 5.5 repeated three times.

When we do this, we don’t recover the data set, but we create a pseudo data set that we can then find the sample mean and sample variance of.

This works well for estimating the mean, but it overestimates the variance. William Sheppard (1863–1936) recommended what is now known as Sheppard’s correction, subtracting h²/12 from the sample variance of the pseudo data. Richardson’s paper mentioned in the previous post fits Sheppard’s correction into the general framework of “the deferred approach to the limit.”

A more probabilistic justification of Sheppard’s correction is that h²/12 is the variance of a uniform random variable over an interval of width h. You could think of Sheppard’s correction as jittering the observations uniformly within each bin.

Let’s do a little simulation to demonstrate the accuracy of estimating the sample variance with and without Sheppard’s correction.

    from numpy import random, floor
    from scipy.stats import norm


    numruns = 1000
    rawbias = 0
    sheppardbias = 0
    for run in range(numruns):
        y = norm(0, 3).rvs(size=1000)
        ypseudo = floor(y) + 0.5
        v = ypseudo.var(ddof=1)
        rawbias += abs(v - y.var(ddof=1))
        sheppardbias += abs(v - y.var(ddof=1) - 1/12)

    print(rawbias/numruns, sheppardbias/numruns)

This Python script computes the variance of the pseudodata, with and without the Sheppard correction, and compares this value to the sample variance of the actual data. This process is repeated one thousand times.

On average, the uncorrected variance of the pseudodata was off by 0.0875 compared to the variance of the actual data. With Sheppard’s correction this drops to an average of 0.0442, i.e. the calculation with Sheppard’s correction was about 2 times more accurate.

Every Japanese prefecture shrinking

It’s well known that the population of Japan has been decreasing for years, and so I was a little puzzled by a recent headline saying that Japan’s population has dropped in every one of its 47 prefectures. Although the national population is in decline, until now not all of the nation’s 47 prefectures dropped in population at the same time. Is this remarkable? Actually, yes.

This post will not be about Japan’s demographics in detail, but was motivated by the headline mentioned above.

Demographics is not random; it is the result of individual human choices. But for this post we’ll look at demographics as if it were random.

Suppose each prefecture had the same probability p of declining. Then we could infer that p is probably quite large for all prefectures to decline at the same time. If you flip a coin 47 times and it comes up heads each time, you don’t imagine it’s equally likely to come up heads or tails the next time. (More on that here.)

For a Bernoulli random variable, how large would p have to be in order for the probability of 47 successes out of 47 trials to have probability at least 1/2?

If p47 = 1/2, then p = (1/2)1/47 = 0.985.

If the value of p is different in each prefecture, and the product of all the p‘s is 1/2, then all the p must be at least 1/2, and nearly all much larger. If one of the p‘s were equal to 1/2, the rest would have to be 1.

Applying this understanding to the headline at the top of the post, we could say that the fact that every prefecture has gone down in population is strong evidence that population will continue to drop, as long as current conditions continue.

Related posts

A note on Zipf’s law

Very often when a number is large, and we don’t know or care exactly how large it is, we can model it as infinite. This may make no practical difference and can make calculations much easier. I give several examples of this in the article Infinite is easier than big.

When you run across a statement that something is infinite, you can often mentally replace “infinite” with “big, but so big that it doesn’t matter exactly how big it is.” Conversely, the term “finite” in applied mathematics means “limited in a way that matters.”

But when we say something does or does not matter, there’s some implicit context. It does or does not matter for some purpose. We can get into trouble when we’re unclear of that context or when we shift contexts without realizing it.

Zipf’s law

Zipf’s law came up a couple days ago in a post looking at rare Chinese characters. That post resolves a sort of paradox, that rare characters come up fairly often. Each of these rare characters is very unlikely to occur in a document, and yet collectively its likely that some of them will be necessary.

In some contexts we can consider the number of Chinese characters or English words to be infinite. If you want to learn English by first learning the meaning of every English word, you will never get around to using English. The number of words is infinite in the sense that you will never get to the end.

Zipf’s law says that the frequency of the nth word in a language is proportional to ns. This is of course an approximation, but a surprisingly accurate approximation for such a simple statement.

Infinite N?

The Zipf probability distribution depends on two parameters: the exponent s and the number of words N. Since the number of words in any spoken language is large, and its impossible to say exactly how large it is, can we assume N = ∞? This seems like exactly the what we were talking about at the top of this article.

Whether it is possible to model N as infinite depends on s. The value of s that models word frequency in many languages is close to 1. The plot below illustrates Zipf’s law for the text of Wikipedia in many different languages. In each case, the slope on the log-log plot is approximately 1, i.e. s ≈ 1.

When s = 1, we don’t have a probability distribution because the sum of 1/n from 1 to ∞ diverges. And so no, we cannot assume N = ∞.

Now you may object that all we have to do is set s to be slightly larger than 1. If s = 1.0000001, then the sum of ns converges. Problem solved. But not really.

When s = 1 the series diverges, but when s is slightly larger than 1 the sum is very large. Practically speaking this assigns too much probability to rare words.

If s = 2, for example, one could set N = ∞. The Zipf distribution with s = 2 may be useful for modeling some things, but not for modeling word frequency.

Zipf’s law applied to word frequency is an interesting example of a model that contains a large N, and although it doesn’t matter exactly how large N is, it matters that N is finite. In the earlier article, I used the estimate that Chinese has 50,000 characters. If I had estimated 60,000 characters the conclusion of the article would not have been too different. But assuming an infinite number of characters would result in substantially overestimating the frequency of rare words.

Related posts

Image above by Sergio Jimenez, CC BY-SA 4.0, via Wikimedia Commons


A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.

Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times

\underbrace{S \times S \times \cdots \times S}_{n \text{ times}}

is a V-statistic.

As in the previous post, let h(x, y) = (xy)²/2. We can illustrate the V-statistic associated with h with Python code as before.

    import numpy as np
    from itertools import product

    def var(xs):
        n = len(xs)
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in product(xs, repeat=2)) / n**2

    xs = np.array([2, 3, 5, 7, 11])

This time, however, we iterate over product rather than over combinations. Note also that at the bottom of the code we print


rather than

   np.var(xs, ddof=1)

This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof.

   np.var(xs, ddof=0)

The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.

U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.

Related posts

Moments of Tukey’s g-and-h distribution

John Tukey developed his so-called g-and-h distribution to be very flexible, having a wide variety of possible values of skewness and kurtosis. Although the reason for the distribution’s existence is its range of possible skewness and values, calculating the skewness and kurtosis of the distribution is not simple.


Let φ be the function of one variable and four parameters defined by

\varphi(x; a, b, g, h) = a + b\left( \frac{\exp(gx) - 1}{g} \right) \exp(hx^2/2)

A random variable Y has a g-and-h distribution if it has the same distribution as φ(Z; a, b, g, h) where Z is a standard normal random variable. Said another way, if Y has a g-and-h distribution then the transformation φ-1 makes the data normal.

The a and b parameters are for location and scale. The name of the distribution comes from the parameters g and h that control skewness and kurtosis respectively.

The transformation φ is invertible but φ-1 does not have a closed-form; φ-1 must be computed numerically. It follows that the density function for Y does not have a closed form either.

Special cases

The g distribution is the g-and-h distribution with h = 0. It generalizes the log normal distribution.

The limit of the g-and-h distribution as g does to 0 is the h distribution.

If g and h are both zero we get the normal distribution.

Calculating skewness and kurtosis

The following method of computing the moments of Y comes from [1].

Define f by

f(g, h, i) = \frac{1}{g^i\sqrt{1 - ih}} \sum_{r=0}^i \binom{i}{r} \exp\left(\frac{((i-r)g)^2}{2(1-ih)}\right)

Then the raw moments of Y are given by

\text{E} \, Y^m = \sum_{i=0}^m \binom{m}{i} a^{m-i}b^i f(g,h,i)

Skewness is the 3rd centralized moment and kurtosis is the 4th centralized moment. Equations for finding centralized moments from raw moments are given here.

Related posts

[1] James B. McDonald and Patrick Turley. Distributional Characteristics: Just a Few More Moments. The American Statistician, Vol. 65, No. 2 (May 2011), pp. 96–103