Relative error in the central limit theorem

If you average a large number independent versions of the same random variable, the central limit theorem says the average will be approximately normal. That is the absolute error in approximating the density of the average by the density of a normal random variable will be small. (Terms and conditions apply. See notes here.)

But the central limit theorem says nothing about relative error. Relative error can diverge to infinity while absolute error converges to zero. We’ll illustrate this with an example.

The average of N independent exponential(1) random variables has a gamma distribution with shape N and scale 1/N.

As N increases, the average becomes more like a normal in distribution. That is, the absolute error in approximating the distribution function of gamma random variable with that of a normal random variable decreases. (Note that we’re talking about distribution functions (CDFs) and not densities (PDFs). The previous post discussed a surprise with density functions in this example.)

The following plot shows that the difference between the distributions functions get smaller as N increases.

But when we look at the ratio of the tail probabilities, that is Pr(X > t) / Pr(Y  > t) where X is the average of N exponential r.v.s and Y is the corresponding normal approximation from the central limit theorem, we see that the ratios diverge, and they diverge faster as N increases.

To make it clear what’s being plotted, here is the Python code used to draw the graphs above.

import matplotlib.pyplot as plt
from scipy.stats import gamma, norm
from scipy import linspace, sqrt

def tail_ratio(ns):

    x = linspace(0, 4, 400)
    for n in ns:
        gcdf = gamma.sf(x, n, scale = 1/n)
        ncdf = norm.sf(x, loc=1, scale=sqrt(1/n))
        plt.plot(x, gcdf/ncdf

    plt.legend(["n = {}".format(n) for n in ns])

def cdf_error(ns):

    x = linspace(0, 6, 400)
    for n in ns:
        gtail = gamma.cdf(x, n, scale = 1/n)
        ntail = norm.cdf(x, loc=1, scale=sqrt(1/n))
        plt.plot(x, gtail-ntail)

    plt.legend(["n = {}".format(n) for n in ns])
ns = [1, 4, 16]

Central limit theorem and Runge phenomena

I was playing around with something this afternoon and stumbled on something like Gibbs phenomena or Runge phenomena for the Central Limit Theorem.

The first place most people encounter Gibbs phenomena is in Fourier series for a step function. The Fourier series develops “bat ears” near the discontinuity. Here’s an example I blogged about before not with Fourier series but with analogous Chebyshev series.

Gibbs phenomena for Chebyshev interpolation

The series converges rapidly in the middle of the flat parts, but under-shoots and over-shoots near the jumps in the step function.

Runge phenomena is similar, where interpolating functions under- and over-shoot the function they’re approximating.

Runge example

Both plots above come from this post.

Here’s the example I ran across with the central limit theorem. The distribution of the average of a set of exponential random variables converges to the distribution of a normal random variable. The nice thing about the exponential distribution is that the averages have a familiar distribution: a gamma distribution. If each exponential has mean 1, the average has a gamma distribution with shape N and scale 1/N. The central limit theorem says this is converging in distribution to a normal distribution with the same mean and variance.

The plot below shows the difference between the density function of the average of N exponential random variables and the density function for its normal approximation, for N = 10 and for N = 400.

Notice that the orange line, corresponding to N = 400, is very flat, most of the time. That is, the normal approximation fits very well. But there’s this spike in the middle, something reminiscent of Gibbs phenomena or Runge phenomena. Going from 10 to 400 samples the average error decreases quite a bit, but the maximum error doesn’t go down much at all.

If you go back and look at the Central Limit Theorem, or its more quantitative counterpart the Berry-Esseen theorem, you’ll notice that it applies to the distribution function, not the density, or in other words, the CDF, not the PDF. I think the density functions do converge in this case because the exponential function has a smooth density, but the rate of convergence depends on the norm. It looks like the convergence is fast in square (L²) norm, but slow in sup norm. A little experiment shows that this is indeed the case.

Maybe the max norm doesn’t converge at all, i.e. the densities don’t converge pointwise. It looks like the max norm may be headed toward a horizontal asymptote, just like Gibbs phenomena.

Update: It seems we do not have uniform convergence. If we let N = 1,000,000, the sup norm of the error 0.1836. It appears the sup norm of the error is approaching a lower bound of approximately this value.

Related posts

Computing extreme normal tail probabilities

Let me say up front that relying on the normal distribution as an accurate model of extreme events is foolish under most circumstances. The main reason to calculate the probability of, say, a 40 sigma event is to show how absurd it is to talk about 40 sigma events. See my previous post on six-sigma events for an explanation.

For this post I’ll be looking at two-tailed events, the probability that a normal random variable is either less than –kσ or greater than kσ. If you’re only interested in one of these two probabilities, divide by 2. Also, since the results are independent of σ, let’s assume σ = 1 for convenience.

The following Python code will print a table of k-sigma event probabilities.

    from scipy.stats import norm

    for k in range(1, 40):
        print(k, 2*norm.cdf(-k))

This shows, for example, that a “25 sigma event,” something I’ve heard people talk about with a straight face, has a probability of 6 × 10-138.

The code above reports that a 38 or 39 sigma event has probability 0, exactly 0. That’s because the actual value, while not zero, is so close to zero that floating point precision can’t tell the difference. This happens around 10-308.

What if, despite all the signs warning hic sunt dracones you want to compute even smaller probabilities? Then for one thing you’ll need to switch over to log scale in order for the results to be representable as floating point numbers.

Exactly computing these extreme probabilities is challenging, but there are convenient upper and lower bounds on the probabilities that can be derived from Abramowitz and Stegun, equation 7.1.13 with a change of variables. See notes here.

\frac{1}{t + \sqrt{t^2 + 2\sqrt{2}}} \leq 2\sqrt{\frac{\pi}{2}}\, e^{t^2/2}\, \mbox{Pr}(|Z| > t) \leq \frac{1}{t + \sqrt{t^2 + 4\sqrt{2}/\pi}}

We can use these bounds to compute upper and lower bounds on the base 10 logs of the tail probabilities.

    from scipy import log, sqrt, pi

    def core(t, c):
        x = 2*sqrt(2/pi)/(t + sqrt(t**2 + c))
        ln_p = -0.5*t**2 + log(x)
        return ln_p/log(10)

    def log10_upper(t):
        return core(t, 2*sqrt(2))

    def log10_lower(t):
        return core(t, 4*sqrt(2)/pi)

This tells us that the log base 10 of the probability of a normal random variable being more than 38 standard deviations away from its mean is between -315.23867 and -315.23859. The upper and lower bounds agree to seven significant figures, and the accuracy only improves as k gets larger. So for large arguments, we can use either the upper or lower bound as an accurate approximation.

The code above was used to compute this table of tail probabilities for k = 1 to 100 standard deviations.

Six sigma events

I saw on Twitter this afternoon a paraphrase of a quote from Nassim Taleb to the effect that if you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a k-sigma event for various k.

        | Sigma | Odds            |
        |     1 | 2 : 1           |
        |     2 | 21 : 1          |
        |     3 | 370 : 1         |
        |     4 | 16,000 : 1      |
        |     5 | 1,700,000 : 1   |
        |     6 | 500,000,000 : 1 |

If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

Bayes rule and Bayes factors

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a heavier-tailed distribution, say a Student t distribution with 10 degrees of freedom, rescaled to also have variance 1.

normal distribution vs t with 10 dof

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is relatively much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the t distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the t distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the heavier-tailed distribution is equally likely, just a possibility. If you did think a priori that both possibilities were equally likely, the posterior odds for the t distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your a priori confidence in a normal model.

By the way, a t distribution with 10 degrees of freedom is not a very heavy-tailed distribution. It has heavier tails than a normal for sure, but not nearly as heavy as a Cauchy, which corresponds to a t with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.


[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

Calendars and continued fractions

Calendars are based on three frequencies: the rotation of the Earth on its axis, the rotation of the moon around the Earth, and the rotation of the Earth around the sun. Calendars are complicated because none of these periods is a simple multiple of any other. The ratios are certainly not integers, but they’re not even fractions with a small denominator.

As I wrote about the other day in the context of rational approximations for π, the most economical rational approximations of an irrational number come from convergents of continued fractions. The book Calendrical Calculations applies continued fractions to various kinds of calendars.

Ratio of years to days

The continued fraction for the number of days in a year is as follows.

365.2424177 = 365 + \cfrac{1}{4 + \cfrac{1}{7+ \cfrac{1}{1 + \cfrac{1}{2 + \cfrac{1}{1 + \cfrac{1}{5 + \ldots }}}}}}

This means that the best approximations for the length of a year are 365 days plus a fraction of 1/4, or 7/29, or 8/33, or 23/95, etc.

You could have one leap day every four years. More accurate would be 7 leap days every 29 years, etc. The Gregorian calendar has 97 leap days every 400 years.

Ratio of years to months

If we express the ratio of the length of the year to the length of the month, we get

\frac{365.24244}{29.53059} = 12 +\cfrac{1}{2 + \cfrac{1}{1+ \cfrac{1}{2 + \cfrac{1}{1 + \cfrac{1}{1 + \cfrac{1}{18 + \cfrac{1}{3 + \ldots }}}}}}}

By taking various convergents we get 37/3, 99/8, 136/11, etc. So if you want to design successively more accurate lunisolar calendars, you’d have 37 lunar months every 3 years, 99 lunar months every 8 years, etc.

Related posts

Computing smooth max without overflow

Erik Erlandson sent me a note saying he found my post on computing the soft maximum helpful. (If you’re unfamiliar with the soft maximum, here’s a brief description of what it is and how you might use it.) Erik writes

I used your post on practical techniques for computing smooth max, which will probably be the difference between being able to actually use my result and not. I wrote up what I’m planning to do here.

His post extends the idea in my post to computing the gradient and Hessian of the soft max.

Proving life exists on Earth

NASA’s Galileo mission was primarily designed to explore Jupiter and its moons. In 1989, the Galileo probe started out traveling away from Jupiter in order to do a gravity assist swing around Venus. About a year later it also did a gravity assist maneuver around Earth. Carl Sagan suggested that when passing Earth, the Galileo probe should turn its sensors on Earth to look for signs of life. [1]

Artist conception of Galileo probe surveying Jupiter and its moons

Now obviously we know there’s life on Earth. But if we’re going look for life on other planets, it’s reasonable to ask that our methods return positive results when examining the one planet we know for sure does host life. So scientists looked at the data from Galileo as if it were coming from another planet to see what patterns in the data might indicate life.

I’ve started using looking for life on Earth as a metaphor. I’m working on a project right now where I’m looking for a needle in a haystack, or rather another needle in a haystack: I knew that one needle existed before I got started. So I want to make sure that my search procedure at least finds the one positive result I already know exists. I explained to a colleague that we need to make sure we can at least find life on Earth.

This reminds me of simulation work. You make up a scenario and treat it as the state of nature. Then pretend you don’t know that state, and see how successful your method is at discovering that state. It’s sort of a schizophrenic way of thinking, pretending that half of your brain doesn’t know what the other half is doing.

It also reminds me of software testing. The most trivial tests can be surprisingly effective at finding bugs. So you write a few tests to confirm that there’s life on Earth.

Related posts

[1] I found out about Galileo’s Earth reconnaissance listening to the latest episode of the .NET Rocks! podcast.

Combinatorics, just beyond the basics

Most basic combinatorial problems can be solved in terms of multiplication, permutations, and combinations. The next step beyond the basics, in my experience, is counting selections with replacement. Often when I run into a problem that is not quite transparent, it boils down to this.

Examples of selection with replacement

Here are three problems that reduce to counting selections with replacement.

Looking ahead in an experiment

For example, suppose you’re running an experiment in which you randomize to n different treatments and you want to know how many ways the next k subjects can be assigned to treatments. So if you had treatments A, B, and C, and five subjects, you could assign all five to A, or four to A and one to B, etc. for a total of 21 possibilities. Your choosing 5 things from a set of 3, with replacement because you can (and will) assign the same treatment more than once.

Partial derivatives

For another example, if you’re taking the kth order partial derivatives of a function of n variables, you’re choosing k things (variables to differentiate with respect to) from a set of n (the variables). Equality of mixed partials for smooth functions says that all that matters is how many times you differentiate with respect to each variable. Order doesn’t matter: differentiating with respect to x and then with respect to y gives the same result as taking the derivatives in the opposite order, as long as the function you’re differentiating has enough derivatives. I wrote about this example here.

Sums over even terms

I recently had an expression come up that required summing over n distinct variables, each with a non-negative even value, and summing to 2k. How many ways can that be done? As many as dividing all the variable values in half and asking that they sum to k. Here the thing being chosen the variable, and since the indexes sum to k, I have a total of k choices to make, with replacement since I can chose a variable more than once. So again I’m choosing k things with replacement from a set of size n.


I wrote up a set of notes on sampling with replacement that I won’t duplicate here, but in a nutshell:

\left( {n \choose k} \right) = {n + k - 1 \choose k}

The symbol on the left is Richard Stanley’s suggested notation for the number of ways to select k things with replacement from a set of size n. It turns out that this equals the expression on the right side. The derivation isn’t too long, but it’s not completely obvious either. See the aforementioned notes for details.

I started by saying basic combinatorial problems can be reduced to multiplication, permutations, and combinations. Sampling with replacement can be reduced to a combination, the right side of the equation above, but with non-obvious arguments. Hence the need to introduce a new symbol, the one on the right, that maps directly to problem statements.

Enumerating possibilities

Sometimes you just need to count how many ways one can select k things with replacement from a set of size n. But often you need to actually enumerate the possibilities, say to loop over them in software.

In conducting a clinical trial, you may want to enumerate all the ways pending data could turn out. If you’re going to act the same way, no matter how the pending data work out, there’s no need to wait for the missing data before proceeding. This is something I did many times when I worked at MD Anderson Cancer Center.

When you evaluate a multivariate Taylor series, you need to carry out a sum that has a term for corresponding to each partial derivative. You could naively sum over all possible derivatives, not taking into account equality of mixed partials, but that could be very inefficient, as described here.

The example above summing over even partitions of an even number also comes out of a problem with multivariate Taylor series, one in which odd terms drop out by symmetry.

To find algorithms for enumerating selections with replacement, see Knuth’s TAOCP, volume 4, Fascicle 3.

Related posts

10 best rational approximations for pi

It’s easy to create rational approximations for π. Every time you write down π to a few decimal places, that’s a rational approximation. For example, 3.14 = 314/100. But that’s not the best approximation.

Think of the denominator of your fraction as something you have to buy. If you have enough budget to buy a three-digit denominator, then you’re better off buying 355/113 rather than 314/100 because the former is more accurate. The approximation 355/113 is accurate to 6 decimal places, whereas 314/100 is only accurate to 2 decimal places.

There’s a way to find the most economical rational approximations to π, or any other irrational number, using continued fractions. If you’re interested in details, see the links below.

Here are the 10 best rational approximations to π, found via continued fractions.

    | Fraction       | Decimals |
    | 3              |      0.8 |
    | 22/7           |      2.9 |
    | 333/106        |      4.1 |
    | 355/113        |      6.6 |
    | 103993/33102   |      9.2 |
    | 104348/33215   |      9.5 |
    | 208341/66317   |      9.9 |
    | 312689/99532   |     10.5 |
    | 833719/265381  |     11.1 |
    | 1146408/364913 |     11.8 |

If you only want to know the number of correct decimal places, ignore the fractional parts of the numbers in the Decimals column above. For example, 22/7 gives two correct decimal places. But it almost gives three. (Technically, the Decimals column gives the -log10 of the absolute error.)

In case you’re curious, here’s a plot of the absolute errors on a log scale.

Errors in rational approximations to pi

Related posts

Fixed points of logistic function

Here’s an interesting problem that came out of a logistic regression application. The input variable was between 0 and 1, and someone asked when and where the logistic transformation

f(x) = 1/(1 + exp(abx))

has a fixed point, i.e. f(x) = x.

So given logistic regression parameters a and b, when does the logistic curve given by yf(x) cross the line yx? Do they always cross? Can they cross twice?

There’s always at least one solution. Because f(x) is strictly between 0 and 1, the function

g(x) = f(x) – x

is positive at 0 and negative at 1, and by the intermediate value theorem g(x) must be zero for some x between 0 and 1.

Sometimes f has only one fixed point. It may have two or three fixed points, as demonstrated in the graph below. The case of two fixed points is unstable: the logistic curve is tangent to the line yx at one point, and a tiny change would turn this tangent point into either no crossing or two crossings.

Logistic curves crossing identity line

If |b| < 1, then you can show that the function f is a contraction map on [0, 1]. In that case there is a unique solution to f(x) = x, and you can find it by starting with an arbitrary value for x and repeatedly applying f to it. For example, if a= 1 and b = 0.8 and we start with x= 0, after applying f ten times we get xf(x) = 0.233790157.

There are a couple questions left to finish this up. How can you tell from a and b how many fixed points there will be? The condition |b| < 1 is sufficient for f to be a contraction map on [0, 1]. Can you find necessary and sufficient conditions?

Related post: Sensitivity of logistic regression prediction

Line art

A new video from 3Blue1Brown is about visualizing derivatives as stretching and shrinking factors. Along the way they consider the function f(x) = 1 + 1/x.

Iterations of f converge on the golden ratio, no matter where you start (with one exception). The video creates a graph where they connect values of x on one line to values of f(x) on another line. Curiously, there’s an oval that emerges where no lines cross.

Here’s a little Python I wrote to play with this:

    import matplotlib.pyplot as plt
    from numpy import linspace

    N = 70
    x = linspace(-3, 3, N)
    y = 1 + 1/x

    for i in range(N):
        plt.plot([x[i], y[i]], [0, 1])
    plt.xlim(-5, 5)

And here’s the output:

In the plot above I just used matplotlib’s default color sequence for each line. In the plot below, I used fewer lines (N = 50) and specified the color for each line. Also, I made a couple changes to the plot command, specifying the color for each line and putting the x line above the y line.

        plt.plot([x[i], y[i]], [0, 1], c="#243f6a")

If you play around with the Python code, you probably want to keep N even. This prevents the x array from containing zero.

Update: Here’s a variation that extends the lines connecting (x, 0) and (y, 1). And I make a few other changes while I’m at it.

    N = 200
    x = linspace(-10, 10, N)
    y = 1 + 1/x
    z = 2*y-x

    for i in range(N):
        plt.plot([x[i], z[i]], [0, 2], c="#243f6a")
    plt.xlim(-10, 10)

Making a career out of the chain rule

When I was a teenager, my uncle gave me a calculus book and told me that mastering calculus was the most important thing I could do for starting out in math. So I learned the basics of calculus from that book. Later I read Michael Spivak’s two calculus books. I took courses that built on calculus, and studied generalizations of calculus such as calculus with complex variables, calculus in Banach spaces, etc. I taught calculus. After a while, I started to feel maybe I’d mastered calculus.

Last year I started digging into automatic differentiation and questioned whether I really had mastered calculus. At a high level, automatic differentiation is “just” an application of the chain rule in many variables. But you can make career out of exploring automatic differentiation (AD), and many people do. The basics of AD are not that complicated, but you can go down a deep rabbit hole exploring optimal ways to implement AD in various contexts.

You can make a career out of things that seem even simpler than AD. Thousands of people have made a career out of solving the equation Ax = b where A is a large matrix and the vector b is given. In high school you solve two equations in two unknowns, then three equations in three unknowns, and in principle you could keep going. But you don’t solve a sparse system of millions of equations the same way. When you consider efficiency, accuracy, limitations of computer arithmetic, parallel computing, etc. the humble equation Ax = b is not so simple to solve.

As Richard Feynman said, nearly everything is really interesting if you go into it deeply enough.

Related posts:

Robustness and tests for equal variance

The two-sample t-test is a way to test whether two data sets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less than ideal circumstances.

This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the F-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.

Sensitivity and robustness

Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it’s important to know how procedures perform when assumptions don’t hold exactly. Since the F-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the F-test to perform terribly under simulation, relative to its recommended alternatives Bartlett’s test and Levene’s test. That’s not exactly what I found.

Simulation design

For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the F-test, Bartlett’s test, and Levene’s test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.

I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the F-test. Then I used the same designs for data coming from a heavy-tailed distribution to see how well each of the tests performed.

For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.

Normal distribution calibration

Here are the results from the normal distribution simulations.

| Test     | Alpha | Type I | Type II |
| F        |  0.13 | 0.0390 |  0.1863 |
| Bartlett |  0.04 | 0.0396 |  0.1906 |
| Levene   |  0.06 | 0.0439 |  0.2607 |

Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.

The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.

To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.

Heavy-tail simulation results

Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student t distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.

| Test     | Alpha | Type I | Type II |
| F        |  0.13 | 0.2417 |  0.2852 |
| Bartlett |  0.04 | 0.2165 |  0.2859 |
| Levene   |  0.06 | 0.0448 |  0.4537 |

The operating characteristics degraded when drawing samples from a heavy-tailed distribution, t with 4 degrees of freedom, but they didn’t degrade uniformly.

Compared to the F-test, the Bartlett test had slightly better Type I error and the same Type II error.

The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.

Conclusion: The F-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett’s test doesn’t seem much better in these particular scenarios. Levene’s test, however, does perform better than the F-test, depending on the relative importance you place on Type I and Type II error.

Related posts

Ellipsoid geometry and Haumea

To first approximation, Earth is a sphere. A more accurate description is that the earth is an oblate spheroid, the polar axis being a little shorter than the equatorial diameter. See details here. Other planets are also oblate spheroids as well. Jupiter is further from spherical than the earth is more oblate.

The general equation for the surface of an ellipsoid is

frac{x^2}{a^2} + frac{y^2}{b^2} + frac{z^2}{c^2} = 1

An ellipsoid has three semi-axes: ab, and c. For a sphere, all three are equal to the radius. For a spheroid, two are equal. In an oblate spheroid, like Earth and Jupiter, abc. For a prolate spheroid like Saturn’s moon Enceladus, abc.

Artists conception of Haumea

The dwarf planet Haumea is far from spherical. It’s not even a spheroid. It’s a general (tri-axial) ellipsoid, with a, b, and c being quite distinct. It’s three semi-axes are approximately 1000, 750, and 500 kilometers. The dimensions are not known very accurately. The image above is an artists conception of Haumea and its ring, via Astronomy Picture of the Day.

This post explains how to compute the volume and surface area of an ellipsoid. See that post for details. Here we’ll just copy the Python code from that post.

    from math import sin, cos, acos, sqrt, pi
    from scipy.special import ellipkinc, ellipeinc
    def volume(a, b, c):
        return 4*pi*a*b*c/3
    def area(a, b, c):
        phi = acos(c/a)
        k = a*sqrt(b**2 - c**2)/(b*sqrt(a**2 - c**2))
        E = ellipeinc(phi, k)
        F = ellipkinc(phi, k)
        elliptic = E*sin(phi)**2 + F*cos(phi)**2
        return 2.0*pi*c**2 + 2*pi*a*b*elliptic/sin(phi)
    a, b, c = 1000, 750, 500
    print(volume(a, b, c))
    print(area(a, b, c))

Related post: Planets evenly spaced on a log scale