I get suspicious when I hear people ask about third and fourth moments (skewness and kurtosis). I’ve heard these terms far more often from people who don’t understand statistics than from people who do.

There are two common errors people often have in mind when they bring up skewness and kurtosis.

First, they implicitly believe that distributions can be boiled down to three or four numbers. Maybe they had an elementary statistics course in which everything boiled down to two moments — mean and variance — and they suspect that’s not enough, that advanced statistics extends elementary statistics by looking at third or fourth moments. “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.” The path forward is not considering higher and higher moments.

This leads to a second and closely related problem. Interest in third and fourth moments sounds like hearkening back to the moment-matching approach to statistics. Moment matching was a simple idea for estimating distribution parameters:

- Set population means equal to sample means.
- Set population variances equal to sample variances.
- Solve the resulting equations for distribution parameters.

There’s more to moment matching that that, but that’s enough for this discussion. It’s a very natural approach, which is probably why it still persists. But it’s also a statistical dead end.

Moment matching is the most convenient approach to finding estimators in some cases. However, there is another approach to statistics that has largely replaced moment matching, and that’s maximum likelihood estimation: find the parameters that make the data most likely.

Both moment matching and maximum likelihood are intuitively appealing ideas. Sometimes they lead to the same conclusions but often they do not. They competed decades ago and maximum likelihood won. One reason is that maximum likelihood estimators have better theoretical properties. Another reason is that maximum likelihood estimation provides a unified approach that isn’t thwarted by difficulties in solving algebraic equations.

There are good reasons to be concerned about higher moments (including fractional moments) though these are primarily theoretical. For example, higher moments are useful in quantifying the error in the central limit theorem. But there are not a lot of elementary applications of higher moments in contemporary statistics.

John, you always do a nice job of choosing a good level of abstraction when you talk about hard topics. This is no exception. But it does leave me with a couple of questions.

1) Do both MLE and “moment matching” require choosing a continuous distribution before fitting? I assume so, but just checking my assumptions.

2) What’s the implications for something like the Johnson distribution that seems to fit a spline that matches the moments of the sample data? I think the Johnson Fit methodology can match moments or quantiles depending on how it’s used. Similar dead ends?

Thanks, JD. Moment matching requires a parameteric distribution with as many moments as you wish to match (usually 2, but occasionally more). In principle the density function doesn’t have to be continuous, but in practice it nearly always is. It also requires that you can solve the system of equations you get, and this is likely to be too hard if the parametrized form is complicated (as would be the case with discontinuous densities).

MLE requires being able to compute the maximize the likelihood. If you want to do this

analyticallythen you may run into trouble. But if you’re doing to compute the maximumnumericallythen a huge class of problems is tractable. Some form of convexity would be helpful but not absolutely necessary.Estimates based on fitting the Johnson distribution may be convenient, and maybe efficient enough for applications. But there’s a theory that asymptotically MLE’s are as efficient as possible. You may be able to match MLEs for efficiency, but you can’t beat them. (At least using the definition of “best” implicit in the theory.)

Hi John,

I like your blog a lot. It’s my understanding that fields like econometrics still primarily use moment estimators. For example, Hayashi’s econometrics textbook is a staple of economics graduate programs. Could you explain a bit more about this late 80’s battle of estimators? What do you mean by: “There are not a lot of elementary applications of higher moments in contemporary statistics”? How are you counting?

Thanks,

Alex

Thanks, Alex. Moment matching and MLE lead to the same place if you assume a normal distribution. In that case, some people may prefer the terminology of moment matching even if they’re using the same estimates as MLE.

I believe the battle between Pearson’s moment matching approach and Fisher’s MLE approach was over by 1930. Maybe you’re referring to something else in the 1980’s.

Moment estimators may be more convenient than MLEs in specific cases, especially for hand calculation, but MLEs are asymptotically unbiased and maximally efficient.

Alex,

Aren’t you talking about the Generalized Method of Moments Estimators, are you?

If my memory permits, the GMM estimators include the traditional (original) Method of Moments and MLE as special cases.

MLE’s easy, but can get you into trouble with variance in several ways:

1. MLEs are often biased, as in the MLE for univariate normal variance. (Though as John points out, they’re unbiased asymptotically and are efficient, so this matters more for low-count data.)

2. MLEs often don’t exist, as in MLEs for normal mixture models.

3. Reasoning with MLEs systematically underestimates estimation uncertainty in the face of finite data (and thus underestimates dispersion in unseen data). (This also matters less with lots of data, because the MLE converges to the true value in the limit.)

Bayesian point estimates based on L2 loss tackle 1 if you really care about point estimates. Full Bayesian inference (integrating inference over the posterior) tackles 3 if you’re not so concerned about point estimates. Adding priors can help with 2 for small data sets (while adding bias and sometimes reducing variance).

Of course, none of this matters if the model’s wrong to begin with, such as using a unimodal distribution (like a normal) to model an underlyingly multimodal distributions (like people’s heights, which is a mixture of men, women and children).

JD:

In most cases, folks are estimating a “continuous” parameter, even if it parameterizes a discrete distribution, so maximum likelihood is straightforward, either analytically (first derivative equals zero) or numerically (some variation of hill-climbing). Occasionally we might want to estimate a discrete parameter–like the number of trials in a binomial model–and then we need to get back to basics, but the problem is still tractable, although error estimation is a bit tricky.

The big advantages of maximum likelihood estimation are

(1) If you have a mathematical model, you can (almost) always write a likelihood function and devise a way to maximize it;

(2) If you can maximize the likelihood by taking the derivative of the log likelihood, you can also estimate the variance-covariance matrix for the estimates of the model parameters; and

(3) For large* samples, the likelihood function can be evaluated at null and alternate hypothesis values in parameter space to perform likelihood ratio tests, whose statistics are asymptotically chi-square distributed. (* Your largeness will vary with the model.) MLE geeks often pimp all this up by adding in score tests and Wald tests (the latter are z-tests and thus require one of those “large” samples).

As I tell my biology majors, there’s good news and there’s bad news: GOOD: If you’ve got a model, you can find an estimator. BAD: There will be calculus involved. Lots of calculus.

MOM estimators for the third order (skewness) and higher can be very unstable, requiring sample sizes in the tens-of-thousands before they settle down to asymptotic normality. MLEs will generally outperform this, as will other methods such as L-Moments. (http://www.research.ibm.com/people/h/hosking/lmoments.html).

Bob: Excellent points.

I had frequentist statistics in mind when I wrote this post, but it’s interesting to see what Bayesian statistics brings to the table.

As far as I know, the drawbacks to MLEs apply even more so to moment matching estimators. Do you know of any exceptions?

Bravo, Bob! Thanks for the addendum about bias, non-existence (unless you do EM–don’t ask), and the trap of inappropriate model selection. Bayes estimators–guaranteed to be biased–are often more tractable, depending on which prior distributions you use. The current fashion of noninformative or invariant priors seems to increase the credibility of the inference at the expense of analytic complexity–replacing the old can of worms with a fresh new one! Of course, if estimation were easy, we statisticians would be out of a job. It ain’t for sissies.

John;

I only recently learned about ‘L-moments’, the only obvious application of which is essentially the matching of (L-) moments approach to density estimation. Would you also consider this a ‘dead end’ approach?

In my experience, L-moments have value in fitting extreme value distributions (that was part of my MS thesis), and have become popular in water resource research. There is a paper suggesting that “L-skewness” is better than the (MOM) skewness statistic, but I hardly ever look at skewness in the first place.