Statistical dead end

Posted on 20 September 2010 by John

I get suspicious when I hear people ask about third and fourth moments (skewness and kurtosis). I’ve heard these terms far more often from people who don’t understand statistics than from people who do.

There are two common errors people often have in mind when they bring up skewness and kurtosis.

First, they implicitly believe that distributions can be boiled down to three or four numbers. Maybe they had an elementary statistics course in which everything boiled down to two moments — mean and variance — and they suspect that’s not enough, that advanced statistics extends elementary statistics by looking at third or fourth moments. “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.” The path forward is not considering higher and higher moments.

This leads to a second and closely related problem. Interest in third and fourth moments sounds like hearkening back to the moment-matching approach to statistics. Moment matching was a simple idea for estimating distribution parameters:

Set population means equal to sample means.
Set population variances equal to sample variances.
Solve the resulting equations for distribution parameters.

There’s more to moment matching that that, but that’s enough for this discussion. It’s a very natural approach, which is probably why it still persists. But it’s also a statistical dead end.

Moment matching is the most convenient approach to finding estimators in some cases. However, there is another approach to statistics that has largely replaced moment matching, and that’s maximum likelihood estimation: find the parameters that make the data most likely.

Both moment matching and maximum likelihood are intuitively appealing ideas. Sometimes they lead to the same conclusions but often they do not. They competed decades ago and maximum likelihood won. One reason is that maximum likelihood estimators have better theoretical properties. Another reason is that maximum likelihood estimation provides a unified approach that isn’t thwarted by difficulties in solving algebraic equations.

There are good reasons to be concerned about higher moments (including fractional moments) though these are primarily theoretical. For example, higher moments are useful in quantifying the error in the central limit theorem. But there are not a lot of elementary applications of higher moments in contemporary statistics.

12 thoughts on “Statistical dead end”

JD Long

20 September 2010 at 11:40

John, you always do a nice job of choosing a good level of abstraction when you talk about hard topics. This is no exception. But it does leave me with a couple of questions.

1) Do both MLE and “moment matching” require choosing a continuous distribution before fitting? I assume so, but just checking my assumptions.

2) What’s the implications for something like the Johnson distribution that seems to fit a spline that matches the moments of the sample data? I think the Johnson Fit methodology can match moments or quantiles depending on how it’s used. Similar dead ends?
John

20 September 2010 at 11:54

Thanks, JD. Moment matching requires a parameteric distribution with as many moments as you wish to match (usually 2, but occasionally more). In principle the density function doesn’t have to be continuous, but in practice it nearly always is. It also requires that you can solve the system of equations you get, and this is likely to be too hard if the parametrized form is complicated (as would be the case with discontinuous densities).

MLE requires being able to compute the maximize the likelihood. If you want to do this analytically then you may run into trouble. But if you’re doing to compute the maximum numerically then a huge class of problems is tractable. Some form of convexity would be helpful but not absolutely necessary.

Estimates based on fitting the Johnson distribution may be convenient, and maybe efficient enough for applications. But there’s a theory that asymptotically MLE’s are as efficient as possible. You may be able to match MLEs for efficiency, but you can’t beat them. (At least using the definition of “best” implicit in the theory.)
Alex

20 September 2010 at 11:55

Hi John,

I like your blog a lot. It’s my understanding that fields like econometrics still primarily use moment estimators. For example, Hayashi’s econometrics textbook is a staple of economics graduate programs. Could you explain a bit more about this late 80’s battle of estimators? What do you mean by: “There are not a lot of elementary applications of higher moments in contemporary statistics”? How are you counting?

Thanks,
Alex
John

20 September 2010 at 12:07

Thanks, Alex. Moment matching and MLE lead to the same place if you assume a normal distribution. In that case, some people may prefer the terminology of moment matching even if they’re using the same estimates as MLE.

I believe the battle between Pearson’s moment matching approach and Fisher’s MLE approach was over by 1930. Maybe you’re referring to something else in the 1980’s.

Moment estimators may be more convenient than MLEs in specific cases, especially for hand calculation, but MLEs are asymptotically unbiased and maximally efficient.
GMM Estimator

20 September 2010 at 12:39

Alex,

Aren’t you talking about the Generalized Method of Moments Estimators, are you?
If my memory permits, the GMM estimators include the traditional (original) Method of Moments and MLE as special cases.
Bob Carpenter

20 September 2010 at 15:41

MLE’s easy, but can get you into trouble with variance in several ways:

1. MLEs are often biased, as in the MLE for univariate normal variance. (Though as John points out, they’re unbiased asymptotically and are efficient, so this matters more for low-count data.)

2. MLEs often don’t exist, as in MLEs for normal mixture models.

3. Reasoning with MLEs systematically underestimates estimation uncertainty in the face of finite data (and thus underestimates dispersion in unseen data). (This also matters less with lots of data, because the MLE converges to the true value in the limit.)

Bayesian point estimates based on L2 loss tackle 1 if you really care about point estimates. Full Bayesian inference (integrating inference over the posterior) tackles 3 if you’re not so concerned about point estimates. Adding priors can help with 2 for small data sets (while adding bias and sometimes reducing variance).

Of course, none of this matters if the model’s wrong to begin with, such as using a unimodal distribution (like a normal) to model an underlyingly multimodal distributions (like people’s heights, which is a mixture of men, women and children).
Mike Anderson

20 September 2010 at 14:54

JD:
In most cases, folks are estimating a “continuous” parameter, even if it parameterizes a discrete distribution, so maximum likelihood is straightforward, either analytically (first derivative equals zero) or numerically (some variation of hill-climbing). Occasionally we might want to estimate a discrete parameter–like the number of trials in a binomial model–and then we need to get back to basics, but the problem is still tractable, although error estimation is a bit tricky.

The big advantages of maximum likelihood estimation are

(1) If you have a mathematical model, you can (almost) always write a likelihood function and devise a way to maximize it;

(2) If you can maximize the likelihood by taking the derivative of the log likelihood, you can also estimate the variance-covariance matrix for the estimates of the model parameters; and

(3) For large* samples, the likelihood function can be evaluated at null and alternate hypothesis values in parameter space to perform likelihood ratio tests, whose statistics are asymptotically chi-square distributed. (* Your largeness will vary with the model.) MLE geeks often pimp all this up by adding in score tests and Wald tests (the latter are z-tests and thus require one of those “large” samples).

As I tell my biology majors, there’s good news and there’s bad news: GOOD: If you’ve got a model, you can find an estimator. BAD: There will be calculus involved. Lots of calculus.
EastwoodDC

20 September 2010 at 16:02

MOM estimators for the third order (skewness) and higher can be very unstable, requiring sample sizes in the tens-of-thousands before they settle down to asymptotic normality. MLEs will generally outperform this, as will other methods such as L-Moments. (http://www.research.ibm.com/people/h/hosking/lmoments.html).
John

20 September 2010 at 16:08

Bob: Excellent points.

I had frequentist statistics in mind when I wrote this post, but it’s interesting to see what Bayesian statistics brings to the table.

As far as I know, the drawbacks to MLEs apply even more so to moment matching estimators. Do you know of any exceptions?
Mike Anderson

21 September 2010 at 08:59

Bravo, Bob! Thanks for the addendum about bias, non-existence (unless you do EM–don’t ask), and the trap of inappropriate model selection. Bayes estimators–guaranteed to be biased–are often more tractable, depending on which prior distributions you use. The current fashion of noninformative or invariant priors seems to increase the credibility of the inference at the expense of analytic complexity–replacing the old can of worms with a fresh new one! Of course, if estimation were easy, we statisticians would be out of a job. It ain’t for sissies.
shabbychef

22 September 2010 at 10:24

John;

I only recently learned about ‘L-moments’, the only obvious application of which is essentially the matching of (L-) moments approach to density estimation. Would you also consider this a ‘dead end’ approach?
EastwoodDC

22 September 2010 at 16:25

In my experience, L-moments have value in fitting extreme value distributions (that was part of my MS thesis), and have become popular in water resource research. There is a paper suggesting that “L-skewness” is better than the (MOM) skewness statistic, but I hardly ever look at skewness in the first place.

Comments are closed.