John Tukey’s median of medians

by John on June 23, 2009

Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.

Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.

Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.

For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2.  Then

yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8

and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.

That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.

Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.

How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.

Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.

For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.

Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.

Other posts on robust statistics:

Canonical examples from robust statistics
Efficiency of median versus mean

Other posts on John Tukey:

Approximate problems and approximate solutions
John Tukey and Aristotle
Tukey tallying
When discoveries stay discovered

{ 5 comments… read them below or add one }

1

Daniel Lemire 06.23.09 at 14:56

I wrote a somewhat related paper (with max-min instead of the median):

Daniel Lemire, Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element. Nordic Journal of Computing, 13 (4), pages 328-339, 2006.
http://arxiv.org/abs/cs.DS/0610046

I wonder how these approaches compare.

2

Karl Ove Hufthammer 06.24.09 at 00:47

It’s also worth noting that you actually don’t have to sort the entire data set to calculate the median; you only have to do a partial sort ensuring that the number at position (n+1)/2 is correct.

This is the way the median calculation is implemented in R. But for larger data sets, the partial sort algorithm (in R) actually isn’t any faster than a full sort, and a full sort is used.

3

Phil H 06.24.09 at 02:06

There are 2 useful things in the ninther concept: approximations to bulk statistics and the calculation of a median without any sorting.

Generalising from 3 groups of 3 to n groups of m, we could still calculate a median from a series of chunks of the dataset, but we would need to sort.

This would suggest problems when working on Very Large Datasets, but consider the case of the most annoying dataset – randomised data. If we take a chunk of data from this dataset, we can approximate the statistics of the bulk with the statistics of the chunk, or a series of chunks.

The answer may be, then, to randomly pick out a series of 9-value chunks, and calculate a series of ninthers. That way the number of comparisons per total values can be less than 1.

O(1) to O(N), depending on how accurate you require your statistics to be.

4

Tim Walker 06.24.09 at 21:16

A tangent, since I’m by no means a mathematician . . . You may already be familiar with this, but Richard Hamming talked about Tukey’s personal characteristics and work methods in his outstanding lecture, “You and Your Research,” which I’ve cited many times on my own blog.

5

CogitoErgoCogitoSum 04.16.10 at 11:55

You didnt discuss the effectiveness of the ninther at estimating the actual mean. How accurate is it? You seem only concerned with computer efficiency and not at all with the entire point of doing it in the first place. The point was to estimate the mean, right? So how close does a ninther get?

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>