John Tukey’s median of medians

Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.

Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.

Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.

For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2.  Then

yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8

and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.

That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.

Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.

How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.

Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.

For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.

Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.

Other posts on robust statistics:

Other posts on John Tukey:

7 thoughts on “John Tukey’s median of medians

  1. I wrote a somewhat related paper (with max-min instead of the median):

    Daniel Lemire, Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element. Nordic Journal of Computing, 13 (4), pages 328-339, 2006.
    http://arxiv.org/abs/cs.DS/0610046

    I wonder how these approaches compare.

  2. It’s also worth noting that you actually don’t have to sort the entire data set to calculate the median; you only have to do a partial sort ensuring that the number at position (n+1)/2 is correct.

    This is the way the median calculation is implemented in R. But for larger data sets, the partial sort algorithm (in R) actually isn’t any faster than a full sort, and a full sort is used.

  3. There are 2 useful things in the ninther concept: approximations to bulk statistics and the calculation of a median without any sorting.

    Generalising from 3 groups of 3 to n groups of m, we could still calculate a median from a series of chunks of the dataset, but we would need to sort.

    This would suggest problems when working on Very Large Datasets, but consider the case of the most annoying dataset – randomised data. If we take a chunk of data from this dataset, we can approximate the statistics of the bulk with the statistics of the chunk, or a series of chunks.

    The answer may be, then, to randomly pick out a series of 9-value chunks, and calculate a series of ninthers. That way the number of comparisons per total values can be less than 1.

    O(1) to O(N), depending on how accurate you require your statistics to be.

  4. A tangent, since I’m by no means a mathematician . . . You may already be familiar with this, but Richard Hamming talked about Tukey’s personal characteristics and work methods in his outstanding lecture, “You and Your Research,” which I’ve cited many times on my own blog.

  5. You didnt discuss the effectiveness of the ninther at estimating the actual mean. How accurate is it? You seem only concerned with computer efficiency and not at all with the entire point of doing it in the first place. The point was to estimate the mean, right? So how close does a ninther get?

  6. Turns out the ninther still has practical use today. It’s used to select pivots in the quicksort implementation of Google’s new programming language, Go.

  7. “Computing the sample median requires storing all data points and then sorting them to pick the middle value.”

    Sorting is O(N log(N)) complexity, which is much more work than necessary to find the median, which is O(N). Even finding multiple order statistics is less work than a full sort: see Mahmoud&Lent “Average-case analysis of multiple Quickselect: An algorithm for finding order statistics”.

Comments are closed.