Imagine a bank of three elevators along a wall. The elevators are in a straight line but they are not evenly spaced. Where do you stand in order to minimize the average distance you’ll need to walk to catch the first elevator that arrives?

This scenario comes from the opening paragraph of a paper by James Handley *et al*.

If asked where they would stand and wait for the next of three elevators, unequally spaced along a wall, many students would choose to stand at the mean position. They think that by doing so they are minimizing the average distance to the elevator. They do not recognize that standing at the mean minimizes the average

squareddistance and that the minimal average distance to the elevator is in fact achieved by standing at the median.

Suppose you start out standing in front of the second elevator. If you move one foot to the left, you decrease the distance to first elevator by a foot, but you increase the distance to the other two elevators by one foot as each, so your average distance increases. Similarly, if you move one foot to the right, you decrease your distance to the third elevator but increase your distance to the other two so that you increase the average distance. Since you can’t move without increasing the average distance, you must have started at the best spot.

So standing in front of the second elevator minimizes the expected distance to the next elevator, assuming all three elevators are equally likely to arrive next.

What if you want to minimize the worst case instead of the average case? Stand half way between the first and third elevators. As before, you can see that if you were to move from that position, you’d increase your distance to at least one elevator and thus increase the maximum distance.

This problem illustrates three optimization theorems:

- The sample mean minimizes the total squared distance to a set of points.
- The sample median minimizes the mean absolute distance to a set of points.
- The mid-range minimizes the maximum distance to a set of points.

These theorems have numerous applications. For example, they are foundational in the study of robust estimators.

Better use the stairs!

Interestingly, if you view this as best-case/worst-case log-loss minimization and allow arbitrary log-linear models, 1 and 2 remain easy, while 3 becomes hard. There a large body of literature on finding “Maximizers of the Information Divergence”, and as far as I can see, people still haven’t a good way to deal with non-convexity of the this objective

Doesn’t this example only show that the spot is a

localminima, and not the global one? Nevertheless, it’s a great intuitive example of the median!Travis: True, it’s not a rigorous proof. You can find a rigorous proof here:

“A Simple Noncalculus Proof That the Median Minimizes the Sum of the Absolute Deviations.” Neil C. Schwertman, A. J. Gilks, and J. Cameron. The American Statistician, Vol. 44, No. 1 (Feb., 1990), pp. 38-39.

Neat explanation of the difference among mean, median, and midrange and a good foot in the door for introducing students (all of whom are allergic to real analysis) to the L1, L2, and L∞ norms. Throw in the mode and some loss functions, and you’ve got a good math stats lecture that isn’t straight out of the undergrad texts! Oh, and thanks for the American Statistician reference.

So how much of statistical theory can be done in the framework of a general p-norm? The three averages you describe correspond to p=1, p=2 and p=infinity, no? What statistical theorems, familiar for the p=2 case, generalize for other p? I seem to recall a central limit theorem for the p=1 case.

Mike: Foot in the elevator?

Steve: Yes, these cases correspond to the 1, 2, and infinity norms. I’m not familiar with statistical applications for other values of p. You might expect, for example, that p = 1.5 would make a good compromise between the mean and median. Instead, I believe it’s more common to compromise between p=1 and p=2 in other ways, such as a function that looks like a quadratic near the origin and looks like absolute value for larger values. (That’s what Huber did.)

There are applications for p<1 in sparse reconstruction, ie you want to minimize the number of non-zero components. p=1 does that somewhat, but p=0.5 is more aggressive and results in better theoretical guarantees http://metaoptimize.com/qa/questions/1103/what-is-l05-regularization#2398

There are some examples outside of statistics where other values of p are important in this MathOverflow question:

Why do we care about L^p spaces besides p = 1, p = 2, and p = infinity?

Also, here is a statistical application of sorts. It’s a statistical paper, but it really just uses the Lp norms for curve-fitting.

This is very nice. But then there’s the further thought of where do you stand given that you are actually standing h from the wall of elevators and you will walk a straight line to which ever elevator opens first. I’m assuming you don’t have control over h for some reason. The optimal x is a pain to calculate for a given h (well, it is for me) but, unless my quick chicken scratch is wrong, as h->infty then x->(mean of the 3 elevators).

However, I don’t believe this addendum illustrates anything of value.

An interesting twist to this problem is to assume that the elevator is a destination elevator. Then, getting into the first elevator that arrives may not be the right thing to do!

If you approach the elevators from a distance it’s better to stop walking immediately after you come within a safe distance of all three. In this manner you will not make one unnecessary step

@Yaroslav Bulatov While P<1 is more aggressive it could also be less reliable. It shift estimator from median to *meridian* and in the case of sharp local maximum(mode) and especially in multimodal distribution it could be unstable to noise.

This is problem 9.3-9, page 223, from chapter 9, “Medians and Orders statistics” from Cormen et al’s book, “Introduction to algorithms”, 3rd edition. It is shown there that the median (which can be aquired in linear time) is the answer.