When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.

Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.

But order statistics are not robust in this sense: the **identity** of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the **value** of the median income may not change much, but the **person** corresponding to that income will change.

One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a **good** choice than how often it makes the **best** choice.

Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.

This is reminiscent of the notion of regret as commonly used in multi-armed bandits; the idea being that you want to minimise your cumulative regret, as defined by the difference between the value of the selected option and the (unknown) optimal one.

I suppose the setting you described is subtly different though, because after the trial has concluded you still want to make a single decision, and you would not necessarily be interested in optimising the online regret during the trial itself.

Also note that as the sample corresponding to that median entry moves up and down the median changes in 1:1 correlation, so minor changes in that one sample change the answer for the whole set.

One way to reduce this sensitivity to ‘choice of person’ is to use the Harrell-Davis quantile estimator rather than the traditional quantile estimator.

e.g.

http://hackage.haskell.org/package/multipass-0.1.0.2/docs/src/Data-Pass-L-Estimator.html#Estimator

Now you’ll borrow a bit of information from all of your order statistics to calculate an estimate of the quantile in question reducing the dependence on any one order statistic at the cost of greatly reduced robustness.

@kmett I’m not sure how Harrell-Davis does it, but the mean of a middle quantile would be a simple way to minimize this sensitivity.

I don’t know why you’d want to reduce this kind of sensitivity. It seems to me that calculations that are hurt by the sensitivity were poorly posed to being with.

You can view the Harrell-Davis quantile estimator as the limit of the bootstrapped quantile estimate as the number of resamples goes to infinity. It gives much better answers when you have few samples, but converges to the same answer as the more direct method in the limit as you get more and more samples to work with.

When you want to get a good quantile estimate and have few data points it abuses some nice properties of the beta distribution and the relationship between order statistics that follow to read some information off the other order statistics rather than just myopically focus on the 1 or 2 that happened to bracket the quantile in -this- set of samples. It effectively considers what are the odds that the samples that bracket the quantile in question would bracket it after bootstrapping, letting you borrow information from the other samples.

It has the benefit of having a closed form, however, and being distribution-free.

As with any bootstrapping technique you need to know how it works and what the limitations are but it is a pretty powerful tool to add to your toolbox, and you can read off the coefficients for the resulting L-estimator from a fairly simple calculation involving beta.