Suppose you’re drawing random samples uniformly from some interval. How likely are you to see a new value outside the range of values you’ve already seen?
The problem is more interesting when the interval is unknown. You may be trying to estimate the end points of the interval by taking the max and min of the samples you’ve drawn. But in fact we might as well assume the interval is [0, 1] because the probability of a new sample falling within the previous sample range does not depend on the interval. The location and scale of the interval cancel out when calculating the probability.
Suppose we’ve taken n samples so far. The range of these samples is the difference between the 1st and the nth order statistics, and for a uniform distribution this difference has a beta(n − 1, 2) distribution. Since a beta(a, b) distribution has mean a/(a + b), the expected value of the sample range from n samples is (n − 1)/(n + 1). This is also the probability that the next sample, or any particular future sample, will lie within the range of the samples seen so far.
If you’re trying to estimate the size of the total interval, this says that after n samples, the probability that the next sample will give you any new information is 2/(n + 1). This is because we only learn something when a sample is less than the minimum so far or greater than the maximum so far.
Isn’t this the purpose of extreme value theory/analysis?
The following papers may be of interest:
Tippet, L. On the Extreme Individuals and the Range of Samples Taken from a Normal Population Biometrika, 1925, 17, 364-387
Newman, D. The Distribution of Range in Samples from a Normal Population, Expressed in Terms of an Independent Estimate of Standard Deviation Biometrika, 1939, 31, 20-30
Also, this is probably related to the German tank problem as well.
Avraham beat me to it — I was also going to mention the German tank problem.
There’s an analogous problem associated with the probability that the next observation of a point drawn uniformly over a convex spatial region will lie outside the convex hull of the previous points…
That probability is not conditioned on what first n the samples actually were, right? I think that why (or even whether) you should continue believing it once you’ve seen the first n samples is a little more subtle.
Right. If you condition on the values seen so far, the probability is simply the sample range.
Oh, yeah. I was thinking of the case where you don’t know the interval.
I can’t make it explicit at the moment, but I think in that case there’s an argument that you should continue believing your (n-1)/(n+1) even after you’ve seen the first n draws.
No need for fancy things like the beta distribution, and no need to restrict to uniformly distributed numbers.
The first n+1 things, whatever they are, can come in any order. (With probability 1 they’re all distinct.) In 1/(n+1) orders, the last one is the smallest. In 1/(n+1) orders, the last one is the biggest. For n>=1 it can’t be both. So the probability is 2/(n+1).
This all works whatever the probability distribution of your numbers, provided only that the probability of getting any precise value is zero (so that the probability of a tie is zero). If you break ties “at random” then it works completely unconditionally.
g: Very slick.
The beta formula gives the entire distribution, not just the mean, but your argument arrives at the mean more simply and under more general assumptions.