Suppose you have a small number of samples, say between 2 and 10, and you’d like to estimate the standard deviation σ of the population these samples came from. Of course you could compute the sample standard deviation, but there is a simple and robust alternative.
Let W be the range of our samples, the difference between the largest and smallest value. Think “w” for “width.” Then
W / dn
is an unbiased estimator of σ where the constants dn can be looked up in a table [1].
| n | 1/d_n | |----+-------| | 2 | 0.886 | | 3 | 0.591 | | 4 | 0.486 | | 5 | 0.430 | | 6 | 0.395 | | 7 | 0.370 | | 8 | 0.351 | | 9 | 0.337 | | 10 | 0.325 |
The values dn in the table were calculated from the expected value of W/σ for normal random variables, but the method may be used on data that do not come from a normal distribution.
Let’s try this out with a little Python code. First we’ll take samples from a standard normal distribution, so the population standard deviation is 1. We’ll draw five samples, and estimate the standard deviation two ways: by the method above and by the sample standard deviation.
from scipy.stats import norm, gamma for _ in range(5): x = norm.rvs(size=10) w = x.max() - x.min() print(x.std(ddof=1), w*0.325)
Here’s the output:
| w/d_n | std | |-------+-------| | 1.174 | 1.434 | | 1.205 | 1.480 | | 1.173 | 0.987 | | 1.154 | 1.277 | | 0.921 | 1.083 |
Just from this example it seems the range method does about as well as the sample standard deviation.
For a non-normal example, let’s repeat our exercise using a gamma distribution with shape 4, which has standard deviation 2.
| w/d_n | std | |-------+-------| | 2.009 | 1.827 | | 1.474 | 1.416 | | 1.898 | 2.032 | | 2.346 | 2.252 | | 2.566 | 2.213 |
Once again, it seems both methods do about equally well. In both examples the uncertainty due to the small sample size is more important than the difference between the two methods.
Update: To calculate dn for other values of n, see this post.
[1] Source: H, A. David. Order Statistics. John Wiley and Sons, 1970.
Hi, John. You should also look at
https://archives.collections.ed.ac.uk/repositories/2/archival_objects/59045 for a different view of the same underlying idea. We published an article in 2015 using Newman’s method to identify outliers in dose response experiments. And now we are working on an application to find differential expression in paired omits data without needing replicates.
Now it would be useful to have a simple rule to remember the 1/d_n table. d_n = sqrt(n) or sqrt(n-0.5) might suffice.
@Andreas: It would be nice if these numbers were memorable.
I imagine in some contexts people always take samples of the same size, then they’d only have one number to remember.
Even using d_n = 3, as crude as it is, might be useful for back-of-the-envelope estimates.
d_n=3*logn^0.75 (where the log is base 10) seems to perform quite well even for larger n
Andreas’ approximation of d_n ~ sqrt(n) seems to work pretty well for n>2, but Ashley’s (empirical ?) fit seems at least an order of mag worse. Is there a typo in the formula?
In booklet 3 accompanying Stuart Hunter’s “Statistics for Problem Solving and Decision Making” (Westinghouse, 1971), p. 10 uses sigma_hat = Range / d_2 ~ Range / sqrt(n). His Table 2 lists d_2 for each value of n from 2 to 10 as what you list as d_n.
If I’m doing approximations, I figure that the formula using sqrt(n) is by definition good enough, because I’ll not remember d_2 reliably.
How would unbiased estimator change if we also know the mean of the sample alongside min & max?