Range trick for larger sample sizes

The previous post showed that the standard deviation of a sample of size n can be well estimated by multiplying the sample range by a constant dn that depends on n.

The method works well for relatively small n. This should sound strange: typically statistical methods work better for large samples, not small samples. And indeed this method would work better for larger samples. However, we’re not so much interested in the efficiency of the method per se, but its efficiency relative to the standard way of estimating standard deviation. For small samples, both methods are not very accurate, and the two methods appear to work equally well.

If we want to use the range method for larger n, there are a couple questions: how well does the method work, and how do we calculate dn.

Simple extension

Ashley Kanter left a comment on the previous post saying

dn = 3 log n0.75 (where the log is base 10) seems to perform quite well even for larger n.

Ashley doesn’t say where this came from. Maybe it’s an empirical fit.

The constant dn is the expected value of the range of n samples from a standard normal random variable. You could find a (complicated) expression for this and then find a simpler expression using an asymptotic approximation. Maybe I’ll try that later, but for now I need to wrap up this post and move on to client work.

Note that

log10 n0.75 = 0.977 loge(n)

and so we could use

dn = log n

where log is natural log. This seems like something that might fall out of an asymptotic approximation. Maybe Ashley empirically discovered the first term of a series approximation.

Update: See this post for a more detailed exploration of how well log n, square root of n, and another method approximate dn.

Update 2: Ashley Kanter’s approximation was supposed to be 3 (log10 n) 0.75 rather than 3 log10 (n0.75) and is a very good approximation. This is also addressed in the link in the first update.

Simulation

Here’s some Python code to try things out.

    import numpy as np
    from scipy.stats import norm

    np.random.seed(20220309)

    n = 20
    for _ in range(5):
        x = norm.rvs(size=n)
        w = x.max() - x.min()
        print(x.std(ddof=1), w/np.log(n))

And here are the results.

    |   std | w/d_n |
    |-------+-------|
    | 0.930 | 1.340 |
    | 0.919 | 1.104 |
    | 0.999 | 1.270 |
    | 0.735 | 0.963 |
    | 0.956 | 1.175 |

It seems the range method has more variance, though notice in the fourth row that the standard estimate can occasionally wander pretty far from the theoretical value of 1 as well.

We get similar results when we increase n to 50.

    |   std | w/d_n |
    |-------+-------|
    | 0.926 | 1.077 |
    | 0.889 | 1.001 |
    | 0.982 | 1.276 |
    | 1.038 | 1.340 |
    | 1.025 | 1.209 |

Not-so-simple extension

There are ways to improve the range method, if by “improve” you mean make it more accurate. One way is to divide the sample into random partitions, apply the method to each partition, and average the results. If you’re going to do this, partitions of size 8 are optimal [1]. However, the main benefit of the range method [2] is its simplicity.

Related posts

[1] F. E. Grubbs and C. L. Weaver (1947). The best unbiased estimate of a population standard deviation based on group ranges. Journal of the American Statistical Association 42, pp 224–41

[2] The main advantage now is its simplicity, When it was discovered, the method reduced manual calculation, and so it could have been worthwhile to make the method a little more complicated as long as the calculation effort was still less than that of the standard method.

3 thoughts on “Range trick for larger sample sizes

  1. Robert Matthews

    The conversion of Ashley’s (empirical?) formula to natural logs is neat – but am not sure the original formula is a good approximation (srqt(n) seems far better).

    Anyhow, thanks for linking this to your 2009 post on Bancroft’s rule for estimating the slope of a line fitting data points; I hadn’t come across this before. After reading your latest post, it occurred to me that there’s a link between the two, via the theory of regression lines.

    The correlation coefficient R is related to the slope of that line via R = (Sx/Sy)*b where Sx and Sy are the standard deviation estimates for x and y. and b is the slope of the regression line. As the current post and that on Bancroft’s rule of thumb shows, all three can be estimated from the range of values of x and y (the “width”). When the rules are plugged into the formula for the correlation coefficient, we find that it’s possible to derive one from the other simply on the assumption to R is approximately unity, implying x and y are pretty well correlated.

    Thanks for a thought-provoking post !

  2. Apologies for the confusion – I intended this to be understood as 3 (log n)^0.75 using base 10, which is empirical but approximates d_n quite well between 2 and 10 (and some way beyond) and easily memorised.

  3. Ashley: 3 (log_10 n)^0.75 is a very good approximation, even for moderately large n.

    I just wrote a new post on approximations to d_n, and I need to add this to the list. Your approximation is much better than log n or square root n.

Leave a Reply

Your email address will not be published.