Sometimes code is easier to understand than abstract math. A few days ago I was having a hard time conveying bias, consistency, and efficiency in a statistics class. Writing some pseudo-code on the board seemed to help clear things up. Loops and calls to random number generation routines are more tangible than discussions about random samples.
Later I turned the pseudo-code into Python code (after all, Python is supposed to be “executable pseudo-code”) and fancied it up a bit. The following page gives some explanation, some plots of the output, and the source code.
The difference between an unbiased estimator and a consistent estimator
An unbiased estimator, very roughly speaking, is a statistic that gives the correct result on average. For a precise definition, see Wikipedia. Unbiasedness is an intuitively desirable property. In fact, it seems indispensable at first.
In the colloquial sense, “bias” is practically synonymous with self-serving dishonesty. Who wants a self-serving, dishonest statistical estimate? But it’s important to remember that “bias” in statistical sense has a technical meaning that may not correspond to the colloquial meaning.
Here’s the big problem with statistical bias: if U is an unbiased estimator of θ, f(U) is NOT an unbiased estimator of f(θ) in general. For example, standard deviation is the square root of variance, but the square root of an unbiased estimator for variance is not an unbiased estimator for standard deviation. This shows bias has nothing to do with accuracy, since the square root of an accurate estimation of variance is an accurate estimate of standard deviation. In fact, unbiased estimators can be terrible.
The fact that unbiasedness is not preserved under transformations calls into question its usefulness. People seldom care directly about abstract statistical parameters directly. Instead they care about some calculation based on those parameters. An unbiased estimate of the parameters does not generally lead to an unbiased estimate of what people really want to estimate.
An estimator in statistics is a way of guessing a parameter based on data. An estimator is unbiased if over the long run, your guesses converge to the thing you’re estimating. Sounds eminently reasonable. But it might not be.
Suppose you’re estimating something like the number of car accidents per week in Texas and you counted 308 the first week. What would you estimate is the probability of seeing no accidents the next week?
If you use a Poisson model for the number of car accidents, a very common assumption for such data, there is a unique unbiased estimator. And this estimator would estimate the probability of no accidents during a week as 1. Worse, had you counted 307 accidents, the estimated probability would be -1! The estimator alternates between two ridiculous values, but in the long run these values average out to the true value. Exact in the limit, useless on the way there. A slightly biased estimator would be much more practical.
See Michael Hardy’s article for more details: An_Illuminating_Counterexample.pdf
For daily tips on data science, follow @DataSciFact on Twitter.