Matt Brigg’s comment on outliers in his post Tyranny of the mean:
Coontz used the word “outliers”. There are no such things. There can be mismeasured data, i.e. incorrect data, say when you tried to measure air temperature but your thermometer fell into boiling water. Or there can be errors in recording the data; transposition and such forth. But excluding mistakes, and the numbers you meant to measure are the numbers you meant to measure, there are no outliers. There are only measurements which do not accord with your theory about the thing of interest.
I have a slight quibble with this description of outliers. Some people use the term to mean legitimate extreme values, and some use the term to mean values that “didn’t really happen” in some sense. I assume Matt is criticizing the latter. For example, Michael Jordan’s athletic ability is an outlier in the former sense. He’s only an outlier in the latter sense if someone decides he “doesn’t count” in some context.
A few weeks ago I said this about outliers:
When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.
Sometimes you have to exclude a data point because you believe it is far more likely to be a mistake than an accurate measurement. You may also decide that an extreme value is legitimate, but that you wish to exclude from your model. The latter should be done with fear and trembling, or at least with an explicit disclaimer.
Related post: The cult of average
For daily tips on data science, follow @DataSciFact on Twitter.