When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.
Maybe you believe that you did not actually see the outlier. If you’re looking at a set of human heights, and one of the values is 61 feet, it is more plausible that you’ve seen a transcription error than that you’ve encountered a person an order of magnitude taller than average.
However, if you believe that a data point is real, but unlikely to reoccur, you are placing more weight on subjective belief than on data, which may or may not be appropriate.
Here’s a personal example. This weekend I bought a kettlebell. As I was waiting in line to check out, I struck up a conversation with the man in line behind me. His right leg was in a cast and resting on a scooter. He told me that he broke his foot in two places by dropping a kettlebell on it! My immediate thought was that this was a fluke, an outlier. My second thought was that according to the only data I have, kettlebells are quite dangerous.
Perhaps the rational decision would have been to leave the store immediately, but I bought the kettlebell anyway. Still, the fellow behind me made an impression. I will think of him every time I work out with the kettlebell and be more careful than I would have been otherwise. Kettlebells are probably more dangerous than I’d like to believe, but so is a sedentary life.
There are certain set-ups in machine learning where the classifier is allowed to say “I don’t know” and incur a smaller penalty. Usually it is required that the number of examples where it does that is sub-linear in the number of examples presented. This is an explicit acknowledgment that it may be worthwhile to trade off performance on outliers for better performance on bulk of the data.
Thinking of which, this is what companies have to do when they design their product to specifically target “the mass market”. Usually this means competing on price, which matters to nearly everyone.
Tomas: The value of such a strategy depends on the consequence of the event being predicted. The return on a sale is fixed, so forgoing a sale to a small market has a small price. But a small probability of breaking my foot grabs my attention!
What is the probability that he noticed that you were buying a kettlebell and was just having some fun messing with you?
I have two rules when I coach classes with kettlebells in them:
1) for your safety: don’t fight for space with the kettlebell, it will always win.
2) for the safety of your classmates: don’t drop the kettlebell or you go home immediately.
And to bring this back to the nerdy side since that’s why I read this blog: kettlebells have gotten popular lately, but they’re usually taught wrong. People use them like glorified dumbbells. If you’re going to do static movements with them, save money and buy dumbbells. Kettlebells get their benefit because the center of mass is not in the handle, so you have an extra degree of freedom: the location of your hand in space relative to the weight. You can really only take advantage of that with dynamic exercises.
Good luck in becoming a girevik!
This is a nice point about how removing outliers can lead to results that are as misleading as including them blindly. However, a major objective of outlier resistant techniques is also to identify the outliers so that you can develop an informed opinion of them. I realize that the post never criticizes the value of outlier identification, but it seems appropriate to point out that concerning ourselves with outliers need not lead inevitably to ignoring them unjustly, and does not inexorably cast the data analyst in the foul role of a ‘data-peeper’.
It sounds to me like you just have a strong prior, not that you are totally excluding this gentleman as an outlier.
Yet another case where a statistical technique is better described by the degree to which it is applied, rather than whether it is applied completely or not at all.
A consistent frequentist would run out of the store. The probability of seeing someone behind you in like with a broken foot under the null hypothesis of safety is very small, certainly less than the magical 0.05.
Wouldn’t it be neat if we could collect enough good data to determine if this post resulted in a net overall effect (either positive or negative) on kettlebell sales?
A few months ago, we had a friend helping us prepare dinner on the grill. As she was moving the grill basket containing peppers/onions from the grill to the serving bowl the retractable handle retracted, dumping our food on the ground. We wrote if off as an unlikely to recur fluke. Last night, while (ironically) recollecting her mishap, I had the same thing happen to me. A new grill basket is now on our shopping list.
What I find interesting in the anecdote is you stating you would think of the guy’s foot – presumably being more careful. This by itself should reduce the chances of injury b/c you respect the kettle bell more now.
I find this aspect of analysis fascinating: how behavior is modified by seeing data points? You can use a number of techniques to estimate the rate of injury NOW, but what techniques would you use to incorporate the new knowledge about injury that you have gained (and now we have gained) and its effect on the rate of injury using kettle bells tomorrow?
I too read the article due to the word Kettlebell in the title.
But my story concerns my response to an outlier of a different nature.
I was cooking hamburgers on a gas grill. My 7 year old nephew came up to watch and seemed very interested. Once the meat was sizzling, I used a spatula to press down on each patty to push the grease out. To my horror, one patty erupted and squirted some very hot grease in my nephew’s direction. Thank God it missed him, but in my mind’s eye, it looked like it hit his face. Can you imagine if it had?
Now I’ve cooked burgers for years and this has never happened before, or since.
But rather than rule out the data point, I adapted to it.
Thank you for the stimulating article!
Do I ban him from the grill? Of course not. But now he stands on a short, 2 step ladder, where he can see, but is still several feet away from danger.
So the point I’m making is that we should welcome outliers, and utilize the knowlede they give us. Espcially in safety related situations.
“When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it.”
I had almost the exact opposite thought. The acid test of a model is that it should be sufficient to generate a stochastic replicate of your data. If you fit your model ignoring the outlier, it is very unlikely that you could use your model to replicate your data. The only reliable procedure is to identify the outlier.
The proper approach to outliers, as to everything else in life, is Bayesian. Assign a prior probability of a false reading, and then, given an estimate of population parameters, calculate a posterior probability.
In fact, to estimate population parameters, first estimate them using all the data, then calculate the posterior probability for each reading that it is genuine, then use these as weights to re-estimate population parameters, and iterate to convergence.