Matt Brigg’s comment on outliers in his post Tyranny of the mean:

Coontz used the word “outliers”.

There are no such things. There can be mismeasured data, i.e. incorrect data, say when you tried to measure air temperature but your thermometer fell into boiling water. Or there can be errors in recording the data; transposition and such forth. But excluding mistakes, and the numbers you meant to measure are the numbers you meant to measure, there are no outliers.There are only measurements which do not accord with your theory about the thing of interest.

Emphasis added.

I have a slight quibble with this description of outliers. Some people use the term to mean legitimate extreme values, and some use the term to mean values that “didn’t really happen” in some sense. I assume Matt is criticizing the latter. For example, Michael Jordan’s athletic ability is an outlier in the former sense. He’s only an outlier in the latter sense if someone decides he “doesn’t count” in some context.

A few weeks ago I said this about outliers:

When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.

Sometimes you have to exclude a data point because you believe it is far more likely to be a mistake than an accurate measurement. You may also decide that an extreme value is legitimate, but that you wish to exclude from your model. The latter should be done with fear and trembling, or at least with an explicit disclaimer.

**Related post**: The cult of average

For daily tips on data science, follow @DataSciFact on Twitter.

In my experience, outliers in are usually caused by a simplification in your model. The point is “wrong”, and “lies out” of what your model accepts. But in reality your model is only considering part of the universe. When you discard an outlier, you are just saying “this was probably caused by a more complicated phenomenon that my model does not take into account”. To drop this outlier will be more or less acceptable as much as your model simplification is more or less acceptable. If you actually have a more interactions between what you are able to model, and other things you are trying to ignore, then you will have more outliers, as you will have biases and non-linearities, etc.

Now look at what I said “probably caused by…”. The problem is that you are not modeling that thing outside of your model, so this is a very informal and incipient probabilistic model of the larger, more complete model that might in fact be able to model that outlier. So we have an intuition how the larger model might look like, we are just not modeling that. But there are situations where you can actually take the outliers a bit into consideration, where initially you might just discard them… When you use techniques such as the EM algorithm, and robust statistics, it is possible to have your model, and also have a more rigorous model for the outliers. But they start not to be outliers anymore, they are special data points that your model says come from somewhere else… Not simply ignored, classified according to some criterion that does have some theoretical basis to it.

Hope all that makes some sense… Or else it is just a literary outlier! :)

nic: I like how you framed outlier detection as “a very informal and incipient probabilistic model of the larger, more complete model.”

The problem comes with your “larger, more complete model” isn’t that much larger than your formal model, because then you have circular reasoning. For example:

This is in fact an acceptance-rejection method for generating normally distributed random values!

The way out of the circle is like you said, using an incipient larger model. For example, “My model assumes human heights are normally distributed. And I’m throwing out the measurement that purports to be a person 6 meters tall, not because of my model, but because I know that people don’t grow that tall. Maybe they meant 6 feet.”

Or my model assumes something about the usual process and variation for measuring air temperature, but I’m aware that the guy I hired to measure air temperatures has a propensity to occasionally drop thermometers in boiling water.

(My point: Nic’s notion of the “larger, more complete model” really unifies the two senses of ‘outlier’ that John talked about in the post — both errors and non-errors that ‘that you wish to exclude from your [formal] model’ are examples of the same thing. And the boundary between them is fuzzy.)

I think in every problem we have a mixture distribution. There are things your model adequately explains, and things that are not adequately explained.

There are two main classes of inadequately explained items: Data points caused by measurement and transcription error, this includes duplicating, transposing digits, measuring the wrong thing, broken instruments, etc, and data points that are accurately measured but caused by real effects that are not included in your model.

In most cases, measurement/transcription error is a nuisance. There are a huge number of ways it can happen, we have relatively little information about the types that are likely in any given problem, and so we spend our time trying to make our data clean, and to detect plausible cases for measurement and transcription error, it’s entirely reasonable to exclude a small number of data points if the likelihood of the data under some simple ad-hoc error model is pretty high (for example transposed digits, if most of your measurements are about 10+- 5 and you have a few that are 51, 41, 31 etc).

The case that is concerning is when you get outliers caused by an unexplained phenomenon which is relevant to the real process but not included in the model. The presence of such things can really screw up the estimation of your model, and your model might be a good model for the sub-portion of the process that you do model. Should you throw those observations out in order to get a good estimate of the portion of the problem you have modeled? The best way is probably to make your error model robust to such things. For example using a t distribution instead of normal errors, or using a finite mixture model for errors and classifying the data points as to which mixture they most likely came from. There are lots of opportunities to incorporate flexibility into the probabilistic error model to deal with things you haven’t modeled.

In fact, all “random errors” are just stuff we didn’t model if we’re talking about classical non-quantum approximations of how the universe works. It’s not surprising if there are a small number of ways in which large deviations might consistently occur.

Hi, my first post here, but been enjoying this blog for quite a while.

The larger incipient model ideas is great.

One motivation for leaving things “outside” that I was surprised not to see is budget/class imbalance. If I have a sample budget of 50 to estimate the heights in a school with 500 6 year olds and 10 teachers, there isn’t really a choice between a mixture of gaussians and using a single gaussian and tossing out the 1 or 2 “outliers”. I am very unlikely to get a good estimate of teacher heights no matter what I do, so the task is to get the kids right, and throwing out the outliers just makes sense.

This isn’t a matter of modeler “laziness”, oversimplified model etc. The budget just isn’t there to get to everything, so we make a choice.

BTW, the classes can be imbalanced in the distribution source, or they can be imbalanced in estimation complexity. Given 100 points, half of which come from a tight cluster and half of which come from a 1000 dim gaussian in Hilbert space… you really want to model the cluster and “outlie” the rest. Now if you had 10,000 points, that might change things, so its a matter of budget.

David Chudzicki reveals a subtlety here that is often controversial in my field. I study human responses to stimuli. Frequently the humans respond but the response is not to the stimuli. They may have missed a brief one, or may have dozed off and just come awake, or any number of things may have happened. I tend to argue those aren’t responses to stimuli and therefore not part of my model. They’re often very difficult to differentiate from actual responses to the stimulus.

Here’s a classic example where we get data that makes no sense to model and makes perfect sense to exclude. If I show someone a picture of a hammer and ask them to identify it they are very accurate if they begin that identification about 450ms after the picture comes on. At some point before before that, and with a very sharp cutoff, they will be absolutely miserable at it. Early responses just aren’t responses to the picture but to something else…perhaps their anticipation of the picture. In addition, they stay accurate for about a second. But, and this is true even if the picture is left on the whole time, after about 1.3s they become much worse at the task, falling off to maybe 70% correct from 99.9. One has to wonder if those responses are they to the hammer or to something else. It seems unlikely that after a second or so the hammer morphs into some other object. What seems much more likely is that the subject wasn’t paying attention and the response is to something else, perhaps another internal state.

Don’t I have outliers in this case? It’s like the thermometer is occasionally dropped in boiling water or a cold drink and I have to figure out afterwards which. This is slightly difference because I’m not using the distributions of time (or temperature) to argue they wander too far from the mean. I’ve got additional information that tells me they’re of a different kind. However, I don’t always have that information. And when I don’t, can’t I then be supported in using the distribution of time only?