Four types of errors

by John on April 21, 2008

There are two kinds of errors discussed in classical statistics, unimaginatively named Type I and Type II. Aside from having completely non-mnemonic names, they represent odd concepts.

Technically, a Type I error consists of rejecting the “null hypothesis” (roughly speaking, the assumption of no effect, the hypothesis you typically set out to disprove) in favor of the “alternative hypothesis” when in fact the null hypothesis is true. A Type II error consists of accepting the null hypothesis (technically, failing to reject the null hypothesis) when in fact the null hypothesis is false.

Suppose you’re comparing two treatments with probabilities of efficacy θj and θk. The typical null hypothesis is that θj = θk, i.e. that the treatments are identically effective. The corresponding alternative hypothesis is that θj ≠ θk, that the treatments are not identical. Andrew Gelman says in this presentation (page 87)

I’ve never made a Type 1 error in my life. Type 1 error is θj = θk, but I claim they’re different. I’ve never studied anything where θj = θk.

I’ve never made a Type 2 error in my life. Type 2 error is θj ≠ θk, but I claim they’re the same. I’ve never claimed that θj = θk.

But I make errors all the time!

(Emphasis added.) The point is that no two treatments are ever identical. People don’t usually mean to test whether two treatments are exactly equal but rather that they’re “nearly” equal, though they are often fuzzy about what “nearly” means.

Instead of Type I and Type II errors, Gelman proposes we concentrate on “Type S” errors (sign errors, concluding θj > θk when in fact θj < θk) and “Type M” errors (magnitude errors, concluding that an effect is larger than it truly is.)

{ 1 trackback }

Type R error — The Endeavour
12.09.11 at 07:01

{ 13 comments… read them below or add one }

1

Mathew Woodyard 04.15.11 at 12:29

Great post, as always, but I noticed that your link to Gelman’s presentation is broken. It looks like it has moved here: http://www.stat.columbia.edu/~gelman/presentations/multiple_minitalk2.pdf

2

John 04.15.11 at 12:39

Thanks. I updated the link.

3

Roman Cheplyaka 04.15.11 at 14:44

Often Type 1/Type 2 errors (or “false positive”/”false negative”) make more sense. E.g. how would you formulate spam detection errors in terms of “Type M” and “Type S”?

4

John 04.15.11 at 15:13

A spam filter does not have a point null hypothesis. Type-S error is relevant if you think of a spaminess scale with 0 being neutral and increasing values corresponding to more offensive spam.

5

Roman Cheplyaka 04.15.11 at 15:20

A spam filter does not have a point null hypothesis.
Exactly. I was under a (probably wrong) impression that you or Andrew Gelman argue for somehow replacing all type 1/type 2 errors with type s/type m errors.

6

John 04.15.11 at 15:28

Andrew does seem to assume all null hypotheses are point hypotheses. I suppose he does this because so often that is the case in practice, even though in theory a null hypothesis could be any arbitrary subset of the parameter space.

7

Andy 04.17.11 at 15:40

What spam filters do could be viewed in terms of a, say, logistic regression model predicting probability an email is spam as a function of a bunch of features, e.g., presence of a particular word. Then, you’ve got a slope, estimated using seen emails (sample) for each feature. Null hypothesis: slope = 0, for each slope.

Type S error would be inferring that a feature is indicative of spam when it’s indicative of a safe email or vice versa. Type M error would be, e.g., inferring that the presence of a particular feature is more likely to indicate spam than it really is in the broader population of emails.

Back to the oldies: a type 1 error would be inferring, based on sample emails, that a feature is predictive of spam (i.e., the slope for the feature is statistically significantly different to zero) when it’s not in the population of emails. Type 2: you think based on sample that a feature is not predictive when in the population it is.

I guess the notion of population is quite tricky for emails, though! Spam detection is a really nice example.

Actually, has anyone looked at what features tend to predict whether an email is spam, or is it all kept secret and/or hidden in the CPTs of a Bayesian classifier somewhere?

8

Roman Cheplyaka 04.17.11 at 21:54

Andy: for example, here are the features that SpamAssassin tests for: http://wiki.apache.org/spamassassin/RulesList
(They are separate from the Bayes classifier and are more about meta-information than the contents.)

9

Andy 04.18.11 at 02:38

Roman: thanks!

10

jean-louis 10.25.11 at 13:01

Hi John,
The point is that no two treatments are ever identical.
While I would agree to this, I believe that it still is not in contradiction with the hypothesis θj = θk, in which the θs do not replace the “treatments” (which, for the experiments to be useful, have to be different!), but rather, usually, the effects of the treatments.

I am not an expert on the topic, nor an expert on statistics, so you might need to correct me if I m wrong. How I understand the null hypothesis is: “are the effects of treatment A statistically different from those of treatment B?” and usually, you don’t really get a clear answer, but rather a measure telling you how likely it is that they are different (and hopefully medical publications are putting great care in keeping this in mind).

As for type S or M errors, I am not sure how that would work. However, it made me think of the difference between one-tailed and two-tailed hypothesis test. As I understand, type S or M would still be type I or II errors, but given different kind of hypothesis (inequality or equality, demonstration is left as homework ;-) ).

Or am I completely off-topic? even more than all the above spam about spam ?!! (no offense, just thought about the funny connection…)

11

John 10.25.11 at 13:17

jean-louis: In statistics, “significantly different” is related to the strength of evidence of a difference, not its size of the difference. The null hypothesis is typically that two treatments have exactly the same effect. If there is statistically significant data that the null is false, that doesn’t mean that there is a large difference in the effects, only a large amount of evidence that there is a non-zero difference. You could have strong evidence of a small effect.

12

jean-louis 10.25.11 at 13:36

Hi John,
“significantly different” is related to the strength of evidence of a difference, not its size of the difference
I agree, but I did not say that, did I? I guess you can measure the size of the difference with a hypothesis like θj = θk+epsilon. With an inequality hypothesis like θj > θk you do not solve this problem, either.

The type M error would do, but I would need to investigate to know more what it is all about :-)

13

Jerzy 12.09.11 at 11:00

I really like the idea of Type S error but am having trouble thinking through how to apply it in practice.
Say you expect θj and θk to be pretty close, but certainly not identical. So you’d like to be able to say one of these three things: “θj is bigger than θk,” “θk is bigger than θj,” or “We don’t have enough evidence to decide which is bigger.”
Then, let’s say you do what many people do in practice. Perform a standard test of H0: θj=θk at confidence level 0.05. Then either you fail to reject H0, saying “I don’t have enough evidence to decide which one is the bigger one”… or you do reject H0, then say “They are statistically significantly different and θj has the higher point estimate so I am confident that θj is bigger than θk” (or vice versa, depending).
Can we say anything precise about our probability of Type S error under this procedure?

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>