Four types of statistical errors

There are two kinds of errors discussed in classical statistics, unimaginatively named Type I and Type II. Aside from having completely non-mnemonic names, they represent odd concepts.

Technically, a Type I error consists of rejecting the “null hypothesis” (roughly speaking, the assumption of no effect, the hypothesis you typically set out to disprove) in favor of the “alternative hypothesis” when in fact the null hypothesis is true. A Type II error consists of accepting the null hypothesis (technically, failing to reject the null hypothesis) when in fact the null hypothesis is false.

Suppose you’re comparing two treatments with probabilities of efficacy θj and θk. The typical null hypothesis is that θj = θk, i.e. that the treatments are identically effective. The corresponding alternative hypothesis is that θj ≠ θk, that the treatments are not identical. Andrew Gelman says in this presentation (page 87)

I’ve never made a Type 1 error in my life. Type 1 error is θj = θk, but I claim they’re different. I’ve never studied anything where θj = θk.

I’ve never made a Type 2 error in my life. Type 2 error is θj ≠ θk, but I claim they’re the same. I’ve never claimed that θj = θk.

But I make errors all the time!

(Emphasis added.) The point is that no two treatments are ever identical. People don’t usually mean to test whether two treatments are exactly equal but rather that they’re “nearly” equal, though they are often fuzzy about what “nearly” means.

Instead of Type I and Type II errors, Gelman proposes we concentrate on “Type S” errors (sign errors, concluding θj > θk when in fact θj < θk) and “Type M” errors (magnitude errors, concluding that an effect is larger than it truly is.)

15 thoughts on “Four types of statistical errors

  1. Often Type 1/Type 2 errors (or “false positive”/”false negative”) make more sense. E.g. how would you formulate spam detection errors in terms of “Type M” and “Type S”?

  2. A spam filter does not have a point null hypothesis. Type-S error is relevant if you think of a spaminess scale with 0 being neutral and increasing values corresponding to more offensive spam.

  3. A spam filter does not have a point null hypothesis.
    Exactly. I was under a (probably wrong) impression that you or Andrew Gelman argue for somehow replacing all type 1/type 2 errors with type s/type m errors.

  4. Andrew does seem to assume all null hypotheses are point hypotheses. I suppose he does this because so often that is the case in practice, even though in theory a null hypothesis could be any arbitrary subset of the parameter space.

  5. What spam filters do could be viewed in terms of a, say, logistic regression model predicting probability an email is spam as a function of a bunch of features, e.g., presence of a particular word. Then, you’ve got a slope, estimated using seen emails (sample) for each feature. Null hypothesis: slope = 0, for each slope.

    Type S error would be inferring that a feature is indicative of spam when it’s indicative of a safe email or vice versa. Type M error would be, e.g., inferring that the presence of a particular feature is more likely to indicate spam than it really is in the broader population of emails.

    Back to the oldies: a type 1 error would be inferring, based on sample emails, that a feature is predictive of spam (i.e., the slope for the feature is statistically significantly different to zero) when it’s not in the population of emails. Type 2: you think based on sample that a feature is not predictive when in the population it is.

    I guess the notion of population is quite tricky for emails, though! Spam detection is a really nice example.

    Actually, has anyone looked at what features tend to predict whether an email is spam, or is it all kept secret and/or hidden in the CPTs of a Bayesian classifier somewhere?

  6. Hi John,
    The point is that no two treatments are ever identical.
    While I would agree to this, I believe that it still is not in contradiction with the hypothesis θj = θk, in which the θs do not replace the “treatments” (which, for the experiments to be useful, have to be different!), but rather, usually, the effects of the treatments.

    I am not an expert on the topic, nor an expert on statistics, so you might need to correct me if I m wrong. How I understand the null hypothesis is: “are the effects of treatment A statistically different from those of treatment B?” and usually, you don’t really get a clear answer, but rather a measure telling you how likely it is that they are different (and hopefully medical publications are putting great care in keeping this in mind).

    As for type S or M errors, I am not sure how that would work. However, it made me think of the difference between one-tailed and two-tailed hypothesis test. As I understand, type S or M would still be type I or II errors, but given different kind of hypothesis (inequality or equality, demonstration is left as homework ;-)).

    Or am I completely off-topic? even more than all the above spam about spam ?!! (no offense, just thought about the funny connection…)

  7. jean-louis: In statistics, “significantly different” is related to the strength of evidence of a difference, not its size of the difference. The null hypothesis is typically that two treatments have exactly the same effect. If there is statistically significant data that the null is false, that doesn’t mean that there is a large difference in the effects, only a large amount of evidence that there is a non-zero difference. You could have strong evidence of a small effect.

  8. Hi John,
    “significantly different” is related to the strength of evidence of a difference, not its size of the difference
    I agree, but I did not say that, did I? I guess you can measure the size of the difference with a hypothesis like θj = θk+epsilon. With an inequality hypothesis like θj > θk you do not solve this problem, either.

    The type M error would do, but I would need to investigate to know more what it is all about :-)

  9. I really like the idea of Type S error but am having trouble thinking through how to apply it in practice.
    Say you expect θj and θk to be pretty close, but certainly not identical. So you’d like to be able to say one of these three things: “θj is bigger than θk,” “θk is bigger than θj,” or “We don’t have enough evidence to decide which is bigger.”
    Then, let’s say you do what many people do in practice. Perform a standard test of H0: θj=θk at confidence level 0.05. Then either you fail to reject H0, saying “I don’t have enough evidence to decide which one is the bigger one”… or you do reject H0, then say “They are statistically significantly different and θj has the higher point estimate so I am confident that θj is bigger than θk” (or vice versa, depending).
    Can we say anything precise about our probability of Type S error under this procedure?

  10. I think students (of all experience levels and ages) of the problem need to emphasize to themselves two things: First, posing these tests is not meaningful unless one comes to them with a domain-derived effect size in mind. Second, evaluating a Type I or Type II error only makes sense with respect to having specific models of the alternative hypotheses.

    On the first point, without such an effect size in hand, the experiment or test cannot really be properly designed to begin with. Sure, you may need to do run-up studies and experiments as to what might be an appropriate or suitable effect size, and ideally this is done with some kind of controls. Alternatively, perhaps a posterior gets calculated based upon bunches of data, and conclude the effect size is the median of that. The point is, without such a size in mind, the temptation for data dredging seems difficult to resist. More on that in a moment.

    On the second point, sure, can pretend there’s a Gaussian at the domain-derived mean in question with (hopefully) a domain-derived variance, but suppose there isn’t such? Suppose these are empirical densities. Then to do the required calculations, need to do some kind of integration. I don’t think this is said enough to ourselves. If we do say it enough, I think some of the barriers between a “frequentist interpretation” and a “Bayesian interpretation” start to dissolve, if only in procedure.

    Dredging or not can be tested by posing the suggested result as a bet on a possible future, and seeing how it turns out in the sequel. That’s not always available in all contexts.

    I noticed the Gelman article also talked about multiple comparisons, de-emphasizing them. It may be that the kinds of problems Professor Gelman deals with don’t need these. I have not looked at his case studies from the paper in detail. But there are plenty I can think of where the multiple comparisons need to be controlled. If online retailer XYZ is repeatedly trying to classify customers as belonging to one of M categories, using some kind of discriminant for each, some portion of the time, those will misfire. The Bonferroni framework is not helpful. Hence things like False Discovery Rate control.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>