Four types of statistical errors

There are two kinds of errors discussed in classical statistics, unimaginatively named Type I and Type II. Aside from having completely non-mnemonic names, they represent odd concepts.

Technically, a Type I error consists of rejecting the “null hypothesis” (roughly speaking, the assumption of no effect, the hypothesis you typically set out to disprove) in favor of the “alternative hypothesis” when in fact the null hypothesis is true. A Type II error consists of accepting the null hypothesis (technically, failing to reject the null hypothesis) when in fact the null hypothesis is false.

Suppose you’re comparing two treatments with probabilities of efficacy θj and θk. The typical null hypothesis is that θj = θk, i.e. that the treatments are identically effective. The corresponding alternative hypothesis is that θj ≠ θk, that the treatments are not identical. Andrew Gelman says in this presentation (page 87)

I’ve never made a Type 1 error in my life. Type 1 error is θj = θk, but I claim they’re different. I’ve never studied anything where θj = θk.

I’ve never made a Type 2 error in my life. Type 2 error is θj ≠ θk, but I claim they’re the same. I’ve never claimed that θj = θk.

But I make errors all the time!

(Emphasis added.) The point is that no two treatments are ever identical. People don’t usually mean to test whether two treatments are exactly equal but rather that they’re “nearly” equal, though they are often fuzzy about what “nearly” means.

Instead of Type I and Type II errors, Gelman proposes we concentrate on “Type S” errors (sign errors, concluding θj > θk when in fact θj < θk) and “Type M” errors (magnitude errors, concluding that an effect is larger than it truly is.)

Automated software builds

My first assignment as a professional programmer was to build another person’s program. I learned right away not to assume a project will build just because the author says it will. I’ve seen the same pattern repeated everywhere I’ve worked. Despite version control systems and procedures, there’s usually some detail in the developer’s head that doesn’t get codified and only the original developer can build the project easily.

The first step in making software builds reproducible is documentation. There’s got to be a document explaining how to extract the project from version control and build it. Requiring screen shots helps since developers have to rehearse their own instructions in order to produce the shots.

The second step is verification. Documentation needs to be tested, just like software. Someone who hasn’t worked on the project needs to extract the code onto a clean machine and build the project using only written instructions — no conversation with the developer allowed. Everyone thinks their code is easy to build; experience says most people are wrong.

The verifiers need to rotate. If one person serves as build master very long, they develop the same implicit knowledge that the original programmers failed to codify.

The third step is automation. Automated instructions are explicit and testable. If automation also saves time, so much the better, but automation is worthwhile even if it does not save time. Clift Norris and I just wrote an article on CodeProject entitled Automated Extract and Build from Team System using PowerShell that helps with this third step if you’re using Visual Studio and VSTS.

Database-first or object model-first software development

The most recent edition of the .NET Rocks podcast interviewed Jeremy Miller and David Laribee. They made the observation that Microsoft’s development tools are implicitly designed to support database-first development rather than object model-first development. Because they are committed to agile and object model-first development, they’re feeling resistance from some of Microsoft’s products.

Miller and Laribee are part of an interesting group of developers who call themselves “ALT.NET” because they’re developing on Microsoft’s .NET framework but are looking for alternative tools. They embrace the bottom of the Microsoft technology stack — operating systems, languages, compilers — but not the top — tools for modeling, code generation, testing, etc.

Overflow and loss of precision

Suppose you need to evaluate the function f(x) = log(1 + ex). The most obvious code to write something like this in C:

    double f(double x) { return log(1 + exp(x)); }

This is so simple you don’t even test it. Then someone calls you to say they’re getting strange output such as 1.#INF depending on your environment. What’s going on?

For large values of x, ex is much bigger than 1, so f(x) is approximately log(ex) which equals x. So, for example, f(1000) should return something very close to 1000. But instead, you get gibberish. In most environments a floating point number can be as large as about 10308, but exp(1000) is about 2×10434 and so the result overflows. Then the code takes the log of “infinity.”

But f(1000) shouldn’t overflow; the result is about 1000. Our function result can be contained in a floating point number, but our intermediate result exp(1000) cannot be. We need some way of telling the computer “Hold on. I know this is going to be a huge number, but trust me for a minute. After I add 1 I’m going to take a log and bring the result back down to a medium-sized number.” Unfortunately computers don’t work that way.

One solution would be to see whether x is so large that exp(x) will overflow, and in that case f(x) could just return x. So our second attempt might look like this:

    double f(double x) { return (x > X_MAX) ? x : log(1 + exp(x)); }

I’ll come back later to what X_MAX should be. Suppose you’ve found a good value for this constant and release your code again. Then you get another call. They say your code is returning zero when it should not.

Someone is trying to compute f(-50). They know the answer is small, but it shouldn’t be zero. How can that be? Since you just learned about overflow, you suspect the problem might have to do with the opposite problem, underflow. But that’s not quite it. The result of exp(-50) does not underflow; it’s about 2×10-22. But it is so small that machine precision cannot distinguish 1+exp(-50) from 1. So the code returns log(1), which equals zero. This may or may not be a problem. The correct answer is very small, so the absolute error is small but the relative error is 100%. Whether that is a problem depends on what you’re going to do with the result. If you’re going to add it to a moderate size number, no big deal. If you’re going to divide by it, it’s a very big deal.

Now what do you do? You think back to your calculus class and remember that for small x, log(1 + x) is approximately x with error on the order of x2. (To see this, expand log(1 + x) in a power series centered at 1.) If x is so small that 1 + x equals 1 inside the computer, x2 must be very small indeed. So f(x) could return exp(x) if exp(x) is sufficiently small. This gives our third attempt.

double f(double x)
{ 
    if (x > X_MAX) return x; 
    if (x < X_MIN) return exp(x);
    return log(1.0 + exp(x));
}

This is a big improvement. It can still underflow for large negative x, but in that case the function itself is underflowing, not just an intermediate result.

Now what do we make X_MAX and X_MIN? For X_MAX we don’t really have to worry about exp(x) overflowing. We can stop when x is so large that exp(x) + 1 equals exp(x) to machine precision. In C there is a header file float.h that contains a constant DBL_EPSILON which is the smallest number ε such that 1 + ε does not equal 1 in machine precision. So it turns out we can set X_MAX equal to -log(DBL_EPSILON). Similarly we could set X_MIN equal to log(DBL_EPSILON).

There’s still a small problem. When x is large and negative, but not so negative that 1 + exp(x) equals 1 in the computer, we could lose precision in computing log(1 + exp(x)).

I have just posted an article on CodeProject entitled Avoiding Overflow, Underflow, and Loss of Precision that has more examples where the most direct method for evaluating a function may not work. Example 2 in that paper explains why directly computing log(1 + y) can be a problem even if y is not so small that 1 + y equals 1 to machine precision. The article explains how to compute log(1 + y) in that case. Setting y = exp(x) explains how to finish the example here when x is moderately large and negative.

Galen and clinical trials

Here’s a quote from the Greek physician Galen (c. 130-210 A.D.)

All who drink of this remedy recover in a short time, except those whom it does not help, who all die. Therefore, it is obvious that it fails only in incurable cases.

Imagine a dialog between Galen and a modern statistician.

Stat: You say your new treatment is better than the previous one?

Galen: Yes.

Stat: But more people died on the new treatment.

Galen: Those patients don’t count because they were incurable. They would have died anyway.

The problem with Galen’s line of reasoning is that it is not falsifiable: no experiment could disprove it. He could call any treatment superior by claiming that evidence against it doesn’t count. Still, Galen might have been right.

Now suppose our statistician has a long talk with Galen and tells him about modern statistical technique.

Galen: Can’t you look back at my notes and see whether there was something different about the patients who didn’t respond to the new treatment? There’s got to be some explanation. Maybe my new treatment isn’t better for everyone, but there must be a group for whom it’s better.

Stat: Well, that’s tricky business. Advocates call that “subset analysis.” Critics call it “data dredging.” The problem is that the more clever you are with generating after-the-fact explanations, the more likely you’ll come up with one that seems true but isn’t.

Galen: I’ll have to think about that one. What do you propose we do?

Stat: We’ll have to do a randomized experiment. When each patient arrives, we’ll flip a coin to decide whether to give them the old or the new treatment. That way we expect about the same number of incurable patients to receive each treatment.

Galen: But the new treatment is better. Why should I give half my patients the worse treatment?

Stat: We don’t really know that the new treatment is better. Maybe it’s not. A randomized experiment will give us more confidence one way or another.

Galen: But couldn’t we be unlucky and assign more incurable patients to the better treatment?

Stat: Yes, that’s possible. But it’s not likely we will assign too many more incurable patients to either treatment. That’s just a chance we’ll have to take.

The issues in these imaginary dialogs come up all the time. There are people who believe their treatment is superior despite evidence to the contrary. But sometimes they’re right. New treatments are often tested on patients with poor prognosis, so the complaints of receiving more incurable patients are justified. And yet until there’s some evidence that a new treatment may be at least as good as standard, it’s unethical to give that treatment to patients with better prognosis. Sometimes post-hoc analysis finds a smoking gun, and sometimes it’s data dredging. Sometimes randomized trials fail to balance on important patient characteristics. There are no simple answers. Context is critical, and dilemmas remain despite our best efforts. That’s what makes biostatistics interesting.

Related: Adaptive clinical trial design

 

If you’re going to do XHTML, you’d better do it right

XHTML is essentially a stricter form of HTML, but not quite. For the most part, you can satisfy the requirements of both standards at the same time. However, when it comes to closing tags, the two standards are incompatible. For example, the line break tag in HTML is <br> but in XHTML is <br/>. Most browsers will tolerate the unnecessary backslash before the closing tag in HTML, especially if you put a space before it. But it’s not strictly correct.

So is this just a pedantic point of markup language grammar? Chris Maunder says an error with closing tags caused Google to stop indexing his website. He had XHTML-style end tags but had set his DOCTYPE to HTML.

I’ve also heard of browsers refusing to render a page at all because it had DOCTYPE set to XHTML but contained an HTML entity not supported in XHTML. I believe the person reporting this said that he had run the XHTML page through a validator that failed to find the error. Unfortunately I’ve forgotten where I saw this. Does anyone know about this?

Houston Deco

This weekend I stumbled across the book Houston Deco at the library. The book is filled with photos of Art Deco and Art Moderne architecture in Houston and the surrounding area. I had no idea how much Art Deco architecture there was in Houston until I read the book. Some of the photos were of buildings I’ve seen or even been inside without paying much attention to the architecture. More photos are available at the Houston Deco website.

Random number generator controversy

I submitted an article to Code Project yesterday, Simple Random Number Generation, describing a small C# class called SimpleRNG that uses George Marsaglia’s WMC algorithm. The article was posted around 5 PM (central US time) and comments started pouring in right away. I didn’t expect any feedback on a Friday afternoon or Saturday morning. But as I write this post, there have been 580 page views and 11 comments.

There have been three basic questions raised in the comments.

  1. Why not just use the random number generator that comes with .NET?
  2. Is this code suitable for cryptography?
  3. Is this code suitable for Monte Carlo applications?

Why not use the built-in generator? For many applications, the simplest thing would be to use the .NET random number generator. But there are instances where this might not be best. There are questions about the statistical quality of the .NET generator; I’ll get to that in a minute. The primary advantages I see to the SimpleRNG class are transparency and portability.

By transparency I mean that the internal state of the generator is simple and easy to access. When you’re trying to reproduce a result, say while debugging, it’s convenient to have full access to the internal state of the random generator. If you’re using your own generator, you can see everything. You can even temporarily change it: for debugging, it may be convenient to temporarily have the “random” generator return a very regular, predictable sequence.

By portability I do not necessarily mean moving the code between operating systems. The primary application I have in mind is moving the algorithm between languages. For example, in my work we often have prototype code written in R that needs to be rewritten in C++ for efficiency. If the code involves random number generation, the output of the prototype and the rewrite cannot be directly compared, only compared on average. Then you have to judge whether the differences are to be expected or whether they indicate a bug. But if both the R and the C++ code use the same RNG algorithm and the same seed, the results may be directly comparable. (They still may not be directly comparable due to other factors, but at least this way the results are often comparable.)

As for cryptography, no, SimpleRNG is not appropriate for cryptography.

As for Monte Carlo applications, not all Monte Carlo applications are created equal. Some applications do not require high quality random number generators. Or more accurately, different applications require different kinds of quality. Some random number generators break down when used for high-dimensional integration. I suspect SimpleRNG is appropriate for moderate dimensions. I use the Mersenne Twister generator for numerical integration. However, SimpleRNG is faster and much simpler; the MT generator has a very large internal state.

Someone commented on the CodeProject article that the random number generator in .NET is not appropriate for Monte Carlo simulation because it does not pass Marsaglia’s DIEHARD tests while SimpleRNG does. I don’t know what algorithm the .NET generator uses, so I can’t comment on its quality. Before I’d use it in statistical applications, I’d want to find out.

Text reviews for software

When users find spelling and grammar errors in your software, your credibility takes a hit. But apparently very few software projects review the text their software displays. I imagine the ones that do review their text use a combination of two leaky methods: asking execution testers to take note of prose errors, and requiring that all text displayed to users be stored in a string table.

There are a couple problems with asking execution testers to be copy editors. First, they’re not copy editors. They may not recognize a grammatical error when they see it. Second, they only see the text that their path through the software exposes. Messages displayed to the user under unusual circumstances slip through testing.

String tables are a good idea. They can be reviewed by a professional editor. (Or translator, if you’re application is internationalized.) But it’s difficult to make sure that every string the user might see is in the string table. When you need to add a few quick lines of error-handling code, it’s so easy to just include the text right there in the code rather than adding an entry to the string table. After all, you say to yourself, the code’s probably not going to run anyway.

My solution was to write a script that extracts all the quoted text from a source tree so it can be reviewed separately. The script tries to only pick out strings that a user could see, filtering out, for example, code quoted inside code. Doing this perfectly would be very hard, but by tolerating a small error rate, the problem can be solved quickly in a few lines of code. I’ve used this script for years. Nearly every time I run it I discover potentially embarrassing errors.

In addition to helping with copy editing, an extract of all the string literals in a project gives an interesting perspective on the source code. For example, it could help uncover security risks such as SQL injection vulnerabilities.

I’ve posted an article on CodeProject along with the script I wrote.

PowerShell Script for Reviewing Text Shown to Users

The script on CodeProject is written for Microsoft’s PowerShell. If anyone would like a Perl version of the script, just let me know. I first wrote the script in Perl, but then moved it to PowerShell as my team was moving to PowerShell for all administrative scripting.