Pearson’s correlation coefficient r is used to measure the linear correlation of one set of data with another. It also provides an example of how you can get in trouble if you just take a formula from a statistics book and naively turn it into a program. I will take two algebraically equivalent equations for the correlation coefficient commonly found in textbooks and give an example where one leads to a correct result and the other leads to an absurd result.
Start with the following definitions.
Lets take a look at two expressions for the correlation coefficient, both commonly given in textbooks.
The two expressions for r are algebraically equivalent. However, they can give very different results when implemented in software.
To demonstrate the problem, I first generated two vectors each filled with 1000 standard normal random samples. Both expressions gave correlation 0.0626881. Next I shifted my original samples by adding 100,000,000 to each element. This does not change the correlation, and the program based on the first expression returned 0.0626881 exactly as before. However, the program based on the second expression returned -8.12857.
Not only is a correlation of -8.12857 inaccurate, it’s nonsensical because correlation is always between -1 and 1.
What went wrong? The second expression for r computes a small number as the difference of two very large numbers. The two terms in the numerator are each around 1020 and yet their difference is around 0.06. That means that if calculated to infinite precision, the two terms would agree to 21 significant figures. But a floating point number (technically a
double in C) only has 15 or 16 significant figures. That means the subtraction cannot be carried out with any precision on a typical computer.
Don’t draw the conclusion that the second expression is accurate unless it completely fails. The same phenomena that caused a complete loss of accuracy in this example could cause a partial loss of accuracy in another example. The latter could be worse. For example, we might not have suspected a problem if the software had returned 0.10 when the correct value was 0.06.
The same problem comes up over and over again in statistics, such as when computing sample variance or simple regression coefficients. In each case, there are two commonly used formulas, and the formula easier to apply manually is potentially inaccurate. To make matters worse, books sometimes imply that the more accurate formula is only for theoretical use and that the less accurate formula is preferable for computation.
For a more detailed explanation of why the two expressions for correlation coefficient gave such different results when implemented in software, see Theoretical explanation of numerical results.