Negative correlation introduced by sampling

Suppose you measure people on two independent attributes, X and Y, and take those for whom X+Y is above some threshold. Then even though X and Y are uncorrelated in the full population, they will be negatively correlated in your sample.

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange
from scipy.stats import norm, pearsonr
import matplotlib.pyplot as plt

# Correlation.
# The function pearsonr returns correlation and a p-value.
def corr(x, y):
    return pearsonr(x, y)[0]

x = norm.rvs(100, 15, 10000)
y = norm.rvs(100, 15, 10000)
z = x + y

span = arange(80, 260, 10)
c = [ corr( x[z > low], y[z > low] ) for low in span ]

plt.plot( span, c )
plt.xlabel( "minimum sum" )
plt.ylabel( "correlation coefficient" )
plt.show()

9 thoughts on “Negative correlation introduced by success”

Daniel Lemire

10 September 2017 at 18:26

Beautiful.

Aaron Meurer

10 September 2017 at 18:43

This is a continuous version of Berkson’s paradox (https://en.m.wikipedia.org/wiki/Berkson%27s_paradox), which can be understood via the principle of restricted choice (see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2709837 for a good summary).

Andrew

10 September 2017 at 21:37

The graph appears to pass through (200, -0.5) — i.e. for the top 50% of sums, the two variables are 50% anticorrelated. Coincidence, or more-or-less inevitable?

Jim Simons

11 September 2017 at 06:15

Here is a way to make this intuitive. If you look at X and Y conditional on X+Y=T, then of course X and Y are negatively correlated (correlation coefficient = -1). Then in instead you look conditional on X+Y \ge T for T>0, most of the distribution is quite near X+Y=T.

Of course it is also true that in T<0, you get positive correlation! So amongst aspiring actors who never make it, there is a positive correlation between beauty and acting ability. They tend to be not terribly high on either scale.

Jeremy L

11 September 2017 at 07:36

In causal inference we call this collider stratification bias (or some call it M bias). More generally, if you condition on a variable, Z, that is caused by two others, X and Y, you will induce a correlation between X and Y or change the correlation between X and Y if they are already correlated.

Steve

11 September 2017 at 12:47

Nassim Nicholas Taleb gives an example about choosing a surgeon that makes a similar point :

“the one who doesn’t look the part, conditional of having made a (sort of) successful career in his profession, had to have much to overcome in terms of perception. And if we are lucky enough to have people who do not look the part, it is thanks to the presence of some skin in the game, the contact with reality that filters out incompetence, as reality is blind to looks.”

https://medium.com/incerto/surgeons-should-notlook-like-surgeons-23b0e2cf6d52#.o6cxtzo1a

BobC

11 September 2017 at 15:47

You had me scared for a moment!

I routinely use threshold filtering on data from “novel physics” environmental sensors. Such sensors are sensitive to multiple environmental effects, so I use a combination of reference sensors and PCA to reveal just what a new device is actually sensing.

Since the reference sensors also have their own valid ranges, I routinely use a sum to ensure both values are usable.

The correlations can be weak even under ideal conditions, and I was concerned my threshold filtering could have made the situation worse.

Fortunately not: I’m way to the left of your diagram, since I use a sum threshold that’s about 5-10% of the full range.

mpledger

11 September 2017 at 20:46

If you look at the distribution of x (beauty) for those with Y>130 and Z>242 i.e. successful actors and high ability actors are 2 s.d.s above the mean; then (in R – I can’t write Python because it’s too close to R but not enough)

x <- rnorm(1000000, 100, 15)
y <- rnorm(1000000, 100, 15)
z 242&y>130&x242&y>130])*100
# me 0.02802497 %

# 1 sd below average to average
sum(x[z>242&y>130&x85])/sum(x[z>242&y>130])*100
# me 2.964104 %

# average to 1 sd above average
sum(x[z>242&y>130&x>100&x242&y>130])*100
# me 48.24954 %

# 1 sd above average to 2 sd above average
sum(x[z>242&y>130&x>115&x242&y>130])*100
# me 40.98447 %

# above 2 sds
sum(x[z>242&y>130&x>130])/sum(x[z>242&y>130])*100
# me 7.773863 %

Then most of the highly successful actors/high ability actors are average to above average in looks, followed by above average to almost beautiful. Way more are beautiful than below average.

If unattractive is 2 sd below the mean than noone in the highly successful actors/high ability actors group was unattractive in *my* (very large) sample.

Gregor Gorjanc

12 September 2017 at 09:40

In quantitative genetics this is called Bulmer effect, i.e., selection on a individual’s value y, which is well approximated by a sum of effect of multiple loci in genome (x1 + x2 + … + xn), invokes correlation between the loci in new generations, even if the loci do not reside on the same chromosome.

Comments are closed.