Individual differences and high dimensions

In 1945, a Cleveland newspaper held a contest to find the woman whose measurements were closest to average. This average was based on a study of 15,000 women by Dr. Robert Dickinson and embodied in a statue called Norma by Abram Belskie. Out of 3,864 contestants, no one was average on all nine factors, and fewer than 40 were close to average on five factors. The story of Norma and the Cleveland contest is told in Todd Rose’s book The End of Average.

People are not completely described by a handful of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.

General theory

Suppose you have N points, each described by n independent, standard normal random variables. That is, each point has the form (x₁, x₂, x₂, …, x_n) where each x_i is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as n becomes larger, most of the points will be in a thin shell with distance √n from the origin. (More details here.)

Simulated contest

In the contest above, n = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.

import numpy as np

# Winning critera: minimum Euclidean distance
def euclidean_norm(x):
    return np.linalg.norm(x)

# Winning criteria: min-max
def max_norm(x):
    return max(abs(x))

n = 9
N = 3864

# Simulated normalized measurements of contestants 
M = np.random.normal(size=(N, n))

euclid = np.empty(N)
maxdev = np.empty(N)
for i in range(N):
    euclid[i] = euclidean_norm(M[i,:])
    maxdev[i] = max_norm(M[i,:])

w1 = euclid.argmin()
w2 = maxdev.argmin()

print( M[w1,:] )
print( euclidean_norm(M[w1,:]) )
print( M[w2,:] )
print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.

7 thoughts on “The empty middle: why no one is average”

Andrew McRae

20 February 2016 at 13:25

Better to say that “very few” are average – just because most of the mass can be found in a shell away from the origin in R^n, the density is still highest at the origin! The number of average people can of course be quantified: if you define ‘average’ as being within 0.5 standard deviations of the mean, with n criteria and N people, the expected number of average people is some N*0.383^n. If you want to tighten up the definition of ‘average’, change the 38.3% accordingly (but the normal distribution curve is already pretty flat).
Jon Peltier

20 February 2016 at 17:14

Also from “The End of Average”

When U.S. air force discovered the flaw of averages
Mike Anderson

20 February 2016 at 19:23

Great simulation example, but is the original data set collected by the Cleveland Plain Dealer out there somewhere?
JF Puget

22 February 2016 at 05:09

Interesting (as always). Shouldn’t the seed setting appear before the call to np.random.normal() ?
John

22 February 2016 at 07:04

Thanks. Are you familiar with the idiom “closing the barn door after the horses are out”? :)

I think I rearranged the code before copying it into the blog and overlooked the line with the seed. I took it out since it didn’t add anything.
BMGM

3 March 2016 at 21:56

You just illustrated the point that Shams and I discussed in
http://badmomgoodmom.blogspot.com/2011/12/meeting-shams.html
Robert Matthews

6 March 2016 at 06:02

Seems to me that this amounts to mathematical proof that Quetelet’s “L’homme moyen” (“Average Man”) is a fiction. I wonder if the proof of it, based on the multivariate distrib, might have been available in his own time ?

Anyhow, thanks v much for a thought provoking post.

Comments are closed.