Stephen Stigler [1] compares least-squares methods to the iPhone:

In the United States many consumers are entranced by the magic of the new iPhone, even though they can only use it with the AT&T system, a system noted for spotty coverage — even no receivable signal at all under some conditions. But the magic available when it does work overwhelms the very real shortcomings. Just so, least-squares will remain the tool of choice unless someone concocts a robust methodology that can perform the same magic, a step that would require the suspension of the laws of mathematics.

In other words, least-squares, like the iPhone, **works so well when it does work that it’s OK that it fails miserably now and then**. Maybe so, but that depends on context.

In his quote, Stigler argues that Americans feel that missing a phone call occasionally is an acceptable trade-off for the features of the iPhone. Many people would agree. But if you’re If you’re on a transplant waiting list, you might prefer more reliable coverage to a nicer phone.

It’s not enough to talk about *probabilities* of failure without also talking about *consequences* of failure. For example, the consequences of missing a phone call are greater for some people than for others.

Least-squares is a mathematically convenient way to place a cost on errors: the cost is proportional to the square of the size of the error. That’s often reasonable in application, but not always. In some applications, the cost is simply proportional to the size of error. In other applications, it doesn’t matter how large an error is once it above some threshold. Sometimes the cost of errors is asymmetric: over-estimating has a different cost than under-estimating by the same amount. Sometimes you’re more worried about the worst case than the average case. One size does not fit all.

[1] Stephen M. Stigler, The Changing History of Robustness, American Statistician, Vol. 64, No. 4. November 2010. (Written before Verizon announced it would be supporting the iPhone)

**Related posts**:

I don’t have an iPhone, but I do have cell phone service from AT&T. I’ve had no problems with their coverage, though I understand that some people do. My point here isn’t to criticize AT&T. I just grant Stigler’s assessment for the sake of illustration.

I think one reasonable approach is to fit models using least squares or something else that’s mathematically convenient but evaluate models using something else.

It seems like complicated methods are much more tractible in the evaluation process.

Under the right conditions, least squares is the maximum likelihood estimator. Another problem with that property (or least squares more generally) is that it is model-dependent. It’s nice to have model-independent evaluations (like a loss function based on how different errors affect you, or a gains chart, or whatever).

I just found your blog and am enjoying it a lot. Thank you!

In a class of computer vision, we were comparing RANSAC to least squares method. I implemented both of them and learned that RANSAC is stochastic and can take much longer to run. But it’s invaluable for eliminating outliers or searching for multiple models.

Thank for your insightful posts.

Well, given the abundance of phenomena where the CLT allows us to deem noise a gaussian variable, the passion with which some statisticians defend the least squares method is not inexplicable: it’s just a maximum likelihood estimation. Even a “quasi-bayesian” maximum a posteriori estimation can sometimes be cast in a least squares fashion.

But when you consider the also abundant number of phenomena where the CLT is NOT applicable (when noise sources are correlated, or noise is a non-stationary process,…) or when a gaussian generative model is not justifiable, least squares estimators can suck big time.

My objection to least-squares is not so much the question of whether the CLT holds but whether x^2 is a reasonable loss function. In many cases it clearly is not.

Loss functions are often asymmetric. For example, I don’t want to minimize volatility in my retirement investments. I want to minimize *downward* volatility. I’m perfectly happy with large erratic gains!

Stigler link: https://files.nyu.edu/ts43/public/research/.svn/text-base/Stigler.pdf.svn-base

(Weird suffix, eh?)

Under regression is the mean. Under the mean is a normal distribution. We assume the data is normal by not knowing we are supposed to check.