I was listening to a business book in my car this afternoon. A couple times it said

Numerous studies have confirmed …

and I couldn’t help but hear

Several of my peers, who share my prejudices, were also able to do a multivariate regression and select a few variables out of hundreds to confirm the prevailing wisdom.

Maybe the prevailing wisdom is right. It often is. However, I’m not very impressed by attempts to shore up prevailing wisdom with linear regression, especially in business studies.

**Related posts**:

Sounds like a case of “experimenter degrees of freedom.”

I think it is also bad writing. If you have references, then provide them. Otherwise it is best to omit such a statement.

Could you elaborate on this post for the sake of people who might not see anything wrong with linear regression?

The first weakness of linear regression is that it assumes a linear relationship exists between the variables. Looking at the data could expose nonlinear relationships if there are only two or three variables, but when there are dozens of variables, highly nonlinear relationships could go undetected. And detected nonlinear relationships are ironed away by arbitrary transformations.

Next, variable selection is extremely ad hoc in practice. The objective appearance of the final analysis conceals the highly subjective process used to arrive there.

Then there are the usual statistical evils: confusing correlation with causation, multiple testing, misinterpretation of p-values, confusing statistical significance with scientific significance, etc.

I’m not saying linear regression is a bad technique. But in practice, it is horribly abused. Linear regression is often the only tool in the toolbox of those who don’t understand statistics but use statistics regularly. The attitude of hacks is “There must be a linear regression model that tells me what I want to know. I just have to explore until I find it.” It never occurs to them that the data may not contain the answer, or may not contain the answer in linear form.

You’re taking a phrase you heard, re-reading it as something it didn’t say, then slamming it with your criticisms of linear regression. Your comments on the abuses of linear regression are interesting in their own right. Maybe it would be better just to write a short essay on them.

As for the business book, it’s unusual in popular business writing (especially an audiobook!) to list numerous references. The better ones have chapter notes and bibliographies, but even that’s uncommon.

Granted it’s only a text-bite, but there is enough there to set off my suspicions that something might be wrong. A large number of variables and a bit of prejudice can show lots of things, and

~~most~~all of the might be wrong.Additional variables can be added to linear regression, for example all multiplications of variable pairs, triples, all powers of variables, logarithms, etc.etc. etc. Just get enough of data đź™‚

Regularization parameter can be added to automatically select variables, and avoid over-fitting.

And cross-validation can help to check, if data contains no answers.

Perhaps it would be better for those who currently use, “Numerous studies have confirmedâ€¦” to replace it with, “There appears to be evidence which suggestsâ€¦”, but that isn’t as persuasive to the ordinary hearer.

Another thing that cracks me up is when people tell me that whatever they believe in “has been scientifically proven”. It always reminds me of this strip: http://xkcd.com/882/

Thanks for the interesting info on statistics. I enjoy reading all your content and this helps me to understand the finer points.