David Hogg on linear regression:
… in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously wrong and unnecessary. It is wrong because … linear relationship is exceedingly rare.
Even if the investigator doesn’t care that the fit is wrong, it is likely to be unnecessary. Why? Because it is rare that … the important result … is the slope and intercept of a best-fit line! Usually the full distribution of data is much more rich, informative, and important than any simple metrics made by fitting an overly simple model.
That said, it must be admitted that one of the most effective ways to communicate scientific results is with catchy punchlines and compact, approximate representations, even when those are unjustified and unnecessary.
Related post: Responsible data analysis
As George Box said “All models are wrong, but some are useful”. The linear regression model is always wrong, but sometimes it’s a useful approximation to reality. Of course, you still need to check to see how far wrong the model is – something that is all too infrequently done. But if it’s close, then it may be valuable to use something simple (like linear regression) than something complex (like, say, spline regression).
On the other hand it is not enough just to say the relationship is “nonlinear”: http://www.johndcook.com/blog/2012/01/04/nonlinear-is-not-a-hypothesis/
Couldn’t agree more with Peter Flom (above). Linear oversimplifications are incredibly useful in many fields, but you have to be aware of what assumptions you’re making and how far off you’re likely to be. The real malpractice comes when standard behavior in an entire discipline is to ignore those assumptions and draw policy conclusions from models that don’t support them. Epidemiology, I’m looking at you, with papers defending malpractice
I seem to have botched the HTML for the links I was going to include in that last comment. Here’s the most egregious one:
Rothman, Kenneth J.
Epidemiology volume 1, #1 (1990)
“No Adjustments Are Needed for Multiple Comparisons”
Hi Dave
Actually, there is a good case to be made for ignoring the multiple comparisons issue. Andrew Gelman basically agrees with Rothman; and Cohen made the point that once you start adjusting for multiple comparisons, you don’t know when to stop – all the results in one table? One article? One line of research? As Cohen said (in his book on multiple regression) “this is a topic on which reasonable people can differ”.
Remember that, when you adjust for multiple comparisons, you lower power. Power is (for some reason) usually set at 0.8, while alpha is usually set at 0.05. Sometimes this is sensible, sometimes it isn’t.
One could even make the point that no one ever makes a type I error, because the null is never true.
Peter,
While the null is never quite true, it is also seldom false in the way the data seem to indicate. If you don’t adjust for multiple comparisons, you are effectively pretending that the apparent association that you chose to look at (because it was the extreme association in the sample) was actually the only association you looked at, chosen at random from among the many possible associations you could have looked at. That’s just plain wrong.
It’s especially wrong when you start with a massive longitudinal data set, which you then mine for apparent associations without adjusting for multiple comparisons. The actual number of comparisons being made may be thousands, or tens of thousands, or more. The odds that an association with a naive p-value between .01 and .05 reflects a real causal relationship are minuscule — but they’ll get reported in the epidemiology literature anyway, and get turned into diets or medication regimens or social policies that harm real people without benefit to any. A couple of authors have gone so far as to show that (for example) using the standard methods you can conclude that being a Leo has a positive association with heart disease. That seems like a reductio ad absurdum to me, but apparently not to Dr. Rothman and his followers.
The honest approach, of course, is to always independently test the hypotheses generated by the study-without-adjustment. When this is done, the ostensible effects invariably vanish — but the criminal part is that this is rarely done. The field of epidemiology has decided that speculative effects are worthy of publication, but confirmation (or failure to confirm) is not.
To me, the worst bit is in the last phrase (and it’s true for fields besides epidemiology) – confirmatory studies are rarely published.
As to the rest, it is of course true that if you fling enough **** at a wall, some will stick. Indeed, that’s what a p-value really is: A measure of the stickiness. That is, if you did 100 statistical tests of absolutely nonsensical things, 5% would be significant.
In my experience, though, people don’t do nonsensical things. They gather data based on some sort of idea (however vague) and then go on from there.
The right way to judge statistical results, it seems to me, is not through statistical significance but through what Abelson called the “MAGIC criteria” – magnitude, articulation, generality, interestingness and credibility. I wrote more about this on my blog, in a review of Abelson’s book. That’s here: http://www.statisticalanalysisconsulting.com/book-review-statistics-as-principled-argument-by-robert-abelson/#more-67
I think we agree on what makes good practice, but disagree on how widespread the problem is.
I think “they gather data based on some sort of idea” is how things used to be done, and because data collection was tedious and expensive it also tended to be targeted, with a hypothesis in mind. These days, though, a big chunk of economics, epidemiology, sociology, etc. consists of getting your hands on one or more of the Big Standard Data Sets — the American Community Survey, the Longitudinal Study of American Youth, the Nurses’ Health Study, the Stanford Large Network Dataset collection, etc. etc. etc. — and fishing for effects related to various outcome measures of interest.
I saw a journal article not that long ago wherein the authors, in an attempt to call attention to the madness, applied the same analytical techniques used in other recent epidemiology articles to find a significant effect of astrological sign on heart disease risk. The general reaction of the readers of that journal was not to say “wow, we need to stop doing that kind of analysis”.
Fishing for effects randomly is a problem, but, to continue the analogy, casting a line where you strongly suspect there are fish is not a problem.
I have no problem with people doing secondary analysis using already collected data, but that analysis (like any primary analysis) should be based on something more than “We have data! Let’s do something!”
It’s again a question of cart and horse; statistics should be used as part of a rational argument about some sort of hypothesis (using the last term loosely).
But I also have a problem with using statistical significance to evaluate effects; the important things are those MAGIC criteria I mentioned before. As an example, here’s a large effect size with N of 1 that would (I bet) be pretty convincing;
Your son eats a peach. Two minutes later he starts sweating and shaking, goes into convulsions. You rush him to the hospital where they save his life. Now, do you feed him another peach to increase sample size (and get to statistical significance)?
A Bayesian approach helps resolve these issues. Another way to say you strongly suspect an area has fish is to say that your prior probability on where the fish are has a lot of mass in that region.
If a result has no apparent justification, such as an association between zodiac sign and heart disease, the result has a small prior probability. Multiplying by a prior automatically enforces the dictum that extraordinary claims require extraordinary evidence.
John: I agree completely that a Bayesian approach helps resolve these issues (and many others). Unfortunately, the people I’m complaining about are even less likely to become Bayesians than they are to do a Bonferroni (or similar) adjustment.
Peter: I’m not sure how I came to be seen as the defender of “using statistical significance to evaluate effects” in this discussion. I certainly don’t. Indeed, in some work I did a couple of years ago regarding veterans’ disability benefit awards, statistical significance was a useless concept — the sample was large enough (and there were enough confounders) that every effect was ‘significant’, but few were important. All I’m saying here is that a researcher who chooses to use p-values to determine which effects are going to get labeled ‘real’ needs to do it as correctly as possible — which includes adjusting the significance threshold when testing many things simultaneously.
Dave, sorry if I mis-assigned you a role. I think there are (at least) two areas of problems with multiple comparisons and adjustments:
1) The general problems with p-values
2) The choice of how many comparisons to correct for.
Let’s ignore 1) as being something we all know about.
The choice of 2), in my view, depends on a few things: 1) Degree of “fish” – are you doing tests based on a priori questions? Are you just playing with data? Or is it somewhere in between? 2) State of knowledge in the field. 3) Relative cost of type I and type II errors (this really should be taken into account when setting p and power, but try telling a journal editor that!)
(And what an interesting and civil discussion we’re having!)
Again, I agree with everything you’re saying — but the editors of the epidemiology and econometrics journals do not, as you yourself note. They also tend to treat false positives as if they had zero cost: “Better to mistakenly identify a few effects than to miss even one real one!”. And thus a generation goes without butter and eggs, or gets subjected to bizarre and ineffective elementary education, or has to cope with counterproductive school busing policies. Because, as we agreed before — the confirmation study never gets done, and if it gets done it doesn’t get published, and if it gets published it doesn’t supplant the original claim in the mind of the public or the policy-maker.
(And I agree that this a pleasant discussion, but if we’re the only people interested we should perhaps take it to email…)
I think we may be the only ones interested, but I also think the conversation is winding down. We both agree: It’s not the statisticians’ fault! :-)
Can’t agree with this one. For one thing, we know from Taylor that any C¹1; function can be approximated as linear. So some convex subset of the data can be used to estimate a local
Second, the exact slope of a line can be very, very important. In most economic arguments the elasticity is the crucial fact in making or breaking an argument.
(I could give the example of the Laffer curve. We want to know the local response [slope] of various subgroups of taxpayers upon raising/lowering their marginal rates. It’s very important to know exactly how they’ll respond, in aggregate!)
Smooth functions can be approximated as locally linear. So a linear regression has to be justified by showing that the data are approximately linear in the range under study. And they may be.
But then a perfectly valid regression can be, and often is, extrapolated into regions where linearity may not hold. I gave a couple examples in a previous post:
John: I totally agree with you. As long as the
lm
is seen as a local approximation of the tangent, regression seems like an unquestionably great tool.Extrapolation and assuming no curves or interactions outside a “small” region is a different question, IMO.
In your exercise example, the relationship is not only non-linear, but non-monotonic.
This also makes me think of the question of “Whether effort or skill makes more difference to success”.
Obviously that’s very broad–multivariate output (per discipline/field), interaction terms, nonlinearities, measurement issues–but at some level we want a “global” answer like, let’s just say for programming to measure [1] IQ or scores on a programming-aptitude test at early age, versus [2] hours of study / practice (in any language? in one? still problematic).
But in a sense the question we’re asking is so global that maybe a
lm
would actually do the trick, even though it’s very much not a local approximation (unlike my Laffer example). I would basically want to know if the β’s are on the same order of magnitude. What I would expect is that we’d see much greater variation in hours put into training and practice, so even if the model issuccess ~ IQ • hours
, the|range(IQ)|<2
whereas|range(hours)|>100
so even ifIQ
is favoured thehours
has to be the main explanator.