Skin in the game for observational studies

The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with

Any claim coming from an observational study is most likely to be wrong.

They back up this assertion with data about observational studies later contradicted by prospective studies.

Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.

The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.

They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.

The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.

* * *

The full title of the article is Deming, data and observational studies: A process out of control and needing fixing. It appeared in the September 2011 issue of Significance.

Update: The article can be found here.

13 thoughts on “Skin in the game for observational studies

  1. That’s an interesting idea, but not useful unless also accompanied by a doubling of sample sizes! I’d hate to see observational studies all suddenly being sliced in half. The approach is begging for someone to run some simulations to see what happens when effects are in the “barely detectable” range.

  2. Your comment implicitly assumes that the new method has no benefit at all, that it’s not worth reducing the original sample size in exchange for more accountability. If the only problem with observational studies is sample size, then you’re right. But if there are other problems — overfitting, multiple testing, publication bias, lack of consequences for publishing false results, etc. — then it’s worth reducing the initial data set size.

  3. Data splitting is fine but the above concerns all hold for experiments, not just observational studies. Indeed, arguably the concerns are largest for randomized experiments, because it is there that researchers get that false feeling of security (led on by statistics textbooks and by statements such as “Any claim coming from an observational study is most likely to be wrong”) in which they are led to believe that claims coming from randomized experiments are most likely to be correct.

  4. P.S. I followed the link and saw this quote: “Science works by experiments that can be repeated; when they are repeated, they must give the same answer. ”

    If “experiment” is taken to have the same meaning that it has in statistics textbooks, the above sentence is wrong and misleading.

    That article is well intentioned but it has so many problems! This is worth its own blog post.

  5. @John – It’s great to see efforts to improve the reliability of results, particularly in biomedicine, but if you implemented this you would have to increase the total sample size. This isn’t cheap, and I’m just being pragmatic here; I suspect there’s no cheap way to fix the problem…

    Let me put it this way. If you consider the results generated in the proposed test set in isolation, I can’t see any benefit to a smaller sample size by any metric (except maybe avoiding being “overpowered” and detecting irrelevant effects). Am I wrong in thinking that a smaller sample size would negatively affect certain problems including: overfitting, P(p<0.05 | true effect), and P(true effect | p<0.05), and therefore some of the other problems you mention? A hold-out data set to hold over people's heads isn't much use, if the test set just generates nonsense – in fact the strategy is at risk of worsening the problem, in order to highlight the problem.

  6. @Ricardo

    If your traditional-style study is likely to be wrong and you can’t afford to do a better job then perhaps you should just walk away from the whole thing? Are likely incorrect results better or worse than no results? I guess it depends on context. Also, I’m not sure of your second paragraph’s reasoning. If you over fit to the data in your hand it should fail on the hold-out set. You can’t over-fit to the hold-out set unless you have access to it (which you shouldn’t), which means it can’t worsen the problem.

  7. This may address overfitting but it doesn’t address missing confounders and other biases in observational designs.

  8. I agree with Ricardo’s point. Often, observational study sample sizes are determined by a power analysis. Funders will only approve a sample size just barely big enough to detect what you seek.

    If you cut this sample size in half and repeat your analysis on both halves, everything will come out insignificant/imprecise on both the modeling and holdout sets. Nobody learns anything this way. You can’t benefit from the extra accountability if the holdout set estimates/predictions are too imprecise to be useful.

    Remember that a non-significant holdout result doesn’t mean we “learned” that the null is true! It only means we didn’t have enough data to measure things precisely… i.e., we learned nothing.*

    This approach can be useful, but only if you’re funded to collect enough data for high power in both the modeling and holdout sets.

    (*Unless you treat it as a pilot study to help plan a future power analysis.)

  9. This seems to be assuming that observational studies have a sample size exactly large enough to detect some effect with a significant p-value, and if we use a smaller sample we learn nothing.

    There are several problems with this assumption. For one, sample sizes are based on wild guesses at the effect size that will be seen. The sample size may be barely adequate for a guessed effect size and desired significance level. But if there’s a larger effect than guessed, a smaller sample will do. Or perhaps the study finds the guessed effect at a different significance level.

    I also object to the assumption that studies have a binary conclusion, we either prove what we’re after or we learn absolutely nothing. Learning is more continuous than that. This is obvious with a Bayesian approach, but it’s also true of a frequentist approach based on estimation rather than hypothesis testing. If you have a smaller sample, you’ll generally get a wider confidence/credibility interval. It’s not all or nothing.

    If the hypothesized effect shows up in both the modeling and hold-out set, you could analyze the combined data as one set and see what p-values or interval estimates that gives. This result would have more credibility than if it were analyzed initially analyzed as one data set since half the data were not used to pick the model.

  10. @Andrew

    I agree, they should walk away. They don’t listen to me, though… Re my second paragraph, I think you misunderstood me. If they overfit on the test set of course it will likely fail on the hold-out. But a smaller sample means it’s more likely to overfit on the test set, which is worsening the problem. Not unresolvable – just need bigger samples, but that’s a big ask of funders/sponsors.

    @John
    “If the hypothesized effect shows up in both the modeling and hold-out set, you could analyze the combined data as one set and see what p-values or interval estimates that gives. This result would have more credibility than if it were analyzed initially analyzed as one data set since half the data were not used to pick the model.”

    Are you sure it will have more credibility? I think that’s a hard one to call.

  11. @ John:
    Fair enough. I agree that estimation is more informative than testing. But it’s not always enough.

    My point is just this: I can easily imagine cases where halving the sample size will make your study *less* informative/credible than using the full sample at once. (But then, I’m imagining simple polls/surveys with a predetermined question of interest, where power analysis is easy-ish, and where data-fishing and overfitting aren’t problems anyway.)

    My comment that “we learned nothing” was a (too-strong) response to what you quoted from the article:
    “Both [author and journal] stand to be embarrassed if the holdout set does not support the original claims of the author.”
    To me, this sounds like, “If something significant/precise in the modeling set is insignificant/imprecise in the holdout set, the author should feel embarrassed.” I disagree, especially when you’ve hamstrung yourself by halving the sample size.
    But perhaps you read their quote differently.

  12. Doubling sample size seems a small price to pay to address this very serious problem. This method should also reduce the need for multiple replications, so it would pay for itself in a way, and funders should take that into consideration if they have the big picture in mind.

  13. Which is worse — asserting a result that is bogus, or failing to find a result that is true? Most epidemiologists seem to think that missing a real effect is much worse than asserting a bogus effect, but I don’t know why they think this. Asserting bogus effects ruins peoples’ lives. Missing real effects doesn’t change the status quo at all.

    If sample size were the problem, the arguments above would make more sense. The problem is more often failure to correct for (mega-)multiple hypotheses, p-hacking, or other predictable but pernicious behaviors driven by standard publication criteria. Doubling the sample size is nothing compared to failing to mention that you tested 3 million hypotheses, or trying 97 variants of the model until one of them produced a p-value of 0.047…

Comments are closed.