Responsible data analysis

David Hogg on responsible data analysis:

The key idea is that the result of responsible data analysis is not an answer but a distribution over answers. Data are inherently noisy and incomplete; they never answer your question precisely. So no single number … will adequately represent the result of a data analysis. Results are always pdfs (or full likelihood functions); we must embrace that.

Tagged with:
Posted in Statistics
9 comments on “Responsible data analysis
  1. Speed says:

    The link to David Hogg needs work.

  2. John says:

    Thanks. I fixed the URL.

  3. Pierre says:

    So much for decision theory… it’s good that people embrace uncertainty quantification, but that quote is just simplistic, isn’t it?

  4. John says:

    The quote is less simplistic than reporting a single number with no indication of uncertainty, and that’s what it’s arguing against.

    As for decision theory, most people agree it’s the right thing to do, and yet hardly anybody does it. I believe the reason is that people get into (often petty) arguments over utilities. And when people cannot agree on a utility, an implicit utility corresponding to mathematical convenience wins by default. It’s really depressing. Here’s a blog post I wrote along these lines.

  5. Rick Wicklin says:

    I was once told that “an answer without a confidence interval is meaningless.” Today we might add “and a graph of the approximate distribution of the answer.”

  6. James Piper says:

    OK. pdfs isn’t referring to the file format. Is it like snafu? So, pretty damn f’g sure. Just wondering.

  7. Dave Tate says:

    In project management, they use the term “S-curve” to refer to the cdf of total project cost (or duration). In theory, every cost or schedule estimate should generate an S-curve, which could then be used to answer questions like “how much should I budget if I want an 80% chance of having enough funds to finish the project?”, or “given my budget, which of these two alternatives is more likely to succeed?”.

    In practice, even when an S-curve is generated it is usually hopelessly optimistic, and bears no resemblance to the actual historical distribution of costs for similar projects. If point estimates are misleading, incorrect distributional estimates are even more misleading.

  8. Tom says:

    @JamesPiper In statistics, a pdf is a “probability density function”.

  9. koala says:

    To make clear, Hogg’s words are completely in line (stem from) Bayesian view of statistics.

1 Pings/Trackbacks for "Responsible data analysis"
  1. [...] calculus for inference and Data analysis recipes: fitting a model to data, by David Hogg via John D. Cook and Andrew Gelman. Described as “chapters from a non-existent [...]