Responsible data analysis

David Hogg on responsible data analysis:

The key idea is that the result of responsible data analysis is not an answer but a distribution over answers. Data are inherently noisy and incomplete; they never answer your question precisely. So no single number … will adequately represent the result of a data analysis. Results are always pdfs (or full likelihood functions); we must embrace that.

7 thoughts on “Responsible data analysis”

Pierre

5 July 2012 at 08:19

So much for decision theory… it’s good that people embrace uncertainty quantification, but that quote is just simplistic, isn’t it?

John

5 July 2012 at 08:27

The quote is less simplistic than reporting a single number with no indication of uncertainty, and that’s what it’s arguing against.

As for decision theory, most people agree it’s the right thing to do, and yet hardly anybody does it. I believe the reason is that people get into (often petty) arguments over utilities. And when people cannot agree on a utility, an implicit utility corresponding to mathematical convenience wins by default. It’s really depressing. Here’s a blog post I wrote along these lines.

Rick Wicklin

5 July 2012 at 08:35

I was once told that “an answer without a confidence interval is meaningless.” Today we might add “and a graph of the approximate distribution of the answer.”

James Piper

6 July 2012 at 04:48

OK. pdfs isn’t referring to the file format. Is it like snafu? So, pretty damn f’g sure. Just wondering.

Dave Tate

6 July 2012 at 08:22

In project management, they use the term “S-curve” to refer to the cdf of total project cost (or duration). In theory, every cost or schedule estimate should generate an S-curve, which could then be used to answer questions like “how much should I budget if I want an 80% chance of having enough funds to finish the project?”, or “given my budget, which of these two alternatives is more likely to succeed?”.

In practice, even when an S-curve is generated it is usually hopelessly optimistic, and bears no resemblance to the actual historical distribution of costs for similar projects. If point estimates are misleading, incorrect distributional estimates are even more misleading.

Tom

7 July 2012 at 10:39

@JamesPiper In statistics, a pdf is a “probability density function”.

koala

11 July 2012 at 01:42

To make clear, Hogg’s words are completely in line (stem from) Bayesian view of statistics.

Comments are closed.