Andrew Gelman just posted an interesting article on the philosophy of Bayesian statistics. Here’s my favorite passage.

This reminds me of a standard question that Don Rubin … asks in virtually any situation: “

What would you do if you had all the data?” For me, that “what would you do” question is one ofthe universal solvents of statistics.

Emphasis added.

I had not heard Don Rubin’s question before, but I think I’ll be asking it often. It reminds me of Alice’s famous dialog with the Cheshire Cat:

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to,” said the Cat.

“I don’t much care where–” said Alice.

“Then it doesn’t matter which way you go,” said the Cat.

**Related post**: Irrelevant uncertainty

>“What would you do if you had all the data?”

Well for starters, I think I would be looking for a new job.

Answer 1: Quickly go broke due to the data storage costs.

Answer 2: BACK IT UP BEFORE I DO SOMETHING STUPID

Answer 3: ….Probably uncover what the true rate of Asperger’s in the population is, and if it is indeed rising, or if that is just due to increased diagnosis.

Answer 4: Hire someone who is good with statistics

Last semester I was working with a professor on a business analytics project where we had all of the data that the business could give us. She was so happy throughout the entire thing. The p-values we had were roughly infinitesimal in most cases.

Really, the only thing we had to worry about was what the values really meant outside of certainty. If you get a .64 R^2 value for a regression and you don’t have all the data, then you can wonder if you have a good sample. But if you have all the data, then you start thinking about why it is .64.

Nevertheless, it was way fun.

If I knew what I would do if I had all of the data, then having the data is unnecessary to my decision making. The only rational response to that question is “I don’t know”. Why pose a question whose answer is already known? I’m either missing something fundamental about that quote and its context, or it’s a nonsensical absurdity.

Some banks “have all the data” on credit card transactions, but they still have trouble determining whether a new transaction is fraudulent because the card was recently stolen. Walmart “has all the data” on what has been sold in every store, but they still have problems anticipating demand and predicting how many of each item to order and ship.

Having all the data implies that you don’t have a sample, you have the entire population. Sample statistics like mean and variance become exact parameters. But it is still a challenge to USE the data to manage a business.

For Rubin I would guess “all of the data” often means the potential outcomes. In the simplest case, these are Y_1 and Y_0 for whether a unit is subject to the treatment or control. The fundamental problem of causal inference is that we only observe one of these (without making other ignorability assumptions).

What I love about science is that every answer produces more questions. Even if you have all of the data, calculations of R^2, mean, and variance make assumptions about the nature of the data. So, question your models, e.g.: What functional family are you regressing against? Does the data really have Gaussian noise? Did you pick these models because they are mathematically convenient, “standard practice”, or because they really reflect the data? You can no longer hide behind the excuse that if your model doesn’t fit the data perfectly it’s because it is only a sample.

Conform it so that I (no we!) could see across date, geography and any other measured grain and slice and dice the heck out of it. Aggregate it; dis-aggregate it. Calculate ratios of sums. Derive efficiency ratios, and ultimately create an idea from it.

Then as my stats prof used to shout at the end of every lecture, ask myself “…. IS THIS CLEAR?”

I just wanted to say, I loved the quote, a lot.