This afternoon Hadley Wickham gave a great talk on data analysis. Here’s a paraphrase of something profound he said.
Visualization can surprise you, but it doesn’t scale well.
Modeling scales well, but it can’t surprise you.
Visualization can show you something in your data that you didn’t expect. But some things are hard to see, and visualization is a slow, human process.
Modeling might tell you something slightly unexpected, but your choice of model restricts what you’re going to find once you’ve fit it.
So you iterate. Visualization suggests a model, and then you use your model to factor out some feature of the data. Then you visualize again.
Related posts
Have you ever looked at Structure Learning for extending models? Very cool stuff. Daphne Koller used it to find new cell protein interactions.
John:
Yes, this is what we are trying to get at in Bayesian Data Analysis. You iterate the following 3 steps: (1) model building, (2) inference conditional on the model, (3) model checking. The better you do (1) and (2), the more informative step (3) will be.
The paradox, if there is one, is that people tend to think of steps 2 and 3 as competing: in step 2 you (temporarily) commit to a belief, whereas in step 3 you look for problems. I think these go together–really, that’s what the scientific method is all about–but I’ve found that, in many cases, people who spend a lot of time with a model don’t want to check it, while people who spend a lot of time on exploratory data analysis don’t like models at all.
Andrew: I agree, and too often step 3 is missing.
Statistics without model checking makes me uneasy, to put it mildly. Some argue that model checking is less important in Bayesian statistics, but I don’t buy that. If anything, because Bayesian analysis makes it easier to construct complex models, there may be more need for model checking.
John:
Yup. As Bayes once said, with great power comes great responsibility.