One of the challenges with big data is to properly estimate your uncertainty. Often “big data” means a huge amount of data that isn’t exactly what you want.
As an example, suppose you have data on how a drug acts in monkeys and you want to infer how the drug acts in humans. There are two sources of uncertainty:
- How well do we really know the effects in monkeys?
- How well do these results translate to humans?
The former can be quantified, and so we focus on that, but the latter may be more important. There’s a strong temptation to believe that big data regarding one situation tells us more than it does about an analogous situation.
I’ve seen people reason as follows. We don’t really know how results translate from monkeys to humans (or from one chemical to a related chemical, from one market to an analogous market, etc.). We have a moderate amount of data on monkeys and we’ll decimate it and use that as if it were human data, say in order to come up with a prior distribution.
Down-weighting by a fixed ratio, such as 10 to 1, is misleading. If you had 10x as much data on monkeys, would you as much about effects in humans as if the original smaller data set were collected on people? What if you suddenly had “big data” involving every monkey on the planet. More data on monkeys drives down your uncertainty about monkeys, but does nothing to lower your uncertainty regarding how monkey results translate to humans.
At some point, more data about analogous cases reaches diminishing return and you can’t go further without data about what you really want to know. Collecting more and more data about how a drug works in adults won’t help you learn how it works in children. At some point, you need to treat children. Terabytes of analogous data may not be as valuable as kilobytes of highly relevant data.
What’s needed here is the ability to transfer knowledge gleaned from one task (predicting in monkeys) to another (predicting in humans). Called “transfer learning”, one particular nice (theoretical) approach is logistic regression with covariate shift — “Discriminative Learning for Differing Training and Test Distributions”, Bicklel.
Regarding choosing a prior by decimating possibly related data, sure its messy and tricky to justify, but isn’t that what the fig leaf of “expert opinion” is for? Don’t construct a prior for human response by decimating monkey response data yourself; elicit the prior from an expert in the field. If it later comes up that the expert actually did decimate monkey data for the prior on human data, pat him or her on the back since they didn’t just throw out some WAG.
Hmmm … preferring relevant data to using analagous data … you goin’ Frequentist on us?
Weight and height — I like to use those handy BMI charts (which litter wellness centers like the leaves of Autumn) to find my ideal height, but I haven’t found the right exercise regimen for getting taller. Hey, at least I’m trying!
Expert opinion can be a fig leaf, but an expert who has monkey data may also have more knowledge about how results are likely to transfer across species.
Sounds like problems with a (asymptotically) biased estimator.
The analog in insurance modeling is often called primary and secondary uncertainty. Primary uncertainty is how well the model matches the thing you really care about (i.e. how well monkeys fit humans) and the secondary uncertainty is the uncertainty around sample size of the model, etc.
I find the use of “primary” and “secondary” interesting. I work with small-sample statistics where the roles of the two kinds of uncertainty may be comparable in magnitude.
This is fundamentally the same point Sandor Greenland is on about when he talks about model uncertainty, isn’t it? You can naively estimate a linear regression estimate to very high precision, yet there is no a priori reason to suppose that the underlying model is linear. Hence the narrow confidence intervals are illusory. At least there you have some residuals to look at, although getting applied scientists to even plot them is challenging…
When I analyze small and medium-sized data, I make a lot of graphs: exploratory graphs to help formulate a model, and diagnostic plots to help evaluate the correctness of the model. (Then iterate.) When someone fits a model to big data, these graphical techniques are often not possible (see Unwin/Theus/Hofmann, Graphics of large datasets: visualizing a million), or are at least greatly hindered. The result? It is hard to assess model fit. Do you include this “model uncertainty” as part of Source #1?
John,
You are of course correct. I wasn’t very serious in characterizing expert opinion as a fig leaf, although it could be used as such.
What I do wonder about is the question of whether I would rather use monkey data or expert opinion without those monkey data, and which I’d be happier with when reading someone’s publication. At least with monkey data they can be published along with whatever methods were used to obtain the estimates, for others to examine and critique, but with expert opinion you can really only supply the name, perhaps along with a statement from the expert. But often in my experience expert opinion is just a “best guess”, with no way that how it was obtained can be meaningfully described beyond the assertion that the expert is expert and that that is their guess.
I suppose it gets back to the Bayesian / Frequentist question, but perhaps upside down, in the question of where uncertainty lies and how it is dealt with.
Model uncertainty is not measurable, different from uncertainty due to sampling. A model is always hypothetical, we can only reject it, never accept it. Nevertheless, a model may be validated, what means it was subjected to a test whose purpose was to reject it. A model is strong if it has passed a lot of tests over time. This is the case with e.g. general relativity, which has resisted tests for more than a hundred years (but a recent discover may prove it false).
Furthermore, the “measurable” uncertainty due to sampling is dependent on the assumption that the model is correct.
Excellent example. I think this applies to many studies and not just “big data”.
There are going to be errors and lack of coverage in any study with finite funding. Some of these count as “numerical ε” errors and some are qualitative. What’s the difference between nitpicking and legitimate criticism of a field study? The answer can be quantified, sort of (like a t-score), but sometimes the difference comes down to a (subjective) judgment call.