Big data is easy; big models are hard.
If you just wanted to use simple models with tons of data, that would be easy. You could resample the data, throwing some of it away until you had a quantity of data you could comfortably manage.
But when you have tons of data, you want to take advantage of it and ask questions that simple models cannot answer. (“Big” data is often indirect data.) So the problem isn’t that you have a lot of data, it’s that you’re using models that require a lot of data. And that can be very hard.
I am not saying people should just use simple models. No, people are right to want to take advantage of their data, and often that does require complex models. (See Brad Efron’s explanation why.) But the primary challenge is not the volume of data.
Related post: Big data and humility
I would also add that big context is hard like big models. In my domain, there are lots of observations of temperature, pressure, etc. What doesn’t exist much, but should, is context. Without context, some questions cannot be answered, and any models or analysis is suspect. For example, the following questions are difficult or impossible to answer with regards to surface station observations:
1. What equipment/hardware took the measurement?
2. What maintenance was performed and when?
3. What was the siting of the equipment at the time of the measurement?
4. What was the local environment at the time of the measurement? Were changes to the local environment documented?
Without context, any models created and conclusions drawn may or may not be valid.
I think a corollary to this is that it is easy to say obvious things with big data, it is much harder to get big data to tell you non obvious things and trust the answer.
Eric: I could see how that would be especially important in studying weather. You don’t really have the result of one experiment, but thousands of experiments conducted under different settings.
I don’t know whether this is still the case, but at least as of a few years ago it was absolutely impossible to combine two microarray experiments. (People still did it, but the results were meaningless.) Lab conditions had a much larger effect than the difference in expression of this or that gene in two populations.
In my case the context does often exist, but only in individual people’s field notes or other hand-written, un-indexed logs… and the original note-taker is the only person who can understand the notes. In that case the main challenge is getting people to understand which things will count to you as data/context later, or convincing them that going through their own logs is worth it.
To tackle the problem of context very often data provenance recording is applied. This forms a complimentary layer to the raw data (pay load), and can be used as semantic, linked meta-data.
We’ve been using provenance recording to answer such resulting questions in scenarios such as simulation, process, conformance to regulations (e. g. laboratory best practices), equipment related logs (e. g. type, maintenance, calibration), etc.
However, having this additional (“linkage layer”) information available obviously doesn’t make mining the big data any easier, even if it *does* contain the missing bits of information.