This is what the book Social Media Mining calls the Big Data Paradox:
Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and use its multidimensional, multisource, and multisite data to aggregate information with sufficient statistics for effective mining.
Brad Efron said something similar:
… enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.
Big data doesn’t always tell us directly what we’d like to know. It may give us a gargantuan amount of slightly related data, from which we may be able to tease out what we want.
Related post: New data, not just bigger data
3 thoughts on “Big data paradox”
Another antipattern that surfaces in the “enterprise” space is bigdata that is only big because all life stages of an object have been overloaded. This drives a lot of the perceived need for “analytics”, much of which would fall away if the business-process context of the data had not been flattened out.
I cannot help but to draw a parallel with social media generated big data an Isaac Asimov”s Foundation Psichohistory, where gigs amount of historical data allowed you to draw conclusions an predict the behavior of the masses, something that has been proving time and again with the effectiveness of social media marketing.
Nevertheless, as this blog post suggest, and as with Psichohistory, all this huge amount of data is fairly useless for tracing patterns or predicting single individuals’ behavior.
Analyzing big data makes a couple of assumptions that aren’t always apparent. I am addressing here big data which consists of simply passive observations for which there are no interventions.
First, there is an assumption that the number of distinct mechanisms producing “the signal” is much much smaller than the number of units observed. These are hoped to “cluster together” in some way, one mechanism per cluster.
Second, there is an assumption, quite key, that the noise or perturbation model for all the units is similar. If it is not, then the signal cannot be separated from the noise or perturbation, and there is an identifiability question.
Now, if there are interventions, this can help, as long as their effects apply to only the mechanisms and not the perturbations.