Big data is not enough

Given enough data, correct answers jump out at you, right?

In some ways I think that scientists have misled themselves into thinking that if you collect enormous amounts of data you are bound to get the right answer. You are not bound to get the right answer unless you are enormously smart. You can narrow down your questions; but enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.

Bradley Efron, quoted in Significance. Emphasis added.

Related posts

12 thoughts on “Big data is not enough

  1. I never thought that the drive behind “big data” was that you’d get the right answers easily.

    Businesses have been using data warehouses for well over a decade now. We have lots of experience with data-driven business intelligence. Data mining had its days too.

    And what we have learned is that it is extremely difficult to get even the simple things right. For example, you want to mine association rules? Forget it! In most instances, it is a pure waste of time to try to find meaningful associations.

    And I’m not even taking into account the experiences of statisticians over something like a century, using much smaller and cleaner data sets… They have lots of experience regarding the difficulties… and they are now several orders of magnitude more acute.

    No. I think that the drive behind big data is that storage is cheap, sensors (include non-physical sensors such as twitter) are cheap, and the software to store and index the data is cheap (Hadoop, MySQL, whatever you like). So, even if you are not doing it, your competitors probably are.

    For example, anyone has access to the meta-data of the wikileaks diplomatic cables. Look it up. It is a gigantic data sets by 1990 standards. It is out there. It is easy to manipulate. Easy to visualize (within reason).

    Ok. Now what?

    Well. I think we are precisely where we are. Now what?

    Google appears to be doing well at exploiting some of this data. Amazon is doing cool stuff… But, overwhelmingly, we are in the dark, and we will probably remain in the dark for years to come. The difference from before is that we have these gigantic data sets at our disposal.

    I think that many companies are worried that competitors are going to exploit this data before they do. And the potential is huge to change the game. For example, marketing could still get better, by orders of magnitude. But except maybe at places like Google and Amazon, I don’t think we have yet realized much in terms of exploiting big data.

    Certainly, in science, while we are good at collecting and processing large data sets, it is unclear whether we are good at exploiting them for real results. Certainly, that’s not true in general.

  2. If you have biased sample, then a big sample will emphasize your bias. The best example is the Literary Digest poll of the 1936 presidential election (Roosevelt vs Landon)

  3. Raza: I think Brad Efron and Peter Norvig are talking about different things.

    A crude statistical method will usually work well given enough relevant data. But it may be hard to make use of data you didn’t ask for.

    Consider genetic data. Suppose you’d really like to have data on whether 500 patients responded to a medical treatment. But instead you have the entire genetic sequence on 20 patients. You’d asked for 500 bytes of clinical data, but instead you got gigabytes of genetic data. More data is better, right? Unless you know how to take advantage of that genetic data, you’d be better off with the smaller data set. I believe Efron would say that the genetic data is actually three billion data sets, each of size 20.

  4. really good point, John. It’s the observational data set issue, magnified millions of times because of big data. For example: Most web data, until recently, are collected by engineering systems for engineering purposes (stuff like the OS, monitor, bandwidth, PC make); and no matter how much of this data one gets, such data will only be weakly correlated with consumer behavior. Search data will tell you some things very well — our search behavior — but will not tell us much about other things, say what factors we consider when choosing between brand X and brand Y.

  5. Would really like to read this article if anyone has the PDF. Really disappointing to find yet another exciting resource that has a big paywall. Anyone want to comment on the value of this magazine? Love the concept, and have always wondered why such a thing didn’t exist, but can’t see joining ASA or RSS to get it (I’m not a professional statistician anyway).

    Anyone know if public libraries ever carry it? When are micropayments finally going to arrive??!!

    Thanks,
    Scott Edwards
    Pasadena, CA

  6. From a tiny signal processing amateur experience, people (with average – meaning low – backgound in modeling or statistics) keep sampling, collecting, gathering data at highest affordable rate (see Daniel Lemire comment) just in case they are able to figure out, in a mid-future, what they were EXACTLY looking for. But this hamster gathering process is quite unconscious. Unconsciousness resides notably in: 1) lack of pratical background: in sampling, derivation, correlation with (more) data 2) lack of memory: in what was already acquired, stored and explained in the past 3) lack of ideology: in what one could prove or disprove with more data. Big data is like teenage sex ‘Everyone is interested in it, but they don’t know what to do’.

  7. Not all problems are enormously difficult. Lots of data + simple algorithms definitely helps solve some important problems for my team at Google.

  8. Over four years later, and this still resonates to me. I think the common assumption with big data is that it replaces the need for domain knowledge. I have also seen it used to separate analysis from the scientific method, the consequences of which (due to wide miss erroneous conclusions) are often amusing but potentially scary, depending on the subject matter.

    The comments on this post are very good, written by some very bright people! Daniel Lemire mentions data warehouses. They were difficult to implement, in my experience. Google and Amazon do remarkable things with data for retail sales and marketing. It isn’t so obvious whether that will be possible for health care or in enterprise settings. Big data is useful in the context of fraud detection, log file analysis and probably sensor data (so-called Internet of Things, though security issues will be a high hurdle there). In general, big data is not enough. Statisticians need programmers, and vice-versa, to be effective.

    I like Significance magazine a lot. Good choice!

Comments are closed.