Big data is getting a lot of buzz lately, but small data is interesting too. In some ways it’s more interesting. Because of limit theorems, a lot of things become dull in the large that are more interesting in the small.

When working with small data sets you have to accept that you will very often draw the wrong conclusion. You just can’t have high confidence in inference drawn from a small amount of data, unless you can do magic. But you do the best you can with what you have. You have to be content with the accuracy of your method relative to the amount of data available.

For example, a clinical trial may try to find the optimal dose of some new drug by giving the drug to only 30 patients. When you have five doses to test and only 30 patients, you’re just not going to find the right dose very often. You might want to assign 6 patients to each dose, but you can’t count on that. For safety reasons, you have to start at the lowest dose and work your way up cautiously, and that usually results in uneven allocation to doses, and thus less statistical power. And you might not treat all 30 patients. You might decide — possibly incorrectly — to stop the trial early because it appears that all doses are too toxic or ineffective. (This gives a glimpse of why testing drugs on people is a harder statistical problem than testing fertilizers on crops.)

Maybe your method finds the right answer 60% of the time, hardly a satisfying performance. But if alternative methods find the right answer 50% of the time under the same circumstances, your 60% looks great by comparison.

**Related post**: The law of medium numbers

heh you don’t report it as “gets the right answer 60% of the time”. you report it as “the method results in a 20% increase in accuracy over the old method”.

And when you conduct an experiment with the method, you announce the conclusion as THE CONCLUSION without saying “Of course, this result is not much better than a finger-in-the-wind estimate, but we did the best we could.”

By the way, when I say one method gets makes the right decision 60% of the time and other method 50%, that means that

on a particular set of scenarios, the methods performed that well. Method comparisons are relative to the scenarios chosen, and these choices should be carefully scrutinized to make sure they’re suitable for a fair comparison. They should be realistic scenarios for the problem domain, and not chosen to make one method look better.Great points here, particularly in light of the hype that “big data” has gotten recently. Some people likely believe that large datasets are much easier to work with, in light of the limit theorems you mention. Just keep averaging, things will converge to normal distributions and off you go doing the really “interesting” work. Meanwhile, smaller data sets force you to think about the assumptions of your data, checking for things like normality.

However, checking these “basic” assumptions is good for any size data, particularly if you starting making reasonable assumptions that turn out to be very, very wrong. Such as assuming the Central Limit Theorem holds, which it does not for power law distributed data. In such a case, instead of values converging to a predictable distribution, they converge to garbage, and subsequent analysis is pretty much useless.

Lest you think this is all theoretical, many financial models, some employed by real financial professionals, make this mistake, losing much of their own and others’ money.

Big data is interesting for different reasons. It’s not the size per se, but the kind of data.