The normal distribution can approximate many other distributions, though the details such as quantitative error estimates and what factors improve or degrade the approximation are harder to find. Here are some notes on normal approximations to several common probability distributions.
When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.
Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.
But order statistics are not robust in this sense: the identity of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the value of the median income may not change much, but the person corresponding to that income will change.
One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a good choice than how often it makes the best choice.
Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.
Suppose you’re drawing random samples uniformly from some interval. How likely are you to see a new value outside the range of values you’ve already seen?
The problem is more interesting when the interval is unknown. You may be trying to estimate the end points of the interval by taking the max and min of the samples you’ve drawn. But in fact we might as well assume the interval is [0, 1] because the probability of a new sample falling within the previous sample range does not depend on the interval. The location and scale of the interval cancel out when calculating the probability.
Suppose we’ve taken n samples so far. The range of these samples is the difference between the 1st and the nth order statistics, and for a uniform distribution this difference has a beta(n-1, 2) distribution. Since a beta(a, b) distribution has mean a/(a+b), the expected value of the sample range from n samples is (n-1)/(n+1). This is also the probability that the next sample, or any particular future sample, will lie within the range of the samples seen so far.
If you’re trying to estimate the size of the total interval, this says that after n samples, the probability that the next sample will give you any new information is 2/(n+1). This is because we only learn something when a sample is less than the minimum so far or greater than the maximum so far.
Cancer research is sometimes criticized for being timid. Drug companies run enormous trials looking for small improvements. Critics say they should run smaller trials and more of them.
Which side is correct depends on what’s out there waiting to be discovered, which of course we don’t know. We can only guess. Timid research is rational if you believe there are only marginal improvements that are likely to be discovered.
Sample size increases quickly as the size of the effect you’re trying to find decreases. To establish small differences in effect, you need very large trials.
If you think there are only small improvements on the status quo available to explore, you’ll explore each of the possibilities very carefully. On the other hand, if you think there’s a miracle drug in the pipeline waiting to be discovered, you’ll be willing to risk falsely rejecting small improvements along the way in order to get to the big improvement.
Suppose there are 500 drugs waiting to be tested. All of these are only 10% effective except for one that is 100% effective. You could quickly find the winner by giving each candidate to one patient. For every drug whose patient responded, repeat the process until only one drug is left. One strike and you’re out. You’re likely to find the winner in three rounds, treating fewer than 600 patients. But if all the drugs are 10% effective except one that’s 11% effective, you’d need hundreds of trials with thousands of patients each.
The best research strategy depends on what you believe is out there to be found. People who know nothing about cancer often believe we could find a cure soon if we just spend a little more money on research. Experts are more sanguine, except when they’re asking for money.
There’s a theorem in statistics that says
You could read this aloud as “the mean of the mean is the mean.” More explicitly, it says that the expected value of the average of some number of samples from some distribution is equal to the expected value of the distribution itself. The shorter reading is confusing since “mean” refers to three different things in the same sentence. In reverse order, these are:
- The mean of the distribution, defined by an integral.
- The sample mean, calculated by averaging samples from the distribution.
- The mean of the sample mean as a random variable.
The hypothesis of this theorem is that the underlying distribution has a mean. Lets see where things break down if the distribution does not have a mean.
It’s tempting to say that the Cauchy distribution has mean 0. Or some might want to say that the mean is infinite. But if we take any value to be the mean of a Cauchy distribution — 0, ∞, 42, etc. — then the theorem above would be false. The mean of n samples from a Cauchy has the same distribution as the original Cauchy! The variability does not decrease with n, as it would with samples from a normal, for example. The sample mean doesn’t converge to any value as n increases. It just keeps wandering around with the same distribution, no matter how large the sample. That’s because the mean of the Cauchy distribution simply doesn’t exist.
Suppose a large number of people each have a slightly better than 50% chance of correctly answering a yes/no question. If they answered independently, the majority would very likely be correct.
For example, suppose there are 10,000 people, each with a 51% chance of answering a question correctly. The probability that more than 5,000 people will be right is about 98%. 
The key assumption here is independence, which is not realistic in most cases. But as people move in the direction of independence, the quality of the majority vote improves. Another assumption is that people are what machine learning calls “weak learners,” i.e. that they perform slightly better than chance. This holds more often than independence, but on some subjects people tend to do worse than chance, particularly experts.
You could call this the wisdom of crowds, but it’s closer to the wisdom of markets. As James Surowiecki points out in his book The Wisdom of Crowds, crowds (as in mobs) aren’t wise; large groups of independent decision makers are wise. Markets are wiser than crowds because they aggregate more independent opinions. Markets are subject to group-think as well, but not to the same extent as mobs.
 Suppose there are N people, each with independent probability p of being correct. Suppose N is large and p is near 1/2. Then the probability of a majority answering correctly is approximately
Prob( Z > (1 – 2p) sqrt(N) )
where Z is a standard normal random variable. You could calculate this in Python by
from scipy.stats import norm from math import sqrt print( norm.sf( (1 - 2*p)*sqrt(N) ) )
This post is an elaboration of something I first posted on Google+.
Russ Roberts had this to say about the proposal to replacing the calculus requirement with statistics for students.
Statistics is in many ways much more useful for most students than calculus. The problem is, to teach it well is extraordinarily difficult. It’s very easy to teach a horrible statistics class where you spit back the definitions of mean and median. But you become dangerous because you think you know something about data when in fact it’s kind of subtle.
A little knowledge is a dangerous thing, more so for statistics than calculus.
This reminds me of a quote by Stephen Senn:
Statistics: A subject which most statisticians find difficult but in which nearly all physicians are expert.
David Hogg calls conventional statistical notation a “nomenclatural abomination”:
The terminology used throughout this document enormously overloads the symbol p(). That is, we are using, in each line of this discussion, the function p() to mean something different; its meaning is set by the letters used in its arguments. That is a nomenclatural abomination. I apologize, and encourage my readers to do things that aren’t so ambiguous (like maybe add informative subscripts), but it is so standard in our business that I won’t change (for now).
I found this terribly confusing when I started doing statistics. The meaning is not explicit in the notation but implicit in the conventions surrounding its use, conventions that were foreign to me since I was trained in mathematics and came to statistics later. When I would use letters like f and g for functions collaborators would say “I don’t know what you’re talking about.” Neither did I understand what they were talking about since they used one letter for everything.
When I was in college, I overheard two senior faculty arguing over an undergraduate probability homework assignment. This seemed very strange. It occurred to me that I’d never seen faculty argue over something elementary before, and I couldn’t imagine an argument over, say, a calculus homework problem. Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.
Intuitive discussions of probability are very likely to be wrong. Experts know this. They’ll say things like “I imagine the answer is around this, but I’d have to go through the calculations to be sure.” Probability is not like physics where you can usually get within an order of magnitude of a correct answer without formal calculation. Probabilistic intuition doesn’t take you as far as physical intuition.
I was culling out books, mostly obsolete technical books, and I remembered that I have an extra copy of Feller’s classic probability text. It’s volume 1, second edition. If you’re a student and would like the book, please send me an email with your mailing address.
Update: The book was claimed 11 minutes after this post was published.
John Ioannidis stirred up a healthy debate when he published Why Most Published Research Findings Are False. Unfortunately, most of the discussion has been over whether the word “most” is correct, i.e. whether the proportion of false results is more or less than 50 percent. At least there is more awareness that some published results are false and that it would be good to have some estimate of the proportion.
However, a more fundamental point has been lost. At the core of Ioannidis’ paper is the assertion that the proportion of true hypotheses under investigation matters. In terms of Bayes’ theorem, the posterior probability of a result being correct depends on the prior probability of the result being correct. This prior probability is vitally important, and it varies from field to field.
In a field where it is hard to come up with good hypotheses to investigate, most researchers will be testing false hypotheses, and most of their positive results will be coincidences. In another field where people have a good idea what ought to be true before doing an experiment, most researchers will be testing true hypotheses and most positive results will be correct.
For example, it’s very difficult to come up with a better cancer treatment. Drugs that kill cancer in a petri dish or in animal models usually don’t work in humans. One reason is that these drugs may cause too much collateral damage to healthy tissue. Another reason is that treating human tumors is more complex than treating artificially induced tumors in lab animals. Of all cancer treatments that appear to be an improvement in early trials, very few end up receiving regulatory approval and changing clinical practice.
A greater proportion of physics hypotheses are correct because physics has powerful theories to guide the selection of experiments. Experimental physics often succeeds because it has good support from theoretical physics. Cancer research is more empirical because there is little reliable predictive theory. This means that a published result in physics is more likely to be true than a published result in oncology.
Whether “most” published results are false depends on context. The proportion of false results varies across fields. It is high in some areas and low in others.
From Controversies in the Foundations of Statistics by Bradley Efron:
Statistics seems to be a difficult subject for mathematicians, perhaps because its elusive and wide-ranging character mitigates against the traditional theorem-proof method of presentation. It may come as some comfort then that statistics is also a difficult subject for statisticians.
The other day I heard someone say something like the following:
I can’t believe how people don’t understand probability. They don’t realize that if a coin comes up heads 20 times, on the next flip there’s still a 50-50 chance of it coming up tails.
But if I saw a coin come up heads 20 times, I’d suspect it would come up heads the next time.
There are two levels of uncertainty here. If the probability of a coin coming up heads is θ = 1/2 and the tosses are independent, then yes, the probability of a head is 1/2 each time, regardless of how many heads have shown before. The parameter θ models our uncertainty regarding which side will show after a toss of the coin. That’s the first level of uncertainty.
But what about our uncertainty in the value of θ? Twenty flips showing the same side up should cause us to question whether θ really is 1/2. Maybe it’s a biased coin and θ is greater than 1/2. Or maybe it really is a fair coin and we’ve just seen a one-in-a-million event. (Such events do happen, but only one in a million times.) Our uncertainty regarding the value of θ is a second level of uncertainty.
Frequentist statistics approaches these two kinds of uncertainty differently. That approach says that θ is a constant but unknown quantity. Probability describes the uncertainty regarding the coin toss given some θ but not the uncertainty regarding θ. The Bayesian models all uncertainty using probability. So the outcome of the coin toss given θ is random, but θ itself is also random. It’s turtles all the way down.
It’s possible to have different degrees of uncertainty at each level. You could, for example, calculate the probability of some quantum event very accurately. If that probability is near 1/2, there’s a lot of uncertainty regarding the event itself, but little uncertainty about the parameter. High uncertainty at the first level, low uncertainty at the next level. If you warp a coin, it may not be apparent what effect that will have on the probability of the outcome. Now there’s significant uncertainty at the first and second level.
We’ve implicitly assumed that a single parameter θ describes the uncertainty in a coin toss outcome. Maybe that’s not true. Maybe the person tossing the coin has the ability to influence the outcome. (Some very skilled people can. I’ve heard rumors that Persi Diaconis is good at this.) Now we have a third level of uncertainty, uncertainty regarding our model and not just its parameter.
If you’re sure that a parameter θ describes the coin toss, but you don’t know θ, then the coin toss outcome is an known unknown and θ is an unknown unknown, a second-order uncertainty. More often though people use the term “unknown unknown” to describe a third-order uncertainty, unforeseen factors that are not included in a model, not even as uncertain parameters.
Sometimes you can derive a probability distributions from a list of properties it must have. For example, there are several properties that lead inevitably to the normal distribution or the Poisson distribution.
Although such derivations are attractive, they don’t apply that often, and they’re suspect when they do apply. There’s often some effect that keeps the prerequisite conditions from being satisfied in practice, so the derivation doesn’t lead to the right result.
The Poisson may be the best example of this. It’s easy to argue that certain count data have a Poisson distribution, and yet empirically the Poisson doesn’t fit so well because, for example, you have a mixture of two populations with different rates rather than one homogeneous population. (Averages of Poisson distributions have a Poisson distribution. Mixtures of Poisson distributions don’t.)
The best scenario is when a theoretical derivation agrees with empirical analysis. Theory suggests the distribution should be X, and our analysis confirms that. Hurray! The theoretical and empirical strengthen each other’s claims.
Theoretical derivations can be useful even when they disagree with empirical analysis. The theoretical distribution forms a sort of baseline, and you can focus on how the data deviate from that baseline.
Stand on a large clock, say on the 1. Now flip a coin and move ahead one hour if the coin turns up heads, and back one hour otherwise. Keep repeating the process until you’ve stood on all 12 numbers. How long, on average, will this random walk take? If you generalize to clocks with p positions, how does the expected time vary with p?
Here’s a little Python code to experiment with the problem.
from random import random p = 12 def time_to_cover_circle(): circle = [0 for i in range(p)] count, pos = 0, 0 while True: count += 1 pos += 1 if random() > 0.5 else -1 pos %= p circle[pos] = 1 if min(circle) == 1: return count