This week’s resource post lists notes on probability approximations.
Do we even need probability approximations anymore? They’re not as necessary for numerical computation as they once were, but they remain vital for understanding the behavior of probability distributions and for theoretical calculations.
Textbooks often leave out details such as quantifying the error when discussion approximations. The following pages are notes I wrote to fill in some of these details when I was teaching.
See also blog posts tagged Probability and statistics and the Twitter account ProbFact.
Last week: Numerical computing resources
Next week: Miscellaneous math notes
Statistical methods should do better with more data. That’s essentially what the technical term “consistency” means. But with improper numerical techniques, the the numerical error can increase with more data, overshadowing the decreasing statistical error.
There are three ways Bayesian posterior probability calculations can degrade with more data:
- Polynomial approximation
- Missing the spike
Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will improve with more data rather than degrade.
With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.
The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.
If you’d like help with statistical computation, let’s talk.
Each Wednesday I post a list of notes on some topic. This week it’s probability.
See also posts tagged probability and statistics and the Twitter account ProbFact.
Last week: Python resources
Next week: Regular expression resources
For the next few weeks, I’ve scheduled @ProbFact tweets to come out at random times.
They will follow a Poisson distribution with an average of two per day. (Times are truncated to multiples of 5 minutes because my scheduling software requires that.)
Suppose you’ve seen a coin come up heads 10 times in a row. What do you believe is likely to happen next? Three common responses:
- Equal probability of heads or tails.
Each is reasonable in its own context. The last answer is correct assuming the flips are independent and heads and tails are equally likely.
But as I argued here, if you see nothing but heads, you have reason to question the assumption that the coin is fair. So there’s some justification for the first answer.
The reasoning behind the second answer is that tails are “due.” This isn’t true if you’re looking at independent flips of a fair coin, but it could reasonable in other settings, such as sampling without replacement.
Say there are a number of coins on a table, covered by a cloth. A fixed number are on the table heads up, and a fixed number tails up. You reach under the cloth and slide a coin out. Every head you pull out increases the chances that the next coin will be tails. If there were an equal number of heads and tails under the cloth to being with, then after pulling out 10 heads tails are indeed more likely next time.
Related post: Long runs
When I was a postdoc I asked a statistician a few questions and he gave me an overview of his subject. (My area was PDEs; I knew nothing about statistics.) I remember two things that he said.
- A big part of being a statistician is knowing what to do when your assumptions aren’t met, because they’re never exactly met.
- A lot of statisticians think time series analysis is voodoo, and he was inclined to agree with them.
Blue Bonnet™ used to run commercials with the jingle “Everything’s better with Blue Bonnet on it.” Maybe they still do.
Perhaps in reaction to knee-jerk antipathy toward Bayesian methods, some statisticians have adopted knee-jerk enthusiasm for Bayesian methods. Everything’s better with Bayesian analysis on it. Bayes makes it better, like a little dab of margarine on a dry piece of bread.
There’s much that I prefer about the Bayesian approach to statistics. Sometimes it’s the only way to go. But Bayes-for-the-sake-of-Bayes can expend a great deal of effort, by human and computer, to arrive at a conclusion that could have been reached far more easily by other means.
Related: Bayes isn’t magic
Image via Gallery of Graphic Design
College courses often begin by trying to weaken your confidence in common sense. For example, a psychology course might start by presenting optical illusions to show that there are limits to your ability to perceive the world accurately. I’ve seen at least one physics textbook that also starts with optical illusions to emphasize the need for measurement. Optical illusions, however, take considerable skill to create. The fact that they are so contrived illustrates that your perception of the world is actually pretty good in ordinary circumstances.
For several years I’ve thought about the interplay of statistics and common sense. Probability is more abstract than physical properties like length or color, and so common sense is more often misguided in the context of probability than in visual perception. In probability and statistics, the analogs of optical illusions are usually called paradoxes: St. Petersburg paradox, Simpson’s paradox, Lindley’s paradox, etc. These paradoxes show that common sense can be seriously wrong, without having to consider contrived examples. Instances of Simpson’s paradox, for example, pop up regularly in application.
Some physicists say that you should always have an order-of-magnitude idea of what a result will be before you calculate it. This implies a belief that such estimates are usually possible, and that they provide a sanity check for calculations. And that’s true in physics, at least in mechanics. In probability, however, it is quite common for even an expert’s intuition to be way off. Calculations are more likely to find errors in common sense than the other way around.
Nevertheless, common sense is vitally important in statistics. Attempts to minimize the need for common sense can lead to nonsense. You need common sense to formulate a statistical model and to interpret inferences from that model. Statistics is a layer of exact calculation sandwiched between necessarily subjective formulation and interpretation. Even though common sense can go badly wrong with probability, it can also do quite well in some contexts. Common sense is necessary to map probability theory to applications and to evaluate how well that map works.
Watching the news gives you an inverted sense of risk.
We fear bad things that we’ve seen on the news because they make a powerful emotional impression. But the things rare enough to be newsworthy are precisely the things we should not fear. Conversely, the risks we should be concerned about are the ones that happen too frequently to make the news.
I will be giving a talk “Bayesian statistics as a way to integrate intuition and data” at KeenCon, September 11, 2014 in San Francisco.
Update: Use promo code KeenCon-JohnCook to get 75% off registration.
Last year I worked with Hitachi Data Systems to evaluate the trade-offs of replication and erasure coding as ways to increase data storage reliability while minimizing costs. This lead to a white paper that has just been published:
Compare Cost and Performance of Replication and Erasure Coding
Hitachi Review Vol. 63 (July 2014)
John D. Cook
Ab de Kwant
The normal distribution can approximate many other distributions, though the details such as quantitative error estimates and what factors improve or degrade the approximation are harder to find. Here are some notes on normal approximations to several common probability distributions.
When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.
Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.
But order statistics are not robust in this sense: the identity of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the value of the median income may not change much, but the person corresponding to that income will change.
One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a good choice than how often it makes the best choice.
Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.
Suppose you’re drawing random samples uniformly from some interval. How likely are you to see a new value outside the range of values you’ve already seen?
The problem is more interesting when the interval is unknown. You may be trying to estimate the end points of the interval by taking the max and min of the samples you’ve drawn. But in fact we might as well assume the interval is [0, 1] because the probability of a new sample falling within the previous sample range does not depend on the interval. The location and scale of the interval cancel out when calculating the probability.
Suppose we’ve taken n samples so far. The range of these samples is the difference between the 1st and the nth order statistics, and for a uniform distribution this difference has a beta(n-1, 2) distribution. Since a beta(a, b) distribution has mean a/(a+b), the expected value of the sample range from n samples is (n-1)/(n+1). This is also the probability that the next sample, or any particular future sample, will lie within the range of the samples seen so far.
If you’re trying to estimate the size of the total interval, this says that after n samples, the probability that the next sample will give you any new information is 2/(n+1). This is because we only learn something when a sample is less than the minimum so far or greater than the maximum so far.
Cancer research is sometimes criticized for being timid. Drug companies run enormous trials looking for small improvements. Critics say they should run smaller trials and more of them.
Which side is correct depends on what’s out there waiting to be discovered, which of course we don’t know. We can only guess. Timid research is rational if you believe there are only marginal improvements that are likely to be discovered.
Sample size increases quickly as the size of the effect you’re trying to find decreases. To establish small differences in effect, you need very large trials.
If you think there are only small improvements on the status quo available to explore, you’ll explore each of the possibilities very carefully. On the other hand, if you think there’s a miracle drug in the pipeline waiting to be discovered, you’ll be willing to risk falsely rejecting small improvements along the way in order to get to the big improvement.
Suppose there are 500 drugs waiting to be tested. All of these are only 10% effective except for one that is 100% effective. You could quickly find the winner by giving each candidate to one patient. For every drug whose patient responded, repeat the process until only one drug is left. One strike and you’re out. You’re likely to find the winner in three rounds, treating fewer than 600 patients. But if all the drugs are 10% effective except one that’s 11% effective, you’d need hundreds of trials with thousands of patients each.
The best research strategy depends on what you believe is out there to be found. People who know nothing about cancer often believe we could find a cure soon if we just spend a little more money on research. Experts are more sanguine, except when they’re asking for money.