Posts tagged as:

Probability and Statistics

The universal solvent of statistics

by John on February 1, 2012

Andrew Gelman just posted an interesting article on the philosophy of Bayesian statistics. Here’s my favorite passage.

This reminds me of a standard question that Don Rubin … asks in virtually any situation: “What would you do if you had all the data?” For me, that “what would you do” question is one of the universal solvents of statistics.

Emphasis added.

I had not heard Don Rubin’s question before, but I think I’ll be asking it often. It reminds me of Alice’s famous dialog with the Cheshire Cat:

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to,” said the Cat.

“I don’t much care where–” said Alice.

“Then it doesn’t matter which way you go,” said the Cat.

Cheshire Cat

Related post: Irrelevant uncertainty

{ 6 comments }

Six analysis and probability diagrams

by John on January 17, 2012

Here are a few diagrams I’ve created that summarize relationships in analysis and probability. Click on a thumbnail image to go to a page with the full image and explanatory text.

Special functions

Gamma and related functions

Probability distributions

Conjugate priors

Convergence theorems

Bessel functions

{ 3 comments }

Interpreting statistics

by John on January 13, 2012

From Matt Briggs:

I challenge you to find me in any published statistical analysis, outside of an introductory textbook, a confidence interval given the correct interpretation. If you can find even one instance where the [frequentist] confidence interval is not interpreted as a [Bayesian] credible interval, then I will eat your hat.

Most statistical analysis is carried out by people who do not interpret their results correctly. They carry out frequentist procedures and then give the results a Bayesian interpretation. This is not simply a violation of an academic taboo. It means that people generally underestimate the uncertainty in their conclusions.

Related posts:

Most published research results are false
Classical statistics in a nutshell

{ 16 comments }

Bugs, features, and risk

by John on January 12, 2012

All software has bugs. Someone has estimated that production code has about one bug per 100 lines. Of course there’s some variation in this number. Some software is a lot worse, and some is a little better.

But bugs-per-line-of-code is not very useful for assessing risk. The risk of a bug is the probability of running into it multiplied by its impact. Some lines of code are far more likely to execute than others, and some bugs are far more consequential than others.

Devoting equal effort to testing all lines of code would be wasteful. You’re not going to find all the bugs anyway, so you should concentrate on the parts of the code that are most likely to run and that would produce the greatest harm if they were wrong.

However, here’s a complication. The probability of running into a bug can change over time as people use the software in new ways. For whatever reason people to want to use features that had not been exercised before. When they do so, they’re likely to uncover new bugs.

(This helps explain why everyone thinks his preferred software is more reliable than others. When you’re a typical user, you tread the well-tested paths. You also learn, often subconsciously, to avoid buggy paths. When you bring your expectations from an old piece of software to a new one, you’re more likely to uncover bugs.)

Even though usage patterns change, they don’t change arbitrarily. It’s still the case that some code is far more likely than other code to execute.

Good software developers think ahead. They solve more than they’re asked to solve. They think “I’m going to go ahead and include this other case while I’m at it in case they need it later.” They’re heroes when it turns out their guesses about future needs were correct.

But there’s a downside to this initiative. You pay for what you don’t use. Every speculative feature either has to be tested, incurring more expense up front, or delivered untested, incurring more risk. This suggests its better to disable unused features.

You cannot avoid speculation entirely. Writing maintainable software requires speculating well, anticipating and preparing for change. Good software developers place good bets, and these tend to be small bets, going to a little extra effort to make software much more flexible. As with bugs, you have to consider probabilities and consequences: how likely is this part of the software to change, and how much effort will it take to prepare for that change?

Developers learn from experience what aspects of software are likely to change and they prepare for that change. But then they get angry at a rookie who wastes a lot of time developing some unnecessary feature. They may not realize that the rookie is doing the same thing they are, but with a less informed idea of what’s likely to be needed in the future.

Disputes between developers often involve hidden assumptions about probabilities. Whether some aspect of the software is responsible preparation for maintenance or wasteful gold plating depends on your idea of what’s likely to happen in the future.

Related post: Why programmers write unneeded code

{ 10 comments }

R in Action

by John on January 2, 2012

No Starch Press sent me a copy of The Art of R Programming last Fall and I wrote a review of it here. Then a couple weeks ago, Manning sent me a copy of R in Action. Here I’ll give a quick comparison of the two books, then focus specifically on R in Action.

[click to continue...]

{ 3 comments }

Amputating reality

by John on December 1, 2011

“If you just rely on one model, you tend to amputate reality to make it fit your model.” — David Brooks

Related post: Advantages of crude models

{ 6 comments }

A Bayesian view of Amazon Resellers

by John on September 27, 2011

I was buying a used book through Amazon this evening. Three resellers offered the book at essentially the same price. Here were their ratings:

  • 94% positive out of 85,193 reviews
  • 98% positive out of 20,785 reviews
  • 99% positive out of 840 reviews

Which reseller is likely to give the best service? Before you assume it’s the seller with the highest percentage of positive reviews, consider the following simpler scenario.

Suppose one reseller has 90 positive reviews out of 100. The other reseller has two reviews, both positive. You could say one has 90% approval and the other has 100% approval, so the one with 100% approval is better. But this doesn’t take into consideration that there’s much more data on one than the other. You can have some confidence that 90% of the first reseller’s customers are satisfied. You don’t really know about the other because you have only two data points.

A Bayesian view of the problem naturally incorporates the amount of data as well as its average. Let θA be the probability of a customer being satisfied with company A’s service. Let θB be the corresponding probability for company B. Suppose before we see any reviews we think all ratings are equally likely. That is, we start with a uniform prior distribution θA and θB. A uniform distribution is the same as a beta(1, 1) distribution.

After observing 90 positive reviews and 10 negative reviews, our posterior estimate on θA has a beta(91, 11) distribution. After observing 2 positive reviews, our posterior estimate on θB has a beta(3, 1) distribution. The probability that a sample from θA is bigger than a sample from θB is 0.713. That is, there’s a good chance you’d get better service from the reseller with the lower average approval rating.

beta(91,11) versus beta(3,1)

Now back to our original question. Which of the three resellers is most likely to satisfy a customer?

Assume a uniform prior on θX, θY, and θZ, the probabilities of good service for each reseller. The posterior distributions on these variables have distributions beta(80082, 5113), beta(20370, 417), and beta(833, 9).

These beta distributions have such large parameters that we can approximate them by normal distributions with the same mean and variance. (A beta(a, b) random variable has mean a/(a+b) and variance ab/((a+b)2(a+b+1)).) The variable with the most variance, θZ, has standard deviation 0.003. The other variables have even smaller standard deviation. So the three distributions are highly concentrated at their mean values with practically non-overlapping support. And so a sample from θX or θY is unlikely to be higher than a sample from θZ.

In general, going by averages alone works when you have a lot of customer reviews. But when you have a small number of reviews, going by averages alone could be misleading.

Thanks to Charles McCreary for suggesting the xkcd comic.

Related links:

Inequality Calculator
Calculating random inequalities
Exact calculation of beta inequalities

{ 28 comments }

New tech reports

by John on September 26, 2011

Soft maximum

I had a request to turn my blog posts on the soft maximum into a tech report, so here it is:

Basic properties of the soft maximum

There’s no new content here, just a little editing and more formal language. But now it can be referenced in a scholarly publication.

More random inequalities

I recently had a project that needed to compute random inequalities comparing common survival distributions (gamma, inverse gamma, Weibull, log normal) to uniform distributions. Here’s a report of the results.

Random inequalities between survival and uniform distributions

This tech report develops analytical solutions for computing Prob(X > Y) where X and Y are independent, X has one of the distributions mentioned above, and Y is uniform over some interval. The report includes R code to carry out the analytic expressions. It also includes R code to estimate the same inequalities by sampling for complementary validation.

Here are some other tech reports and blog posts on random inequalities.

{ 5 comments }

When asked why he robbed banks, Willie Sutton famously replied “Because that’s where the money is.”

If you read about data analysis in high dimensions, you might hear someone say they’re focused on a thin shell because that’s where the probability is. For a multivariate normal distribution in high dimensions, nearly all the probability mass is concentrated in a thin shell some distance away from the origin.

What does that mean? Why is it true? How thin is the shell and what is its radius?

It seems absurd to say the probability is concentrated in a shell. The multivariate normal density looks has its greatest value at the origin and quickly decays as you move out in any direction. So most of the probability must be near the origin, right? No, because mass equals density times volume. The density decays quickly as you move away from the origin, but volume increase quickly. The product of the two is greatest at some radius away from the origin. That’s the shell.

The volume of a sphere in d dimensions is proportional to rd, so volume increases very quickly if d is large. For example, if d = 100, how much of the volume of a unit sphere is between a distance of 0.99 and 1 from the origin? Since 1100 – 0.99100 = 0.634, this says 63.4% of the volume is in the outer shell of thickness 0.01.

Since volume of a sphere is proportional to rd, the volume of a shell of radius r and thickness Δr is roughly proportional to d rd-1 Δr. When you multiply that volume by the probability density exp( -r2 / 2 ) you get that the probability mass in the shell is proportional to

rd-1 exp( -r2 / 2 ) Δr.

This leads to a χ distribution with d degrees of freedom. (Not the better known χ2 distribution.) This distribution has mode √(d-1) and variance 1. For large d, the distribution is approximately normal. So a multivariate normal in d dimensions with d large has roughly 95% of its probability mass in a shell of radius √d with thickness 4, two standard deviations either side of √d. (I’m approximating anyway, so I approximated √(d-1) as √d to make the conclusion a little simpler.)

The graph below gives the probability density of shells as a function of radius in dimensions 10 and 100.

Related post:

Volumes of Lp unit balls

{ 2 comments }

Markov chains don’t converge

by John on August 10, 2011

I often hear people often say they’re using a burn-in period in MCMC to run a Markov chain until it converges. But Markov chains don’t converge, at least not the Markov chains that are useful in MCMC. These Markov chains wander around forever exploring the domain they’re sampling from. Any point that makes a “bad” starting point for MCMC is a point you might reach by burn-in.

Not only that, Markov chains can’t remember how they got where they are. That’s their defining property. So if your burn-in period ends at a point x, the chain will perform exactly as if you had simply started at x.

When someone says a Markov chain has converged, they mean that the chain has entered a high-probability region. I’ll explain in a moment why that’s desirable. But the belief/hope is that a burn-in period will put a Markov chain in a high-probability region. And it probably will, but there are a couple reasons why this isn’t necessarily the best thing to do.

  1. Burn-in may be ineffective. You could use use optimization to be certain that you’re starting in such a region. Burn-in offers no such assurance. See Burn-in and other MCMC folklore.
  2. Burn-in may be inefficient. Casting aside worries that burn-in may not do what you want, it can be an inefficient way to find a high-probability region. MCMC isn’t designed to optimize a density function but rather to sample from it.

Why use burn-in? MCMC practitioners often don’t know how to do optimization, and in any case the corresponding optimization problem may be difficult. Also, if you’ve got the MCMC code in hand, it’s convenient to use it to find a starting point as well as for sampling.

So why does it matter whether you start your Markov chain in a high-probability region? In the limit, it doesn’t matter. But since you’re averaging some function of some finite number of samples, your average will be a better approximation if you start at a typical point in the density you’re sampling. If you start at a low probability location, your average may be more biased.

Samples from Markov chains don’t converge, but averages of functions applied to these samples may converge. When someone says a Markov chain has converged, they mean they’re at a point where the average of a finite number of function applications will be a better approximation of the thing they want to compute than if they’d started at a low probability point.

It’s not just a matter of imprecise language when people say a Markov chain has converged. It sometimes betrays a misunderstanding of how Markov chains work.

{ 13 comments }

The single big jump principle

by John on August 9, 2011

Suppose you’re playing a game where you take 10 steps of a random size. Here are two variations on the game. Which will give you a better chance of ending up far from where you started?

  1. You take your steps one at a time, starting each new step from where the last one took you.
  2. You return to the starting point after each step. After 10 turns, you go back to where your largest step took you.

Assume all steps are in the same direction. Then you’re always better off under the first rule. The sum of all your steps will be bigger than the maximum since the maximum is one of the terms in the sum.

However, depending on the probability distribution on your random steps, you may do almost as well under the second rule, taking the maximum of your steps.

If the distribution on your step size is heavy-tailed (technically, subexponential) then the maximum and the sum have the same asymptotic distribution. That is, as x goes to infinity,

Pr( X1 + X2 + … + Xn > x ) ~ Pr( max(X1, X2, … , Xn) > x )

As x gets larger, the relative advantage of taking the sum rather than the maximum goes to zero. This is known as “the principle of the single big jump.” If you’re looking to make a lot of progress, adding up typical jumps isn’t likely to get you there. You need one big jump. Said another way, your total progress is about as good as the progress on your best shot.

Before you draw a life lesson from this, note that this is only true for heavy-tailed distributions. It’s not true at all if your jumps have a thin-tailed distribution. But sometimes the payoffs in life’s games are heavy-tailed.

What distributions are subexponential? Any heavy-tailed distribution you’re likely to have heard of: Cauchy, Lévy, Weibull (with shape < 1), log-normal, etc.

To illustrate the single jump principle, let X1 and X2 be standard Cauchy random variables. The difference between Pr( X1 + X2 > x ) and Pr( max(X1 , X2) > x ) becomes impossible to see for x much bigger than 3.

Here’s a plot of the ratio of the sum to the maximum.

The ratio tends to 1 in the limit as predicted by the single big jump principle.

I chose to use a Cauchy distribution to simplify the calculations. (The sum of two Cauchy random variables is another Cauchy. See distribution relationships.) In this case the maximum is actually larger than the sum because the Cauchy can be positive or negative. But it is still the case that the two tails converge as x increases.

Here’s the Sage notebook that I used to create the graphs.

More on the single big jump here.

{ 10 comments }

Works well versus well understood

by John on May 10, 2011

While I was looking up the Tukey quote in my earlier post, I ran another of his quotes:

The test of a good procedure is how well it works, not how well it is understood.

At some level, it’s hard to argue against this. Statistical procedures operate on empirical data, so it makes sense that the procedures themselves be evaluated empirically.

But I question whether we really know that a statistical procedure works well if it isn’t well understood. Specifically, I’m skeptical of complex statistical methods whose only credentials are a handful of simulations. “We don’t have any theoretical results, buy hey, it works well in practice. Just look at the simulations.”

Every method works well on the scenarios its author publishes, almost by definition. If the method didn’t handle a scenario well, the author would publish a different scenario. Even if the author didn’t select the most flattering scenarios, he or she may simply not have considered unflattering scenarios. The latter is particularly understandable, almost inevitable.

Simulation results would have more credibility if an adversary rather than an advocate chose the scenarios. Even so, an adversary and an advocate may share the same blind spots and not explore certain situations. Unless there’s a way to argue that a set of scenarios adequately samples the space of possible inputs, it’s hard to have a great deal of confidence in a method based on simulation results alone.

Related posts:

Buggy code is biased code
Software sins of omission
Occam’s razor and Bayes’ theorem

{ 10 comments }

Move on to the next question

by John on May 9, 2011

Here’s a recent discussion from Math Overflow.

Q: I have some data points and, when I plot them on R, it looks like a normal distribution. I want to know how well my data fits the normal distribution. What kind of test should I do?

A: There’s actually a much broader question that you should be asking yourself here: does it matter whether your data really is normally distributed, or will the procedures that you’re going to perform on the data be reasonably robust in the presence of a distribution that is only approximately normal? …

The person asking the question was already satisfied that his data were approximately normal. So it was time to move on to the next question: Does what I want to do next work well for approximately normal data? (There’s no point asking whether your data is normal; it’s not. Normality is an idealization.)

We’re often tempted to add decimal places to the answer to one question instead of moving on to the next question. Maybe we don’t even realize what the next question should be. Or maybe we do know but we want stay with the familiar. In either case, this quote from John Tukey comes to mind.

An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.

Related post:

What distribution does my data have?

{ 4 comments }

Significance testing and Congress

by John on April 14, 2011

The US Supreme Court’s criticism of significance testing has been in the news lately. Here’s a criticism of significance testing involving the US Congress. Consider the following syllogism.

  1. If a person is an American, he is not a member of Congress.
  2. This person is a member of Congress.
  3. Therefore he is not American.

The initial premise is false, but the reasoning is correct if we assume the initial premise is true.

The premise that Americans are never members of Congress is clearly false. But it’s almost true! The probability of an American being a member of Congress is quite small, about 535/309,000,000. So what happens if we try to salvage the syllogism above by inserting “probably” in the initial premise and conclusion?

  1. If a person is an American, he is probably not a member of Congress.
  2. This person is a member of Congress.
  3. Therefore he is probably not American.

What went wrong? The probability is backward. We want to know the probability that someone is American given he is a member of Congress, not the probability he is a member of Congress given he is American.

Science continually uses flawed reasoning analogous to the example above. We start with a “null hypothesis,” a hypothesis we seek to disprove. If our data are highly unlikely assuming this hypothesis, we reject that hypothesis.

  1. If the null hypothesis is correct, then these data are highly unlikely.
  2. These data have occurred.
  3. Therefore, the null hypothesis is highly unlikely.

Again the probability is backward. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypothesis.

We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!

I found this illustration in The Earth is Round (p < 0.05) by Jacob Cohen (1994). Cohen in turn credits Pollard and Richardson (1987) in his references.

Related posts:

How insignificant is significance testing?
Five criticisms of significance testing
Most published research results are false
Classical statistics in a nutshell

{ 8 comments }

Luis Pericchi sent me a brief note commenting on the recent US Supreme Court decision involving statistical significance and medical reporting. Here is his paper, about a page and a half.

How insignificant is statistical significance? (PDF)

Related post: Significance testing and Congress

{ 1 comment }