I often hear people often say they’re using a burn-in period in MCMC to run a Markov chain until it converges. But Markov chains don’t converge, at least not the Markov chains that are useful in MCMC. These Markov chains wander around forever exploring the domain they’re sampling from. Any point that makes a “bad” starting point for MCMC is a point you might reach by burn-in.
Not only that, Markov chains can’t remember how they got where they are. That’s their defining property. So if your burn-in period ends at a point x, the chain will perform exactly as if you had simply started at x.
When someone says a Markov chain has converged, they may mean that the chain has entered a high-probability region. I’ll explain in a moment why that’s desirable. But the belief/hope is that a burn-in period will put a Markov chain in a high-probability region. And it probably will, but there are a couple reasons why this isn’t necessarily the best thing to do.
- Burn-in may be ineffective. You could use optimization to be certain that you’re starting in such a region. Burn-in offers no such assurance. See Burn-in and other MCMC folklore.
- Burn-in may be inefficient. Casting aside worries that burn-in may not do what you want, it can be an inefficient way to find a high-probability region. MCMC isn’t designed to optimize a density function but rather to sample from it.
Why use burn-in? MCMC practitioners often don’t know how to do optimization, and in any case the corresponding optimization problem may be difficult. Also, if you’ve got the MCMC code in hand, it’s convenient to use it to find a starting point as well as for sampling.
So why does it matter whether you start your Markov chain in a high-probability region? In the limit, it doesn’t matter. But since you’re averaging some function of some finite number of samples, your average will be a better approximation if you start at a typical point in the density you’re sampling. If you start at a low probability location, your average may be more biased.
Samples from Markov chains don’t converge, but averages of functions applied to these samples may converge. When someone says a Markov chain has converged, they may mean they’re at a point where the average of a finite number of function applications will be a better approximation of the thing they want to compute than if they’d started at a low probability point.
It’s not just a matter of imprecise language when people say a Markov chain has converged. It sometimes betrays a misunderstanding of how Markov chains work.
By “converge”, we mean to say “converge in distribution”, with respect to the target distribution that is implied by the detailed balance equation.
I recently read “AI, A Modern Approach 3rd edition” – ch15 (probabilistc reasoning over time) described utilizing Monte Carlo Markov Chains (specifically Gibbs sampling) to approximate the joint probability distribution & prune unlikely states for approximate inference with a Bayesian network.
WinBUGS (http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml ) does Bayesian analysis using MCMC
MCMC isn’t about points, its about distributions – it is they that converge. MCMC is usually performed in Bayesian stats to get approximate posteriors / evidences etc. The key is that in Bayesian stats you want the whole distribution, not just a point (either from the chain itself or as a function of the chain). So no, it doesn’t make sense to talk about convergence to- or of- a point, but as Chris says, its about convergence in distribution – that the empirical distribution of samples is a reasonable approximation of the target distribution. In this sense it does make complete sense to say the Markov chain (rather than its points) converges.
Chris: Yes, the samples converge in distribution, and in that sense a Markov chain converges, but that’s not a property of any particular point in a chain.
When someone looks a print out and says “The chain starts to converge after 400 samples,” I think they mean that the chain appears to have entered a high probability region.
Oh, I just mean that as the number of points in the chain approaches infinity, the empirical distribution of those points becomes a better and better approximation of the target distribution: it converges to that distribution.
I agree completely that saying ‘after this point it has converged’ is a bit loose. As you say, burn in just a heuristic to step around the requirement for an infinitude of points, by removing from consideration those that are from the tail of the target distribution, but overrepresented in the chain on account of an arbitrary starting point.
I’ve hardly ever heard anyone say that a Markov chain converges, speaking of the properties of a chain as a whole. Instead, they always say it has converged, past tense, referring to an event.
If you run several parallel MCMC chains at once from different starting points, there is also the loose usage of “my chains have converged”: after a certain point, they all appear to be sampling from the *same* high probability region. This seems slightly different from the meanings you mention above.
Anyhow, if saying “the chain has converged” is wrong, there should be another short, simple name for what the chain *has* done (since “the chain appears to have entered a high-probability region and stayed there” doesn’t roll off the tongue easily). This is a useful concept to have a name for! If the chain has not done this yet, your inferences are likely to be bad since low-probability points are overrepresented; and if your chain never does this even after a long run, you’re likely to have a bug in your code or a mistake in your math.
Naming suggestions to replace “my chain has converged”?
For that matter, in practice, computation is usually cheap enough that I don’t see what’s wrong with running your chain(s) for a long while to check that your samples are in a given high-probability region. In fact, if I’ve just made changes to my code, it may be good to start with a point I know is “bad” — if the chain doesn’t wander over to the “correct” high-probability region, there might be a bug in my code. And if I’m going to do this anyway, why bother with all the optimization required to find a “good” starting point? Even if I did use a better method than burn-in to choose a starting point, I’d be hesitant to trust it unless I ran it for a while… so I might as well just stick with burn-in: The risk of adding more bugs in the starting-point-optimization code might outweigh the benefit of shaving a few minutes off my MCMC by not doing any burn-in.
Although starting at a “good” point removes the need for burn-in from a *math* perspective (!assuming your code has no bugs!), from a *practical* perspective I don’t see why allowing for burn-in is a bad idea. Of course, if it turns out you did start from a “good” point, then you can always keep those early iterations instead of chucking them out.
This is confused. Convergence is not about “properties of the chain as a whole,” but how closely the marginal distribution at a point in time is to the limiting (stationary) distribution. If we draw X_0 from the stationary distribution, then *all* draws are from the same stationary distribution. However, in most cases, we draw the starting value from some other distribution (often an arbitrary starting value), so that the marginal distributions of X_1, X_2,…, say, F_1, F_2,… are different from stationary distribution F. The Markov convergence theorem says that F_n converges weakly (i.e., pointwise except at points of discontinuity of F) to the stationary distribution.
The usual rationale for picking a “high probability” point is that F_n tends to be close to F even for small n. However, it’s hard to make this rigorous.
I think an extreme example might help the discussion. Consider a (highly stylized) situation where you run the sampler a bunch of times, each time forgetting the results of the last run, and for every chain you are given initial values from the true stationary distribution through some black box (that is, you don’t know anything about the stationary distribution other than you have one sample from it). The initial values “probably” come from a high probability region, but some proportion of the time they don’t. Regardless, each chain will “converge” to the stationary distribution in one step. Now imagine you construct your MC estimate based on just the first 4 or 5 samples. When you start in a region of high density your estimate will probably be better than when you started someplace in the tails, because the latter includes at least one point which is over-represented in your MCMC sample. Contrast that with the case where you have a black box routine that gives you the global mode, or a point nearby.
The first black box is *better* than initializing a chain through burn-in can ever give you (since the burn-in period is finite). A silly example, but maybe instructive.
I don’t think I agree with this statement however: “I’ve hardly ever heard anyone say that a Markov chain converges, speaking of the properties of a chain as a whole. Instead, they always say it has converged, past tense, referring to an event.” In my experience people say this when they’re looking at a screenful of traceplots/acf plots/running quantiles/etc, so they are actually talking about (at least part of) the distribution of the samples in the chain. Of course what they mean to say – hopefully – is that the chain has supposedly “converged enough” in that the distribution of the samples is close to the target distribtuon. But I agree that this does sometimes get conflated with the effective size of the sample, and hence the precision of the MC estimate.
It really is unfortunate that all these heuristics (and really, all we’ve got are heuristics in MCMC) are oversold and/or misunderstood by practitioners.
Doug, you’re exactly correct, thanks.
@doug Your statement of the theory is correct, but I’d just put a finer point on this: “The usual rationale for picking a “high probability” point is that F_n tends to be close to F even for small n. However, it’s hard to make this rigorous.” Obviously in a continuous state space no point has positive probability, so there’s that (hence your quotes I assume).
But the usual rationale given is actually that if you start in a region with low probability, a slowly mixing chain might not spend enough time in high probability to regions ensure that the early points aren’t disproportionately represented in whatever (finite) average you’re computing. If you start in a region of high probability then this problem is less likely to be severe, although in principle it could be even worse for some functionals.
Imagine a bimodal distribution where the two modes are well-separated, and one is mush smaller than the other. Without some extra work it will probably take a really long time for the chain to visit both modes. But if you start near the big mode the extra mass you’re missing is maybe negligible in computing your estimate; if you start near the small one then you’re probably hosed, since it might take billions of iterations to cross the “valley” (which you probably don’t even know exists). Now even if you draw x_0 from the stationary distribution, with positive probability you could pull a point from around the small mode and your estimate will suffer, even though the marginal distribution of x_t is in fact the stationary distribution. This is because even though the convergence of the chain is a property of the marginal distribution of the current point, we’re computing the average using (some part of) the entire history of the chain. If you had a routine that spat out the global mode (if only :) ) then you’d probably be wise to use it here.
Now obviously the larger point is that to be *sure* your estimate is good you need to visit the area around both modes (or more generally, areas of low probability) “often enough”. But without some knowledge of the target distribution which you probably don’t have, there’s no way to ensure it and you might never know that you missed a region with much larger density than the one your chain explored. So in practice starting near a mode is a little hedge against stopping your chain too early, as we usually do.
A quick health-warning about using an optimizer to initialize:
The point with (locally) highest density may not be typical of the distribution. When sampling from a 1000-dimensional Gaussian, one never goes anywhere near the mean. Many MCMC methods will take a long time to escape pathological high-density peaks, so initializing with an optimizer can hurt. Where applicable, one could consider terminating the optimizer with an early-stopping trick before running MCMC.
Can we say that if the MCMC is long enough, the burn-in is not necessary?
The burn- in is designed to complement the drawback of finite samples.