Last February I interviewed Rick Richter, CIO of Food for the Hungry, a Christian relief and development organization. This morning I spoke with Rick and was reminded of some of the challenges involved in supporting computers in poor parts of the world.
Traveling to areas most in need of relief is arduous. You may, for example, have two or three days of travel by land or water after your airplane lands. So naturally you want to do as much system administration remotely as you can.
In general you can’t expect broadband in the poorest areas, though there are some surprises. You often have to rely on satellite connections, though these are slow, unreliable, and expensive. Dial-up is preferable, if you can get it. Bandwidth may be adequate for command line access, though even then the latency is annoying. Video conferencing (e.g. for showing someone how to do something) is out of the question. So remote locations need to be mostly self-sufficient.
The most recent episode of the Plus Maths podcast describes how the London Velodrome was designed. Being a math podcast, it focuses on the optimization problems involved in the design and the finite element modeling of the structure.
The beautiful shape of the building was not an initial goal but rather a consequence of design decisions that began with the track and worked outward.
It’s perhaps surprising, given the pragmatic design concerns of optimizing the experience of people using the velodrome, maximizing the efficiency of the building, all within the constraints of the construction methods, the design process has led to a stunningly beautiful roof that almost echos the shape of the track.
… it’s a happy by-product of the design. There was nothing intentional in the design [of the roof] that we wanted it to look like the track. … it’s the opposite if the track …
I often hear people often say they’re using a burn-in period in MCMC to run a Markov chain until it converges. But Markov chains don’t converge, at least not the Markov chains that are useful in MCMC. These Markov chains wander around forever exploring the domain they’re sampling from. Any point that makes a “bad” starting point for MCMC is a point you might reach by burn-in.
Not only that, Markov chains can’t remember how they got where they are. That’s their defining property. So if your burn-in period ends at a point x, the chain will perform exactly as if you had simply started at x.
When someone says a Markov chain has converged, they may mean that the chain has entered a high-probability region. I’ll explain in a moment why that’s desirable. But the belief/hope is that a burn-in period will put a Markov chain in a high-probability region. And it probably will, but there are a couple reasons why this isn’t necessarily the best thing to do.
Burn-in may be ineffective. You could use use optimization to be certain that you’re starting in such a region. Burn-in offers no such assurance. See Burn-in and other MCMC folklore.
Burn-in may be inefficient. Casting aside worries that burn-in may not do what you want, it can be an inefficient way to find a high-probability region. MCMC isn’t designed to optimize a density function but rather to sample from it.
Why use burn-in? MCMC practitioners often don’t know how to do optimization, and in any case the corresponding optimization problem may be difficult. Also, if you’ve got the MCMC code in hand, it’s convenient to use it to find a starting point as well as for sampling.
So why does it matter whether you start your Markov chain in a high-probability region? In the limit, it doesn’t matter. But since you’re averaging some function of some finite number of samples, your average will be a better approximation if you start at a typical point in the density you’re sampling. If you start at a low probability location, your average may be more biased.
Samples from Markov chains don’t converge, but averages of functions applied to these samples may converge. When someone says a Markov chain has converged, they may mean they’re at a point where the average of a finite number of function applications will be a better approximation of the thing they want to compute than if they’d started at a low probability point.
It’s not just a matter of imprecise language when people say a Markov chain has converged. It sometimes betrays a misunderstanding of how Markov chains work.
Suppose you’re playing a game where you take 10 steps of a random size. Here are two variations on the game. Which will give you a better chance of ending up far from where you started?
You take your steps one at a time, starting each new step from where the last one took you.
You return to the starting point after each step. After 10 turns, you go back to where your largest step took you.
Assume all steps are in the same direction. Then you’re always better off under the first rule. The sum of all your steps will be bigger than the maximum since the maximum is one of the terms in the sum.
However, depending on the probability distribution on your random steps, you may do almost as well under the second rule, taking the maximum of your steps.
If the distribution on your step size is heavy-tailed (technically, subexponential) then the maximum and the sum have the same asymptotic distribution. That is, as x goes to infinity,
As x gets larger, the relative advantage of taking the sum rather than the maximum goes to zero. This is known as “the principle of the single big jump.” If you’re looking to make a lot of progress, adding up typical jumps isn’t likely to get you there. You need one big jump. Said another way, your total progress is about as good as the progress on your best shot.
Before you draw a life lesson from this, note that this is only true for heavy-tailed distributions. It’s not true at all if your jumps have a thin-tailed distribution. But sometimes the payoffs in life’s games are heavy-tailed.
What distributions are subexponential? Any heavy-tailed distribution you’re likely to have heard of: Cauchy, Lévy, Weibull (with shape < 1), log-normal, etc.
To illustrate the single jump principle, let X1 and X2 be standard Cauchy random variables. The difference between Pr( X1 + X2 > x ) and Pr( max(X1 , X2) > x ) becomes impossible to see for x much bigger than 3.
Here’s a plot of the ratio of the sum to the maximum.
The ratio tends to 1 in the limit as predicted by the single big jump principle.
I chose to use a Cauchy distribution to simplify the calculations. (The sum of two Cauchy random variables is another Cauchy. See distribution relationships.) In this case the maximum is actually larger than the sum because the Cauchy can be positive or negative. But it is still the case that the two tails converge as x increases.
Here’s the Sage notebook that I used to create the graphs.
I also don’t do this [work from the command line] out of some perverse hipster desire for retro-computing. I have work to do. If my system didn’t work, I’d abandon it tomorrow.
That’s refreshing. Some of the more ardent command line advocates give the impression that they use the command line out of pride rather than out of a desire to get work done. Ramsay is recommending his way of working, not bragging about what he’s able to do. He has some interesting ideas, especially in his follow-up article The Mythical Man-Finger.
By the way, I’m no command line wizard; I’m a fairly typical computer user. On the other hand, my use of the command line and Emacs has been increasing.
When someone suggested to Derek Sivers that he cover the CD Baby web site with advertising, he replied
No way. Out of the question. That would be like putting a coke machine in a monastery.
I love that simile for something highly inappropriate. In context, I believe Sivers was saying that advertisement would be an inappropriate way to make money given his vision for CD Baby. But I think more of the disregard for aesthetics and historical context. I could just imagine how crass a coke machine would look in a medieval stone building, something like putting a glass and steel pyramid in the courtyard of a French Renaissance palace.
Here’s an elegant little theorem I just learned about. Informally,
A polynomial with few non-zero coefficients has few real roots.
If a polynomial has k non-zero coefficients, it can have at most 2k – 1 distinct real roots.
This says, for example, that a polynomial like x100 – 12 x37 + 1 can have at most 5 distinct real roots, even though it has a total of 100 roots when you count complex roots and count with multiplicity.
I won’t repeat the proof here, but it’s not long or advanced.
The upper bound 2k – 1 in the theorem is sharp. To see this, note that
x(x2 – 1)(x2 – 4) … (x2 – (k-1)2)
has k non-zero coefficients and 2k – 1 distinct real roots.
I’ve decided to hand my Twitter account RLangTip over to the folks at Revolution Analytics starting next week. I thought it would be better to give the account to someone who is more enthusiastic about R than I am, and so I offered it to David Smith. If you’ve enjoyed RLangTip so far, I expect you’ll like it even better under new ownership.
If you’d like to continue to hear from me on Twitter, you can follow one of my 10 other daily tip accounts or my personal account.
Descriptions of these accounts are available here.