# Functional folds and conjugate models

Bayesian calculations are intrinsically recursive: The posterior distribution from one step becomes the prior for the next step.

With conjugate models, the prior and posterior belong to the same family of distributions. If a distribution from this family is determined by a fixed set of parameters, we only need to update these parameters at each step. This updating process is defined by an integration problem, i.e. applying Bayes’ theorem, but it’s common with conjugate models for the integration to reduce to a simple algebraic operation on the parameters.

For example, suppose some binary event that happens with probability θ where θ has a beta(α, β) distribution. When we take our next observation, the posterior distribution becomes beta(α+1, β) if the event occurred and beta(α, β+1) if it didn’t. There is an integration problem in the background that reduces to simply adding a 1 to the α parameter for every success and to the β parameter for every failure. (Standard terminology is to call an observation of the thing you’re interested in a “success” regardless of whether the thing you’re observing is good or bad. “Failure” is just an observation of the thing not happening, even if it’s good that it didn’t happen.)

Conjugate models have the structure of a fold in functional programming, also known as a reduce or accumulate operation. We’ll show below how to compute the posterior distribution in a beta-binomial and normal-normal model using folds in Haskell. (Technically a left associative fold, foldl. Haskell also has a right associative fold foldr that we won’t be concerned about here.)

What does foldl require? Let’s ask GHCi, the Haskell REPL. Prior to GHC version 7.10 we would get the following.

     ghci> :type foldl
foldl :: (a -> b -> a) -> a -> [b] -> a


Unfortunately the type variables a and b just mean “one type” and “a possibly different type” and tell us nothing about how the types are used, so let’s unravel this a little at a time.

Here “a” is the type of the accumulator state. In our case, it will be the prior and posterior distribution parameters because that’s what’s being updated for each data point. Conjugacy makes this work since the prior and posterior distributions belong to the same family. The “b” type is our data. So you could read the type of foldl as follows.

foldl takes three things: a function to fold over the data, an initial accumulator state, and an array of data. It returns the final accumulator state. The accumulator function takes an accumulator state and a data value, and returns a new accumulator state.

Starting with GHC 7.10 the type of foldl is a little more general.

     ghci> :type foldl
foldl :: Foldable t => (a -> b -> a) -> a -> t b -> a


Instead of foldl operating only on lists, it can operate on any “foldable” container. To recover the type from earlier versions of the compiler, replace t with []. (Haskell uses [] b and [b] as equivalent ways of indicating a list of elements of type b.) There is one other difference I’ve edited out: The latest version of GHCi reverses the roles of ‘a’ and ‘b’. This is confusing, but has no effect since the type variable names have no meaning. I swapped the a’s and b’s back like they were to make the two definitions easier to compare.

## Beta-binomial example

Returning to the beta-binomial model discussed above, suppose we have a sequence of observations consisting of a 1 when the event in question happened and a 0 when it did not. And suppose our prior distribution on the probability of the event has a beta(α, β) distribution.

Now what function will we fold over our data to update our prior into a posterior?

     ghci> let f params y = (fst params + y, snd params + 1 - y)


The function f takes a pair of numbers, the two beta distribution parameters, and a data point y, and returns an updated set of beta parameters. If the data point is a 1 (success) then the α> parameter is incremented, and if the data point is a 0 (failure) then the β parameter is incremented. If we start with a beta(0.2, 0.8) prior and observe the data [1, 1, 0, 0, 1] then we apply f to our data as follows.

     ghci> let d = [1, 1, 0, 0, 1]
ghci> foldl f (0.2, 0.8) d


The result will be (3.2, 2.8). The three successes incremented the first beta parameter and the two failures incremented the second beta parameter.

To see how the foldl operates, we can run scanl instead. It works like foldl but returns a list of intermediate accumulator values rather than just the final accumulator value.

     ghci> scanl f (0.2, 0.8) d


This returns

     [ (0.2, 0.8), (1.2, 0.8), (2.2, 0.8), (2.2, 1.8), (2.2, 2.8), (3.2, 2.8) ].

he initial accumulator state is (0.2, 0.8) because we started with a beta(0.2, 0.8) prior. Then we increment the accumulator state to (1.2, 0.8) because the first data point was a 1. Then we increment the state to (2.2, 0.8) because the second data point is also a 1. Next we see two failures in a row and so the second part of the accumulator is incremented two times. After observing the last data point, a success, our final state is (3.2, 2.8), just as when we applied foldl.

## Normal-normal example

Next we consider another Bayesian model with a conjugate prior. We assume our data come from a normal distribution with mean θ and known variance 1. We assume a conjugate prior distribution on θ, normal with mean μ0 and variance τ02.

This time the posterior distribution calculation is more complicated and so our accumulator function is more complicated. But the form of the calculation is the same as above: we fold an accumulator function over the data.

Here is the function that encapsulates our posterior distribution calculation.

     f params y = ((m/v  + y)*newv, newv)
where
m = fst params -- mean
v = snd params -- variance
newv = v/(v + 1)


Now suppose our prior distribution on θ has mean 0 and variance 4, and we observe two values, [3, 5].

     ghci> foldl f (0, 4) [3, 5]


This returns (3.5555, 0.4444). To see the intermediate accumulation state, i.e. after just seeing y = 3, we run scanl instead and see

     [ (0, 4), (2,4, 0,8), (3.5555, 0.4444) ]

## Inspiration

This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

## Next up: ODEs

The next post shows how to implement an ODE solver (4th order Runge-Kutta) as a fold over the list of time steps where the solution is needed.

# Kalman filters and bottom-up learning

Kalman filtering is a mixture of differential equations and statistics. Kalman filters are commonly used in tracking applications, such as tracking the location of a space probe or tracking the amount of charge left in a cell phone battery. Kalman filters provide a way to synthesize theoretical predictions and actual measurements, accounting for error in both.

Engineers naturally emphasize the differential equations and statisticians naturally emphasize the statistics. Both perspectives are valuable, but in my opinion/experience, the engineering perspective must come first.

From an engineering perspective, a Kalman filtering problem starts as a differential equation. In an ideal world, one would simply solve the differential equation and be done. But the experienced engineer realizes his or her differential equations don’t capture everything. (Unlike the engineer in this post.) Along the road to the equations at hand, there were approximations, terms left out, and various unknown unknowns.

The Kalman filter accounts for some level of uncertainty in the process dynamics and in the measurements taken. This uncertainty is modeled as randomness, but this doesn’t mean that there’s necessarily anything “random” going on. It simply acknowledges that random variables are an effective way of modeling miscellaneous effects that are unknown or too complicated to account for directly. (See Random is as random does.)

The statistical approach to Kalman filtering is to say that it is simply another estimation problem. You start from a probability model and apply Bayes’ theorem. That probability model has a term inside that happens to come from a differential equation in practice, but this is irrelevant to the statistics. The basic Kalman filter is a linear model with normal probability distributions, and this makes a closed-form solution for the posterior possible.

You’d be hard pressed to start from a statistical description of Kalman filtering, such as that given here, and have much appreciation for the motivating dynamics. Vital details have simply been abstracted away. As a client told me once when I tried to understand his problem starting from the top-down, “You’ll never get here from there.”

The statistical perspective is complementary. Some things are clear from the beginning with the statistical formulation that would take a long time to see from the engineering perspective. But while both perspectives are valuable, I believe it’s easier to start on the engineering end and work toward the statistics end rather than the other way around.

History supports this claim. The Kalman filter from the engineering perspective came first and its formulation in terms of Bayesian statistics came later. Except that’s not entirely true.

Rudolf Kálmán published his seminal paper in 1960 and four years later papers started to come out making the connection to Bayesian statistics. But while Kálmán and others were working in the US starting from the engineering end, Ruslan Stratonovich was working in Russia starting from the statistical end. Still, I believe it’s fair to say that most of the development and application of Kalman filters has proceeded from the engineering to the statistics rather than the other way around.

More on Kalman filters

# Bayesian adaptive clinical trials: promise and pitfalls

This afternoon I’m giving a talk at the Houston INFORMS chapter entitled “Bayesian adaptive clinical trials: promise and pitfalls.”

When I started working in adaptive clinical trials, I was very excited about the potential of such methods. The clinical trial methods most commonly used are very crude, and there’s plenty of room for improvement.

Over time I became concerned about overly complex methods, methods which were good for academic publication but may not be best for patients. Such methods are extremely time-consuming to develop and may not perform as well in practice as simpler methods.

There’s a great deal of opportunity between the extremes, methods that are more sophisticated than the status quo without being unnecessarily complex.

# Frequentist properties of Bayesian methods

Bayesian methods for designing clinical trials have become more common, and yet these Bayesian designs are almost always evaluated by frequentist criteria. For example, a trial may be designed to stop early 95% of the time under some bad scenario and stop no more than 20% of the time under some good scenario.

These criteria are arbitrary, since the “good” and “bad” scenarios are arbitrary, and because the stopping probability requirements of 95% and 20% are arbitrary. Still, there’s an idea in lurking in the background that in every design there must be something that is shown to happen no more than 5% of the time.

It takes a great deal of effort to design Bayesian methods with desired frequentist properties. It’s an inverse problem, searching for the parameters in a high-dimensional design space, usually via lengthy simulation, that cause the method to satisfy some criteria. Of course frequentist methods satisfy frequentist criteria by design and so meet these criteria with far less effort. It’s rare to see the tables turned, evaluating frequentist methods by Bayesian criteria.

Sometimes the effort to beat frequentist designs at their own game is futile because the frequentist designs are optimal by their own criteria. More often, however, the Bayesian and frequentist methods being compared are not direct competitors but only analogs. The aim in this case is to match the frequentist method’s operating characteristics by one criterion while doing better by a new criterion.

Sometimes a Bayesian method can be shown to have better frequentist operating characteristics than its frequentist counterpart. This puts dogmatic frequentists in the awkward position of admitting that what they see as an unjustified approach to statistics has nevertheless produced a superior product. Some anti-Bayesians are fine with this, happy to have a procedure with better frequentist properties, even though it happened to be discovered via a process they view as illegitimate.

Related postBayesian clinical trials in one zip code

# Bayesian and nonlinear

Someone said years ago that you’ll know Bayesian statistics has become mainstream when people no longer put “Bayesian” in the titles of their papers. That day has come. While the Bayesian approach is still the preferred approach of a minority of statisticians, it’s no longer a novelty. If you want people to find your paper interesting, the substance needs to be interesting. A Bayesian approach isn’t remarkable alone.

You could say the same about nonlinear differential equations. Differential equations are so often nonlinear that the “nonlinear” qualifier isn’t always necessary to say explicitly. Just as a Bayesian analysis isn’t interesting just because it’s Bayesian, a differential equation isn’t necessarily interesting just because it’s nonlinear.

The analogy between Bayesian statistics and nonlinear differential equations breaks down though. Nonlinear equations are intrinsically more interesting than linear ones. But it’s no longer remarkable to solve a nonlinear differential equation numerically.

When an adjective becomes the default, it drops off and the previous default now requires an adjective. Terms like “electronic” and “digital” are fading from use. If you say you’re going to mail someone something, the default assumption is usually that you are going to email it. What used to be simply “mail” is now “snail mail.” Digital signal processing is starting to sound quaint. The abbreviation DSP is still in common use, but digital signal processing is simply signal processing. Now non-digital signal processing requires a qualifier, i.e. analog.

There was no term for Frequentist statistics when it was utterly dominant. Now of course there is. (Some people use the term “classical,” but that’s an odd term given that Bayesian analysis is older.) The term linear has been around a long time. Even when nearly all analysis was linear, people were aware that linearity was a necessary simplification.

Related posts:

# Introduction to MCMC

Markov Chain Monte Carlo (MCMC) is a technique for getting your work done when Monte Carlo won’t work.

The problem is finding the expected value of f(X) where X is some random variable. If you can draw independent samples xi from X, the solution is simple:

When it’s possible to draw these independent samples, the sum above is well understood. It’s easy to estimate the error after n samples, or to turn it the other way around, estimate what size n you need so that the error is probably below a desired threshold.

MCMC is a way of making the approximation above work even though it’s not practical to draw independent random samples from X. In Bayesian statistics, X is the posterior distribution of your parameters and you want to find the expected value of some function of these parameters. The problem is that this posterior distribution is typically complicated, high-dimensional, and unique to your problem (unless you have a simple conjugate model). You don’t know how to draw independent samples from X, but there are standard ways to construct a Markov chain whose samples make the approximation above work. The samples are not independent but in the limit the set of samples has the same distribution as X.

MCMC is either simple or mysterious, depending on your perspective. It’s simple in that writing code that should work, eventually. Writing efficient code is another matter. And above all, knowing when your answer is good enough is tricky. If the samples were independent, the Central Limit Theorem would tell you how the sum should behave. But since the samples are dependent, all bets are off. Almost all we know is that the average on the right side converges to the expectation on the left side. Aside from toy problems there’s very little theory to tell you when your average is sufficiently close to what you want to compute.

Because there’s not much theory, there’s a lot of superstition and folklore. As with other areas, there’s truth in MCMC superstition and folklore, but also some error and nonsense.

In some ways MCMC is truly marvelous. It’s disappointing that there isn’t more solid theory around convergence, but it lets you take a stab at problems that would be utterly intractable otherwise. In theory you can’t know when it’s time to stop, and in practice you can fool yourself into thinking you’ve seen convergence when you haven’t, such as when you have a bimodal distribution and you’ve only been sampling from one mode. But it’s also common in practice to have some confidence a calculation has converged because the results are qualitatively correct: this value should be approximately this other value, this should be less than that, etc.

Bayesian statistics is older than Frequentist statistics, but it didn’t take off until statisticians discovered MCMC in the 1980s, after physicists discovered it in the 1950s. Before that time, people might dismiss Bayesian statistics as being interesting but impossible to compute. But thanks to Markov Chain Monte Carlo, and Moore’s law, many problems are numerically tractable that were not before.

Since MCMC gives a universal approach for solving certain kinds of problems, some people equate the algorithm with the problem. That is, they see MCMC as the solution rather than an algorithm to compute the solution. They forget what they were ultimately after. It’s sometimes possible to compute posterior probabilities more quickly and more accurately by other means, such as numerical integration.

# Bayes factors vs p-values

Bayesian analysis and Frequentist analysis often lead to the same conclusions by different routes. But sometimes the two forms of analysis lead to starkly different conclusions.

The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5. The alternative hypothesis was that p is not 0.5. There were N = 104,490,000 bits emitted during the experiment, and s = 52,263,471 were 1’s. The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.

The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence for ESP while the Bayesian concludes there is strong evidence against ESP.

The following Python code shows how one might calculate the p-value and Bayes factor.

from scipy.stats import binom
from scipy import log, exp
from scipy.special import betaln

N = 104490000
s = 52263471

# sf is the survival function, i.e. complementary cdf
# ccdf multiplied by 2 because we're doing a two-sided test
print("p-value: ", 2*binom.sf(s, N, 0.5))

# Compute the log of the Bayes factor to avoid underflow.
logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) + betaln(0.5, 0.5)
print("Bayes factor: ", exp(logbf))


# More data, less accuracy

Statistical methods should do better with more data. That’s essentially what the technical term “consistency” means. But with improper numerical techniques, the the numerical error can increase with more data, overshadowing the decreasing statistical error.

There are three ways Bayesian posterior probability calculations can degrade with more data:

1. Polynomial approximation
2. Missing the spike
3. Underflow

Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will improve with more data rather than degrade.

With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.

The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.

# Convenient and innocuous priors

Andrew Gelman has some interesting comments on non-informative priors this morning. Rather than thinking of the prior as a static thing, think of it as a way to prime the pump.

… a non-informative prior is a placeholder: you can use the non-informative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. …

At first this may sound like tweaking your analysis until you get the conclusion you want. It’s like the old joke about consultants: the client asks what 2+2 equals and the consultant counters by asking the client what he wants it to equal. But that’s not what Andrew is recommending.

A prior distribution cannot strictly be non-informative, but there are common intuitive notions of what it means to be non-informative. It may be helpful to substitute “convenient” or “innocuous” for “non-informative.” My take on Andrew’s advice is something like this.

Start with a prior distribution that’s easy to use and that nobody is going to give you grief for using. Maybe the prior doesn’t make much difference. But if your convenient/innocuous prior leads to too vague a conclusion, go back and use a more realistic prior, one that requires more effort or risks more criticism.

It’s odd that realistic priors can be more controversial than unrealistic priors, but that’s been my experience. It’s OK to be unrealistic as long as you’re conventional.

# Levels of uncertainty

The other day I heard someone say something like the following:

I can’t believe how people don’t understand probability. They don’t realize that if a coin comes up heads 20 times, on the next flip there’s still a 50-50 chance of it coming up tails.

But if I saw a coin come up heads 20 times, I’d suspect it would come up heads the next time.

There are two levels of uncertainty here. If the probability of a coin coming up heads is θ = 1/2 and the tosses are independent, then yes, the probability of a head is 1/2 each time, regardless of how many heads have shown before. The parameter θ models our uncertainty regarding which side will show after a toss of the coin. That’s the first level of uncertainty.

But what about our uncertainty in the value of θ? Twenty flips showing the same side up should cause us to question whether θ really is 1/2. Maybe it’s a biased coin and θ is greater than 1/2. Or maybe it really is a fair coin and we’ve just seen a one-in-a-million event. (Such events do happen, but only one in a million times.) Our uncertainty regarding the value of θ is a second level of uncertainty.

Frequentist statistics approaches these two kinds of uncertainty differently. That approach says that θ is a constant but unknown quantity. Probability describes the uncertainty regarding the coin toss given some θ but not the uncertainty regarding θ. The Bayesian models all uncertainty using probability. So the outcome of the coin toss given θ is random, but θ itself is also random. It’s turtles all the way down.

It’s possible to have different degrees of uncertainty at each level. You could, for example, calculate the probability of some quantum event very accurately. If that probability is near 1/2, there’s a lot of uncertainty regarding the event itself, but little uncertainty about the parameter. High uncertainty at the first level, low uncertainty at the next level. If you warp a coin, it may not be apparent what effect that will have on the probability of the outcome. Now there’s significant uncertainty at the first and second level.

We’ve implicitly assumed that a single parameter θ describes the uncertainty in a coin toss outcome. Maybe that’s not true. Maybe the person tossing the coin has the ability to influence the outcome. (Some very skilled people can. I’ve heard rumors that Persi Diaconis is good at this.) Now we have a third level of uncertainty, uncertainty regarding our model and not just its parameter.

If you’re sure that a parameter θ describes the coin toss, but you don’t know θ, then the coin toss outcome is an known unknown and θ is an unknown unknown, a second-order uncertainty. More often though people use the term “unknown unknown” to describe a third-order uncertainty, unforeseen factors that are not included in a model, not even as uncertain parameters.

# Bayes : Python :: Frequentist : Perl

Bayesian statistics is to Python as frequentist statistics is to Perl.

Perl has the slogan “There’s more than one way to do it,” abbreviated TMTOWTDI and pronounced “tim toady.” Perl prides itself on variety.

Python takes the opposite approach. The Zen of Python says “There should be one — and preferably only one — obvious way to do it.” Python prides itself on consistency.

Frequentist statistics has a variety of approaches and criteria for various problems. Bayesian critics call this “adhockery.”

Bayesian statistics has one way to do everything: write down a likelihood function and prior distribution, then add data and compute a posterior distribution. This is sometimes called “turning the Bayesian crank.”

# A statistical problem with “nothing to hide”

One problem with the nothing-to-hide argument is that it assumes innocent people will be exonerated certainly and effortlessly. That is, it assumes that there are no errors, or if there are, they are resolved quickly and easily.

Suppose the probability of a correctly analyzing an email or phone call is not 100% but 99.99%. In other words, there’s one chance in 10,000 of an innocent message being incriminating. Imagine authorities analyzing one message each from 300,000,000 people, roughly the population of the United States. Then around 30,000 innocent people will have some ‘splaining to do. They will have to interrupt their dinner to answer questions from an agent knocking on their door, or maybe they’ll spend a few weeks in custody. If the legal system is 99.99% reliable, then three of them will go to prison.

Now suppose false positives are really rare, one in a million. If you analyze 100 messages from each person rather than just one, you’re approximately back to the scenario above.

Scientists call indiscriminately looking through large amounts of data “a fishing expedition” or “data dredging.” One way to mitigate the problem of massive false positives from data dredging is to demand a hypothesis: before you look through the data, say what you’re hoping to prove and why you think it’s plausible.

The legal analog of a plausible hypothesis is a search warrant. In statistical terms, “probable cause” is a judge’s estimation that the prior probability of a hypothesis is moderately high. Requiring scientists to have a hypothesis and requiring law enforcement to have a search warrant both dramatically reduce the number of false positives.

Related:

# A priori overfitting

The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.

An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.

If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.

With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random—which would be an improvement over manually selecting scenarios that present a method in the best light—how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.

Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.

This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself—each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.

Related posts:

# Offended by conditional probability

It’s a simple rule of probability that if A makes B more likely, B makes A more likely. That is, if the conditional probability of A given B is larger than the probability of A alone, the the conditional probability of B given A is larger than the probability of B alone. In symbols,

Prob( A | B ) > Prob( A ) ⇒ Prob( B | A ) > Prob( B ).

The proof is trivial: Apply the definition of conditional probability and observe that if Prob( AB ) / Prob( B ) > Prob( A ), then Prob( AB ) / Prob( A ) > Prob( B ).

Let A be the event that someone was born in Arkansas and let B be the event that this person has been president of the United States. There are five living current and former US presidents, and one of them, Bill Clinton, was born in Arkansas, a state with about 1% of the US population. Knowing that someone has been president increases your estimation of the probability that this person is from Arkansas. Similarly, knowing that someone is from Arkansas should increase your estimation of the chances that this person has been president.

The chances that an American selected at random has been president are very small, but as small as this probability is, it goes up if you know the person is from Arkansas. In fact, it goes up by the same proportion as the opposite probability. Knowing that someone has been president increases their probability of being from Arkansas by a factor of 20, so knowing that someone is from Arkansas increases the probability that they have been president by a factor of 20 as well. This is because

Prob( A | B ) / Prob( A ) = Prob( B | A ) / Prob( B ).

This isn’t controversial when we’re talking about presidents and where they were born. But it becomes more controversial when we apply the same reasoning, for example, to deciding who should be screened at airports.

When I jokingly said that being an Emacs user makes you a better programmer, it appears a few Vim users got upset. Whether they were serious or not, it does seem that they thought “Hey, what does that say about me? I use Vim. Does that mean I’m a bad programmer?”

Assume for the sake of argument that Emacs users are better programmers, i.e.

Prob( good programmer | Emacs user )  >  Prob( good programmer ).

We’re not assuming that Emacs users are necessarily better programmers, only that a larger proportion of Emacs users are good programmers. And we’re not saying anything about causality, only probability.

Does this imply that being a Vim user lowers your chance of being a good programmer? i.e.

Prob( good programmer | Vim user )  <  Prob( good programmer )?

No, because being a Vim user is a specific alternative to being an Emacs user, and there are programmers who use neither Emacs nor Vim. What the above statement about Emacs would imply is that

Prob( good programmer | not a Emacs user )  <  Prob( good programmer ).

That is, if knowing that someone uses Emacs increases the chances that they are a good programmer, then knowing that they are not an Emacs user does indeed lower the chances that they are a good programmer, if we have no other information. In general

Prob( A | B ) > Prob( A ) ⇒ Prob( A | not B ) < Prob( A ).

To take a more plausible example, suppose that spending four years at MIT obtaining a computer science degree makes you a better programmer. Then knowing that someone has a CS degree from MIT increases the probability that this person is a good programmer. But if that’s true, it must also be true that absent any other information, knowing that someone does not have a CS degree from MIT decreases the probability that this person is a good programmer. If a larger proportion of good programmers come from MIT, then a smaller proportion must not come from MIT.

* * *

This post uses the ideas of information and conditional probability interchangeably. If you’d like to read more on that perspective, I recommend Probability Theory: The Logic of Science by E. T. Jaynes.

# Closet Bayesian

When I was a grad student, a statistics postdoc confided to me that he was a “closet Bayesian.” This sounded absolutely bizarre. Why would someone be secretive about his preferred approach to statistics? I could not imagine someone whispering that although she’s doing her thesis in algebra, she’s secretively interested in differential equations.

I knew nothing about statistics at the time and was surprised to find that there was a bitter rivalry between two schools of statistics. The rivalry is still there, though it’s not as bitter as it once was.

I find it grating when someone asks “Are you a Bayesian?” It implies an inappropriate degree of commitment and exclusivity. Bayesian statistics is just a tool. Statistics itself is just tool, one way of understanding the world.

My car has a manual transmission. I prefer manual transmissions. But if someone asked whether I was a manual transmissionist, I’d look at them like they’re crazy. I don’t have any moral objections to automatic transmissions.

I evaluate a car by how well it works. And for most purposes, I prefer the way a manual transmission works. But when I’m teaching one of my kids to drive, we go out in my wife’s car with an automatic transmission. Similarly, I evaluate a mathematical model (statistical or otherwise) by how it works for a given purpose. Sometimes a Bayesian and a frequentist approach lead to the same conclusions, but the latter is easier to understand or implement. Sometimes a Bayesian method leads to a better result because it can use more information or is easier to interpret. Sometimes it’s a toss up and I use a Bayesian approach because its more familiar, just like my old car.