Bayesian calculations are intrinsically recursive: The posterior distribution from one step becomes the prior for the next step.
With conjugate models, the prior and posterior belong to the same family of distributions. If a distribution from this family is determined by a fixed set of parameters, we only need to update these parameters at each step. This updating process is defined by an integration problem, i.e. applying Bayes’ theorem, but it’s common with conjugate models for the integration to reduce to a simple algebraic operation on the parameters.
For example, suppose some binary event that happens with probability θ where θ has a beta(α, β) distribution. When we take our next observation, the posterior distribution becomes beta(α+1, β) if the event occurred and beta(α, β+1) if it didn’t. There is an integration problem in the background that reduces to simply adding a 1 to the α parameter for every success and to the β parameter for every failure. (Standard terminology is to call an observation of the thing you’re interested in a “success” regardless of whether the thing you’re observing is good or bad. “Failure” is just an observation of the thing not happening, even if it’s good that it didn’t happen.)
Conjugate models have the structure of a fold in functional programming, also known as a reduce or accumulate operation. We’ll show below how to compute the posterior distribution in a beta-binomial and normal-normal model using folds in Haskell. (Technically a left associative fold,
foldl. Haskell also has a right associative fold
foldr that we won’t be concerned about here.)
Folds in Haskell
foldl require? Let’s ask GHCi, the Haskell REPL. Prior to GHC version 7.10 we would get the following.
ghci> :type foldl foldl :: (a -> b -> a) -> a -> [b] -> a
Unfortunately the type variables
b just mean “one type” and “a possibly different type” and tell us nothing about how the types are used, so let’s unravel this a little at a time.
Here “a” is the type of the accumulator state. In our case, it will be the prior and posterior distribution parameters because that’s what’s being updated for each data point. Conjugacy makes this work since the prior and posterior distributions belong to the same family. The “b” type is our data. So you could read the type of
foldl as follows.
foldltakes three things: a function to fold over the data, an initial accumulator state, and an array of data. It returns the final accumulator state. The accumulator function takes an accumulator state and a data value, and returns a new accumulator state.
Starting with GHC 7.10 the type of
foldl is a little more general.
ghci> :type foldl foldl :: Foldable t => (a -> b -> a) -> a -> t b -> a
foldl operating only on lists, it can operate on any “foldable” container. To recover the type from earlier versions of the compiler, replace
. (Haskell uses
 b and
[b] as equivalent ways of indicating a list of elements of type
b.) There is one other difference I’ve edited out: The latest version of GHCi reverses the roles of ‘a’ and ‘b’. This is confusing, but has no effect since the type variable names have no meaning. I swapped the a’s and b’s back like they were to make the two definitions easier to compare.
Returning to the beta-binomial model discussed above, suppose we have a sequence of observations consisting of a 1 when the event in question happened and a 0 when it did not. And suppose our prior distribution on the probability of the event has a beta(α, β) distribution.
Now what function will we fold over our data to update our prior into a posterior?
ghci> let f params y = (fst params + y, snd params + 1 - y)
f takes a pair of numbers, the two beta distribution parameters, and a data point
y, and returns an updated set of beta parameters. If the data point is a 1 (success) then the α> parameter is incremented, and if the data point is a 0 (failure) then the β parameter is incremented. If we start with a beta(0.2, 0.8) prior and observe the data [1, 1, 0, 0, 1] then we apply f to our data as follows.
ghci> let d = [1, 1, 0, 0, 1] ghci> foldl f (0.2, 0.8) d
The result will be (3.2, 2.8). The three successes incremented the first beta parameter and the two failures incremented the second beta parameter.
To see how the
foldl operates, we can run
scanl instead. It works like
foldl but returns a list of intermediate accumulator values rather than just the final accumulator value.
ghci> scanl f (0.2, 0.8) d
[ (0.2, 0.8), (1.2, 0.8), (2.2, 0.8), (2.2, 1.8), (2.2, 2.8), (3.2, 2.8) ].
he initial accumulator state is (0.2, 0.8) because we started with a beta(0.2, 0.8) prior. Then we increment the accumulator state to (1.2, 0.8) because the first data point was a 1. Then we increment the state to (2.2, 0.8) because the second data point is also a 1. Next we see two failures in a row and so the second part of the accumulator is incremented two times. After observing the last data point, a success, our final state is (3.2, 2.8), just as when we applied
Next we consider another Bayesian model with a conjugate prior. We assume our data come from a normal distribution with mean θ and known variance 1. We assume a conjugate prior distribution on θ, normal with mean μ0 and variance τ02.
This time the posterior distribution calculation is more complicated and so our accumulator function is more complicated. But the form of the calculation is the same as above: we fold an accumulator function over the data.
Here is the function that encapsulates our posterior distribution calculation.
f params y = ((m/v + y)*newv, newv) where m = fst params -- mean v = snd params -- variance newv = v/(v + 1)
Now suppose our prior distribution on θ has mean 0 and variance 4, and we observe two values, [3, 5].
ghci> foldl f (0, 4) [3, 5]
This returns (3.5555, 0.4444). To see the intermediate accumulation state, i.e. after just seeing y = 3, we run
scanl instead and see
[ (0, 4), (2,4, 0,8), (3.5555, 0.4444) ]
This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.
Next up: ODEs
The next post shows how to implement an ODE solver (4th order Runge-Kutta) as a fold over the list of time steps where the solution is needed.