# Six sigma events

I saw on Twitter this afternoon a paraphrase of a quote from Nassim Taleb to the effect that if you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a k-sigma event for various k.

        |-------+-----------------|
| Sigma | Odds            |
|-------+-----------------|
|     1 | 2 : 1           |
|     2 | 21 : 1          |
|     3 | 370 : 1         |
|     4 | 16,000 : 1      |
|     5 | 1,700,000 : 1   |
|     6 | 500,000,000 : 1 |
|-------+-----------------|


If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

## Bayes rule and Bayes factors

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a heavier-tailed distribution, say a Student t distribution with 10 degrees of freedom, rescaled to also have variance 1.

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is relatively much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the t distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the t distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the heavier-tailed distribution is equally likely, just a possibility. If you did think a priori that both possibilities were equally likely, the posterior odds for the t distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your a priori confidence in a normal model.

By the way, a t distribution with 10 degrees of freedom is not a very heavy-tailed distribution. It has heavier tails than a normal for sure, but not nearly as heavy as a Cauchy, which corresponds to a t with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.

***

[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

# Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.

# Laplace approximation of an integral from Bayesian logistic regression

Define

and

This integral comes up in Bayesian logisitic regression with a uniform (improper) prior. We will use this integral to illustrate a simple case of Laplace approximation.

The idea of Laplace approximation is to approximate the integral of a Gaussian-like function by the integral of a (scaled) Gaussian with the same mode and same curvature at the mode. That’s the most common, second order Laplace approximation. More generally, Laplace’s approximation depends on truncating a Taylor series, and truncating after the 2nd order term gives the approximation described above.

Let x0 be the place where f takes on its maximum and let g = log f. Note that g also takes on its maximum at x0 and so its first derivative is zero there. Then the (second order) Laplace approximation has the following derivation.

We can show that

and

It follows that x0 = log (m/n) and the Laplace approximation for I(m, n) reduces to

.

Now let’s compare the Laplace approximation to the exact value of the integral for a few values of m and n. We would expect the Laplace approximation to be more accurate when the integrand is shaped more like the Gaussian density. This happens when m and n are large and when they are more nearly equal. (The function f is a likelihood, and m+n is the number of data points in the likelihood. More data means the likelihood is more nearly normal. Also, m = n means f is symmetric, which improves the normal approximation.)

We’ll look at (m, n) = (2, 1), (20, 10), (200, 100), and (15, 15). Here’s the plot of f(x, 2, 1) with its normal approximation, scaled vertically so that the heights match.

Even for small arguments, the Gaussian approximation fits well on the left side, but not so much on the right. For large arguments it would be hard to see the difference between the two curves.

Here are the results of the Laplace approximation and exact integration.

|-----+-----+---------------+---------------+--------------|
|   m |   n |        approx |         exact |    rel error |
|-----+-----+---------------+---------------+--------------|
|   2 |   1 |    0.45481187 |           0.5 | -0.090376260 |
|  20 |  10 |  4.9442206e-9 | 4.99250870e-9 | -0.009672111 |
|  15 |  15 | 8.5243139e-10 | 8.5956334e-10 | -0.008297178 |
| 200 | 100 | 3.6037801e-84 | 3.6072854e-84 | -0.000971728 |
|-----+-----+---------------+---------------+--------------|


The integrals were computed exactly using Mathematica; the results are rational numbers. [1]

Note that when m and n increase by a factor of 10, the relative error goes down by about a factor of 10. This is in keeping with theory that say the error should be O(1/(m+n)). Also, note that the error for (15, 15) is a little lower than that for (20, 10). The errors are qualitatively just what we expected.

***

[1] I initially used Mathematica to compute the integrals, but then realized that’s not necessary. The change of variables u = exp(x) / (1 + exp(x)) shows that

I(ab) = B(ab+1) + B(a+1, b)

where B is the beta function. Incidentally, this shows that if a and b are integers then  I(ab) is rational.

# Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
from scipy.stats import beta as beta
from scipy import log2

def infogain(post_a, post_b, prior_a, prior_b):

p = beta(post_a, post_b).pdf
q = beta(prior_a, prior_b).pdf

(info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
return info


This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
print( infogain(3, 8, 3, 7) )


The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

# Randomized response, privacy, and Bayes theorem

Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability.

## Randomized response

Each student is asked to flip two coins. If the first coin comes up heads, the student answers the question truthfully, yes or no. Otherwise the student reports “yes” if the second coin came up heads and “no” it came up tails. Every student has deniability because each “yes” answer may have come from an innocent student who flipped tails on the first coin and heads on the second.

How can the professor estimate p, the proportion of students who cheated? Around half the students will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

## Database anonymization

The calculations above assume that everyone complied with the protocol, which may not be reasonable. If everyone were honest, there’d be no reason for this exercise in the first place. But we could imagine another scenario. Someone holds a database with identifiers and answers to a yes/no question. The owner of the database could follow the procedure above to introduce randomness in the data before giving the data over to someone else.

## Information contained in a randomized response

What can we infer from someone’s randomized response to the cheating question? There’s nothing you can infer with certainty; that’s the point of introducing randomness. But that doesn’t mean that the answers contain no information. If we completely randomized the responses, dispensing with the first coin flip, then the responses would contain no information. The responses do contain information, but not enough to be incriminating.

Let C be a random variable representing whether someone cheated, and let R be their response, following the randomization procedure above. Given a response R = 1, what is the probability p that C = 1, i.e. that someone cheated? This is a classic application of Bayes’ theorem.

If we didn’t know someone’s response, we would estimate their probability of having cheated as p, the group average. But knowing that their response was “yes” we update our estimate to 3p / (2p + 1). At the extremes of p = 0 and p = 1 these coincide. But for any value of p strictly between 0 and 1, our estimate goes up. That is, the probability that someone cheated, conditional on knowing they responded “yes”, is higher than the unconditional probability. In symbols, we have

when 0 < < 1. The difference between the left and right sides above is maximized when p = (√3 – 1)/2 = 0.366. That is, a “yes” response tells us the most when about 1/3 of the students cheated. When p = 0.366, P(= 1 | R= 1) = 0.634, i.e. the posterior probability is almost twice the prior probability.

You could go through a similar exercise with Bayes theorem to show that P(C = 1 | R = 0) = p/(3 – 2p), which is less than p provided 0 < p < 1. So if someone answers “yes” to cheating, that does make it more likely that the actually cheated, but not so much more that you can justly accuse them of cheating. (Unless p = 1, in which case you’re in the realm of logic rather than probability: if everyone cheated, then you can conclude that any individual cheated.)

Update: See the next post for a more general randomization scheme and more about the trade-off between privacy and utility. The post after that gives an overview of randomization for more general kinds of data.

If you would like help with database de-identification, please let me know.

# Bayesian methods at Bletchley Park

From Nick Patterson’s interview on Talking Machines:

GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an extreme minority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statistics works. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

# Effective sample size for MCMC

In applications we’d like to draw independent random samples from complicated probability distributions, often the posterior distribution on parameters in a Bayesian analysis. Most of the time this is impractical.

MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is effective sample size (ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

where n is the number of samples and ρ(k) is the correlation at lag k.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag k decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

# Freudian hypothesis testing

In his paper Mindless statistics, Gerd Gigerenzer uses a Freudian analogy to describe the mental conflict researchers experience over statistical hypothesis testing. He says that the “statistical ritual” of NHST (null hypothesis significance testing) “is a form of conflict resolution, like compulsive hand washing.”

In Gigerenzer’s analogy, the id represents Bayesian analysis. Deep down, a researcher wants to know the probabilities of hypotheses being true. This is something that Bayesian statistics makes possible, but more conventional frequentist statistics does not.

The ego represents R. A. Fisher’s significance testing: specify a null hypothesis only, not an alternative, and report a p-value. Significance is calculated after collecting the data. This makes it easy to publish papers. The researcher never clearly states his hypothesis, and yet takes credit for having established it after rejecting the null. This leads to feelings of guilt and shame.

The superego represents the Neyman-Pearson version of hypothesis testing: pre-specified alternative hypotheses, power and sample size calculations, etc. Neyman and Pearson insist that hypothesis testing is about what to do, not what to believe. [1]

I assume Gigerenzer doesn’t take this analogy too seriously. In context, it’s a humorous interlude in his polemic against rote statistical ritual.

But there really is a conflict in hypothesis testing. Researchers naturally think in Bayesian terms, and interpret frequentist results as if they were Bayesian. They really do want probabilities associated with hypotheses, and will imagine they have them even though frequentist theory explicitly forbids this. The rest of the analogy, comparing the ego and superego to Fisher and Neyman-Pearson respectively, seems weaker to me. But I suppose you could imagine Neyman and Pearson playing the role of your conscience, making you feel guilty about the pragmatic but unprincipled use of p-values.

* * *

[1] “No test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern behaviour in regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong.”

Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philos Trans Roy Soc A, 1933;231:289, 337.

# Subjectivity in statistics

Andrew Gelman on subjectivity in statistics:

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re explicitly subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are implicitly subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

# Functional folds and conjugate models

Bayesian calculations are intrinsically recursive: The posterior distribution from one step becomes the prior for the next step.

With conjugate models, the prior and posterior belong to the same family of distributions. If a distribution from this family is determined by a fixed set of parameters, we only need to update these parameters at each step. This updating process is defined by an integration problem, i.e. applying Bayes’ theorem, but it’s common with conjugate models for the integration to reduce to a simple algebraic operation on the parameters.

For example, suppose some binary event that happens with probability θ where θ has a beta(α, β) distribution. When we take our next observation, the posterior distribution becomes beta(α+1, β) if the event occurred and beta(α, β+1) if it didn’t. There is an integration problem in the background that reduces to simply adding a 1 to the α parameter for every success and to the β parameter for every failure. (Standard terminology is to call an observation of the thing you’re interested in a “success” regardless of whether the thing you’re observing is good or bad. “Failure” is just an observation of the thing not happening, even if it’s good that it didn’t happen.)

Conjugate models have the structure of a fold in functional programming, also known as a reduce or accumulate operation. We’ll show below how to compute the posterior distribution in a beta-binomial and normal-normal model using folds in Haskell. (Technically a left associative fold, foldl. Haskell also has a right associative fold foldr that we won’t be concerned about here.)

What does foldl require? Let’s ask GHCi, the Haskell REPL. Prior to GHC version 7.10 we would get the following.

     ghci> :type foldl
foldl :: (a -> b -> a) -> a -> [b] -> a


Unfortunately the type variables a and b just mean “one type” and “a possibly different type” and tell us nothing about how the types are used, so let’s unravel this a little at a time.

Here “a” is the type of the accumulator state. In our case, it will be the prior and posterior distribution parameters because that’s what’s being updated for each data point. Conjugacy makes this work since the prior and posterior distributions belong to the same family. The “b” type is our data. So you could read the type of foldl as follows.

foldl takes three things: a function to fold over the data, an initial accumulator state, and an array of data. It returns the final accumulator state. The accumulator function takes an accumulator state and a data value, and returns a new accumulator state.

Starting with GHC 7.10 the type of foldl is a little more general.

     ghci> :type foldl
foldl :: Foldable t => (a -> b -> a) -> a -> t b -> a


Instead of foldl operating only on lists, it can operate on any “foldable” container. To recover the type from earlier versions of the compiler, replace t with []. (Haskell uses [] b and [b] as equivalent ways of indicating a list of elements of type b.) There is one other difference I’ve edited out: The latest version of GHCi reverses the roles of ‘a’ and ‘b’. This is confusing, but has no effect since the type variable names have no meaning. I swapped the a’s and b’s back like they were to make the two definitions easier to compare.

## Beta-binomial example

Returning to the beta-binomial model discussed above, suppose we have a sequence of observations consisting of a 1 when the event in question happened and a 0 when it did not. And suppose our prior distribution on the probability of the event has a beta(α, β) distribution.

Now what function will we fold over our data to update our prior into a posterior?

     ghci> let f params y = (fst params + y, snd params + 1 - y)


The function f takes a pair of numbers, the two beta distribution parameters, and a data point y, and returns an updated set of beta parameters. If the data point is a 1 (success) then the α> parameter is incremented, and if the data point is a 0 (failure) then the β parameter is incremented. If we start with a beta(0.2, 0.8) prior and observe the data [1, 1, 0, 0, 1] then we apply f to our data as follows.

     ghci> let d = [1, 1, 0, 0, 1]
ghci> foldl f (0.2, 0.8) d


The result will be (3.2, 2.8). The three successes incremented the first beta parameter and the two failures incremented the second beta parameter.

To see how the foldl operates, we can run scanl instead. It works like foldl but returns a list of intermediate accumulator values rather than just the final accumulator value.

     ghci> scanl f (0.2, 0.8) d


This returns

     [ (0.2, 0.8), (1.2, 0.8), (2.2, 0.8), (2.2, 1.8), (2.2, 2.8), (3.2, 2.8) ].

he initial accumulator state is (0.2, 0.8) because we started with a beta(0.2, 0.8) prior. Then we increment the accumulator state to (1.2, 0.8) because the first data point was a 1. Then we increment the state to (2.2, 0.8) because the second data point is also a 1. Next we see two failures in a row and so the second part of the accumulator is incremented two times. After observing the last data point, a success, our final state is (3.2, 2.8), just as when we applied foldl.

## Normal-normal example

Next we consider another Bayesian model with a conjugate prior. We assume our data come from a normal distribution with mean θ and known variance 1. We assume a conjugate prior distribution on θ, normal with mean μ0 and variance τ02.

This time the posterior distribution calculation is more complicated and so our accumulator function is more complicated. But the form of the calculation is the same as above: we fold an accumulator function over the data.

Here is the function that encapsulates our posterior distribution calculation.

     f params y = ((m/v  + y)*newv, newv)
where
m = fst params -- mean
v = snd params -- variance
newv = v/(v + 1)


Now suppose our prior distribution on θ has mean 0 and variance 4, and we observe two values, [3, 5].

     ghci> foldl f (0, 4) [3, 5]


This returns (3.5555, 0.4444). To see the intermediate accumulation state, i.e. after just seeing y = 3, we run scanl instead and see

     [ (0, 4), (2,4, 0,8), (3.5555, 0.4444) ]

## Inspiration

This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

## Next up: ODEs

The next post shows how to implement an ODE solver (4th order Runge-Kutta) as a fold over the list of time steps where the solution is needed.

# Kalman filters and bottom-up learning

Kalman filtering is a mixture of differential equations and statistics. Kalman filters are commonly used in tracking applications, such as tracking the location of a space probe or tracking the amount of charge left in a cell phone battery. Kalman filters provide a way to synthesize theoretical predictions and actual measurements, accounting for error in both.

Engineers naturally emphasize the differential equations and statisticians naturally emphasize the statistics. Both perspectives are valuable, but in my opinion/experience, the engineering perspective should come first.

From an engineering perspective, a Kalman filtering problem starts as a differential equation. In an ideal world, one would simply solve the differential equation and be done. But the experienced engineer realizes his or her differential equations don’t capture everything. (Unlike the engineer in this post.) Along the road to the equations at hand, there were approximations, terms left out, and various unknown unknowns.

The Kalman filter accounts for some level of uncertainty in the process dynamics and in the measurements taken. This uncertainty is modeled as randomness, but this doesn’t mean that there’s necessarily anything “random” going on. It simply acknowledges that random variables are an effective way of modeling miscellaneous effects that are unknown or too complicated to account for directly. (See Random is as random does.)

The statistical approach to Kalman filtering is to say that it is simply another estimation problem. You start from a probability model and apply Bayes’ theorem. That probability model has a term inside that happens to come from a differential equation in practice, but this is irrelevant to the statistics. The basic Kalman filter is a linear model with normal probability distributions, and this makes a closed-form solution for the posterior possible.

You’d be hard pressed to start from a statistical description of Kalman filtering, such as that given here, and have much appreciation for the motivating dynamics. Vital details have simply been abstracted away. As a client told me once when I tried to understand his problem starting from the top-down, “You’ll never get here from there.”

The statistical perspective is complementary. Some things are clear from the beginning with the statistical formulation that would take a long time to see from the engineering perspective. But while both perspectives are valuable, I believe it’s easier to start on the engineering end and work toward the statistics end rather than the other way around.

History supports this claim. The Kalman filter from the engineering perspective came first and its formulation in terms of Bayesian statistics came later. Except that’s not entirely true.

Rudolf Kálmán published his seminal paper in 1960 and four years later papers started to come out making the connection to Bayesian statistics. But while Kálmán and others were working in the US starting from the engineering end, Ruslan Stratonovich was working in Russia starting from the statistical end. Still, I believe it’s fair to say that most of the development and application of Kalman filters has proceeded from the engineering to the statistics rather than the other way around.

RelatedMore on Kalman filters

# Bayesian adaptive clinical trials: promise and pitfalls

This afternoon I’m giving a talk at the Houston INFORMS chapter entitled “Bayesian adaptive clinical trials: promise and pitfalls.”

When I started working in adaptive clinical trials, I was very excited about the potential of such methods. The clinical trial methods most commonly used are very crude, and there’s plenty of room for improvement.

Over time I became concerned about overly complex methods, methods which were good for academic publication but may not be best for patients. Such methods are extremely time-consuming to develop and may not perform as well in practice as simpler methods.

There’s a great deal of opportunity between the extremes, methods that are more sophisticated than the status quo without being unnecessarily complex.

# Frequentist properties of Bayesian methods

Bayesian methods for designing clinical trials have become more common, and yet these Bayesian designs are almost always evaluated by frequentist criteria. For example, a trial may be designed to stop early 95% of the time under some bad scenario and stop no more than 20% of the time under some good scenario.

These criteria are arbitrary, since the “good” and “bad” scenarios are arbitrary, and because the stopping probability requirements of 95% and 20% are arbitrary. Still, there’s an idea in lurking in the background that in every design there must be something that is shown to happen no more than 5% of the time.

It takes a great deal of effort to design Bayesian methods with desired frequentist properties. It’s an inverse problem, searching for the parameters in a high-dimensional design space, usually via lengthy simulation, that cause the method to satisfy some criteria. Of course frequentist methods satisfy frequentist criteria by design and so meet these criteria with far less effort. It’s rare to see the tables turned, evaluating frequentist methods by Bayesian criteria.

Sometimes the effort to beat frequentist designs at their own game is futile because the frequentist designs are optimal by their own criteria. More often, however, the Bayesian and frequentist methods being compared are not direct competitors but only analogs. The aim in this case is to match the frequentist method’s operating characteristics by one criterion while doing better by a new criterion.

Sometimes a Bayesian method can be shown to have better frequentist operating characteristics than its frequentist counterpart. This puts dogmatic frequentists in the awkward position of admitting that what they see as an unjustified approach to statistics has nevertheless produced a superior product. Some anti-Bayesians are fine with this, happy to have a procedure with better frequentist properties, even though it happened to be discovered via a process they view as illegitimate.

Related postBayesian clinical trials in one zip code

# Bayesian and nonlinear

Someone said years ago that you’ll know Bayesian statistics has become mainstream when people no longer put “Bayesian” in the titles of their papers. That day has come. While the Bayesian approach is still the preferred approach of a minority of statisticians, it’s no longer a novelty. If you want people to find your paper interesting, the substance needs to be interesting. A Bayesian approach isn’t remarkable alone.

You could say the same about nonlinear differential equations. Differential equations are so often nonlinear that the “nonlinear” qualifier isn’t always necessary to say explicitly. Just as a Bayesian analysis isn’t interesting just because it’s Bayesian, a differential equation isn’t necessarily interesting just because it’s nonlinear.

The analogy between Bayesian statistics and nonlinear differential equations breaks down though. Nonlinear equations are intrinsically more interesting than linear ones. But it’s no longer remarkable to solve a nonlinear differential equation numerically.

When an adjective becomes the default, it drops off and the previous default now requires an adjective. Terms like “electronic” and “digital” are fading from use. If you say you’re going to mail someone something, the default assumption is usually that you are going to email it. What used to be simply “mail” is now “snail mail.” Digital signal processing is starting to sound quaint. The abbreviation DSP is still in common use, but digital signal processing is simply signal processing. Now non-digital signal processing requires a qualifier, i.e. analog.

There was no term for Frequentist statistics when it was utterly dominant. Now of course there is. (Some people use the term “classical,” but that’s an odd term given that Bayesian analysis is older.) The term linear has been around a long time. Even when nearly all analysis was linear, people were aware that linearity was a necessary simplification.

Related posts:

# Introduction to MCMC

Markov Chain Monte Carlo (MCMC) is a technique for getting your work done when Monte Carlo won’t work.

The problem is finding the expected value of f(X) where X is some random variable. If you can draw independent samples xi from X, the solution is simple:

When it’s possible to draw these independent samples, the sum above is well understood. It’s easy to estimate the error after n samples, or to turn it the other way around, estimate what size n you need so that the error is probably below a desired threshold.

MCMC is a way of making the approximation above work even though it’s not practical to draw independent random samples from X. In Bayesian statistics, X is the posterior distribution of your parameters and you want to find the expected value of some function of these parameters. The problem is that this posterior distribution is typically complicated, high-dimensional, and unique to your problem (unless you have a simple conjugate model). You don’t know how to draw independent samples from X, but there are standard ways to construct a Markov chain whose samples make the approximation above work. The samples are not independent but in the limit the set of samples has the same distribution as X.

MCMC is either simple or mysterious, depending on your perspective. It’s simple to write code that should work, eventually, in theory. Writing efficient code is another matter. And above all, knowing when your answer is good enough is tricky. If the samples were independent, the Central Limit Theorem would tell you how the sum should behave. But since the samples are dependent, all bets are off. Almost all we know is that the average on the right side converges to the expectation on the left side. Aside from toy problems there’s very little theory to tell you when your average is sufficiently close to what you want to compute.

Because there’s not much theory, there’s a lot of superstition and folklore. As with other areas, there’s some truth in MCMC superstition and folklore, but also some error and nonsense.

In some ways MCMC is truly marvelous. It’s disappointing that there isn’t more solid theory around convergence, but it lets you take a stab at problems that would be utterly intractable otherwise. In theory you can’t know when it’s time to stop, and in practice you can fool yourself into thinking you’ve seen convergence when you haven’t, such as when you have a bimodal distribution and you’ve only been sampling from one mode. But it’s also common in practice to have some confidence that a calculation has converged because the results are qualitatively correct: this value should be approximately this other value, this should be less than that, etc.

Bayesian statistics is older than Frequentist statistics, but it didn’t take off until statisticians discovered MCMC in the 1980s, after physicists discovered it in the 1950s. Before that time, people might dismiss Bayesian statistics as being interesting but impossible to compute. But thanks to Markov Chain Monte Carlo, and Moore’s law, many problems are numerically tractable that were not before.

Since MCMC gives a universal approach for solving certain kinds of problems, some people equate the algorithm with the problem. That is, they see MCMC as the solution rather than an algorithm to compute the solution. They forget what they were ultimately after. It’s sometimes possible to compute posterior probabilities more quickly and more accurately by other means, such as numerical integration.