Data analysis has to start from some set of assumptions. Bayesian prior distributions drive some people crazy because they make assumptions explicit that people prefer to leave implicit. But there’s no escaping the need to make some sort of prior assumptions, whether you’re doing Bayesian statistics or not.
One attempt to avoid specifying a prior distribution is to start with a “non-informative” prior. David Hogg gives a good explanation of why this doesn’t accomplish what some think it does.
In practice, investigators often want to “assume nothing” and put a very or infinitely broad prior on the parameters; of course putting a broad prior is not equivalent to assuming nothing, it is just as severe an assumption as any other prior. For example, even if you go with a very broad prior on the parameter a, that is a different assumption than the same form of very broad prior on a2 or on arctan(a). The prior doesn’t just set the ranges of parameters, it places a measure on parameter space. That’s why it is so important.
7 thoughts on “Vague priors are informative”
Yes it looks very difficult to ignore the existence of the prior. A Bayesian network could even implicate another variable as the causal agent of the vague prior.
A few of those “uninformative priors” like Jeffrey’s prior are parameterization invariant. However, they still start with some assumption. We could say that a truly uninformative prior is the prior under which the inference is most sensitive to perturbations in data, but that reduces Bayesian Inference to maximum likelihood in at least one case. There’s a nice discussion by Bernardo on the issues involved — “Noninformative Priors Do Not Exist”
Let’s define prior of ‘a’ as this:
Every infinite bit string defines some Turing machine (where only prefix has meaning and infinite suffix is ignored)
Put prior probability of ‘a’ as probability that random bitstring is MT, which halts and produces ‘a’
‘a’ can be from family of all finite bitstring, which gives us of a dense covering of real-valued interval. Or ‘a’ can be from family of all machine floating point values.
I am fortunate to be young enough to have read the Jargon File as a boy, and this koan stuck in my mind. I think it’s the esoteric equivalent to this blog post: .
I don’t agree that using a broad prior is “just as severe an assumption as any other prior.” That’s simply not true, if your goal is to make inferences about ‘a’. Under a delta function prior, you will ignore the data entirely and the posterior over ‘a’ will be the same as the prior. (That seems pretty severe, given that there’s no point in collecting any data).
This is not to say that there *is* such a thing as a purely “uninformative prior”, but if you care about inferring ‘a’ as opposed to ‘arctan(a)’, this can and should affect your choice of prior. The difference between prior and posterior credible intervals can provides insight into the degree to which the data (as opposed to the prior) determine your beliefs about the parameter.
In the current issue of the American Statistician, a friend of mine investigates this issue in-depth. Here is a link: http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2012.695938
Unfortunately, the article is not open access.
@John Ramey I just love how academics write to each other through paying fee to another guy, using pdf as if computers did not exist. I dont know what is most hilarious.