Why is Kullback-Leibler divergence not a distance?

The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

This is pronounced/interpreted several ways:

  • The divergence from Y to X
  • The relative entropy of X with respect to Y
  • How well Y approximates X
  • The information gain going from the prior Y to the posterior X
  • The average surprise in seeing Y when you expected X

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

exponential and gamma PDFs

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

    from scipy.integrate import quad
    from scipy.stats import expon, gamma
    from scipy import inf

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, 0, inf)

    e = expon
    g = gamma(a = 2)

    print( KL(e, g) )
    print( KL(g, e) )

This returns

    (0.5772156649008394, 1.3799968612282498e-08)
    (0.4227843350984687, 2.7366807708872898e-09)

The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL function to integrate over the whole real line

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, -inf, inf)

and trying an example.

n1 = norm(1, 1)
n2 = norm(2, 1)

print( KL(n1, n2) )
print( KL(n2, n1) )

This returns

(0.4999999999999981, 1.2012834963423225e-08)
(0.5000000000000001, 8.106890774205374e-09)

and so both integrals are equal to within the error in the numerical integration.

How much do you expect to learn from that experiment?

Bayesian statisticians often talk about models “learning” as data accumulate. Here’s an example that applies information theory to quantify how much you can learn from an experiment using the same likelihood function but two different priors: a conjugate prior and a robust prior.

Here’s an example from a paper Luis Pericchi and I wrote recently. Suppose X ~ Normal(θ, 1) where the prior on θ is either a standard Cauchy distribution or a normal distribution with mean 0 and variance 2.19. (The variance on the normal was chosen following an example by Jim Berger so that both priors put half their mass on the interval [-1, 1].)

The expected information gain from a single observation using the normal (conjugate) prior was 0.58. The corresponding gain for the Cauchy (robust) prior was 1.20. Because robust priors are more responsive to data, the expected gain in information is larger (in this case twice as large) when using these priors.

Related: Quantifying information content

Learning is not the same as gaining information

Learning is not the same as just gaining information. Sometimes learning means letting go of previously held beliefs. While this is true in life in general, my point here is to show how this holds true when using the mathematical definition of information.

The information content of a probability density function p(x) is given by

integral of p(x) log p(x)

Suppose we have a Beta(2, 6) prior on the probability of success for a binary outcome.

plot of beta(2,6) density

The prior density has information content 0.597. Then suppose we observe a success. The posterior density is distributed as Beta(3, 6). The posterior density has information 0.516, less information than the prior density.

plot of beta(3,6) density

Observing a success pulled the posterior density toward the right. The posterior density is a little more diffuse than the prior and so has lower information content. In that sense, we know less than before we observed the data! Actually, we’re less certain than we were before observing the data. But if the true probability of response is larger than our prior would indicate, we’re closer to the truth by becoming less confident of our prior belief, and we’ve learned something.

Related: Use information theory to clarify and quantify goals

Probability and information

E. T. Jaynes gave a speech entitled A Backward Look to the Future in which he looked back on his long career as a physicist and statistician. The speech contains several quotes related to my recent post on what a probability means.

Jaynes advocated the view of probability theory as logic extended to include reasoning on incomplete information. Probability need not have anything to do with randomness. Jaynes believed that frequency interpretations of probability are unnecessary and misleading.

… think of probability theory as extended logic, because then probability
distributions are justified in terms of their demonstrable information content, rather than their imagined—and as it now turns out, irrelevant—frequency connections.

He concludes with this summary of his approach to probability.

As soon as we recognize that probabilities do not describe reality—only our information about reality—the gates are wide open to the optimal solution of problems of reasoning from that information.

Related: Using information theory to clarify and quantify