For a set of positive probabilities *p*_{i} summing to 1, their entropy is defined as

(For this post, log will mean log base 2, not natural log.)

This post looks at a couple questions about computing entropy. First, are there any numerical problems computing entropy directly from the equation above?

Second, imagine you don’t have the *p*_{i} values directly but rather counts *n*_{i} that sum to *N*. Then *p*_{i} = *n*_{i}/* _{N}*. To apply the equation directly, you’d first need to compute

*N*, then go back through the data again to compute entropy. If you have a large amount of data, could you compute the entropy in one pass?

To address the second question, note that

so you can sum *n*_{i} and *n*_{i} log *n*_{i} in the same pass.

One of the things you learn in numerical analysis is to look carefully at subtractions. Subtracting two nearly equal numbers can result in a loss of precision. Could the numbers above be nearly equal? Maybe if the *n*_{i} are ridiculously large. Not just astronomically large — astronomically large numbers like the number of particles in the universe are fine — but ridiculously large, numbers whose logarithms approach the limits of machine-representable numbers. (If we’re only talking about numbers as big as the number of particles in the universe, their logs will be at most three-digit numbers).

Now to the problem of computing the sum of *n*_{i} log *n*_{i}. Could the order of the terms matter? This also applies to the first question of the post if we look at summing the *p*_{i} log *p*_{i}. In general, you’ll get better accuracy summing a lot positive numbers by sorting them and adding from smallest to largest and worse accuracy by summing largest to smallest. If adding a sorted list gives essentially the same result when summed in either direction, summing the list in any other order should too.

To test the methods discussed here, I used two sets of count data, one on the order of a million counts and the other on the order of a billion counts. Both data sets had approximately a power law distribution with counts varying over seven or eight orders of magnitude. For each data set I computed the entropy four ways: two equations times two orders. I convert the counts to probabilities and use the counts directly, and I sum smallest to largest and largest to smallest.

For the smaller data set, all four methods produced the same answer to nine significant figures. For the larger data set, all four methods produced the same answer to seven significant figures. So at least for the kind of data I’m looking at, it doesn’t matter how you calculate entropy, and you might as well use the one-pass algorithm to get the result faster.

**Related**: Applied information theory

“This post looks at a couple questions about computing entropy.” There is also the question if you meant to compute entropy in the first place, or GuessWork.

A very interesting paper on the topic, would be interested about your opinion on it as well: http://arxiv.org/abs/1301.6356

Why is this defined as entropy – how is it useful?

If you know your data is approximately power-law tailed, and you don’t have million data points but much less, the plugin estimator you are using can be severely biased downwards. We have recently developed an entropy estimator using Bayesian nonparametrics to put prior on power-law tailed distributions which alleviates this problem. You might be interested:

Evan Archer*, Il Memming Park*, Jonathan Pillow. Bayesian Entropy Estimation for Countable Discrete Distributions. arXiv:1302.0328 (2013)

http://arxiv.org/abs/1302.0328

MATLAB implementation is freely available:

https://github.com/pillowlab/PYMentropy

One minor consideration that sometimes comes up that you have to program: nln(n) = 0 if n =0.

The only big issue I’ve run across with computing entropy is when dealing with maxent distributions which are exponential familty type and include most of the common stat distros.

For example if p is something like p_i = exp(x_i)/(sum_i exp(x_i)) then when computing the entropy you have to evaluate terms like:

ln(sum_i exp(x_i))

a common trick to avoid overloading is to replace this with:

m+ln(sum_i exp(x_i-m))

where m is suitably chosen so that exp(x_i-m) is a reasonable size.

Hi John,

This might be a bit off-topic, but I was wondering what you usually do once you’ve computed the entropy measure of some data. I would guess that the data would show some signal if the entropy is low enough, but how do you decide what is “low” enough? Do people just use arbitrarily chosen thresholds? Or is there some standard threshold?

You’re better off using Kahan’s algorithm to sum lots of numbers instead of trying to sort them and sum from smallest to largest.

This is great. I’ve seen some interesting articles on how entroy relates to machine learning, and maybe intelligence in general.

I have this little naive-Bayesian machine learning engine I’ve been playing with. So just out of curiosity, for each item, I used this entropy formula to calculate the entropy (roughly) for the two possible answers, and by picking the answer with the least entropy, it actually did a pretty good job of predicting the right answer.

I didn’t push it past that, but it was interesting that a rough entropy comparison came out better than random.

Hi John,

if p == 0 how does that count for the entropy?

log2(p) is equal to -Inf and 0*-Inf results in “not a number”

what’s the right thing to do?

What do I need to add in the summation for the entropy in case p == 0?

Thank you.

Luca