Empirically testing the Chowla conjecture

Terry Tao’s most recent blog post looks at the Chowla conjecture theoretically. This post looks at the same conjecture empirically using Python. (Which is much easier!)

The Liouville function λ(n) is (-1)Ω(n) where Ω(n) is the number of prime factors of n counted with multiplicity. So, for example, Ω(9) = 2 because even though 9 has only one distinct prime factor, that factor counts twice.

Given some set of k strides h1, h2, …, hk, define

f(n) = λ(n + h1) λ(n + h1) … λ(n + hk).

The Chowla conjecture says that the average of the first N values of f(n) tends to zero as N goes to infinity. This has been proven for k = 1 but not for larger k.

If f(n) acts like a Bernoulli random variable, our experiments might increase our confidence in the conjecture, but they wouldn’t prove anything. Unexpected results wouldn’t prove anything either, but a departure from random behavior might help find a counterexample if the conjecture is false.

We’re going to be evaluating the Liouville function repeatedly at the same arguments, so it will save some compute time if we tabulate its values. This will also let us use some compact Python notation to average f. We’ll need to tabulate the function up to Nhk.

In the code below, maxstride is an upper bound on the strides hk we may look at. SymPy has a function primeomega that calculates Ω(n) so we might as well use it. If you wanted to use a very large value of N, you might want to fill the array of Liouville function values using a more direct approach that avoids all the factoring implicit in calls to primeomega.

    from sympy import primeomega
    from numpy import empty

    N = 10000
    maxstride = 100

    liouville = empty(N + maxstride)
    liouville[0] = 1
    for n in range(1, len(liouville)):
        liouville[n] = (-1)**primeomega(n)

The following code looks at the Chowla conjecture for h1 = 0 and h2 ranging over 1, 2, …, 99.

    average = empty(maxstride-1)
    for s in range(1, maxstride):
        average[s-1] = (liouville[0:N] * liouville[s:N+s]).sum()/N

If the Liouville function is distributed like a random variable taking on -1 and 1 with equal probability, we’d expect the averages to vary like samples form a normal distribution with mean 0 and variance 1/(2N).

    print( (2*N)**-0.5 )

This returns


and so the means are indeed near zero, and the standard deviation of the samples is about what we’d expect.

What if we replace Ω(n), the number of prime factors with multiplicity, with ω(n), the number of distinct prime factors? The averages above are still small, but the sample variance is about twice as big as before. I’ve seen this pattern with several different large values of N, so I suspect there’s something going on here.

(I didn’t see a function in SymPy corresponding to ω(n), but you can create your own with len(factorint(n)).

Time series analysis vs DSP terminology

Time series analysis and digital signal processing are closely related. Unfortunately, the two fields use different terms to refer to the same things.

Suppose you have a sequence of inputs x[n] and a sequence of outputs y[n] for integers n.

Moving average / FIR

If each output depends on a linear combination of a finite number of previous inputs

y[n] = b0 x[n] + b1 x[n-1] + … + bq x[nq]

then time series analysis would call this a moving average (MA) model of order q, provided b0 = 1. Note that this might not really be an average, i.e. the b‘s are not necessarily positive and don’t necessarily sum to 1.

Digital signal processing would call this a finite impulse response (FIR) filter of order q.

Autoregressive / IIR

If each output depends on a linear combination of a finite number of previous outputs

y[n] = a1 y[n-1] + … + ap y[np]

then time series analysis would call this an autoregressive (AR) model of order p.

Digital signal processing would call this an infinite impulse response (IIR) filter of order p. 

Sometimes you’ll see the opposite sign convention on the a‘s.


If each output depends on a linear combination of a finite number of previous inputs and outputs

y[n] = b0 x[n] + b1 x[n-1] + … + bq x[nq] + a1 y[n-1] + … + ap y[np]

then time series analysis would call this an autoregressive moving average (ARMA) model of order (pq), i.e. p AR terms and q MA terms.

Digital signal processing would call this an infinite impulse response (IIR) filter with q feedforward coefficients and p feedback coefficients. Also, as above, you may see the opposite sign convention on the a‘s.

ARMA notation

Box and Jenkins use a‘s for input and z‘s for output. We’ll stick with x‘s and y’s to make the comparison to DSP easier.

Using the backward shift operator B that takes a sample at n to the sample at n-1, the ARMA system can be written

φ(B) y[n] = θ(B) x[n]

where φ and θ are polynomials

φ(B) = 1 – φ1B – φ2B² – … φpBp


θ(B) = 1 – θ1B – θ2B² – … θqBq.

System function notation

In DSP, filters are described by their system function, the z-transform of the impulse response. In this notation (as in Oppenheim and Shafer, for example) we have

H(z) = \frac{\sum_{k=0}^q b_k z^{-k}}{1 - \sum_{k=1}^p a_k z^{-k}}

The φk in Box and Jenkins correspond to the ak in Oppenheim and Schafer. The θk correspond to the (negative) bk.

The system function H(z) corresponds to θ(1/z) / φ(1/z).


DSP and time series consulting


How to eliminate the first order term from a second order ODE

Authors will often say that “without loss of generality” they will assume that a differential equation has no first order derivative term. They’ll explain that there’s no need to consider

y'' + p(x) y' + q(x) y = 0

because a change of variables can turn the above equation into one of the form

y'' + q(x) y = 0

While this is true, the change of variables is seldom spelled out. I’ll give the change of variables explicitly here in case this is helpful to someone in the future. Define u(x) and r(x) by

u(x) = \exp\left( \frac{1}{2} \int^x p(t)\,dt\right ) y(x)


r(x) = q(x) - \frac{1}{2}p'(x) - \frac{1}{4}p(x)^2

With this change of variables

u'' + r(x) u = 0

Proof: Calculate u” + ru and use the fact that y satisfies the original differential equation. The calculation is tedious but routine.

Example: Suppose we start with

xy'' + y' + x^3 y = 0

Then dividing by x we get

y'' + \frac{1}{x}y' + x^2 y = 0

Now applying the change of variables above gives

u'' + \left(x^2 + \frac{1}{4x^2}\right)u = 0
and our original y is u / √ x.

Common words that have a technical meaning in math

Mathematical writing is the opposite of business writing in at least one respect. Math uses common words as technical terms, whereas business coins technical terms to refer to common ideas.

There are a few math terms I use fairly often and implicitly assume readers understand. Perhaps the most surprising is almost as in “almost everywhere.” My previous post, for example, talks about something being true for “almost all x.”

The term “almost” sounds vague but it actually has a precise technical meaning. A statement is true almost everywhere, or holds for almost all x, if the set of points where it doesn’t hold has measure zero.

For example, almost all real numbers are irrational. There are infinitely many rational numbers, and so there are a lot of exceptions to the statement “all real numbers are irrational,” but the set of exceptions has measure zero [1].

In common parlance, you might use ball and sphere interchangeably, but in math they’re different. In a normed vector space, the set of all points of norm no more than r is the ball of radius r. The set of all points with norm exactly r is the sphere of radius r. A sphere is the surface of a ball.

The word smooth typically means “infinitely differentiable,” or depending on context, differentiable as many times as you need. Often there’s no practical loss of generality in assuming something is infinitely differentiable when you only need to know, for example, that it only needs three derivatives [2]. For example, a manifold whose charts are once differentiable can always be altered slightly to be infinitely differentiable.

The words regular and normal are used throughout mathematics as technical terms, and their meaning changes completely depending on context. For example, in topology regular and normal are two kinds of separation axioms. They tell you whether a topology has enough open sets to separate a point from a closed set or separate two closed sets from each other.

When I use normal I’m most often talking about a normal (i.e. Gaussian) probability distribution. I don’t think I use regular as a technical term that often, but when I do it probably means something like smooth, but more precise. A regularity result in differential equations, for example, tells you what sort of desirable properties a solution has: whether it’s a classical solution or only a weak solution, whether it’s continuous or differentiable, etc.

While I’m giving a sort of reader’s guide to my terminology, log always refers to natural log and trig functions are always in radians unless noted otherwise. More on that here.

* * *

The footnotes below are much more technical than the text above.

[1] Here’s a proof that any countable set of points has measure zero. Pick any ε > 0. Put an open interval of width ε/2 around the first point, an interval of width ε/4 around the second point, an interval of width ε/8 around the third point etc. This covers the countable set of points with a cover of measure ε, and since ε as arbitrary, the set of points must have measure 0.

The irrational numbers are uncountable, but that’s not why they have positive measure. A countable set has measure zero, but a set of measure zero may be uncountable. For example, the Cantor set is uncountable but has measure zero. Or to be more precise, I should say the standard Cantor set has measure zero. There are other Cantor sets, i.e. sets homoemorphic to the standard Cantor set, that have positive measure. This shows that “measure zero” is not a topological property.

[2] I said above that often it doesn’t matter how many times you can differentiate a function, but partial differential equations are an exception to that rule. There you’ll often you’ll care exactly how many (generalized) derivatives a solution has. And you’ll obsess over exactly which powers of the function or its derivatives are integrable. The reason is that a large part of the theory revolves around embedding theorems, whether this function space embeds in that function space. The number of derivatives a function has and the precise exponents p for the Lebesgue spaces they live in matters a great deal. Existence and uniqueness of solutions can hang on such fine details.

A uniformly distributed sequence

If you take the fractional parts of the set of numbers {n cos nx :  integer n > 0} the result is uniformly distributed for almost all x. That is, in the limit, the number of times the sequence visits a subinterval of [0, 1] is proportional to the length of the interval. (Clearly it’s not true for all x: take x = 0, for instance. Or any rational number times π.)

The proof requires some delicate work with Fourier analysis that I’ll not repeat here. If you’re interested in the proof, see Theorem 4.4 of Uniform Distribution of Sequences.

This is a surprising result: why should such a strange sequence be uniformly distributed? Let’s look at a histogram to see whether the theorem is plausible.


Histogram for an = n

OK. Looks plausible.

But there’s a generalization that’s even more surprising. Let an be any increasing sequence of integers. Then the fractional parts of an cos anx are uniformly distributed for almost all x.

Any increasing sequence of integers? Like the prime numbers, for example? Apparently so. The result produces a very similar histogram.

Histogram for an = n

But this can’t hold for just any sequence. Surely you could pick an integer sequence to thwart the theorem. Pick an x, then let an be the subset of the integers for which n cos nx < 0.5. Then an cos anx isn’t uniformly distributed because it never visits the right half the unit interval!

Where’s the contradiction? The theorem doesn’t start by saying “For almost all x ….” It starts by saying “For any increasing sequence an ….” That is, you don’t get to pick x first. You pick the sequence first, then you get a statement that is true for almost all x. The theorem is true for every increasing sequence of integers, but the exceptional set of x‘s may be different for each integer sequence.

Applying probability to non-random things

Probability has surprising uses, including applications to things that absolutely are not random. I’ve touched on this a few times. For example, I’ve commented on how arguments about whether something is really random are often moot: Random is as random does.

This post will take non-random uses for probability in a different direction. We’ll start with discussion of probability and end up proving a classical analysis result.

Let p be the probability of success on some experiment and let Xn count the number of successes in n independent trials. The expected value of Xn is np, and so the expected value of Xn/n is p.

Now let be a continuous function on [0, 1] and think about f(Xn/n). As n increases, the distribution of Xn/n clusters more and more tightly around p, and so f(Xn/n) clusters more and more around f(p).

Let’s illustrate this with a simulation. Let p = 0.6, n = 100, and let f(x) = cos(x).

Here’s a histogram of random samples from Xn/n.

Histogram of samples from Bin(100, 0.6)

And here’s a histogram of random samples from f(Xn/n). Note that it’s centered near cos(p) = 0.825.

Histogram of samples from cos( Bin(100, 0.6) )


As n gets large, the expected value of f(Xn/n) converges to f(p). (You could make all this all rigorous, but I’ll leave out the details to focus on the intuitive idea.)

Now write out the expected value of f(Xn/n).

E\left[ f\left(\frac{X_n}{n} \right ) \right ] = \sum_{k=0}^n f\left(\frac{k}{n} \right ) {n \choose k} p^k(1-p)^{n-k}

Up to this point we’ve thought of p as a fixed probability, and we’ve arrived at an approximation result guided by probabilistic thinking. But the expression above is approximately f(p) regardless of how we think of it. So now let’s think of p varying over the interval [0, 1].

E[ f(Xn/n) ] is an nth degree polynomial in p that is close to f(p). In fact, as n goes to infinity, E[ f(Xn/n) ] converges uniformly to f(p).

So for an arbitrary continuous function f, we’ve constructed a sequence of polynomials that converge uniformly to f. In other words, we’ve proved the Weierstrass approximation theorem! The Weierstrass approximation theorem says that there exists a sequence of polynomials converging to any continuous function f on a bounded interval, and we’ve explicitly constructed such a sequence. The polynomials E[ f(Xn/n) ] are known as the Bernstein polynomials associated with f.

This is one of my favorite proofs. When I saw this in college, it felt like the professor was pulling a rabbit out of a hat. We’re talking about probability when all of a sudden, out pops a result that has nothing to do with randomness.

Now let’s see an example of the Bernstein polynomials approximating a continuous function. Again we’ll let f(x) = cos(x). If we plot the Bernstein polynomials and cosine on the same plot, the approximation is good enough that it’s hard to see what’s what. So instead we’ll plot the differences, cosine minus the Bernstein approximations.

Error in approximating cosine by Bernstein polynomials

As the degree of the Bernstein polynomials increase, the error decreases. Also, note the vertical scale. Cosine of zero is 1, and so the errors here are small relative to the size of the function being approximated.

Misplacing a continent

There are many conventions for describing points on a sphere. For example, does latitude zero refer to the North Pole or the equator? Mathematicians tend to prefer the former and geoscientists the latter. There are also varying conventions for longitude.

Volker Michel describes this clash of conventions colorfully in his book on constructive approximation.

Many mathematicians have faced weird jigsaw puzzles with misplaced continents after using a data set from a geoscientist. If you ever get such figures, too, or if you are, for example, desperately searching South America in a data set but cannot find it, remember the remark you have just read to solve your problem.

Related posts:

Quaint supercomputers

The latest episode of Star Trek Discovery (S1E4) uses the word “supercomputer” a few times. This sounds jarring. The word has become less common in contemporary usage, and seems even more out of place in a work of fiction set more than two centuries in the future.

According to Google’s Ngram Viewer, the term “supercomputer” peaked in 1990.

Google Ngram Viewer results for supercomputer

(The term “cluster” is far more common, but it is mostly a non-technical term. It’s used in connection with grapes, for example, much more often than with computers.)

Years ago you’d hear someone say a problem would take a “big computer,” but these days you’re more likely to hear someone say a problem is going to take “a lot of computing power.” Hearing that a problem is going to require a “big computer” sounds as dated as saying something would take “a big generator” rather than saying it would take a lot of electricity.

Like electricity, computing power has been commoditized. We tend to think in terms of the total amount of computing resources needed, measured in, say, CPU-hours or number of low-level operations. We don’t think first about what configuration of machines would deliver these resources any more than we’d first think about what machines it would take to deliver a certain quantity of electrical power.

There are still supercomputers and problems that require them, though an increasing number of computationally intense projects do not require such specialized hardware.

Something that bothers me about deep neural nets

Overfitting happens when a model does too good a job of matching a particular data set and so does a poor job on new data. The way traditional statistical models address the danger of overfitting is to limit the number of parameters. For example, you might fit a straight line (two parameters) to 100 data points, rather than using a 99-degree polynomial that could match the input data exactly and probably do a terrible job on new data. You find the best fit you can to a model that doesn’t have enough flexibility to match the data too closely.

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

It a nutshell, the disturbing thing is that you have a negative criteria for what constitutes a good solution: one that’s not too close to optimal. But there are a lot of ways to not be too close to optimal. In practice, you experiment until you find an optimally suboptimal solution, i.e. the intentionally suboptimal fit that performs the best in validation.


Exponential sums make pretty pictures

Exponential sums are a specialized area of math that studies series with terms that are complex exponentials. Estimating such sums is delicate work. General estimation techniques are ham-fisted compared to what is possible with techniques specialized for these particular sums. Exponential sums are closely related to Fourier analysis and number theory.

Exponential sums also make pretty pictures. If you make a scatter plot of the sequence of partial sums you can get surprising shapes. This is related to the trickiness of estimating such sums: the partial sums don’t simply monotonically converge to a limit.

The exponential sum page at UNSW suggests playing around with polynomials with dates in the denominator. If we take that suggestion with today’s date, we get the curve below:

f(n) = n/10 + n**2/7 + n**3/17

These are the partial sums of exp(2πi f(n)) where f(n) = n/10 + n²/7 + n³/17.

[Update: You can get an image each day for the current day’s date here.]

Here’s the code that produced the image.

    import matplotlib.pyplot as plt
    from numpy import array, pi, exp, log

    N = 12000
    def f(n):
        return n/10 + n**2/7 + n**3/17 

    z = array( [exp( 2*pi*1j*f(n) ) for n in range(3, N+3)] )
    z = z.cumsum()

    plt.plot(z.real, z.imag, color='#333399')

If we use logarithms, we get interesting spirals. Here f(n) = log(n)4.1.

f(n) = log(n)**4.1

And we can mix polynomials with logarithms. Here f(n) = log(n) + n²/100.

f(n) = log(n) + n**2/100

In this last image, I reduced the number of points from 12,000 to 1200. With a large number of points the spiral nature dominates and you don’t see the swirls along the spiral as clearly.


No critical point between two peaks

If a function of one variable has two local maxima, it must have a local minimum in between.

What about a function of two variables? If it has two local maxima, does it need to have a local minimum? No, it could have a saddle point in between, a point that is a local minimum in one direction but a local maximum in another direction. But even more surprising, it need not even have a saddle point. A function of two variables could have two local maxima and no other critical points! Here’s an example:

f(x, y) = – (x² – 1)² – (x²yx – 1)²

It’s clear that the function is zero at (-1, 0) and (1, 2), and that the function is negative otherwise. So it clearly has two local maxima. You can write out the partial derivatives with respect to x and y and see that the only place they’re both zero is at the two local maxima.

Here’s a plot of the function:

Plot3D[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}]

And here’s a contour plot:

ContourPlot[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}, Contours -> 50]

The two maxima are in in the bright patches in the lower left and upper right.

You might be thinking that if you walk between two peaks, you’ve got to go down in between. And that’s true. If you walk in a straight line between (-1, 0) and (1, 2), you’ll run into a local minimum around (0.2316, 1.2316). But that’s only a local minimum along your path. It’s not a local minimum or saddle point of the function in a neighborhood of that point.

I found this example in the book Single Digits.

Clean obfuscated code

One way to obfuscate code is clever use of arcane programming language syntax. Hackers are able to write completely unrecognizable code by exploiting dark corners of programming techniques and languages. Some of these attempts are quite impressive.

But it’s also possible to write clean source code that is nevertheless obfuscated. For example, it’s not at all obvious what the following code computes.

def f(x):
    return 4*x*(1-x)

def g(x):
    return x**3.5 * (1-x)**2.5

sum = 0
x = 1/7.0
N = 1000000

for _ in range(N):
    x = f(x)
    sum += g(x)


The key to unraveling this mystery is a post I wrote a few days ago that shows the x‘s in this code cover the unit interval like random samples from a beta(0.5, 0.5) distribution. That means the code is finding the expected value of g(X) where X has a beta(0.5, 0.5) distribution. The following computes this expected value.

 \int_0^1 g(x) \, dF_X &=& \frac{1}{\pi} \int_0^1 x^{3.5} (1-x)^{2.5} x^{-0.5} (1-x)^{-0.5} \, dx \\ &=& \frac{1}{\pi} \int_0^1 x^3 (1-x)^2\, dx \\ &=& \frac{1}{\pi} \frac{\Gamma(4) \Gamma(3)}{\pi \Gamma(7)} \\ &=&\frac{1}{60\pi}

The output of the program matches 1/60π to six decimal places.

Related post: What does this code do?

How many musical scales are there?

How many musical scales are there? That’s not a simple question. It depends on how you define “scale.”

For this post, I’ll only consider scales starting on C. That is, I’ll only consider changing the intervals between notes, not changing the starting note. Also, I’ll only consider subsets of the common chromatic scale; this post won’t get into dividing the octave into more or less than 12 intervals.

First of all we have the major scale — C D E F G A B C — and the “natural” minor scale: A B C D E F G A. The word “natural” suggests there are other minor scales. More on that later.

Then we have the classical modes: Dorian, Phrygian, Lydian, Mixolydian, Aeolian, and Locrian. These have the same intervals as taking the notes of the C major scale and starting on D, E, F, G, A, and B respectively. For example, Dorian has the same intervals as D E F G A B C D. Since we said we’d start everything on C, the Dorian mode would be C D E♭ F G A B♭ C. The Aeloian mode is the same as the natural minor scale.

The harmonic minor scale adds a new wrinkle: C D E♭ F G A♭ B C. Notice that A♭ and B are three half steps apart. In all the scales above, notes were either a half step or a whole step apart. Do we want to consider scales that have such large intervals? It seems desirable to include the harmonic minor scale. But what about this: C E♭ G♭ A C. Is that a scale? Most musicians would think of that as a chord or arpeggio rather than a scale. (It’s a diminished seventh chord. And it would be more common to write the A as a B♭♭.)

We might try to put some restriction on the definition of a scale so that the harmonic minor scale is included and the diminished seventh arpeggio is excluded. Here’s what I settled on. For the purposes of this post, I’ll say that a scale is an ascending sequence of eight notes with two restrictions: the first and last are an octave apart, and no two consecutive notes are more than three half steps apart. This will include modes mentioned above, and the harmonic minor scale, but will exclude the diminished seventh arpeggio. (It also excludes pentatonic scales, which we may or may not want to include.)

One way to enumerate all possible scales would be to start with the chromatic scale and decide which notes to keep. Write out the notes C, C♯, D, … , B, C and write a ‘1’ underneath a note if you want to keep it and a ‘0’ otherwise. We have to start and end on C, so we only need to specify which of the 11 notes in the middle we are keeping. That means we can describe any potential scale as an 11-bit binary number. That’s what I did to carry out an exhaustive search for scales with a little program.

There are 266 scales that meet the criteria listed here. I’ve listed all of them on another page. Some of these scales have names and some don’t. I’ve noted some names as shown below. I imagine there are others that have names that I have not labeled. I’d appreciate your help filling these in.

| Scale number | Notes                 | Name              |
|          693 | C D  E  F# G  A  B  C | Lydian mode       |
|          725 | C D  E  F  G  A  B  C | Major             |
|          726 | C D  E  F  G  A  Bb C | Mixolydian mode   |
|          825 | C D  Eb F# G  Ab B  C | Hungarian minor   |
|          826 | C D  Eb F# G  Ab Bb C | Ukrainian Dorian  |
|          854 | C D  Eb F  G  A  Bb C | Dorian mode       |
|          858 | C D  Eb F  G  Ab Bb C | Natural minor     |
|         1235 | C Db E  F  G  Ab B  C | Double harmonic   |
|         1242 | C Db E  F  G  Ab Bb C | Phrygian dominant |
|         1257 | C Db E  F  Gb Ab B  C | Persian           |
|         1370 | C Db Eb F  G  Ab Bb C | Phrygian mode     |
|         1386 | C Db Eb F  Gb Ab Bb C | Locrian mode      |

Related posts:

Toxic pairs, re-identification, and information theory

Database fields can combine in subtle ways. For example, nationality is not usually enough to identify anyone. Neither is religion. But the combination of nationality and religion can be surprisingly informative.

Information content of nationality

How much information is contained in nationality? That depends on exactly how you define nations versus territories etc., but for this blog post I’ll take this Wikipedia table for my raw data. You can calculate that nationality has entropy of 5.26 bits. That is, on average, nationality is slightly more informative than asking five independent yes/no questions. (See this post for how to calculate information content.)

Entropy measures expected information content. Knowing that someone is from India (population 1.3 billion) carries only 2.50 bits of information. Knowing that someone is from Vatican City (population 800) carries 23.16 bits of information.

One way to reduce the re-identification risk of PII (personally identifiable information) such as nationality is to combine small categories. Suppose we lump all countries with a population under one million into “other.” Then we go from 240 categories down to 160. This hardly makes any difference to the entropy: it drops from 5.26 bits to 5.25 bits. But the information content for the smallest country on the list is now 8.80 bits rather than 23.16.

Information content of religion

What about religion? This is also subtle to define, but again I’ll use Wikipedia for my data. Using these numbers, we get an entropy of 2.65 bits. The largest religion, Christianity, has an information content 1.67 bits. The smallest religion on the list, Rastafari, has an information content of 13.29 bits.

Joint information content

So if nationality carries 5.25 bits of information and religion 2.65 bits, how much information does the combination of nationality and religion carry? At least 5.25 bits, but no more than 5.25 + 2.65 = 7.9 bits on average. For two random variables X and Y, the joint entropy H(X, Y) satisfies

max( H(X), H(Y) ) ≤ H(X, Y) ≤ H(X) + H(Y)

where H(X) and H(Y) are the entropy of X and Y respectively.

Computing the joint entropy exactly would require getting into the joint distribution of nationality and religion. I’d rather not get into this calculation in detail, except to discuss possible toxic pairs. On average, the information content of the combination of nationality and religion is no more than the sum of the information content of each separately. But particular combinations can be highly informative.

For example, there are not a lot of Jews living in predominantly Muslim countries. According to one source, there are at least five Jews living in Iraq. Other sources put the estimate as “less than 10.” (There are zero Jews living in Libya.)

Knowing that someone is a Christian living in Mexico, for example, would not be highly informative. But knowing someone is a Jew living in Iraq would be extremely informative.

More information