Reverse engineering the seed of a linear congruential generator

The previous post gave an example of manipulating the seed of a random number generator to produce a desired result. This post will do something similar for a different generator.

A couple times I’ve used the following LCG (linear congruential random number generator) in examples. An LCG starts with an initial value of z and updates z at each step by multiplying by a constant a and taking the remainder by m. The particular LCG I’ve used has a = 742938285 and m = 231 – 1 = 2147483647.

(I used these parameters because they have maximum range, i.e. every positive integer less than m appears eventually, and because it was one of the LCGs recommended in an article years ago. That is, it’s very good as far as 32-bit LCGs go, which isn’t very far. An earlier post shows how it quickly fails the PractRand test suite.)

Let’s pick the seed so that the 100th output of the generator is today’s date in ISO form: 20170816.

We need to solve

a100z = 20170816 mod 2147483647.

By reducing  a100 mod 2147483647 we  find we need to solve

160159497 z = 20170816 mod 2147483647

and the solution is z = 1898888478. (See How to solve linear congruences.)

The following Python code verifies that the solution works.

    a = 742938285
    z = 1898888478
    m = 2**31 - 1

    for _ in range(100):
        z = a*z % m
    print(z)

Manipulating a random number generator

With some random number generators, it’s possible to select the seed carefully to manipulate the output. Sometimes this is easy to do. Sometimes it’s hard but doable. Sometimes it’s theoretically possible but practically impossible.

In my recent correspondence with Melissa O’Neill, she gave me an example that seeds a random number generator so that the 9th and 10th outputs produce the ASCII code for my name.

Here’s the code. The function next is the xoroshiro128+ (XOR/rotate/shift/rotate) random number generator. The global array s contains the state of the random number generator. Its initial values are the seeds of the generator.

#include <cstdio>
#include <cstdint>

// xoroshiro generator taken from
// http://vigna.di.unimi.it/xorshift/xoroshiro128plus.c

uint64_t s[2];

static inline uint64_t rotl(const uint64_t x, int k) {
	return (x << k) | (x >> (64 - k));
}

uint64_t next(void) {
	const uint64_t s0 = s[0];
	uint64_t s1 = s[1];
	const uint64_t result = s0 + s1;

	s1 ^= s0;
	s[0] = rotl(s0, 55) ^ s1 ^ (s1 << 14); // a, b
	s[1] = rotl(s1, 36); // c

	return result;
}

int main() {
    freopen(NULL, "wb", stdout); 

    s[0] = 0x084f31240ed2ec3f;
    s[1] = 0x0aa0d69470975eb8;

    while (1) {
        uint64_t value = next();
        fwrite((void*) &value, sizeof(value), 1, stdout);
    }
}

Compile this code then look at a hex dump of the first few outputs. Here’s what you get:

cook@mac> gcc xoroshiro.cpp
cook@mac> ./a.out | hexdump -C | head
f7 4a 6a 7f b8 07 f0 12  f8 96 e1 af 29 08 e3 c8  |.Jj.........)...|
15 0e b0 38 01 ef b2 a7  bb e9 6f 4d 77 55 d7 a0  |...8......oMwU..|
49 08 f2 fc 0c b2 ea e8  48 c2 89 1b 31 3f d7 3d  |I.......H...1?.=|
11 eb bc 5e ee 98 b6 3b  d9 d1 cc 15 ae 00 fc 2f  |...^...;......./|
3e 20 4a 6f 68 6e 20 44  2e 20 43 6f 6f 6b 20 3c  |> John D. Cook <| 
d1 80 49 27 3e 25 c2 4b  2a e3 78 71 9c 9e f7 18  |..I'>%.K*.xq....|
0b bb 1f c6 1c 71 79 29  d6 45 81 47 3b 88 4f 42  |.....qy).E.G;.OB|
7c 1c 19 c4 22 58 51 2d  d7 74 69 ac 36 6f e0 3f  ||..."XQ-.ti.6o.?|
78 7f a4 14 1c bc a8 de  17 e3 f7 d8 0c de 2c ea  |x.............,.|
a2 37 83 f9 38 e4 14 77  e3 6a c8 52 d1 0c 29 01  |.7..8..w.j.R..).|

(I cut out the line numbers on the left side to make the output fit better horizontally on the page.)

Not only did one pair of seed values put my name in the output, another pair would work too. If you change the seed values to

s[0] = 0x820e8a6c1baf5b13;
s[1] = 0x5c51f1c4e2e64253;

you’ll also see my name in the output:

66 9d 95 fe 30 7c 60 de 7c 89 0c 6f cd d6 05 1e |f...0|`.|..o....|
2b e9 9c cc cd 3d c5 ec 3f e3 88 6c a6 cd 78 84 |+....=..?..l..x.|
20 12 ac f2 2b 3c 89 73 60 09 8d c3 85 68 9e eb | ...+<.s`....h..|
15 3e c1 0b 07 68 63 e5 73 a7 a8 f2 4b 8b dd d0 |.>...hc.s...K...|
3e 20 4a 6f 68 6e 20 44 2e 20 43 6f 6f 6b 20 3c |> John D. Cook <|
3f 71 cf d7 5f a6 ab cf 9c 81 93 d1 3d 4c 9e 41 |?q.._.......=L.A|
0d b5 48 9c aa fc 84 d8 c6 64 1d c4 91 79 b4 f8 |..H......d...y..|
0c ac 35 48 fd 38 01 dd cc a4 e4 43 c6 5b e8 c7 |..5H.8.....C.[..|
e1 2e 76 30 0f 9d 41 ff b0 99 84 8d c1 72 5a 91 |..v0..A......rZ.|
06 ea f2 8e d7 87 e5 2c 53 a9 0c a0 f4 cd d1 9b |.......,S.......|

Note that there are limits to how much you can manipulate the output of an RNG. In this case the seed was selected to produce two particular output values, but the rest of the sequence behaves as if it were random. (See Random is as random does.)

Testing RNGs with PractRand

PractRand is a random number generator test suite, somewhat like the DIEHARDER and NIST tests I’ve written about before, but more demanding.

Rather than running to completion, it runs until it a test fails with an infinitesimally small p-value. It runs all tests at a given sample size, then doubles the sample and runs the tests again.

32-bit generators

LCG

A while back I wrote about looking for an RNG that would fail the NIST test suite and being surprised that a simple LCG (linear congruential generator) did fairly well. PractRand, however, dismisses this generator with extreme prejudice:

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x4a992b2c
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x4a992b2c
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+0,13-3,T) R=+115128 p = 0 FAIL !!!!!!!!
BCFN(2+1,13-3,T) R=+105892 p = 0 FAIL !!!!!!!!
...
[Low1/32]FPF-14+6/16:(8,14-9) R= +25.8 p = 1.5e-16 FAIL
[Low1/32]FPF-14+6/16:(9,14-10) R= +15.5 p = 8.2e-9 very suspicious
[Low1/32]FPF-14+6/16:(10,14-11) R= +12.9 p = 1.8e-6 unusual
[Low1/32]FPF-14+6/16:all R=+380.4 p = 8.2e-337 FAIL !!!!!!!
[Low1/32]FPF-14+6/16:all2 R=+33954 p = 0 FAIL !!!!!!!!
[Low1/32]FPF-14+6/16:cross R= +33.6 p = 6.4e-25 FAIL !!
...and 9 test result(s) without anomalies

I don’t recall the last time I saw a p-value of exactly zero. Presumably the p-value was so small that it underflowed to zero.

MWC

Another 32 bit generator, George Marsaglia’s MWC generator, also fails, but not as spectacularly as LCG.

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x187edf12
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x187edf12
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +6.3 p = 2.2e-3 unusual
Gap-16:A R= +33.3 p = 1.6e-28 FAIL !!!
Gap-16:B R= +13.6 p = 7.9e-12 FAIL
...and 105 test result(s) without anomalies

64-bit generators

Next let’s see how some well-regarded 64-bit random number generators do. We’ll look at xorshift128+ and xoroshir0128+ by Sebastiano Vigna and David Blackman, the famous Mersenne Twister, and PCG by Melissa O’Neill.

The numbers generated by xhoroshir0128+ and xorshift128+ are not random in the least significant bit and so the PractRand tests end quickly. The authors claim that xoroshiro128+ “passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests.” I’ve only run PractRand with its default settings; I have not tried getting it to keep running the rest of the tests.

A lack of randomness in the least significant bits is inconsequential if you’re converting the outputs to floating point numbers, say as part of the process of generating Gaussian random values. For other uses, such as reducing the outputs modulo a small number, a lack of randomness in the least significant bits would be disastrous. (Here numerical analysis and number theory have opposite ideas regarding which parts of a number are most “significant.”)

At the time of writing, both Mersenne Twister and PCG have gone through 256 GB of generated values and are still running. I intend to add more results tomorrow.

Update: Mersenne Twister failed a couple of tests with 512 GB of input. PCG passed 8 TB and is still going.

xoroshiro128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xe15bf63c
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xe15bf63c
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

xorshift128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x8c7c5173
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x8c7c5173
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

Mersenne Twister

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x300dab94
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x300dab94
length= 256 megabytes (2^28 bytes), time= 3.7 seconds
no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 megabytes (2^29 bytes), time= 7.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+2,13-2,T) R= -7.0 p =1-1.2e-3 unusual
[Low1/64]BCFN(2+2,13-6,T) R= -5.7 p =1-1.0e-3 unusual
...and 167 test result(s) without anomalies

...

rng=RNG_stdin64, seed=0x300dab94
length= 256 gigabytes (2^38 bytes), time= 3857 seconds
  no anomalies in 265 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 gigabytes (2^39 bytes), time= 8142 seconds
 Test Name Raw Processed Evaluation
 BRank(12):24K(1) R=+99759 p~= 0 FAIL !!!!!!!!
 [Low16/64]BRank(12):16K(1) R= +1165 p~= 1.3e-351 FAIL !!!!!!!
 ...and 274 test result(s) without anomalies

PCG

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x82f88431
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x82f88431
length= 128 megabytes (2^27 bytes), time= 2.0 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]DC6-9x1Bytes-1           R=  +6.6  p =  1.6e-3   unusual
  ...and 147 test result(s) without anomalies

rng=RNG_stdin64, seed=0x82f88431
length= 256 megabytes (2^28 bytes), time= 4.7 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x82f88431
length= 512 megabytes (2^29 bytes), time= 9.5 seconds
  no anomalies in 169 test result(s)

...

rng=RNG_stdin64, seed=0x82f88431
length= 8 terabytes (2^43 bytes), time= 143676 seconds
  no anomalies in 319 test result(s)

Random minimum spanning trees

I just ran across a post by John Baez pointing to an article by Alan Frieze on random minimum spanning trees.

Here’s the problem.

  1. Create a complete graph with n nodes, i.e. connect every node to every other node.
  2. Assign each edge a uniform random weight between 0 and 1.
  3. Find the minimum spanning tree.
  4. Add up the edges of the weights in the minimum spanning tree.

The surprise is that as n goes to infinity, the expected value of the process above converges to the Riemann zeta function at 3, i.e.

ζ(3) = 1/1³ + 1/2³ + 1/3³ + …

Incidentally, there are closed-form expressions for the Riemann zeta function at positive even integers. For example, ζ(2) = π² / 6. But no closed-form expressions have been found for odd integers.

Simulation

Here’s a little Python code to play with this.

    import networkx as nx
    from random import random

    N = 1000

    G = nx.Graph()
    for i in range(N):
        for j in range(i+1, N):
            G.add_edge(i, j, weight=random())

    T = nx.minimum_spanning_tree(G)
    edges = T.edges(data=True)

    print( sum([e[2]["weight"] for e in edges]) )

When I ran this, I got 1.2307, close to ζ(3) = 1.20205….

I ran this again, putting the code above inside a loop, averaging the results of 100 simulations, and got 1.19701. That is, the distance between my simulation result and ζ(3) went from 0.03 to 0.003.

There are two reasons I wouldn’t get exactly ζ(3). First, I’m only running a finite number of simulations (100) and so I’m not computing the expected value exactly, but only approximating it. (Probably. As in PAC: probably approximately correct.) Second, I’m using a finite graph, of size 1000, not taking a limit as graph size goes to infinity.

My limited results above suggest that the first reason accounts for most of the difference between simulation and theory. Running 100 replications cut the error down by a factor of 10. This is exactly what you’d expect from the central limit theorem. This suggests that for graphs as small as 1000 nodes, the expected value is close to the asymptotic value.

You could experiment with this, increasing the graph size and increasing the number of replications. But be patient. It takes a while for each replication to run.

Generalization

The paper by Frieze considers more than the uniform distribution. You can use any non-negative distribution with finite variance and whose cumulative distribution function F is differentiable at zero. The more general result replaces ζ(3) with ζ(3) / F ‘(0). We could, for example, replace the uniform distribution on weights with an exponential distribution. In this case the distribution function is 1 – exp(-x), at its derivative at the origin is 1, so our simulation should still produce approximately ζ(3). And indeed it does. When I took the average of 100 runs with exponential weights I got a value of 1.2065.

There’s a little subtlety around using the derivative of the distribution at 0 rather than the density at 0. The derivative of the distribution (CDF) is the density (PDF), so why not just say density? One reason would be to allow the most general probability distributions, but a more immediate reason is that we’re up against a discontinuity at the origin. We’re looking at non-negative distributions, so the density has to be zero to the left of the origin.

When we say the derivative of the distribution at 0, we really mean the derivative at zero of a smooth extension of the distribution. For example, the exponential distribution has density 0 for negative x and density exp(-x) for non-negative x. Strictly speaking, the CDF of this distribution is 1 – exp(-x) for non-negative x and 0 for negative x. The left and right derivatives are different, so the derivative doesn’t exist. By saying the distribution function is simply exp(-x), we’ve used a smooth extension from the non-negative reals to all reals.

Selecting things in Emacs

You can select blocks of text in Emacs just as you would in most other environments. You could, for example, drag your mouse over a region. You could also hold down the Shift key and use arrow keys. But Emacs also has a number of commands that let you work in larger semantic units. That is, instead of working with an undifferentiated set of characters, you can select meaningful chunks of text, the meaning depending on context.

When you’re editing English prose, the semantic units you are concerned with might be words, sentences, or paragraphs. When you’re editing programming language source code, you care about functions or various “balanced expressions” such as the content between two parentheses or two curly brackets.

The following table gives some of the selection commands built into Emacs.

UnitCommandKey binding
wordmark-wordM-@
paragraphmark-paragraphM-h
pagemark-pageC-x C-p
buffermark-whole-buffer C-x h
functionmark-defunC-M-h
balanced expressionmark-sexpC-M-@

The expand-region package offers an alternative to several of these commands. More on that later.

The command for selecting a word does just what you expect. Likewise, the commands for selecting a page or a buffer require little explanation. But the meaning of a “paragraph” depends on context (i.e. editing mode), as do the meanings of “function” and “balanced expression.”

When editing source code, a “paragraph” is typically a block of code without blank lines. However, each language implements its own editing mode and could interpret editing units differently. Function definition syntax varies across languages, so mark-defun has to be implemented differently in each language mode.

Balanced expressions could have a different meanings in different contexts, but they’re fairly consistent. Content between matching delimiters—quotation marks, parentheses, square brackets, curly braces, etc.—is generally considered a balanced expression.

Here’s where expand-region comes in. It’s typically bound to C-=. It can be used as a substitute for mark-word and mark-sexp. And if you use it repeatedly, it can replace mark-defun.

Each time you call expand-region it takes in more context. For example, suppose you’re in text mode with your cursor is in the middle of a word. The first call to expand-region selects to the end of the word. The second call selects the whole word, i.e. expanding backward to the beginning. The next call selects the enclosing sentence and the next call the enclosing paragraph.

The expand-region function works analogously when editing source code. Suppose you’re editing the bit of Emacs Lisp below and have your cursor on the slash between eshell and pwd.

(setq eshell-prompt-function
  (lambda nil
    (concat
     (eshell/pwd)
     " $ ")))

Here’s what sequential invocations of expand-region will select.

  1. /pwd
  2. /pwd/)
  3. (eshell/pwd)
  4. (eshell/pwd) " $ ")
  5. (concat (eshell/pwd) " $ ")
  6. (concat (eshell/pwd) " $ "))
  7. (lambda nil (concat (eshell/pwd) " $ "))
  8. (lambda nil (concat (eshell/pwd) " $ ")))
  9. (setq eshell-prompt-function (lambda nil (concat (eshell/pwd) " $ ")))

This is kinda tedious in this particular context because there are a lot of delimiters in a small region. In less dense code you’ll select larger blocks of code with each invocation of expand-region. Since each invocation requires only a single key (i.e. hold down Control and repeatedly type =) it’s easy to call expand-region over and over until you select the region you’d like.

Related posts:

Random walk on quaternions

The previous post was a riff on a tweet asking what you’d get if you extracted all the i‘s, j‘s, and k‘s from Finnegans Wake and multiplied them as quaternions. This post is a probabilistic variation on the previous one.

If you randomly select a piece of English prose, extract the i‘s, j‘s, and k‘s, and multiply them together as quaternions, what are you likely to get?

The probability that a letter in this sequence is an i is about 91.5%. There’s a 6.5% chance it’s a k, and a 2% chance it’s a j. (Derived from here.) We’ll assume the probabilities of each letter appearing next are independent.

You could think of the process multiplying all the i‘s, j‘s, and k‘s together as a random walk on the unit quaternions, an example of a Markov chain. Start at 1. At each step, multiply your current state by i with probability 0.915, by j with probability 0.02, and by k with probability 0.065.

After the first step, you’re most likely at i. You could be at j or k, but nowhere else. After the second step, you’re most likely at -1, though you could be anywhere except at 1. For the first several steps you’re much more likely to be at some places than others. But after 50 steps, you’re about equally likely to be at any of the eight possible values.

Wolfram Alpha, Finnegans Wake, and Quaternions

James Joyce

I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye.

Suppose you did extract all the i‘s, j‘s, and k‘s from James Joyce’s novel Finnegans Wake. How would you answer the question above?

You could initialize an accumulator to 1 and then march through the list, updating the accumulator by multiplying it by the next element. But is is there a more efficient way?

Quaternion multiplication is not commutative, i.e. the order in which you multiply things matters. So it would not be enough to have a count of how many times each letter appears. Is there any sort of useful summary of the data short of carrying out the whole multiplication? In other words, could you scan the list while doing something other than quaternion multiplication, something faster to compute? Something analogous to sufficient statistics.

We’re carrying out multiplications in the group Q of unit quaternions, a group with eight elements: ±1, ±i, ±j, ±k. But the input to the question about Finnegans Wake only involves three of these elements. Could that be exploited for some slight efficiency?

How would you best implement quaternion multiplication? Of course the answer depends on your environment and what you mean by “best.”

Note that we don’t actually need to implement quaternion multiplication in general, though that would be sufficient. All we really need is multiplication in the group Q.

You could implement multiplication by a table lookup. You could use an 8 × 3 table; the left side of our multiplication could be anything in Q, but the right side can only be ij, or k. You could represent quaternions as a list of four numbers—coefficients of 1, ij, and k—and write rules for multiplying these. You could also represent quaternions as real 4 × 4 matrices or as complex 2 × 2 matrices.

If you have an interesting solution, please share it in a comment below. It could be interesting by any one of several criteria: fast, short, cryptic, amusing, etc.

Related posts:

The cross polytope

There are five regular solids in three dimensions:

  • tetrahedron
  • octahedron (pictured above)
  • hexahedron (cube)
  • dodecahedron
  • icosahedron.

I give a proof here that these are the only five.

The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular solids in dimension 4.)

I’ve mentioned generalizations of the cube, the hypercube, lately. I suppose you could call the generalization of a octahedron a “hyperoctahedron” by analogy with the hypercube, though I’ve never heard anybody use that term. Instead, the most common name is cross polytope.

This post will focus on the cross polytope. In particular, we’re going to look at the relative volume of a ball inside a cross polytope.

The cross polytope in n dimensions is the convex hull of all n-dimensional vectors that are ±1 in one coordinate and 0 in all the rest. It is the “plus or minus” part that gives the cross polyhedron its name, i.e. the vertices are in pairs across the origin.

In analysis, the cross polytope is the unit ball in ℓ1 (“little ell one”), the set of points (x1, x1, …, xn) such that

|x1| + |x2| + … + |xn| = 1.

The ℓ1 norm, and hence the ℓ1 ball, comes up frequently in compressed sensing and in sparse regression.

In recent blog posts we’ve looked at how the relative volume in a ball inscribed in a hypercube drops quickly as dimension increases. What about the cross polytope? The relative volume of a ball inscribed in a cross polytope decreases rapidly with dimension as well. But does it decreases faster or slower than the relative volume of a ball inscribed in a hypercube? To answer this, we need to compute

\left.\frac{\mbox{vol ball in cross poly}}{\mbox{vol cross poly}}\middle/\frac{\mbox{vol ballin hypercube}}{\mbox{vol hypercube}}\right.

Let’s gather what we need to evaluate this. We need the volume of a ball of radius r in n dimensions, and as mentioned before this is

V = \frac{\pi^{\frac{n}{2}} r^n}{\Gamma\left(\frac{n}{2} + 1\right)}

A ball sitting inside an n-dimensional unit cross polytope will have radius 1/√n. This is because if n positive numbers sum to 1, the sum of their squares is minimized by making them all equal, and the point (1/n, 1/n, …, 1/n) has norm 1/√ n. A ball inside a unit hypercube will have radius 1/2.

The cross polytope has volume 2n / n! and the hypercube has volume 1.

Putting this all together, the relative volume of a ball in a cross polytope divided by the relative volume of a ball inside a hypercube is

\left. \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{\sqrt{n}}\right)^n } { \frac{2^n}{n!} } \middle/ \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{2}\right)^n } { 1 } \right.

which fortunately reduces to just

\frac{n!}{n^n}

But how do we compare n! and nn/2? That’s a job for Stirling’s approximation. It tells us that for large n, the ratio is approximately

\sqrt{2\pi n}\, n^{n/2}e^{-n}

and so the ratio diverges for large n, i.e. the ball in the cross polytope takes up increasingly more relative volume.

Looking back at just the relative volume of the ball inside the cross polytope, and applying Stirling’s approximation again, we see that the relative volume of the ball inside the cross polytope is approximately

\sqrt{2}\left( \frac{\pi}{2e} \right )^{n/2}

and so the relative volume decreases geometrically as n increases, decreasing much slower than the relative volume of a ball in a hypercube.

Bayesian methods at Bletchley Park

From Nick Patterson’s interview on Talking Machines:

GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an extreme minority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statistics works. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

Alan Turing

Sphere packing

The previous couple blog posts touched on a special case of sphere packing.

We looked at the proportion of volume contained near the corners of a hypercube. If you take the set of points within a distance 1/2 of a corner of a hypercube, you could rearrange these points to form a full ball centered one corner of the hypercube. Saying that not much volume is located near the corners is equivalent to saying that the sphere packing that centers spheres at points with integer coordinates is not very dense.

We also looked at centering balls inside hypercubes. This is the same sphere packing as above, just shifting every coordinate by 1/2. So saying that a ball in a box doesn’t take up much volume in high dimensions is another way of saying that the integer lattice sphere packing is not very dense.

How much better can we pack spheres? In 24 dimensions, balls centered inside hypercubes would have density equal to the volume of a ball of radius 1/2, or (π/2)12 / 12!. The most dense packing in 24 dimensions, the Leech lattice sphere packing, has a density of π12 / 12!, i.e. it is 212 = 4096 times more efficient.

The densest sphere packings have only been proven in dimensions 1, 2, 3, 8, and 24. (The densest regular (lattice) packings are known for dimensions up to 8, but it is conceivable that there exist irregular packings that are more efficient than the most efficient lattice packing.) Dimension 24 is special in numerous ways, and it appears that 24 is a local maximum as far as optimal sphere packing density. How does sphere packing based on a integer lattice compare to the best packing in other high dimensions?

Although optimal packings are not known in high dimensions, upper and lower bounds on packing density are known. If Δ is the optimal sphere packing density in dimension n, then we have the following upper and lower bounds for large n:

-1 \leq \frac{1}{n} \log_2 \Delta \leq -0.599

The following plot shows how the integer lattice packing density (solid line) compares to the upper and lower bounds (dashed lines).

The upper and lower bounds come from Sphere Packings, Lattices, and Groups, published in 1998. Perhaps tighter bounds have been found since then.

Is most volume in the corners or not?

I’ve written a couple blog posts that may seem to contradict each other. Given a high-dimensional cube, is most of the volume in the corners or not?

I recently wrote that the corners of a cube stick out more in high dimensions. You can quantify this by centering a ball at a corner and looking at how much of the ball comes from the cube and how much from surrounding space. That post showed that the proportion of volume near a corner goes down rapidly as dimension increases.

About a year ago I wrote a blog post about how formal methods let you explore corner cases. Along the way I said that most cases are corner cases, i.e. most of the volume is in the corners.

Both posts are correct, but they use the term “corner” differently. That is because there are two ideas of “corner” that are the same in low dimensions but diverge in higher dimensions.

Draw a circle and then draw a square just big enough to contain it. You could say that the area in the middle is the area inside the circle and the corners are everything else. Or you could say that the corners are the regions near a vertex of the square, and the middle is everything else. These two criteria aren’t that different. But in high dimensions they’re vastly different.

The post about pointy corners looked at the proportion of volume near the vertices of the cube. The post about formal methods looked at the proportion of volume not contained in a ball in the middle of the cube. As dimension increases, the former goes to zero and the latter goes to one.

In other words, in high dimensions most of the volume is neither near a vertex nor in a ball in the middle. This gives a hint at why sphere packing is interesting in high dimensions. The next post looks at how the sphere packings implicit in this post compare to the best possible packings.

Corners stick out more in high dimensions

High-dimensional geometry is full of surprises. For example, nearly all the area of a high-dimensional sphere is near the equator, and by symmetry it doesn’t matter which equator you take.

Here’s another surprise: corners stick out more in high dimensions. Hypercubes, for example, become pointier as dimension increases.

How might we quantify this? Think of a pyramid and a flag pole. If you imagine a ball centered at the top of a pyramid, a fair proportion of the volume of the ball contains part of the pyramid. But if you do the same for a flag pole, only a small proportion of the ball contains pole; nearly all the volume of the ball is air.

So one way to quantify how pointy a corner is would be to look at a neighborhood of the corner and measure how much of the neighborhood intersects the solid that the corner is part of. The less volume, the pointier the corner.

Consider a unit square. Put a disk of radius r at a corner, with r < 1. One quarter of that disk will be inside the square. So the proportion of the square near a particular corner is πr²/4, and the proportion of the square near any corner is πr².

Now do the analogous exercise for a unit cube. Look at a ball of radius r < 1 centered at a corner. One eighth of the volume of that ball contains part of the cube. The proportion of cube’s volume located within a distance r of a particular corner is πr³/6, and the proportion located within a distance r of any corner is 4πr³/3.

The corner of a cube sticks out a little more than the corner of a square. 79% of a square is within a distance 0.5 of a corner, while the proportion is 52% for a cube. In that sense, the corners of a cube stick out a little more than the corners of a square.

Now let’s look at a hypercube of dimension n. Let V be the volume of an n-dimensional ball of radius r < 1. The proportion of the hypercube’s volume located within a distance r of a particular corner is V / 2n and the proportion located with a distance r of any corner is simply V.

The equation for the volume V is

V = \frac{\pi^{\frac{n}{2}} r^n}{\Gamma\left(\frac{n}{2} + 1\right)}

If we fix r and let n vary, this function decreases rapidly as n increases.

Saying that corners stick out more in high dimensions is a corollary of the more widely known fact that a ball in a box takes up less and less volume as the dimension of the ball and the box increase.

Let’s set r = 1/2 and plot how the volume of a ball varies with dimension n.

Volume of a ball of radius 1/2 in dimension n

You could think of this as the volume of a ball sitting inside a unit hypercube, or more relevant to the topic of this post, the proportion of the volume of the hypercube located with a distance 1/2 of a corner.

Discrete example of concentration of measure

The previous post looked at a continuous example of concentration of measure. As you move away from a thin band around the equator, the remaining area in the rest of the sphere decreases as an exponential function of the dimension and the distance from the equator. This post will show a very similar result for discrete sequences.

Suppose you have a sequence X1, X2, …, Xn of n random variables that each take on the values {-1, 1} with equal probability. You could think of this as a random walk: you start at 0 and take a sequence of steps either to the left or the right.

Let Sn = X1 + X2 + … + Xn be the sum of the sequence. The expected value of Sn is 0 by symmetry, but it could be as large as n or as small as –n. We want to look at how large |Sn| is likely to be relative to the sequence length n.

Here’s the analytical bound:

P\left( \frac{|S_n|}{n} \geq r\right) \leq 2 \exp\left( - \frac{nr^2}{2}\right )

(If you’re reading this via email, you probably can’t see the equation. Here’s why and how to fix it.)

Here is a simulation in Python to illustrate the bound.

    from random import randint
    import numpy as np
    import matplotlib.pyplot as plt

    n = 400  # random walk length
    N = 10000 # number of repeated walks

    reps = np.empty(N)

    for i in range(N):
        random_walk = [2*randint(0, 1) - 1 for _ in range(n)]
        reps[i] = abs(sum(random_walk)) / n
    
    plt.hist(reps, bins=41)
    plt.xlabel("$|S_n|/n$")
    plt.show()

And here are the results.

Nearly all the area in a high-dimensional sphere is near the equator

Nearly all the area of a high-dimensional sphere is near the equator.  And by symmetry, it doesn’t matter which equator you take. Draw any great circle and nearly all of the area will be near that circle.  This is the canonical example of “concentration of measure.”

What exactly do we mean by “nearly all the area” and “near the equator”? You get to decide. Pick your standard of “nearly all the area,” say 99%, and your definition of “near the equator,” say within 5 degrees. Then it’s always possible to take the dimension high enough that your standards are met. The more demanding your standard, the higher the dimension will need to be, but it’s always possible to pick the dimension high enough.

This result is hard to imagine. Maybe a simulation will help make it more believable.

In the simulation below, we take as our “north pole” the point (1, 0, 0, 0, …, 0). We could pick any unit vector, but this choice is convenient. Our equator is the set of points orthogonal to the pole, i.e. that have first coordinate equal to zero. We draw points randomly from the sphere, compute their latitude (i.e. angle from the equator), and make a histogram of the results.

The area of our planet isn’t particularly concentrated near the equator.

But as we increase the dimension, we see more and more of the simulation points are near the equator.

Here’s the code that produced the graphs.

from scipy.stats import norm
from math import sqrt, pi, acos, degrees
import matplotlib.pyplot as plt

def pt_on_sphere(n):
    # Return random point on unit sphere in R^n.
    # Generate n standard normals and normalize length.
    x = norm.rvs(0, 1, n)
    length = sqrt(sum(x**2))
    return x/length

def latitude(x):
    # Latitude relative to plane with first coordinate zero.
    angle_to_pole = acos(x[0]) # in radians
    latitude_from_equator = 0.5*pi - angle_to_pole
    return degrees( latitude_from_equator )

N = 1000 # number of samples

for n in [3, 30, 300, 3000]: # dimension of R^n
    
    latitudes = [latitude(pt_on_sphere(n)) for _ in range(N)]
    plt.hist(latitudes, bins=int(sqrt(N)))
    plt.xlabel("Latitude in degrees from equator")
    plt.title("Sphere in dimension {}".format(n))
    plt.xlim((-90, 90))
    plt.show()

Not only is most of the area near the equator, the amount of area outside a band around the equator decreases very rapidly as you move away from the band. You can see that from the histograms above. They look like a normal (Gaussian) distribution, and in fact we can make that more precise.

If A is a band around the equator containing at least half the area, then the proportion of the area a distance r or greater from A is bound by exp( -(n-1)r² ). And in fact, this holds for any set A containing at least half the area; it doesn’t have to be a band around the equator, just any set of large measure.

Related post: Willie Sutton and the multivariate normal distribution