Defining the Fourier transform on LCA groups

My previous post said that all the familiar variations on Fourier transforms—Fourier series analysis and synthesis, Fourier transforms on the real line, discrete Fourier transforms, etc.—can be unified into a single theory. They’re all instances of a Fourier transform on a locally compact Abelian (LCA) group. The difference between them is the underlying group.

Given an LCA group G, the Fourier transform takes a function on G and returns a function on the dual group of G. We said this much last time, but we didn’t define the dual group; we just stated examples. We also didn’t say just how you define a Fourier transform in this general setting.

Characters and dual groups

Before we can define a dual group, we have to define group homomorphisms. A homomorphism between two groups G and H is a function h between the groups that preserves the group structure. Suppose the group operation is denoted by addition on G and by multiplication on H (as it will be in our application), saying h preserves the group structure means

h(x + y) = h(x) h(y)

for all x and y in G.

Next, let T be the unit circle, i.e. complex numbers with absolute value 1. T is a group with respect to multiplication. (Why T for circle? This is a common notation, anticipating generalization to toruses in all dimensions. A circle is a one-dimensional torus.)

Now a character on G is a continuous homomorphism from G to T. The set of all characters on G is the dual group of G. Call this group Γ. If G is an LCA group, then so is Γ.

Integration

The classical Fourier transform is defined by an integral. To define the Fourier transform on a group we have to have a way to do integration on that group. And there’s a theorem that says we can always do that. For every LCA group, there exists a Haar measure μ, and this measure is nice enough to develop our theory. This measure is essentially unique: Any two Haar measures on the same LCA group must be proportional to each other. In other words, the measure is unique up to multiplying by a constant.

On a discrete group—for our purposes, think of the integers and the integers mod m—Haar measure is just counting; the measure of a set is the number of things in the set. And integration with respect to this measure is summation.

Fourier transform defined

Let f be a function in L¹(G), i.e. an absolutely integrable function on G. Then the Fourier transform of f is a function on Γ defined by

\hat{f}(\gamma) = \int_G f(x)\, \gamma(-x) \, d\mu

What does this have to do with the classical Fourier transform? The classical Fourier transform takes a function of time and returns a function of frequency. The correspondence between the classical Fourier transform and the abstract Fourier transform is to associate the frequency ω with the character that takes x to the value exp(iωx).

There are multiple slightly different conventions for the classical Fourier transform cataloged here. These correspond to different constant multiples in the choice of measure on G and Γ, i.e. whether to divide by or multiply by √(2π), and in the correspondence between frequencies and characters, whether ω corresponds to exp(±iωx) or exp(±2πiωx).

Unified theory of Fourier transforms

You can take a periodic function and analyze it into its Fourier coefficients, or use the Fourier coefficients in a sum to synthesize a periodic function. You can take the Fourier transform of a function defined on the whole real line and get another such function. And you can compute the discrete Fourier transform via the FFT algorithm.

Is there a general theory that unifies all these related but different things? Why yes, yes there is.

Groups

Everything in the opening paragraph is simply a Fourier transform, each in a different context. And the contexts correspond to groups. Specifically, locally compact Abelian groups.

Some of these groups are easier to see than others. Clearly the real numbers with addition form a group: the sum of two real numbers is a real number, etc. But where are the groups in the other contexts?

You can think of a periodic function as a function on a circle; the function values have to agree at both ends of an interval, so you might as well think of those two points as the same point, i.e. join them to make a circle. Shifting along an interval, wrapping around if necessary, corresponds to a rotation of the circle, and rotations form a group. So analyzing a periodic function in to a set of Fourier coefficients is a Fourier transform on the circle.

You can think of a set of Fourier coefficients as a function on the integers, mapping n to the nth coefficient. Synthesizing a set of Fourier coefficients into a periodic function is a Fourier transform on the group of integers.

What about a discrete Fourier transform (DFT)? If you have a function sampled at m points, you could think of those points as the group of integers mod m. Your sampled points constitute a function on the integers mod m, and the DFT is a Fourier transform on that group.

Note that the DFT is a Fourier transform in its own right. It’s not an approximation per se, though it’s nearly always used as part of an approximation process. You can start with a continuous function, approximate it by a finite set of samples, compute the DFT of these samples, and the result will give you an approximation to the Fourier transform of the original continuous function.

What about functions of several variables? These are functions on groups too. A function of two real variables, for example, is a function on R², which is a group with (vector) addition.

Dual groups

A Fourier transform takes a function defined on a group and returns a function defined on the dual of that group. I go into exactly what a dual group is in my next post, but for now, just note that the Fourier transform takes a function defined on one group and returns a function defined on another group.

The dual of the circle is the integers, and vice versa. That’s why the Fourier transform of a function on the circle is an infinite set of Fourier coefficients, which we think of as a function on the integers. The Fourier transform of the function on the integers, i.e. a set of Fourier coefficients, is a function on the circle, i.e. a periodic function.

The dual group of the real numbers is the real numbers again. That’s why the Fourier transform of a function on the real line is another function on the real line.

The integers mod m is also its own dual group. So the DFT takes a set of m numbers and returns a set of m numbers.

Locally compact Abelian (LCA) groups

What do locally compact and Abelian mean? And why do we make these assumptions?

Let’s start with Abelian. This just means that the group operation is commutative. When we’re adding real numbers, or composing rotations of a circle, these operations are commutative.

Locally compactness is a more technical requirement. The circle is compact, and so are the integers mod m. But if we restricted out attention to compact groups, that would leave out the integers and the real numbers. These spaces are not compact, but they’re locally compact, and that’s enough for the theory to go through.

It turns out that LCA groups are a sort of theoretical sweet spot. Assuming groups are LCA is general enough to include the examples we care about the most, but it’s not so general that the theory becomes harder and the results less powerful.

More connections

This post relates Fourier series (analysis and synthesis) to Fourier transforms (on the real line) by saying they’re both special cases of Fourier analysis on LCA groups. There are a couple other ways to connect Fourier series and Fourier transforms.

You can take the Fourier transform (not Fourier series) of a periodic function two ways: by restricting it to one period and defining it to be zero everywhere else, or by letting it repeat forever across the real line and taking the Fourier transform in the sense of generalized functions. You can read more about these two approaches in this post.

Solving problems we wish we had

There’s a great line from Heather McGaw toward the end of the latest episode of 99 Percent Invisible:

Sometimes … we can start to solve problems that we wish were problems because they’re easy to solve.

Reminds me of an excerpt from Richard Weaver’s book Ideas Have Consequences:

Obsession, according to the canons of psychology, occurs when an innocuous idea is substituted for a painful one. The victim simply avoids recognizing the thing which will hurt. We have seen that the most painful confession for the modern egoist to make is that there is a center or responsibility. He has escaped it by taking his direction with reference to the smallest points. … The obsession, however, is a source of great comfort to the obsessed.

Predicting when an RNG will output a given value

A few days ago I wrote about how to pick the seed of a simple random number generator so that a desired output came n values later. The number n was fixed and we varied the seed. In this post, the seed will be fixed and we’ll solve for n. In other words, we ask when a pseudorandom sequence will produce a given value.

In principle you could just run the RNG until you get the output you’re looking for, but we’ll assume such a brute force approach is not feasible or at least not fast enough.

If a LCG (linear congruential generator) has seed z, multiplier a, and modulus m, then the nth output is an z reduced mod m. So our task is to solve

x = an z mod m

for n. If we forget for a moment that we’re working with integers mod m, we see that the solution is

n = loga (x / z)

We can actually make this work if we interpret division by z to mean multiplication by the inverse of z mod m and if we interpret the logarithm to be a discrete logarithm. For more on discrete logarithms and one algorithm for computing them, see this post.

In an earlier post I used  a = 742938285 and m = 231 – 1 = 2147483647. We set n = 100 and solved for z to make the 100th output equal to 20170816, the date of that post. It turned out that z = 1898888478.

Now let’s set the seed z = 1898888478 and ask when the LCG sequence will take on the value x = 20170816. Of course we know that n will turn out to be 100, but let’s pretend we don’t know that. We’ll write a little Python script to find n.

I expect there’s a simple way to compute modular inverses using SymPy, but I haven’t found it, so I used some code from StackOverflow.

The following code produces n = 100, as expected.

from sympy.ntheory import discrete_log

def egcd(a, b):
    if a == 0:
        return (b, 0, 1)
    else:
        g, y, x = egcd(b % a, a)
        return (g, x - (b // a) * y, y)

def modinv(a, m):
    g, x, y = egcd(a, m)
    if g != 1:
        raise Exception('modular inverse does not exist')
    else:
        return x % m

a = 742938285
z = 1898888478
m = 2**31 - 1
x = 20170816
zinv = modinv(z, m)

n = discrete_log(m, x*zinv, a)
print(n)

Programming language life expectancy

The Lindy effect says that what’s been around the longest is likely to remain around the longest. It applies to creative artifacts, not living things. A puppy is likely to live longer than an elderly dog, but a book that has been in press for a century is likely to be in press for another century.

In a previous post I go into the mathematical detail of the Lindy effect: power law distributions etc. The key fact we need for this blog post is that if something has the kind of survival distribution described by the Lindy effect, then the expected future lifetime equals the current age. For example, the 100 year old book in the opening paragraph is expected to be around for another 100 years.

Note that this is all framed in terms of probability distributions. It’s not saying, for example, that everything new will go away soon. Everything was once new. Someone watching Hamlet on opening night would be right to speculate that nobody would care about Hamlet in a few years. But now that we know Hamlet has been around for four centuries and is going strong, the Lindy effect would predict that people will be performing Hamlet in the 25th century.

Note that Lindy takes nothing into account except survival to date. Someone might have been more bullish on Hamlet after opening night based on other information such as how well the play was received, but that’s beyond the scope of the Lindy effect.

If we apply the Lindy effect to programming languages, we only consider how long they’ve been around and whether they’re thriving today. You might think that Go, for example, will be along for a long time based on Google’s enormous influence, but the Lindy effect does not take such information into account.

So, if we assume the Lindy effect holds, here are the expected years when programming languages would die. (How exactly would you define the time of death for a programming language? Doesn’t really matter. This isn’t that precise or that serious.)

LanguageBornExpected death
Go20092025
C#20002034
Java19952039
Python19912043
Haskell19902044
C19722062
Lisp19592075
Fortran19572077

You can debate what it even means for a language to survive. For example, I’d consider Lisp to be alive and well if in the future people are programming Clojure but not Common Lisp, for example, but others might disagree.

“We don’t know what language engineers will be coding in in the year 2100. However, we do know that it will be called FORTRAN.” — C.A.R. Hoare

Reverse engineering the seed of a linear congruential generator

The previous post gave an example of manipulating the seed of a random number generator to produce a desired result. This post will do something similar for a different generator.

A couple times I’ve used the following LCG (linear congruential random number generator) in examples. An LCG starts with an initial value of z and updates z at each step by multiplying by a constant a and taking the remainder by m. The particular LCG I’ve used has a = 742938285 and m = 231 – 1 = 2147483647.

(I used these parameters because they have maximum range, i.e. every positive integer less than m appears eventually, and because it was one of the LCGs recommended in an article years ago. That is, it’s very good as far as 32-bit LCGs go, which isn’t very far. An earlier post shows how it quickly fails the PractRand test suite.)

Let’s pick the seed so that the 100th output of the generator is today’s date in ISO form: 20170816.

We need to solve

a100z = 20170816 mod 2147483647.

By reducing  a100 mod 2147483647 we  find we need to solve

160159497 z = 20170816 mod 2147483647

and the solution is z = 1898888478. (See How to solve linear congruences.)

The following Python code verifies that the solution works.

    a = 742938285
    z = 1898888478
    m = 2**31 - 1

    for _ in range(100):
        z = a*z % m
    print(z)

Update: In this post, I kept n = 100 fixed and solved for the seed to give a specified output n steps later. In a follow up post I do the opposite, fixing the seed and solving for n.

Manipulating a random number generator

With some random number generators, it’s possible to select the seed carefully to manipulate the output. Sometimes this is easy to do. Sometimes it’s hard but doable. Sometimes it’s theoretically possible but practically impossible.

In my recent correspondence with Melissa O’Neill, she gave me an example that seeds a random number generator so that the 9th and 10th outputs produce the ASCII code for my name.

Here’s the code. The function next is the xoroshiro128+ (XOR/rotate/shift/rotate) random number generator. The global array s contains the state of the random number generator. Its initial values are the seeds of the generator.

#include <cstdio>
#include <cstdint>

// xoroshiro generator taken from
// http://vigna.di.unimi.it/xorshift/xoroshiro128plus.c

uint64_t s[2];

static inline uint64_t rotl(const uint64_t x, int k) {
	return (x << k) | (x >> (64 - k));
}

uint64_t next(void) {
	const uint64_t s0 = s[0];
	uint64_t s1 = s[1];
	const uint64_t result = s0 + s1;

	s1 ^= s0;
	s[0] = rotl(s0, 55) ^ s1 ^ (s1 << 14); // a, b
	s[1] = rotl(s1, 36); // c

	return result;
}

int main() {
    freopen(NULL, "wb", stdout); 

    s[0] = 0x084f31240ed2ec3f;
    s[1] = 0x0aa0d69470975eb8;

    while (1) {
        uint64_t value = next();
        fwrite((void*) &value, sizeof(value), 1, stdout);
    }
}

Compile this code then look at a hex dump of the first few outputs. Here’s what you get:

cook@mac> gcc xoroshiro.cpp
cook@mac> ./a.out | hexdump -C | head
f7 4a 6a 7f b8 07 f0 12  f8 96 e1 af 29 08 e3 c8  |.Jj.........)...|
15 0e b0 38 01 ef b2 a7  bb e9 6f 4d 77 55 d7 a0  |...8......oMwU..|
49 08 f2 fc 0c b2 ea e8  48 c2 89 1b 31 3f d7 3d  |I.......H...1?.=|
11 eb bc 5e ee 98 b6 3b  d9 d1 cc 15 ae 00 fc 2f  |...^...;......./|
3e 20 4a 6f 68 6e 20 44  2e 20 43 6f 6f 6b 20 3c  |> John D. Cook <| 
d1 80 49 27 3e 25 c2 4b  2a e3 78 71 9c 9e f7 18  |..I'>%.K*.xq....|
0b bb 1f c6 1c 71 79 29  d6 45 81 47 3b 88 4f 42  |.....qy).E.G;.OB|
7c 1c 19 c4 22 58 51 2d  d7 74 69 ac 36 6f e0 3f  ||..."XQ-.ti.6o.?|
78 7f a4 14 1c bc a8 de  17 e3 f7 d8 0c de 2c ea  |x.............,.|
a2 37 83 f9 38 e4 14 77  e3 6a c8 52 d1 0c 29 01  |.7..8..w.j.R..).|

(I cut out the line numbers on the left side to make the output fit better horizontally on the page.)

Not only did one pair of seed values put my name in the output, another pair would work too. If you change the seed values to

s[0] = 0x820e8a6c1baf5b13;
s[1] = 0x5c51f1c4e2e64253;

you’ll also see my name in the output:

66 9d 95 fe 30 7c 60 de 7c 89 0c 6f cd d6 05 1e |f...0|`.|..o....|
2b e9 9c cc cd 3d c5 ec 3f e3 88 6c a6 cd 78 84 |+....=..?..l..x.|
20 12 ac f2 2b 3c 89 73 60 09 8d c3 85 68 9e eb | ...+<.s`....h..|
15 3e c1 0b 07 68 63 e5 73 a7 a8 f2 4b 8b dd d0 |.>...hc.s...K...|
3e 20 4a 6f 68 6e 20 44 2e 20 43 6f 6f 6b 20 3c |> John D. Cook <|
3f 71 cf d7 5f a6 ab cf 9c 81 93 d1 3d 4c 9e 41 |?q.._.......=L.A|
0d b5 48 9c aa fc 84 d8 c6 64 1d c4 91 79 b4 f8 |..H......d...y..|
0c ac 35 48 fd 38 01 dd cc a4 e4 43 c6 5b e8 c7 |..5H.8.....C.[..|
e1 2e 76 30 0f 9d 41 ff b0 99 84 8d c1 72 5a 91 |..v0..A......rZ.|
06 ea f2 8e d7 87 e5 2c 53 a9 0c a0 f4 cd d1 9b |.......,S.......|

Note that there are limits to how much you can manipulate the output of an RNG. In this case the seed was selected to produce two particular output values, but the rest of the sequence behaves as if it were random. (See Random is as random does.)

Testing RNGs with PractRand

PractRand is a random number generator test suite, somewhat like the DIEHARDER and NIST tests I’ve written about before, but more demanding.

Rather than running to completion, it runs until it a test fails with an infinitesimally small p-value. It runs all tests at a given sample size, then doubles the sample and runs the tests again.

32-bit generators

LCG

A while back I wrote about looking for an RNG that would fail the NIST test suite and being surprised that a simple LCG (linear congruential generator) did fairly well. PractRand, however, dismisses this generator with extreme prejudice:

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x4a992b2c
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x4a992b2c
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+0,13-3,T) R=+115128 p = 0 FAIL !!!!!!!!
BCFN(2+1,13-3,T) R=+105892 p = 0 FAIL !!!!!!!!
...
[Low1/32]FPF-14+6/16:(8,14-9) R= +25.8 p = 1.5e-16 FAIL
[Low1/32]FPF-14+6/16:(9,14-10) R= +15.5 p = 8.2e-9 very suspicious
[Low1/32]FPF-14+6/16:(10,14-11) R= +12.9 p = 1.8e-6 unusual
[Low1/32]FPF-14+6/16:all R=+380.4 p = 8.2e-337 FAIL !!!!!!!
[Low1/32]FPF-14+6/16:all2 R=+33954 p = 0 FAIL !!!!!!!!
[Low1/32]FPF-14+6/16:cross R= +33.6 p = 6.4e-25 FAIL !!
...and 9 test result(s) without anomalies

I don’t recall the last time I saw a p-value of exactly zero. Presumably the p-value was so small that it underflowed to zero.

MWC

Another 32 bit generator, George Marsaglia’s MWC generator, also fails, but not as spectacularly as LCG.

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x187edf12
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x187edf12
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +6.3 p = 2.2e-3 unusual
Gap-16:A R= +33.3 p = 1.6e-28 FAIL !!!
Gap-16:B R= +13.6 p = 7.9e-12 FAIL
...and 105 test result(s) without anomalies

64-bit generators

Next let’s see how some well-regarded 64-bit random number generators do. We’ll look at xorshift128+ and xoroshir0128+ by Sebastiano Vigna and David Blackman, the famous Mersenne Twister, and PCG by Melissa O’Neill.

The numbers generated by xhoroshir0128+ and xorshift128+ are not random in the least significant bit and so the PractRand tests end quickly. The authors claim that xoroshiro128+ “passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests.” I’ve only run PractRand with its default settings; I have not tried getting it to keep running the rest of the tests.

A lack of randomness in the least significant bits is inconsequential if you’re converting the outputs to floating point numbers, say as part of the process of generating Gaussian random values. For other uses, such as reducing the outputs modulo a small number, a lack of randomness in the least significant bits would be disastrous. (Here numerical analysis and number theory have opposite ideas regarding which parts of a number are most “significant.”)

At the time of writing, both Mersenne Twister and PCG have gone through 256 GB of generated values and are still running. I intend to add more results tomorrow.

Update: Mersenne Twister failed a couple of tests with 512 GB of input. I stopped the tests after PCG passed 16 TB.

xoroshiro128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xe15bf63c
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xe15bf63c
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

xorshift128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x8c7c5173
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x8c7c5173
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

Mersenne Twister

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x300dab94
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x300dab94
length= 256 megabytes (2^28 bytes), time= 3.7 seconds
no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 megabytes (2^29 bytes), time= 7.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+2,13-2,T) R= -7.0 p =1-1.2e-3 unusual
[Low1/64]BCFN(2+2,13-6,T) R= -5.7 p =1-1.0e-3 unusual
...and 167 test result(s) without anomalies

...

rng=RNG_stdin64, seed=0x300dab94
length= 256 gigabytes (2^38 bytes), time= 3857 seconds
  no anomalies in 265 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 gigabytes (2^39 bytes), time= 8142 seconds
 Test Name Raw Processed Evaluation
 BRank(12):24K(1) R=+99759 p~= 0 FAIL !!!!!!!!
 [Low16/64]BRank(12):16K(1) R= +1165 p~= 1.3e-351 FAIL !!!!!!!
 ...and 274 test result(s) without anomalies

PCG

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x82f88431
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x82f88431
length= 128 megabytes (2^27 bytes), time= 2.0 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]DC6-9x1Bytes-1           R=  +6.6  p =  1.6e-3   unusual
  ...and 147 test result(s) without anomalies

rng=RNG_stdin64, seed=0x82f88431
length= 256 megabytes (2^28 bytes), time= 4.7 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x82f88431
length= 512 megabytes (2^29 bytes), time= 9.5 seconds
  no anomalies in 169 test result(s)

...

rng=RNG_stdin64, seed=0x82f88431
length= 16 terabytes (2^44 bytes), time= 254943 seconds
  no anomalies in 329 test result(s)

Random minimum spanning trees

I just ran across a post by John Baez pointing to an article by Alan Frieze on random minimum spanning trees.

Here’s the problem.

  1. Create a complete graph with n nodes, i.e. connect every node to every other node.
  2. Assign each edge a uniform random weight between 0 and 1.
  3. Find the minimum spanning tree.
  4. Add up the edges of the weights in the minimum spanning tree.

The surprise is that as n goes to infinity, the expected value of the process above converges to the Riemann zeta function at 3, i.e.

ζ(3) = 1/1³ + 1/2³ + 1/3³ + …

Incidentally, there are closed-form expressions for the Riemann zeta function at positive even integers. For example, ζ(2) = π² / 6. But no closed-form expressions have been found for odd integers.

Simulation

Here’s a little Python code to play with this.

    import networkx as nx
    from random import random

    N = 1000

    G = nx.Graph()
    for i in range(N):
        for j in range(i+1, N):
            G.add_edge(i, j, weight=random())

    T = nx.minimum_spanning_tree(G)
    edges = T.edges(data=True)

    print( sum([e[2]["weight"] for e in edges]) )

When I ran this, I got 1.2307, close to ζ(3) = 1.20205….

I ran this again, putting the code above inside a loop, averaging the results of 100 simulations, and got 1.19701. That is, the distance between my simulation result and ζ(3) went from 0.03 to 0.003.

There are two reasons I wouldn’t get exactly ζ(3). First, I’m only running a finite number of simulations (100) and so I’m not computing the expected value exactly, but only approximating it. (Probably. As in PAC: probably approximately correct.) Second, I’m using a finite graph, of size 1000, not taking a limit as graph size goes to infinity.

My limited results above suggest that the first reason accounts for most of the difference between simulation and theory. Running 100 replications cut the error down by a factor of 10. This is exactly what you’d expect from the central limit theorem. This suggests that for graphs as small as 1000 nodes, the expected value is close to the asymptotic value.

You could experiment with this, increasing the graph size and increasing the number of replications. But be patient. It takes a while for each replication to run.

Generalization

The paper by Frieze considers more than the uniform distribution. You can use any non-negative distribution with finite variance and whose cumulative distribution function F is differentiable at zero. The more general result replaces ζ(3) with ζ(3) / F ‘(0). We could, for example, replace the uniform distribution on weights with an exponential distribution. In this case the distribution function is 1 – exp(-x), at its derivative at the origin is 1, so our simulation should still produce approximately ζ(3). And indeed it does. When I took the average of 100 runs with exponential weights I got a value of 1.2065.

There’s a little subtlety around using the derivative of the distribution at 0 rather than the density at 0. The derivative of the distribution (CDF) is the density (PDF), so why not just say density? One reason would be to allow the most general probability distributions, but a more immediate reason is that we’re up against a discontinuity at the origin. We’re looking at non-negative distributions, so the density has to be zero to the left of the origin.

When we say the derivative of the distribution at 0, we really mean the derivative at zero of a smooth extension of the distribution. For example, the exponential distribution has density 0 for negative x and density exp(-x) for non-negative x. Strictly speaking, the CDF of this distribution is 1 – exp(-x) for non-negative x and 0 for negative x. The left and right derivatives are different, so the derivative doesn’t exist. By saying the distribution function is simply exp(-x), we’ve used a smooth extension from the non-negative reals to all reals.

Selecting things in Emacs

You can select blocks of text in Emacs just as you would in most other environments. You could, for example, drag your mouse over a region. You could also hold down the Shift key and use arrow keys. But Emacs also has a number of commands that let you work in larger semantic units. That is, instead of working with an undifferentiated set of characters, you can select meaningful chunks of text, the meaning depending on context.

When you’re editing English prose, the semantic units you are concerned with might be words, sentences, or paragraphs. When you’re editing programming language source code, you care about functions or various “balanced expressions” such as the content between two parentheses or two curly brackets.

The following table gives some of the selection commands built into Emacs.

UnitCommandKey binding
wordmark-wordM-@
paragraphmark-paragraphM-h
pagemark-pageC-x C-p
buffermark-whole-buffer C-x h
functionmark-defunC-M-h
balanced expressionmark-sexpC-M-@

The expand-region package offers an alternative to several of these commands. More on that later.

The command for selecting a word does just what you expect. Likewise, the commands for selecting a page or a buffer require little explanation. But the meaning of a “paragraph” depends on context (i.e. editing mode), as do the meanings of “function” and “balanced expression.”

When editing source code, a “paragraph” is typically a block of code without blank lines. However, each language implements its own editing mode and could interpret editing units differently. Function definition syntax varies across languages, so mark-defun has to be implemented differently in each language mode.

Balanced expressions could have a different meanings in different contexts, but they’re fairly consistent. Content between matching delimiters—quotation marks, parentheses, square brackets, curly braces, etc.—is generally considered a balanced expression.

Here’s where expand-region comes in. It’s typically bound to C-=. It can be used as a substitute for mark-word and mark-sexp. And if you use it repeatedly, it can replace mark-defun.

Each time you call expand-region it takes in more context. For example, suppose you’re in text mode with your cursor is in the middle of a word. The first call to expand-region selects to the end of the word. The second call selects the whole word, i.e. expanding backward to the beginning. The next call selects the enclosing sentence and the next call the enclosing paragraph.

The expand-region function works analogously when editing source code. Suppose you’re editing the bit of Emacs Lisp below and have your cursor on the slash between eshell and pwd.

(setq eshell-prompt-function
  (lambda nil
    (concat
     (eshell/pwd)
     " $ ")))

Here’s what sequential invocations of expand-region will select.

  1. /pwd
  2. /pwd/)
  3. (eshell/pwd)
  4. (eshell/pwd) " $ ")
  5. (concat (eshell/pwd) " $ ")
  6. (concat (eshell/pwd) " $ "))
  7. (lambda nil (concat (eshell/pwd) " $ "))
  8. (lambda nil (concat (eshell/pwd) " $ ")))
  9. (setq eshell-prompt-function (lambda nil (concat (eshell/pwd) " $ ")))

This is kinda tedious in this particular context because there are a lot of delimiters in a small region. In less dense code you’ll select larger blocks of code with each invocation of expand-region. Since each invocation requires only a single key (i.e. hold down Control and repeatedly type =) it’s easy to call expand-region over and over until you select the region you’d like.

Related posts:

Random walk on quaternions

The previous post was a riff on a tweet asking what you’d get if you extracted all the i‘s, j‘s, and k‘s from Finnegans Wake and multiplied them as quaternions. This post is a probabilistic variation on the previous one.

If you randomly select a piece of English prose, extract the i‘s, j‘s, and k‘s, and multiply them together as quaternions, what are you likely to get?

The probability that a letter in this sequence is an i is about 91.5%. There’s a 6.5% chance it’s a k, and a 2% chance it’s a j. (Derived from here.) We’ll assume the probabilities of each letter appearing next are independent.

You could think of the process multiplying all the i‘s, j‘s, and k‘s together as a random walk on the unit quaternions, an example of a Markov chain. Start at 1. At each step, multiply your current state by i with probability 0.915, by j with probability 0.02, and by k with probability 0.065.

After the first step, you’re most likely at i. You could be at j or k, but nowhere else. After the second step, you’re most likely at -1, though you could be anywhere except at 1. For the first several steps you’re much more likely to be at some places than others. But after 50 steps, you’re about equally likely to be at any of the eight possible values.

Wolfram Alpha, Finnegans Wake, and Quaternions

James Joyce

I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye.

Suppose you did extract all the i‘s, j‘s, and k‘s from James Joyce’s novel Finnegans Wake. How would you answer the question above?

You could initialize an accumulator to 1 and then march through the list, updating the accumulator by multiplying it by the next element. But is is there a more efficient way?

Quaternion multiplication is not commutative, i.e. the order in which you multiply things matters. So it would not be enough to have a count of how many times each letter appears. Is there any sort of useful summary of the data short of carrying out the whole multiplication? In other words, could you scan the list while doing something other than quaternion multiplication, something faster to compute? Something analogous to sufficient statistics.

We’re carrying out multiplications in the group Q of unit quaternions, a group with eight elements: ±1, ±i, ±j, ±k. But the input to the question about Finnegans Wake only involves three of these elements. Could that be exploited for some slight efficiency?

How would you best implement quaternion multiplication? Of course the answer depends on your environment and what you mean by “best.”

Note that we don’t actually need to implement quaternion multiplication in general, though that would be sufficient. All we really need is multiplication in the group Q.

You could implement multiplication by a table lookup. You could use an 8 × 3 table; the left side of our multiplication could be anything in Q, but the right side can only be ij, or k. You could represent quaternions as a list of four numbers—coefficients of 1, ij, and k—and write rules for multiplying these. You could also represent quaternions as real 4 × 4 matrices or as complex 2 × 2 matrices.

If you have an interesting solution, please share it in a comment below. It could be interesting by any one of several criteria: fast, short, cryptic, amusing, etc.

Update: See follow up post, Random walk on quaternions

Related posts:

The cross polytope

There are five regular solids in three dimensions:

  • tetrahedron
  • octahedron (pictured above)
  • hexahedron (cube)
  • dodecahedron
  • icosahedron.

I give a proof here that these are the only five.

The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular solids in dimension 4.)

I’ve mentioned generalizations of the cube, the hypercube, lately. I suppose you could call the generalization of a octahedron a “hyperoctahedron” by analogy with the hypercube, though I’ve never heard anybody use that term. Instead, the most common name is cross polytope.

This post will focus on the cross polytope. In particular, we’re going to look at the relative volume of a ball inside a cross polytope.

The cross polytope in n dimensions is the convex hull of all n-dimensional vectors that are ±1 in one coordinate and 0 in all the rest. It is the “plus or minus” part that gives the cross polyhedron its name, i.e. the vertices are in pairs across the origin.

In analysis, the cross polytope is the unit ball in ℓ1 (“little ell one”), the set of points (x1, x1, …, xn) such that

|x1| + |x2| + … + |xn| = 1.

The ℓ1 norm, and hence the ℓ1 ball, comes up frequently in compressed sensing and in sparse regression.

In recent blog posts we’ve looked at how the relative volume in a ball inscribed in a hypercube drops quickly as dimension increases. What about the cross polytope? The relative volume of a ball inscribed in a cross polytope decreases rapidly with dimension as well. But does it decreases faster or slower than the relative volume of a ball inscribed in a hypercube? To answer this, we need to compute

\left.\frac{\mbox{vol ball in cross poly}}{\mbox{vol cross poly}}\middle/\frac{\mbox{vol ballin hypercube}}{\mbox{vol hypercube}}\right.

Let’s gather what we need to evaluate this. We need the volume of a ball of radius r in n dimensions, and as mentioned before this is

V = \frac{\pi^{\frac{n}{2}} r^n}{\Gamma\left(\frac{n}{2} + 1\right)}

A ball sitting inside an n-dimensional unit cross polytope will have radius 1/√n. This is because if n positive numbers sum to 1, the sum of their squares is minimized by making them all equal, and the point (1/n, 1/n, …, 1/n) has norm 1/√ n. A ball inside a unit hypercube will have radius 1/2.

The cross polytope has volume 2n / n! and the hypercube has volume 1.

Putting this all together, the relative volume of a ball in a cross polytope divided by the relative volume of a ball inside a hypercube is

\left. \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{\sqrt{n}}\right)^n } { \frac{2^n}{n!} } \middle/ \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{2}\right)^n } { 1 } \right.

which fortunately reduces to just

\frac{n!}{n^n}

But how do we compare n! and nn/2? That’s a job for Stirling’s approximation. It tells us that for large n, the ratio is approximately

\sqrt{2\pi n}\, n^{n/2}e^{-n}

and so the ratio diverges for large n, i.e. the ball in the cross polytope takes up increasingly more relative volume.

Looking back at just the relative volume of the ball inside the cross polytope, and applying Stirling’s approximation again, we see that the relative volume of the ball inside the cross polytope is approximately

\sqrt{2}\left( \frac{\pi}{2e} \right )^{n/2}

and so the relative volume decreases geometrically as n increases, decreasing much slower than the relative volume of a ball in a hypercube.

Bayesian methods at Bletchley Park

From Nick Patterson’s interview on Talking Machines:

GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an extreme minority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statistics works. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

Alan Turing

Sphere packing

The previous couple blog posts touched on a special case of sphere packing.

We looked at the proportion of volume contained near the corners of a hypercube. If you take the set of points within a distance 1/2 of a corner of a hypercube, you could rearrange these points to form a full ball centered one corner of the hypercube. Saying that not much volume is located near the corners is equivalent to saying that the sphere packing that centers spheres at points with integer coordinates is not very dense.

We also looked at centering balls inside hypercubes. This is the same sphere packing as above, just shifting every coordinate by 1/2. So saying that a ball in a box doesn’t take up much volume in high dimensions is another way of saying that the integer lattice sphere packing is not very dense.

How much better can we pack spheres? In 24 dimensions, balls centered inside hypercubes would have density equal to the volume of a ball of radius 1/2, or (π/2)12 / 12!. The most dense packing in 24 dimensions, the Leech lattice sphere packing, has a density of π12 / 12!, i.e. it is 212 = 4096 times more efficient.

The densest sphere packings have only been proven in dimensions 1, 2, 3, 8, and 24. (The densest regular (lattice) packings are known for dimensions up to 8, but it is conceivable that there exist irregular packings that are more efficient than the most efficient lattice packing.) Dimension 24 is special in numerous ways, and it appears that 24 is a local maximum as far as optimal sphere packing density. How does sphere packing based on a integer lattice compare to the best packing in other high dimensions?

Although optimal packings are not known in high dimensions, upper and lower bounds on packing density are known. If Δ is the optimal sphere packing density in dimension n, then we have the following upper and lower bounds for large n:

-1 \leq \frac{1}{n} \log_2 \Delta \leq -0.599

The following plot shows how the integer lattice packing density (solid line) compares to the upper and lower bounds (dashed lines).

The upper and lower bounds come from Sphere Packings, Lattices, and Groups, published in 1998. Perhaps tighter bounds have been found since then.