Summing random powers up to a threshold

Pick random numbers uniformly between 0 and 1, adding them as you go, and stop when you get a result bigger than 1. How many numbers would you expect to add together on average?

You need at least two samples, and often two are enough, but you might get any number, and those larger numbers will pull the expected value up.

Here’s a simulation program in Python.

    from random import random
    from collections import Counter

    N = 1000000
    c = Counter()
    for _ in range(N):
        x = 0
        steps = 0
        while x < 1:
            x += random()
            steps += 1
        c[steps] += 1

    print( sum( k*c[k] for k in c.keys() )/N )

When I ran it I got 2.718921. There’s theoretical result first proved by W. Weissblum that the expected value is e = 2.71828… Our error was on the order of 1/√N, which is what we’d expect from the central limit theorem.

Now we can explore further in a couple directions. We could take a look at the distribution of the number steps, not just its expected value. Printing c shows us the raw data.

    Counter({
        2: 499786, 
        3: 333175, 
        4: 125300, 
        5:  33466, 
        6:   6856, 
        7:   1213, 
        8:    172, 
        9:     29, 
        10:     3
    })

And here’s a plot.

We could also generalize the problem by taking powers of the random numbers. Here’s what we get when we use exponents 1 through 20.

expected steps as exponents increase

There’s a theoretical result that the expected number of steps is asymptotically equal to cn where

c = \exp(\exp(-\gamma)) - \int_{-\infty}^\infty \frac{\exp(-\exp(u)) }{ (u+ \gamma)^2 + \pi^2}\, du

I computed c = 1.2494. The plot above shows that the dependence on the exponent n does look linear. The simulation results appear to be higher than the asymptotic prediction by a constant, but that’s consistent with the asymptotic prediction since relative to n, a constant goes away as n increases.

Reference for theoretical results: D. J. Newman and M. S. Klamkin. Expectations for Sums of Powers. The American Mathematical Monthly, Vol. 66, No. 1 (Jan., 1959), pp. 50-51

Average fraction round up

Pick a large number n. Divide n by each of the positive integers up to n and round the results up to the nearest integer. On average, how far do you round up?

Or in terms of probability, what is the expected distance between a fraction n/r, where n is large and fixed and r is chosen randomly between 1 and n, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

\frac{1}{n} \sum_{r = 1}^n \left( \left\lceil \frac{n}{r} \right\rceil - \frac{n}{r}\right )

for large n, i.e. in the limit as n goes to ∞. Here ⌈x⌉ denotes the ceiling of x, the smallest integer greater than or equal to x.

Let’s plot this as a function of n and see what it looks like. Here’s the Python code.

    import matplotlib.pyplot as plt
    from numpy import ceil, arange

    def f(n):
        return sum( ceil(n/r) - n/r for r in range(1, n) )/n

    x = arange(1, 100)
    y = [f(n) for n in x]
    plt.plot(x, y)
    plt.show()

And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the nth harmonic number and log n, i.e.

\gamma = \lim_{n\to\infty} \left(\sum_{r = 1}^n \frac{1}{r} - \log n \right)

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant euler_gamma from numpy and add the

    plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at n = 1,000,000, we get 0.577258… while γ = 0.577215….

At n = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√n.

Equation for the Eiffel Tower

Robert Banks’s book Towing Icebergs, Falling Dominoes, and Other Adventures in Applied Mathematics describes the Eiffel Tower’s shape as approximately the logarithmic curve

y = -y_* \log\left(\frac{x}{x_0} \right )

where y* and x0 are chosen to match the tower’s dimensions.

Here’s a plot of the curve:

Eiffel tower curve plot

And here’s the code that produced the plot:

from numpy import log, exp, linspace, vectorize
import matplotlib.pyplot as plt

# Taken from "Towing Icebergs, Falling Dominoes,
# and Other Adventures in Applied Mathematics"
# by Robert B. Banks

# Constants given in Banks in feet. Convert to meters.
feet_to_meter = 0.0254*12
ystar  = 201*feet_to_meter
x0     = 207*feet_to_meter
height = 984*feet_to_meter

# Solve for where to cut off curve to match height of the tower.
# - ystar log xmin/x0 = height
xmin = x0 * exp(-height/ystar)

def f(x):
    if -xmin < x < xmin:
        return height
    else:
        return -ystar*log(abs(x/x0))
curve = vectorize(f)
    
x = linspace(-x0, x0, 400)

plt.plot(x, curve(x))
plt.xlim(-2*x0, 2*x0)
plt.xlabel("Meters")
plt.ylabel("Meters")
plt.title("Eiffel Tower")

plt.axes().set_aspect(1)
plt.savefig("eiffel_tower.svg")

Related post: When length equals area
The St. Louis arch is approximately a catenary, i.e. a hyperbolic cosine.

Why is Kullback-Leibler divergence not a distance?

The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

This is pronounced/interpreted several ways:

  • The divergence from Y to X
  • The relative entropy of X with respect to Y
  • How well Y approximates X
  • The information gain going from the prior Y to the posterior X
  • The average surprise in seeing Y when you expected X

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

exponential and gamma PDFs

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

    from scipy.integrate import quad
    from scipy.stats import expon, gamma
    from scipy import inf

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, 0, inf)

    e = expon
    g = gamma(a = 2)

    print( KL(e, g) )
    print( KL(g, e) )

This returns

    (0.5772156649008394, 1.3799968612282498e-08)
    (0.4227843350984687, 2.7366807708872898e-09)

The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL function to integrate over the whole real line

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, -inf, inf)

and trying an example.

n1 = norm(1, 1)
n2 = norm(2, 1)

print( KL(n1, n2) )
print( KL(n2, n1) )

This returns

(0.4999999999999981, 1.2012834963423225e-08)
(0.5000000000000001, 8.106890774205374e-09)

and so both integrals are equal to within the error in the numerical integration.

Negative correlation introduced by success

Suppose you measure people on two independent attributes, X and Y, and take those for whom X+Y is above some threshold. Then even though X and Y are uncorrelated in the full population, they will be negatively correlated in your sample.

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange
from scipy.stats import norm, pearsonr
import matplotlib.pyplot as plt

# Correlation.
# The function pearsonr returns correlation and a p-value.
def corr(x, y):
    return pearsonr(x, y)[0]

x = norm.rvs(100, 15, 10000)
y = norm.rvs(100, 15, 10000)
z = x + y

span = arange(80, 260, 10)
c = [ corr( x[z > low], y[z > low] ) for low in span ]

plt.plot( span, c )
plt.xlabel( "minimum sum" )
plt.ylabel( "correlation coefficient" )
plt.show()

Random minimum spanning trees

I just ran across a post by John Baez pointing to an article [the link has gone away] by Alan Frieze on random minimum spanning trees.

Here’s the problem.

  1. Create a complete graph with n nodes, i.e. connect every node to every other node.
  2. Assign each edge a uniform random weight between 0 and 1.
  3. Find the minimum spanning tree.
  4. Add up the edges of the weights in the minimum spanning tree.

The surprise is that as n goes to infinity, the expected value of the process above converges to the Riemann zeta function at 3, i.e.

ζ(3) = 1/1³ + 1/2³ + 1/3³ + …

Incidentally, there are closed-form expressions for the Riemann zeta function at positive even integers. For example, ζ(2) = π² / 6. But no closed-form expressions have been found for odd integers.

Simulation

Here’s a little Python code to play with this.

    import networkx as nx
    from random import random

    N = 1000

    G = nx.Graph()
    for i in range(N):
        for j in range(i+1, N):
            G.add_edge(i, j, weight=random())

    T = nx.minimum_spanning_tree(G)
    edges = T.edges(data=True)

    print( sum( e[2]["weight"] for e in edges ) )

When I ran this, I got 1.2307, close to ζ(3) = 1.20205….

I ran this again, putting the code above inside a loop, averaging the results of 100 simulations, and got 1.19701. That is, the distance between my simulation result and ζ(3) went from 0.03 to 0.003.

There are two reasons I wouldn’t get exactly ζ(3). First, I’m only running a finite number of simulations (100) and so I’m not computing the expected value exactly, but only approximating it. (Probably. As in PAC: probably approximately correct.) Second, I’m using a finite graph, of size 1000, not taking a limit as graph size goes to infinity.

My limited results above suggest that the first reason accounts for most of the difference between simulation and theory. Running 100 replications cut the error down by a factor of 10. This is exactly what you’d expect from the central limit theorem. This suggests that for graphs as small as 1000 nodes, the expected value is close to the asymptotic value.

You could experiment with this, increasing the graph size and increasing the number of replications. But be patient. It takes a while for each replication to run.

Generalization

The paper by Frieze considers more than the uniform distribution. You can use any non-negative distribution with finite variance and whose cumulative distribution function F is differentiable at zero. The more general result replaces ζ(3) with ζ(3) / F ‘(0). We could, for example, replace the uniform distribution on weights with an exponential distribution. In this case the distribution function is 1 − exp(−x), at its derivative at the origin is 1, so our simulation should still produce approximately ζ(3). And indeed it does. When I took the average of 100 runs with exponential weights I got a value of 1.2065.

There’s a little subtlety around using the derivative of the distribution at 0 rather than the density at 0. The derivative of the distribution (CDF) is the density (PDF), so why not just say density? One reason would be to allow the most general probability distributions, but a more immediate reason is that we’re up against a discontinuity at the origin. We’re looking at non-negative distributions, so the density has to be zero to the left of the origin.

When we say the derivative of the distribution at 0, we really mean the derivative at zero of a smooth extension of the distribution. For example, the exponential distribution has density 0 for negative x and density exp(−x) for non-negative x. Strictly speaking, the CDF of this distribution is 1 − exp(−x) for non-negative x and 0 for negative x. The left and right derivatives are different, so the derivative doesn’t exist. By saying the distribution function is simply exp(−x), we’ve used a smooth extension from the non-negative reals to all reals.

Nearly all the area in a high-dimensional sphere is near the equator

Nearly all the area of a high-dimensional sphere is near the equator.  And by symmetry, it doesn’t matter which equator you take. Draw any great circle and nearly all of the area will be near that circle.  This is the canonical example of “concentration of measure.”

What exactly do we mean by “nearly all the area” and “near the equator”? You get to decide. Pick your standard of “nearly all the area,” say 99%, and your definition of “near the equator,” say within 5 degrees. Then it’s always possible to take the dimension high enough that your standards are met. The more demanding your standard, the higher the dimension will need to be, but it’s always possible to pick the dimension high enough.

This result is hard to imagine. Maybe a simulation will help make it more believable.

In the simulation below, we take as our “north pole” the point (1, 0, 0, 0, …, 0). We could pick any unit vector, but this choice is convenient. Our equator is the set of points orthogonal to the pole, i.e. that have first coordinate equal to zero. We draw points randomly from the sphere, compute their latitude (i.e. angle from the equator), and make a histogram of the results.

The area of our planet isn’t particularly concentrated near the equator.

But as we increase the dimension, we see more and more of the simulation points are near the equator.

Here’s the code that produced the graphs.

from scipy.stats import norm
from math import sqrt, pi, acos, degrees
import matplotlib.pyplot as plt

def pt_on_sphere(n):
    # Return random point on unit sphere in R^n.
    # Generate n standard normals and normalize length.
    x = norm.rvs(0, 1, n)
    length = sqrt(sum(x**2))
    return x/length

def latitude(x):
    # Latitude relative to plane with first coordinate zero.
    angle_to_pole = acos(x[0]) # in radians
    latitude_from_equator = 0.5*pi - angle_to_pole
    return degrees( latitude_from_equator )

N = 1000 # number of samples

for n in [3, 30, 300, 3000]: # dimension of R^n
    
    latitudes = [latitude(pt_on_sphere(n)) for _ in range(N)]
    plt.hist(latitudes, bins=int(sqrt(N)))
    plt.xlabel("Latitude in degrees from equator")
    plt.title("Sphere in dimension {}".format(n))
    plt.xlim((-90, 90))
    plt.show()

Not only is most of the area near the equator, the amount of area outside a band around the equator decreases very rapidly as you move away from the band. You can see that from the histograms above. They look like a normal (Gaussian) distribution, and in fact we can make that more precise.

If A is a band around the equator containing at least half the area, then the proportion of the area a distance r or greater from A is bound by exp( −(n − 1)r² ). And in fact, this holds for any set A containing at least half the area; it doesn’t have to be a band around the equator, just any set of large measure.

Related post: Willie Sutton and the multivariate normal distribution

Simple random number generator does surprisingly well

I was running the NIST statistical test suite recently. I wanted an example of a random number generator where the tests failed, and so I used a simple generator, a linear congruence generator. But to my surprise, the generator passed nearly all the tests, even though some more sophisticated generators failed some of the same tests. This says more about the NIST test suite than it says about the LCG being tested.

This post will implement a couple of the simplest tests in Python and show that the generator does well.

The linear congruential generator used here starts with an arbitrary seed, then at each step produces a new number by multiplying the previous number by a constant and taking the remainder by 231 – 1. The multiplier constant was chosen to be one of the multipliers recommended in [1].

We’ll need a couple math functions:

from math import sqrt, log

and we need to define the constants for our generator.

# Linear congruence generator (LCG) constants
z = 20170705   # seed
a = 742938285  # multiplier
e = 31         # will need this later
m = 2**e -1    # modulus

Next we form a long string of 0’s and 1’s using our generator

# Number of random numbers to generate
N = 100000     

# Format to print bits, padding with 0's on the left if needed
formatstr = "0" + str(e) + "b"

bit_string = ""
for _ in range(N):
    z = a*z % m # LCG
    bit_string += format(z, formatstr)

Next we run a couple tests. First, we count the number of 1’s in our string of bits. We expect about half the bits to be 1’s. We can quantify “about” as within two standard deviations.

def count_ones(string):
    ones = 0
    for i in range(len(string)):
        if string[i] == '1':
            ones += 1
    return ones

ones = count_ones(bit_string)
expected = e*N/2
sd = sqrt(0.25*N)
print( "Number of 1's: {}".format(ones) )
print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

The results are nothing unusual:

Number of 1's: 1550199
Expected: 1549683.8 to 1550316.2

Next we look at the length of the longest runs on 1’s. I’ve written before about the probability of long runs and the code below uses a couple results from that post.

def runs(string):
    max_run = 0
    current_run = 0
    for i in range(len(string)):
        if string[i] == '1':
            current_run += 1
        else:
            current_run = 0
        max_run = max(max_run, current_run)
    return max_run

runlength = runs(bit_string)
expected = -log(0.5*e*N)/log(0.5)
sd = 1/log(2)
print( "Run length: {}".format(runlength) )
print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

Again the results are nothing unusual:

Run length: 19
Expected: 17.7 to 23.4

Simple random number generators are adequate for many uses. Some applications, such as high dimensional integration and cryptography, require more sophisticated generators, but sometimes its convenient and sufficient to use something simple. For example, code using the LCG generator above would be easier to debug than code using the Mersenne Twister. The entire state of the LCG is a single number, whereas the Mersenne Twister maintains an internal state of 312 numbers.

One obvious limitation of the LCG used here is that it couldn’t possibly produce more than  231 – 1 values before repeating itself. Since the state only depends on the last value, every time it comes to a given output, the next output will be whatever the next output was the previous time. In fact, [1] shows that it does produce 231 – 1 values before cycling. If the multiplier were not chosen carefully it could have a shorter period.

So our LCG has a period of about two billion values. That’s a lot if you’re writing a little game, for example. But it’s not enough for many scientific applications.

* * *

[1] George S. Fishman and Louis R. Moore III, An exhaustive analysis of multiplicative congruential random number generators with modulus 231 – 1, SIAM Journal of Scientific and Statistical Computing, Vol. 7, no. 1, January 1986.

Polynomials evaluated at integers

Let p(x) = a0 + a1x + a2x2 + … + anxn and suppose at least one of the coefficients ai is irrational for some i ≥ 1. Then a theorem by Weyl says that the fractional parts of p(n) are equidistributed as n varies over the integers. That is, the proportion of values that land in some interval is equal to the length of that interval.

Clearly it’s necessary that one of the coefficients be irrational. What may be surprising is that it is sufficient.

If the coefficients are all rational with common denominator N, then the sequence would only contain multiples of 1/N. The interval [1/3N, 2/3N], for example, would never get a sample. If a0 were irrational but the rest of the coefficients were rational, we’d have the same situation, simply shifted by a0.

This is a theorem about what happens in the limit, but we can look at what happens for some large but finite set of terms. And we can use a χ2 test to see how evenly our sequence is compared to what one would expect from a random sequence.

In the Python code below, we use the polynomial p(x) = √2 x² + πx + 1 and evaluate p at 0, 1, 2, …, 99,999. We then count how many fall in the bins [0, 0.01), [0.01, 0.02), … [0.99, 1] and compute a chi-square statistic on the counts.

from math import pi, sqrt, floor

def p(x):
    return 1 + pi*x + sqrt(2)*x*x

def chisq_stat(O, E):
    return sum( (o - e)**2/e for (o, e) in zip(O, E) )

def frac(x):
    return x - floor(x)

N = 100000
data = [frac(p(n)) for n in range(N)]

count = [0]*100
for d in data:
    count[ int(floor(100*d)) ] += 1

expected = [N/100]*100

print(chisq_stat(count, expected))

We get a chi-square statistic of 95.4. Since we have 100 bins, there are 99 degrees of freedom, and we should compare our statistic to a chi-square distribution with 99 degrees of freedom. Such a distribution has mean 99 and standard deviation √(99*2) = 14.07, so a value of 95.4 is completely unremarkable.

If we had gotten a very large chi-square statistic, say 200, we’d have reason to suspect something was wrong. Maybe a misunderstanding on our part or a bug in our software. Or maybe the sequence was not as uniformly distributed as we expected.

If we had gotten a very small chi-square statistic, say 10, then again maybe we misunderstood something, or maybe our sequence is remarkably evenly distributed, more evenly than one would expect from a random sequence.

Related math posts

Fractional parts, invariant measures, and simulation

A function f: XX is measure-preserving if for each iteration of f sends the same amount of stuff into a given set. To be more precise, given a measure μ and any μ-measurable set E with μ(E) > 0, we have

\mu( E ) = \mu( f^{-1}(E) )

You can read the right side of the equation above as “the measure of the set of points that f maps into E.” You can apply this condition repeatedly to see that the measure of the set of points mapped into E after n iterations is still just the measure of E.

If X is a probability space, i.e. μ( ) = 1, then you could interpret the definition of measure-preserving to mean that the probability that a point ends up in E after n iterations is independent of n. We’ll illustrate this below with a simulation.

Let X be the half-open unit interval (0, 1] and let f be the Gauss map, i.e.

f(x) = \frac{1}{x} - \left\lfloor \frac{1}{x}\right\rfloor

where [z] is the integer part of z. The function f is measure-preserving, though not for the usual Lebesgue measure. Instead it preserves the following measure:

\mu(E) = \frac{1}{\log 2} \int_E \frac{1}{1+x} \, dx

Let’s take as our set E an interval [a, b] and test via simulation whether the probability of landing in E after n iterations really is just the measure of E, independent of n.

We can’t just generate points uniformly in the interval (0, 1]. We have to generate the points so that the probability of a point coming from a set E is μ(E). That means the PDF of the distribution must be p(x) = 1 / (log(2) (1 + x)). We use the inverse-CDF method to generate points with this density in the Python code below.

from math import log, floor
from random import random

def gauss_map(x):
    y = 1.0/x
    return y - floor(y)

# iterate gauss map n times
def iterated_gauss_map(x, n):
    while n > 0:
        x = gauss_map(x)
        n = n - 1
    return x      

# measure mu( [a,b] )
def mu(a, b):
    return (log(1.0+b) - log(1.0+a))/log(2.0)

# Generate samples with Prob(x in E) = mu( E )
def sample():
    u = random()
    return 2.0**u - 1.0

def simulate(num_points, num_iterations, a, b):
    count = 0
    for _ in range(num_points):
        x = sample()
        y = iterated_gauss_map(x, num_iterations)
        if a < y < b:
            count += 1
    return count / num_points

# Change these parameters however you like
a, b = 0.1, 0.25
N = 1000000

exact = mu(a, b)
print("Exact probability:", exact)
print("Simulated probability after n iterations")
for n in range(1, 10):
    simulated = simulate(N, n, a, b)
    print("n =", n, ":", simulated)

Here’s the output I got:

Exact probability: 0.18442457113742736
Simulated probability after n iterations
n = 1 : 0.184329
n = 2 : 0.183969
n = 3 : 0.184233
n = 4 : 0.184322
n = 5 : 0.184439
n = 6 : 0.184059
n = 7 : 0.184602
n = 8 : 0.183877
n = 9 : 0.184834

With 1,000,000 samples, we expect the results to be the same to about 3 decimal places, which is what we see above.

Related post: Irrational rotations are ergodic.

(A transformation f  is ergodic if it is measure preserving and the only sets E with  f –1(E)  = E are those with measure 0 or full measure. Rational rotations are measure-preserving but not ergodic. The Gauss map above is ergodic.)