Landau kernel

The previous post was about the trick Lebesgue used to construct a sequence of polynomials converging to |x| on the interval [-1, 1]. This was the main step in his proof of the Weierstrass approximation theorem.

Before that, I wrote a post on Bernstein’s proof that used his eponymous polynomials to prove Weierstrass’ theorem. This is my favorite proof because it’s an example of using results from probability to prove a statement that has nothing to do with randomness.

This morning I’ll present one more way to prove the approximation theorem, this one due to Landau.

The Landau kernel is defined as

K_n(u) = (1 - u^2)^n

Denote its integral by

k_n = \int_{-1}^1 K_n(u)\, du

Let f(x) be any continuous function on [-1, 1]. Then the convolution of the normalized Landau kernel with f gives a sequence of polynomial approximations that converge uniformly to f. By “normalized” I mean dividing the kernel by its integral so that it integrates to 1.

For each n,

\frac{1}{k_n}\int_{-1}^1 K_n(t-x)\, f(t)\, dt

is a polynomial in x of degree 2n, and as n goes to infinity this converges uniformly to f(x).

Connections

There are a few connections I’d like to mention. First, the normalized Landau kernel is essentially a beta distribution density, just scaled to live on [-1, 1] rather than [0, 1].

And as with Bernstein’s proof of the Weierstrass approximation theorem, you could use probability to prove Landau’s result. Namely, you could use the fact that two independent random variables X and Y, the PDF of their sum is the convolution of their PDFs.

The normalizing constants kn have a simple closed form in terms of double factorials:

\frac{k_n}{2} = \frac{(2n)!!}{(2n+1)!!}

I don’t know which Landau is responsible for the Landau kernel. I’ve written before about the Edmund Landau and his Big O notation, and I wrote about Lev Landau and his license plate game. Edmund was a mathematician, so it makes sense that he might be the one to come up with another proof of Weierstrass’ theorem. Lev was a physicist, and I could imagine he would be interested in the Landau kernel as an approximation to the delta function.

If you know which of these Landaus, or maybe another, is behind the Landau kernel, please let me know.

Update: Someone sent me this paper which implies Edmund Landau is the one we’re looking for.

Lebesgue’s proof of Weierstrass’ theorem

A couple weeks ago I wrote about the Weierstrass approximation theorem, the theorem that says every continuous function on a closed finite interval can be approximated as closely as you like by a polynomial.

The post mentioned above uses a proof by Bernstein. And in that post I used the absolute value function as an example. Not only is |x| an example, you could go the other way around and use it as a step in the proof. That is, there is a proof of the Weierstrass approximation theorem that starts by proving the special case of |x| then use that result to build a proof for the general case.

There have been many proofs of Weierstrass’ theorem, and recently I ran across a proof due to Lebesgue. Here I’ll show how Lebesgue constructed a sequence of polynomials approximating |x|. It’s like pulling a rabbit out of a hat.

The staring point is the binomial theorem. If x and y are real numbers with |x| > |y| and r is any real number, then

(x+y)^r & =\sum_{k=0}^\infty {r \choose k} x^{r-k} y^k.

Now apply the theorem substituting 1 for x and x² – 1 for y above and you get

|x| = (1 + (x^2 - 1))^{1/2} =\sum_{k=0}^\infty {1/2 \choose k} (x^2-1)^k

The partial sums of the right hand side are a sequence of polynomials converging to |x| on the interval [-1, 1].

***

If you’re puzzled by the binomial coefficient with a top number that isn’t a positive integer, see the general definition of binomial coefficients. The top number can even be complex, and indeed the binomial theorem holds for complex r.

You might also be puzzled by the binomial theorem being an infinite sum. Surely if r is a positive integer we should get the more familiar binomial theorem which is a finite sum. And indeed we do. The general definition of binomial coefficients insures that if r is a positive integer, all the binomial coefficients with k > r are zero.

Proving that a choice was made in good faith

How can you prove that a choice was made in good faith? For example, if your company selects a cohort of people for random drug testing, how can you convince those who were chosen that they weren’t chosen deliberately? Would a judge find your explanation persuasive? This is something I’ve helped companies with.

It may be impossible to prove that a choice was not deliberate, but you can show good faith by providing evidence that the choice was deliberate by a different criteria than the one in question.

I’ll give four examples, three positive and one negative.

Cliff RNG

My previous three blog posts looked at different aspects of the Cliff random number generator. The generator needs a seed between 0 and 1 to start. Suppose I chose 0.8121086949937715 as my seed. On the one hand, that’s a number with no apparent special features. But you might ask “Hey, why that number?” and you’d be right. I show in the first post in the series how that number was chosen to make the generator start off producing duplicate output.

In the next two posts in the series, I chose π – 3 as my seed. That’s a recognizable number and obviously a deliberate choice. But it has no apparent connection to the random number generator, and so it’s reasonable to assume that the seed wasn’t chosen to make the generator look good or bad.

SHA-2

The SHA-2 cryptographic hash function seventy two 32-bit numbers for initial state that needed to be “random” in some sense. But if the values were actually chosen at random, critics would suspect that the values were chosen to provide a back door. And maybe there is a clever way to pick the initial state that provides a non-obvious exploitable weakness.

The designers of SHA-2 chose the square roots of the first consecutive primes to fill one set of constants, and the cube roots of the first consecutive primes to fill another. See code here.

The initial state is definitely not random. Someone looking at the state would eventually discover where it came from. So while the choice was obviously deliberate, but apparently not designed by any cryptographic criteria.

Curve 25519

Daniel Bernstein’s elliptic curve Curve25519 is widely trusted in part because Bernstein made his design choices transparent. The curve is

y² = x³ + 486662x² + x

over the finite field with 2255-19 elements, hence the name.

2255-19 is the largest prime less than 2255, and being close to 2255 makes the method efficient to implement. The coefficient 48666 is less obvious. But Bernstein explains in his paper that he took the three smallest possible values of this parameter that met the explicit design criteria, and then rejected two of them on objective grounds described at the bottom of the paper.

NIST P-384

The design of elliptic curve NIST P-384 is not as transparent as that of Curve25519 which has lead to speculation that NIST may have designed the curve to have a back door.

The curve has has Weierstrass form

y² = x³ – 3x + b

over the finite field with p elements where

p = 2384 – 2128 – 296 + 232 – 1.

As with Curve25519, the choice of field size was motivated by efficiency; the pattern of powers of 2 enables some tricks for efficient implementation. Also, there are objective reasons why the linear coefficient is -3. But the last coefficient b is the 383-bit number

27580193559959705877849011840389048093056905856361568521428707301988689241309860865136260764883745107765439761230575

which has raised some eyebrows. NIST says the value was chosen at random, though not directly. More specifically, NIST says it first generated a certain 160-bit random number, then applied a common key stretching algorithm to obtain b above. Researchers are divided over whether they believe this. See more in this post.

Conclusion

Sometimes you can’t prove that a choice wasn’t deliberate. In that case the best you can do is show that the choice was deliberate, but by an innocent criteria, one unrelated to the matter at hand.

I tried to do this in the Cliff RNG blog posts by using π as my seed. I couldn’t actually use π because the seed had to be between 0 and 1, but there’s an obvious way to take π and produce a number between 0 and 1.

The designers of SHA-2 took a similar route. Just as π is a natural choice for a real number, primes are natural choices for integers. They couldn’t use integers directly, but they made complicated patterns from simple integers in a natural way by taking roots.

Daniel Bernstein gained the cryptography community’s trust by making his design criteria transparent. Given his criteria, the design was almost inevitable.

NIST was not as persuasive as Daniel Bernstein. They claim to have chosen a 160-bit number at random, and they may very well have. But there’s no way to know whether they generated a lot of 160-bit seeds until they found one that resulted in a constant term that has some special property. They may have chosen their seed in good faith, but they have a not been completely persuasive.

Sometimes it’s not enough to act in good faith; you have to make a persuasive case that you acted in good faith.

Detecting a short period in an RNG

The last couple posts have been looking at the Cliff random number generator. I introduce the generator here and look at its fixed points. These turn out to be less of a problem in practice than in theory.

Yesterday I posted about testing the generator with the DIEHARDER test suite, the successor to George Marsaglia’s DIEHARD battery of RNG tests.

This morning I discovered something about the Cliff RNG which led to discovering something about DIEHARDER.The latter is more important: I don’t think anyone is using the Cliff RNG for serious work, but people are definitely using DIEHARDER.

The Cliff RNG has a short period, and yet many of the DIEHARDER tests passed. However, several of the tests failed as well, and perhaps these tests failed due to the short period, in particular rgb_lagged_sum. But at least some tests can pass despite a short period.

Finding the period

Since the Cliff RNG maintains internal state as a floating point number and outputs integers, the period is a bit subtle.

The state of the Cliff RNG is a floating point number between 0 and 1, and so it has 253 possible values. (See Anatomy of a floating point number.) That means the maximum possible period is 253. We use the internal state x to create 32-bit integers n by multiplying x by 232 and truncating to an integer value. The maximum period could conceivably be larger than 232 because an output value n could repeat but correspond to two different x‘s. The actual period, at least in my experiment, was between 222 and 223.

I seeded the Cliff RNG with π – 3 (why?) and found that starting with index i = 3,075,302, the output values repeat with period p = 6,705,743. So there was a burn-in period before the state entered a cycle, but it would repeat that cycle forever. Not only are the output values repeating, the state values x repeat. (Otherwise it would be possible for the integer outputs to cycle for a while then break out.)

It’s possible that starting with other seeds, the generator has other cycle lengths, longer or shorter. But I stumbled across one cycle of period 6,705,743.

Testing RNGs

I wrote a chapter for O’Reilly’s book Beautiful Testing in which I discuss How to test a random number generator. More specifically, now to test a non-uniform random number generator. That is, starting with a trusted uniform random number generator, transform the values to have some other probability distribution. This is the most common scenario. Few developers write their own core RNG, but it’s fairly common to have to use a core RNG to create a custom distribution.

If you do want to test a uniform random number generator, as I do in this post, there are test suites like DIEHARDER. This is one of the oldest and best known test suites. There are newer and more rigorous suites, like TestU01 that I blog about here. There’s also the NIST statistical test suite that I write about in the same post.

These tests are a little challenging to build and use. I’ve had clients ask me to run these tests for them and help them interpret the results. If you’d like for me to do that for you, let’s talk.

Testing Cliff RNG with DIEHARDER

My previous post introduced the Cliff random number generator. The post showed how to find starting seeds where the generator will start out by producing approximately equal numbers. Despite this flaw, the generator works well by some criteria.

I produced a file of s billion 32-bit integers by multiplying the output values, which were floating point numbers between 0 and 1, by 232 and truncating to integer. Then I ran the DIEHARDER random number generator test suite.

The results were interesting. Before running the tests, I thought the tests would nearly all pass or nearly all fail, more likely the latter. But what happened was that many tests passed and some failed hard [1].

Here’s a list of the tests that passed:

  • diehard_birthdays
  • diehard_rank_32x32
  • diehard_rank_6x8
  • diehard_bitstream
  • diehard_oqso
  • diehard_dna
  • diehard_count_1s_str
  • diehard_count_1s_byt
  • diehard_runs
  • sts_monobit
  • sts_serial
  • rgb_bitdist
  • rgb_kstest_test
  • dab_dct
  • dab_filltree2
  • dab_monobit2

The tests that failed were:

  • diehard_parking_lot
  • diehard_2sphere
  • diehard_3sphere
  • diehard_squeeze
  • diehard_craps
  • marsaglia_tsang_gcd
  • rgb_lagged_sum
  • dab_bytedistrib

I’ve left out a few test restults that were ambiguous as well as tests that were described as “Suspect” and “Do not use” on the DIEHARDER web site.

The site I mentioned in the previous post where I ran across this generator said that it passed a spherical generation test. I assume the implementation of that test was less demanding that the version included in DIEHARD. But the generator does well by other tests.

The lagged sum test tests for autocorrelation. Maybe the failure of this test has something to do with the fixed points discussed earlier.

Update: After writing this post I discovered that the generator has a short period, as I discuss here. That explains why the lagged sum test fails: the output has perfect autocorrelation at a lag equal to the period.

Related posts

[1] By “failed hard” I mean the test return a p-value of zero. The p-value couldn’t actually be zero, but it was close enough that it the displayed value was exactly zero.

Fixed points of the Cliff random number generator

I ran across the Cliff random number generator yesterday. Given a starting value x0 in the open interval (0, 1), the generator proceeds by

xn+1 = | 100 log(xn) mod 1 |

for n > 0. The article linked to above says that this generator passes a test of randomness based on generating points on a sphere.

Real numbers

While the long term distribution of the generator may be good, it has a problem with its sequential behavior, namely that it has an infinite number of fixed points. If the generator ever reaches one of these points, it gets stuck forever.

Here’s a proof. Since x is between 0 and 1, log(x) is always negative. So we can replace the absolute value above with a negative. A fixed point of the generator is a solution to

x = -100 log(x) – k

for some integer k. Define

f(x, k) = -100 log(x) – xk.

For each non-negative k, the limit of f(x, k) as x goes to 0 is ∞ and the limit as x goes to 1 is negative, so somewhere in between it is zero.

Floating point numbers

So in exact operations over the reals, there is a fixed point for each non-negative integer k. However, when implemented in finite precision arithmetic, it manages to get itself unstuck as the following Python code shows with k arbitrarily chosen to be 20.

    from numpy import log
    from scipy.optimize import bisect

    r = bisect(lambda x: -100*log(x) - x - 20, 0.4, 0.999)
    print(r)
    for i in range(10):
        r = abs(-100*log(r)) % 1
        print(r)

This produces

    0.8121086949937715
    0.8121086949252678
    0.8121087033605505
    0.8121076646716787
    0.8122355649800639
    0.7964876237426814
    0.7543688054472710
    0.1873898667258977
    0.4563983607272064
    0.4389252746489802
    0.3426097596058071

In infinite precision, r above would be a fixed point. In floating point precision, r is not. But it does start out nearly fixed. The first iteration only changes r in the 11th decimal place, and it doesn’t move far for the next few iterations.

Update: See the next post for how the random number generator does on the DIEHARDER test suite.

Plotting fixed points

The kth fixed point is the solution to f(x, k) = 0. The following code plots the fixed points as a function of k.

    t = arange(300)
    y = [bisect(
            lambda x: -100*log(x)-x-k,
            0.001,
            0.999)
        for k in t]

    plt.plot(t, y)
    plt.xlabel("k")
    plt.ylabel("fixed point")

fixed points of Cliff random number generator

The fixed points cluster toward zero, or they would in infinite precision arithmetic. As we showed above, the Cliff random number generator performs better in practice than in theory. Maybe the generator does well after wandering close to zero, but I wouldn’t be surprised if it has a bias toward the low end of the interval.

Related posts

Ease of learning vs relearning

Much more is written about how easy or hard some technology is to learn than about how hard it is to relearn. Maybe this is because people are more eager to write about something while the excitement or frustration of their first encounter is fresh.

Advocates of difficult-to-learn technologies say that tools should be optimized for experienced users, that ease of learning is over-rated because you’re only learn a tool once and use it for much longer. That makes sense if you use a tool continuously. If you use a tool occasionally, however, you might learn it once and relearn it many times.

The ease of relearning a technology should be emphasized more. As you’re learning a programming language, for example, it may be difficult to imagine forgetting it and needing to relearn it down the road. But you might ask yourself

If I put this down for a couple years and then have to come back to it, what language would I wish I’d written it in?

A while back I debated relearning Perl for the kind of text munging projects that Perl was designed for. But not only would I have to relearn Perl once, I’d have to relearn it every time I revisit the code. Perl does not stick in my head without constant use. Awk, on the other hand, is small and simple, and has a lot of the benefits of Perl. You can learn the basics of Awk in a day, and so if you have to, you can relearn it in a day.

Something easy to learn is also easy to relearn.

However, the converse isn’t necessarily true. Some things may be hard to learn but easy to pick back up. For example, I found LaTeX hard to learn but easy to relearn after not using it for several years. A lot of other tools seem almost as hard to relearn every time I pick them up. I think part of what made LaTeX easy to pick back up was its internal consistency. It’s a little quirky, but it has conceptual integrity.

Conceptual integrity

I’ve used Mathematica off and on ever since it came out. Sometimes I’d go for years without using it, but it has always been easy to pick back up. Mathematica is easy to return to because its syntax is consistent and predictable. Mathematica has conceptual integrity. I find R much harder to use because the inconsistent syntax fades from my memory between uses.

Conceptual integrity comes from strong leadership, even a “benevolent dictator.” Donald Knuth shaped TeX and Stephen Wolfram shaped Mathematica. R has been more of an egalitarian effort, and it shows.

The “Tidyverse” of libraries on top of R is more consistent than the base language, due to the Hadley Wickham doing so much of the work himself. In fact, the Tidyverse was initially called the “Hadleyverse,” though Hadley didn’t like that name.

Uniform approximation paradox

What I’m going to present here is not exactly a paradox, but I couldn’t think of a better way to describe it in the space of a title. I’ll discuss two theorems about uniform convergence that seem to contradict each other, then show by an example why there’s no contradiction.

Weierstrass approximation theorem

One of my favorite theorems is the Weierstrass approximation theorem. It says that every continuous function can be approximated as closely as you like by polynomials. This is surprising because polynomials are very special, well behaved functions, and merely continuous function can be worse.

For example, a polynomial cannot have any kinks in it, unlike the absolute value function. But even though an individual polynomial cannot have a kink, a sequence of polynomials can approach a kink.

Morera’s theorem

Morera’s theorem [1] says that the uniform limit of analytic functions is analytic. But Weierstrass’s theorem says that the uniform limit of analytic functions (namely polynomials) can have a kink in it, which an analytic function cannot have. What gives?

Weierstrass’s theorem is about uniform convergence over an interval of the real line.

Morera’s theorem is about uniform convergence on compact subsets of an open set in the complex plane.

We’ll show an example below where a sequence of polynomials converges to |x| on an interval of the real line but not in a disk containing the interval.

Bernstein polynomials

The Weierstrass approximation theorem tells us that there exists a sequence of polynomials converging uniformly to any continuous function on a compact interval. But we can go a step further and actually construct a sequence of such polynomials. The polynomials fall out of a proof of the Weierstrass theorem using probability! There’s nothing random going on here, and yet we can take a set of inequalities that fall out of calculations motivated by probability and construct our approximations.

Here is the nth Bernstein polynomial approximation to a function g in Mathematica code.

B[x_, n_, g_] := 
    Sum[Binomial[n, k] x^k (1 - x)^(n - k) g[k/n], 
        {k, 0, n}]

We can then use the following code to show the convergence of Bernstein polynomials.

f[x_] := Abs[x - 1/2]
Plot[{B[x, 4, f], B[x, 10, f], B[x, 30, f], f[x]}, 
    {x, 0, 1} , PlotLegends -> "Expressions"]

Plot of Bernstein polynomials converging to absolute value

Next we’ll take a particular Bernstein polynomial for f in the sequence and plot the difference between it and f over the complex plane.

ComplexPlot3D[B[z, 6, f] - f[z], {z, 0, 1 + I}]

Bernstein approximation error in complex plane

The polynomial is close to f along the interval [0, 1] on the real line, but the further we move away from the real axis the further it gets from f. Furthermore, the distance increases as we increase the degree of polynomial. The following code looks at the distance between Bn(i) and f(i) for n = 1, 2, 3, …, 10.

Table[N[Abs[B[I , n, f] - f[I]]], {n, 1, 10}]

It returns

 0.6180
 1.9021
 1.9021
 3.4085
 3.4085
 7.3941
 7.3941
23.4511
23.4511
92.4377

So there’s no contradiction between the theorems of Weierstrass and Morera. The Bernstein polynomials do indeed converge uniformly to f over [0, 1] but they do not converge in the complex plane.

Related posts

[1] I don’t know whether this theorem is exactly due to Morera, but it follows directly from Morera’s theorem.

Nearly parallel is nearly transitive

We begin with a bit of geometry, then show its relevance to statistics.

Geometry

Let X, Y, and Z be three unit vectors. If X is nearly parallel to Y, and Y is nearly parallel to Z, then X is nearly parallel to Z.

Here’s a proof. Think of X, Y, and Z as points on a unit sphere. Then saying that X and Y are nearly parallel means that the two points are close together on the sphere. The statement above follows from the triangle inequality on the sphere:

dist(X, Z) ≤ dist(X, Y) + dist(Y, Z).

So if the two terms on the right are small, the term on the left is small, though maybe not quite as small. No more than twice the larger of the other two angles.

We can be a little more quantitative. Let a be the angle between X and Y, b the angle between Y and Z, and c the angle between X and Z.  Then the law of cosines for spherical trigonometry says

cos c = cos a cos b + sin a sin b cos γ

where γ is the angle between the arcs a and b. If a and b are small, then sin a and sin b are also small (see here), and so we have the approximation

cos c ≈ cos a cos b.

The error in the approximation is sin a sin b cos γ, the product of two small numbers and a number with absolute value no more than 1.

The geometric exercise above was inspired by a discussion of correlation.

Correlation

Correlation of random variables is not transitive. Correlation corresponds to directions not being  perpendicular. If X is not perpendicular to Y, and Y is not perpendicular to Z, it might be the case that X is perpendicular to Z.

But if we replace “not perpendicular” with “nearly parallel” we see that we do have something like transitivity. That is, correlation of random variables is not transitive, but high correlation is.

If the angles a, b, and c above are correlation angles, then we have the approximation

corr(X, Z) ≈ corr(X, Y) corr(Y, Z)

if all the correlations are near 1.

Exercise for the reader: interpret the error term in the geometric problem in statistical terms.

Angles in the spiral of Theodorus

The previous post looked at how to plot the spiral of Theodorus shown below.

Spiral of TheodorusWe stopped the construction where we did because the next triangle to be added would overlap the first triangle, which would clutter the image. But we could certainly have kept going.

If we do keep going, then the set of hypotenuse angles will be dense in the circle, with no repeats.

The nth triangle has sides of length 1 and √n, and so the tangent of nth triangle’s acute angle is 1/√n. The angle formed by the nth hypotenuse is thus

arctan(1) + arctan(1/√2) + arctan(1/√3) + … + arctan(1/√n).

Here’s a plot of the first 99 hypotenuse angles.

Angles formed by hypotenuses in spiral of Theodorus

Here’s the code that produced the plot.

    from numpy import *
    import matplotlib.pyplot as plt

    plt.axes().set_aspect(1)

    N = 100
    theta = cumsum(arctan(arange(1,N)**-0.5))
    plt.scatter(cos(theta), sin(theta))
    plt.savefig("theodorus_angles.svg")

If we change N to 500 we get a solid ring because the angles are closer together than the default thickness of dots in a scatterplot.