Positive polynomials and squares

If a real polynomial in one variable is a sum of squares, it obviously cannot be negative. For example, the polynomial

p(x) = (x2 – 3)2 + (x + 7)2

is obviously never negative for real values of x. What about the converse: If a real polynomial is never negative, is it a sum of squares? Yes, indeed it is.

What about polynomials in two variables? There the answer is no. David Hilbert (1862–1943) knew that there must be positive polynomials that are not a sum of squares, but no one produced a specific example until Theodove Motzkin in 1967. His polynomial

p(x, y) = 1 – 3x2y2 + x2 y4 + x4 y2

is never negative but cannot be written as a sum of any number of squares. Here’s a plot:

Motzkin polynomial

Source: Single Digits


Category theory and Koine Greek

Fragment of the Gospel of John in Greek

When I was in college, I sat in on a communication workshop for Latin American preachers. This was unusual since I’m neither Latin American nor a preacher, but I’m glad I was there.

I learned several things in that workshop that I’ve used ever since. For example, when you’re gesturing about something moving forward in time, move your hand from left to right from the audience’s perspective. Since English speakers (and for the audience of this workshop, Spanish speakers) read from left to right, we think of time progressing from left to right. If you see someone talking about time moving forward, but you see motion from right to left, you feel a subtle cognitive dissonance. (Presumably you should reverse this when speaking to an audience whose primary language is Hebrew or Arabic.)

Another lesson from that workshop, the one I want to focus on here, is that you don’t always need to convey how you arrived at an idea. Specifically, the leader of the workshop said that if you discover something interesting from reading the New Testament in Greek, you can usually present your point persuasively using the text in your audience’s language without appealing to Greek. This isn’t always possible—you may need to explore the meaning of a Greek word or two—but you can use Greek for your personal study without necessarily sharing it publicly. The point isn’t to hide anything, only to consider your audience. In a room full of Greek scholars, bring out the Greek.

This story came up in a recent conversation with Brent Yorgey about category theory. You might discover something via category theory but then share it without discussing category theory. If your audience is well versed in category theory, then go ahead and bring out your categories. But otherwise your audience might be bored or intimidated, as many people would be listening to an argument based on the finer points of Koine Greek grammar. Microsoft’s LINQ software, for example, was inspired by category theory principles, but you’d be hard pressed to find any reference to this because most programmers don’t want to know or need to know where it came from. They just want to know how to use it.

Some things may sound profound when expressed in esoteric language, such as category theory or Koine Greek, that don’t seem so profound in more down-to-earth language. Expressing yourself in a different language helps filter out pedantry from useful ideas. (On the other hand, some things that looked like pure pedantry have turned out to be very useful. Some hairs are worth splitting.)

Sometimes you have to introduce a new terms because there isn’t a colloquial counterpart. Monads are a good example, a concept from category theory that has entered software development. A monad is what it is, and analogies to burritos and other foods don’t really help. Better to introduce the term and say plainly what it is.

* * *

More on applied category theory

New Twitter account for functional programming and categories

I’m starting a new Twitter account @FunctorFact for functional programming and category theory.

These two subjects have a lot of overlap, and some tweets will combine both, but many will be strictly about one or the other. So some content will be strictly about programming, some strictly about math, and some combining ideas from both.

FunctorFact icon

Prime factors, phone numbers, and the normal distribution

Telephone numbers typically have around three distinct prime factors.

The length of a phone number varies by country, but US a phone number is a 10 digit number, and 10-digit numbers are often divisible by three different prime numbers, give or take a couple. Assuming phone numbers are scattered among possible 10-digit numbers in a way that doesn’t bias their number of prime factors, these numbers will often have between 1 and 5 prime factors. If a country has 9- or 11-digit phone numbers, the result is essentially the same.

Let ω(n) be the number of distinct prime factors of n. Then the Erdős–Kac theorem says roughly that ω(n) is distributed like a normal random variable with mean and variance log log n. More precisely, fix two numbers a and b.  For a given value of x, count the proportion of positive integers less than x where

(ω(n) – log log n) / sqrt( log log n)

is between a and b. Then in the limit as x goes to infinity, this proportion approaches the probability that a standard normal random variable is between a and b.

So by that heuristic, the number of distinct prime factors of a 10-digit number is approximately normally distributed with mean and variance log log 10^11 = 3.232, and such a distribution is between 1 and 5 around 73% of the time.

My business phone number, for example, is 8324228646. Obviously this is divisible by 2. In fact it equals 2 × 32 × 462457147, and so it has exactly three distinct prime factors: 2, 3, and 462457147.

Here’s how you could play with this using Python.

    from sympy.ntheory import factorint

    def omega(n):
        return len(factorint(n))

I looked in SymPy and didn’t see an implementation of ω(n) directly, but it does have a function factorint that returns the prime factors of a number, along with their multiplicities, in a dictionary. So ω(n) is just the size of that dictionary.

I took the first 20 phone numbers in my contacts and ran them through omega and got results consistent with what you’d expect from the theory above. One was prime, and none had more than five factors.

Bar chart of umber of prime factors in a sample of phone numbers with heights [1, 3, 5, 8, 3]

Benford’s law, chi-square, and factorials

A while back I wrote about how the leading digits of factorials follow Benford’s law. That is, if you look at just the first digit of a sequence of factorials, they are not evenly distributed. Instead, 1’s are most popular, then 2’s, etc. Specifically, the proportion of factorials starting with n is roughly log10(1 + 1/n).

Someone has proved that the limiting distribution of leading digits of factorials exactly satisfies Benford’s law. But if we didn’t know this, we might use a chi-square statistic to measure how well the empirical results match expectations. As I argued in the previous post, statistical tests don’t apply here, but they can be useful anyway in giving us a way to measure the size of the deviations from theory.

Benford’s law makes a better illustration of the chi-square test than the example of prime remainders because the bins are unevenly sized, which they’re allowed to be in general. In the prime remainder post, they were all the same size.

The original post on leading digits of factorial explains why we compute the leading digits the way we do. Only one detail has changed: the original post used Python 2 and this one uses Python 3. Integer division was the default in Python 2, but now in Python 3 we have to use // to explicitly ask for integer division, floating point division being the new default.

Here’s a plot of the distribution of the leading digits for the first 500 factorials.

And here’s code to compute the chi-square statistic:

    from math import factorial, log10

    def leading_digit_int(n):
        while n > 9:
            n = n//10
        return n

    def chisq_stat(O, E):
        return sum( [(o - e)**2/e for (o, e) in zip(O, E)] )

    # Waste the 0th slot to make the code simpler.
    digits = [0]*10

    N = 500
    for i in range(N):
        digits[ leading_digit_int( factorial(i) ) ] += 1

    expected = [ N*log10(1 + 1/n) for n in range(1, 10) ]

    print( chisq_stat(digits[1:], expected) )

This gives a chi-square statistic of 7.693, very near the mean value of 8 for a chi-square distribution with eight degrees of freedom. (There are eight degrees of freedom, not nine, because if we know how many factorials start with the digits 1 through 8, we know how many start with 9.)

So the chi-square statistic suggests that the deviation from Benford’s law is just what we’d expect from random data following Benford’s law. And as we said before, this suggestion turns out to be correct.

Related posts:

Hypothesis testing and number theory

This post uses a hypothesis test for proportions to look at a couple conjectures in number theory. It is similar to my earlier post on the chi-square test and prime remainders. You could read this as a post on statistics or a post on number theory, depending on which you’re less familiar with.

Using statistical tests on number theory problems is kind of odd. There’s nothing random going on, so in that since the whole enterprise is unjustified. Nevertheless, statistical tests can be suggestive. They certainly don’t prove theorems, but they can give reason to suspect a theorem is true or false. In that sense, applying statistical tests to number theory isn’t all that different from applying them to more traditional settings.

First we’ll look at the remainders of primes modulo 4. Except for 2, all primes are odd, and so they either have remainder 1 or 3 when divided by 4. Brian Hayes wrote recently that Chebyshev noticed in the 1850’s that there seems to be more primes with remainder 3. Is the imbalance larger than one would expect to see from fair coin tosses?

Here’s some Python code to find the proportion of the first million primes (after 2) that have remainder 3 when divided by 4.

    from sympy import prime
    from scipy import sqrt

    n = 1000000
    rem3 = 0
    for i in range (2, n+2):
        if prime(i) % 4 == 3:
            rem3 += 1
    p_hat = rem3/n

This shows that of the first million odd primes, 500,202 are congruent to 3 mod 4. Would it be unusual for a coin to come up heads this many times in a million flips? To find out we’d compute the z-statistic:

z = \frac{\hat{p} - p}{\sqrt{pq/n}}

Here p is the proportion under the null hypothesis, q = 1 – p, and n is the sample size. In our case, the null hypothesis is p = 0.5 and n = 1,000,000. [1]

The code

    p = 0.5
    q = 1 - p
    z = (p_hat - p)/sqrt(p*q/n)

shows that z = 0.404, hardly a suspiciously large value. If we were looking at random values we’d see a z-score this large or larger 34% of the time. So in this case doesn’t suggest much.

* * *

[1] The derivation of the z statistic is fairly quick. If the proportion of successes is p, then the number of successes out of n trials is binomial(np). For large n, this is has approximately the same distribution as a normal distribution with the same mean and variance, mean np and variance npq. The proportion of successes then has approximately mean p and standard deviation √(pq/n). Subtracting the mean and dividing by the standard deviation normalizes the distribution to have mean 0 and variance 1. So under the null hypothesis, the z statistic has a standard normal distribution.

Prime remainders too evenly distributed

First Brian Hayes wrote an excellent post about the remainders when primes are divided by other primes. Then I wrote a follow-on just focusing on the first part of his post. He mostly looked at pairs of primes, but I wanted to look in more detail at the first part of his post, simulating dice rolls by keeping the remainder when consecutive primes are divided by a fixed prime. For example, using a sequence of primes larger than 7 and taking their remainder by 7 to create values from 1 to 6.

The results are evenly distributed with some variation, just like dice rolls. In fact, given these results and results from a set of real dice rolls, most people would probably think the former are real because they’re more evenly distributed. A chi-squared goodness of fit test shows that the results are too evenly distributed compared to real dice rolls.

At the end of my previous post, I very briefly discuss what happens when you look a “dice” with more than six sides. Here I’ll go into a little more detail and look at a large number of examples.

In short, you either get a suspiciously good fit or a terrible fit. If you look at the remainder when dividing primes by m, you get values between 1 and m-1. You can’t get a remainder of 0 because primes aren’t divisible by m (or anything else!). If m itself is prime, then you get all the numbers between 1 and m-1, and as we’ll show below you get them in very even proportions. But if m isn’t prime, there are some remainders you can’t get.

The sequence of remainders looks random in the sense of being unpredictable. (Of course it is predictable by the algorithm that generates them, but it’s not predictable in the sense that you could look at the sequence out of context and guess what’s coming next.) The sequence is biased, and that’s the big news. Pairs of consecutive primes have correlated remainders. But I’m just interested in showing a different departure from a uniform distribution, namely that the results are too evenly distributed compared to random sequences.

The table below gives the chi-square statistic and p-value for each of several primes. For each prime p, we take remainders mod p of the next million primes after p and compute the chi-square goodness of fit statistic with p-2 degrees of freedom. (Why p-2? There are p-1 different remainders, and the chi-square test for k possibilities has k-1 degrees of freedom.)

The p-value column gives the probability of seeing at fit this good or better from uniform random data. (The p in p-value is unrelated to our use of p to denote a prime. It’s an unfortunate convention of statistics that everything is denoted p.) After the first few primes, the p-values are extremely small, indicating that such an even distribution of values would be astonishing from random data.

| Prime | Chi-square | p-value    |
|     3 |     0.0585 |   2.88e-02 |
|     5 |     0.0660 |   5.32e-04 |
|     7 |     0.0186 |   1.32e-07 |
|    11 |     0.2468 |   2.15e-07 |
|    13 |     0.3934 |   6.79e-08 |
|    17 |     0.5633 |   7.64e-10 |
|    19 |     1.3127 |   3.45e-08 |
|    23 |     1.1351 |   2.93e-11 |
|    29 |     1.9740 |   3.80e-12 |
|    31 |     2.0052 |   3.11e-13 |
|    37 |     2.5586 |   3.92e-15 |
|    41 |     3.1821 |   9.78e-16 |
|    43 |     4.4765 |   5.17e-14 |
|    47 |     3.7142 |   9.97e-18 |
|    53 |     3.7043 |   3.80e-21 |
|    59 |     7.0134 |   2.43e-17 |
|    61 |     5.1461 |   6.45e-22 |
|    67 |     7.1037 |   5.38e-21 |
|    71 |     7.6626 |   6.13e-22 |
|    73 |     7.5545 |   4.11e-23 |
|    79 |     8.0275 |   3.40e-25 |
|    83 |    12.1233 |   9.92e-21 |
|    89 |    11.4111 |   2.71e-24 |
|    97 |    12.4057 |   2.06e-26 |
|   101 |    11.8201 |   3.82e-29 |
|   103 |    14.4733 |   3.69e-26 |
|   107 |    13.8520 |   9.24e-29 |
|   109 |    16.7674 |   8.56e-26 |
|   113 |    15.0897 |   1.20e-29 |
|   127 |    16.4376 |   6.69e-34 |
|   131 |    19.2023 |   6.80e-32 |
|   137 |    19.1728 |   1.81e-34 |
|   139 |    22.2992 |   1.82e-31 |
|   149 |    22.8107 |   6.67e-35 |
|   151 |    22.8993 |   1.29e-35 |
|   157 |    30.1726 |   2.60e-30 |
|   163 |    26.5702 |   3.43e-36 |
|   167 |    28.9628 |   3.49e-35 |
|   173 |    31.5647 |   7.78e-35 |
|   179 |    33.3494 |   2.46e-35 |
|   181 |    36.3610 |   2.47e-33 |
|   191 |    29.1131 |   1.68e-44 |
|   193 |    29.9492 |   2.55e-44 |
|   197 |    34.2279 |   3.49e-41 |
|   199 |    36.7055 |   1.79e-39 |
|   211 |    41.0392 |   8.42e-40 |
|   223 |    39.6699 |   1.73e-45 |
|   227 |    42.3420 |   2.26e-44 |
|   229 |    37.1896 |   2.02e-50 |
|   233 |    45.0111 |   4.50e-44 |
|   239 |    43.8145 |   2.27e-47 |
|   241 |    51.3011 |   1.69e-41 |
|   251 |    47.8670 |   6.28e-48 |
|   257 |    44.4022 |   1.54e-53 |
|   263 |    51.5905 |   7.50e-49 |
|   269 |    59.8398 |   3.92e-44 |
|   271 |    59.6326 |   6.02e-45 |
|   277 |    52.2383 |   2.80e-53 |
|   281 |    52.4748 |   1.63e-54 |
|   283 |    64.4001 |   2.86e-45 |
|   293 |    59.7095 |   2.59e-52 |
|   307 |    65.2644 |   1.64e-52 |
|   311 |    63.1488 |   1.26e-55 |
|   313 |    68.6085 |   7.07e-52 |
|   317 |    63.4099 |   1.72e-57 |
|   331 |    66.3142 |   7.20e-60 |
|   337 |    70.2918 |   1.38e-58 |
|   347 |    71.3334 |   3.83e-61 |
|   349 |    75.8101 |   3.38e-58 |
|   353 |    74.7747 |   2.33e-60 |
|   359 |    80.8957 |   1.35e-57 |
|   367 |    88.7827 |   1.63e-54 |
|   373 |    92.5027 |   7.32e-54 |
|   379 |    86.4056 |   5.67e-60 |
|   383 |    74.2349 |   3.13e-71 |
|   389 |   101.7328 |   9.20e-53 |
|   397 |    86.9403 |   1.96e-65 |
|   401 |    90.3736 |   3.90e-64 |
|   409 |    92.3426 |   2.93e-65 |
|   419 |    95.9756 |   8.42e-66 |
|   421 |    91.1197 |   3.95e-70 |
|   431 |   100.3389 |   1.79e-66 |
|   433 |    95.7909 |   1.77e-70 |
|   439 |    96.2274 |   4.09e-72 |
|   443 |   103.6848 |   6.96e-68 |
|   449 |   105.2126 |   1.07e-68 |
|   457 |   111.9310 |   1.49e-66 |
|   461 |   106.1544 |   7.96e-72 |
|   463 |   116.3193 |   1.74e-65 |
|   467 |   116.2824 |   1.02e-66 |
|   479 |   104.2246 |   3.92e-79 |
|   487 |   116.4034 |   9.12e-73 |
|   491 |   127.2121 |   6.69e-67 |
|   499 |   130.9234 |   5.90e-67 |
|   503 |   118.4955 |   2.60e-76 |
|   509 |   130.9212 |   6.91e-70 |
|   521 |   118.6699 |   6.61e-82 |
|   523 |   135.4400 |   3.43e-71 |
|   541 |   135.9210 |   3.13e-76 |
|   547 |   120.0327 |   2.41e-89 |

Computing higher moments with a fold

Folds in functional programming are often introduced as a way to find the sum or product of items in a list. In this case the fold state has the same type as the list items. But more generally the fold state could have a different type, and this allows more interesting applications of folds. Previous posts look at using folds to update conjugate Bayesian models and numerically solve differential equations.

This post uses a fold to compute mean, variance, skewness, and kurtosis. See this earlier post for an object-oriented approach. The code below seems cryptic out of context. The object-oriented post gives references for where these algorithms are developed. The important point for this post is that we can compute mean, variance, skewness, and kurtosis all in one pass through the data even though textbook definitions appear to require at least two passes. It’s also worth noting that the functional version is less than half as much code as the object-oriented version.

(Algorithms that work in one pass through a stream of data, updating for each new input, are sometimes called “online” algorithms. This is unfortunate now that “online” has come to mean something else.)

The Haskell function moments below returns the number of samples and the mean, but does not directly return variance, skewness and kurtosis. Instead it returns moments from which these statistics can easily be calculated using the mvks function.

    moments (n, m1, m2, m3, m4) x = (n', m1', m2', m3', m4')
            n' = n + 1
            delta = x - m1
            delta_n = delta / n'
            delta_n2 = delta_n**2
            t = delta*delta_n*n
            m1' = m1 + delta_n
            m4' = m4 + t*delta_n2*(n'*n' - 3*n' + 3) + 6*delta_n2*m2 - 4*delta_n*m3
            m3' = m3 + t*delta_n*(n' - 2) - 3*delta_n*m2
            m2' = m2 + t

    mvsk (n, m1, m2, m3, m4) = (m1, m2/(n-1.0), (sqrt n)*m3/m2**1.5, n*m4/m2**2 - 3.0)                         

Here’s an example of how you would use this Haskell code to compute statistics for the list [2, 30, 51, 72]:

    ghci>  mvsk $ foldl moments (0,0,0,0,0) [2, 30, 51, 72]
    (38.75, 894.25,-0.1685, -1.2912)

The foldl applies moments first to its initial value, the 5-tuple of zeros. Then it iterates over the data, taking data points one at a time and visiting each point only once, returning a new state from moments each time. Another way to say this is that after processing each data point, moments returns the 5-tuple that it would have returned if that data only consisted of the values up to that point.

For a non-numerical example of folds, see my post on sorting.

Ten spectral graph theory posts

Erdős-Rényi graph

Here are 10 blog posts I wrote earlier this year about spectral graph theory, studying graphs via the eigenvalues of matrices associated with the graphs.

ODE solver as a functional fold

One of the most widely used numerical algorithms for solving differential equation is the 4th order Runge-Kutta method. This post shows how the Runge-Kutta method can be written naturally as a fold over the set of points where the solution is needed. These points do not need to be evenly spaced.

Given a differential equation of the form

y' = f(t, y)

with initial condition y(t0) = y0, the 4th order Runge-Kutta method advances the solution by an amount of time h by

y_{n+1} = y_n + \frac{h}{6}\left( k_{n1} + 2k_{n2} + 2k_{n3} + k_{n4}\right)


k_{n1} &=& f(t_n, y_n) \\ k_{n2} &=& f(t_n + 0.5h, y_n + 0.5hk_{n1}) \\ k_{n3} &=& f(t_n + 0.5h, y_n + 0.5hk_{n2}) \\ k_{n4} &=& f(t_n + h, y_n + hk_{n3}) \\

The Haskell code for implementing the accumulator function for foldl looks very much like the mathematical description above. This is a nice feature of the where syntax.

    rk (t, y) t' = (t', y + h*(k1 + 2.0*k2 + 2.0*k3 + k4)/6.0)
            h  = t' - t
            k1 = f t y
            k2 = f (t + 0.5*h) (y + 0.5*h*k1)
            k3 = f (t + 0.5*h) (y + 0.5*h*k2)
            k4 = f (t + 1.0*h) (y + 1.0*h*k3)     

Suppose we want to solve the differential equation y ‘ = (t2y2) sin(y) with initial condition y(0) = -1, and we want to approximate the solution at [0.01, 0.03, 0.04, 0.06]. We would implement the right side of the equation as

     f t y = (t**2 - y**2)*sin(y)

and fold the function rk over our time steps with

    foldl rk (0, -1) [0.01, 0.03, 0.04, 0.06]

This returns (0.06, -0.9527). The first part, 0.06, is no surprise since obviously we asked for the solution up to 0.06. The second part, -0.9527, is the part we’re more interested in.

If you want to see the solution at all the specified points and not just the last one, replace foldl with scanl.

    scanl rk (0, -1) [0.01, 0.03, 0.04, 0.06]

This returns

    [(0.0, -1.0), (0.01, -0.9917), (0.03, -0.9756), (0.042, -0.9678), (0.06, -0.9527)]

As pointed out in the previous post, writing algorithms as folds separates the core of the algorithm from data access. This would allow us, for example, to change rk independently, such as using a different order Runge-Kutta method. (Which hardly anyone does. Fourth order is a kind of sweet spot.) Or we could swap out Runge-Kutta for a different ODE solver entirely just by passing a different function into the fold.

Next see how you could use a fold to compute mean, variance, skewness, and kurtosis in one pass through a stream of data using a fold.

Functional folds and conjugate models

oragami bird

Bayesian calculations are intrinsically recursive: The posterior distribution from one step becomes the prior for the next step.

With conjugate models, the prior and posterior belong to the same family of distributions. If a distribution from this family is determined by a fixed set of parameters, we only need to update these parameters at each step. This updating process is defined by an integration problem, i.e. applying Bayes’ theorem, but it’s common with conjugate models for the integration to reduce to a simple algebraic operation on the parameters.

For example, suppose some binary event that happens with probability θ where θ has a beta(α, β) distribution. When we take our next observation, the posterior distribution becomes beta(α+1, β) if the event occurred and beta(α, β+1) if it didn’t. There is an integration problem in the background that reduces to simply adding a 1 to the α parameter for every success and to the β parameter for every failure. (Standard terminology is to call an observation of the thing you’re interested in a “success” regardless of whether the thing you’re observing is good or bad. “Failure” is just an observation of the thing not happening, even if it’s good that it didn’t happen.)

Conjugate models have the structure of a fold in functional programming, also known as a reduce or accumulate operation. We’ll show below how to compute the posterior distribution in a beta-binomial and normal-normal model using folds in Haskell. (Technically a left associative fold, foldl. Haskell also has a right associative fold foldr that we won’t be concerned about here.)

Folds in Haskell

What does foldl require? Let’s ask GHCi, the Haskell REPL. Prior to GHC version 7.10 we would get the following.

     ghci> :type foldl
     foldl :: (a -> b -> a) -> a -> [b] -> a

Unfortunately the type variables a and b just mean “one type” and “a possibly different type” and tell us nothing about how the types are used, so let’s unravel this a little at a time.

Here “a” is the type of the accumulator state. In our case, it will be the prior and posterior distribution parameters because that’s what’s being updated for each data point. Conjugacy makes this work since the prior and posterior distributions belong to the same family. The “b” type is our data. So you could read the type of foldl as follows.

foldl takes three things: a function to fold over the data, an initial accumulator state, and an array of data. It returns the final accumulator state. The accumulator function takes an accumulator state and a data value, and returns a new accumulator state.

Starting with GHC 7.10 the type of foldl is a little more general.

     ghci> :type foldl
     foldl :: Foldable t => (a -> b -> a) -> a -> t b -> a

Instead of foldl operating only on lists, it can operate on any “foldable” container. To recover the type from earlier versions of the compiler, replace t with []. (Haskell uses [] b and [b] as equivalent ways of indicating a list of elements of type b.) There is one other difference I’ve edited out: The latest version of GHCi reverses the roles of ‘a’ and ‘b’. This is confusing, but has no effect since the type variable names have no meaning. I swapped the a’s and b’s back like they were to make the two definitions easier to compare.

Beta-binomial example

Returning to the beta-binomial model discussed above, suppose we have a sequence of observations consisting of a 1 when the event in question happened and a 0 when it did not. And suppose our prior distribution on the probability of the event has a beta(α, β) distribution.

Now what function will we fold over our data to update our prior into a posterior?

     ghci> let f params y = (fst params + y, snd params + 1 - y)

The function f takes a pair of numbers, the two beta distribution parameters, and a data point y, and returns an updated set of beta parameters. If the data point is a 1 (success) then the α> parameter is incremented, and if the data point is a 0 (failure) then the β parameter is incremented. If we start with a beta(0.2, 0.8) prior and observe the data [1, 1, 0, 0, 1] then we apply f to our data as follows.

     ghci> let d = [1, 1, 0, 0, 1]
     ghci> foldl f (0.2, 0.8) d

The result will be (3.2, 2.8). The three successes incremented the first beta parameter and the two failures incremented the second beta parameter.

To see how the foldl operates, we can run scanl instead. It works like foldl but returns a list of intermediate accumulator values rather than just the final accumulator value.

     ghci> scanl f (0.2, 0.8) d

This returns

     [ (0.2, 0.8), (1.2, 0.8), (2.2, 0.8), (2.2, 1.8), (2.2, 2.8), (3.2, 2.8) ].

he initial accumulator state is (0.2, 0.8) because we started with a beta(0.2, 0.8) prior. Then we increment the accumulator state to (1.2, 0.8) because the first data point was a 1. Then we increment the state to (2.2, 0.8) because the second data point is also a 1. Next we see two failures in a row and so the second part of the accumulator is incremented two times. After observing the last data point, a success, our final state is (3.2, 2.8), just as when we applied foldl.

Normal-normal example

Next we consider another Bayesian model with a conjugate prior. We assume our data come from a normal distribution with mean θ and known variance 1. We assume a conjugate prior distribution on θ, normal with mean μ0 and variance τ02.

This time the posterior distribution calculation is more complicated and so our accumulator function is more complicated. But the form of the calculation is the same as above: we fold an accumulator function over the data.

Here is the function that encapsulates our posterior distribution calculation.

     f params y = ((m/v  + y)*newv, newv)
             m = fst params -- mean
             v = snd params -- variance
             newv = v/(v + 1)

Now suppose our prior distribution on θ has mean 0 and variance 4, and we observe two values, [3, 5].

     ghci> foldl f (0, 4) [3, 5]

This returns (3.5555, 0.4444). To see the intermediate accumulation state, i.e. after just seeing y = 3, we run scanl instead and see

     [ (0, 4), (2,4, 0,8), (3.5555, 0.4444) ]


This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

Next up: ODEs

The next post shows how to implement an ODE solver (4th order Runge-Kutta) as a fold over the list of time steps where the solution is needed.

Cepstral alanysis vocabulary

An earlier post defined cepstrum and quefrency. This post explains some of the other quirky terms introduced in the same paper by Bogert, Healy, and Tukey. (Given Tukey’s delight in coining words, we can assume he was the member of the trio responsible for the new terms.)

The paper [1] explains why the new twists on familiar words:

In general, we find ourselves operating on the frequency side in ways customary on the time side and vice versa. Experience has made it clear that “words that sound like other words,” although strange at first sight, considerably reduce confusion on balance. These parallel or “paraphrased” words are made by the interchange of consonants or consonant groups, as in “alanysis” from “analysis,” and are introduced as needed.

The magnitude and phase of a cepstrum are called gamnitude and saphe. (The latter explains the pun “saphe cracking” in the title.)

Filtering in the cepstral domain is called liftering. A high-pass filter corresponds to a long-pass lifter and a low-pass filter corresponds to a short-pass lifter.

Harmonics in spectra correspond to rahmonics in cepstra.

Some of these terms are helpful. As explained in the previous post, the independent variable in cepstral analysis, quefrency, differs enough from frequency that it helps to have a separate term for it. Using the terms long and short rather than high and low is helpful for the same reason. Using repiod for the analog of period seems gratuitous, but maybe it’s necessary for consistency. Once you introduce some new terminology, you have to keep going.

Click to learn more about consulting help with signal processing


[1] Bruce P. Bogert, M. J. R. Healy, John W. Tukey. The Quefrency Analysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphe Cracking. Collected works of John Tukey volume 1

Cepstrum, quefrency, and pitch

John Tukey

John Tukey coined many terms that have passed into common use, such as bit (a shortening of binary digit) and software. Other terms he coined are well known within their niche: boxplot, ANOVA, rootogram, etc. Some of his terms, such as jackknife and vacuum cleaner, were not new words per se but common words he gave a technical meaning to.

Cepstrum is an anagram of spectrum. It involves an unusual use of power spectra, and is roughly analogous to making anagrams of a word. A related term, one we will get to shortly, is quefrency, an anagram of frequency. Some people pronounce the ‘c’ in cepstrum hard (like ‘k’) and some pronounce it soft (like ‘s’).

Let’s go back to an example from my post on guitar distortion. Here’s a note played with a fairly large amount of distortion:


And here is its power spectrum:

single note with distortion

There’s a lot going on in the spectrum, but the peaks are very regularly spaced. As I mentioned in the post on the sound of a leaf blower, this is the fingerprint of a sound with a definite pitch. Spikes in the spectrum alone don’t indicate a definite pitch if they are irregularly spaced.

The peaks are fairly periodic. How to you find periodic patterns in a signal? Fourier transform! But if you simply take the Fourier transform of a Fourier transform, you essentially get the original signal back. The key to the cepstrum is to do something else between the two Fourier transforms.

The cepstrum starts by taking the Fourier transform, then the magnitude, then the logarithm, and then the inverse Fourier transform.

When we take the magnitude, we throw away phase information, which we don’t need in this context. Taking the log of the magnitude is essentially what you do when you compute sound pressure level. Some define the cepstrum using the magnitude of the Fourier transform and some the magnitude squared. Squaring only introduces a multiple of 2 once we take logs, so it doesn’t effect the location of peaks, only their amplitude.

Taking the logarithm compresses the peaks, bringing them all into roughly the same range, making the sequence of peaks roughly periodic.

When we take the inverse Fourier transform, we now have something like a frequency, but inverted. This is what Tukey called quefrency.

Looking at the guitar power spectrum above, we see a sequence of peaks spaced 440 Hz apart. When we take the inverse Fourier transform of this, we’re looking at a sort of frequency of a frequency, what Tukey calls quefrency. The quefrency scale is inverted: sounds with a high frequency fundamental have overtones that are far apart on the frequency domain, so the sequence of the overtone peaks has low frequency.

Here’s the plot of the cepstrum for the guitar sample.

electric guitar cepstrum

There’s a big peak at 109 on the quefrency scale. The audio clip was recorded at 48000 samples per second, so the 109 on the quefrency scale corresponds to a frequency of 48000/109 = 440 Hz. The second peak is at quefrency 215, which corresponds to 48000/215 = 223 Hz. The second peak corresponds to the perceived pitch of the note, A3, and the first peak corresponds to its first harmonic, A4. (Remember the quefrency scale is inverted relative to the frequency scale.)

I cheated a little bit in the plot above. The very highest peaks are at 0. They are so large that they make it hard to see the peaks we’re most interested in. These low quefrency peaks correspond to very high frequency noise, near the edge of the audible spectrum or beyond.

Click to learn more about consulting help with signal processing

Bring out your equations!

Nice discussion from Fundamentals of Kalman Filtering: A Practical Approach by Paul Zarchan and Howard Musoff:

Often the hardest part in Kalman filtering is the subject that no one talks about—setting up the problem. This is analogous to the quote from the recent engineering graduate who, upon arriving in industry, enthusiastically says, “Here I am, present me with your differential equations!” As the naive engineering graduate soon found out, problems in the real world are frequently not clear and are subject to many interpretations. Real problems are seldom presented in the form of differential equations, and they usually do not have unique solutions.

Whether it’s Kalman filters, differential equations, or anything else, setting up the problem is the hard part, or at least a hard part.

On the other hand, it’s about as impractical to only be able to set up problems as it is to only be able to solve them. You have to know what kinds of problems can be solved, and how accurately, so you can formulate a problem in a tractable way. There’s a feedback loop: provisional problem formulation, attempted solution, revised formulation, etc. It’s ideal when one person can set up and solve a problem, but it’s enough for the formulators and solvers to communicate well and have some common ground.

Related posts:

Family tree numbering

When you draw a tree of your ancestors, things quickly get out of hand. There are twice as many nodes each time you go back a generation, and so the size of paper you need grows exponentially. Things also get messy because typically you know much more about some lines than others. If you know much about your ancestry, one big tree isn’t going to work.

Ahnentafel numbering system from 1590

There’s a simple solution to this problem, one commonly used in genealogy: assign everyone in the tree a number, starting with yourself as 1. Then follow two simple rules:

  1. The father of person n has number 2n.
  2. The mother of person n has number 2n + 1.

You can tell where someone fits into the tree easily from their number. Men have even numbers, women odd numbers. The number of someone’s child is half their number (rounding down if you get a fraction). For example, person 75 on your tree must be a woman. Her husband would be 74, her child 37, her father 150, etc.

Taking the logarithm base 2 tells you how many generations back someone is. That is, person n is ⌊ log2n ⌋ generations back. Going back to our example of 75, this person would be 6 generations back because log2 75 = 6.2288. (Here ⌊ x ⌋ is the “floor” of x, the largest integer less than x. Using the same notation, the child of n is ⌊ n/2 ⌋.)

Said another way, the people m generations back have numbers 2m through 2m+1 – 1. Your paternal line has numbers equal to powers of 2, and your maternal line has numbers one less than powers of 2.

If you write out a person’s number in binary, you stick a 0 on the end to find their father and a 1 on the end to find their mother. So your paternal grandmother, for example, would have number 101 in binary. Start with your number: 1. Then tack on a zero for your father: 10. Then tack on a 1 for his mother: 101.

In our example of 75 above, this number is 1001011 in binary. Leave off the one on the left, then read from left to right saying “father” every time you see a 0 and “mother” every time you see a 1. So person 75 is your father’s father’s mother’s father’s mother’s mother.

This numbering system goes back to at least 1590. In that year Michaël Eytzinger published the chart in the image above, giving the genealogy of Henry III of France.

Related posts: