Particle filter and unscented Kalman filter in a nutshell

Suppose you have a linear dynamic system. That is, the function that predicts the next state from the current state to the next is linear. Suppose also that the states in your system are not known precisely but have some uncertainty modeled by a (multivariate) normal distribution. Then the uncertainty in the state at the next step also has a normal distribution, because a linear transformation of a normal distribution remains normal. This is a very high-level description of the classic Kalman filter.

When the transition from one state to another is nonlinear, the probability distribution around future states is not normal. There are many variations on the Kalman filter that amount to various approximations to get around this core difficulty: extended Kalman filters, unscented Kalman filters, particle filters, etc. Here I’ll just discuss unscented Kalman filters and particle filters. This will be a very hand-wavy discussion, but it will give the basic ideas.

It’s easy to push discrete random points through a nonlinear transformation. Calculating the effect of a nonlinear transformation on continuous random variables is more work. The idea of an unscented Kalman filter is to create a normal distribution that approximates the result of a nonlinear transformation by seeing what happens to a few carefully chosen points. The idea of particle filtering is to randomly generate a cloud of points and push these points through the nonlinear transformation.

This may make it sound like (unscented) Kalman filters and particle filters are trivial to implement. Far from it. The basic framework is easy to understand, but getting good performance on a particular problem can take a lot of work.

Related posts:

Heavy-tailed random matrices

Suppose you fill the components of a matrix with random values. How are the eigenvalues distributed?

We limit our attention to large, symmetric matrices. We fill the entries of the matrix on the diagonal and above the diagonal with random elements, then fill in the elements below the diagonal by symmetry.

If we choose our random values from a thin-tailed distribution, then Wigner’s semicircle law tells us what to expect from our distribution. If our matrices are large enough, then we expect the probability density of eigenvalues to be semicircular. To illustrate this, we’ll fill a matrix with samples from a standard normal distribution and compute its eigenvalues with the following Python code.

      import numpy as np
      import matplotlib.pyplot as plt

      N = 5000
      A = np.random.normal(0, 1, (N, N))    
      B = (A + A.T)/np.sqrt(2)
      eigenvalues = np.linalg.eigvalsh(B) 
      print(max(eigenvalues), min(eigenvalues))

      plt.hist(eigenvalues, bins=70)
      plt.show()

We first create an N by N non-symmetric matrix, then make it symmetric by adding it to its transpose. (That’s easier than only creating the upper-triangular elements.) We divide by the square root of 2 to return the variance of each component to its original value, in this case 1. The eigenvalues in this particular experiment ran from -141.095 to 141.257 and their histogram shows the expected semi-circular shape.

eigenvalue distribution with normally distributed matrix entries

Wigner’s semicircle law does not require the samples to come from a normal distribution. Any distribution with finite variance will do. We illustrate this by replacing the normal distribution with a Laplace distribution with the same variance and rerunning the code. That is, we change the definition of the matrix A to

      A = np.random.laplace(0, np.sqrt(0.5), (N, N))

and get very similar results. The eigenvalues ran from -140.886 to 141.514 and again we see a semicircular distribution.

eigenvalue distribution for matrix with entries drawn from Laplace distribution

But what happens when we draw samples from a heavy-tailed distribution, i.e. one without finite variance? We’ll use a Student-t distribution with ν = 1.8 degrees of freedom. When ν > 2 the t-distribution has variance ν/(ν – 2), but for smaller values of ν it has no finite variance. We change the definition of the matrix A to the following:

      A = np.random.standard_t(1.8, (N, N))

and now the distribution is quite different.

eigenvalue distribution for matrix with entries drawn from Student t distribution with 1.8 degrees of freedom

In this case the minimum eigenvalue was -9631.558 and the largest was 9633.853. When the matrix components are selected from a heavy-tailed distribution, the eigenvalues also have a heavier-tailed distribution. In this case, nearly all the eigenvalues are between -1000 and 1000, but the largest and smallest were 10 times larger. The eigenvalues are more variable, and their distribution looks more like a normal distribution and less like a semicircle.

The distributions for all thin-tailed symmetric matrices are the same. They have a universal property. But each heavy-tailed distribution gives rise to a different distribution on eigenvalues. With apologies to Tolstoy, thin-tailed matrices are all alike; every thick-tailed matrix is thick-tailed in its own way.

Update: As the first comment below rightfully points out, the diagonal entries should be divided by 2, not sqrt(2). Each of the off-diagonal elements of A + AT are the sum of two independent random variables, but the diagonal elements are twice what they were in A. To put it another way, the diagonal elements are the sum of perfectly correlated random variables, not independent random variables.

I reran the simulations with the corrected code and it made no noticeable difference, either to the plots or to the range of the eigenvalues.

Simplify integration with complex variables

Last night I was helping my daughter with her calculus homework. One of the problems was the following integral:

\int \exp(-x) \sin(4x) \, dx

This is an interesting problem for two reasons. First, it illustrates a clever variation on integration by parts; that’s why it was assigned. But it can also be computed using complex variables. As is often the case, the “complex” approach is simpler. Below I’ll show the solution the students were expected to find, then one that they wouldn’t not be expected to find.

Integration by parts

The traditional approach to this integral is to integrate by parts. Letting u = sin(4x), the integral becomes

- \exp(-x) \sin(4x) + 4 \int \exp(-x) \cos(4x) \, dx

Next we integrate by parts one more time, this time letting u = cos(4x). This gives us

- \exp(-x) \sin(4x) - 4 \exp(-x) \cos(4x) -16\int \exp(-x) \sin(4x) \, dx

At this point it looks like we’re getting nowhere. We could keep on integrating by parts forever. Not only that, we’re going in circles: we have an integral that’s just like the one we started with. But then the clever step is to realize that this is a good thing. Let I be our original integral. Then

I = -\exp(x) \left(\sin(4x) + 4 \cos(4x)\right) - 16 I

Now we can solve for I:

I = - \frac{1}{17} \exp(-x) (\sin(4x) + 4 \cos(4x))

Complex variables

Here’s another approach. Recognize that sin(4x) is the imaginary part of exp(4ix) and so our integral is the imaginary part of

\int \exp(-x) \exp(-4ix) \,dx = \int \exp\left((-1+4i)x\right) \,dx

which we can integrate immediately:

\frac{1}{-1+4i} \exp\left((-1+4i)x\right).

There’s still algebra to do, but the calculus is over. And while the algebra will take a few steps, it’s routine.

First, let’s take care of the fraction.

\frac{1}{-1+4i} = \frac{1}{-1+4i} \frac{-1 - 4i}{-1 - 4i} = - \frac{1+4i}{17}.

Next,

\exp\left((-1+4i)x\right) = \exp(-x)\left( \cos(4x) + i \sin(4x) \right)

and so our integral is the complex part of

-\frac{1}{17} \exp(-x) (1 + 4i)\left( \cos(4x) + i \sin(4x) \right)

which gives us the same result as before.

The complex variable requires one insight: recognizing a sine as the real part of an exponential. The traditional approach requires several insights: two integrations by parts and solving for the original integral.

Related:

Sticky cards

Suppose you shuffle a deck of cards. How likely is it that there are two cards that were next to each other before the shuffle are still next to each other after the shuffle?

It depends on how well you shuffle the cards. If you do what’s called a “faro shuffle” then the probability of a pair of cards remaining neighbors is 0. In a faro shuffle, if the cards were originally numbered 1 through 52, the new arrangement will be 1, 27, 2, 28, 3, 29, …, 26, 52. All pairs are split up.

But if you shuffle the cards so well that the new arrangement is a random permutation of the original, there’s about an 86% chance that at least one pair of neighbors remain neighbors. To be more precise, consider permutations of n items. As n increases, the probability of no two cards remaining consecutive converges to exp(-2), and so the probability of at least two cards remaining next to each other converges to 1 – exp(-2).

By the way, this considers pairs staying next to each other in either order. For example, if the cards were numbered consecutively, then either a permutation with 7 followed by 8 or 8 followed by 7 would count.

More generally, for large n, the probability of k pairs remaining consecutive after a shuffle is 2ke-2 / k!.

One application of this result would be testing. If you randomly permute a list of things to break up consecutive pairs, it probably won’t work. Instead, you might want to split your list in half, randomize each half separately, then interleave the two half lists as in the faro shuffle.

Another application would be fraud detection. If you’re suspicious that a supposedly random process isn’t splitting up neighbors, you could use the result presented here to calibrate your expectations. Maybe what you’re seeing is consistent with good randomization. Or maybe not. Note that the result here looks at any pair remaining together. If a particular pair remains together consistently, that’s more suspicious.

Related posts:

Reference: David P. Robbins, The Probability that Neighbors Remain Neighbors After Random Rearrangements. The American Mathematical Monthly, Vol. 87, No. 2, pp. 122-124

Category Theory and Facebook

From Drew Armstrong’s notes on adjoint functors:

Once upon a time, my opinion of category theory was the same as my opinion of Facebook: if I ignore it for long enough, hopefully it will go away. It is now my educated opinion that category theory will not go away, and in fact the language of category theory will continue to spread until it becomes the default foundation of mathematics.

More posts on category theory:

Infinite primes via Fibonacci numbers

A well-known result about Fibonacci numbers says

gcd(Fm, Fn) = Fgcd(m, n)

In words, the greatest common divisor of the mth and nth Fibonacci numbers is the gth Fibonacci number where g is the greatest common divisor of m and n. You can find a proof here.

M. Wunderlich used this identity to create a short, clever proof that there are infinitely many primes.

Suppose on the contrary that there are only k prime numbers, and consider the set of Fibonacci numbers with prime indices: Fp1, Fp2, … Fpk. The Fibonacci theorem above tells us that these Fibonacci numbers are pairwise relatively prime: each pair has greatest common divisor F1 = 1.

Each of the Fibonacci numbers in our list must have only one prime factor. Why? Because we have assumed there are only k prime numbers, and no two of our Fibonacci numbers share a prime factor. But F19 = 4181 = 37*113. We’ve reached a contradiction, and so there must be infinitely many primes.

Source: M. Wunderlich, Another proof of the infinite primes theorem. American Mathematical Monthly, Vol. 72, No. 3 (March 1965), p. 305.

Here are a couple more unusual proofs that there are infinitely many primes. The first uses a product of sines. The second is from Paul Erdős. It also gives a lower bound on π(N), the number of primes less than N.

Safe primes, Sylow theorems, and Cryptography

A logarithm is the inverse of an exponential, and so we can generalize the idea of a logarithm wherever we can generalize the idea of an exponential. In particular, we can raise elements to exponents in a discrete group, and that leads to the definition of a discrete logarithm.

Diffie-Hellman public key cryptography is based on the assumption that discrete logarithms are hard to compute. There are algorithms to compute discrete logarithms that are much faster than brute force. For example, baby-step giant-step is a fairly simple algorithm. There are more efficient algorithms as well, but the best known algorithms are still much slower than raising numbers to powers. Whenever you find something that is much harder to undo than to do, it might be useful in cryptography, and that is the case with discrete logs.

Diffie-Hellman encryption requires users to compute exponentials and presumably requires attackers to compute discrete logs. I say “presumably” because it’s a fatal error in cryptography to assume an attacker has to solve the problem you think he’d have to solve. For example, you can create a simple encryption scheme by permuting the alphabet and encrypting each letter to its counterpart in the permutation. Someone might naively think “No one can break this because they’d have to try 26! permutations of the alphabet, over 400 million million million million possibilities!” Except that’s not how anyone approaches a substitution cipher. If it were, you wouldn’t see cryptograms in puzzle books.

As far as we know, discrete logarithms are hard to compute when working over integers mod p where p is a large prime, except for primes that have certain properties. We’ll look at what those properties are below and how to avoid them.

For a prime p, the integers mod p form a finite field. They are a group under addition, and the non-zero elements form a group under multiplication. It’s the multiplicative group we care about here. This group has order p-1, i.e. it has p-1 elements.

A group of prime order has no proper subgroups. But a group of composite order does. And our multiplicative group has order p-1, which is composite. (Except for p = 3, and cryptography depends on primes far, far bigger than 3.)

Sylow’s theorems tell us something about what kinds of subgroups a group must have. If s is prime and sk is a factor of the order of our group, then the group has a subgroup of order sk. We don’t want our multiplicative group to have any small-order subgroups because these would make it easier to compute discrete logarithms.

A safe prime p has the form 2q + 1 where q is also prime. Diffie-Hellman chooses safe primes for moduli because this means the multiplicative group of order p-1 = 2q has no small subgroups. (It has two small subgroups, {1} and {1, -1}, but these can easily be avoided. The algorithm requires picking a generator g, and as long as you don’t pick g to be 1 or -1 mod p, then g generates a group of order q, and if p is gigantic, so is q.) Because q is prime, the subgroup of order q does not have any further subgroups.

Related post: Probability that a number is prime

 

Automated theorem proving

When I first heard of automated theorem proving, I imagined computers being programmed to search for mathematical theorems interesting to a wide audience. Maybe that’s what a few of the pioneers in the area had in mind too, but that’s not how things developed.

The biggest uses for automated theorem proving have been highly specialized applications, not mathematically interesting theorems. Computer chip manufacturers use formal methods to verify that given certain inputs their chips produce certain outputs. Compiler writers use formal methods to verify that their software does the right thing. A theorem saying your product behaves correctly is very valuable to you and your customers, but nobody else. These aren’t the kinds of theorems that anyone would cite the way they might site the Pythagorean theorem. Nobody would ever say “And therefore, by the theorem showing that this particular pacemaker will not fall into certain error modes, I now prove this result unrelated to pacemakers.”

Automated theorem provers are important in these highly specialized applications in part because the results are of such limited interest. For every theorem of wide mathematical interest, there are a large number of mathematicians who are searching for a proof or who are willing to scrutinize a proposed proof. A theorem saying that a piece of electronics performs correctly appeals to only the tiniest audience, and yet is probably much easier (for a computer) to prove.

The term “automated theorem proving” is overloaded to mean a couple things. It’s used broadly to include any use of computing in proving theorems, and it’s used more narrowly to mean software that searches for proofs or even new theorems. Most theorem provers in the broad sense are not automated theorem provers in the more narrow sense but rather proof assistants. They verify proofs rather than discover them. (There’s some gray zone. They may search on a small scale, looking for a way to prove a minor narrow result, but not search for the entire proof to a big theorem.) There have been computer-verified proofs of important mathematical theorems, such as the Feit-Thompson theorem from group theory, but I’m not aware of any generally interesting discoveries that have come out of a theorem prover.

Related post: Formal methods let you explore the corners

Probability of secure hash collisions

A hash function maps arbitrarily long input strings to fixed-length outputs. For example, SHA-256 maps its input to a string of 256 bits. A cryptographically secure hash function h is a one-way function, i.e. given a message m it’s easy to compute h(m) but it’s not practical to go the other way, to learn anything about m from h(m). Secure hash functions are useful for message authentication codes because it is practically impossible to modify m without changing h(m).

Ideally, a secure hash is “indistinguishable from a random mapping.” [1]  So if a hash function has a range of size N, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash N+1 items, two of them are certain to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.

The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values [2]. In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.

Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem comes up in attacking hash functions.

When N is large, it is likely that hashing √N values will lead to a collision. We prove this below.

Proof

The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.

Suppose we’re hashing n items to a range of size N = n2. The exact probability that all n items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.

\log( \mbox{Prob(unique)}) &=& \log (\Gamma(n^2)) - \log (\Gamma(n^2 - n)) - n \log( n^2 )\\ &\approx& (n^2 - n) \log(n^2) - (n^2 - n) \log(n^2 - n) - n \\ &=& (n^2 - n) \log\left( \frac{n^2}{n^2 -n} \right) - n \\ &=& (n^2 -n) \log\left( 1 + \frac{1}{n-1} \right) - n \\ &\approx& (n^2 - n) \left( \frac{1}{n-1} - \frac{1}{2}\frac{1}{(n-1)^2} \right) \\ &=& -\frac{n}{2(n-1)} \\ &\approx& -\frac{1}{2}

The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.

The second approximation comes from using the first two terms in the power series for log(1 + x). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as n goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.

Conclusions

So the probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, a hash function with range of size N can hash on the order of √N values before running into collisions.

This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 232 or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range could be secure, based on the size of its range, while a 64-bit hash cannot be.

Numerical calculation

We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large n, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.

There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough n, n2n = n2 in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to k bits, you could lose k bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.

Notes

[1] Cryptography Engineering by Ferguson, Schneier, and Kohno

[2] Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.

Three proofs that 2017 is prime

Aaron Meurer asked on Twitter whether there’s a proof that 2017 is prime that would fit into 140 characters.

My first reply was this:

sqrt(2017) < 45.
2017 not divisible by 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, or 43.
Ergo prime.

I’m not sure that’s what he had in mind. There’s some implied calculation (which I didn’t actually do), so it’s kinda cheating. It would be interesting if there were something special about 2017 that would allow a more transparent proof.

(Implicit in the proof above is the theorem that if a number has a prime factor, it has a prime factor less than it’s square root. If you multiply together numbers bigger than the square root of n, you get a product bigger than n.)

Then for fun I gave two more proofs that are more sophisticated but would require far too much work to actually carry out by hand.

The first uses Fermat’s little theorem:

For 0 < a < 2017, a2016 – 1 is divisible by 2017.
2017 is not one of the three Carmichael numbers < 2465.
Ergo prime.

Fermat’s little theorem says that if p is prime, then for any 0 < ap, ap – 1 – 1 is divisible by p. This is actually an efficient way to prove that a number is not prime. Find a number a such that the result doesn’t hold, and you’ve proved that p isn’t prime. For small numbers, the easiest way to show a number is not prime is to show its factors. But for very large numbers, such as those used in cryptography, it’s efficient to have a way to prove that a number has factors without having to actually produce those factors.

Unfortunately, Fermat’s little theorem gives a necessary condition for a number to be prime, but not a sufficient condition. It can appear to be prime for every witness (the bases a are called witnesses) and still not be a prime. The Carmichael numbers pass the Fermat primailty test without being prime. The first four are 561, 1105, 1729, and 2465.

For more on using Fermat’s little theorem to test for primality, see Probability that a number is prime.

The final proof was this:

2016! + 1 is divisible by 2017, and so by Wilson’s theorem 2017 is prime.

Unlike Fermat’s little theorem, Wilson’s theorem gives necessary and sufficient conditions for a number to be prime. A number n is prime if and only if (n-1)! + 1 is divisible by n. In theory you could use Wilson’s theorem to test whether a number is prime, but this would be horrendously inefficient. 2016! has 5,789 digits. (You can find out how many digits n! has without computing it using a trick described here.)

Despite its inefficiency, you can actually use Wilson’s theorem and SymPy to prove that 2017 is prime.

      >>> from sympy import factorial
      >>> x = factorial(2016) + 1
      >>> x % 2017
      0

 

Monthly highlights

If you enjoy reading the articles here, you might like a monthly review of the most popular posts.

I send out a newsletter at the end of each month. I’ve sent out around 20 so far. They all have two parts:

  1. a review of the most popular posts of the month, and
  2. a few words about what I’ve been up to.

That’s it. Short and sweet. I might send out more mail than this someday, but I’ve been doing this for nearly two years I’ve never sent more than one email a month.

If you’d like to subscribe, just enter your email address in the box on the side of the page labeled “Subscribe to my newsletter.” If you’re not reading this directly on the site, say you’re reading it in an RSS reader, then you can follow this link.

Cover time of a graph: cliques, chains, and lollipops

Cover time

The cover time of a graph is the expected time it takes a simple random walk on the graph to visit every node. A simple random walk starts at some node, then at each step chooses with equal probability one of the adjacent nodes. The cover time is defined to be the maximum over all starting points of expected time to visit all nodes.

It seems plausible that adding edges to a graph would decrease its cover time. Sometimes this is the case and sometimes it isn’t, as we’ll demonstrate below.

Chains, complete graphs, and lollipops

This post will look at three kinds of graphs

  1. a chain of  n vertices
  2. a complete graph on n vertices
  3. a “lollipop” with n vertices

A chain simply connects vertices in a straight line. A complete graph connects every node to every other node.

A lollipop graph of order (5, 5)

A lollipop with n vertices takes a chain with n/2 vertices and complete graph on n/2 vertices and joins them together. The image above is a lollipop graph of order 10. The complete graph becomes a clique in the new graph.

The image looks more like a kite than a lollipop because that’s the way my software (networkx) drew it by default. If you’d like, image the complete graph more round and the chain more straight so that the graph more closely resembles a lollipop.

Random walks on graphs

Chains have a long cover time. Complete graphs have a short cover time. What about lollipops?

A plausible answer would be that a lollipop graph would have a moderate cover time, somewhere between that of a complete graph and a chain. But that’s not the case. In fact, the lollipop has a longer cover time than either the complete graph or the chain.

You could think of a lollipop graph with n vertices as a chain of length n with extra edges added. This shows that adding edges does not always decrease the cover time.

The cover times for the three graphs we consider here are of different orders: O(n log n) for the complete graph, O(n2) for the chain, and O(n3) for the lollipop.

Now these are only asymptotic results. Big-O notation leaves out proportionality constants. We know that for sufficiently large n the cover times are ordered so that the complete graph is the smallest, then the chain, then the lollipop. But does n have to be astronomically large before this happens? What happens with a modest size n?

There’s a theoretical result that says the cover time is bounded by 2m(n-1) where m is the number of edges and n the number of vertices. This tells us that when we say the cover time for the chain is O(n2) we can say more, that it’s less than 2n2, so at least in this case the proportionality constant isn’t large.

We’ll look at the cover times with some simulation results below based on n = 10 and n = 30 based on 10,000 runs.

With 10 nodes each, the cover times for the complete graph, chain, and lollipop were 23.30, 82.25, and 131.08 respectively.

With 30 nodes the corresponding cover times were 114.37, 845.16, and 3480.95.

So the run times were ordered as the asymptotic theory would suggest.

Related posts

Changing names

I’ve just started reading Laurus, an English translation of a contemporary Russian novel. The book opens with this paragraph.

He had four names at various times. A person’s life is heterogeneous, so this could be seen as an advantage. Life’s parts sometimes have little in common, so little that it might appear that various people lived them. When this happens, it is difficult not to feel surprised that all these people carry the same name.

This reminded me of the section of James Scott’s Seeing Like a State that explains how names used to be more variable.

Among some peoples, it is not uncommon for individuals to have different names during different stages of life (infancy, childhood, adulthood) and in some cases after death; added to these are names used for joking, rituals, and mourning and names used for interactions with same-sex friends or with in-laws. Each name is specific to a certain phase of life, social setting, or interlocutor.

If someone’s name had more than one component, the final component might come from their profession (which could change) rather than their ancestry. Scott goes on to say

The invention of permanent, inherited patronyms was … the last step in establishing the necessary preconditions of modern statecraft. In almost every case it was a state project, designed to allow officials to identify, unambiguously, the majority of its citizens.

In short, governments insisted people adopt fixed names to make them easier to tax and to conscript. Before fixed names, governments would ask towns to provide so much tax money or so many soldiers because it could not tax or conscript citizens directly. For a famous example, see Luke’s account of the birth of Jesus: all went to be registered, each to his own town.

It’s hard to imagine people not needing fixed names. But when people lived on a smaller scale, interacting with a number of people closer to Dunbar’s number, there was no danger of ambiguity because there was more context.

 

 

Bernoulli numbers, Riemann zeta, and strange sums

In the previous post, we looked at sums of the first n consecutive powers, i.e. sums of the form

S_p(n) = \sum_{k=1}^n k^p

where p was a positive integer. Here we look at what happens when we let p be a negative integer and we let n go to infinity. We’ll learn more about Bernoulli numbers and we’ll see what is meant by apparently absurd statements such as 1 + 2 + 3 + … = -1/12.

If p < -1, then the limit as n goes to infinity of Sp(n) is ζ(-p). That is, for s > 1, the Riemann zeta function ζ(s) is defined by

\zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s}

We don’t have to limit ourselves to real numbers s > 1; the definition holds for complex numbers s with real part greater than 1. That’ll be important below.

When s is a positive even number, there’s a formula for ζ(s) in terms of the Bernoulli numbers:

\zeta(2n) = (-1)^{n-1} 2^{2n-1} \pi^{2n} \frac{B_{2n}}{(2n)!}

 

The best-known special case of this formula is that

1 + 1/4 + 1/9 + 1/16 + … = π2 / 6.

It’s a famous open problem to find a closed-form expression for ζ(3) or any other odd argument.

The formula relating the zeta function and Bernoulli tells us a couple things about the Bernoulli numbers. First, for n ≥ 1 the Bernoulli numbers with index 2n alternate sign. Second, by looking at the sum defining ζ(2n) we can see that it is approximately 1 for large n. This tells us that for large n, |B2n| is approximately (2n)! / 22n-1 π2n.

We said above that the sum defining the Riemann zeta function is valid for complex numbers s with real part greater than 1. There is a unique analytic extension of the zeta function to the rest of the complex plane, except at s = 1. The zeta function is defined, for example, at negative integers, but the sum defining zeta in the half-plane Re(s) > 1 is NOT valid.

You may have seen the equation

1 + 2 + 3 + … = -1/12.

This is an abuse of notation. The sum on the left clearly diverges to infinity. But if the sum defining ζ(s) for Re(s) > 1 were valid for s = -1 (which it is not) then the left side would equal ζ(-1). The analytic continuation of ζ is valid at -1, and in fact ζ(-1) = -1/12. So the equation above is true if you interpret the left side, not as an ordinary sum, but as a way of writing ζ(-1). The same approach could be used to make sense of similar equations such as

12 + 22 + 32 + … = 0

and

13 + 23 + 33 + … = 1/120.

Sums of consecutive powers

There’s a well-known formula for the sum of the first n positive integers:

1 + 2 + 3 + … + n = n(n + 1) / 2

There’s also a formula for the sum of the first n squares

12 + 22 + 32 + … + n2 = n(n + 1)(2n + 1) / 6

and for the sum of the first n cubes:

13 + 23 + 33 + … + n3 = n2(n + 1)2 / 4

It’s natural to ask whether there’s a general formula for all exponents. There is, but it’s not entirely satisfying. There’s a single formula for the sum of the pth powers of the first n positive integers, but it involves mysterious coefficients known as Bernoulli numbers. So there’s a fairly simple formula for sums of powers in terms of Bernoulli numbers, but there’s no simple formula for Bernoulli numbers.

S_p(n) = \sum_{k=1}^n k^p = \frac{1}{p+1} \sum_{j=0}^p (-1)^j {p+1\choose j} B_j n^{p-j+1}

At first glance this might not seem that useful because we have a formula for a sum in terms of another sum, but note that the sums run over different ranges. The original sum runs up to n, and in application n might be very large. The sum on the right runs up to the exponent p, and in application p is usually quite small.

The first few Bernoulli numbers Bj, starting with j = 0, are 1, -1/2, 1/6, 0, -1/30, 0, …. This short list might give you some hope of finding a nice formula for the Bernoulli numbers, but they’re full of surprises. The 12th Bernoulli number, for example, is -691/2730.

So how do you define the Bernoulli numbers? These numbers come up in many contexts, and so there are many ways to define them. One way is to say that the exponential generating function of the Bernoulli numbers is z / (exp(z)- 1). In other words, Bj is j! times the coefficient of zj in the power series for z / (exp(z)- 1) centered at z = 0.

There are a couple ways to look at this definition. The first reaction is probably disappointment. “You won’t just tell me what the Bernoulli numbers are. You tell me that I have to calculate another term of this ugly power series every time I want a new Bernoulli number?!” But from a more advanced perspective, you might be grateful that the unwieldy Bernoulli numbers have such a simple exponential generating function. Often the most useful ways to study special numbers is via their generating functions.

The next post will look at what happens when we let p be negative, and when we let n go to infinity. In other words, we’ll look at the Riemann zeta function.