Comparing three discrete power laws

Yesterday I wrote about the zeta distribution and the Yule-Simon distribution and said that they, along with the Zipf distribution, are all discrete power laws. This post fills in detail for that statement.

The probability mass functions for the zeta, Zipf, and Yule-Simon distributions are as follows.

\begin{align*} f_\zeta(k; s) &= k^{-s} / \zeta(s) \\ f_y(k; \rho) &= \rho B(k, \rho+1) \\ f_z(k; N, s) &= k^{-s} / H_{N,s} \end{align*}

Here the subscripts ζ, y, and z stand for zeta, Yule, and Zipf respectively. The distribution parameters follow after the semicolon.

Comparing Zipf and zeta

The Zipf distribution is only defined on the first N positive integers. The normalizing constant for the Zipf distribution is the Nth generalized harmonic number with exponent s.

H_{N,s} = \sum_{k=1}^N k^{-s}

As N goes to infinity, HN,s converges to ζ(s); this is the definition of the ζ function. So the Zipf and zeta distributions are asymptotically equal.

Comparing zeta and Yule

This post showed that for large k,

f_y(k; \rho) \approx \rho \Gamma(\rho + 1) \frac{1}{(k+\rho)^{\rho + 1}}

That is, fy(k, ρ) is proportional to a power of k, except k is shifted by an amount ρ.

To compare the zeta and Yule distributions we’ll need to compare zeta with s+1 to Yule with ρ in order to make the exponents agree, and we’ll need to shift the zeta distribution by s+1.

When we plot the ratio of the pmfs, we see that the distributions agree in the limit as k gets large.

In this plot s = ρ = 1.5.

The ratio is approaching a constant, and in fact the limit is

\frac{1}{ \rho \,\Gamma(\rho + 1)\,\zeta(s+1)}

based on the ratio of the proportionality constants.

Comment

Note that the three distributions are asymptotically proportional, given the necessary shift in k, but in different ways. The Zipf distribution converges to the zeta distribution, for every k, as N goes to infinity. The zeta and Yule distributions are asymptotically proportional as k goes to infinity. So one proportion is asymptotic in a parameter N and one is asymptotic in an argument k.

The zeta distribution

The previous post on the Yule-Simon distribution mentioned the zeta distribution at the end. This is a powerlaw distribution on the positive integers with normalizing constant given by the Riemann zeta distribution. That is, the zeta distribution has density

f(k; s) = ks / ζ(s).

where k is a positive integer and s > 1 is a fixed parameter.

For s > 1, the zeta function is defined as the sum of the positive integers to the power negative s, so ζ(s) is essentially defined as the normalizing constant of the zeta distribution.

I wanted to make a couple comments about this. First, it shows that the zeta function appears in applications outside of number theory. Second, when working with the zeta distribution, it would be useful to have an estimate for ζ(s).

Zeta function bounds

The integral test in calculus is typically presented as a way to test whether an infinite sum converges. This is a shame. In analysis as in statistics, estimation is better than testing. Testing throws away continuous information and replaces it with a single bit.

Let f(x) be a decreasing function; we have in mind f(x) = 1/xs. Then for any n,

\int_{n+1}^\infty f(x)\, dx \leq \sum_{k = n+1}^\infty f(k) \leq \int_{n}^\infty f(x)\, dx

To estimate a sum over k, we sum the first n terms directly and apply the inequalities above for the rest. Typically n will be small, maybe 0 or 1. A larger value of n will give more accurate bounds but at the expense of a more complicated expression.

When f(x) = 1/xs and n is at least 1, we have

\frac{(n+1)^{1-s}}{s-1} \leq \sum_{k = n+1}^\infty f(k) \leq \frac{n^{1-s}}{s-1}

This says

\sum_{k=1}^n k^{-s} + \frac{(n+1)^{1-s}}{s-1} \leq \zeta(s) \leq \sum_{k=1}^n k^{-s} + \frac{n^{1-s}}{s-1}

This gives a good lower bound but a bad upper bound when we choose n = 1.

But it gives a much better upper bound when we choose n = 2.

Related posts

Yule-Simon distribution

The Yule-Simon distribution, named after Udny Yule and Herbert Simon, is a discrete probability with pmf

f(k; \rho) = \rho B(k, \rho + 1)

The semicolon in f(k; ρ) suggests that we think of f as a function of k, with a fixed parameter ρ. The way the distribution shows the connection to the beta function, but for our purposes it will be helpful to expand this function using

B(x, y) = \frac{\Gamma(x) \, \Gamma(y)}{\Gamma(x+y)}

and so

\begin{align*} f(k; \rho) &= \rho B(k, \rho + 1) \\ &= \rho \Gamma(\rho + 1) \,\,\frac{\Gamma(k)}{\Gamma(k + \rho + 1)} \\ &= \rho \Gamma(\rho + 1) \,\, \frac{1}{(k + \rho)^{\underline{\rho + 1}}} \end{align*}

Ignore the first part of the last line, ρ Γ(ρ + 1), because it doesn’t involve k. It helps to ignore proportionality constants in probability densities when they’re not necessary. What’s left is the (ρ + 1) falling power of k + ρ.

For large values of k, the falling power term is asymptotically equal to kρ+1. To see this, let k = 1000 and ρ = 3. Then we’re saying that the ratio of

1003 × 1002 × 1001 × 1000

to

1000 × 1000 × 1000 × 1000

is approximately 1, and the ratio converges 1 as k increases.

This says that the Yule-Simon distribution is a power law in the tails, just like the Zipf distribution and the zeta distribution. Details of the comparison between these three distributions are given here.

Related posts

Mahalanobis distance and Henry V

I was reading a stats book that mentioned Mahalanobis distance and that made me think of Non Nobis from Henry V, a great scene in a great movie. As far as I know, there’s no connection between Mahalanobis and Non Nobis except that both end in “nobis.”

Since Mahalanobis is an Indian surname and Non Nobis is Latin, there’s probably no etymological connection between the two except maybe way back in Indo-European.

Mahalanobis distance is Euclidean distance adapted to a multivariate normal distribution. Specifically, the squared distance between two column vectors x and y is

(xy)T S−1 (xy)

where S is the covariance matrix for a multivariate Gaussian distribution.

You might look at this and ask why the matrix in the middle has to be the inverse of a covariance matrix. Couldn’t it be any invertible matrix? Isn’t this just Euclidean distance in transformed coordinates? Yes and yes. But Mahalanobis thought of how to use it in statistics.

Mahalanobis’ birthday, June 29, is National Statistics Day in India in his honor.

Probability of a magical permutation

Take a permutation of the numbers 1 through n² and lay out the elements of the permutation in a square. We will call a permutation a magic permutation if the corresponding square is a magic square. What is the probability that a permutation is a magic permutation? That is, if you fill a grid randomly with the numbers 1 through n², how likely are you to get a magic square?

For example, we could generate a random permutation in Python and see whether it forms a magic square.

>>> import numpy as np
>>> np.random.permutation(9).reshape(3,3)
array([[4, 7, 3],
       [2, 6, 5],
       [8, 0, 1]])

The first row adds up to 14 and the second row adds up to 13, so this isn’t a magic square. We could write a script to do this over and over and check whether the result is a magic square.

The exact number of magic squares of size n is known for n up to 5, and we have Monte Carlo estimates for larger values of n.

The number of unique magic squares, modulo rotations and reflections, of size 1 through 5 is

1, 0, 1, 880, 275305224.

For n > 2 the total number of magic squares, counting rotations and reflections as different squares, is 8 times larger than the numbers above. This is because the group of rotations and reflections of a square, D4, has 8 elements.

The probability that a permutation of the numbers 1 through 9, arranged in a square, gives a magic square is 8/9!. The corresponding probability for the numbers 1 through 16 is 8 × 880/16!, and for the numbers 1 through 25 we have 8 × 275305224/25!.

    |---+-------------|
    | n | Prob(magic) |
    |---+-------------|
    | 3 | 2.20459e-05 |
    | 4 | 3.36475e-10 |
    | 5 | 1.41990e-16 |
    |---+-------------|

It looks like the exponents are in roughly a linear progression, so maybe you could fit a line fairly well to the points on a logarithmic scale. In fact, empirical studies suggest that the probability that a permutation of the first n² positive integers is magic decreases a little faster than exponentially.

We could fit a linear regression to the logs of the numbers above to come up with an estimate for the result for n = 6. We expect the estimate could be pretty good, and likely an upper bound on the correct answer. Let’s see what actually happens with a little R code.

    > x <- c(3,4,5)
    > y <- 8*c(1/factorial(9), 880/factorial(16), 275305224/factorial(25))
    > m <- lm(log(y) ~ x)
    > predict(m, data.frame(x=c(6)))
    1
    -48.77694

This says we’d estimate that the natural log of the probability that a permutation of the first 6² positive integers is magic is -48.77694. So the estimate of the probability itself is

exp(−48.77694) = 6.55306 × 10−22

We don’t know the number of magic squares of size n = 6, but the number of distinct squares has been estimated [1] at

(0.17745 ± 0.00016) × 1020

and so the total number including rotations and reflections would be 8 times higher. This says we’d expect our probability to be around

8 × 0.17745 × 1020 / 36! = 3.18162 × 10−22

So our estimate is off by about a factor of 2, and as predicted it does give an upper bound.

Looking back at our regression model, the slope is -12.88 and the intercept is 28.53. This suggests that an upper bound of the probability of a permutation of size n² being magical is

exp(28.53 − 12.88 n).

In closing I’d like to point out that the estimate for n = 6 that we’re referencing above did not come from simply permuting 36 integers over and over and counting how many of the permutations correspond to magic squares. That would take far too long. It’s quite likely that after the first billion billion tries you would not have seen a magic permutation. To estimate such a small number to four significant figures requires a more sophisticated Monte Carlo procedure.

Related posts

[1] See OEIS sequence A006052

Degree of magic

A square grid of distinct integers is a magic square if all its rows columns and full diagonals have the same sum. Otherwise it is not a magic square.

Now suppose we fill a square grid with samples from a continuous random variable. The probability that the entries are distinct is 1, but the probability that the square is magic is 0.

We could make this more interesting by asking how close to magic the square is, using a continuous measure of degree of magic, rather than simply asking whether the square is magic or not.

If we have an n by n square of real numbers, we could take the set of n row sums, n column sums, and 2 diagonals as data and look at its range or its variance. These statistics would be zero for a magic square and small for a nearly magic square.

If each element of the square is a sample from a standard normal random variable, what is the distribution of these statistics? How much does the distribution on the numbers matter?

The ring of entire functions

Rings made a bad first impression on me. I couldn’t remember the definitions of all the different kinds of rings, much less have an intuition for what was important about each one. As I recall, all the examples of rings in our course were variations on the integers, often artificial variations.

Entire functions

I’m more interested in analysis than algebra, so my curiosity was piqued when I ran across an appendix on entire functions in the back of an algebra book [1]. This appendix opens with the statement

The ring E of entire functions is a most peculiar ring.

It’s interesting that something so natural from the perspective of analysis is considered peculiar from the perspective of algebra.

An entire function is a function of a complex variable that is analytic in the entire complex plane. Entire functions are functions that have a Taylor series with infinite radius of convergence.

Rings

A ring is a set of things with addition and multiplication operations, and these operations interact as you’d expect via distributive rules. You can add, subtract, and multiply, but not divide: addition is invertible but multiplication is not in general. Clearly the sum or product of entire functions is an entire function. But the reciprocal of an entire function is not an entire function because the reciprocal has poles where the original function has zeros.

So why is the ring of analytic functions peculiar to an algebraist? Osborne speaks of “the Jekyll-Hyde nature of E,” meaning that E is easy to work with in some ways but not in others. If Santa were an algebraist, he would say E is both naughty and nice.

Nice properties

On the nice side, E is an integral domain. That is, if f and g are entire functions and fg = 0, then either f = 0 or g = 0.

If we were looking at functions that were merely continuous, it would be possible for f to be zero in some places and g to be zero in the rest, so that the product fg is zero everywhere.

But analytic functions are much less flexible than continuous functions. If an analytic function is zero on a set of points with a limit point, it’s zero everywhere. If every point in the complex plane is either a zero of f or a zero of g, one of these sets of zeros must have a limit point.

Another nice property of E is that it is a Bezóut domain. This means that if f and g are entire functions with no shared zeros, there exist entire functions λ and μ such that

λf + μg = 1.

This is definition is analogous to (and motivated by) Bezóut’s theorem in number theory which says that if a and b are relatively prime integers, then there are integers m and n such that

ma + nb = 1.

Naughty properties

The naughty properties of E take longer to describe and involve dimension. “Nice” rings have small Krull dimension. For example Artinian rings have Krull dimension 0, and the ring of polynomials in n complex variables has dimension n. But the Krull dimension of the ring of entire functions is infinite. In fact it’s “very infinite” in the sense that it is at least

2^{\aleph_1}

and so the Krull dimension of E is larger than the cardinality of the complex numbers.

Wittgenstein’s ruler

Nassim Taleb described Wittgenstein’s ruler this way:

Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table you may also be using the table to measure the ruler. The less you trust the ruler’s reliability, the more information you are getting about the ruler and the less about the table.

An algebraist would say that entire functions are weird, but an analyst could say that on the contrary, ring theory, or at least Krull dimension, is weird.

Related posts

[1] Basic Homological Algebra by M. Scott Osborne.

The awkward middle child of algebra

Abstract algebra is basically the study of groups, rings, and fields. There are more concepts, but these are the big three.

Groups have the least structure and fields have the most structure. Rings are somewhere in the middle.

Groups have just one operation, which is thought of as multiplication by default. If the group operation is commutative, it’s thought of as addition. But in either case, there’s only one operation.

Fields have two operations, addition and multiplication, and the operations have all the basic properties you expect.

Rings also have two operations, addition and multiplication, but they lack some of the familiar properties of fields.

The most awkward thing about ring theory is that there is no universal agreement on the definition of a ring. The disagreement is over whether the definition should require the existence of a multiplicative identity, the analog of the number 1. Most authors include this as part of the definition, but a substantial portion do not.

There are two ways of dealing with this mess. The most common is to require rings to have an identity, but continually remind readers of this convention. And so you’ll see things like “Let R be a ring (with identity).”

The other less common solution is to give custody of the term “ring” to those who require an identity and use the term “rng” for the thing that’s just like a ring except it doesn’t necessarily have an identity.

Another awkward, or at least curious, feature of ring theory is that you almost never speak of subrings. Subgroups are important in group theory, and subfields are important in field theory [1]. But you almost never hear the word “subring.”

The ring theory counterpart to subgroups in group theory is the ideal. The name comes from Earnst Kummer thinking of these objects as ideal numbers, numbers added to an algebraic system to fill in some deficiency.

Given a ring R, a subset I of R is an ideal if it is closed under addition and under multiplication by any element of R. Note the asymmetry between addition and multiplication. You can add two elements of I and stay in I. And you can multiply an element in I, not just by another element of I, but by any element in R, and stay in I.

A quick example will help. Let R be the integers and let I be all multiples of 7. The sum of two multiples of 7 is another multiple of 7, And if you multiply a multiple of 7 by any integer you get another multiple of 7. So multiples of 7 form an ideal in the integers.

The disagreement over whether to require identities and the lack of interest in subrings are related. If you don’t require a ring to have an identity element, then you don’t require subrings to have an identity element, and so every ideal is a subring. But if you do require rings to have an identity, the only ideal of R that is a subring is R itself.

A final awkward thing about ring theory is that there are so many kinds of rings: integral domains, principle ideal domains, Noetherian rings, Artinian rings, etc. It seems that “ring” itself is too generic to be useful except as a pedagogical device to organize what more specific kinds of rings have in common.

In this sense ring theory is analogous to “absolute geometry,” the study of the common features of Euclidean and Non-Euclidean geometry. In practice one cares about more specific geometries, but there’s some value in cataloging theorems that various geometries share.

Related posts

[1] The concept of a subfield is more common than the term “subfield.” In field theory, it’s more common to say that K is an extension field of F than to say that F is a subfield of K, though the two statements are equivalent.

Partial functions and total functions

I was thinking about writing a post about entire functions and it occurred to me that I should say something about how entire functions are unrelated to total functions. But then I realized they’re not unrelated. I intend to say something about entire functions in a future post.

Update: See The ring of entire functions.

Partial functions

A total function is a function defined for every element of its domain. This terminology is more common in computer science than in math. A lot of what we call functions in software are actually partial functions because they’re not defined everywhere. For example, the square root function sqrt in C takes a argument of type double and returns a value of type double, but it’s not defined for all inputs of type double, only those inputs that are not negative.

Mathematically speaking it makes no sense to refer to a function that is not defined on every point of its domain. A function’s domain is by definition the set of points where it is defined, so a total function is just a function, and a partial function is not even a function.

And yet it is convenient to refer informally to functions that are undefined for some inputs. This is common as dirt in software development, but it occurs in math too.

Type checking

The distinction between total functions and partial functions is more subtle than it may seem.

For example, it would be more precise to say that the sqrt function is defined on non-negative numbers of type double. There’s no non-negative floating point type in C, but in principle we could introduce one. Sounds like a good idea. Why hasn’t C done this?

It’s obvious when functions are defined on (some) arguments of type double and return (some) arguments of type double.  It’s not so obvious that functions are defined on more restricted types or return more restrictive types. For example, suppose you want to apply sqrt to a function that you wrote. If your function takes in a double and squares it, then this is fine. It may ostensibly return a double but in fact it returns a non-negative double. But if your function cubes its input then the output may not be compatible with the input to sqrt.

The domain of the square root function is simple, but what about functions that are undefined at more complicated sets. For example, the gamma function is undefined at 0 and at negative integers. The gamma function is undefined at −4, for instance, but it’s defined at −4 + ε for any tiny ε. Imagine how hard it might be to verify that a function returns −4 + ε but not −4.

Or suppose you have a function that computes the real part of complex numbers for which the Riemann zeta function is zero. If you could prove that the program always returns either 1/2 or a negative even integer, you would earn fame and fortune. This is an extreme example, but it is true that determining the precise domain or range of a function can require proving theorems that are difficult if not impossible.

Dependent type theory

We’re at a fork in the road. We either give up on more restrictive types and live with functions that may not be defined for every element in their domain, or we implement far more sophisticated types and type checking. In other words, we get into dependent type theory.

If we choose the fork leading to dependent types, there are a lot of new costs and new benefits. The benefits are more obvious than the costs. The benefits include, for example, having a compiler reject code that could pass a negative value into sqrt. But this comes at the expense of making it far more difficult to write compilers. It also restricts the range of programs that can be written or increases the skill requirement of those who can write the needed programs.

Mathematical formalism

Mathematicians informally speak of functions not being defined everywhere in their domain even though it would be more accurate to say that the domain excludes certain points. This kind of talk is fine among friends where there’s an implicit understanding of what detail is being left out and the trust that the details will be made more explicit when necessary. In a more hostile environment, like Twitter, pedants will gleefully pounce on such lapses in rigor.

You can make partial functions rigorous by defining them to be relations rather than functions. A relation between sets A and B is a subset of their Cartesian product A × B. A function is a relation such that for every a in A there exists a unique pair (a, b) in the relation. A partial function is a relation in which for every a in A there is at most one pair (a, b) in the relation.

A “multi-valued function” is strictly speaking an oxymoron, but more formally it is a also relation, not a function. As with partial functions, the terminology “multi-valued function” is informal. Typically there is a way to formalize multi-valued functions so that they are actual (single-valued) functions with a different codomain, but one may wish to avoid this extra formality when it is not necessary.