Magic squares as matrices

If you view a 3 × 3 magic square as a matrix and raise it to the third power, the result is also a magic square.

More generally, if you multiply an odd number of 3 × 3 magic squares together, the result is a magic square.

For example, here are three magic squares that appeared in my blog post on Spanish magic squares, and you can verify that their product is another magic square.

\left[ \begin{array}{ccc} 93 & 155 & 121 \\ 151 & 123 & 95 \\ 125 & 91 & 153 \\ \end{array} \right] \left[ \begin{array}{ccc} 12 & 21 & 15 \\ 19 & 16 & 13 \\ 17 & 11 & 20 \\ \end{array} \right] \left[ \begin{array}{ccc} 13 & 20 & 18 \\ 22 & 17 & 12 \\ 16 & 14 & 21 \\ \end{array} \right] = \left[ \begin{array}{ccc} 299622 & 301968 & 301722 \\ 303204 & 301104 & 299004 \\ 300486 & 300240 & 302586 \\ \end{array} \right]

Source: Martin Gardner, Some New Discoveries About 3 × 3 Magic Squares, Math Horizons, February 1998.

The same article conjectures that the results above are true for magic squares of any odd order. That was 20 years ago. Maybe the conjecture has been resolved by now. If you know, please leave a comment.

Related posts:

GDPR and the right to be forgotten

George Bailey on the Bedford Falls bridge

General Data Protection Regulation

The European GDPR (General Data Protection Regulation) was adopted in 2016 and becomes enforceable in May of this year. Article 17 mandates a right to erasure, more commonly called the right to be forgotten.

A right to be forgotten is tricky. It’s not immediately clear what this means or to what extent it’s even possible. That’s for lawyers and courts to work out. I can’t comment on the legal interpretation of the regulation. There are laws that require companies to delete information and laws that require them to retain information, and no doubt there are debates on how to reconcile these.

Here I want to look at some of the logical and statistical issues that come out of privacy laws, not necessarily the GDPR in particular, that mandate that information be forgotten.

Challenges with a right to be forgotten

Suppose John Smith had participated in some database and a report showed that there were 5,000 diabetics in Smith’s geographic region. Smith asks to be removed from the database, and now a revised report shows that there are 4,999 diabetics in the region. What does that tell you about Smith? (For an elaboration on this example, see Big aggregate queries can still violate privacy.)

You might object that removing Smith from the database doesn’t really cause him to be forgotten if we retain the fact that he was removed from the database. We have to forget that he was forgotten, maybe like George Bailey in It’s a Wonderful Life. While the fictional George Bailey was able to see what life would have been like if he had never been born, our real-world John Smith doesn’t have that option. Truly forgetting anything in a world of ubiquitous databases may not be possible. It may require a cascade of changes that are illegal, impractical, or impossible.

Differential privacy

One way to address the challenges of erasure is though differential privacy. A report constructed via a differentially private system would not have released the exact number of diabetics in Smith’s region but rather some randomized version of the exact result. Ideally the amount of noise added to the true result is large enough to protect Smith’s privacy, but small enough that the result is useful to the person who wants to know the approximate number of diabetics in Smith’s region.

Differential privacy is both powerful and subtle. It gives a theoretically grounded way to quantify the privacy implications of someone’s participation or lack of participation in a database. But it comes with restrictions that may be hard to live with. For example, we cannot let someone ask the same question many times, or if we do, we must give the same answer each time. If the answers were generated afresh each time, one could average the results to remove the effect of the random noise.

But what if you want to let someone rerun reports, not because they’re trying to evade privacy protections, but because the world is changing and they periodically want to get an updated view of the world? That can be done, but it requires forethought. Before you release the first report, you must have a system in place that is intended to allow multiple queries over time.

Differential privacy applies to a process, not to a result. You can’t say “This is a differentially private report” but rather “This report is the result of a differentially private mechanism.” That mechanism has to be designed up front. You can design a mechanism to support a wide variety of queries, but this comes at a cost. You have to anticipate what these queries will be (maybe not specifically but at least some class that they fall in) and add sufficient randomness to support them all. In general, the more flexible you want your query system to be, the more randomness you will need to add, and so you face a privacy-utility trade-off.

Help with privacy compliance

If you’d like help with privacy law compliance, let’s talk. I help companies with statistical matters related to privacy, such as expert determination of de-identification (or “pseudonymisation” as it’s known in Europe). I am not an expert on the regulatory side of things, but I work in coordination with lawyers who are.

Most useful math class

A few years ago someone asked me what was my most useful undergraduate math class. My first thought was topology.

I have never directly applied topology for a client. Nobody has ever approached me wanting to know, for example, whether two objects were in the same homotopy class. But I believe topology was one of the most important classes I took for three reasons.

First, I learned how to prove things in that course. It was a small, interactive class with an excellent teacher (Jim Vick). I might have learned the same techniques in a different class, but for me I learned them in topology.

Second, the course built my confidence. I was apprehensive about taking the course because I knew nothing about it. The little I’d heard about topology—stretching coffee cups into donuts etc.—made me wonder what a class could possibly be like. I proved to myself that I could jump into something unfamiliar and do well.

Finally, the course gave me a solid foundation for analysis, and analysis I have applied more directly. I got a thorough understanding of foundational ideas like continuity and compactness, and a foretaste of measure theory. The course also provided my first brief exposure to category theory. To this day, my Pavlovian response to a mention of functors is to think of the fundamental group of a topological space.

I look back on topology the way many look back on a classical education, something not directly useful but indirectly very useful.

Average fraction round up

Pick a large number n. Divide n by each of the positive integers up to n and round the results up to the nearest integer. On average, how far do you round up?

Or in terms of probability, what is the expected distance between a fraction n/r, where n is large and fixed and r is chosen randomly between 1 and n, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

\frac{1}{n} \sum_{r = 1}^n \left( \left\lceil \frac{n}{r} \right\rceil - \frac{n}{r}\right )

for large n, i.e. in the limit as n goes to ∞. Here ⌈x⌉ denotes the ceiling of x, the smallest integer greater than or equal to x.

Let’s plot this as a function of n and see what it looks like. Here’s the Python code.

    import matplotlib.pyplot as plt
    from numpy import ceil, arange

    def f(n):
        return sum( [ceil(n/r) - n/r for r in range(1, n)] )/n

    x = arange(1, 100)
    y = [f(n) for n in x]
    plt.plot(x, y)

And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the nth harmonic number and log n, i.e.

\gamma = \lim_{n\to\infty} \left(\sum_{r = 1}^n \frac{1}{r} - \log n \right)

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant euler_gamma from numpy and add the

    plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at n = 1,000,000, we get 0.577258… while γ = 0.577215….

At n = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√n.

New prime number record: 50th Mersenne prime

A new record for the largest known prime was announced yesterday:

2^{77,232,917} - 1

This number has 23,249,425 digits when written in base 10.

In base 2, 2p – 1 is a sequence of p ones. For example, 31 = 25 -1  which is 11111 in binary. So in binary, the new record prime is a string of 77,232,917 ones.

77,232,917 ones

You can convert a binary number to hexadecimal (base 16) by starting at the right end and converting blocks of 4 bits into hexadecimal. For example, to convert 101101111 to hex, we break it into three blocks: 1, 0110, and 1111. These blocks convert to 1, 6, and F, and so 101101111 in binary corresponds to 16F in hex.

Now 77,232,917  = 19,308,229 * 4 + 1, so if we break our string of 77,232,917 ones into groups of four, we have a single bit followed by 19,308,229 groups of 4. This means that in hexadecimal, the new prime record is 1FFF…FFF, a 1 followed by 19,308,229 F’s.

19,308,229 Fs

The new record is the 50th Mersenne prime. A Mersenne prime is a prime number that is one less than a power of 2, i.e. of the form 2p – 1. It turns out p must be prime before 2p – 1 can be prime. In the case of the new record, 77,232,917 is prime.

It is unknown whether there are infinitely many Mersenne primes. But now we know there are at least 50 of them.

All the recent prime number records have been Mersenne primes because there is an algorithm for testing whether a number of the special form 2p – 1 is prime, the Lucas-Lehmer test.

Mersenne primes are closely related to perfect numbers. A number n is perfect if the sum of the numbers less than n that divide n equals n. For example, 28 is divisible by 1, 2, 4, 7, and 14, and 1 + 2 + 4 + 7 + 14 = 28.

Finding a new Mersenne prime means finding a new perfect number. If m is a Mersenne prime, then nm(m + 1)/2 is a perfect number. Conversely, if n is an even perfect number, n has the form m(m + 1)/2, Nobody knows whether odd perfect numbers exist. Since we don’t know whether there are infinitely many Mersenne primes, we don’t know whether there are infinitely many perfect numbers. But there are at least 50 of them.

Related posts:

The Engineer’s Nyquist frequency and the sampling theorem

The Nyquist sampling theorem says that a band-limited signal can be recovered from evenly-spaced samples. If the highest frequency component of the signal is fc then the function needs to be sampled at a frequency of at least the Nyquist frequency 2fc. Or to put it another way, the spacing between samples needs to be no more than than Δ = 1/2fc.

If the signal is given by a function h(t), then the Nyquist-Shannon sampling theorem says we can recover h(t) by

h(t) = \sum_{n=-\infty}^\infty h(n\Delta)\, \mathrm{sinc}(2f_c t- n)

where sinc(x) = sin(πx) / πx.

In practice, signals may not entirely band-limited, but beyond some frequency fc higher frequencies can be ignored. This means that the cutoff frequency fc is somewhat fuzzy. As we demonstrate below, it’s much better to err on the side of making the cutoff frequency higher than necessary. Sampling at a little less than the necessary frequency can cause the reconstructed signal to be a poor approximation of the original. That is, the sampling theorem is robust to over-sampling but not to under-sampling. There’s no harm from sampling more frequently than necessary. (No harm as far as the accuracy of the equation above. There may be economic costs, for example, that come from using an unnecessarily high sampling rate.)

Let’s look at the function h(t) = cos(18πt) + cos(20πt). The bandwidth of this function is 10 Hz, and so the sampling theorem requires that we sample our function at 20 Hz. If we sample at 20.4 Hz, 2% higher than necessary, the reconstruction lines up with the original function so well that the plots of the two functions agree to the thickness of the plotting line.

Function and reconstruction sampling at 20.4 Hz

But if we sample at 19.6 Hz, 2% less than necessary, the reconstruction is not at all accurate due to problems with aliasing.

Function and reconstruction sampling at 19.6 Hz

One rule of thumb is to use the Engineer’s Nyquist frequency of 2.5 fc which is 25% more than the exact Nyquist frequency. An engineer’s Nyquist frequency is sorta like a baker’s dozen, a conventional safety margin added to a well-known quantity.

Update: Here’s a plot of the error, the RMS difference between the signal and its reconstruction, as a function of sampling frequency.

RMS error in reconstruction as a function of sampling frequency

By the way, the function in the example demonstrates beats. The sum of a 9 Hz signal and a 10 Hz signal is a 9.5 Hz signal modulated at 0.5 Hz. More details on beats in this post on AM radio and musical instruments.

Exponential sum for the new year

Exponential sums can make intricate patterns. Last year I made a page that displays a different page each day, using the month, day, and year as parameters in the expression below. The images plot the partial sums of this sum.

\sum_{n=0}^N \exp\left( 2\pi i \left( \frac{n}{m} + \frac{n^2}{d} + \frac{n^3}{y} \right ) \right )

This was yesterday’s image.

Image of the day, New Year's Eve.

Today’s image is surprisingly plain if we use y = 18.

This is in part because the least common multiple of 1, 1, and 18 is 18. The image could have no more than 18 vertices. In fact, it has only 6 vertices because the summand above has period 6.

But if we use y = 2018 we get something much more complex.

The Exponential Sum of the Day page will use y = 18 this year. There will be a few simple images this year but there will also be lots of surprises.

Hermite polynomials, expected values, and integration

In the previous post, I alluded to using Hermite polynomials in conjunction with higher-order Laplace approximation. In this post I’ll expand on what that means.

Hermite polynomials are orthogonal polynomials over the real line with respect to the weight given by the standard normal distribution. (There are two conventions for defining Hermite polynomials, what Wikipedia calls the physicist convention and the probabilist convention. We’re using the latter here.)

The first few Hermite polynomials are 1, x, x2 – 1, x3 – 3x, … . You can find the rest of the Hermite polynomials via the recurrence relation

Hn+1(x) = x Hnn Hn-1(x).

The odd order Hermite polynomials are odd functions, and the standard normal density is an even function, so the integral of their product is zero. For an even number m, the integral of the mth order Hermite polynomial times the standard normal density is (m-1)!!, i.e. m double factorial. In probability terms, if X is a standard normal random variable, the expected value of Hk(X) is 0 if k is odd and (k – 1)!! otherwise.

You could think of the Hermite polynomials as the right basis to use when working with normal probability distributions. Writing a polynomial as a linear combination of Hermite polyn0mials is a change of basis that makes integration easy:

\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty \left(\sum_{k=0}^N a_k\,H_k(x)\right) \exp(-x^2/2) \,dx = \sum_{k=0}^N a_k \, [k\, \mathrm{ even} ] \,k!!

Here [k even] is the function that returns 1 if k is even and 0 otherwise. This notation was introduced by Kenneth Iverson in the APL programming language and has become moderately common in mathematics.


Higher-order Laplace approximation

Yesterday’s post presented the most common form of Laplace approximation, the second order version, and mentioned in passing that there are higher order versions. The extension to higher order is not trivial, so post gives a high level overview of how you’d do it.

The previous post looked at integrating exp( log( g(x) ) ) by expanding log g(x) in a power series centered at its maximum. Without loss of generality we can assume the maximum occurs at 0; otherwise do a change of variables first to make this happen.

The first order term of the series is missing because the derivative of g is zero at its maximum. Taking the terms of the series up to second order gives us something of the form a – bx² where b > 0. Then

exp(abx²) = exp(a) exp(-bx²).

The first term on the right is a constant that cam be pulled out of the integral, and the second term is a normal distribution density, modulo vertical scaling.

Now suppose we want to use more terms from the power series for log g. Let’s write out the series as

g(x) = abx² + R(x)

where R(x) is the remainder of the series after the quadratic terms. When we exponentiate the series for log g we get

exp(log g(x)) = exp(a) exp(-bx²) exp(R(x)).

We think of this as the expected value of exp(R(X)) where X is a normal random variable with variance 1/b, up to some constant we pull out of the integral.

We could approximate R(x) with the polynomial formed by the first few terms, but that’s not enough by itself. We need to approximate the expected value of exp(R(X)), not of R(X). The trick is to create another polynomial and find the expectation of that polynomial applied to X.

We need to find a power series for exp(R(x)) and truncate it after a few terms. We don’t need to analytically find the power series for exp(R(x)) directly before truncating it. We can apply the first few terms of the power series for exp to the first few terms of the power series for R and keep all the terms up to the order we want, say up to order k, giving us a kth order polynomial P(x).

Once we have the polynomial P(x), we can find the expectation P(X) in terms of normal moments (see this post) or alternatively by writing the polynomial as a sum of Hermite polynomials.


Laplace approximation of an integral from Bayesian logistic regression


f(x; m, n) = \left( \frac{e^x}{e^x + 1} \right)^m \left( \frac{1}{e^x + 1} \right)^n


I(m, n) = \int_{-\infty}^\infty f(x; m, n) \, dx.
This integral comes up in Bayesian logisitic regression with a uniform (improper) prior. We will use this integral to illustrate a simple case of Laplace approximation.

The idea of Laplace approximation is to approximate the integral of a Gaussian-like function by the integral of a (scaled) Gaussian with the same mode and same curvature at the mode. That’s the most common, second order Laplace approximation. More generally, Laplace’s approximation depends on truncating a Taylor series, and truncating after the 2nd order term gives the approximation described above. One can construct higher-order Laplace approximations by keeping more terms of the power series.

Let x0 be the place where f takes on its maximum and let g = log f. Note that g also takes on its maximum at x0 and so its first derivative is zero there. Then the (second order) Laplace approximation has the following derivation.

\begin{eqnarray*} I(m,n) &=& \int_{-\infty}^\infty f(x; m, n) \, dx \\ &=& \int_{-\infty}^\infty \exp(g(x; m, n))\, dx \\ &=& \int_{-\infty}^\infty \exp\left( \sum_{k=0}^\infty \frac{ g^{(k)}(0) }{k!} (x - x_0)^k\right) \, dx \\ &\approx& \int_{-\infty}^\infty \exp\left( g(x_0) + \frac{g''(x_0)}{2} (x - x_0)^2 \right) \, dx \\ &=& \exp(g(x_0)) \sqrt{\frac{2\pi}{| g''(x_0)|}} . \end{eqnarray*}


We can show that

\frac{\partial}{\partial x} g(x; m, n) = \frac{m - ne^x}{1 + e^x}


\frac{\partial^2}{\partial x^2} g(x; m, n) = -\frac{(m + n)e^x}{(1 + e^x)^2}.

It follows that x0 = log (m/n) and the Laplace approximation for I(m, n) reduces to

 \frac{m^m n^n}{(m+n)^{m+n}} \sqrt{ \frac{2\pi(m+n)}{mn} }.

Now let’s compare the Laplace approximation to the exact value of the integral for a few values of m and n. We would expect the Laplace approximation to be more accurate when the integrand is shaped more like the Gaussian density. This happens when m and n are large and when they are more nearly equal. (The function f is a likelihood, and m+n is the number of data points in the likelihood. More data means the likelihood is more nearly normal. Also, m = n means f is symmetric, which improves the normal approximation.)

We’ll look at (m, n) = (2, 1), (20, 10), (200, 100), and (15, 15). Here’s the plot of f(x, 2, 1) with its normal approximation, scaled vertically so that the heights match.

Laplace approximation plot

Even for small arguments, the Gaussian approximation fits well on the left side, but not so much on the right. For large arguments it would be hard to see the difference between the two curves.

Here are the results of the Laplace approximation and exact integration.

|   m |   n |        approx |         exact |    rel error |
|   2 |   1 |    0.45481187 |           0.5 | -0.090376260 |
|  20 |  10 |  4.9442206e-9 | 4.99250870e-9 | -0.009672111 |
|  15 |  15 | 8.5243139e-10 | 8.5956334e-10 | -0.008297178 |
| 200 | 100 | 3.6037801e-84 | 3.6072854e-84 | -0.000971728 |

The integrals were computed exactly using Mathematica; the results are rational numbers. [1]

Note that when m and n increase by a factor of 10, the relative error goes down by about a factor of 10. This is in keeping with theory that say the error should be O(1/(m+n)). Also, note that the error for (15, 15) is a little lower than that for (20, 10). The errors are qualitatively just what we expected.


[1] I initially used Mathematica to compute the integrals, but then realized that’s not necessary. The change of variables u = exp(x) / (1 + exp(x)) shows that

I(ab) = B(ab+1) + B(a+1, b)

where B is the beta function. Incidentally, this shows that if a and b are integers then  I(ab) is rational.

Efficiency is not associative for matrix multiplication

Here’s a fact that has been rediscovered many times in many different contexts: The way you parenthesize matrix products can greatly change the time it takes to compute the product. This is important, for example, for the back propagation algorithm in deep learning.

Let AB, and C be matrices that are compatible for multiplication. Then (AB)CA(BC). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix M, let rows(M) be the number of rows in M and let cols(M) be the number of columns. When we multiply two matrices M and N, we multiply every row of M by every column of N [1]. So the number of vector inner products is rows(M) cols(N), and the number of multiplications [2] in each inner product is cols(M). (We must have cols(M) = rows(N) because we implicitly assumed it’s possible to multiple M and N.)

That means the total number of scalar multiplications required to conmpute MN equals

rows(M) cols(M) cols(N) = rows(M) rows(N) cols(N).

Now let’s apply this to computing ABC. Suppose A is an m by n matrix and C is a by q matrix. Then B has to be a n by p matrix in order to be compatible for multiplication.

Computing AB requires mnp multiplications. Once we have AB, a m by p matrix, the number of multiplications to compute (AB)C is mpq. So the total multiplications to compute ABC by first computing AB is mp(nq).

If we compute BC first, this requires npq multiplications, and then multiplying A by BC requires mnq operations, for a total of nq(m + p).

In summary, computing (AB)C requires mp(nq) multiplications, and computing A(BC) requires (m + p)nq multiplications. I turned the second term around to emphasize that in both expressions you do something to m and p, and something else to n and q. You either multiply and add, or add and multiply.

If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if m and p are large relative to n and q, i.e. both A and C are tall matrices, having more rows than columns, then multiplying A(BC) is going to be faster than multiplying (AB)C.

For example, suppose both A and C have a million rows and a hundred columns. Necessarily B would have a hundred rows and a million columns. Computing (AB)C would require 2×1014 multiplications, but computing A(BC) would take 2×1010 multiplications. That is, the latter would be 10,000 times faster.

Related: Applied linear algebra


[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.




Higher order Taylor series in several variables

Most sources that present Taylor’s theorem for functions of several variables stop at second order terms. One reason is that one or two terms are good enough for many applications. But the bigger reason is that things get more complicated when you include higher order terms.

The kth order term in Taylor’s theorem is a rank k tensor. You can think o rank 0 tensors as numbers, rank 1 tensors as vectors, and rank 2 tensors as matrices. Then we run out of familiar objects. A rank 3 tensor requires you start thinking in terms of tensors rather than more elementary terms.

There is a way to express Taylor’s theorem using only vectors and matrices. Maybe not the most elegant approach, depending on one’s taste, but it avoids any handwaving talk of a tensor being a “higher dimensional boxes of numbers” and such.

There’s a small price to pay. You have to introduce two new but simple ideas: the vec operator and the Kronecker product.

The vec operator takes an m × n matrix A and returns an mn × 1 matrix v, i.e. a column vector, by stacking the columns of A. The first m elements of v are the first column of A, the next m elements of v are the second column of A, etc.

The Kronecker product of an m × n matrix A and a p × q matrix B is a mp × nq matrix KA B. You can think of K as a block partitioned matrix. The ij block of K is aij B. In other words, to form K, take each element of A and replace it with its product with the matrix B.

A couple examples will make this clear.

A = \left\[ \begin{array}{cc} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{array} \right]


\mathrm{vec} \, A = \left\[ \begin{array}{c} 1 \\ 3 \\ 5 \\ 7 \\ 2 \\ 4 \\ 6 \\ 8 \end{array} \right]


B= \left\[ \begin{array}{ccc} 10 & 0 & 0 \\ 0 & 20 & 0 \\ 0 & 0 & 30 \end{array} \right]


A \otimes B= \left\[ \begin{array}{cccccc} 10 & 0 & 0 & 20 & 0 & 0 \\ 0 & 20 & 0 & 0 & 40 & 0 \\ 0 & 0 & 30 & 0 & 0 & 60 \\ 30 & 0 & 0 & 40 & 0 & 0 \\ 0 & 60 & 0 & 0 & 80 & 0 \\ 0 & 0 & 90 & 0 & 0 & 120 \\ 50 & 0 & 0 & 60 & 0 & 0 \\ 0 & 100 & 0 & 0 & 120 & 0 \\ 0 & 0 & 150 & 0 & 0 & 180 \\ 70 & 0 & 0 & 80 & 0 & 0 \\ 0 & 140 & 0 & 0 & 160 & 0 \\ 0 & 0 & 210 & 0 & 0 & 240 \\ \end{array} \right]


Now we write down Taylor’s theorem. Let f be a real-valued function of n variables. Then

f(x + h) = f(x) + f^{(1)}(x) h + \sum_{k=2}^\infty \frac{1}{k!} \left[\stackrel{k-1}{\otimes}h^T \right] f^{(k)}(x) h

where f(0) = f and for k > 0,

f^{(k)}(x) = left. \frac{ \partial \mathrm{vec}\, f^{(k-1)} }{\partial h^T} \right|x

The symbol ⊗ with a number on top means to take the Kronecker product of the argument with itself that many times.

Source: Matrix Differential Calculus by Magnus and Neudecker.

Related post: What is a tensor?

Moment generating functions and connections to other things

This post relates moment generating functions to the Laplace transform and to exponential generating functions. It also brings in connections to the z-transform and the Fourier transform.

Thanks to Brian Borchers who suggested the subject of this post in a comment on a previous post on transforms and convolutions.

Moment generating functions

The moment generating function (MGF) of a random variable X is defined as the expected value of exp(tX). By the so-called rule of the unconscious statistician we have

M_X(t) \equiv \mathrm{E}[e^{tX}] = \int_{-\infty}^\infty e^{tx} f_X(x)\, dx

where fX is the probability density function of the random variable X. The function MX is called the moment generating function of X because it’s nth derivative, evaluated at 0, gives the nth moment of X, i.e. the expected value of Xn.

Laplace transforms

If we flip the sign on t in the integral above, we have the two-sided Laplace transform of fX. That is, the moment generating function of X at t is the two-sided Laplace transform of fX at –t. If the density function is zero for negative values, then the two-sided Laplace transform reduces to the more common (one-sided) Laplace transform.

Exponential generating functions

Since the derivatives of MX at zero are the moments of X, the power series for MX is the exponential generating function for the moments. We have

M_X(t) = m_0 + m_1t + \frac{m_2}{2}t^2 + \frac{m_3}{3!} t^3 + \cdots

where mn is the nth moment of X.

Other generating functions

This terminology needs a little explanation since we’re using “generating function” two or three different ways. The “moment generating function” is the function defined above and only appears in probability. In combinatorics, the (ordinary) generating function of a sequence is the power series whose coefficient of xn is the nth term of the sequence. The exponential generating function is similar, except that each term is divided by n!. This is called the exponential generating series because it looks like the power series for the exponential function. Indeed, the exponential function is the exponential generating function for the sequence of all 1’s.

The equation above shows that MX is the exponential generating function for mn and the ordinary generating function for mn/n!.

If a random variable Y is defined on the integers, then the (ordinary) generating function for the sequence Prob(Yn) is called, naturally enough, the probability generating function for Y.

The z-transform of a sequence, common in electrical engineering, is the (ordinary) generating function of the sequence, but with x replaced with 1/z.

Characteristic functions

The characteristic function of a random variable is a variation on the moment generating function. Rather than use the expected value of tX, it uses the expected value of itX. This means the characteristic function of a random variable is the Fourier transform of its density function.

Characteristic functions are easier to work with than moment generating functions. We haven’t talked about when moment generating functions exist, but it’s clear from the integral above that the right tail of the density has to go to zero faster than ex, which isn’t the case for fat-tailed distributions. That’s not a problem for the characteristic function because the Fourier transform exists for any density function. This is another example of how complex variables simplify problems.