New prime number record: 50th Mersenne prime

A new record for the largest known prime was announced yesterday:

2^{77,232,917} - 1

This number has 23,249,425 digits when written in base 10.

In base 2, 2p – 1 is a sequence of p ones. For example, 31 = 25 -1  which is 11111 in binary. So in binary, the new record prime is a string of 77,232,917 ones.

77,232,917 ones

You can convert a binary number to hexadecimal (base 16) by starting at the right end and converting blocks of 4 bits into hexadecimal. For example, to convert 101101111 to hex, we break it into three blocks: 1, 0110, and 1111. These blocks convert to 1, 6, and F, and so 101101111 in binary corresponds to 16F in hex.

Now 77,232,917  = 19,308,229 * 4 + 1, so if we break our string of 77,232,917 ones into groups of four, we have a single bit followed by 19,308,229 groups of 4. This means that in hexadecimal, the new prime record is 1FFF…FFF, a 1 followed by 19,308,229 F’s.

19,308,229 Fs

The new record is the 50th Mersenne prime. A Mersenne prime is a prime number that is one less than a power of 2, i.e. of the form 2p – 1. It turns out p must be prime before 2p – 1 can be prime. In the case of the new record, 77,232,917 is prime.

It is unknown whether there are infinitely many Mersenne primes. But now we know there are at least 50 of them.

All the recent prime number records have been Mersenne primes because there is an algorithm for testing whether a number of the special form 2p – 1 is prime, the Lucas-Lehmer test.

Mersenne primes are closely related to perfect numbers. A number n is perfect if the sum of the numbers less than n that divide n equals n. For example, 28 is divisible by 1, 2, 4, 7, and 14, and 1 + 2 + 4 + 7 + 14 = 28.

Finding a new Mersenne prime means finding a new perfect number. If m is a Mersenne prime, then nm(m + 1)/2 is a perfect number. Conversely, if n is an even perfect number, n has the form m(m + 1)/2, Nobody knows whether odd perfect numbers exist. Since we don’t know whether there are infinitely many Mersenne primes, we don’t know whether there are infinitely many perfect numbers. But there are at least 50 of them.

Related posts:

The Engineer’s Nyquist frequency and the sampling theorem

The Nyquist sampling theorem says that a band-limited signal can be recovered from evenly-spaced samples. If the highest frequency component of the signal is fc then the function needs to be sampled at a frequency of at least the Nyquist frequency 2fc. Or to put it another way, the spacing between samples needs to be no more than than Δ = 1/2fc.

If the signal is given by a function h(t), then the Nyquist-Shannon sampling theorem says we can recover h(t) by

h(t) = \sum_{n=-\infty}^\infty h(n\Delta)\, \mathrm{sinc}(2f_c t- n)

where sinc(x) = sin(πx) / πx.

In practice, signals may not entirely band-limited, but beyond some frequency fc higher frequencies can be ignored. This means that the cutoff frequency fc is somewhat fuzzy. As we demonstrate below, it’s much better to err on the side of making the cutoff frequency higher than necessary. Sampling at a little less than the necessary frequency can cause the reconstructed signal to be a poor approximation of the original. That is, the sampling theorem is robust to over-sampling but not to under-sampling. There’s no harm from sampling more frequently than necessary. (No harm as far as the accuracy of the equation above. There may be economic costs, for example, that come from using an unnecessarily high sampling rate.)

Let’s look at the function h(t) = cos(18πt) + cos(20πt). The bandwidth of this function is 10 Hz, and so the sampling theorem requires that we sample our function at 20 Hz. If we sample at 20.4 Hz, 2% higher than necessary, the reconstruction lines up with the original function so well that the plots of the two functions agree to the thickness of the plotting line.

Function and reconstruction sampling at 20.4 Hz

But if we sample at 19.6 Hz, 2% less than necessary, the reconstruction is not at all accurate due to problems with aliasing.

Function and reconstruction sampling at 19.6 Hz

One rule of thumb is to use the Engineer’s Nyquist frequency of 2.5 fc which is 25% more than the exact Nyquist frequency. An engineer’s Nyquist frequency is sorta like a baker’s dozen, a conventional safety margin added to a well-known quantity.

Update: Here’s a plot of the error, the RMS difference between the signal and its reconstruction, as a function of sampling frequency.

RMS error in reconstruction as a function of sampling frequency

By the way, the function in the example demonstrates beats. The sum of a 9 Hz signal and a 10 Hz signal is a 9.5 Hz signal modulated at 0.5 Hz. More details on beats in this post on AM radio and musical instruments.

Making sense of a probability problem in the WSJ

Wall Street Journal clipping

Someone wrote to me the other day asking if I could explain a probability example from the Wall Street Journal. (“Proving Investment Success Takes Time,” Spencer Jakab, November 25, 2017.)

Victor Haghani … and two colleagues told several hundred acquaintances who worked in finance that they would flip two coins, one that was normal and the other that was weighted so it came up heads 60% of the time. They asked the people how many flips it would take them to figure out, with a 95% confidence level, which one was the 60% coin. Told to give a “quick guess,” nearly a third said fewer than 10 flips, while the median response was 40. The correct answer is 143.

The anecdote is correct in spirit: it takes longer to discover the better of two options than most people suppose. But it’s jarring to read the answer is precisely 143 when the question hasn’t been stated clearly.

How many flips would it take to figure out which coin is better with a 95% confidence level? For starters, the answer would have to be a distribution, not a single number. You might quickly come to the right conclusion. You might quickly come to the wrong conclusion. You might flip coins for a long time and never come to a conclusion. Maybe there is a way a formulating the problem so that so that the expected value of the distribution is 143.

How are you to go about flipping the coins? Do you flip both of them, or just flip one coin? For example, you might flip both coins until you are confident that one is better, and conclude that the better one is the one that was designed to come up heads 60% of the time. Or you could just flip one of them and test the hypothesis Prob(heads) = 0.5 versus the alternative Prob(heads) = 0.6. Or maybe you flip one coin two times for every one time you flip the other. Etc.

What do you mean by “95% confidence level”? Is this a frequentist confidence interval? And do you compute the (Bayesian) predictive probability of arriving at such a confidence level? Are you computing the (Bayesian) posterior model probabilities of two models, one in which the first coin has probability of heads 0.5 and the second has probability 0.6 versus the opposite model?

Do you assume that you know a priori that one coin has probability of heads 0.5 and the other 0.6, or do you not assume this and just want to find the coin with higher probability of heads, and evaluate such a model when in fact the probabilities of heads are as stated?

Are you conducting an experiment with a predetermined sample size of 143? Or are you continuous monitoring the data, stopping when you reach your conclusion?

I leave it as an exercise to the reader to implement the various alternatives suggested above and see whether one of them produces 143 as a result. (I did a a back-of-the-envelope calculation that suggests there is one.) So the first question is to reverse engineer which problem statement the article was based on. The second question is to decide which problem formulation you believe would be most appropriate in the context of the article.

Exponential sum for the new year

Exponential sums can make intricate patterns. Last year I made a page that displays a different page each day, using the month, day, and year as parameters in the expression below. The images plot the partial sums of this sum.

\sum_{n=0}^N \exp\left( 2\pi i \left( \frac{n}{m} + \frac{n^2}{d} + \frac{n^3}{y} \right ) \right )

This was yesterday’s image.

Image of the day, New Year's Eve.

Today’s image is surprisingly plain if we use y = 18.

This is in part because the least common multiple of 1, 1, and 18 is 18. The image could have no more than 18 vertices. In fact, it has only 6 vertices because the summand above has period 6.

But if we use y = 2018 we get something much more complex.

The Exponential Sum of the Day page will use y = 18 this year. There will be a few simple images this year but there will also be lots of surprises.

Free technical books, mostly chemical engineering

Retiring professor Leonard Fabiano contacted me looking to give away a set of technical books, mostly chemical engineering books. If you’re interested please email him at lenfab@live.com.

Here are the books:

Chemical engineering books

Click on the image to see a larger version.

Two titles are not possible to read in the photo. These are

  • Conduction of heat in solids by Carslaw and Jaeger
  • Molecular Thermodynamics of Fluid Phase Equilibria by Prausnitz

Equation for the Eiffel Tower

Robert Banks’s book Towing Icebergs, Falling Dominoes, and Other Adventures in Applied Mathematics describes the Eiffel Tower’s shape as approximately the logarithmic curve

y = -y_* \log\left(\frac{x}{x_0} \right )

where y* and x0 are chosen to match the tower’s dimensions.

Here’s a plot of the curve:

Eiffel tower curve plot

And here’s the code that produced the plot:

from numpy import log, exp, linspace, vectorize
import matplotlib.pyplot as plt

# Taken from "Towing Icebergs, Falling Dominoes,
# and Other Adventures in Applied Mathematics"
# by Robert B. Banks

# Constants given in Banks in feet. Convert to meters.
feet_to_meter = 0.0254*12
ystar  = 201*feet_to_meter
x0     = 207*feet_to_meter
height = 984*feet_to_meter

# Solve for where to cut off curve to match height of the tower.
# - ystar log xmin/x0 = height
xmin = x0 * exp(-height/ystar)

def f(x):
    if -xmin < x < xmin:
        return height
    else:
        return -ystar*log(abs(x/x0))
curve = vectorize(f)
    
x = linspace(-x0, x0, 400)

plt.plot(x, curve(x))
plt.xlim(-2*x0, 2*x0)
plt.xlabel("Meters")
plt.ylabel("Meters")
plt.title("Eiffel Tower")

plt.axes().set_aspect(1)
plt.savefig("eiffel_tower.svg")

Related post: When length equals area
The St. Louis arch is approximately a catenary, i.e. a hyperbolic cosine.

Hermite polynomials, expected values, and integration

In the previous post, I alluded to using Hermite polynomials in conjunction with higher-order Laplace approximation. In this post I’ll expand on what that means.

Hermite polynomials are orthogonal polynomials over the real line with respect to the weight given by the standard normal distribution. (There are two conventions for defining Hermite polynomials, what Wikipedia calls the physicist convention and the probabilist convention. We’re using the latter here.)

The first few Hermite polynomials are 1, x, x2 – 1, x3 – 3x, … . You can find the rest of the Hermite polynomials via the recurrence relation

Hn+1(x) = x Hnn Hn-1(x).

The odd order Hermite polynomials are odd functions, and the standard normal density is an even function, so the integral of their product is zero. For an even number m, the integral of the mth order Hermite polynomial times the standard normal density is (m-1)!!, i.e. m double factorial. In probability terms, if X is a standard normal random variable, the expected value of Hk(X) is 0 if k is odd and (k – 1)!! otherwise.

You could think of the Hermite polynomials as the right basis to use when working with normal probability distributions. Writing a polynomial as a linear combination of Hermite polyn0mials is a change of basis that makes integration easy:

\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty \left(\sum_{k=0}^N a_k\,H_k(x)\right) \exp(-x^2/2) \,dx = \sum_{k=0}^N a_k \, [k\, \mathrm{ even} ] \,k!!

Here [k even] is the function that returns 1 if k is even and 0 otherwise. This notation was introduced by Kenneth Iverson in the APL programming language and has become moderately common in mathematics.

 

Higher-order Laplace approximation

Yesterday’s post presented the most common form of Laplace approximation, the second order version, and mentioned in passing that there are higher order versions. The extension to higher order is not trivial, so post gives a high level overview of how you’d do it.

The previous post looked at integrating exp( log( g(x) ) ) by expanding g(x) in a power series centered at its maximum. Without loss of generality we can assume the maximum occurs at 0; otherwise do a change of variables first to make this happen.

The first order term of the series is missing because the derivative of g is zero at its maximum. Taking the terms of the series up to second order gives us something of the form a – bx² where b > 0. Then

exp(abx²) = exp(a) exp(-bx²).

The first term on the right is a constant that cam be pulled out of the integral, and the second term is a normal distribution density, modulo vertical scaling.

Now suppose we want to use more terms from the power series for g. Let’s write out the series as

g(x) = abx² + R(x)

where R(x) is the remainder of the series after the quadratic terms. When we exponentiate the series for g we get

exp(g(x)) = exp(a) exp(-bx²) exp(R(x)).

We think of this as the expected value of exp(R(X)) where X is a normal random variable with variance 1/b, up to some constant we pull out of the integral.

We could approximate R(x) with the polynomial formed by the first few terms, but that’s not enough by itself. We need to approximate the expected value of exp(R(X)), not of R(X). The trick is to create another polynomial and find the expectation of that polynomial applied to X.

We need to find a power series for exp(R(x)) and truncate it after a few terms. We don’t need to analytically find the power series for exp(R(x)) directly before truncating it. We can apply the first few terms of the power series for exp to the first few terms of the power series for R and keep all the terms up to the order we want, say up to order k, giving us a kth order polynomial P(x).

Once we have the polynomial P(x), we can find the expectation P(X) in terms of normal moments (see this post) or alternatively by writing the polynomial as a sum of Hermite polynomials.

 

Laplace approximation of an integral from Bayesian logistic regression

Define

f(x; m, n) = \left( \frac{e^x}{e^x + 1} \right)^m \left( \frac{1}{e^x + 1} \right)^n

and

I(m, n) = \int_{-\infty}^\infty f(x; m, n) \, dx.
This integral comes up in Bayesian logisitic regression with a uniform (improper) prior. We will use this integral to illustrate a simple case of Laplace approximation.

The idea of Laplace approximation is to approximate the integral of a Gaussian-like function by the integral of a (scaled) Gaussian with the same mode and same curvature at the mode. That’s the most common, second order Laplace approximation. More generally, Laplace’s approximation depends on truncating a Taylor series, and truncating after the 2nd order term gives the approximation described above. One can construct higher-order Laplace approximations by keeping more terms of the power series.

Let x0 be the place where f takes on its maximum and let g = log f. Note that g also takes on its maximum at x0 and so its first derivative is zero there. Then the (second order) Laplace approximation has the following derivation.

\begin{eqnarray*} I(m,n) &=& \int_{-\infty}^\infty f(x; m, n) \, dx \\ &=& \int_{-\infty}^\infty \exp(g(x; m, n))\, dx \\ &=& \int_{-\infty}^\infty \exp\left( \sum_{k=0}^\infty \frac{ g^{(k)}(0) }{k!} (x - x_0)^k\right) \, dx \\ &\approx& \int_{-\infty}^\infty \exp\left( g(x_0) + \frac{g''(x_0)}{2} (x - x_0)^2 \right) \, dx \\ &=& \exp(g(x_0)) \sqrt{\frac{2\pi}{| g''(x_0)|}} . \end{eqnarray*}

 

We can show that

\frac{\partial}{\partial x} g(x; m, n) = \frac{m - ne^x}{1 + e^x}

and

\frac{\partial^2}{\partial x^2} g(x; m, n) = -\frac{(m + n)e^x}{(1 + e^x)^2}.

It follows that x0 = log (m/n) and the Laplace approximation for I(m, n) reduces to

 \frac{m^m n^n}{(m+n)^{m+n}} \sqrt{ \frac{2\pi(m+n)}{mn} }.

Now let’s compare the Laplace approximation to the exact value of the integral for a few values of m and n. We would expect the Laplace approximation to be more accurate when the integrand is shaped more like the Gaussian density. This happens when m and n are large and when they are more nearly equal. (The function f is a likelihood, and m+n is the number of data points in the likelihood. More data means the likelihood is more nearly normal. Also, m = n means f is symmetric, which improves the normal approximation.)

We’ll look at (m, n) = (2, 1), (20, 10), (200, 100), and (15, 15). Here’s the plot of f(x, 2, 1) with its normal approximation, scaled vertically so that the heights match.

Laplace approximation plot

Even for small arguments, the Gaussian approximation fits well on the left side, but not so much on the right. For large arguments it would be hard to see the difference between the two curves.

Here are the results of the Laplace approximation and exact integration.

|-----+-----+---------------+---------------+--------------|
|   m |   n |        approx |         exact |    rel error |
|-----+-----+---------------+---------------+--------------|
|   2 |   1 |    0.45481187 |           0.5 | -0.090376260 |
|  20 |  10 |  4.9442206e-9 | 4.99250870e-9 | -0.009672111 |
|  15 |  15 | 8.5243139e-10 | 8.5956334e-10 | -0.008297178 |
| 200 | 100 | 3.6037801e-84 | 3.6072854e-84 | -0.000971728 |
|-----+-----+---------------+---------------+--------------|

The integrals were computed exactly using Mathematica; the results are rational numbers. [1]

Note that when m and n increase by a factor of 10, the relative error goes down by about a factor of 10. This is in keeping with theory that say the error should be O(1/(m+n)). Also, note that the error for (15, 15) is a little lower than that for (20, 10). The errors are qualitatively just what we expected.

***

[1] I initially used Mathematica to compute the integrals, but then realized that’s not necessary. The change of variables u = exp(x) / (1 + exp(x)) shows that

I(ab) = B(ab+1) + B(a+1, b)

where B is the beta function. Incidentally, this shows that if a and b are integers then  I(ab) is rational.

Scholarship versus research

One of the things about academia that most surprised and disappointed me was the low regard for scholarship. Exploration is tolerated as long as it results in a profusion of journal articles, and of course grant money, but is otherwise frowned upon. For example, I know someone who ruined his academic career by writing a massive scholarly book rather than cranking out papers.

I recently ran across an essay [1] in which C. S. Lewis expressed similar concerns sixty years ago, referring to “the incubus of Research.” In the essay he describes a young academic in the humanities who

… far from being able or anxious … to add to the sum of human knowledge, wants to acquire a good deal more of the knowledge we already have. He has lately begun to discover how many things he needs to know in order to follow up his budding interests … To head him off from these studies, to pinfold him in some small inquiry whose chief claim is that no one has made it before is cruel and frustrating. It wastes such years as he will never have again …

My favorite part of the quote is describing research as “some small inquiry whose chief claim is that no one has made it before.” As Lewis said elsewhere, striving for originality can thwart originality.

[1] “Interim Report.” First published in The Cambridge Review in 1956 and reprinted as chapter 17 of Present Concerns.

Intellectual onramps

Tyler Cowen’s latest blog post gives advice for learning about modern China. He says that “books about sequences of dynasties are mind-numbing and not readily absorbed” and recommends finding other entry points before reading about dynasties.

Find an “entry point” into China of independent intrinsic interest to you, be it basketball, artificial intelligence, Chinese opera, whatever.

In a podcast interview—sorry, I no longer remember which one—Cowen talked more generally about finding entry points or onramps for learning big topics. The blog post mentioned above applies this specifically to China, but he gave other examples of coming to a subject through a side door rather than the front entrance. If I remember correctly, he mentioned learning the politics or economics of a region by first studying its architecture or food.

I’ve stumbled upon a number of intellectual onramps through my career, but I haven’t been as deliberate as Cowen in seeking them out. I had no interest in medicine before I ended up working for the world’s largest cancer center. I learned a bit about cancer and genetics from working at MD Anderson and I’ve since learned a little about other areas of medicine working with various clients. Right now I’m working on projects in nephrology and neurology.

Applied math is my onramp to lots of things I might not pursue otherwise. As John Tukey said, you get to play in everyone else’s backyard.

There are many things I’ve tried and failed to learn via a frontal assault. For example, I’ve tried several times to learn algebraic geometry by simply reading a book on the subject. But I find all the abstract machinery mind-numbing and difficult to absorb, just as Cowen described his first exposure to Chinese history. If I’m ever to learn much algebraic geometry, it will start with an indirect entry point, such as a concrete problem I need to solve.

Efficiency is not associative for matrix multiplication

Here’s a fact that has been rediscovered many times in many different contexts: The way you parenthesize matrix products can greatly change the time it takes to compute the product. This is important, for example, for the back propagation algorithm in deep learning.

Let AB, and C be matrices that are compatible for multiplication. Then (AB)CA(BC). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix M, let rows(M) be the number of rows in M and let cols(M) be the number of columns. When we multiply two matrices M and N, we multiply every row of M by every column of N [1]. So the number of vector inner products is rows(M) cols(N), and the number of multiplications [2] in each inner product is cols(M). (We must have cols(M) = rows(N) because we implicitly assumed it’s possible to multiple M and N.)

That means the total number of scalar multiplications required to conmpute MN equals

rows(M) cols(M) cols(N) = rows(M) rows(N) cols(N).

Now let’s apply this to computing ABC. Suppose A is an m by n matrix and C is a by q matrix. Then B has to be a n by p matrix in order to be compatible for multiplication.

Computing AB requires mnp multiplications. Once we have AB, a m by p matrix, the number of multiplications to compute (AB)C is mpq. So the total multiplications to compute ABC by first computing AB is mp(nq).

If we compute BC first, this requires npq multiplications, and then multiplying A by BC requires mnq operations, for a total of nq(m + p).

In summary, computing (AB)C requires mp(nq) multiplications, and computing A(BC) requires (m + p)nq multiplications. I turned the second term around to emphasize that in both expressions you do something to m and p, and something else to n and q. You either multiply and add, or add and multiply.

If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if m and p are large relative to n and q, i.e. both A and C are tall matrices, having more rows than columns, then multiplying A(BC) is going to be faster than multiplying (AB)C.

For example, suppose both A and C have a million rows and a hundred columns. Necessarily B would have a hundred rows and a million columns. Computing (AB)C would require 2×1014 multiplications, but computing A(BC) would take 2×1010 multiplications. That is, the latter would be 10,000 times faster.

Related: Applied linear algebra

***

[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.

 

 

 

Higher order Taylor series in several variables

Most sources that present Taylor’s theorem for functions of several variables stop at second order terms. One reason is that one or two terms are good enough for many applications. But the bigger reason is that things get more complicated when you include higher order terms.

The kth order term in Taylor’s theorem is a rank k tensor. You can think o rank 0 tensors as numbers, rank 1 tensors as vectors, and rank 2 tensors as matrices. Then we run out of familiar objects. A rank 3 tensor requires you start thinking in terms of tensors rather than more elementary terms.

There is a way to express Taylor’s theorem using only vectors and matrices. Maybe not the most elegant approach, depending on one’s taste, but it avoids any handwaving talk of a tensor being a “higher dimensional boxes of numbers” and such.

There’s a small price to pay. You have to introduce two new but simple ideas: the vec operator and the Kronecker product.

The vec operator takes an m × n matrix A and returns an mn × 1 matrix v, i.e. a column vector, by stacking the columns of A. The first m elements of v are the first column of A, the next m elements of v are the second column of A, etc.

The Kronecker product of an m × n matrix A and a p × q matrix B is a mp × nq matrix KA B. You can think of K as a block partitioned matrix. The ij block of K is aij B. In other words, to form K, take each element of A and replace it with its product with the matrix B.

A couple examples will make this clear.

A = \left\[ \begin{array}{cc} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{array} \right]

 

\mathrm{vec} \, A = \left\[ \begin{array}{c} 1 \\ 3 \\ 5 \\ 7 \\ 2 \\ 4 \\ 6 \\ 8 \end{array} \right]

 

B= \left\[ \begin{array}{ccc} 10 & 0 & 0 \\ 0 & 20 & 0 \\ 0 & 0 & 30 \end{array} \right]

 

A \otimes B= \left\[ \begin{array}{cccccc} 10 & 0 & 0 & 20 & 0 & 0 \\ 0 & 20 & 0 & 0 & 40 & 0 \\ 0 & 0 & 30 & 0 & 0 & 60 \\ 30 & 0 & 0 & 40 & 0 & 0 \\ 0 & 60 & 0 & 0 & 80 & 0 \\ 0 & 0 & 90 & 0 & 0 & 120 \\ 50 & 0 & 0 & 60 & 0 & 0 \\ 0 & 100 & 0 & 0 & 120 & 0 \\ 0 & 0 & 150 & 0 & 0 & 180 \\ 70 & 0 & 0 & 80 & 0 & 0 \\ 0 & 140 & 0 & 0 & 160 & 0 \\ 0 & 0 & 210 & 0 & 0 & 240 \\ \end{array} \right]

 

Now we write down Taylor’s theorem. Let f be a real-valued function of n variables. Then

f(x + h) = f(x) + f^{(1)}(x) h + \sum_{k=2}^\infty \frac{1}{k!} \left[\stackrel{k-1}{\otimes}h^T \right] f^{(k)}(x) h

where f(0) = f and for k > 0,

f^{(k)}(x) = left. \frac{ \partial \mathrm{vec}\, f^{(k-1)} }{\partial h^T} \right|x

The symbol ⊗ with a number on top means to take the Kronecker product of the argument with itself that many times.

Source: Matrix Differential Calculus by Magnus and Neudecker.

Related post: What is a tensor?