Test functions

Test functions are how you can make sense of functions that aren’t really functions.

The canonical example is the Dirac delta “function” that is infinite at the origin, zero everywhere else, and integrates to 1. That description is contradictory: a function that is 0 almost everywhere integrates to 0, even if you work in extended real numbers where a function can take on the value ∞.

You can make things like the delta function rigorous by saying they’re not functions of real numbers, but functions that operate on other functions, i.e. test functions. More on that here. These functions acting on test functions are called generalized functions or distributions. (This this post for how this kind of distribution differs from a probability distribution, but is analogous.)

Analogy with test charges

To say rigorously how these generalized functions behave you show how they act on test functions φ. Test functions are analogous to test charges: you can describe an electric field by saying what force it would exert on a test charge.

A test charge is somewhat idealized. It has to be so small that it tests a field without effecting it. This isn’t really possible, but you can think of it as a limit. You’re looking at the limit of the force per unit charge as the charge goes to zero.

Similarly, a test function is ideal in that it very well behaved so the generalized functions that act on it can be badly behaved. Test functions are infinitely differentiable, and they either have compact support or have extremely thin tails, depending on context.

Analogy with category theory

While writing the previous post I thought about an analogy between distribution theory and category theory. I worked with distribution theory a lot in grad school, and I found it natural that the definition of a distribution depended on how it acted in relation to something else, i.e. how it acted on all test functions.

But I found category definitions that involved extraneous objects puzzling. For example, the product of two objects is a third object such that for any fourth object (!) a certain diagram commutes. Why is this superfluous object doing injecting itself into the definition? If I’d thought of it as a test object then I would have found the definition more palatable.

As with distribution theory, you’re defining something by how it relates to all elements of some collection. But in distribution theory, your distributions and your test functions are very distinct things. In category theory, your test objects are peers of the thing you’re testing.

Software and the Allee effect

The Allee effect is named after Warder Clyde Allee who added a term to the famous logistic equation. His added term is highlighted in blue.

\frac{dN}{dt} = r N {\color{red}\left( \frac{N}{A} - 1 \right)} \left( 1 - \frac{N}{K} \right)

Here N is the population of a species over time, r is the intrinsic rate of increase, K is the carrying capacity, and A is the critical point.

If you remove Allee’s term, you get an equation saying that the rate of growth of a population is proportional to the current population size, and so growth starts out exponential, and a term (1 – N/K), which says growth slows down as the population approaches its carrying capacity.

Allee’s term (N/A – 1) says that the rate of growth becomes negative when the population falls below some threshold A. When there are too few individuals, survival becomes more difficult.

Software metaphor

I thought of the Allee effect as a metaphor for software technology after writing my previous post. In general, problems become easier to solve over time. Software development may become harder because the problems developers solve are changing, but solving old problems typically gets easier. Algorithms improve and get wrapped up for convenience. There’s something like logistic growth where tasks get easier to solve, but improvement slows down over time.

If a problem is specialized, it can run into something like the Allee effect. It becomes harder over time because fewer people are interested in it. Software isn’t maintained as fast as it degrades. Fewer people have experience with it. It’s harder to be a COBOL programmer, for example, than it used to be. But this can also apply to much more current problems. A problem that was hot five years ago can be harder to solve now than it was then, for reasons the previous post discusses.

Complex differential equations

Differential equations in the complex plane are different from differential equations on the real line.

Suppose you have an nth order linear differential equation as follows.

y^{(n)} + a_{n-1}y^{(n-1)} + a_{n-2}y^{(n-2)} + \cdots a_0y = 0

The real case

If the a‘s are continuous, real-valued functions on an open interval of the real line, then there are n linearly independent solutions over that interval. The coefficients do not need to be differentiable for the solutions to be differentiable.

The complex case

If the a‘s are continuous, complex-valued functions on an connected open of the real line, then there may not be n linearly independent solutions over that interval.

If the a‘s are all analytic, then there are indeed n independent solutions. But if any one of the a‘s is merely continuous and not analytic, there will be fewer than n independent solutions. There may be as many as n − 1 solutions, or there may not be any solutions at all.

Source: A. K. Bose. Linear Differential Equations on the Complex Plane. The American Mathematical Monthly, Vol. 89, No. 4 (Apr., 1982), pp 244–246.

What good is a DE once you have the solution?

In an undergraduate course on differential equations, you see an equation, you find a solution. Wax on, wax off. It would be reasonable to assume that the differential equation is simply an obstacle to overcome and that once you have the solution, the differential equation has done its job and can be discarded.

It would seem backward to ask whether a function satisfies a differential equation. It is backward, but that doesn’t mean it’s a bad idea. It can be very useful to know that a function you’re interested in satisfies a nice differential equation. (It’s trivial to make up differential equations for any function; the key is to discover a nice differential equation, a simple relationship between a function and its derivatives.)

The previous post gave an example of this. There we said that although Halley’s root-finding method converges faster than Newton’s method, it also requires one more function evaluation, namely the second derivative of the function whose root you’re finding. But when that function satisfies a second order differential equation, you can get the second derivative for free once you’ve evaluated the function and its derivative. Well, at a deep discount at least if not free.

In that post we gave the example of a Bessel function. Once you have evaluated a Bessel function and its first derivative at a point. you can calculate the second derivative in much less time that it took to evaluate the Bessel function itself.

Bessel functions are not unique in this regard. Hypergeometric functions are a large and useful class of functions, and these functions all satisfy Riemann’s differential equation.

All the classic orthogonal polynomials—Chebyshev polynomials, Jacobi polynomials, Laguerre polynomials, Hermite polynomials, etc.—satisfy a linear second order differential equation. For example, the Legendre polynomials Pn satisfy

(1 - x^2) y''(x) - 2 x y'(x) + \nu (\nu + 1) y = 0

And its not just Bessel functions, hypergeometric functions, and orthogonal polynomials. Most special functions satisfy a useful differential equation. More specifically, a linear second order differential equation.

So far the only reason given for wanting to know what differential equation a function satisfies is to speed up calculations: evaluate two functions and get a third one for cheap. But there are many other reasons to care what differential equation a function satisfies. For example, a common way to show that two functions are equal is to show that they satisfy the same differential equation with the same initial conditions. This is applying a uniqueness theorem for differential equations, but many other equations about differential equations could be useful to apply.

Related posts

Halley’s variation on Newton’s method

Newton’s method for solving f(x) = 0 converges quickly once it starts to converge. Specifically, it converges quadratically: the error is roughly squared at each step. This means that the number of correct decimal places doubles at each iteration of Newton’s method.

Doubling the number of decimal places is nice, but what if we could triple the number of correct decimals at each step? Edmond Halley, best known for the comet that bears his name, came up with a way to do just that. Newton’s method converges quadratically, but Halley’s method converges cubically.

Newton’s method updates iterations using the rule

x_{n+1} = x_n + \frac{f(x_n)}{f'(x_n)}

Halley’s variation uses the rule

x_{n+1} = x_n - \frac{1}{\dfrac{f'(x_n)}{f(x_n)} - \dfrac{f''(x)}{2f'(x)}}

Halley’s method makes more progress per iteration, but it also takes more work per iteration, and so in practice it’s not usually an improvement over Newton’s method. But in some cases it could be.

Evaluating efficiency

In root-finding problems, the function f is usually relatively time-consuming to evaluate. We can assume nearly all the work in root-finding goes into evaluating f and its derivatives, and we can ignore the rest of the arithmetic in the method.

Newton’s method requires evaluating the function f and its derivative f ′. Halley’s method requires these function evaluates as well and also requires evaluating the second derivative f ′′.

Three iterations of Newton’s method make as much progress toward finding a root as two iterations of Halley’s method. But since Newton’s method requires two function evaluations and Halley’s method requires three, both require six function evaluations to make the same amount of progress. So in general Halley’s method is not an improvement over Newton’s method.

When Halley might be better

Suppose we had a way to quickly compute the second derivative of f at a point from the values of f and its first derivative at that point, rather than by having to evaluate the second derivative explicitly. In that case two iterations of Halley’s method might make as much progress as three iterations of Newton’s method but with less effort, four function evaluations rather than six.

Many of the functions that are used most in applied mathematics satisfy 2nd order differential equations, and so it is common to be able to compute the second derivative from the function and its derivative.

For example, suppose we want to find the zeros of Bessel functions Jν. Bessel functions satisfy Bessel’s differential equation; that’s where Bessel functions get their name.

x^2 y'' + x y' + \left(x^2 - \nu^2 \right) y = 0

This means that once we have evaluated Jν and its first derivative, we can compute the second derivative from these values as follows.

J_{\nu}''(x_n) = \frac{(\nu^2 - x_n^2)J_\nu(x_n) - x_nJ_\nu'(x_n)}{x_n^2}

Maybe we’re not working with a well-studied function like a Bessel function but rather we are numerically solving a differential equation of our own creation and we want to know where the solution is zero. Halley’s method would be a natural fit because our numerical method will give us the values of y and y ′ at each step.

Related posts

Drag equation exponent variation

The motion of a falling body of mass m is given by

m \frac{dv}{dt} = mg - kv^r

where the term −kvr accounts for drag due to air resistance. One can derive r = 2 under simple physical assumptions, but if I remember correctly other values of r may be more realistic in certain circumstances. I don’t know much about the physics here; if you know about the use of other values of r, please let me know by leaving a comment.

Terminal velocity

When r = 1 or r = 2 the differential equation above can be solved in terms of elementary functions, but otherwise it cannot. Nevertheless one can show that for all values of r the object reaches a terminal velocity, and calculate that velocity without explicitly solving the differential equation. William Waterhouse demonstrated this in a one-page article [1]. He rewrites the equation to look at time as a function of velocity rather than velocity as a function of time

\frac{dt}{dv} = \frac{1}{g} \frac{1}{1 - (k/mg)v^r}

and concludes

t = \frac{1}{g} \int_0^v \frac{dv}{1 - (k/mg)v^r} + t_0

He notes that the integral diverges as v approaches

 \left(\frac{mg}{k}\right)^{1/r}

and so that is the terminal velocity, i.e. it takes an infinite amount of time to achieve this velocity. Waterhouse recommends this derivation as “a good example of deriving information about a problem without knowing an explicit solution.”

I would add that such an approach is the norm, not an exception. A closed-form solution to a differential equation in nice when you can get it, but usually not possible. And even when you can find a closed-form solution, you may be able to achieve your goal more directly by not using it.

Hypergeometric solution

I suspected the differential equation could be solved for general values of r using special functions, and that is the case. Mathematica was able to evaluate the integral for t as a function of v in terms of a hypergeometric function.

v \, _2F_1\left(1,\frac{1}{r};1+\frac{1}{r};\frac{g k v^r}{m}\right)

When I asked Mathematica to solve the differential equation directly, it said that the solution is the inverse function of the function above. Apparently Waterhouse and Mathematica agree that it is easier to think of t as a function of v rather than the original formulation.

The notation 2F1 indicates there are two upper parameters and one lower parameter. In our application, the upper parameters are 1 and 1/r, the lower parameter is 1 + 1/r, and the function is evaluated at gkvr/m. You can find a brief introduction to hypergeometric functions here. A hypergeometric function 2F1 has a singularity at 1, and so we could derive the terminal velocity from the explicit solution.

Mathematica implementation

Let c = gk/m. Then we can express velocity as a function of time in Mathematica by

    f[r_, c_, v_] := InverseFunction
    [
        #1 Hypergeometric2F1[1, 1/r, 1 + 1/r, c #1^r] &
    ][t]

and use this to make plots of the velocity for various values of c and r.

The following sets c = 2 and varies r over 1, 1.1, 1.2, … 2.

    Plot[Table[f[2, d/10, t], {d, 10, 20}], {t, 0, 4}, PlotRange -> All]

Here’s the output.

The terminal velocity decreases as r increases. The opposite is true for c < 1.

[1] William C. Waterhouse. A Fact about Falling Bodies. Mathematics Magazine, Vol. 44, No. 1 (Jan., 1971), pp. 33–34. The article straddles two pages, but takes up less than half of each page.

Moving between differential and integral equations

My years in graduate school instilled a Pavlovian response to PDEs: multiply by a test function and integrate by parts. This turns a differential equation into an integral equation [1].

I’ve been reading a book [2] on integral equations right now, and it includes several well-known techniques for turning certain kinds of integral equations into differential equations. Then this afternoon I talked to someone who was excited to have discovered a way to turn a more difficult integral equation into a differential equation.

For theoretical purposes, you often want to turn differential equations into integral equations. But for computational purposes, you often want to do the reverse.

Differential and integral equations are huge, overlapping fields, and sweeping generalities have exceptions. The opposite of the statement above may also be true. You may want to turn a differential equation into an integral equation for computational purposes (as in the finite element method) or turn an integral equation into a differential equation for theoretical convenience (as was the case for the person I was taking to).

***

[1] Sorta. More precisely this moves from the strong form of the PDE to a weak form. This involves integration, at least formally.

[2] Vladimir Ryzhov et al. Modern Methods in Mathematical Physics: Integral Equations in Wolfram Mathematica.

The essence of chaos

logistic bifurcation diagram

Linear systems can show sensitive dependence on initial conditions, but they cannot be chaotic. Only nonlinear systems can be chaotic.

George Datseris and Ulrich Parlitz explain this well in their book Nonlinear Dynamics:

… Sensitive dependence is not sufficient for a definition of chaos. … the state space is first stretched and then folded within itself. This is the defining characteristic of chaotic systems. The stretching part is where sensitive dependence comes from … as state space volumes increase in size … nonlinear dynamics takes over and folds the state space back in on itself. … This is why only nonlinear systems can be chaotic: linear systems with positive Lyapunov exponents always escape to infinity.

Related posts

The Pearson distributions

The previous post was about 12 probability distributions named after Irving Burr. This post is about 12 probability distributions named after Karl Pearson. The Pearson distributions are better known, and include some very well known distributions.

Burr’s distributions are defined by their CDFs; Pearson’s distributions are defined by their PDFs.

Pearson’s differential equation

The densities of Pearson’s distributions all satisfy the same differential equation:

f'(x) = \frac{(x-a) f(x)}{c_0 + c_1x + c_2x^2}

This is a linear differential equation, and so multiples of a solution are also a solution. However, a probability density must integrate to 1, so there is a unique probability density solution given a, c0, c1, and c2.

Well known distributions

Note that f(x) = exp(-x²/2) satisfies the differential equation above if we set a = 0, c0 = 1, and c1 = c2 = 0. This says the normal distribution is a Pearson distribution.

If f(x) = xm exp(-x) then the differential equation is satisfied for am, c0 = −1, and c0 = c2 = 0. This says that the exponential distribution and more generally the gamma distribution are Pearson distributions.

You can also show that the Cauchy distribution and more generally the Student t distribution are also Pearson distributions. So are the beta distributions (with a transformed range).

Table of Pearson distributions

The table below lists all Pearson distributions with their traditional names. The order of the list is a little strange for historical reasons.

The table uses Iverson’s bracket notation: a Boolean expression in brackets represents the function that is 1 when the condition holds and 0 otherwise. This way all densities are defined over the entire real line, though some of them are only positive over an interval.

The densities are presented without normalization constant; the normalization constant are whatever they have to be for the function to integrate to 1. The normalization constants can be complicated functions of the parameters and so they are left out for simplicity.

\begin{align*} \text{I} \hspace{1cm} & (1+x)^{m_1} (1-x)^{m_2} \,\,[-1 \leq x \leq 1] \\ \text{II} \hspace{1cm} & (1 - x^2)^m \,\, [ -1 \leq x \leq 1] \\ \text{III} \hspace{1cm} & x^m \exp(-x) \,\, [0 \leq x] \\ \text{IV} \hspace{1cm} & (1 + x^2)^{-m} \exp(-v \arctan x) \\ \text{V} \hspace{1cm} & x^{-m} \exp(-1/x) \,\, [0 \leq x] \\ \text{VI} \hspace{1cm} & x^{m_2}(1 + x)^{-m_1} \,\,[0 \leq x] \\ \text{VII} \hspace{1cm} & (1 + x^2)^{-m}\\ \text{VIII} \hspace{1cm} & (1 + x)^{-m} \,\, [0 \leq x \leq 1] \\ \text{IX} \hspace{1cm} & (1 + x)^{m} \,\, [0 \leq x \leq 1] \\ \text{X} \hspace{1cm} & \exp(-x) \,\, [0 \leq x] \\ \text{XI} \hspace{1cm} & x^{-m} \,\,[1 \leq x] \\ \text{XII} \hspace{1cm} & \left( (g+x)(g-x)\right)^h \,\, [-g \leq x \leq g] \end{align*}

There is a lot of redundancy in the list. All the distributions are either special cases of or limiting cases of distributions I, IV, and VI.

Note that VII is the Student t distribution after you introduce a scaling factor.

Moments

The Pearson distributions are determined by their first few moments, provided these exist, and these moments can be derived from the parameters in Pearson’s differential equation.

This suggests moment matching as a way to fit Pearson distributions to data: solve for the distribution parameters that make the the exact moments match the empirical moments. Sometimes this works very well, though sometimes other approaches are better, depending on your criteria for what constitutes a good match.

Heat equation and the normal distribution

The density function of a normal distribution with mean 0 and standard deviation √(2kt) satisfies the heat equation. That is, the function

u(x, t) = \frac{1}{2\sqrt{\pi kt}} \exp\left(-\frac{x^2}{4kt}\right)

satisfies the partial differential equation

u_t = k\,u_{xx}

You could verify this by hand, or if you’d like, here’s Mathematica code to do it.

      u[x_, t_] := PDF[NormalDistribution[0, Sqrt[2 k t]], x]
      Simplify[ D[u[x, t], {t, 1}] - k D[u[x, t], {x, 2}] ]

This returns 0 as expected.

Solutions to the heat equation are unique if we specify initial conditions. So if we start with a very concentrated heat source at time t = ε, say

u(x, \varepsilon) = \frac{1}{2\sqrt{\pi k\varepsilon}} \exp\left(-\frac{x^2}{4k\varepsilon}\right)

where ε is a tiny but positive number, then the solution at the top of the post hold for all t > 0 and is unique.

Alternatively, we could say that if our initial condition is u(x, 0) = δ where the right hand side is the Dirac delta function, then the solution at the top of the post holds for all t > 0 and is unique. You could justify this rigorously using distribution theory (in the analytic sense, not the probability sense, though they’re related). Or you could justify it heuristically by letting ε to go zero and thinking of δ as the limit.

It is also the case that a constant satisfies the heat equation. So if we add a constant to our solution, the constant representing ambient temperature, and add our initial condition to that constant, the solution is the constant plus the solution above.

The temperature at the center will decay in proportion to 1/√t. A point x away from the center will initially get warmer, reach a maximum temperature at time t = x²/2k, and then cool off roughly in proportion to 1/√t.

Here’s a plot of the temperature at x = 10 with k = 5.

Related posts