Closed-form solutions to nonlinear PDEs

The traditional approach to teaching differential equations is to present a collection of techniques for finding closed-form solutions to ordinary differential equations (ODEs). These techniques seem completely unrelated [1] and have arcane names such as integrating factors, exact equations, variation of parameters, etc.

Students may reasonably come away from an introductory course with the false impression that it is common for ODEs to have closed-form solutions because it is common in the class.

My education reacted against this. We were told from the beginning that differential equations rarely have closed-form solutions and that therefore we wouldn’t waste time learning how to find such solutions. I didn’t learn the classical solution techniques until years later when I taught an ODE class as a postdoc.

I also came away with a false impression, the idea that differential equations almost never have closed-form solutions in practice, especially nonlinear equations, and above all partial differential equations (PDEs). This isn’t far from the truth, but it is an exaggeration.

I specialized in nonlinear PDEs in grad school, and I don’t recall ever seeing a closed-form solution. I heard rumors of a nonlinear PDE with a closed form solution, the KdV equation, but I saw this as the exception that proves the rule. It was the only nonlinear PDE of practical importance with a closed-form solution, or so I thought.

It is unusual for a nonlinear PDE to have a closed-form solution, but it is not unheard of. There are numerous examples of nonlinear PDEs, equations with important physical applications, that have closed-form solutions.

Yesterday I received a review copy of Analytical Methods for Solving Nonlinear Partial Differential Equations by Daniel Arrigo. If I had run across with a book by that title as a grad student, it would have sounded as eldritch as a book on the biology of Bigfoot or the geography of Atlantis.

A few pages into the book there are nine exercises asking the reader to verify closed-form solutions to nonlinear PDEs:

  1. a nonlinear diffusion equation
  2. Fisher’s equation
  3. Fitzhugh-Nagumo equation
  4. Berger’s equation
  5. Liouville’s equation
  6. Sine-Gordon equation
  7. Korteweg–De Vries (KdV) equation
  8. modified Korteweg–De Vries (mKdV) equation
  9. Boussinesq’s equation

These are not artificial examples crafted to have closed-form solutions. These are differential equations that were formulated to model physical phenomena such as groundwater flow, nerve impulse transmission, and acoustics.

It remains true that differential equations, and especially nonlinear PDEs, typically must be solved numerically in applications. But the number of nonlinear PDEs with closed-form solutions is not insignificant.

Related posts

[1] These techniques are not as haphazard as they seem. At a deeper level, they’re all about exploiting various forms of symmetry.

 

Length of a general Archimedean spiral

This post ties together the previous three posts.

In this post, I said that an Archimedean spiral has the polar equation

r = b θ1/n

and applied this here to rolls of carpet.

When n = 1, the length of the spiral for θ running from 0 to T is approximately

½ bT²

with the approximation becoming more accurate [1] as T increases.

In this post we want to look at the general case where n might not be 1. In that case the arc length is given by a hypergeometric function, and finding the asymptotic behavior for large T requires evaluating a hypergeometric function at a large argument.

Here’s an example with n = 3.

Now the arms are not equally spaced but instead grow closer together.

Arc length

According to MathWorld, the length of the spiral with n > 1 and θ running from 0 to T is given by

L = b T^{1/n} \,_2F_1\left(-\frac{1}{2}, \frac{1}{2n}; 1 + \frac{1}{2n}; -n^2 T^2 \right)

As I discussed here, the power series defining a hypergeometric function 2F1 diverges for arguments outside the unit disk, but the function can be extended by analytic continuation using the identity

\begin{align*} F(a, b; c; z) &= \frac{\Gamma(c) \Gamma(b-a)}{\Gamma(b) \Gamma(c-a)} \,(-z)^{-a\phantom{b}} F\left(a, 1-c+a; 1-b+a; \frac{1}{z}\right) \\ &+ \frac{\Gamma(c) \Gamma(a-b)}{\Gamma(a) \Gamma(c-b)} \,(-z)^{-b\phantom{a}} F\left(\,b, 1-c+b; 1-a+b; \,\frac{1}{z}\right) \\ \end{align*}

Asymptotics

Most of the terms above will drop out in the limit as z = −n²T² gets large. The 1/z terms go to zero, and hypergeometric functions equal 1 when z = 0, and so the F terms on the right hand side above go to 1 as T increases.

In our application the hypergeometric parameters are a = −1/2, b = 1/2n, and c = 1 + 1/2n. The term

(−z)a  = (−z)1/2

matters, but the term

(−z)b  = (−z)−1/2n

goes to zero and can be ignored.

We find that

Lk T1 + 1/n

where the constant k is given by

k = bn Γ(1 + 1/2n) Γ(1/2 + 1/2n) / Γ(1/2n) Γ(3/2 + 1/2n)

When n = 1, this reduces to L ≈ ½ bT² as before.

 

[1] The relative approximation error decreases approximately quadratically in T. But the absolute error grows algorithmically.

Approximating a spiral by rings

An Archimedean spiral has the polar equation

r = b θ1/n

This post will look at the case n = 1. I may look at more general values of n in a future post. (Update: See here.) The case n = 1 is the simplest case, and it’s the case I needed for the client project that motivated this post.

In this case the spacing between points where the spiral crosses an axis is constant. Call this constant h. Then

h = 2πb.

For example, when rolling up a carpet, h corresponds to the thickness of the carpet.

Suppose θ runs from 0 to 2πm, wrapping around the origin m times. We could approximate the spiral by m concentric circles of radius h, 2h, 3h, …, mh. To visualize this, we’re approximating the length of the red spiral on the left with that of the blue circles on the right.

Comparing Archimedes spiral and concentric circles

We could approximate this further by saying we have m/2 circles whose average radius is πmb. This suggests the length of the spiral should be approximately

2π²m²b

How good is this approximation? What happens to the relative error as θ increases? Intuitively, each wrap around the origin is more like a circle as θ increases, so we’d expect the approximation to improve for large θ.

According to Mathworld, the exact length of the spiral is

πbm √(1 + (2πm)²) + b arcsinh(2πm) /2

When m is so large that we can ignore the 1 in √(1 + (2πm)²) then the first term is the same as the circle approximation, and all that’s left is the arcsinh term, which is on the order of log m because

arcsinh(x) = log(x + (1 + x²)1/2).

So for large m, the arc length is on the order of m² while the error is on the order of log m. This means the relative error is O( log(m) / m² ). [1]

We’ve assumed m was an integer because that makes it easier to visual approximating the spiral by circles, but that assumption is not necessary. We could restate the problem in terms of the final value of θ. Say θ runs from 0 to T. Then we could solve

T = 2πm

for m and say that the approximate arc length is

½ bT²

and the exact length is

½ bT(1 + T²)1/2 + ½ b arcsinh(T).

The relative approximation error is O( log(T) / T² ).

Here’s a plot of the error as a function of T assuming b = 1.

Related posts

[1] The error in approximating √(1 + (2πm)²) with 2πm is on the order of 1/(4πm) and so is smaller than the logarithmic term.

Hypergeometric function of a large negative argument

It’s occasionally necessary to evaluate a hypergeometric function at a large negative argument. I was working on a project today that involved evaluating F(a, b; c; z) where z is a large negative number.

The hypergeometric function F(a, b; c; z) is defined by a power series in z whose coefficients are functions of a, b, and c. However, this power series has radius of convergence 1. This means you can’t use the series to evaluate F(a, b; c; z) for z < −1.

It’s important to keep in mind the difference between a function and its power series representation. The former may exist where the latter does not. A simple example is the function f(z) = 1/(1 − z). The power series for this function has radius 1, but the function is defined everywhere except at z = 1.

Although the series defining F(a, b; c; z) is confined to the unit disk, the function itself is not. It can be extended analytically beyond the unit disk, usually with a branch cut along the real axis for z ≥ 1.

It’s good to know that our function can be evaluated for large negative x, but how do we evaluate it?

Linear transformation formulas

Hypergeometric functions satisfy a huge number of identities, the simplest of which are known as the linear transformation formulas even though they are not linear transformations of z. They involve bilinear transformations z, a.k.a. fractional linear transformations, a.k.a. Möbius transformations. [1]

One such transformation is the following, found in A&S 15.3.4 [2].

F(a, b; c; z) = (1-z)^{-a} F\left(a, c-b; c; \frac{z}{z-1} \right)

If z < 1, then 0 < z/(z − 1) < 1, which is inside the radius of convergence. However, as z goes off to −∞, z/(z − 1) approaches 1, and the convergence of the power series will be slow.

A more complicated, but more efficient, formula is A&S 15.3.7, a linear transformation formula relates F at z to two other hypergeometric functions evaluated at 1/z. Now when z is large, 1/z is small, and these series will converge quickly.

\begin{align*} F(a, b; c; z) &= \frac{\Gamma(c) \Gamma(b-a)}{\Gamma(b) \Gamma(c-a)} \,(-z)^{-a\phantom{b}} F\left(a, 1-c+a; 1-b+a; \frac{1}{z}\right) \\ &+ \frac{\Gamma(c) \Gamma(a-b)}{\Gamma(a) \Gamma(c-b)} \,(-z)^{-b\phantom{a}} F\left(\,b, 1-c+b; 1-a+b; \,\frac{1}{z}\right) \\ \end{align*}

Related posts

[1] It turns out these transformations are linear, but not as functions of a complex argument. They’re linear as transformations on a projective space. More on that here.

[2] A&S refers to the venerable Handbook of Mathematical Functions by Abramowitz and Stegun.

More Laguerre images

A week or two ago I wrote about Laguerre’s root-finding method and made some associated images. This post gives a couple more examples.

Laguerre’s method is very robust in the sense that it is likely to converge to a root, regardless of the starting point. However, it may be difficult to predict which root the method will end up at. To visualize this, we color points according to which root they converge to.

First, let’s look at the polynomial

(x − 2)(x − 4)(x − 24)

which clearly has roots at 2, 4, and 24. We’ll generate random starting points and color them blue, orange, or green depending on whether they converge to 2, 4, or 24. Here’s the result.

To make this easier to see, let’s split it into each color: blue, orange, and green.

Now let’s change our polynomial by moving the root at 4 to 4i.

(x − 2)(x − 4i)(x − 24)

Here’s the combined result.

And here is each color separately.

As we explained last time, the area taken up by the separate colors seems to exceed the total area. That is because the colors are so intermingled that many of the dots in the images cover some territory that belongs to another color, even though the dots are quite small.

Antenna length: Another rule of 72

The famous Rule of 72 says that to find out how many years it takes an investment to double in value, divide 72 by the annual percentage rate. I’ll come back to that in a little bit.

This morning I read a really good article, Fifty Things you can do with a Software Defined Radio. The article includes a rule of thumb for how long an antenna needs to be.

My rule of thumb was to divide 72 by the frequency in MHz, and take that as the length of each side of the dipole in meters [1]. That’d make the whole antenna a bit shorter than half of the wavelength.

Ideally an antenna should be as long as half a wavelength of the signal you want to receive. Light travels 3 × 108 meters per second, so one wavelength of a 1 MHz signal is 300 m. A quarter wavelength, the length of one side of a dipole antenna, would be 75 m. Call it 72 m because 72 has lots of small factors, i.e. it’s usually mentally easier to divide things into 72 than 75. Rounding 75 down to 72 results in the antenna being a little shorter than ideal. But antennas are forgiving, especially for receiving.

Update: There’s more to replacing 75 with 72 than simplifying mental arithmetic. See Markus Meier’s comment below.

Just as the Rule of 72 for antennas rounds 75 down to 72, the Rule of 72 for interest rounds 69.3 up to 72, both for ease of mental calculation.

\begin{align*} \left(1 + \frac{n}{100}\right)^{72/n} &= \exp\left(\frac{72}{n} \log\left(1 + \frac{n}{100}\right)\right) \\ &\approx \exp\left(\frac{72}{n} \, \frac{n}{100} \right) \\ &= \exp(72/100) \\ &= 2.05 \end{align*}

The approximation step comes from the approximation log(1 + x) ≈ x for small x, a first order Taylor approximation.

The last line would be 2 rather than 2.05 if we replaced 72 with 100 log(2) = 69.3. That’s where the factor of 69.3 mentioned above comes from.

Related posts

[1] The post actually says centimeters, but the author meant to say meters.

Hallucinations of AI Science Models

AlphaFold 2, FourCastNet and CorrDiff are exciting. AI-driven autonomous labs are going to be a big deal [1]. Science codes now use AI and machine learning to make scientific discoveries on the world’s most powerful computers [2].

It’s common practice for scientists to ask questions about the validity, reliability and accuracy of the mathematical and computational methods they use. And many have voiced concerns about the lack of explainability and potential pitfalls of AI models, in particular deep neural networks (DNNs) [3].

The impact of this uncertainty varies highly according to project. Science projects that are able to easily check AI-generated results against ground truth may not be that concerned. High-stakes projects like design of a multimillion dollar spacecraft with high project risks may ask about AI model accuracy with more urgency.

Neural network accuracy

Understanding of the properties of DNNs is still in its infancy, with many as-yet unanswered questions. However, in the last few years some significant results have started to come forth.

A fruitful approach to analyzing DNNs is to see them as function approximators (which, of course, they are). One can study how accurately DNNs approximate a function representing some physical phenomenon in a domain (for example, fluid density or temperature).

The approximation error can be measured in various ways. A particularly strong measure is “sup-norm” or “max-norm” error, which requires that the DNN approximation be accurate at every point of the target function’s domain (“uniform approximation”). Some science problems may have a weaker requirement than this, such as low RMS or 2-norm error. However, it’s not unreasonable to ask about max-norm approximation behaviors of numerical methods [4,5].

An illuminating paper by Ben Adcock and Nick Dexter looks at this problem [6]. They show that standard DNN methods applied even to a simple 1-dimensional problem can result in “glitches”: the DNN as a whole matches the function well but at some points totally misapproximates the target function. For a picture that shows this, see [7].

Other mathematical papers have subsequently shed light on these behaviors. I’ll try to summarize the findings below, though the actual findings are very nuanced, and many details are left out here. The reader is advised to refer to the respective papers for exact details.

The findings address three questions: 1) how many DNN parameters are required to approximate a function well? 2) how much data is required to train to a desired accuracy? and 3) what algorithms are capable of training to the desired accuracy?

How many neural network weights are needed?

How large does the neural network need to be for accurate uniform approximation of functions? If tight max-norm approximation requires an excessively large number of weights, then use of DNNs is not computationally practical.

Some answers to this question have been found—in particular, a result 1 is given in [8, Theorem 4.3; cf. 9, 10]. This result shows that the number of neural network weights required to approximate an arbitrary function to high max-norm accuracy grows exponentially in the dimension of the input to the function.

This dependency on dimension is no small limitation, insofar as this is not the dimension of physical space (e.g., 3-D) but the dimension of the input vector (such as the number of gridcells), which for practical problems can be in the tens [11] or even millions or more.

Sadly, this rules out the practical use of DNN for some purposes. Nonetheless, for many practical applications of deep learning, the approximation behaviors are not nearly so pessimistic as this would indicate (cp. [12]). For example, results are more optimistic:

  • if the target function has a strong smoothness property;
  • if the function is not arbitrary but is a composition of simpler functions;
  • if the training and test data are restricted to a (possibly unknown) lower dimensional manifold in the high dimensional space (this is certainly the case for common image and language modeling tasks);
  • if the average case behavior for the desired problem domain is much better than the worst case behavior addressed in the theorem;
  • The theorem assumes multilayer perceptron and ReLU activation; other DNN architectures may perform better (though the analysis is based on multidimensional Taylor’s theorem, which one might conjecture applies also to other architectures).
  • For many practical applications, very high accuracy is not a requirement.
  • For some applications, only low 2-norm error is sufficient, (not low max-norm).
  • For the special case of physics informed neural networks (PINNs), stronger results hold.

Thus, not all hope is lost from the standpoint of theory. However, certain problems for which high accuracy is required may not be suitable for DNN approximation.

How much training data is needed?

Assuming your space of neural network candidates is expressive enough to closely represent the target function—how much training data is required to actually find a good approximation?

A result 2 is given in [13, Theorem 2.2] showing that the number of training samples required to train to high max-norm accuracy grows, again, exponentially in the dimension of the input to the function.

The authors concede however that “if additional problem information about [the target functions] can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work.” One suspects that some of the caveats given above might also be relevant here. Additionally, if one considers 2-norm error instead of max-norm error, the data requirement grows polynomially rather than exponentially, making the training data requirement much more tractable. Nonetheless, for some problems the amount of data required is so large that attempts to “pin down” the DNN to sufficient accuracy become intractable.

What methods can train to high accuracy?

The amount of training data may be sufficient to specify a suitable neural network. But, will standard methods for finding the weights of such a DNN be effective for solving this difficult nonconvex optimization problem?

A recent paper [14] from Max Tegmark’s group empirically studies DNN training to high accuracy. They find that as the input dimension grows, training to very high accuracy with standard stochastic gradient descent methods becomes difficult or impossible.

They also find second order methods perform much better, though these are more computationally expensive and have some difficulty also when the dimension is higher. Notably, second order methods have been used effectively for DNN training for some science applications [15]. Also, various alternative training approaches have been tried to attempt to stabilize training; see, e.g., [16].

Prospects and conclusions

Application of AI methods to scientific discovery continues to deliver amazing results, in spite of lagging theory. Ilya Sutskever has commented, “Progress in AI is a game of faith. The more faith you have, the more progress you can make” [17].

Theory of deep learning methods is in its infancy. The current findings show some cases for which use of DNN methods may not be fruitful, Continued discoveries in deep learning theory can help better guide how to use the methods effectively and inform where new algorithmic advances are needed.

Footnotes

1 Suppose the function to be approximated takes d inputs and has the smoothness property that all nth partial derivatives are continuous (i.e., is in Cn(Ω) for compact Ω). Also suppose a multilayer perceptron with ReLU activation functions must be able to approximate any such function to max-norm no worse than ε. Then the number of weights required is at least a fixed constant times (1/ε)d/(2n).

2 Let F be the space of all functions that can be approximated exactly by a broad class of ReLU neural networks. Suppose there is a training method that can recover all these functions up to max-norm accuracy bounded by ε. Then the number of training samples required is at least a fixed constant times (1/ε)d.

References

[1] “Integrated Research Infrastructure Architecture Blueprint Activity (Final Report 2023),” https://www.osti.gov/biblio/1984466.

[2] Joubert, Wayne, Bronson Messer, Philip C. Roth, Antigoni Georgiadou, Justin Lietz, Markus Eisenbach, and Junqi Yin. “Learning to Scale the Summit: AI for Science on a Leadership Supercomputer.” In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1246-1255. IEEE, 2022, https://www.osti.gov/biblio/2076211.

[3] “Reproducibility Workshop: The Reproducibility Crisis in ML‑based Science,” Princeton University, July 28, 2022, https://sites.google.com/princeton.edu/rep-workshop.

[4] Wahlbin, L. B. (1978). Maximum norm error estimates in the finite element method with isoparametric quadratic elements and numerical integration. RAIRO. Analyse numérique, 12(2), 173-202, https://www.esaim-m2an.org/articles/m2an/pdf/1978/02/m2an1978120201731.pdf

[5] Kashiwabara, T., & Kemmochi, T. (2018). Maximum norm error estimates for the finite element approximation of parabolic problems on smooth domains. https://arxiv.org/abs/1805.01336.

[6] Adcock, Ben, and Nick Dexter. “The gap between theory and practice in function approximation with deep neural networks.” SIAM Journal on Mathematics of Data Science 3, no. 2 (2021): 624-655, https://epubs.siam.org/doi/10.1137/20M131309X.

[7] “Figure 5 from The gap between theory and practice in function approximation with deep neural networks | Semantic Scholar,” https://www.semanticscholar.org/paper/The-gap-between-theory-and-practice-in-function-Adcock-Dexter/156bbfc996985f6c65a51bc2f9522da2a1de1f5f/figure/4

[8] Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of Deep Neural Networks. In P. Grohs & G. Kutyniok (Eds.), Mathematical Aspects of Deep Learning (pp. 149-199). Cambridge: Cambridge University Press. doi:10.1017/9781009025096.004, https://arxiv.org/abs/2007.04759.

[9] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017, https://arxiv.org/abs/1610.01145

[10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neural networks in Ws,p norms. Anal. Appl. (Singap.), pages 1–57, 2019, https://arxiv.org/abs/1902.07896

[11] Matt R. Norman, “The MiniWeather Mini App,” https://github.com/mrnorman/miniWeather

[12] Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

[13] Berner, J., Grohs, P., & Voigtlaender, F. (2022). Training ReLU networks to high uniform accuracy is intractable. ICLR 2023, https://openreview.net/forum?id=nchvKfvNeX0.

[14] Michaud, E. J., Liu, Z., & Tegmark, M. (2023). Precision machine learning. Entropy, 25(1), 175, https://www.mdpi.com/1099-4300/25/1/175.

[15] Markidis, S. (2021). The old and the new: Can physics-informed deep-learning replace traditional linear solvers?. Frontiers in big Data, 4, 669097, https://www.frontiersin.org/articles/10.3389/fdata.2021.669097/full

[16] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19,  https://papers.nips.cc/paper_files/paper/2006/hash/5da713a690c067105aeb2fae32403405-Abstract.html

[17] “Chat with OpenAI CEO and and Co-founder Sam Altman, and Chief Scientist Ilya Sutskever,” https://www.youtube.com/watch?v=mC-0XqTAeMQ&t=250s

Double super factorial

I saw someone point out recently that

10! = 7! × 5! × 3! × 1!

Are there more examples like this?

What would you call the pattern on the right? I don’t think there’s a standard name, but here’s why I think it should be called double super factorial or super double factorial.

Super factorial

The factorial of a positive number n is the product of the positive numbers up to and including n. The super factorial of n is the product of the factorials of the positive numbers up to and including n. So, for example, 7 super factorial would be

7! × 6! × 5! × 4! × 3! × 2! × 1!

Double factorial

The double factorial of a positive number n is the product of all the positive numbers up to n with the same parity of n. So, for example, the double factorial of 7 would be

7!! = 7 × 5 × 3 × 1.

Double superfactorial

The pattern at the top of the post is like super factorial, but it only includes odd terms, so it’s like a cross between super factorial and double factorial, hence double super factorial.

Denote the double super factorial of n as dsf(n), the product of the factorials of all numbers up to n with the same parity as n. That is,

dsf(n) = n! × (n − 2)! × (n − 4)! × … × 1

where the 1 at the end is 1! if n is odd and 0! if n is even. In this notation, the observation at the top of the post is

10! = dsf(7).

Super double factorial

We can see by re-arranging terms that a double super factorial is also a super double factorial. For example, look at

dsf(7) = 7! × 5! × 3! × 1!

If we separate out the first term in each factorial we have

(7 × 5 × 3 × 1)(6! × 4! × 2!) = 7!! dsf(6)

We can keep going and show in general that

dsf(n) = n!! × (n − 1)!! × (n − 2)!! … × 1

We could call the right hand side super double factorial, sdf(n). Just as a super factorial is a product of factorials, a super double factorials is a product of double factorials. Therefore

dsf(n) = sdf(n).

Factorials that equal double super factorials

Are there more solutions to

n! = dsf(m).

besides n = 10 and m = 7? Yes, here are some.

0! = dsf(0)
1! = dsf(1)
2! = dsf(2)
3! = dsf(3)
6! = dsf(5)

There are no solutions to

n! = dsf(m)

if n > 10. Here’s a sketch of a proof.

Bertrand’s postulate says that for n > 1 there is always a prime p between n and 2n. Now p divides (2n)! but p cannot divide dsf(n) because dsf(n) only has factors less than or equal to n.

If we can show that for some N, n > N implies (2n)! < dsf(n) then there are no solutions to

n! = dsf(m)

for n > 2N because there is a prime p between N and 2N that divides the left side but not the right. In fact N = 12. We can show empirically there are no solutions for n = 11 up to 24, and the proof shows there are no solutions for n > 24.

Laguerre’s root finding method

Edmond Laguerre (1834–1886) came up with a method for finding zeros of polynomials. Unlike Newton’s method for finding zeros of general functions, Laguerre’s method is specialized for polynomials. Laguerre’s method converges an order of magnitude faster than Newton’s method, i.e. the error is cubed on each step rather than squared.

The most interesting thing about Laguerre’s method is that it nearly always converges to a root, no matter where you start. Newton’s method, on the other hand, converges quickly if you start sufficiently close to a root. If you don’t start close enough to a root, the method might cycle or go off to infinity.

The first time I taught numerical analysis, the textbook we used began the section on Newton’s method with the following nursery rhyme:

There was a little girl,
Who had a little curl,
Right in the middle of her forehead.
When she was good,
She was very, very good,
But when she was bad, she was horrid.

When Newton’s method is good, it is very, very good, but when it is bad it is horrid.

Laguerre’s method is not well understood. Experiments show that it nearly always converges, but there’s not much theory to explain what’s going on.

The method is robust in that it is very likely to converge to some root, but it may not converge to the root you expected unless you start sufficiently close. The rest of the post illustrates this.

Let’s look at the polynomial

p(x) = 3 + 22x + 20x² + 24x³.

This polynomial has roots at 0.15392, and at -0.3397 ± 0.83468i.

We’ll generate 2,000 random starting points for Laguerre’s method and color its location according to which root it converges to. Points converging to the real root are colored blue, points converging to the root with positive imaginary part are colored orange, and points converging to the root with negative imaginary part are colored green.

Here’s what we get:

This is hard to see, but we can tell that there aren’t many blue dots, about an equal number of orange and green dots, and the latter are thoroughly mixed together. This means the method is unlikely to converge to the real root, and about equally likely to converge to either of the complex roots.

Let’s look at just the starting points that converge to the real root, its basin of attraction. To get more resolution, we’ll generate 100,000 starting points and make the dots smaller.

The convergence region is pinched near the root; you can start fairly near the root along the real axis but converge to one of the complex roots. Notice also that there are scattered points far from the real root that converge to that point.

Next let’s look at the points that converge to the complex root in the upper half plane.

Note that the basin of attraction appears to take up over half the area. But the corresponding basin of attraction for the root in the lower half plane also appears to take up over half the area.

They can’t both take up over half the area. In fact. both take up about 48%. But the two regions are very intertwined. Due to the width of the dots used in plotting, each green dot covers a tiny bit of area that belongs to orange, and vice versa. That is, the fact that both appear to take over half the area shows how commingled they are.

Related posts

Distance from a point to a line

Eric Lengyel’s new book Projective Geometric Algebra Illuminated arrived yesterday and I’m enjoying reading it. Imagine if someone started with ideas like dot products, cross products, and determinants that you might see in your first year of college, then thought deeply about those things for years. That’s kinda what the book is.

Early in the book is the example of finding the distance from a point q to a line of the form p + tv.

If you define u = q − p then a straightforward derivation shows that the distance d from q to the line is given by

d = \sqrt{{\bold u}^2 - \frac{({\bold u}\cdot {\bold v})^2}{{\bold v}^2}}

But as the author explains, it is better to calculate d by

d = \sqrt{\frac{({\bold u}\times {\bold v})^2}{{\bold v}^2}}

Why is that? The two expressions are algebraically equal, but the latter is better suited for numerical calculation.

The cardinal rule of numerical calculation is to avoid subtracting nearly equal floating point numbers. If two numbers agree to b bits, you may lose up to b bits of significance when computing their difference.

If u and v are vectors with large magnitude, but q is close to the line, then the first equation subtracts two large, nearly equal numbers under the square root.

The second equation involves subtraction too, but it’s less obvious. This is a common theme in numerical computing. Imagine this dialog.

[Student produces first equation.]

Mentor: Avoid subtracting nearly equal numbers.

[Student produces section equation.]

Student: OK, did it.

Mentor: That’s much better, though it could still have problems.

Where is there a subtraction in the second equation? We started with a subtraction in defining u. More subtly, the definition of cross product involves subtractions. But these subtractions involve smaller numbers than the first formula, because the first formula subtracts squared values. Eric Lengyel points this out in his book.

None of this may matter in practice, until it does matter, which is a common pattern in numerical computing. You implement something like the first formula, something that can be derived directly. You implicitly have in mind vectors whose magnitude is comparable to d and this guides your choice of unit tests, which all pass.

Some time goes by and a colleague tells you your code is failing. Impossible! You checked your derivation by hand and in Mathematica. Your unit tests all pass. Must be your colleague’s fault. But it’s not. Your code would be correct in infinite precision, but in an actual computer it fails on inputs that violate your implicit assumptions.

This can be frustrating, but it can also be fun. Implementing equations from a freshman textbook accurately, efficiently, and robustly is not a freshman-level exercise.

Related posts