John D. Cook

Closed-form solutions to nonlinear PDEs

John — Thu, 25 Apr 2024 11:21:38 +0000

The traditional approach to teaching differential equations is to present a collection of techniques for finding closed-form solutions to ordinary differential equations (ODEs). These techniques seem completely unrelated [1] and have arcane names such as integrating factors, exact equations, variation of parameters, etc.

Students may reasonably come away from an introductory course with the false impression that it is common for ODEs to have closed-form solutions because it is common in the class.

My education reacted against this. We were told from the beginning that differential equations rarely have closed-form solutions and that therefore we wouldn’t waste time learning how to find such solutions. I didn’t learn the classical solution techniques until years later when I taught an ODE class as a postdoc.

I also came away with a false impression, the idea that differential equations almost never have closed-form solutions in practice, especially nonlinear equations, and above all partial differential equations (PDEs). This isn’t far from the truth, but it is an exaggeration.

I specialized in nonlinear PDEs in grad school, and I don’t recall ever seeing a closed-form solution. I heard rumors of a nonlinear PDE with a closed form solution, the KdV equation, but I saw this as the exception that proves the rule. It was the only nonlinear PDE of practical importance with a closed-form solution, or so I thought.

It is unusual for a nonlinear PDE to have a closed-form solution, but it is not unheard of. There are numerous examples of nonlinear PDEs, equations with important physical applications, that have closed-form solutions.

Yesterday I received a review copy of Analytical Methods for Solving Nonlinear Partial Differential Equations by Daniel Arrigo. If I had run across with a book by that title as a grad student, it would have sounded as eldritch as a book on the biology of Bigfoot or the geography of Atlantis.

A few pages into the book there are nine exercises asking the reader to verify closed-form solutions to nonlinear PDEs:

a nonlinear diffusion equation
Fisher’s equation
Fitzhugh-Nagumo equation
Berger’s equation
Liouville’s equation
Sine-Gordon equation
Korteweg–De Vries (KdV) equation
modified Korteweg–De Vries (mKdV) equation
Boussinesq’s equation

These are not artificial examples crafted to have closed-form solutions. These are differential equations that were formulated to model physical phenomena such as groundwater flow, nerve impulse transmission, and acoustics.

It remains true that differential equations, and especially nonlinear PDEs, typically must be solved numerically in applications. But the number of nonlinear PDEs with closed-form solutions is not insignificant.

[1] These techniques are not as haphazard as they seem. At a deeper level, they’re all about exploiting various forms of symmetry.

The post Closed-form solutions to nonlinear PDEs first appeared on John D. Cook.

Choosing a Computer Language for a Project

Wayne Joubert — Wed, 24 Apr 2024 00:19:22 +0000

Julia. Scala. Lua. TypeScript. Haskell. Go. Dart. Various computer languages new and old are sometimes proposed as better alternatives to mainstream languages. But compared to mainstream choices like Python, C, C++ and Java (cf. Tiobe Index)—are they worth using?

Certainly it depends a lot on the planned use: is it a one-off small project, or a large industrial-scale software application?

Yet even a one-off project can quickly grow to production-scale, with accompanying growing pains. Startups sometimes face a growth crisis when the nascent code base becomes unwieldy and must be refactored or fully rewritten (or you could do what Facebook/Meta did and just write a new compiler to make your existing code base run better).

The scope of different types of software projects and their requirements is so incredibly diverse that any single viewpoint from experience runs a risk of being myopic and thus inaccurate for other kinds of projects. With this caveat, I’ll share some of my own experience from observing projects for many dozens of production-scale software applications written for leadership-scale high performance computing systems. These are generally on a scale of 20,000 to 500,000 lines of code and often require support of mathematical and scientific libraries and middleware for build support, parallelism, visualization, I/O, data management and machine learning.

Here are some of the main issues regarding choice of programming languages and compilers for these codes:

1. Language and compiler sustainability. While the lifetime of computing systems is measured in years, the lifetime of an application code base can sometimes be measured in decades. Is the language it is written in likely to survive and be well-supported long into the future? For example, Fortran, though still used and frequently supported, is is a less common language thus requiring special effort from vendors, with fewer developer resources than more popular languages. Is there a diversity of multiple compilers from different providers to mitigate risk? A single provider means a single point of failure, a high risk; what happens if the supplier loses funding? Are the language and compilers likely to be adaptable for future computer hardware trends (though sometimes this is hard to predict)? Is there a large customer base to help ensure future support? Similarly, is there an adequate pool of available programmers deeply skilled in the language? Does the language have a well-featured standard library ecosystem and good support for third-party libraries and frameworks? Is there good tool support (debuggers, profilers, build tools)?

2. Related to this is the question of language governance. How are decisions about future updates to the language made? Is there broad participation from the user community and responsiveness to their needs? I’ve known members of the C++ language committee; from my limited experience they seem very reasonable and thoughtful about future directions of the language. On the other hand, some standards have introduced features that scarcely anyone ever uses—a waste of time and more clutter for the standard.

3. Productivity. It is said that programmer productivity is limited by the ability of a few lines of code to express high level abstractions that can do a lot with minimal syntax. Does the language permit this? Does the language standard make sense (coherent, cohesive) and follow the principle of least surprise? At the same time, the language should not engulf what might better be handled modularly by a library. For example, a matrix-matrix product that is bound up with the language might be highly productive for simple cases but have difficulty supporting the many variants of matrix-matrix product provided for example by the NVIDIA CUTLASS library. Also, in-language support for CUDA GPU operations, for example, would make it hard for the language not to lag behind in support of the frequent new releases of CUDA.

4. Strategic advantage. The 10X improvement rule states that an innovation is only worth adopting if it offers 10X improvement compared to existing practice according to some metric . If switching to a given new language doesn’t bring significant improvement, it may not be worth doing. This is particularly true if there is an existing code base of some size. A related question is whether the new language offers an incremental transition path for an existing code to the new language (in many cases this is difficult or impossible).

5. Performance (execution speed). Does the language allow one to get down to bare-metal performance rather than going through costly abstractions or an interpreter layer? Are the features of the underlying hardware exposed for the user to access? Is performance predictable? Can one get a sense of the performance of each line of code just by inspection, or is this occluded by abstractions or a complex compilation process? Is the use of just-in-time compilation or garbage collection unpredictable, which could be a problem for parallel computing wherein unexpected “hangs” can be caused by one process unexpectedly performing one of these operations? Do the compiler developers provide good support for effective and accurate code optimization options? Have results from standardized non-cherry-picked benchmarks been published (kernel benchmarks, proxy apps, full applications)?

Early adopters provide a vibrant “early alert” system for new language and tool developments that are useful for small projects and may be broadly impactful. Python was recognized early in the scientific computing community for its potential complementary use with standard languages for large scientific computations. When it comes to planning large-scale software projects, however, a range of factors based on project requirements must be considered to ensure highest likelihood of success.

The post Choosing a Computer Language for a Project first appeared on John D. Cook.

On greedy algorithms and rodeo clowns

John — Mon, 22 Apr 2024 13:16:23 +0000

This weekend I ran across a blog post The Rodeo Clown Theory of Personal Development. The title comes from a hypothetical example of a goal you don’t know how to achieve: becoming a rodeo clown.

Let’s say you decide you want to be a rodeo clown. And let’s say you’re me and you have no idea how to be a rodeo clown. … Can you look at the possibilities currently available to you and imagine which of them might lead to an overall increase of rodeo clowndom in your life, even infinitesimally?

Each day you ask yourself what you can do that might lead you closer to your goal. This is a greedy algorithm: at each step you make as much progress toward your goal as you can. Greedy algorithms are not always optimal, but the premise above is that you have no idea what to do. In that case, doing whatever you can imagine that might might bring you a little closer to your goal is a smart thing to do.

Crossing the continent

If you’re in Los Angeles and your goal is to get to New York, you could start by walking east. You might run into a brick wall and decide that you should not take your approach too literally. Maybe you decide to buy a ticket for a bus that’s traveling east, even if you have to walk west to the bus station. Maybe the bus takes you to Houston. This isn’t the most direct route, but it’s progress.

When you don’t know what else to do, a greedy approach, especially one tempered with common sense, could be a very sensible way to start. Maybe it’ll put you in a position to learn of a more efficient approach. In the scenario above, maybe some stranger sees you walk into a wall, asks you what you’re trying to do, and tells you where the bus station is.

Sorting algorithms

The way most people sort a hand of playing cards is a sort of greedy algorithm. They scan for the highest card and put it first. Then they scan the remainder of the hand for the next highest card and so on. This algorithm is known as insertion sort. It’s not the most efficient algorithm, but it may be the most natural, and it works. And for a hand of five cards it’s about as good as any.

Anyone who has taken an algorithms class will know you can beat insertion sort with a more sophisticated algorithm such as heap sort. But sorting is a well-understood, one-dimensional problem. That’s why people long ago came up with solutions that are better than a greedy approach. But when a problem is not well understood and is high-dimensional, greedy algorithms are more likely to succeed, even if they’re not optimal.

Training neural nets

Training a neural network is a very high dimensional optimization problem, maybe on the order of a billion dimensions. And yet it is solved using stochastic gradient descent, which is essentially a greedy algorithm. At each iteration, you move in the direction that gives you the most improvement toward your goal. There’s a little more nuance to it than that, but that’s the basic idea.

How can gradient descent work so well in practice when in theory it could get stuck in a local minimum or a saddle point? John Carmack posted a heuristic explanation on Twitter a few weeks ago. A local critical point requires the “cooperation” of every variable in the function you’re optimizing. The more variables, the less likely this is to happen.

Carmack didn’t give a full explanation—he posted a tweet, not a journal article—but he does give an intuitive idea of what’s going on. At the global minimum, the variables cooperate for a good reason. He assumes the variables are unlikely to cooperate anywhere else, which is apparently often true when training neural nets.

Usually things get harder in higher dimensions, hence the “curse of dimensionality.” But there’s a sort of blessing of dimensionality here: you’re less likely to get stuck in a local minimum in higher dimensions.

Back to the rodeo

Training a neural network can be a huge undertaking, using millions of dollars worth of electricity, but it’s a mathematically defined problem. Life goals such as becoming a rodeo clown are not cleanly quantified problems. They are very high dimensional in the sense that there are a vast number of things you could do at any given time. You might argue by analogy with neural nets that the fact that there are so many dimensions means that you’re unlikely to get stuck in a local minimum. This is a loose analogy, one that easily breaks down, but still one worth thinking about.

Seth Godin once said “Remarkable work often comes from making choices when everyone else feels as though there is no choice.” What looks like a local minimum may be revealed to not be once you consider more options.

Image by Clarence Alford from Pixabay

The post On greedy algorithms and rodeo clowns first appeared on John D. Cook.

Finding strings in binary files

John — Sat, 20 Apr 2024 15:54:28 +0000

There’s a little program called strings that searches for what appear to be strings inside binary file. I’ll refer to it as strings(1) to distinguish the program name from the common English word strings. [1]

What does strings(1) consider to be a string? By default it is a sequence of four or more bytes that correspond to printable ASCII strings. There are command options to change the sequence length and the character encoding.

There are 98 printable ASCII characters [2] and 256 possible values for an 8-bit byte, so the probability of a byte being a printable character is

p = 98/256 = 0.3828125.

This implies that the probability of strings(1) flagging a sequence of four random bytes as a string is p⁴ or about 2%.

How long a string might you find inside a binary file?

I ran strings(1) on a photograph and found a 46-character string:

   .IEC 61966-2.1 Default RGB colour space - sRGB

Of course this isn’t random. This string is part of the metadata stored inside JPEG images.

Next I encrypted the file so that no strings would be metadata and all strings would be false positives. The longest line was 12 characters:

    Z
How does this compare to what we might expect from a random file? I wrote about the probability of long runs a dozen years ago. In a file of n bytes, the expected length of the longest run of printable characters is approximately

In my case, the file had n = 203,308 bytes. The expected length of the longest run of printable characters would then be 12.2 characters, and so the actual length of the longest run is in line with what theory would have predicted.
 
[1] Unix documentation is separated into sections, and parentheses after a name specify the documentation section. Section 1 is for programs, Section 2 is for system calls, etc. So, for example, chmod(1) is the command line utility named chmod, and chmod(2) is the system call by the same name. Since command line utilities often have names that are common words, tacking (1) on the end helps distinguish program names from English words.
[2] p could be a little larger if you consider some of the control characters to be printable.
The post Finding strings in binary files first appeared on John D. Cook.

Extract text from a PDF

John — Sat, 20 Apr 2024 12:31:59 +0000

Arshad Khan left a comment on my post on the less and more utilities saying “on ubuntu if I do less on a pdf file, it shows me the text contents of the pdf.”

Apparently this is an undocumented feature of GNU less. It works, but I don’t see anything about it in the man page documentation [1].

Not all versions of less do this. On my Mac, less applied to a PDF gives a warning saying “… may be a binary file. See it anyway?” If you insist, it will dump gibberish to the command line.

A more portable way to extract text from a PDF would be to use something like the pypdf Python module:

    from pypdf import PdfReader

    reader = PdfReader("myfile.pdf")
    for page in reader.pages:
        print(page.extract_text())

The pypdf documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it’s not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?

PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you’re likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.

Related post: Your PDF may reveal more than you intend

[1] Update: Several people have responded saying that that less isn’t extracting the text from a PDF, but lesspipe is. That would explain why it’s not a documented feature of less. But it’s not clear how lesspipe is implicitly inserting itself.

Further update: Thanks to Konrad Hinsen for pointing me to this explanation. less reads an environment variable LESSOPEN for a preprocessor to run on its arguments, and that variable is, on some systems, set to lesspipe.

The post Extract text from a PDF first appeared on John D. Cook.

Length of a general Archimedean spiral

John — Fri, 19 Apr 2024 23:58:15 +0000

This post ties together the previous three posts.

In this post, I said that an Archimedean spiral has the polar equation

r = b θ^1/n

and applied this here to rolls of carpet.

When n = 1, the length of the spiral for θ running from 0 to T is approximately

½ bT²

with the approximation becoming more accurate [1] as T increases.

In this post we want to look at the general case where n might not be 1. In that case the arc length is given by a hypergeometric function, and finding the asymptotic behavior for large T requires evaluating a hypergeometric function at a large argument.

Here’s an example with n = 3.

Now the arms are not equally spaced but instead grow closer together.

Arc length

According to MathWorld, the length of the spiral with n > 1 and θ running from 0 to T is given by

As I discussed here, the power series defining a hypergeometric function ₂F₁ diverges for arguments outside the unit disk, but the function can be extended by analytic continuation using the identity

Asymptotics

Most of the terms above will drop out in the limit as z = −n²T² gets large. The 1/z terms go to zero, and hypergeometric functions equal 1 when z = 0, and so the F terms on the right hand side above go to 1 as T increases.

In our application the hypergeometric parameters are a = −1/2, b = 1/2n, and c = 1 + 1/2n. The term

(−z)^−a = (−z)^1/2

matters, but the term

(−z)^−b = (−z)^−1/2n

goes to zero and can be ignored.

We find that

L ≈ k T^{1 + 1/n}

where the constant k is given by

k = bn Γ(1 + 1/2n) Γ(1/2 + 1/2n) / Γ(1/2n) Γ(3/2 + 1/2n)

When n = 1, this reduces to L ≈ ½ bT² as before.

[1] The relative approximation error decreases approximately quadratically in T. But the absolute error grows algorithmically.

The post Length of a general Archimedean spiral first appeared on John D. Cook.

How big will a carpet be when you roll or unroll it?

John — Thu, 18 Apr 2024 13:52:57 +0000

If you know the dimensions of a carpet, what will the dimensions be when you roll it up into a cylinder?

If you know the dimensions of a rolled-up carpet, what will the dimensions be when you unroll it?

This post answers both questions.

Flexible carpet: solid cylinder

The edge of a rolled-up carpet can be described as an Archimedian spiral. In polar coordinates, this spiral has the equation

r = hθ / 2π

where h is the thickness of the carpet. The previous post gave an exact formula for the length L of the spiral, given the maximum value of θ which we denoted T. It also gave a simple approximation that is quite accurate when T is large, namely

L = hT² / 4π

If r₁ is the radius of the carpet as a rolled up cylinder, r₁ = hT / 2π and so T = 2π r₁ / h. So when we unroll the carpet

L = hT² / 4π = πr₁² / h.

Now suppose we know the length L and want to find the radius r when we roll up the carpet.

T = √(hL/π).

Stiff carpet: hollow cylinder

The discussion so far has assumed that the spiral starts from the origin, i.e. the carpet is rolled up so tightly that there’s no hole in the middle. This may be a reasonable assumption for a very flexible carpet. But if the carpet is stiff, the rolled up carpet will not be a solid cylinder but a cylinder with a cylindrical hole in the middle.

In the case of a hollow cylinder, there is an inner radius r₀ and an outer radius r₁. This means θ runs from T₀ = 2π r₀/h to T₁ = 2πr₁/h.

To find the length of the spiral running from T₀ to T₁ we find the length of a spiral that runs from 0 to T₁ and subtract the length of a spiral from 0 to T₀

L = πr₁² / h − πr₀² / h = π(r₁² − r₀²)/h.

This approximation is even better because the approximation is least accurate for small T, and we’ve subtracted off that part.

Now let’s go the other way around and find the outer radius r₁ when we know the length L. We also need to know the inner radius r₀. So suppose we are wrapping the carpet around a cardboard tube of radius r₀. Then

r₁ = √(r₀² + hL/π).

The post How big will a carpet be when you roll or unroll it? first appeared on John D. Cook.

Approximating a spiral by rings

John — Wed, 17 Apr 2024 14:19:31 +0000

An Archimedean spiral has the polar equation

r = b θ^1/n

This post will look at the case n = 1. I may look at more general values of n in a future post. (Update: See here.) The case n = 1 is the simplest case, and it’s the case I needed for the client project that motivated this post.

In this case the spacing between points where the spiral crosses an axis is constant. Call this constant h. Then

h = 2πb.

For example, when rolling up a carpet, h corresponds to the thickness of the carpet.

Suppose θ runs from 0 to 2πm, wrapping around the origin m times. We could approximate the spiral by m concentric circles of radius h, 2h, 3h, …, mh. To visualize this, we’re approximating the length of the red spiral on the left with that of the blue circles on the right.

We could approximate this further by saying we have m/2 circles whose average radius is πmb. This suggests the length of the spiral should be approximately

2π²m²b

How good is this approximation? What happens to the relative error as θ increases? Intuitively, each wrap around the origin is more like a circle as θ increases, so we’d expect the approximation to improve for large θ.

According to Mathworld, the exact length of the spiral is

πbm √(1 + (2πm)²) + b arcsinh(2πm) /2

When m is so large that we can ignore the 1 in √(1 + (2πm)²) then the first term is the same as the circle approximation, and all that’s left is the arcsinh term, which is on the order of log m because

arcsinh(x) = log(x + (1 + x²)^1/2).

So for large m, the arc length is on the order of m² while the error is on the order of log m. This means the relative error is O( log(m) / m² ). [1]

We’ve assumed m was an integer because that makes it easier to visual approximating the spiral by circles, but that assumption is not necessary. We could restate the problem in terms of the final value of θ. Say θ runs from 0 to T. Then we could solve

T = 2πm

for m and say that the approximate arc length is

½ bT²

and the exact length is

½ bT(1 + T²)^1/2 + ½ b arcsinh(T).

The relative approximation error is O( log(T) / T² ).

Here’s a plot of the error as a function of T assuming b = 1.

[1] The error in approximating √(1 + (2πm)²) with 2πm is on the order of 1/(4πm) and so is smaller than the logarithmic term.

The post Approximating a spiral by rings first appeared on John D. Cook.

Hypergeometric function of a large negative argument

John — Wed, 17 Apr 2024 01:00:07 +0000

It’s occasionally necessary to evaluate a hypergeometric function at a large negative argument. I was working on a project today that involved evaluating F(a, b; c; z) where z is a large negative number.

The hypergeometric function F(a, b; c; z) is defined by a power series in z whose coefficients are functions of a, b, and c. However, this power series has radius of convergence 1. This means you can’t use the series to evaluate F(a, b; c; z) for z < −1.

It’s important to keep in mind the difference between a function and its power series representation. The former may exist where the latter does not. A simple example is the function f(z) = 1/(1 − z). The power series for this function has radius 1, but the function is defined everywhere except at z = 1.

Although the series defining F(a, b; c; z) is confined to the unit disk, the function itself is not. It can be extended analytically beyond the unit disk, usually with a branch cut along the real axis for z ≥ 1.

It’s good to know that our function can be evaluated for large negative x, but how do we evaluate it?

Linear transformation formulas

Hypergeometric functions satisfy a huge number of identities, the simplest of which are known as the linear transformation formulas even though they are not linear transformations of z. They involve bilinear transformations z, a.k.a. fractional linear transformations, a.k.a. Möbius transformations. [1]

One such transformation is the following, found in A&S 15.3.4 [2].

If z < 1, then 0 < z/(z − 1) < 1, which is inside the radius of convergence. However, as z goes off to −∞, z/(z − 1) approaches 1, and the convergence of the power series will be slow.

A more complicated, but more efficient, formula is A&S 15.3.7, a linear transformation formula relates F at z to two other hypergeometric functions evaluated at 1/z. Now when z is large, 1/z is small, and these series will converge quickly.

[1] It turns out these transformations are linear, but not as functions of a complex argument. They’re linear as transformations on a projective space. More on that here.

[2] A&S refers to the venerable Handbook of Mathematical Functions by Abramowitz and Stegun.

The post Hypergeometric function of a large negative argument first appeared on John D. Cook.

More is less

John — Tue, 16 Apr 2024 11:40:35 +0000

When I first started using Unix, I used a program called “more” to read files. The name makes sense because each time you press the space bar, more will show you more of your file, one screen at a time.

Now everyone uses less, and more is all but forgotten.

Daniel Halbert wrote more in 1978. Mark Nudelman a similar program with more functionality in 1984 which he named less. The name was a pun on the aphorism “less is more” [1]. Soon less completely replaced more.

I’m curious why I ever used more, since less had taken over before I touched Unix. One possibility is that someone who was accustomed to more showed me that command. Another possibility is that I learned it from reading The Unix Programming Environment which came out in November 1983. It includes more but not less.

My laptop contains executables for more and less in /usr/bin. The command

    diff less more

returns nothing, indicating that the binaries are identical: less literally is more.

My desktop has distinct binaries for less and more. The more binary is much smaller, and so I assume it is limited to the original functionality of more, more or less.

[1] I don’t know who coined the phrase “less is more,” but it is associated with architect Ludwig Mies van der Rohe (1886–1969) who often quoted it. He did not apply the principle to is own name, however. He was born Ludwig Mies and later appended van der Rohe.

The post More is less first appeared on John D. Cook.

Precise answers to useless questions

John — Fri, 12 Apr 2024 12:25:05 +0000

I recently ran across a tweet from Allen Downey saying

So much of 20th century statistics was just a waste of time, computing precise answers to useless questions.

He’s right. I taught mathematical statistics at GSBS [1, 2] several times, and each time I taught it I became more aware of how pointless some of the material was.

I do believe mathematical statistics is useful, even some parts whose usefulness isn’t immediately obvious, but there were other parts of the curriculum I couldn’t justify spending time on [3].

Fun and profit

I’ll say this in partial defense of computing precise answers to useless questions: it can be fun and good for your career.

Mathematics problems coming out of statistics can be more interesting, and even more useful, than the statistical applications that motivate them. Several times in my former career a statistical problem of dubious utility lead to an interesting mathematical puzzle.

Solving practical research problems in statistics is hard, and can be hard to publish. If research addresses a practical problem that a reviewer isn’t aware of, a journal may reject it. The solution to a problem in mathematical statistics, regardless of its utility, is easier to publish.

Private sector

Outside of academia there is less demand for precise answers to useless questions. A consultant can be sure that a client finds a specific project useful because they’re willing to directly spend money on it. Maybe the client is mistaken, but they’re betting that they’re not.

Academics get grants for various lines of research, but this isn’t the same kind of feedback because the people who approve grants are not spending their own money. Imagine a grant reviewer saying “I think this research is so important, I’m not only going to approve this proposal, I’m going to write the applicant a $5,000 personal check.”

Consulting clients may be giving away someone else’s money too, but they have a closer connection to the source of the funds than a bureaucrat has when dispensing tax money.

Notes

[1] When I taught there, GSBS was The University of Texas Graduate School of Biomedical Sciences. I visited their website this morning, and apparently GSBS is now part of, or at least co-branded with, MD Anderson Cancer Center.

There was a connection between GSBS and MDACC at the time. Some of the GSBS instructors, like myself, were MDACC employees who volunteered to teach a class.

[2] Incidentally, there was a connection between GSBS and Allen Downey: one of my students was his former student, and he told me what a good teacher Allen was.

[3] I talk about utility a lot in this post, but I’m not a utilitarian. There are good reasons to learn things that are not obviously useful. But the appeal of statistics is precisely its utility, and so statistics that isn’t useful is particularly distasteful.

Pure math is beautiful (and occasionally useful) and applied math is useful (and occasionally beautiful). But there’s no reason to study fake applied math that is neither beautiful or useful.

The post Precise answers to useless questions first appeared on John D. Cook.

Pairs in poker

John — Thu, 11 Apr 2024 02:38:24 +0000

An article by Y. L. Cheung [1] gives reasons why poker is usually played with five cards. The author gives several reasons, but here I’ll just look at one reason: pairs don’t act like you might expect if you have more than five cards.

In five-card poker, the more pairs the better. Better here means less likely. One pair is better than no pair, and two pairs is better than one pair. But in six-card or seven-card poker, a hand with no pair is less likely than a hand with one pair.

For a five-card hand, the probabilities of 0, 1, or 2 pair are 0.5012, 0.4226, and 0.0475 respectively.

For a six-card hand, the same probabilities are 0.3431, 0.4855, and 0.1214.

For a seven-card hand, the probabilities are 0.2091, 0.4728, and 0.2216.

[1] Y. L. Cheung. Why Poker Is Played with Five Cards. The Mathematical Gazette, Dec., 1989, Vol. 73, No. 466 (Dec., 1989), pp. 313–315

The post Pairs in poker first appeared on John D. Cook.

Solar system means

John — Wed, 10 Apr 2024 14:40:27 +0000

Yesterday I stumbled on the fact that the size of Jupiter is roughly the geometric mean between the sizes of Earth and the Sun. That’s not too surprising: in some sense (i.e. on a logarithmic scale) Jupiter is the middle sized object in our solar system.

What I find more surprising is that a systematic search finds mean relationships that are far more accurate. The radius of Jupiter is within 5% of the geometric mean of the radii of the Earth and Sun. But all the mean relations below have an error less than 1%.

The radius of Mercury equals the geometric mean of the radii of the Moon and Mars, within 0.052%.

The radius of Mars equals the harmonic mean of the radii of the Moon and Jupiter, within 0.08%.

The radius of Uranus equals the arithmetic-geometric mean of the radii of Earth and Saturn, within 0.0018%.

See the links below for more on AM, GM, and AGM.

Now let’s look at masses.

The mass of Earth is the geometric mean of the masses of Mercury and Neptune, within 2.75%. This is the least accurate approximation in this post.

The mass of Pluto is the harmonic mean of the masses of the Moon and Mars, within 0.7%.

The mass of Uranus is the arithmetic-geometric mean of the masses of of the Moon and Saturn, within 0.54%.

The post Solar system means first appeared on John D. Cook.

Earth : Jupiter :: Jupiter : Sun

John — Tue, 09 Apr 2024 17:50:06 +0000

The size of Jupiter is approximately the geometric mean of the sizes of Sun and Earth.

In terms of radii,

The ratio on the left equals 9.95 and the ratio on the left equals 10.98.

The subscripts are the astronomical symbols for the Sun (☉, U+2609), Jupiter (♃, U+2643), and Earth (, U+1F728). I produced them in LaTeX using the mathabx package and the commands \Sun, Jupiter, and Earth.

The the mathabx symbol for Jupiter is a little unusual. It looks italicized, but that’s not because the symbol is being used in math mode. Notice that the vertical bar in the symbol for Earth is vertical, i.e. not italicized.

The post Earth : Jupiter :: Jupiter : Sun first appeared on John D. Cook.

Gravity on Jupiter

John — Tue, 09 Apr 2024 12:57:14 +0000

I was listening to the latest episode of the Space Rocket History podcast. The show includes some audio from a documentary on Pioneer 11 that mentioned that a man would weigh 500 pounds on Jupiter.

My immediate thought was “Is that all?! Is this ‘man’ a 100 pound boy?”

The documentary was correct and my intuition was wrong. And the implied mass of the man in the documentary is 190 pounds.

Jupiter has more than 300 times more mass than the earth. Why is its surface gravity only 2.6 times that of the earth?

Although Jupiter is very massive, it is also very large. Gravitational attraction is proportional to mass, but inversely proportional to the square of distance.

A satellite in orbit 100,000 km from the center of Jupiter would feel 300 times as much gravity as one in orbit the same distance from the center of Earth. But the surface of Jupiter is further from its center of mass than the surface of Earth is from its center of mass.

The mass of Jupiter is 318 times that of Earth, and the its mean radius is 11 times that of Earth. So the ratio of gravity on the surface of Jupiter to gravity on the Earth’s surface is

318 / 11² = 2.63

Now suppose a planet had the same density as Earth but a radius of r Earth radii. Then its mass would be r³ times greater, but its surface gravity would only be r times greater since gravity follows an inverse square law. So if Jupiter were made of the same stuff as Earth, its surface gravity would be 11 times greater. But Jupiter is a gas giant, so its surface gravity is only 2.6 times greater.

The post Gravity on Jupiter first appeared on John D. Cook.

Are guidance documents laws?

John — Fri, 05 Apr 2024 23:48:05 +0000

Are guidance documents laws? No, but they can have legal significance.

The people who generate regulatory guidance documents are not legislators. Legislators delegate to agencies to make rules, and agencies delegate to other organizations to make guidelines. For example [1],

Even HHS, which has express cybersecurity rulemaking authority under the Health Insurance Portability and Accountability Act (HIPAA), has put a lot of the details of what it considers adequate cybersecurity into non-binding guidelines.

I’m not a lawyer, so nothing I can should be considered legal advice. However, the authors of [1] are lawyers.

The legal status of guidance documents is contested. According to [2], Executive Order 13892 said that agencies

may not treat noncompliance with a standard of conduct announced solely in a guidance document as itself a violation of applicable statutes or regulations.

Makes sense to me, but EO 13992 revoked EO 13892.

Again according to [3],

Under the common law, it used to be that government advisories, guidelines, and other non-binding statements were non-binding hearsay [in private litigation]. However, in 1975, the Fifth Circuit held that advisory materials … are an exception to the hearsay rule … It’s not clear if this is now the majority rule.

In short, it’s fuzzy.

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition, page 245.

[2] Ibid., page 199.

[3] Ibid., page 200.

The post Are guidance documents laws? first appeared on John D. Cook.

More Laguerre images

John — Tue, 02 Apr 2024 03:10:27 +0000

A week or two ago I wrote about Laguerre’s root-finding method and made some associated images. This post gives a couple more examples.

Laguerre’s method is very robust in the sense that it is likely to converge to a root, regardless of the starting point. However, it may be difficult to predict which root the method will end up at. To visualize this, we color points according to which root they converge to.

First, let’s look at the polynomial

(x − 2)(x − 4)(x − 24)

which clearly has roots at 2, 4, and 24. We’ll generate random starting points and color them blue, orange, or green depending on whether they converge to 2, 4, or 24. Here’s the result.

To make this easier to see, let’s split it into each color: blue, orange, and green.

Now let’s change our polynomial by moving the root at 4 to 4i.

(x − 2)(x − 4i)(x − 24)

Here’s the combined result.

And here is each color separately.

As we explained last time, the area taken up by the separate colors seems to exceed the total area. That is because the colors are so intermingled that many of the dots in the images cover some territory that belongs to another color, even though the dots are quite small.

The post More Laguerre images first appeared on John D. Cook.

A Bayesian approach to proving you’re human

John — Thu, 28 Mar 2024 14:47:00 +0000

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

These kinds of tests could be made less annoying and more secure by using a Bayesian approach.

Suppose someone solves 19 out of 20 puzzles correctly. You require 20 out of 20, so you have them start over. When you do, you’re throwing away information. You require 20 more puzzles, despite the fact that they only missed one. And if a bot had solved say 8 out of 20 puzzles, you’d let it pass if it passes the next 20.

If you wipe your memory after every round of 20 puzzles, and allow unlimited do-overs, then any sufficiently persistent entity will almost certainly pass eventually.

Bayesian statistics reflects common sense. After someone (or something) has correctly solved 19 out of 20 puzzles designed to be hard for machines to solve, your conviction that this entity is human is higher than if they/it had solved 8 out of 20 correctly. You don’t need as much additional evidence in the first case as in the latter to be sufficiently convinced.

Here’s how a Bayesian captcha could work. You start out with some distribution on your probability θ that an entity is human, say a uniform distribution. You present a sequence of puzzles, recalculating your posterior distribution after each puzzle, until the posterior probability that this entity is human crosses some upper threshold, say 0.95, or some lower threshold, say 0.50. If the upper threshold is crossed, you decide the entity is likely human. If the lower threshold is crossed, you decide the entity is likely not human.

If solving 20 out of 20 puzzles correctly crosses your threshold of human detection, then after solving 19 out 20 correctly your posterior probability of humanity is close to the upper threshold and would only require a few more puzzles. And if an entity solved 8 out of 20 puzzles correctly, that may cross your lower threshold. If not, maybe only a few more puzzles would be necessary to reject the entity as non-human.

When I worked at MD Anderson Cancer Center we applied this approach to adaptive clinical trials. A clinical trial might stop early because of particularly good results or particularly bad results. Clinical trials are expensive, both in terms of human costs and financial costs. Rejecting poor treatments quickly, and sending promising treatments on to the next stage quickly, is both humane and economical.

The post A Bayesian approach to proving you’re human first appeared on John D. Cook.

Antenna length: Another rule of 72

John — Wed, 27 Mar 2024 14:33:15 +0000

The famous Rule of 72 says that to find out how many years it takes an investment to double in value, divide 72 by the annual percentage rate. I’ll come back to that in a little bit.

This morning I read a really good article, Fifty Things you can do with a Software Defined Radio. The article includes a rule of thumb for how long an antenna needs to be.

My rule of thumb was to divide 72 by the frequency in MHz, and take that as the length of each side of the dipole in meters [1]. That’d make the whole antenna a bit shorter than half of the wavelength.

Ideally an antenna should be as long as half a wavelength of the signal you want to receive. Light travels 3 × 10⁸ meters per second, so one wavelength of a 1 MHz signal is 300 m. A quarter wavelength, the length of one side of a dipole antenna, would be 75 m. Call it 72 m because 72 has lots of small factors, i.e. it’s usually mentally easier to divide things into 72 than 75. Rounding 75 down to 72 results in the antenna being a little shorter than ideal. But antennas are forgiving, especially for receiving.

Update: There’s more to replacing 75 with 72 than simplifying mental arithmetic. See Markus Meier’s comment below.

Just as the Rule of 72 for antennas rounds 75 down to 72, the Rule of 72 for interest rounds 69.3 up to 72, both for ease of mental calculation.

The approximation step comes from the approximation log(1 + x) ≈ x for small x, a first order Taylor approximation.

The last line would be 2 rather than 2.05 if we replaced 72 with 100 log(2) = 69.3. That’s where the factor of 69.3 mentioned above comes from.

[1] The post actually says centimeters, but the author meant to say meters.

The post Antenna length: Another rule of 72 first appeared on John D. Cook.

Hallucinations of AI Science Models

Wayne Joubert — Tue, 26 Mar 2024 15:35:48 +0000

AlphaFold 2, FourCastNet and CorrDiff are exciting. AI-driven autonomous labs are going to be a big deal [1]. Science codes now use AI and machine learning to make scientific discoveries on the world’s most powerful computers [2].

It’s common practice for scientists to ask questions about the validity, reliability and accuracy of the mathematical and computational methods they use. And many have voiced concerns about the lack of explainability and potential pitfalls of AI models, in particular deep neural networks (DNNs) [3].

The impact of this uncertainty varies highly according to project. Science projects that are able to easily check AI-generated results against ground truth may not be that concerned. High-stakes projects like design of a multimillion dollar spacecraft with high project risks may ask about AI model accuracy with more urgency.

Neural network accuracy

Understanding of the properties of DNNs is still in its infancy, with many as-yet unanswered questions. However, in the last few years some significant results have started to come forth.

A fruitful approach to analyzing DNNs is to see them as function approximators (which, of course, they are). One can study how accurately DNNs approximate a function representing some physical phenomenon in a domain (for example, fluid density or temperature).

The approximation error can be measured in various ways. A particularly strong measure is “sup-norm” or “max-norm” error, which requires that the DNN approximation be accurate at every point of the target function’s domain (“uniform approximation”). Some science problems may have a weaker requirement than this, such as low RMS or 2-norm error. However, it’s not unreasonable to ask about max-norm approximation behaviors of numerical methods [4,5].

An illuminating paper by Ben Adcock and Nick Dexter looks at this problem [6]. They show that standard DNN methods applied even to a simple 1-dimensional problem can result in “glitches”: the DNN as a whole matches the function well but at some points totally misapproximates the target function. For a picture that shows this, see [7].

Other mathematical papers have subsequently shed light on these behaviors. I’ll try to summarize the findings below, though the actual findings are very nuanced, and many details are left out here. The reader is advised to refer to the respective papers for exact details.

The findings address three questions: 1) how many DNN parameters are required to approximate a function well? 2) how much data is required to train to a desired accuracy? and 3) what algorithms are capable of training to the desired accuracy?

How many neural network weights are needed?

How large does the neural network need to be for accurate uniform approximation of functions? If tight max-norm approximation requires an excessively large number of weights, then use of DNNs is not computationally practical.

Some answers to this question have been found—in particular, a result ¹ is given in [8, Theorem 4.3; cf. 9, 10]. This result shows that the number of neural network weights required to approximate an arbitrary function to high max-norm accuracy grows exponentially in the dimension of the input to the function.

This dependency on dimension is no small limitation, insofar as this is not the dimension of physical space (e.g., 3-D) but the dimension of the input vector (such as the number of gridcells), which for practical problems can be in the tens [11] or even millions or more.

Sadly, this rules out the practical use of DNN for some purposes. Nonetheless, for many practical applications of deep learning, the approximation behaviors are not nearly so pessimistic as this would indicate (cp. [12]). For example, results are more optimistic:

if the target function has a strong smoothness property;
if the function is not arbitrary but is a composition of simpler functions;
if the training and test data are restricted to a (possibly unknown) lower dimensional manifold in the high dimensional space (this is certainly the case for common image and language modeling tasks);
if the average case behavior for the desired problem domain is much better than the worst case behavior addressed in the theorem;
The theorem assumes multilayer perceptron and ReLU activation; other DNN architectures may perform better (though the analysis is based on multidimensional Taylor’s theorem, which one might conjecture applies also to other architectures).
For many practical applications, very high accuracy is not a requirement.
For some applications, only low 2-norm error is sufficient, (not low max-norm).
For the special case of physics informed neural networks (PINNs), stronger results hold.

Thus, not all hope is lost from the standpoint of theory. However, certain problems for which high accuracy is required may not be suitable for DNN approximation.

How much training data is needed?

Assuming your space of neural network candidates is expressive enough to closely represent the target function—how much training data is required to actually find a good approximation?

A result ² is given in [13, Theorem 2.2] showing that the number of training samples required to train to high max-norm accuracy grows, again, exponentially in the dimension of the input to the function.

The authors concede however that “if additional problem information about [the target functions] can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work.” One suspects that some of the caveats given above might also be relevant here. Additionally, if one considers 2-norm error instead of max-norm error, the data requirement grows polynomially rather than exponentially, making the training data requirement much more tractable. Nonetheless, for some problems the amount of data required is so large that attempts to “pin down” the DNN to sufficient accuracy become intractable.

What methods can train to high accuracy?

The amount of training data may be sufficient to specify a suitable neural network. But, will standard methods for finding the weights of such a DNN be effective for solving this difficult nonconvex optimization problem?

A recent paper [14] from Max Tegmark’s group empirically studies DNN training to high accuracy. They find that as the input dimension grows, training to very high accuracy with standard stochastic gradient descent methods becomes difficult or impossible.

They also find second order methods perform much better, though these are more computationally expensive and have some difficulty also when the dimension is higher. Notably, second order methods have been used effectively for DNN training for some science applications [15]. Also, various alternative training approaches have been tried to attempt to stabilize training; see, e.g., [16].

Prospects and conclusions

Application of AI methods to scientific discovery continues to deliver amazing results, in spite of lagging theory. Ilya Sutskever has commented, “Progress in AI is a game of faith. The more faith you have, the more progress you can make” [17].

Theory of deep learning methods is in its infancy. The current findings show some cases for which use of DNN methods may not be fruitful, Continued discoveries in deep learning theory can help better guide how to use the methods effectively and inform where new algorithmic advances are needed.

Footnotes

¹ Suppose the function to be approximated takes d inputs and has the smoothness property that all n^th partial derivatives are continuous (i.e., is in Cⁿ(Ω) for compact Ω). Also suppose a multilayer perceptron with ReLU activation functions must be able to approximate any such function to max-norm no worse than ε. Then the number of weights required is at least a fixed constant times (1/ε)^d/(2n).

² Let F be the space of all functions that can be approximated exactly by a broad class of ReLU neural networks. Suppose there is a training method that can recover all these functions up to max-norm accuracy bounded by ε. Then the number of training samples required is at least a fixed constant times (1/ε)^d.

References

[1] “Integrated Research Infrastructure Architecture Blueprint Activity (Final Report 2023),” https://www.osti.gov/biblio/1984466.

[2] Joubert, Wayne, Bronson Messer, Philip C. Roth, Antigoni Georgiadou, Justin Lietz, Markus Eisenbach, and Junqi Yin. “Learning to Scale the Summit: AI for Science on a Leadership Supercomputer.” In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1246-1255. IEEE, 2022, https://www.osti.gov/biblio/2076211.

[3] “Reproducibility Workshop: The Reproducibility Crisis in ML‑based Science,” Princeton University, July 28, 2022, https://sites.google.com/princeton.edu/rep-workshop.

[4] Wahlbin, L. B. (1978). Maximum norm error estimates in the finite element method with isoparametric quadratic elements and numerical integration. RAIRO. Analyse numérique, 12(2), 173-202, https://www.esaim-m2an.org/articles/m2an/pdf/1978/02/m2an1978120201731.pdf

[5] Kashiwabara, T., & Kemmochi, T. (2018). Maximum norm error estimates for the finite element approximation of parabolic problems on smooth domains. https://arxiv.org/abs/1805.01336.

[6] Adcock, Ben, and Nick Dexter. “The gap between theory and practice in function approximation with deep neural networks.” SIAM Journal on Mathematics of Data Science 3, no. 2 (2021): 624-655, https://epubs.siam.org/doi/10.1137/20M131309X.

[7] “Figure 5 from The gap between theory and practice in function approximation with deep neural networks | Semantic Scholar,” https://www.semanticscholar.org/paper/The-gap-between-theory-and-practice-in-function-Adcock-Dexter/156bbfc996985f6c65a51bc2f9522da2a1de1f5f/figure/4

[8] Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of Deep Neural Networks. In P. Grohs & G. Kutyniok (Eds.), Mathematical Aspects of Deep Learning (pp. 149-199). Cambridge: Cambridge University Press. doi:10.1017/9781009025096.004, https://arxiv.org/abs/2007.04759.

[9] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017, https://arxiv.org/abs/1610.01145

[10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neural networks in W^s,p norms. Anal. Appl. (Singap.), pages 1–57, 2019, https://arxiv.org/abs/1902.07896

[11] Matt R. Norman, “The MiniWeather Mini App,” https://github.com/mrnorman/miniWeather

[12] Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

[13] Berner, J., Grohs, P., & Voigtlaender, F. (2022). Training ReLU networks to high uniform accuracy is intractable. ICLR 2023, https://openreview.net/forum?id=nchvKfvNeX0.

[14] Michaud, E. J., Liu, Z., & Tegmark, M. (2023). Precision machine learning. Entropy, 25(1), 175, https://www.mdpi.com/1099-4300/25/1/175.

[15] Markidis, S. (2021). The old and the new: Can physics-informed deep-learning replace traditional linear solvers?. Frontiers in big Data, 4, 669097, https://www.frontiersin.org/articles/10.3389/fdata.2021.669097/full

[16] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, https://papers.nips.cc/paper_files/paper/2006/hash/5da713a690c067105aeb2fae32403405-Abstract.html

[17] “Chat with OpenAI CEO and and Co-founder Sam Altman, and Chief Scientist Ilya Sutskever,” https://www.youtube.com/watch?v=mC-0XqTAeMQ&t=250s

The post Hallucinations of AI Science Models first appeared on John D. Cook.

Double super factorial

John — Mon, 25 Mar 2024 11:30:27 +0000

I saw someone point out recently that

10! = 7! × 5! × 3! × 1!

Are there more examples like this?

What would you call the pattern on the right? I don’t think there’s a standard name, but here’s why I think it should be called double super factorial or super double factorial.

Super factorial

The factorial of a positive number n is the product of the positive numbers up to and including n. The super factorial of n is the product of the factorials of the positive numbers up to and including n. So, for example, 7 super factorial would be

7! × 6! × 5! × 4! × 3! × 2! × 1!

Double factorial

The double factorial of a positive number n is the product of all the positive numbers up to n with the same parity of n. So, for example, the double factorial of 7 would be

7!! = 7 × 5 × 3 × 1.

Double superfactorial

The pattern at the top of the post is like super factorial, but it only includes odd terms, so it’s like a cross between super factorial and double factorial, hence double super factorial.

Denote the double super factorial of n as dsf(n), the product of the factorials of all numbers up to n with the same parity as n. That is,

dsf(n) = n! × (n − 2)! × (n − 4)! × … × 1

where the 1 at the end is 1! if n is odd and 0! if n is even. In this notation, the observation at the top of the post is

10! = dsf(7).

Super double factorial

We can see by re-arranging terms that a double super factorial is also a super double factorial. For example, look at

dsf(7) = 7! × 5! × 3! × 1!

If we separate out the first term in each factorial we have

(7 × 5 × 3 × 1)(6! × 4! × 2!) = 7!! dsf(6)

We can keep going and show in general that

dsf(n) = n!! × (n − 1)!! × (n − 2)!! … × 1

We could call the right hand side super double factorial, sdf(n). Just as a super factorial is a product of factorials, a super double factorials is a product of double factorials. Therefore

dsf(n) = sdf(n).

Factorials that equal double super factorials

Are there more solutions to

n! = dsf(m).

besides n = 10 and m = 7? Yes, here are some.

0! = dsf(0)
1! = dsf(1)
2! = dsf(2)
3! = dsf(3)
6! = dsf(5)

There are no solutions to

n! = dsf(m)

if n > 10. Here’s a sketch of a proof.

Bertrand’s postulate says that for n > 1 there is always a prime p between n and 2n. Now p divides (2n)! but p cannot divide dsf(n) because dsf(n) only has factors less than or equal to n.

If we can show that for some N, n > N implies (2n)! < dsf(n) then there are no solutions to

n! = dsf(m)

for n > 2N because there is a prime p between N and 2N that divides the left side but not the right. In fact N = 12. We can show empirically there are no solutions for n = 11 up to 24, and the proof shows there are no solutions for n > 24.

The post Double super factorial first appeared on John D. Cook.

Laguerre’s root finding method

John — Fri, 22 Mar 2024 15:52:06 +0000

Edmond Laguerre (1834–1886) came up with a method for finding zeros of polynomials. Unlike Newton’s method for finding zeros of general functions, Laguerre’s method is specialized for polynomials. Laguerre’s method converges an order of magnitude faster than Newton’s method, i.e. the error is cubed on each step rather than squared.

The most interesting thing about Laguerre’s method is that it nearly always converges to a root, no matter where you start. Newton’s method, on the other hand, converges quickly if you start sufficiently close to a root. If you don’t start close enough to a root, the method might cycle or go off to infinity.

The first time I taught numerical analysis, the textbook we used began the section on Newton’s method with the following nursery rhyme:

There was a little girl,
Who had a little curl,
Right in the middle of her forehead.
When she was good,
She was very, very good,
But when she was bad, she was horrid.

When Newton’s method is good, it is very, very good, but when it is bad it is horrid.

Laguerre’s method is not well understood. Experiments show that it nearly always converges, but there’s not much theory to explain what’s going on.

The method is robust in that it is very likely to converge to some root, but it may not converge to the root you expected unless you start sufficiently close. The rest of the post illustrates this.

Let’s look at the polynomial

p(x) = 3 + 22x + 20x² + 24x³.

This polynomial has roots at 0.15392, and at -0.3397 ± 0.83468i.

We’ll generate 2,000 random starting points for Laguerre’s method and color its location according to which root it converges to. Points converging to the real root are colored blue, points converging to the root with positive imaginary part are colored orange, and points converging to the root with negative imaginary part are colored green.

Here’s what we get:

This is hard to see, but we can tell that there aren’t many blue dots, about an equal number of orange and green dots, and the latter are thoroughly mixed together. This means the method is unlikely to converge to the real root, and about equally likely to converge to either of the complex roots.

Let’s look at just the starting points that converge to the real root, its basin of attraction. To get more resolution, we’ll generate 100,000 starting points and make the dots smaller.

The convergence region is pinched near the root; you can start fairly near the root along the real axis but converge to one of the complex roots. Notice also that there are scattered points far from the real root that converge to that point.

Next let’s look at the points that converge to the complex root in the upper half plane.

Note that the basin of attraction appears to take up over half the area. But the corresponding basin of attraction for the root in the lower half plane also appears to take up over half the area.

They can’t both take up over half the area. In fact. both take up about 48%. But the two regions are very intertwined. Due to the width of the dots used in plotting, each green dot covers a tiny bit of area that belongs to orange, and vice versa. That is, the fact that both appear to take over half the area shows how commingled they are.

The post Laguerre’s root finding method first appeared on John D. Cook.

Breach Safe Harbor

John — Thu, 21 Mar 2024 12:48:44 +0000

In the context of medical data, Safe Harbor typically refers to the Safe Harbor provisions of the HIPAA Privacy Rule explained here. Breach Safe Harbor is a little different. It basically means you’re off the hook if you breach encrypted health data. (But not necessarily. More on that below.)

I’m not a lawyer, so this isn’t legal advice. Even the HHS, who coin the term “Breach Safe Harbor” in their guidance portal, weasels out of saying they’re giving legal guidance by saying “The contents of this database lack the force and effect of law, except as authorized by law …”

Quality of encryption

You can’t just say that data were encrypted before they were breached. Weak encryption won’t cut it. You have to use acceptable algorithms and procedures.

How can you know whether you’ve encrypted data well enough to be covered Breach Safe Harbor? HHS cites four NIST publications for further guidance. (Not that I’m giving legal advice. I’m merely citing the HHS, who also is not giving legal advice.)

Here are the four publications.

Maybe encryption isn’t enough

At one point Tennessee law said a breach of encrypted data was still a breach. According to Dempsey and Carlin [1]

In 2016, Tennessee repealed its encryption safe harbor, requiring notice of breach of even encrypted data, but then in 2017, after criticism, the state restored a safe harbor for “information that has been encrypted in accordance with the current version of the Federal Information Processing Standard (FIPS) 140-2 if the encryption key has not been acquired by an unauthorized person.”

This is interesting for a couple reasons. First, there is a precedent for requiring notification of encrypted data. Second, this underscores the point above that encryption in general is not sufficient to avoid having to give notice of a breach: standard-compliant encryption is sufficient.

Consulting help

If you would like technical or statistical advice on how to prevent or prepare for a data breach, or how to respond after a data breach after the fact, we can help.

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition.

The post Breach Safe Harbor first appeared on John D. Cook.

MD5 hash collision example

John — Thu, 21 Mar 2024 01:07:54 +0000

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

The example claims that

TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

and

TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

have the same hash value.

This raises several questions.

Are these two strings really different, and if so, how do they differ? If you stare at the strings long enough you can see that they do indeed differ by one character. But how could you compare long strings like this in a more automated way?

How could you compute the MD5 hash values of the strings to verify that they are the same?

The following Python code addresses the questions above.

from hashlib import md5
from difflib import ndiff

def showdiff(a, b):
    for i,s in enumerate(ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    

a = "TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"
b = "TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"

showdiff(a, b)

ahash = md5(a.encode('utf-8')).hexdigest()
bhash = md5(b.encode('utf-8')).hexdigest()
assert(ahash == bhash)

The basis of the showdiff function was from an answer to a question on Stack Overflow.

The output of the call to showdiff is as follows.

Delete "A" from position 21
Add "E" to position 22

This means we can form string b from a by changing the ‘A’ in position 21 to an ‘E’. There was only one difference between the two strings in this example, but showdiff could be useful for understanding more complex differences.

The assert statement passes because both strings hash to faad49866e9498fc1719f5289e7a0269.

The post MD5 hash collision example first appeared on John D. Cook.

Distance from a point to a line

John — Sat, 16 Mar 2024 16:17:46 +0000

Eric Lengyel’s new book Projective Geometric Algebra Illuminated arrived yesterday and I’m enjoying reading it. Imagine if someone started with ideas like dot products, cross products, and determinants that you might see in your first year of college, then thought deeply about those things for years. That’s kinda what the book is.

Early in the book is the example of finding the distance from a point q to a line of the form p + tv.

If you define u = q − p then a straightforward derivation shows that the distance d from q to the line is given by

But as the author explains, it is better to calculate d by

Why is that? The two expressions are algebraically equal, but the latter is better suited for numerical calculation.

The cardinal rule of numerical calculation is to avoid subtracting nearly equal floating point numbers. If two numbers agree to b bits, you may lose up to b bits of significance when computing their difference.

If u and v are vectors with large magnitude, but q is close to the line, then the first equation subtracts two large, nearly equal numbers under the square root.

The second equation involves subtraction too, but it’s less obvious. This is a common theme in numerical computing. Imagine this dialog.

[Student produces first equation.]

Mentor: Avoid subtracting nearly equal numbers.

[Student produces section equation.]

Student: OK, did it.

Mentor: That’s much better, though it could still have problems.

Where is there a subtraction in the second equation? We started with a subtraction in defining u. More subtly, the definition of cross product involves subtractions. But these subtractions involve smaller numbers than the first formula, because the first formula subtracts squared values. Eric Lengyel points this out in his book.

None of this may matter in practice, until it does matter, which is a common pattern in numerical computing. You implement something like the first formula, something that can be derived directly. You implicitly have in mind vectors whose magnitude is comparable to d and this guides your choice of unit tests, which all pass.

Some time goes by and a colleague tells you your code is failing. Impossible! You checked your derivation by hand and in Mathematica. Your unit tests all pass. Must be your colleague’s fault. But it’s not. Your code would be correct in infinite precision, but in an actual computer it fails on inputs that violate your implicit assumptions.

This can be frustrating, but it can also be fun. Implementing equations from a freshman textbook accurately, efficiently, and robustly is not a freshman-level exercise.

The post Distance from a point to a line first appeared on John D. Cook.

Experiences with Thread Programming in Microsoft Windows

Wayne Joubert — Fri, 15 Mar 2024 10:35:29 +0000

Lately I’ve been helping a colleague to add worker threads to his GUI-based Windows application.

Thread programming can be tricky. Here are a few things I’ve learned along the way.

Performance. This app does compute-intensive work. It is helpful to offload this very compute-heavy work to a worker thread. Doing this frees the main thread to service GUI requests better.

Thread libraries. Windows has multiple thread libraries, for example Microsoft Foundation Class library threads and C++ standard library threads. It is hazardous to use different thread libraries in the same app. In the extreme case, different thread libraries, such as GOMP vs. LOMP, used in resp. the GCC and LLVM compiler families, have different threading runtimes which keep track of threads in different ways. Mixing them in the same code can cause hazardous silent errors.

Memory fences are a thing. Different threads can run on different processor cores and hold variables in different respective L1 caches that are not flushed (this to maintain high performance). An update to a variable by one thread is not guaranteed to be visible to other threads without special measures. For example, one could safely transfer information using ::PostMessage coupled with a handler function on the receiver thread. Or one could send a signal using an MFC CEvent on one thread and read its Lock on the other. Also, a thread launch implicitly does a memory fence, so that, at least then, the new thread is guaranteed to correctly see the state of all memory locations.

GUI access should be done from the master thread only, not a worker thread. Doing so can result in deadlock. A worker thread can instead ::PostMessage to ask the master thread to do a GUI action.

Thread launch. By default AfxBeginThread returns a thread handle which MFC takes care of deleting when no longer needed. If you want to manage the life cycle of the handle yourself, you can do something like:

myWorkerThreadHandle = AfxBeginThread(myFunc, myParams,
  THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED);
myWorkerThreadHandle->m_bAutoDelete = false;
myWorkerThreadHandle->ResumeThread();

Joint use of a shared library like the DAO database library has hazards. One should beware of using the library to allocate something in one thread and deallocating in another, as this will likely allocate in a thread-local heap or stack instead of a shared thread-safe heap, this resulting in a crash.

Initialization. One should call CoInitializeEx(NULL, COINIT_APARTMENTTHREADED) and AfxDaoInit() (if using DAO) at thread initialization on both master and worker threads, and correspondingly CoUninitialize() and AfxDaoTerm() at completion of the thread.

Monitoring of thread state can be done with
WaitForSingleObject(myWorkerThreadHandle->m_hThread, 0) to determine if the thread has completed or WaitForSingleObject(myWorkerThreadHandle->m_hThread, INFINITE) for a blocking wait until completion.

Race conditions are always a risk but can be avoided by careful reasoning about execution. Someone once said up to 90% of code errors can be found by desk checking [1]. Race conditions are notoriously hard to debug, partly because they can occur nondeterministically. There are tools for trying to find race condition errors, though I’ve never tried them.

So far I find no rigorous specification of the MFC threading model online that touches on all these concerns. Hopefully this post is useful to someone else working through these issues.

References

[1] Dasso, Aristides., Funes, Ana. Verification, Validation and Testing in Software Engineering. United Kingdom: Idea Group Pub., 2007, p. 163.

The post Experiences with Thread Programming in Microsoft Windows first appeared on John D. Cook.

Accelerating Archimedes

John — Thu, 14 Mar 2024 09:30:47 +0000

One way to approximate π is to find the areas of polygons inscribed inside a circle and polygons circumscribed outside the circle. The approximation improves as the number of sides in the polygons increases. This idea goes back at least as far as Archimedes (287–212 BC). Maybe you’ve tried this. It’s a lot of work.

In 1667 James Gregory came up with a more efficient way to carry out Archimedes’ approach. Tom Edgar and David Richeson cite Gregory’s theorem in [1] as follows.

Let I_k and C_k denote the areas of regular k-gons inscribed in and circumscribed around a given circle. Then I_2n is the geometric mean of I_n and C_k, and C_2n is the harmonic mean of I_2n and C_n; that is,

We can start with n = 4. Obviously the square circumscribed around the unit circle has vertices at (±1, ±1) and area 4. A little thought finds that the inscribed square has vertices at (±√2/2, ±√2/2) and area 2.

The following Python script finds π to six decimal places.

n = 4
I, C = 2, 4 # inscribed area, circumscribed area
delta = 2

while delta > 1e-6:
    I = (I*C)**0.5
    C = 2*C*I / (C + I)
    delta = C - I
    n *= 2
    print(n, I, C, delta)

The script stops when n = 8192, meaning that the difference between the areas of an inscribed and circumscribed 8192-gon is less than 10⁻⁶. The final output of the script is.

8192 3.1415923 3.1415928 4.620295e-07

If we average the upper and lower bound we get one more decimal place of accuracy.

Gregory’s algorithm isn’t fast, though it’s much faster than carrying out Archimedes’ approach by computing each area from scratch.

Gregory gives an interesting recursion. It has an asymmetry that surprised me: you update I, then C. You don’t update them simultaneously as you do when you’re computing the arithmetic-geometric mean.

[1] Tom Edgar and David Richeson. A Visual Proof of Gregory’s Theorem. Mathematics Magazine, December 2019, Vol. 92, No. 5 (December 2019), pp. 384–386

The post Accelerating Archimedes first appeared on John D. Cook.

How much will a cable sag? A simple approximation

John — Sat, 09 Mar 2024 23:15:14 +0000

Suppose you have a cable of length 2s suspended from two poles of equal height a distance 2x apart. Assuming the cable hangs in the shape of a catenary, how much does it sag in the middle?

If the cable were pulled perfectly taut, we would have s = x and there would be no sag. But in general, s > x and the cable is somewhat lower in the middle than on the two ends. The sag is the difference between the height at the end points and the height in the middle.

The equation of a catenary is

y = a cosh(t/a)

and the length of the catenary between t = −x and t = x is

2s = 2a sinh(x/a).

You could solve this equation for a and the sag will be

g = a cosh(x/a) − a.

Solving for a given s is not an insurmountable problem, but there is a simple approximation for g that has an error of less than 1%. This approximation, given in [1], is

g² = (s − x)(s + x/2).

To visualize the error, I set x = 1 and let a vary to produce the plot below.

The relative error is always less than 1/117. It peaks somewhere around x = 1/3 and decreases from there, the error getting smaller as a increases (i.e. the cable is pulled tighter).

[1] J. S. Frame. An approximation for the dip of a catenary. Pi Mu Epsilon Journal, Vol. 1, No. 2 (April 1950), pp. 44–46

The post How much will a cable sag? A simple approximation first appeared on John D. Cook.

Unique letter patterns in words

John — Sat, 09 Mar 2024 16:01:57 +0000

The word Mississippi has a unique pattern of letters. If you were solving a cryptogram puzzle and saw ZVFFVFFVCCV you might guess that the word is Mississippi.

Is the pattern of letters in Mississippi literally unique or just uncommon? What is the shortest word with a unique letter pattern? The longest word?

We can answer these questions by looking at normalized cryptograms, a sort of word signature. These are formed by replacing the first letter in a word with ‘A’, the next unique letter with ‘B’, etc. The normalized cryptogram of Mississippi is ABCCBCCBDDB.

The set of English words is fuzzy, but for my purposes I will take “words” to mean the entries in the dictionary file american-english on my Linux box, removing words that contain apostrophes or triple letters. I computed the cryptogram of each word, then looked for those that only appear once.

Relative to the list of words I used, yes, Mississippi is unique.

The shortest word with a unique cryptogram is eerie. [1]

The longest word with a unique cryptogram is ambidextrously. Every letter in this 14-letter word appears only once.

[1] Update: eerie is a five-letter example, but there are more. Jack Kennedy pointed out amass, llama, and mamma in the comments. I noticed eerie because its cryptogram comes first in alphabetical order.

The post Unique letter patterns in words first appeared on John D. Cook.

How to Organize Technical Research?

Wayne Joubert — Sat, 09 Mar 2024 13:02:48 +0000

64 million scientific papers have been published since 1996 [1].

Assuming you can actually find the information you want in the first place—how can you organize your findings to be able to recall and use them later?

It’s not a trifling question. Discoveries often come from uniting different obscure pieces of information in a new way, possibly from very disparate sources.

Many software tools are used today for notetaking and organizing information, including simple text files and folders, Evernote, GitHub, wikis, Miro, mymind, Synthical and Notion—to name a diverse few.

AI tools can help, though they can’t always recall correctly and get it right, and their ability to find connections between ideas is elementary. But they are getting better [2,3].

One perspective was presented by Jared O’Neal of Argonne National Laboratory, from the standpoint of laboratory notebooks used by teams of experimental scientists [4]. His experience was that as problems become more complex and larger, researchers must invent new tools and processes to cope with the complexity—thus “reinventing the lab notebook.”

While acknowledging the value of paper notebooks, he found electronic methods essential because of distributed teammates. In his view many streams of notes are probably necessary, using tools such as GitLab and Jupyter notebooks. Crucial is the actual discipline and methodology of notetaking, for example a hierarchical organization of notes (separating high-level overview and low-level details) that are carefully written to be understandable to others.

A totally different case is the research methodology of 19th century scientist Michael Faraday. He is not to be taken lightly, being called by some “the best experimentalist in the history of science” (and so, perhaps, even compared to today) [5].

A fascinating paper [6] documents Faraday’s development of “a highly structured set of retrieval strategies as dynamic aids during his scientific research.” He recorded a staggering 30,000 experiments over his lifetime. He used 12 different kinds of record-keeping media, including lab notebooks proper, idea books, loose slips, retrieval sheets and work sheets. Often he would combine ideas from different slips of paper to organize his discoveries. Notably, his process to some degree varied over his lifetime.

Certain motifs emerge from these examples: the value of well-organized notes as memory aids; the need to thoughtfully innovate one’s notetaking methods to find what works best; the freedom to use multiple media, not restricted to a single notetaking tool or format.

Do you have a favorite method for organizing your research? If so, please share in the comments below.

References

[1] How Many Journal Articles Have Been Published? https://publishingstate.com/how-many-journal-articles-have-been-published/2023/

[2] “Multimodal prompting with a 44-minute movie | Gemini 1.5 Pro Demo,” https://www.youtube.com/watch?v=wa0MT8OwHuk

[3] Geoffrey Hinton, “CBMM10 Panel: Research on Intelligence in the Age of AI,” https://www.youtube.com/watch?v=Gg-w_n9NJIE&t=4706s

[4] Jared O’Neal, “Lab Notebooks For Computational Mathematics, Sciences, Engineering: One Ex-experimentalist’s Perspective,” Dec. 14, 2022, https://www.exascaleproject.org/event/labnotebooks/

[5] “Michael Faraday,” https://dlab.epfl.ch/wikispeedia/wpcd/wp/m/Michael_Faraday.htm

[6] Tweney, R.D. and Ayala, C.D., 2015. Memory and the construction of scientific meaning: Michael Faraday’s use of notebooks and records. Memory Studies, 8(4), pp.422-439. https://www.researchgate.net/profile/Ryan-Tweney/publication/279216243_Memory_and_the_construction_of_scientific_meaning_Michael_Faraday’s_use_of_notebooks_and_records/links/5783aac708ae3f355b4a1ca5/Memory-and-the-construction-of-scientific-meaning-Michael-Faradays-use-of-notebooks-and-records.pdf

The post How to Organize Technical Research? first appeared on John D. Cook.

John D. Cook

Closed-form solutions to nonlinear PDEs

Related posts

Choosing a Computer Language for a Project

On greedy algorithms and rodeo clowns

Crossing the continent

Sorting algorithms

Training neural nets

Back to the rodeo

Finding strings in binary files

Extract text from a PDF

Length of a general Archimedean spiral

Arc length

Asymptotics

How big will a carpet be when you roll or unroll it?

Flexible carpet: solid cylinder

Stiff carpet: hollow cylinder

Approximating a spiral by rings

Related posts

Hypergeometric function of a large negative argument

Linear transformation formulas

Related posts

More is less

Related posts

Precise answers to useless questions

Fun and profit

Private sector

Notes

Pairs in poker

Related posts

Solar system means

Related posts

Earth : Jupiter :: Jupiter : Sun

Gravity on Jupiter

Related posts

Are guidance documents laws?

More Laguerre images

A Bayesian approach to proving you’re human

Related posts

Antenna length: Another rule of 72

Related posts

Hallucinations of AI Science Models

Neural network accuracy

How many neural network weights are needed?

How much training data is needed?

What methods can train to high accuracy?

Prospects and conclusions

Footnotes

References

Double super factorial

Super factorial

Double factorial

Double superfactorial

Super double factorial

Factorials that equal double super factorials

Laguerre’s root finding method

Related posts

Breach Safe Harbor

Quality of encryption

Maybe encryption isn’t enough

Consulting help

MD5 hash collision example

Related posts

Distance from a point to a line

Related posts

Experiences with Thread Programming in Microsoft Windows

References

Accelerating Archimedes

Related posts

How much will a cable sag? A simple approximation

Related posts

Unique letter patterns in words

How to Organize Technical Research?

References