Precise answers to useless questions

I recently ran across a tweet from Allen Downey saying

So much of 20th century statistics was just a waste of time, computing precise answers to useless questions.

He’s right. I taught mathematical statistics at GSBS [1, 2] several times, and each time I taught it I became more aware of how pointless some of the material was.

I do believe mathematical statistics is useful, even some parts whose usefulness isn’t immediately obvious, but there were other parts of the curriculum I couldn’t justify spending time on [3].

Fun and profit

I’ll say this in partial defense of computing precise answers to useless questions: it can be fun and good for your career.

Mathematics problems coming out of statistics can be more interesting, and even more useful, than the statistical applications that motivate them. Several times in my former career a statistical problem of dubious utility lead to an interesting mathematical puzzle.

Solving practical research problems in statistics is hard, and can be hard to publish. If research addresses a practical problem that a reviewer isn’t aware of, a journal may reject it. The solution to a problem in mathematical statistics, regardless of its utility, is easier to publish.

Private sector

Outside of academia there is less demand for precise answers to useless questions. A consultant can be sure that a client finds a specific project useful because they’re willing to directly spend money on it. Maybe the client is mistaken, but they’re betting that they’re not.

Academics get grants for various lines of research, but this isn’t the same kind of feedback because the people who approve grants are not spending their own money. Imagine a grant reviewer saying “I think this research is so important, I’m not only going to approve this proposal, I’m going to write the applicant a $5,000 personal check.”

Consulting clients may be giving away someone else’s money too, but they have a closer connection to the source of the funds than a bureaucrat has when dispensing tax money.

Notes

[1] When I taught there, GSBS was The University of Texas Graduate School of Biomedical Sciences. I visited their website this morning, and apparently GSBS is now part of, or at least co-branded with, MD Anderson Cancer Center.

There was a connection between GSBS and MDACC at the time. Some of the GSBS instructors, like myself, were MDACC employees who volunteered to teach a class.

[2] Incidentally, there was a connection between GSBS and Allen Downey: one of my students was his former student, and he told me what a good teacher Allen was.

[3] I talk about utility a lot in this post, but I’m not a utilitarian. There are good reasons to learn things that are not obviously useful. But the appeal of statistics is precisely its utility, and so statistics that isn’t useful is particularly distasteful.

Pure math is beautiful (and occasionally useful) and applied math is useful (and occasionally beautiful). But there’s no reason to study fake applied math that is neither beautiful or useful.

Pairs in poker

An article by Y. L. Cheung [1] gives reasons why poker is usually played with five cards. The author gives several reasons, but here I’ll just look at one reason: pairs don’t act like you might expect if you have more than five cards.

In five-card poker, the more pairs the better. Better here means less likely. One pair is better than no pair, and two pairs is better than one pair. But in six-card or seven-card poker, a hand with no pair is less likely than a hand with one pair.

For a five-card hand, the probabilities of 0, 1, or 2 pair are 0.5012, 0.4226, and 0.0475 respectively.

For a six-card hand, the same probabilities are 0.3431, 0.4855, and 0.1214.

For a seven-card hand, the probabilities are 0.2091, 0.4728, and 0.2216.

Related posts

[1] Y. L. Cheung. Why Poker Is Played with Five Cards. The Mathematical Gazette, Dec., 1989, Vol. 73, No. 466 (Dec., 1989), pp. 313–315

Solar system means

Yesterday I stumbled on the fact that the size of Jupiter is roughly the geometric mean between the sizes of Earth and the Sun. That’s not too surprising: in some sense (i.e. on a logarithmic scale) Jupiter is the middle sized object in our solar system.

What I find more surprising is that a systematic search finds mean relationships that are far more accurate. The radius of Jupiter is within 5% of the geometric mean of the radii of the Earth and Sun. But all the mean relations below have an error less than 1%.

\begin{eqnarray*} R_\Mercury &=& \mbox{GM}\left(R_\Moon, R_\Mars\right) \\ R_\Mars &=& \mbox{HM}\left(R_\Moon, R_\Jupiter\right) \\ R_\Uranus &=& \mbox{AGM}\left(R_\Earth, R_\Saturn\right) \\ \end{eqnarray*}

The radius of Mercury equals the geometric mean of the radii of the Moon and Mars, within 0.052%.

The radius of Mars equals the harmonic mean of the radii of the Moon and Jupiter, within 0.08%.

The radius of Uranus equals the arithmetic-geometric mean of the radii of Earth and Saturn, within 0.0018%.

See the links below for more on AM, GM, and AGM.

Now let’s look at masses.

\begin{eqnarray*} M_\Earth &=& \mbox{GM}\left(M_\Mercury, M_\Neptune\right) \\ M_\Pluto &=& \mbox{HM}\left(M_\Moon, M_\Mars\right) \\ M_\Uranus &=& \mbox{AGM}\left(M_\Moon, M_\Saturn\right) \\ \end{eqnarray*}

The mass of Earth is the geometric mean of the masses of Mercury and Neptune, within 2.75%. This is the least accurate approximation in this post.

The mass of Pluto is the harmonic mean of the masses of the Moon and Mars, within 0.7%.

The mass of Uranus is the arithmetic-geometric mean of the masses of of the Moon and Saturn, within 0.54%.

Related posts

Earth : Jupiter :: Jupiter : Sun

The size of Jupiter is approximately the geometric mean of the sizes of Sun and Earth.

In terms of radii,

\frac{R_\Sun}{R_{\text{\Jupiter}}} \approx \frac{R_\Jupiter}{R_\Earth}

The ratio on the left equals 9.95 and the ratio on the left equals 10.98.

The subscripts are the astronomical symbols for the Sun (☉, U+2609), Jupiter (♃, U+2643), and Earth (, U+1F728). I produced them in LaTeX using the mathabx package and the commands \Sun, Jupiter, and Earth.

The the mathabx symbol for Jupiter is a little unusual. It looks italicized, but that’s not because the symbol is being used in math mode. Notice that the vertical bar in the symbol for Earth is vertical, i.e. not italicized.

 

Gravity on Jupiter

NASA image of Jupiter

I was listening to the latest episode of the Space Rocket History podcast. The show includes some audio from a documentary on Pioneer 11 that mentioned that a man would weigh 500 pounds on Jupiter.

My immediate thought was “Is that all?! Is this ‘man’ a 100 pound boy?”

The documentary was correct and my intuition was wrong. And the implied mass of the man in the documentary is 190 pounds.

Jupiter has more than 300 times more mass than the earth. Why is its surface gravity only 2.6 times that of the earth?

Although Jupiter is very massive, it is also very large. Gravitational attraction is proportional to mass, but inversely proportional to the square of distance.

A satellite in orbit 100,000 km from the center of Jupiter would feel 300 times as much gravity as one in orbit the same distance from the center of Earth. But the surface of Jupiter is further from its center of mass than the surface of Earth is from its center of mass.

The mass of Jupiter is 318 times that of Earth, and the its mean radius is 11 times that of Earth. So the ratio of gravity on the surface of Jupiter to gravity on the Earth’s surface is

318 / 11² = 2.63

Now suppose a planet had the same density as Earth but a radius of r Earth radii. Then its mass would be r³ times greater, but its surface gravity would only be r times greater since gravity follows an inverse square law. So if Jupiter were made of the same stuff as Earth, its surface gravity would be 11 times greater. But Jupiter is a gas giant, so its surface gravity is only 2.6 times greater.

Related posts

Are guidance documents laws?

Are guidance documents laws? No, but they can have legal significance.

The people who generate regulatory guidance documents are not legislators. Legislators delegate to agencies to make rules, and agencies delegate to other organizations to make guidelines. For example [1],

Even HHS, which has express cybersecurity rulemaking authority under the Health Insurance Portability and Accountability Act (HIPAA), has put a lot of the details of what it considers adequate cybersecurity into non-binding guidelines.

I’m not a lawyer, so nothing I can should be considered legal advice. However, the authors of [1] are lawyers.

The legal status of guidance documents is contested. According to [2], Executive Order 13892 said that agencies

may not treat noncompliance with a standard of conduct announced solely in a guidance document as itself a violation of applicable statutes or regulations.

Makes sense to me, but EO 13992 revoked EO 13892.

Again according to [3],

Under the common law, it used to be that government advisories, guidelines, and other non-binding statements were non-binding hearsay [in private litigation]. However, in 1975, the Fifth Circuit held that advisory materials … are an exception to the hearsay rule … It’s not clear if this is now the majority rule.

In short, it’s fuzzy.

 

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition, page 245.

[2] Ibid., page 199.

[3] Ibid., page 200.

 

More Laguerre images

A week or two ago I wrote about Laguerre’s root-finding method and made some associated images. This post gives a couple more examples.

Laguerre’s method is very robust in the sense that it is likely to converge to a root, regardless of the starting point. However, it may be difficult to predict which root the method will end up at. To visualize this, we color points according to which root they converge to.

First, let’s look at the polynomial

(x − 2)(x − 4)(x − 24)

which clearly has roots at 2, 4, and 24. We’ll generate random starting points and color them blue, orange, or green depending on whether they converge to 2, 4, or 24. Here’s the result.

To make this easier to see, let’s split it into each color: blue, orange, and green.

Now let’s change our polynomial by moving the root at 4 to 4i.

(x − 2)(x − 4i)(x − 24)

Here’s the combined result.

And here is each color separately.

As we explained last time, the area taken up by the separate colors seems to exceed the total area. That is because the colors are so intermingled that many of the dots in the images cover some territory that belongs to another color, even though the dots are quite small.

A Bayesian approach to proving you’re human

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

These kinds of tests could be made less annoying and more secure by using a Bayesian approach.

Suppose someone solves 19 out of 20 puzzles correctly. You require 20 out of 20, so you have them start over. When you do, you’re throwing away information. You require 20 more puzzles, despite the fact that they only missed one. And if a bot had solved say 8 out of 20 puzzles, you’d let it pass if it passes the next 20.

If you wipe your memory after every round of 20 puzzles, and allow unlimited do-overs, then any sufficiently persistent entity will almost certainly pass eventually.

Bayesian statistics reflects common sense. After someone (or something) has correctly solved 19 out of 20 puzzles designed to be hard for machines to solve, your conviction that this entity is human is higher than if they/it had solved 8 out of 20 correctly. You don’t need as much additional evidence in the first case as in the latter to be sufficiently convinced.

Here’s how a Bayesian captcha could work. You start out with some distribution on your probability θ that an entity is human, say a uniform distribution. You present a sequence of puzzles, recalculating your posterior distribution after each puzzle, until the posterior probability that this entity is human crosses some upper threshold, say 0.95, or some lower threshold, say 0.50. If the upper threshold is crossed, you decide the entity is likely human. If the lower threshold is crossed, you decide the entity is likely not human.

If solving 20 out of 20 puzzles correctly crosses your threshold of human detection, then after solving 19 out 20 correctly your posterior probability of humanity is close to the upper threshold and would only require a few more puzzles. And if an entity solved 8 out of 20 puzzles correctly, that may cross your lower threshold. If not, maybe only a few more puzzles would be necessary to reject the entity as non-human.

When I worked at MD Anderson Cancer Center we applied this approach to adaptive clinical trials. A clinical trial might stop early because of particularly good results or particularly bad results. Clinical trials are expensive, both in terms of human costs and financial costs. Rejecting poor treatments quickly, and sending promising treatments on to the next stage quickly, is both humane and economical.

Related posts

Antenna length: Another rule of 72

The famous Rule of 72 says that to find out how many years it takes an investment to double in value, divide 72 by the annual percentage rate. I’ll come back to that in a little bit.

This morning I read a really good article, Fifty Things you can do with a Software Defined Radio. The article includes a rule of thumb for how long an antenna needs to be.

My rule of thumb was to divide 72 by the frequency in MHz, and take that as the length of each side of the dipole in meters [1]. That’d make the whole antenna a bit shorter than half of the wavelength.

Ideally an antenna should be as long as half a wavelength of the signal you want to receive. Light travels 3 × 108 meters per second, so one wavelength of a 1 MHz signal is 300 m. A quarter wavelength, the length of one side of a dipole antenna, would be 75 m. Call it 72 m because 72 has lots of small factors, i.e. it’s usually mentally easier to divide things into 72 than 75. Rounding 75 down to 72 results in the antenna being a little shorter than ideal. But antennas are forgiving, especially for receiving.

Update: There’s more to replacing 75 with 72 than simplifying mental arithmetic. See Markus Meier’s comment below.

Just as the Rule of 72 for antennas rounds 75 down to 72, the Rule of 72 for interest rounds 69.3 up to 72, both for ease of mental calculation.

\begin{align*} \left(1 + \frac{n}{100}\right)^{72/n} &= \exp\left(\frac{72}{n} \log\left(1 + \frac{n}{100}\right)\right) \\ &\approx \exp\left(\frac{72}{n} \, \frac{n}{100} \right) \\ &= \exp(72/100) \\ &= 2.05 \end{align*}

The approximation step comes from the approximation log(1 + x) ≈ x for small x, a first order Taylor approximation.

The last line would be 2 rather than 2.05 if we replaced 72 with 100 log(2) = 69.3. That’s where the factor of 69.3 mentioned above comes from.

Related posts

[1] The post actually says centimeters, but the author meant to say meters.

Hallucinations of AI Science Models

AlphaFold 2, FourCastNet and CorrDiff are exciting. AI-driven autonomous labs are going to be a big deal [1]. Science codes now use AI and machine learning to make scientific discoveries on the world’s most powerful computers [2].

It’s common practice for scientists to ask questions about the validity, reliability and accuracy of the mathematical and computational methods they use. And many have voiced concerns about the lack of explainability and potential pitfalls of AI models, in particular deep neural networks (DNNs) [3].

The impact of this uncertainty varies highly according to project. Science projects that are able to easily check AI-generated results against ground truth may not be that concerned. High-stakes projects like design of a multimillion dollar spacecraft with high project risks may ask about AI model accuracy with more urgency.

Neural network accuracy

Understanding of the properties of DNNs is still in its infancy, with many as-yet unanswered questions. However, in the last few years some significant results have started to come forth.

A fruitful approach to analyzing DNNs is to see them as function approximators (which, of course, they are). One can study how accurately DNNs approximate a function representing some physical phenomenon in a domain (for example, fluid density or temperature).

The approximation error can be measured in various ways. A particularly strong measure is “sup-norm” or “max-norm” error, which requires that the DNN approximation be accurate at every point of the target function’s domain (“uniform approximation”). Some science problems may have a weaker requirement than this, such as low RMS or 2-norm error. However, it’s not unreasonable to ask about max-norm approximation behaviors of numerical methods [4,5].

An illuminating paper by Ben Adcock and Nick Dexter looks at this problem [6]. They show that standard DNN methods applied even to a simple 1-dimensional problem can result in “glitches”: the DNN as a whole matches the function well but at some points totally misapproximates the target function. For a picture that shows this, see [7].

Other mathematical papers have subsequently shed light on these behaviors. I’ll try to summarize the findings below, though the actual findings are very nuanced, and many details are left out here. The reader is advised to refer to the respective papers for exact details.

The findings address three questions: 1) how many DNN parameters are required to approximate a function well? 2) how much data is required to train to a desired accuracy? and 3) what algorithms are capable of training to the desired accuracy?

How many neural network weights are needed?

How large does the neural network need to be for accurate uniform approximation of functions? If tight max-norm approximation requires an excessively large number of weights, then use of DNNs is not computationally practical.

Some answers to this question have been found—in particular, a result 1 is given in [8, Theorem 4.3; cf. 9, 10]. This result shows that the number of neural network weights required to approximate an arbitrary function to high max-norm accuracy grows exponentially in the dimension of the input to the function.

This dependency on dimension is no small limitation, insofar as this is not the dimension of physical space (e.g., 3-D) but the dimension of the input vector (such as the number of gridcells), which for practical problems can be in the tens [11] or even millions or more.

Sadly, this rules out the practical use of DNN for some purposes. Nonetheless, for many practical applications of deep learning, the approximation behaviors are not nearly so pessimistic as this would indicate (cp. [12]). For example, results are more optimistic:

  • if the target function has a strong smoothness property;
  • if the function is not arbitrary but is a composition of simpler functions;
  • if the training and test data are restricted to a (possibly unknown) lower dimensional manifold in the high dimensional space (this is certainly the case for common image and language modeling tasks);
  • if the average case behavior for the desired problem domain is much better than the worst case behavior addressed in the theorem;
  • The theorem assumes multilayer perceptron and ReLU activation; other DNN architectures may perform better (though the analysis is based on multidimensional Taylor’s theorem, which one might conjecture applies also to other architectures).
  • For many practical applications, very high accuracy is not a requirement.
  • For some applications, only low 2-norm error is sufficient, (not low max-norm).
  • For the special case of physics informed neural networks (PINNs), stronger results hold.

Thus, not all hope is lost from the standpoint of theory. However, certain problems for which high accuracy is required may not be suitable for DNN approximation.

How much training data is needed?

Assuming your space of neural network candidates is expressive enough to closely represent the target function—how much training data is required to actually find a good approximation?

A result 2 is given in [13, Theorem 2.2] showing that the number of training samples required to train to high max-norm accuracy grows, again, exponentially in the dimension of the input to the function.

The authors concede however that “if additional problem information about [the target functions] can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work.” One suspects that some of the caveats given above might also be relevant here. Additionally, if one considers 2-norm error instead of max-norm error, the data requirement grows polynomially rather than exponentially, making the training data requirement much more tractable. Nonetheless, for some problems the amount of data required is so large that attempts to “pin down” the DNN to sufficient accuracy become intractable.

What methods can train to high accuracy?

The amount of training data may be sufficient to specify a suitable neural network. But, will standard methods for finding the weights of such a DNN be effective for solving this difficult nonconvex optimization problem?

A recent paper [14] from Max Tegmark’s group empirically studies DNN training to high accuracy. They find that as the input dimension grows, training to very high accuracy with standard stochastic gradient descent methods becomes difficult or impossible.

They also find second order methods perform much better, though these are more computationally expensive and have some difficulty also when the dimension is higher. Notably, second order methods have been used effectively for DNN training for some science applications [15]. Also, various alternative training approaches have been tried to attempt to stabilize training; see, e.g., [16].

Prospects and conclusions

Application of AI methods to scientific discovery continues to deliver amazing results, in spite of lagging theory. Ilya Sutskever has commented, “Progress in AI is a game of faith. The more faith you have, the more progress you can make” [17].

Theory of deep learning methods is in its infancy. The current findings show some cases for which use of DNN methods may not be fruitful, Continued discoveries in deep learning theory can help better guide how to use the methods effectively and inform where new algorithmic advances are needed.

Footnotes

1 Suppose the function to be approximated takes d inputs and has the smoothness property that all nth partial derivatives are continuous (i.e., is in Cn(Ω) for compact Ω). Also suppose a multilayer perceptron with ReLU activation functions must be able to approximate any such function to max-norm no worse than ε. Then the number of weights required is at least a fixed constant times (1/ε)d/(2n).

2 Let F be the space of all functions that can be approximated exactly by a broad class of ReLU neural networks. Suppose there is a training method that can recover all these functions up to max-norm accuracy bounded by ε. Then the number of training samples required is at least a fixed constant times (1/ε)d.

References

[1] “Integrated Research Infrastructure Architecture Blueprint Activity (Final Report 2023),” https://www.osti.gov/biblio/1984466.

[2] Joubert, Wayne, Bronson Messer, Philip C. Roth, Antigoni Georgiadou, Justin Lietz, Markus Eisenbach, and Junqi Yin. “Learning to Scale the Summit: AI for Science on a Leadership Supercomputer.” In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1246-1255. IEEE, 2022, https://www.osti.gov/biblio/2076211.

[3] “Reproducibility Workshop: The Reproducibility Crisis in ML‑based Science,” Princeton University, July 28, 2022, https://sites.google.com/princeton.edu/rep-workshop.

[4] Wahlbin, L. B. (1978). Maximum norm error estimates in the finite element method with isoparametric quadratic elements and numerical integration. RAIRO. Analyse numérique, 12(2), 173-202, https://www.esaim-m2an.org/articles/m2an/pdf/1978/02/m2an1978120201731.pdf

[5] Kashiwabara, T., & Kemmochi, T. (2018). Maximum norm error estimates for the finite element approximation of parabolic problems on smooth domains. https://arxiv.org/abs/1805.01336.

[6] Adcock, Ben, and Nick Dexter. “The gap between theory and practice in function approximation with deep neural networks.” SIAM Journal on Mathematics of Data Science 3, no. 2 (2021): 624-655, https://epubs.siam.org/doi/10.1137/20M131309X.

[7] “Figure 5 from The gap between theory and practice in function approximation with deep neural networks | Semantic Scholar,” https://www.semanticscholar.org/paper/The-gap-between-theory-and-practice-in-function-Adcock-Dexter/156bbfc996985f6c65a51bc2f9522da2a1de1f5f/figure/4

[8] Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of Deep Neural Networks. In P. Grohs & G. Kutyniok (Eds.), Mathematical Aspects of Deep Learning (pp. 149-199). Cambridge: Cambridge University Press. doi:10.1017/9781009025096.004, https://arxiv.org/abs/2007.04759.

[9] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017, https://arxiv.org/abs/1610.01145

[10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neural networks in Ws,p norms. Anal. Appl. (Singap.), pages 1–57, 2019, https://arxiv.org/abs/1902.07896

[11] Matt R. Norman, “The MiniWeather Mini App,” https://github.com/mrnorman/miniWeather

[12] Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

[13] Berner, J., Grohs, P., & Voigtlaender, F. (2022). Training ReLU networks to high uniform accuracy is intractable. ICLR 2023, https://openreview.net/forum?id=nchvKfvNeX0.

[14] Michaud, E. J., Liu, Z., & Tegmark, M. (2023). Precision machine learning. Entropy, 25(1), 175, https://www.mdpi.com/1099-4300/25/1/175.

[15] Markidis, S. (2021). The old and the new: Can physics-informed deep-learning replace traditional linear solvers?. Frontiers in big Data, 4, 669097, https://www.frontiersin.org/articles/10.3389/fdata.2021.669097/full

[16] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19,  https://papers.nips.cc/paper_files/paper/2006/hash/5da713a690c067105aeb2fae32403405-Abstract.html

[17] “Chat with OpenAI CEO and and Co-founder Sam Altman, and Chief Scientist Ilya Sutskever,” https://www.youtube.com/watch?v=mC-0XqTAeMQ&t=250s