The search for the perfect prompt


Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without  limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

PromptAgent

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The web site does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

Conclusion

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthly trial-and-error processes.

References

[1]  “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3]  “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

 

Hallucinations of AI Science Models

AlphaFold 2, FourCastNet and CorrDiff are exciting. AI-driven autonomous labs are going to be a big deal [1]. Science codes now use AI and machine learning to make scientific discoveries on the world’s most powerful computers [2].

It’s common practice for scientists to ask questions about the validity, reliability and accuracy of the mathematical and computational methods they use. And many have voiced concerns about the lack of explainability and potential pitfalls of AI models, in particular deep neural networks (DNNs) [3].

The impact of this uncertainty varies highly according to project. Science projects that are able to easily check AI-generated results against ground truth may not be that concerned. High-stakes projects like design of a multimillion dollar spacecraft with high project risks may ask about AI model accuracy with more urgency.

Neural network accuracy

Understanding of the properties of DNNs is still in its infancy, with many as-yet unanswered questions. However, in the last few years some significant results have started to come forth.

A fruitful approach to analyzing DNNs is to see them as function approximators (which, of course, they are). One can study how accurately DNNs approximate a function representing some physical phenomenon in a domain (for example, fluid density or temperature).

The approximation error can be measured in various ways. A particularly strong measure is “sup-norm” or “max-norm” error, which requires that the DNN approximation be accurate at every point of the target function’s domain (“uniform approximation”). Some science problems may have a weaker requirement than this, such as low RMS or 2-norm error. However, it’s not unreasonable to ask about max-norm approximation behaviors of numerical methods [4,5].

An illuminating paper by Ben Adcock and Nick Dexter looks at this problem [6]. They show that standard DNN methods applied even to a simple 1-dimensional problem can result in “glitches”: the DNN as a whole matches the function well but at some points totally misapproximates the target function. For a picture that shows this, see [7].

Other mathematical papers have subsequently shed light on these behaviors. I’ll try to summarize the findings below, though the actual findings are very nuanced, and many details are left out here. The reader is advised to refer to the respective papers for exact details.

The findings address three questions: 1) how many DNN parameters are required to approximate a function well? 2) how much data is required to train to a desired accuracy? and 3) what algorithms are capable of training to the desired accuracy?

How many neural network weights are needed?

How large does the neural network need to be for accurate uniform approximation of functions? If tight max-norm approximation requires an excessively large number of weights, then use of DNNs is not computationally practical.

Some answers to this question have been found—in particular, a result 1 is given in [8, Theorem 4.3; cf. 9, 10]. This result shows that the number of neural network weights required to approximate an arbitrary function to high max-norm accuracy grows exponentially in the dimension of the input to the function.

This dependency on dimension is no small limitation, insofar as this is not the dimension of physical space (e.g., 3-D) but the dimension of the input vector (such as the number of gridcells), which for practical problems can be in the tens [11] or even millions or more.

Sadly, this rules out the practical use of DNN for some purposes. Nonetheless, for many practical applications of deep learning, the approximation behaviors are not nearly so pessimistic as this would indicate (cp. [12]). For example, results are more optimistic:

  • if the target function has a strong smoothness property;
  • if the function is not arbitrary but is a composition of simpler functions;
  • if the training and test data are restricted to a (possibly unknown) lower dimensional manifold in the high dimensional space (this is certainly the case for common image and language modeling tasks);
  • if the average case behavior for the desired problem domain is much better than the worst case behavior addressed in the theorem;
  • The theorem assumes multilayer perceptron and ReLU activation; other DNN architectures may perform better (though the analysis is based on multidimensional Taylor’s theorem, which one might conjecture applies also to other architectures).
  • For many practical applications, very high accuracy is not a requirement.
  • For some applications, only low 2-norm error is sufficient, (not low max-norm).
  • For the special case of physics informed neural networks (PINNs), stronger results hold.

Thus, not all hope is lost from the standpoint of theory. However, certain problems for which high accuracy is required may not be suitable for DNN approximation.

How much training data is needed?

Assuming your space of neural network candidates is expressive enough to closely represent the target function—how much training data is required to actually find a good approximation?

A result 2 is given in [13, Theorem 2.2] showing that the number of training samples required to train to high max-norm accuracy grows, again, exponentially in the dimension of the input to the function.

The authors concede however that “if additional problem information about [the target functions] can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work.” One suspects that some of the caveats given above might also be relevant here. Additionally, if one considers 2-norm error instead of max-norm error, the data requirement grows polynomially rather than exponentially, making the training data requirement much more tractable. Nonetheless, for some problems the amount of data required is so large that attempts to “pin down” the DNN to sufficient accuracy become intractable.

What methods can train to high accuracy?

The amount of training data may be sufficient to specify a suitable neural network. But, will standard methods for finding the weights of such a DNN be effective for solving this difficult nonconvex optimization problem?

A recent paper [14] from Max Tegmark’s group empirically studies DNN training to high accuracy. They find that as the input dimension grows, training to very high accuracy with standard stochastic gradient descent methods becomes difficult or impossible.

They also find second order methods perform much better, though these are more computationally expensive and have some difficulty also when the dimension is higher. Notably, second order methods have been used effectively for DNN training for some science applications [15]. Also, various alternative training approaches have been tried to attempt to stabilize training; see, e.g., [16].

Prospects and conclusions

Application of AI methods to scientific discovery continues to deliver amazing results, in spite of lagging theory. Ilya Sutskever has commented, “Progress in AI is a game of faith. The more faith you have, the more progress you can make” [17].

Theory of deep learning methods is in its infancy. The current findings show some cases for which use of DNN methods may not be fruitful, Continued discoveries in deep learning theory can help better guide how to use the methods effectively and inform where new algorithmic advances are needed.

Footnotes

1 Suppose the function to be approximated takes d inputs and has the smoothness property that all nth partial derivatives are continuous (i.e., is in Cn(Ω) for compact Ω). Also suppose a multilayer perceptron with ReLU activation functions must be able to approximate any such function to max-norm no worse than ε. Then the number of weights required is at least a fixed constant times (1/ε)d/(2n).

2 Let F be the space of all functions that can be approximated exactly by a broad class of ReLU neural networks. Suppose there is a training method that can recover all these functions up to max-norm accuracy bounded by ε. Then the number of training samples required is at least a fixed constant times (1/ε)d.

References

[1] “Integrated Research Infrastructure Architecture Blueprint Activity (Final Report 2023),” https://www.osti.gov/biblio/1984466.

[2] Joubert, Wayne, Bronson Messer, Philip C. Roth, Antigoni Georgiadou, Justin Lietz, Markus Eisenbach, and Junqi Yin. “Learning to Scale the Summit: AI for Science on a Leadership Supercomputer.” In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1246-1255. IEEE, 2022, https://www.osti.gov/biblio/2076211.

[3] “Reproducibility Workshop: The Reproducibility Crisis in ML‑based Science,” Princeton University, July 28, 2022, https://sites.google.com/princeton.edu/rep-workshop.

[4] Wahlbin, L. B. (1978). Maximum norm error estimates in the finite element method with isoparametric quadratic elements and numerical integration. RAIRO. Analyse numérique, 12(2), 173-202, https://www.esaim-m2an.org/articles/m2an/pdf/1978/02/m2an1978120201731.pdf

[5] Kashiwabara, T., & Kemmochi, T. (2018). Maximum norm error estimates for the finite element approximation of parabolic problems on smooth domains. https://arxiv.org/abs/1805.01336.

[6] Adcock, Ben, and Nick Dexter. “The gap between theory and practice in function approximation with deep neural networks.” SIAM Journal on Mathematics of Data Science 3, no. 2 (2021): 624-655, https://epubs.siam.org/doi/10.1137/20M131309X.

[7] “Figure 5 from The gap between theory and practice in function approximation with deep neural networks | Semantic Scholar,” https://www.semanticscholar.org/paper/The-gap-between-theory-and-practice-in-function-Adcock-Dexter/156bbfc996985f6c65a51bc2f9522da2a1de1f5f/figure/4

[8] Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of Deep Neural Networks. In P. Grohs & G. Kutyniok (Eds.), Mathematical Aspects of Deep Learning (pp. 149-199). Cambridge: Cambridge University Press. doi:10.1017/9781009025096.004, https://arxiv.org/abs/2007.04759.

[9] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017, https://arxiv.org/abs/1610.01145

[10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neural networks in Ws,p norms. Anal. Appl. (Singap.), pages 1–57, 2019, https://arxiv.org/abs/1902.07896

[11] Matt R. Norman, “The MiniWeather Mini App,” https://github.com/mrnorman/miniWeather

[12] Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

[13] Berner, J., Grohs, P., & Voigtlaender, F. (2022). Training ReLU networks to high uniform accuracy is intractable. ICLR 2023, https://openreview.net/forum?id=nchvKfvNeX0.

[14] Michaud, E. J., Liu, Z., & Tegmark, M. (2023). Precision machine learning. Entropy, 25(1), 175, https://www.mdpi.com/1099-4300/25/1/175.

[15] Markidis, S. (2021). The old and the new: Can physics-informed deep-learning replace traditional linear solvers?. Frontiers in big Data, 4, 669097, https://www.frontiersin.org/articles/10.3389/fdata.2021.669097/full

[16] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19,  https://papers.nips.cc/paper_files/paper/2006/hash/5da713a690c067105aeb2fae32403405-Abstract.html

[17] “Chat with OpenAI CEO and Co-founder Sam Altman, and Chief Scientist Ilya Sutskever,” https://www.youtube.com/watch?v=mC-0XqTAeMQ&t=250s

Is Low Precision Arithmetic Safe?

The popularity of low precision arithmetic for computing has exploded since the 2017 release of the Nvidia Volta GPU. The half precision tensor cores of Volta offered a massive 16X performance gain over double precision for key operations. The “race to the bottom” for lower precision computations continues: some have even solved significant problems using 1-bit precision arithmetic hardware ([1], [2]). And hardware performance is getting even better: the Nvidia H100 tensor core-enabled FP16 is a full 58X faster than standard FP64, and 1-bit precision is yet another 16X faster than this, for total speedup of over 900X for algorithms that can use it [3].

This eye-popping speedup certainly draws attention. However, in scientific computing, low precision arithmetic has typically been seen as unsafe for modeling and simulation codes. Indeed, lower precision can sometimes be used to advantage [4], commonly in a “mixed precision” setting in which only parts of the calculation are done in low precision. However, in general anything less than double precision is considered inadequate to model complex physical phenomena with fidelity (see, e.g., [5]).

In response, developers have created tools to measure the safety of reduced precision arithmetic in application codes [6]. Some tools can even identify which variables or arrays can be safely demoted to lower precision without loss of accuracy in the final result. However, use of these tools in a blind fashion, not backed by some kind of reasoning process, can be hazardous.

An example will illustrate this. The conjugate gradient method for linear system solving and optimization [7] and the closely related Lanczos method for eigenvalue problem solving [8] showed great promise following their invention in the early 1950s. However, they were considered unsafe due to catastrophic roundoff errors under floating point arithmetic—even more pronounced as floating point precision is reduced. Nonetheless, Chris Paige showed in his pioneering work in the 1970s [9] that the roundoff error, though substantial, did not preclude the usefulness of the methods when properly used. The conjugate gradient method has gone on to become a mainstay in scientific computing.

Notice that no tool could possibly arrive at this finding, without a careful mathematical analysis of the methods. A tool would detect inaccuracy in the calculation but could not certify that these errors could cause no harm to the final result.

Some might propose instead a purely data-driven approach: just try low precision on some test cases, if it works then use low precision in production. This approach is fraught with peril, however: the test cases may not capture all situations that could be encountered in production.

For example, one might test an aerodynamics code only on smooth flow regimes, but production runs may encounter complex flows with steep gradients—that low precision arithmetic cannot correctly model. Academic papers that test low precision methods and tools must rigorously evaluate in challenging real-world scenarios like this.

Sadly, computational science teams frequently don’t have the time to evaluate their codes for potential use of lower precision arithmetic. Tools could certainly help. Also, libraries that encapsulate mixed precision methods can provide benefits to many users. A great success story here is mixed precision dense linear solvers, founded on the solid theoretical work of Nick Highnam and colleagues [10], which has found its way into libraries such as [11].

So the final answer is, “it depends.” Each new case must be looked at carefully, and a determination made based on some combination of analysis and testing.

References

[1] Zhang, Y., Garg, A., Cao, Y., Lew, Ł., Ghorbani, B., Zhang, Z. and Firat, O., 2023. Binarized Neural Machine Translation. arXiv preprint arXiv:2302.04907, https://openreview.net/forum?id=XAyPlfmWpu

[2] Lagergren, J., Cashman, M., Vergara, V.G.M., Eller, P.R., Gazolla, J.G.F.M., Chhetri, H.B., Streich, J., Climer, S., Thornton, P., Joubert, W. and Jacobson, D., 2023. Climatic clustering and longitudinal analysis with impacts on food, bioenergy, and pandemics. Phytobiomes Journal, 7(1), pp.65-77, https://apsjournals.apsnet.org/doi/10.1094/PBIOMES-02-22-0007-R.

[3] “NVIDIA H100 Tensor Core GPU Datasheet,” https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.

[4] G. Alvarez et al., “New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors,” SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Austin, TX, USA, 2008, pp. 1-10, doi: 10.1109/SC.2008.5218119.

[5] Spafford, K., Meredith, J., Vetter, J., Chen, J., Grout, R., Sankaran, R. (2010). Accelerating S3D: A GPGPU Case Study. In: Lin, HX., et al. Euro-Par 2009 – Parallel Processing Workshops. Euro-Par 2009. Lecture Notes in Computer Science, vol 6043. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14122-5_16.

[6] “Mixed precision analysis tools,” https://scholar.google.com/scholar?q=mixed+precision+analysis+tools

[7] Hestenes, M.R. and Stiefel, E., 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards49(6), pp.409-436, https://nvlpubs.nist.gov/nistpubs/jres/049/jresv49n6p409_a1b.pdf.

[8] Cornelius Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, Journal of Research of the National Bureau of Standards Vol. 45, No. 4, October 1950, https://nvlpubs.nist.gov/nistpubs/jres/045/4/V45.N04.A01.pdf.

[9] Paige, Christopher C.. “The computation of eigenvalues and eigenvectors of very large sparse matrices.” (1971), https://www.cs.mcgill.ca/~chris/pubClassic/PaigeThesis.pdf.

[10] Higham, N.J., Pranesh, S. and Zounon, M., 2019. Squeezing a matrix into half precision, with an application to solving linear systems. SIAM Journal on Scientific Computing41(4), pp.A2536-A2551, https://epubs.siam.org/doi/abs/10.1137/18M1229511.

[11] Lu, Hao; Matheson, Michael; Wang, Feiyi; Joubert, Wayne; Ellis, Austin; Oles, Vladyslav. “OpenMxP-Opensource Mixed Precision Computing,” https://www.osti.gov/biblio/1961398.