Spreading out words in space

A common technique for memorizing numbers is to associate numbers with words. The Major mnemonic system does this by associating consonant sounds with each digit. You form words by inserting vowels as you please.

There are many possible encodings of numbers, but sometimes you want to pick a canonical word for each number, what’s commonly called a peg. Choosing pegs for the numbers 1 through 10, or even 1 through 100, is not difficult. Choosing pegs for a larger set of numbers becomes difficult for a couple reasons. First, it’s hard to think of words to fit some three-digit numbers. Second, you want your pegs to be dissimilar in order to avoid confusion.

Say for example you’ve chosen “syrup” for 049 and you need a peg for 350. You could use “molasses,” but that’s conceptually similar to “syrup.” If you use “Miles Davis” for 350 then there’s no confusion [1].

You could quantify how similar words are using cosine similarity between word embeddings.  A vector embedding associates a high-dimensional vector with each word in such a way that the geometry corresponds roughly with meaning. The famous example is that you might have, at least approximately,

queen = king − man + woman.

This gives you a way to define angles between words that ideally corresponds to conceptual similarity. Similar words would have a small angle between their vectors, while dissimilar words would have larger angles.

If you wanted to write a program to discover pegs for you, say using some corpus like ARPABet, you could have it choose alternatives that spread the words out conceptually. It’s debatable how practical this is, but it’s interesting nonetheless.

The angles you get would depend on the embedding you use. Here I’ll use the gensim code I used earlier in this post.

The angle between “syrup” and “molasses” is 69° but the angle between “syrup” and “miles” is 84°. The former is larger than I would have expected, but still significantly smaller than the latter. If you were using cosine similarity to suggest mnemonic pegs, hopefully the results would be directionally useful, choosing alternatives that minimize conceptual overlap.

As I said earlier, it’s debatable how useful this is. Mnemonics are very personal. A musician might be fine with using “trumpet” for 143 and “flugelhorn” for 857 because in his mind they’re completely different instruments, but someone else might think they’re too similar. And you might not want to use “Miles Davis” and “trumpet” as separate pegs, even though software will tell you that “miles” and “trumpet” are nearly orthogonal.

Related posts

[1] Here we’re following the convention that only the first three consonants in a word count. This makes it easier to think of pegs.

Mnemonic images with Grok 3

The Major mnemonic system makes numbers easier to memorize by encoding them as words. Each digit corresponds to one or more consonant sounds, and you can fill in vowels as you wish.

In August 2022 I tried creating a few images using DALL-E 2. The results were disappointing and sometimes disturbing.

To illustrate the use of the Major system, I gave the example of memorizing a list of the US presidents by creating mental images associating each president with the number of their term as president. For example, Franklin Delano Roosevelt was the 32nd POTUS. You can encode 32 as “moon”, so you might imagine FDR looking up at the moon.

At the time, Grover Cleveland was the only US President to serve two non-consecutive terms, being both the 22nd and 24th president. I asked DALL-E to create an image of Grover the Muppet holding an onion (22) and a wiener dog (24). This was too much for DALL-E at the time. The image below was as close as I could get.

Blue dog holding cucumber dog?

When I asked Grok 3 for a similar image it did a much better job. To be fair, it initially put a different dog breed in Grover’s hand, but I asked it to change the dog to a dachshund and it performed admirably.

Grover holding an onion and a wiener dog

Related posts

Rich Sutton’s bitter lesson of AI

This morning I read Alberto Romeo’s article Grok 3: Another Win For The Bitter Lesson. I wasn’t clear on what exactly the bitter lesson was, so I followed the link to Romeo’s previous article GPT-4: The Bitterer Lesson, which lead to Rich Sutton’s original article The Bitter Lesson.

Sutton opens his article by saying

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Clear enough, except you have to read further to get the “compared to what.” I asked Grok 3 to read the article and tell me exactly what the bitter lesson is. It begins with this explanation:

The “Bitter Lesson” is the principle that, in the field of artificial intelligence, the most effective and lasting progress has historically come from methods that rely on leveraging ever-increasing computational power rather than depending on human ingenuity to craft domain-specific knowledge or shortcuts. …

and concludes with the following pithy summary of its summary:

In essence, the Bitter Lesson is: computation trumps human-crafted specialization in AI, and embracing this, though humbling, is key to future success.

Sutton supports his thesis with examples from chess, go, speech recognition, and computer vision.

In some sense this is the triumph of statistics over logic. All else being equal, collecting enormous amounts of data and doing statistical analysis (i.e. training an AI model) will beat out rule-based systems. But of course all else is often not equal. Logic is often the way to go, even on big problems.

When you do know whats going on and can write down a (potentially large) set of logical requirements, exploiting that knowledge via techniques like SAT solvers and SMT is the way to go. See, for example, Wayne’s article on constraint programming. But if you want to build the Everything Machine, building software that wants to be all things to all people, statistics is the way to go.

Sometimes you have a modest amount of relevant data, and you don’t have 200,000 Nvidia H100 GPUs. In that case you’re much better off with classical statistics than AI. Classical models are more robust and easier to interpret. They are also not subject to hastily written AI regulation.

Putting a face on a faceless account

I’ve been playing around with Grok today, logging into some of my X accounts and trying out the prompt “Draw an image of me based on my posts.” [1] In most cases Grok returned a graphic, but sometimes it would respond with a text description. In the latter case asking for a photorealistic image made it produce a graphic.

Here’s what I get for @AlgebraFact:

The icons for all my accounts are cerulean blue dots with a symbol in the middle. Usually Grok picks up on the color, as above. With @AnalysisFact, it dropped a big blue piece of a circle on the image.

For @UnixToolTip it kept the & from the &> in the icon. Generative AI typically does weird things with text in images, but it picked up “awk” correctly.

Here’s @ProbFact. Grok seems to think it’s a baseball statistics account.

Last but not least, here’s @DataSciFact.

I wrote a popular post about how to put Santa hats on top of symbols in LaTeX, and that post must have had an outsided influence on the image Grok created.

[1] Apparently if you’re logging into account A and ask it to draw B, the image will be heavily influence by A‘s posts, not B‘s. You have to log into B and ask in the first person.

Golden hospital gowns

Here’s something I posted on X a couple days ago:

There’s no direct connection between AI and cryptocurrency, but they have a similar vibe.

They both leave you wondering whether the emperor is sumptuously clothed, naked, or a mix of both.

Maybe he’s wearing a hospital gown with gold threads.

In case you’re unfamiliar with the story, this is an allusion to The Emperor’s New Clothes, one of the most important stories in literature.

I propose golden hospital gown as a metaphor for things that are a fascinating mixture of good and bad, things that have large numbers of haters and fanboys, both with valid points. There’s continual improvement and a lot work to be done sorting out what works well and what does not.

I tried to get Grok to create an image of what I had in mind by a golden hospital gown. The results were not what I wanted, but passable, which is kinda the point of the comment above. It’s amazing that AI can produce anything remotely resembling a desired image starting from a text description. But there is a very strong temptation to settle for mediocre and vaguely creepy images that aren’t what we really want.

Man in a yellow hospital gown with hairy back and legs exposed

Related posts

LLMs and regular expressions

Yesterday I needed to write a regular expression as part of a client report. Later I was curious whether an LLM could have generated an equivalent expression.

When I started writing the prompt, I realized it wasn’t trivial to tell the LLM what I wanted. I needed some way to describe the pattern that the expression should match.

“Hmm, what’s the easiest way to describe a text pattern? I know: use a regular expression! Oh wait, …”

Prompt engineering and results

I described the pattern in words, which was more difficult than writing the regular expression, and the LLM came up with a valid regular expression, and sample code for demonstrating the use of the expression, but the expression wasn’t quite right. After a couple more nudges I managed to get it to produce a correct regex.

I had asked for a Perl regular expression, and the LLM did generate syntactically correct Perl [1], both for the regex and the sample code. When I asked it to convert the regex to POSIX form it did so, and when I asked it to convert the regex to Python it did that as well, replete with valid test code.

I repeated my experiment using three different LLMs and got similar results. In all cases, the hardest part was specifying what I wanted. Sometimes the output was correct given what I asked for but not what I intended, a common experience since the dawn of computers. It was easier to translate a regex from one syntax flavor to another than to generate a correct regex, easier for both me and the computer: it was easier for me to generate a prompt and the LLM did a better job.

Quality assurance for LLMs

Regular expressions and LLMs are complementary. The downside of regular expressions is that you have to specify exactly what you want. The upside is that you can specify exactly what you want. We’ve had several projects lately in which we tested the output of a client’s model using regular expressions and found problems. Sometimes it takes a low-tech tool to find problems with a high-tech tool.

We’ve also tested LLMs using a different LLM. That has been useful because there’s some degree of independence. But we’ve gotten better results using regular expressions since there is a greater degree of independence.

Related posts

[1] Admittedly that’s a low bar. There’s an old joke that Perl was created by banging on a keyboard then hacking on the compiler until the input compiled.

The search for the perfect prompt


Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without  limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

PromptAgent

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The website does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

Conclusion

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthy trial-and-error processes.

References

[1]  “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3]  “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

 

Hallucinations of AI Science Models

AlphaFold 2, FourCastNet and CorrDiff are exciting. AI-driven autonomous labs are going to be a big deal [1]. Science codes now use AI and machine learning to make scientific discoveries on the world’s most powerful computers [2].

It’s common practice for scientists to ask questions about the validity, reliability and accuracy of the mathematical and computational methods they use. And many have voiced concerns about the lack of explainability and potential pitfalls of AI models, in particular deep neural networks (DNNs) [3].

The impact of this uncertainty varies highly according to project. Science projects that are able to easily check AI-generated results against ground truth may not be that concerned. High-stakes projects like design of a multimillion dollar spacecraft with high project risks may ask about AI model accuracy with more urgency.

Neural network accuracy

Understanding of the properties of DNNs is still in its infancy, with many as-yet unanswered questions. However, in the last few years some significant results have started to come forth.

A fruitful approach to analyzing DNNs is to see them as function approximators (which, of course, they are). One can study how accurately DNNs approximate a function representing some physical phenomenon in a domain (for example, fluid density or temperature).

The approximation error can be measured in various ways. A particularly strong measure is “sup-norm” or “max-norm” error, which requires that the DNN approximation be accurate at every point of the target function’s domain (“uniform approximation”). Some science problems may have a weaker requirement than this, such as low RMS or 2-norm error. However, it’s not unreasonable to ask about max-norm approximation behaviors of numerical methods [4,5].

An illuminating paper by Ben Adcock and Nick Dexter looks at this problem [6]. They show that standard DNN methods applied even to a simple 1-dimensional problem can result in “glitches”: the DNN as a whole matches the function well but at some points totally misapproximates the target function. For a picture that shows this, see [7].

Other mathematical papers have subsequently shed light on these behaviors. I’ll try to summarize the findings below, though the actual findings are very nuanced, and many details are left out here. The reader is advised to refer to the respective papers for exact details.

The findings address three questions: 1) how many DNN parameters are required to approximate a function well? 2) how much data is required to train to a desired accuracy? and 3) what algorithms are capable of training to the desired accuracy?

How many neural network weights are needed?

How large does the neural network need to be for accurate uniform approximation of functions? If tight max-norm approximation requires an excessively large number of weights, then use of DNNs is not computationally practical.

Some answers to this question have been found—in particular, a result 1 is given in [8, Theorem 4.3; cf. 9, 10]. This result shows that the number of neural network weights required to approximate an arbitrary function to high max-norm accuracy grows exponentially in the dimension of the input to the function.

This dependency on dimension is no small limitation, insofar as this is not the dimension of physical space (e.g., 3-D) but the dimension of the input vector (such as the number of grid cells), which for practical problems can be in the tens [11] or even millions or more.

Sadly, this rules out the practical use of DNN for some purposes. Nonetheless, for many practical applications of deep learning, the approximation behaviors are not nearly so pessimistic as this would indicate (cp. [12]). For example, results are more optimistic:

  • if the target function has a strong smoothness property;
  • if the function is not arbitrary but is a composition of simpler functions;
  • if the training and test data are restricted to a (possibly unknown) lower dimensional manifold in the high dimensional space (this is certainly the case for common image and language modeling tasks);
  • if the average case behavior for the desired problem domain is much better than the worst case behavior addressed in the theorem;
  • The theorem assumes multilayer perceptron and ReLU activation; other DNN architectures may perform better (though the analysis is based on multidimensional Taylor’s theorem, which one might conjecture applies also to other architectures).
  • For many practical applications, very high accuracy is not a requirement.
  • For some applications, only low 2-norm error is sufficient, (not low max-norm).
  • For the special case of physics informed neural networks (PINNs), stronger results hold.

Thus, not all hope is lost from the standpoint of theory. However, certain problems for which high accuracy is required may not be suitable for DNN approximation.

How much training data is needed?

Assuming your space of neural network candidates is expressive enough to closely represent the target function—how much training data is required to actually find a good approximation?

A result 2 is given in [13, Theorem 2.2] showing that the number of training samples required to train to high max-norm accuracy grows, again, exponentially in the dimension of the input to the function.

The authors concede however that “if additional problem information about [the target functions] can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work.” One suspects that some of the caveats given above might also be relevant here. Additionally, if one considers 2-norm error instead of max-norm error, the data requirement grows polynomially rather than exponentially, making the training data requirement much more tractable. Nonetheless, for some problems the amount of data required is so large that attempts to “pin down” the DNN to sufficient accuracy become intractable.

What methods can train to high accuracy?

The amount of training data may be sufficient to specify a suitable neural network. But, will standard methods for finding the weights of such a DNN be effective for solving this difficult nonconvex optimization problem?

A recent paper [14] from Max Tegmark’s group empirically studies DNN training to high accuracy. They find that as the input dimension grows, training to very high accuracy with standard stochastic gradient descent methods becomes difficult or impossible.

They also find second order methods perform much better, though these are more computationally expensive and have some difficulty also when the dimension is higher. Notably, second order methods have been used effectively for DNN training for some science applications [15]. Also, various alternative training approaches have been tried to attempt to stabilize training; see, e.g., [16].

Prospects and conclusions

Application of AI methods to scientific discovery continues to deliver amazing results, in spite of lagging theory. Ilya Sutskever has commented, “Progress in AI is a game of faith. The more faith you have, the more progress you can make” [17].

Theory of deep learning methods is in its infancy. The current findings show some cases for which use of DNN methods may not be fruitful, Continued discoveries in deep learning theory can help better guide how to use the methods effectively and inform where new algorithmic advances are needed.

Footnotes

1 Suppose the function to be approximated takes d inputs and has the smoothness property that all nth partial derivatives are continuous (i.e., is in Cn(Ω) for compact Ω). Also suppose a multilayer perceptron with ReLU activation functions must be able to approximate any such function to max-norm no worse than ε. Then the number of weights required is at least a fixed constant times (1/ε)d/(2n).

2 Let F be the space of all functions that can be approximated exactly by a broad class of ReLU neural networks. Suppose there is a training method that can recover all these functions up to max-norm accuracy bounded by ε. Then the number of training samples required is at least a fixed constant times (1/ε)d.

References

[1] “Integrated Research Infrastructure Architecture Blueprint Activity (Final Report 2023),” https://www.osti.gov/biblio/1984466.

[2] Joubert, Wayne, Bronson Messer, Philip C. Roth, Antigoni Georgiadou, Justin Lietz, Markus Eisenbach, and Junqi Yin. “Learning to Scale the Summit: AI for Science on a Leadership Supercomputer.” In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1246-1255. IEEE, 2022, https://www.osti.gov/biblio/2076211.

[3] “Reproducibility Workshop: The Reproducibility Crisis in ML‑based Science,” Princeton University, July 28, 2022, https://sites.google.com/princeton.edu/rep-workshop.

[4] Wahlbin, L. B. (1978). Maximum norm error estimates in the finite element method with isoparametric quadratic elements and numerical integration. RAIRO. Analyse numérique, 12(2), 173-202, https://www.esaim-m2an.org/articles/m2an/pdf/1978/02/m2an1978120201731.pdf

[5] Kashiwabara, T., & Kemmochi, T. (2018). Maximum norm error estimates for the finite element approximation of parabolic problems on smooth domains. https://arxiv.org/abs/1805.01336.

[6] Adcock, Ben, and Nick Dexter. “The gap between theory and practice in function approximation with deep neural networks.” SIAM Journal on Mathematics of Data Science 3, no. 2 (2021): 624-655, https://epubs.siam.org/doi/10.1137/20M131309X.

[7] “Figure 5 from The gap between theory and practice in function approximation with deep neural networks | Semantic Scholar,” https://www.semanticscholar.org/paper/The-gap-between-theory-and-practice-in-function-Adcock-Dexter/156bbfc996985f6c65a51bc2f9522da2a1de1f5f/figure/4

[8] Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of Deep Neural Networks. In P. Grohs & G. Kutyniok (Eds.), Mathematical Aspects of Deep Learning (pp. 149-199). Cambridge: Cambridge University Press. doi:10.1017/9781009025096.004, https://arxiv.org/abs/2007.04759.

[9] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017, https://arxiv.org/abs/1610.01145

[10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neural networks in Ws,p norms. Anal. Appl. (Singap.), pages 1–57, 2019, https://arxiv.org/abs/1902.07896

[11] Matt R. Norman, “The MiniWeather Mini App,” https://github.com/mrnorman/miniWeather

[12] Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

[13] Berner, J., Grohs, P., & Voigtlaender, F. (2022). Training ReLU networks to high uniform accuracy is intractable. ICLR 2023, https://openreview.net/forum?id=nchvKfvNeX0.

[14] Michaud, E. J., Liu, Z., & Tegmark, M. (2023). Precision machine learning. Entropy, 25(1), 175, https://www.mdpi.com/1099-4300/25/1/175.

[15] Markidis, S. (2021). The old and the new: Can physics-informed deep-learning replace traditional linear solvers?. Frontiers in big Data, 4, 669097, https://www.frontiersin.org/articles/10.3389/fdata.2021.669097/full

[16] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19,  https://papers.nips.cc/paper_files/paper/2006/hash/5da713a690c067105aeb2fae32403405-Abstract.html

[17] “Chat with OpenAI CEO and Co-founder Sam Altman, and Chief Scientist Ilya Sutskever,” https://www.youtube.com/watch?v=mC-0XqTAeMQ&t=250s

The IQ Test That AI Can’t Pass

Large language models have recently achieved remarkable test scores on well-known academic and professional exams (see, e.g., [1], p. 6). On such tests, these models are at times said to reach human-level performance. However, there is one test that humans can pass but every AI method known to have been tried has abysmally failed.

The Abstraction and Reasoning Corpus (ARC) benchmark [2] was developed by François Chollet to measure intelligence in performing tasks never or rarely seen before. We all do tasks something like this every day, like making a complicated phone call to correct a mailing address. The test is composed of image completion problems similar to Raven’s Progressive Matrices but more complex. Given images A, B and C, one must identify the image D such that the relationship “A is to B as C is to D” holds. Sometimes several examples of the A:B relationship are given.

The problem is hard because the relationship patterns between A and B that humans could easily identify (for example, image shrinking, rotating, folding, recoloring, etc.) might be many, many different things—more than can easily be trained for. By construction, every problem is qualitatively, unpredictably different, so the common approach of training on the training set doesn’t work. Instead, bonafide reasoning on a new kind of problem for each case is required.

Several competitions with prize money have encouraged progress on the ARC benchmark [3]. In these, each entrant’s algorithm must be tested against an unseen ARC holdout set. The leaderboard for the ARCathon 2023 challenge completed last month shows top score of 30 percent [4]; this is excellent progress on a very hard problem, but far from a perfect score or anything else resembling passing.

Ilya Sutskever has famously warned we shouldn’t bet against deep learning, and perhaps a future LLM will do much better on this benchmark. Others feel a new approach is needed, for example, from the burgeoning field of neurosymbolic methods. In any case, these results show at the present moment in this rapidly progressing field, we don’t seem to be anywhere close to strong forms of AGI, artificial general intelligence.

[1] OpenAI, “GPT-4 Technical Report,” https://cdn.openai.com/papers/gpt-4.pdf

[2] François Chollet, “On the Measure of Intelligence,” https://arxiv.org/abs/1911.01547

[3] “Abstraction and Reasoning Challenge,” https://www.kaggle.com/c/abstraction-and-reasoning-challenge

[4] “Winners – Lab42, “https://lab42.global/winners/

The Million Dollar Matrix Multiply

The following post is by Wayne Joubert, the newest member of our consulting team. Wayne recently retired from his position as a Senior Computational Scientist at Oak Ridge National Laboratory. — John

Training large language models like GPT-4 costs many millions of dollars in server expenses. These costs are expected to trend to billions of dollars over the next few years [1]. One of the biggest computational expenses of LLM training is multiplying matrices. These are simple operations of the form C = AB. Matrix multiplies are common not only in AI model training but also many high performance computing applications from diverse science domains.

Eking out more speed from matrix multiplies could reduce AI model training costs by millions of dollars. More routinely, such improvements could reduce training runtime by hours on a single GPU-powered workstation or cut down cloud service provider expenses significantly.

What is less well-known is that matrix multiples run on graphics processing units (GPUs) that are typically used for model training have many exotic performance behaviors that can drastically reduce matrix multiply efficiency by a wide margin.

Two recent works [2], [3] examine these phenomena in considerable depth. Factors such as matrix size, alignment of data in memory, power throttling, math library versions, chip-level manufacturing variability, and even the values of the matrix entries can significantly affect performance. At the same time, much of this variability can be modeled by machine learning methods such as decision trees and random forests [2].

Use of these methods can be the first step toward implementing autotuning techniques to minimize costs. Using such methods or carefully applying rules of thumb for performance optimization can make a huge performance difference for matrix multiply-heavy GPU software.

Related posts

[1] What large models cost you—there is no free AI lunch

[2] Wayne Joubert, Eric Palmer and Verónica G. Melesse Vergara, “Matrix Multiply Performance of GPUs on Exascale-class HPE/Cray Systems,” Proceedings of the Cray User Group Meeting (CUG) 2022, https://www.osti.gov/biblio/2224210.

[3] P. Sinha, A. Guliani, R. Jain, B. Tran, M. D. Sinclair and S. Venkataraman, “Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems,” SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 01-15, doi: 10.1109/SC41404.2022.00070.