The Million Dollar Matrix Multiply

The following post is by Wayne Joubert, the newest member of our consulting team. Wayne recently retired from his position as a Senior Computational Scientist at Oak Ridge National Laboratory. — John

Training large language models like GPT-4 costs many millions of dollars in server expenses. These costs are expected to trend to billions of dollars over the next few years [1]. One of the biggest computational expenses of LLM training is multiplying matrices. These are simple operations of the form C = AB. Matrix multiplies are common not only in AI model training but also many high performance computing applications from diverse science domains.

Eking out more speed from matrix multiplies could reduce AI model training costs by millions of dollars. More routinely, such improvements could reduce training runtime by hours on a single GPU-powered workstation or cut down cloud service provider expenses significantly.

What is less well-known is that matrix multiples run on graphics processing units (GPUs) that are typically used for model training have many exotic performance behaviors that can drastically reduce matrix multiply efficiency by a wide margin.

Two recent works [2], [3] examine these phenomena in considerable depth. Factors such as matrix size, alignment of data in memory, power throttling, math library versions, chip-level manufacturing variability, and even the values of the matrix entries can significantly affect performance. At the same time, much of this variability can be modeled by machine learning methods such as decision trees and random forests [2].

Use of these methods can be the first step toward implementing autotuning techniques to minimize costs. Using such methods or carefully applying rules of thumb for performance optimization can make a huge performance difference for matrix multiply-heavy GPU software.

Related posts

[1] What large models cost you—there is no free AI lunch

[2] Wayne Joubert, Eric Palmer and Verónica G. Melesse Vergara, “Matrix Multiply Performance of GPUs on Exascale-class HPE/Cray Systems,” Proceedings of the Cray User Group Meeting (CUG) 2022,

[3] P. Sinha, A. Guliani, R. Jain, B. Tran, M. D. Sinclair and S. Venkataraman, “Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems,” SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 01-15, doi: 10.1109/SC41404.2022.00070.

2 thoughts on “The Million Dollar Matrix Multiply

  1. Wayne must feel like he passed into heaven — JDC Consulting. That’s it. It doesn’t get any better.

    This post topic is fascinating, it’s like the “Quake Square Root algorithm” (or was it Doom?) which depended for its wicked optimization on C, data types “mis” casting and register tomfoolery — but this matrix version is ‘on steroids’. (Maybe even stronger drugs, appropriately enough for machine learning: hallucinogens?)

    It was trendy ~10 years ago to talk about “data exhaust” and Big Data and being able to get new insights from processing all the data we left ‘on the table’.

    I’ve wondered for years if there’s not a *computational* exhaust version. Waste heat. Can we use mis-predicted branches, or just-in-case register ops for other purposes?

    My favorite idea (fair warning, I did mention hallucinogens) was that the evolving information about locks and latches in large concurrent systems (e.g. relational DBs) must be usable ‘somehow’/’someway’.

    If that vague and wooly-headed idea ever comes to anything, it will be from someone like Wayne, at a place like John D. Cook Consulting.

    Happy New Year.

  2. This takes me back! I graduated from UC San Diego with my BS in Computer Engineering in 1986, just before the “BackProp Revolution” of the early 1990s headlined by UCSD’s Robert Hect-Nielsen and Bart Kosko. At the time, I was employed by SAIC, which had put significant resources into hardware for Artificial Neural Network (ANN) processing, resulting in a PC expansion card stuffed with TI DSPs.

    Unfortunately, the hardware was not performing up to its potential, the PhDs were getting frustrated, and management was wary of the delays and increasing costs with too few results. I was brought in to evaluate the software-hardware interface for readily available optimizations, something I had done on prior real-time video and signal processing and analysis projects.

    Unfortunately, I was brought in too late to have any impact, as I was still conquering my learning curve when the software side of the project was moved outside the company: we kept producing the hardware, and pivoted toward using it for DoD signal processing applications.

    Still, some insights were gathered, though not fully implemented. In particular, keeping the DSPs fed was one obvious problem, as was “memory thrashing” on some rare computations. I felt the solution to both would be found in better queue and cache management, but didn’t get to do a detailed analysis or simulation.

Comments are closed.