New Ways To Make Code Run Faster


The news from Meta last week is a vivid reminder of the importance of making code run faster and more power-efficiently. Meta intends to purchase 350,000 Nvidia H100 GPUs this year [1]. Assuming 350W TDP [2] and $0.1621 per kW-h [3] average US energy cost, one expects a figure of $174 million per year in electricity expenses just to power the GPUs (note this is only a rough estimate of the actual). For this and many other datacenter-scale and real-time critical applications, every bit of increased performance can be impactful.

Many approaches can be taken to making code run faster. The excellent book Hacker’s Delight contains many tricks for speeding up low-level code kernels [4]. Also, superoptimization techniques find the very fastest performing implementations of short, loop-free code kernels by exhaustive search [5].

Since exhaustive search scales exponentially with code size, it’s no surprise that other methods have been tried. Recent work with reinforcement learning reduces the number of scalar multiples needed for matrix products, discovering new Strassen-like algorithms [6]. Other work focuses more on hardware design; for example, PrefixRL optimizes chip design for targeted problems using reinforcement learning [7].

Finding the absolute fastest code for a given task is in general NP-complete [8]. This problem is out of reach for tractable solution by such methods. However, phenomenal progress in SAT/SMT solver efficiency for NP-complete problems has been made over the last 20 years (someone has even quipped, “NP is the new P” [9]). And indeed, these methods have been applied to superoptimization problems [10, 11]. Perhaps methodologies based on efficient SAT and SMT solvers will afford further opportunities for advancement.

The need for speed and power efficiency has become so acute that radically different non-von Neumann processors are being built [12], in some cases orders of magnitude more power efficient but restrictive in the problems they can solve. In the coming years we can expect to see a huge amount activity and developments on this important problem.

[1] “Meta will have 350,000 of Nvidia’s fastest AI GPUs by end of year, buying AMD’s MI300, too,” Tom’s Hardware, Jan. 19, 2024,

[2] “NVIDIA H100 Tensor Core GPU Datasheet,”

[3] “Electricity Rates for Every State in The U.S.,”

[4] Henry Warren, “Hacker’s Delight,” 2nd Edition, 2012,

[5] “Superoptimization”,

[6] Fawzi, A., Balog, M., Huang, A. et al. “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature 610, 47–53 (2022).

[7] Rajarshi Roy et al., “PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning,”

[8] Hennessy, John L., and Thomas Gross. “Postpass code optimization of pipeline constraints.” ACM Transactions on Programming Languages and Systems (TOPLAS) 5, no. 3 (1983): 422-448.

[9] Arie Gurfinkel, “Algorithms for SAT,”

[10] Abhinav Jangda and Greta Yorsh. 2017. “Unbounded superoptimization,” in Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2017). Association for Computing Machinery, New York, NY, USA, 78–88.

[11] Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. 2016. “GreenThumb: superoptimizer construction framework,” in Proceedings of the 25th International Conference on Compiler Construction (CC 2016). Association for Computing Machinery, New York, NY, USA, 261–262.

[12] “OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman,” Wired, Dec. 3, 2023,