New Ways To Make Code Run Faster

Racecar

The news from Meta last week is a vivid reminder of the importance of making code run faster and more power-efficiently. Meta intends to purchase 350,000 Nvidia H100 GPUs this year [1]. Assuming 350W TDP [2] and $0.1621 per kW-h [3] average US energy cost, one expects a figure of $174 million per year in electricity expenses just to power the GPUs (note this is only a rough estimate of the actual). For this and many other datacenter-scale and real-time critical applications, every bit of increased performance can be impactful.

Many approaches can be taken to making code run faster. The excellent book Hacker’s Delight contains many tricks for speeding up low-level code kernels [4]. Also, superoptimization techniques find the very fastest performing implementations of short, loop-free code kernels by exhaustive search [5].

Since exhaustive search scales exponentially with code size, it’s no surprise that other methods have been tried. Recent work with reinforcement learning reduces the number of scalar multiples needed for matrix products, discovering new Strassen-like algorithms [6]. Other work focuses more on hardware design; for example, PrefixRL optimizes chip design for targeted problems using reinforcement learning [7].

Finding the absolute fastest code for a given task is in general NP-complete [8]. This problem is out of reach for tractable solution by such methods. However, phenomenal progress in SAT/SMT solver efficiency for NP-complete problems has been made over the last 20 years (someone has even quipped, “NP is the new P” [9]). And indeed, these methods have been applied to superoptimization problems [10, 11]. Perhaps methodologies based on efficient SAT and SMT solvers will afford further opportunities for advancement.

The need for speed and power efficiency has become so acute that radically different non-von Neumann processors are being built [12], in some cases orders of magnitude more power efficient but restrictive in the problems they can solve. In the coming years we can expect to see a huge amount activity and developments on this important problem.

[1] “Meta will have 350,000 of Nvidia’s fastest AI GPUs by end of year, buying AMD’s MI300, too,” Tom’s Hardware, Jan. 19, 2024, https://www.tomshardware.com/tech-industry/meta-will-have-350000-of-nvidias-fastest-ai-gpus-by-end-of-year-buying-amds-mi300-too

[2] “NVIDIA H100 Tensor Core GPU Datasheet,” https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet

[3] “Electricity Rates for Every State in The U.S.,” https://www.energybot.com/electricity-rates/

[4] Henry Warren, “Hacker’s Delight,” 2nd Edition, 2012, https://web.archive.org/web/20190915025154/http://www.hackersdelight.org/

[5] “Superoptimization”, https://en.wikipedia.org/wiki/Superoptimization

[6] Fawzi, A., Balog, M., Huang, A. et al. “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature 610, 47–53 (2022). https://doi.org/10.1038/s41586-022-05172-4

[7] Rajarshi Roy et al., “PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning,” https://arxiv.org/abs/2205.07000

[8] Hennessy, John L., and Thomas Gross. “Postpass code optimization of pipeline constraints.” ACM Transactions on Programming Languages and Systems (TOPLAS) 5, no. 3 (1983): 422-448.

[9] Arie Gurfinkel, “Algorithms for SAT,” https://ece.uwaterloo.ca/~agurfink/ece750t29/assets/pdf/02_SAT.pdf

[10] Abhinav Jangda and Greta Yorsh. 2017. “Unbounded superoptimization,” in Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2017). Association for Computing Machinery, New York, NY, USA, 78–88. https://doi.org/10.1145/3133850.3133856

[11] Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. 2016. “GreenThumb: superoptimizer construction framework,” in Proceedings of the 25th International Conference on Compiler Construction (CC 2016). Association for Computing Machinery, New York, NY, USA, 261–262. https://doi.org/10.1145/2892208.2892233

[12] “OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman,” Wired, Dec. 3, 2023, https://www.wired.com/story/openai-buy-ai-chips-startup-sam-altman/

 

The IQ Test That AI Can’t Pass

Large language models have recently achieved remarkable test scores on well-known academic and professional exams (see, e.g., [1], p. 6). On such tests, these models are at times said to reach human-level performance. However, there is one test that humans can pass but every AI method known to have been tried has abysmally failed.

The Abstraction and Reasoning Corpus (ARC) benchmark [2] was developed by François Chollet to measure intelligence in performing tasks never or rarely seen before. We all do tasks something like this every day, like making a complicated phone call to correct a mailing address. The test is composed of image completion problems similar to Raven’s Progressive Matrices but more complex. Given images A, B and C, one must identify the image D such that the relationship “A is to B as C is to D” holds. Sometimes several examples of the A:B relationship are given.

The problem is hard because the relationship patterns between A and B that humans could easily identify (for example, image shrinking, rotating, folding, recoloring, etc.) might be many, many different things—more than can easily be trained for. By construction, every problem is qualitatively, unpredictably different, so the common approach of training on the training set doesn’t work. Instead, bonafide reasoning on a new kind of problem for each case is required.

Several competitions with prize money have encouraged progress on the ARC benchmark [3]. In these, each entrant’s algorithm must be tested against an unseen ARC holdout set. The leaderboard for the ARCathon 2023 challenge completed last month shows top score of 30 percent [4]; this is excellent progress on a very hard problem, but far from a perfect score or anything else resembling passing.

Ilya Sutskever has famously warned we shouldn’t bet against deep learning, and perhaps a future LLM will do much better on this benchmark. Others feel a new approach is needed, for example, from the burgeoning field of neurosymbolic methods. In any case, these results show at the present moment in this rapidly progressing field, we don’t seem to be anywhere close to strong forms of AGI, artificial general intelligence.

[1] OpenAI, “GPT-4 Technical Report,” https://cdn.openai.com/papers/gpt-4.pdf

[2] François Chollet, “On the Measure of Intelligence,” https://arxiv.org/abs/1911.01547

[3] “Abstraction and Reasoning Challenge,” https://www.kaggle.com/c/abstraction-and-reasoning-challenge

[4] “Winners – Lab42, “https://lab42.global/winners/

The cold start problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of Bayes theorem, you don’t have to have an ad hoc rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to all information available at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

The cold start problem for business applications is easier than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use Bayesian model selection or Bayesian model averaging to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

Something that bothers me about deep neural nets

Overfitting happens when a model does too good a job of matching a particular data set and so does a poor job on new data. The way traditional statistical models address the danger of overfitting is to limit the number of parameters. For example, you might fit a straight line (two parameters) to 100 data points, rather than using a 99-degree polynomial that could match the input data exactly and probably do a terrible job on new data. You find the best fit you can to a model that doesn’t have enough flexibility to match the data too closely.

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

It a nutshell, the disturbing thing is that you have a negative criteria for what constitutes a good solution: one that’s not too close to optimal. But there are a lot of ways to not be too close to optimal. In practice, you experiment until you find an optimally suboptimal solution, i.e. the intentionally suboptimal fit that performs the best in validation.

 

The big deal about neural networks

In their book Computer Age Statistical Inference, Brad Efron and Trevor Hastie give a nice description of neutral networks and deep learning.

The knee-jerk response [to neural networks] from statisticians was “What’s the big deal? A neural network is just a nonlinear model, not too different from many other generalizations of linear models.”

While this may be true, neural networks brought a new energy to the field. They could be scaled up and generalized in a variety of ways … And most importantly, they were able to solve problems on a scale far exceeding what the statistics community was used to. This was part computing scale expertise, part liberated thinking and creativity on the part of this computer science community.

After enjoying considerable popularity for a number of years, neural networks were somewhat sidelined by new inventions in the mid 1990s. … Neural networks were passé. But then they re-emerged with a vengeance after 2010 … the reincarnation now being called deep learning.

Quantifying uncertainty

The primary way to quantify uncertainty is to use probability. Subject to certain axioms that aim to capture common-sense rules for quantifying uncertainty, probability theory is essentially the only way. (This is Cox’s theorem.)

Other methods, such as fuzzy logic, may be useful, though they must violate common sense (at least as defined by Cox’s theorem) under some circumstances. They may be still useful when they provide approximately the results that probability would have provided and at less effort and stay away from edge cases that deviate too far from common sense.

There are various kinds of uncertainty, principally epistemic uncertainty (lack of knowledge) and aleatory uncertainty (randomness), and various philosophies for how to apply probability. One advantage to the Bayesian approach is that it handles epistemic and aleatory uncertainty in a unified way.

Blog posts related to quantifying uncertainty:

Pros and cons of the term “data science”

I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all.

Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)

Cons

The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)

Similarly, the term data scientist is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”

The big deal about data science isn’t data but the science of drawing inferences from the data. Inference science would be a better term, in my opinion, but that term hasn’t taken off.

Pros

Data science could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title data scientist is rightfully associated with people who have better computational skills than statisticians typically have.

While the term data science isn’t perfect, there’s little to recommend the term statistics other than that it is well established. The root of statistics is state, as in a government. This is because statistics was first applied to the concerns of bureaucracies. The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term.

Adaptive clinical trials and machine learning

Arguments over the difference between statistics and machine learning are often pointless. There is a huge overlap between the two approaches to analyzing data, sometimes obscured by differences in vocabulary. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.

This distinction is valid in broad strokes, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.

Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model per se is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.

Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.

There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.

Related: Adaptive clinical trial design

 

A subtle way to over-fit

If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.

For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.

Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.

With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.

The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.

For example, suppose you want to fit a linear regression to a data set. That is, you want to pick m and b so that y = mx + b is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.

So here’s what you could do. Pick a constant value of b, call it b0. Now fit the one-parameter model y = mx + b0 on your training data, selecting the value of m only to minimize the error in fitting the training set. Now pick another value of b, call it b1, and see how well it does on the test set. Repeat until you’ve found the best value of b. You’ve essentially used the training and test data to fit a two-parameter model, albeit awkwardly.

Related: Probability modeling

Machine learning and magic

When I first heard about a lie detector as a child, I was puzzled. How could a machine detect lies? If it could, why couldn’t you use it to predict the future? For example, you could say “IBM stock will go up tomorrow” and let the machine tell you whether you’re lying.

Of course lie detectors can’t tell whether someone is lying. They can only tell whether someone is exhibiting physiological behavior believed to be associated with lying. How well the latter predicts the former is a matter of debate.

I saw a presentation of a machine learning package the other day. Some of the questions implied that the audience had a magical understanding of machine learning, as if an algorithm could extract answers from data that do not contain the answer. The software simply searches for patterns in data by seeing how well various possible patterns fit, but there may be no pattern to be found. Machine learning algorithms cannot generate information that isn’t there any more than a polygraph machine can predict the future.