Machine learning by satisfiability solving

Posted on 31 July 2025 by Wayne Joubert

Define B = {0, 1} and a Boolean function f_p: B^N → B where p is a Boolean parameter vector in Bⁿ. Consider that f_p(x) can be represented as a Boolean expression whose variables are the entries of vectors p and x. Assume that c is the cost of computing f_p(x) measured in some way, for example, the number of operator evaluations based on some complete set of Boolean operators. Then, given y in B and x in B^N, the cost to solve the equation f_p(x) = y for satisfying y has cost c2ⁿ, using the most naïve brute force search.

For 3SAT solving, a much better worst case bound is known, namely O(1.307ⁿ), see here, for a related discussion see here. For the inexactly-defined class of “industrial problems,” performance in practice is often much better; for a discussion see here.

Now consider a set of Boolean feature vectors x_i and labels y_i. for i in {1, …, d}. One can now solve f_p(x_i) = y_i for all i. Since the number of unknown variables to be solved for is unchanged, the naïve bound on computational cost is cd2ⁿ, scaling linearly in the number of data values d. Note this is tractable if the phenomenon in question can be modeled by a small number of logical parameters n.

In practice, one generally does not solve a machine learning problem exactly but approximately, minimizing a loss function. One can use the AtLeast operator in a SAT solver like Z3 to require that f_p(x_i) = y_i be satisfied for at least K values of i, for some K. One can then find the maximal such K by performing bisection on K, requiring log₂(d) such SAT solves. On exact solvers for this problem see also here.

The AtLeast operator can be implemented by either binary adder / totalizer encoding (Bailleux and Boufkhad 2003), sequential counter encoding (Sinz 2005), or Batcher sorting network approach (Abío et al. 2013). Unfortunately, all these methods require adding significant numbers of auxiliary variables, adversely affecting the naïve complexity bound. However, one can hope that performance is much better than this bound for industrial problems, as is often the case in practice. Furthermore, randomized approximation algorithms known to run in polynomial time can provably find assignments that satisfy a guaranteed fraction (e.g., 3/4 or more) of the maximum number of satisfiable Boolean constraints (see for example here, here, here and here). This might serve as a proxy for exactly solving for the optimizer.

If the problem instead has Boolean expressions with embedded linear inequality predicates on integer variables of bounded range, one could apply SMT solver methods directly using the ideas described above, or convert the problem to a SAT problem and apply the above methods directly.

The idea of using SAT solvers for machine learning in the way described here goes back to (Kamath et al. 1992). The method is described with an example in Donald Knuth’s TAOCP fascicle on Satisfiability, section on Learning a Boolean function.

Evaluating and lowering AI hallucination cost

Posted on 16 July 2025 by John

AI models hallucinate, and always will [1]. In some sense this is nothing new: everything has an error rate. But what is new is the nature of AI errors. The output of an AI is plausible by construction [2], and so errors can look reasonable even when they’re wrong. For example, if you ask for the capital city of Illinois, an AI might err by saying Chicago, but it wouldn’t say Seattle.

You need to decide two things when evaluating whether to use an AI for a task: the cost of errors and the cost of validation.

Cost of error on horizontal axis, cost of validation on vertical axis

Image generation is an ideal task for AIs because the consequences of an error are usually low. And validation is almost effortless: if an image looks OK, it’s OK. So imagine generation is in the bottom left corner of the image above.

Text-to-speech is a little more interesting to evaluate. The consequences of an error depend on user expectations. Someone may be mildly annoyed when a computer makes an error reading an email out loud, but the same person may angry if he thought he was talking to a human and discovers he’s been talking to a machine.

Validating text-to-speech requires a human to listen to the audio. This doesn’t require much effort, though the cost might be too much in some contexts. (I’m in the middle of an AI-narrated book and wondering why the producer didn’t go to the effort of fixing the pronunciation errors. The print version of the book has high production quality, but not as much effort went into the audio.) So I’d say text-to-speech is somewhere in the bottom half of the graph, but the horizontal location depends on expectations.

Code generation could be anywhere on the graph, depending on who is generating the code and what the code is being used for. If an experienced programmer is asking AI to generate code that in principle he could have written, then it should be fairly easy to find and fix errors. But it’s not a good idea for a non-programmer to generate code for a safety-critical system.

Mitigating risks and costs

The place you don’t want to be is up and to the right: errors are consequential and validation is expensive. If that’s where your problem lies, then you want to either mitigate the consequences of errors or reduce the cost of validation.

This is something my consultancy helps clients with. We find ways to identify errors and to mitigate the impact of inevitable errors that slip through.

If you’d like help moving down and to the left, lowering the cost of errors and the cost of validation, let’s set up a meeting to discuss how we can help.

LET’S TALK

[1] LLMs have billions of parameters, pushing a trillion. But as large as the parameter space is, the space of potential prompts is far larger, and so the parameters do not contain enough information to respond completely accurately to every possible prompt.

[2] LLMs predict the word that is likely to come next given the words produced so far. In that sense, the output is always reasonable. If an output does not appear reasonable to a human, it is because the human has more context.

Scientific papers: innovation … or imitation?

Posted on 5 June 2025 by Wayne Joubert

Sometimes a paper comes out that has the seeds of a great idea that could lead to a whole new line of pioneering research. But, instead, nothing much happens, except imitative works that do not push the core idea forward at all.

For example the McCulloch Pitts paper from 1943 showed how neural networks could represent arbitrary logical or Boolean expressions of a certain class. The paper was well-received at the time, brilliantly executed by co-authors with diverse expertise in neuroscience, logic and computing. Had its significance been fully grasped, this paper might have, at least notionally, formed a unifying conceptual bridge between the two nascent schools of connectionism and symbolic AI (one can at least hope). But instead, the heated conflict in viewpoints in the field has persisted, even to this day.

Another example is George Miller’s 7 +/- 2 paper. This famous result showed humans are able to hold only a small number of pieces of information in mind at the same time while reasoning. This paper was important not just for the specific result, but for the breakthrough in methodology using rigorous experimental noninvasive methods to discover how human thinking works—a topic we know so little about, even today. However, the followup papers by others, for the most part, only extended or expanded on the specific finding in very minor ways. [1] Thankfully, Miller’s approach did eventually gain influence in more subtle ways.

Of course it’s natural from the incentive structures of publishing that many papers would be primarily derivative rather than original. It’s not a bad thing that, when a pioneering paper comes out, others very quickly write rejoinder papers containing evaluations or minor tweaks of the original result. Not bad, but sometimes we miss the larger implications of the original result and get lost in the details.

Another challenge is stovepiping—we get stuck in our narrow swim lanes for our specific fields and camps of research. [2] We don’t see the broader implications, such as connections and commonalities across fields that could lead to fruitful new directions.

Thankfully, at least to some extent current research in AI shows some mix of both innovation and imitation. Inspired in part by the accelerationist mindset, many new papers appear every day, some with significant new findings and others that are more modest riffs on previous papers.

Notes

[1] Following this line of research on human thought processes could be worthwhile for various reasons. For example, some papers in linguistics state that Chomsky‘s vision for a universal grammar is misguided because the common patterns in human language are entirely explainable by the processing limitations of the human mind. But this claim is made with no justification or methodological rigor of any kind. If I claimed a CPU performs vector addition or atomic operations efficiently because of “the capabilities of the processor,” I would need to provide some supporting evidence, for example, documenting that the CPU has vector processing units or specialized hardware for atomics. The assertions about language structure being shaped by the human mental processing faculty is just an empty truism, unless supported by some amount of scientific rigor and free of the common fallacies of statistical reasoning.

[2] I recently read a paper in linguistics with apparent promise, but the paper totally misconstrued the relationship between Shannon entropy and Kolmogorov complexity. Sadly this paper passed review in a linguistic journal, but if it had had a mathematically inclined reviewer, the problem would have been caught and fixed.

Why do LLMs have emergent properties?

Posted on 8 May 2025 by Wayne Joubert

Large language models display emergence behaviors: when the parameter count is scaled to a certain value, suddenly the LLM is capable of performing a new task not possible at a smaller size. Some say the abruptness of this change is merely a spurious artifact of how it is measured. Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities.

The following is not a mathematical proof , but a plausibility argument as to why such behavior should not be surprising, and a possible mechanism. I’ll start with simple cases and work up to more complex ones.

In nature

An obvious point. Emergence is ubiquitous in nature. Ice near the freezing point that is slightly heated suddenly becomes drinkable (phase change). An undrivable car with three wheels gets a fourth wheel and is suddenly drivable. Nonlinearity exists in nature.

In machine learning

A simple example: consider fitting N arbitrary points in one dimension with linear regression using monomials. For a basis up to degree less than N-1, for most possible sets of data points (excluding “special” cases like collinear), the regression error will be non-zero, and reciprocally, the accuracy will be some finite value. Increase the number of monomials (parameter count) to N-1, and suddenly the error drops to zero, and accuracy jumps to infinity.

When using k-means clustering, if one has n clusters and runs k-means clustering with K<N cluster centers, the error will be significant, but when K=N, suddenly the cluster centers can model all clusters well, and the error drops dramatically.

In algorithms

Consider all Boolean circuits composed from some fixed logically complete set of gate types. Now consider the goal of constructing a Boolean circuit that takes a single byte representing the integer N and increments it to N+1, modulo 256 (8 bits input, 8 bits output). Clearly, such a circuit exists, for example, the standard chain of 1-bit add-and-carry circuits. Note one can in principle enumerate all possible circuits of finite gate count. It is manifest that an integer K>0 exists for which no circuit with less than K gates solves the problem but there exists a circuit with K gates that does. The standard chain of 8 1-bit adders might be such a minimizer, or maybe the optimal circuit is more exotic (for example see here, though this method is not guaranteed to compute a minimizer).

One would thus see this capability potentially emerge as soon as one reaches a gate budget of K gates. Now, one could argue that for a smaller gate budget, a partial result might be possible, for example, incrementing any 7-bit number—so the increase in capability is continuous, not emergent or wholly new. However, if all you care about is correctly incrementing any byte (for example, for manipulating ASCII text), then it’s all or nothing; there’s no partial credit. Even so, the gate budget required for incrementing 8 bits compared to only 7-bit integers is only slightly higher, but this minor increase in gate count actually doubles the quantity of integers that can be incremented, which might be perceived as a surprising, unexpected (emergent) jump.

In LLMs

The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks the final LLM will be capable of, as defined by the architecture and the training process (in particular, the specific mix of training data). These tasks are implemented as “algorithms” (circuits) within the LLM. The algorithms are mixed together and (to some extent) overlap in a complex way that is difficult to analyze.

Suppose one of these desired capabilities is some task X. Suppose all possible input/output pairs for this operation are represented in the training data (or, maybe not—maybe some parts of the algorithm can be interpolated from the training data). The LLM is trained with SGD, typically with 2-norm minimization. The unit ball in the 2-norm is a sphere in high dimensional space. Thus “all directions” of the loss are pressed down equally by the minimization process—which is to say, the LLM is optimized on all the inputs for many, many tasks, not just task X. The limited parameter bit budget must be spread across many, many other tasks the LLM must be trained to do. As LLMs of increasing size are trained, at some point enough parameter bits in the budget will be allocatable to represent a fully accurate algorithm for task X, and at this point the substantially accurate capability to do “task X” will be perceivable—“suddenly.”

Task X could be the 8-bit incrementer, which from an optimal circuit standpoint would manifest emergence, as described above. However, due to the weakness of the SGD training methodology and possibly the architecture, there is evidence that LLM training does not learn optimal arithmetic circuits at all but instead does arithmetic by a “bag of heuristics” (which incidentally really is, itself, an algorithm, albeit a piecemeal one). In this case, gradually adding more and more heuristics might be perceived to increase the number of correct answers in a somewhat more incremental way, to be sure. However, this approach is not scalable—to perform accurate arithmetic for any number of digits, if one does not use an exact arithmetic algorithm or circuit, one must use increasingly more heuristics to increase coverage to try to capture all possible inputs accurately. And still, transitioning from an approximate to an exact 8-bit incrementer might in practice be perceived as an abrupt new capability, albeit a small one for this example.

One could alternatively consider tool use (for example, a calculator function that is external to the LLM proper), but then a new tool must be written for every new task, and the LLM needs to understand how to use the tool. (Maybe at some point LLMs will know how to write and use their own algorithmic tools?)

Predicting emergence

The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit” (and do the same thing again once a year based on new data, indefinitely into the future without failure). We don’t know an “algorithm” for this, and we can’t even begin to guess the required parameter budget or the training data needed. That’s the point of using an LLM—its training internally “discovers” new, never seen before algorithms from data that would be difficult for humans to formulate or express from first principles. Perhaps there is some indirect way of predicting the emergence of such X, but it doesn’t seem obvious on the face of it how to predict this directly.

Conclusion

Based on these examples, it would seem not at all surprising for LLMs to exhibit emergent behaviors, though in our experience our encounter with them may be startling. Predicting them may be possible to a limited extent but for the general case seems really hard.

Do you have any thoughts? If so, please leave them in the comments.

Looking at Your Data

Posted on 29 March 2025 by Wayne Joubert

What to do first after scoping out and starting a data science project?

I’ve started an unsupervised learning project based on textual data. The first thing I like to do is actually look at the data. Is it noisy? What are the features—complex feature engineering needed? How heterogeneous? What generalization and overfitting challenges?

Analysis can take many forms: actually looking at the numbers, using visualization tools, Excel spreadsheet, Jupyter notebooks with Matplotlib, computing various statistics on the whole dataset or portions of it.

Some may believe this is not important. Just throw a barrage of classification or regression methods at the data, treat the data as a black box. Of course testing on a suite of ML methods is not a bad thing. But I can’t imagine not using every avenue available, including looking at the data. I’m certainly not alone in this view (see for example here, here and here).

I spent a few hours developing a simple custom data viewer for my problem that colored different parts of the textual data to give insight as to what was going on. I used ChatGPT to develop parts of this tool; some of it was incorrect and needed fixing, but having at least a draft of the code definitely saved time. Seeing the actual data in person was insightful and generated ideas for solving the problem.

While inspecting the data can help identify issues, it also risks biasing the modeling process by imposing assumptions that a flexible model might otherwise uncover on its own. One must also beware of data leakage. That being said—in general I think understanding as much as you can about the data is not a bad thing.

Lessons Learned With the Z3 SAT/SMT Solver

Posted on 17 March 2025 by Wayne Joubert

Community best practices are useful for helping use a software product more effectively. I’ve just completed a small project using the Z3 solver. Here are some things I’ve learned:

My project involves an optimization problem: for a subset of Boolean variables, maximize the count of how many are true. My specific problem is solved much faster with Z3 by converting to a decision problem: set up a base problem to solve for the count being at least a certain fixed number, and iterate using bisection search to find the highest number satisfied. Bisection has been used for this problem before. Also, certain methods may possibly reduce the number of bisection steps.
Using Z3 “tactics” can greatly speed up the solve process. I found a good combination of tactics by trial and error, guided in part by the descriptions of the tactics. ChatGPT was of some help in finding good selections to try. An interesting paper discusses use of Monte Carlo tree search to define a good chain of tactics. The branching factor here is high, perhaps around 1000, though there are some redundancies in this number. Training multi-step MCTS might be expensive, but doing this once to get a good static chain of tactics might be worthwhile.
The strength of Z3 is in its extremely broad functionality, more so than its raw compute performance. It would be a daunting task for the Z3 team to fully optimize every possible solve option. I examined some of the SMT solver competitions to find faster codes. CVC5 on one case I tried was about twice as fast as Z3; I’ve seen similar reports in the literature. Presently I don’t find it worth the switching costs to use CVC5. One approach might be to use the very capable tactics engine of Z3 and pass the resulting modified problem to CVC5.
The specific formulation of the problem can make a big difference in solver performance. I’ve already seen this in the area of iterative linear solvers, for example diagonal matrix scaling can dramatically help (conjugate gradients) or hurt (multigrid) solver performance. Same thing here. Hence the huge importance in good “preprocessing“ for SAT/SMT solvers. One could wish the solver could handle all this automatically without user intervention. However, these powerful tools must be brandished very carefully for maximum effect.
Generally, one should move as much of the problem outside of the solver as possible, since the solver is the long pole in the tent in terms of scalability. For example if there is a Z3 integer that must be limited to a certain range and additionally some values in the interval must be blacklisted, it’s better, if possible, to compress all of the valid values into a single interval, to make testing for validity simpler in the Z3 code.
Along these lines: the Z3 tactics for propagating constants are not perfect; thus it can help to manually propagate constants (though this unfortunately makes the code more messy). This propagation can also sometimes allow for removal of unneeded constraints, further speeding up performance. Relatedly, some intriguing work by Benjamin Mikek shows how one can use the LLVM code optimizer to optimize the SMT problem in a way that is complementary to Z3 tactics, achieving significant speedup (for more info see here, here and here). I haven’t tried this but it seems promising.
Because of the scalability challenge of SMT solvers, various simplifying heuristics to modify the problem can be helpful. For example: solving a subproblem of the main problem and holding the resulting variables fixed in order to solve the rest of the problem. Or solving a simpler, smaller problem first to determine variable presets for the full problem. With these heuristics, one does not in general find the true global optimum; but the result may be adequate.
CPU threading does not work for my case (Z3 Python, macOS). Perfect parallelization of SAT and SMP is an unsolved (and perhaps in some sense not fully solvable) problem. One can naïvely parallelize bisection search by converting to trisection, etc., but this does not give perfect speedup (specif., log(P) speedup on P threads). Improvements to parallel bisection in some cases may be possible. Recent work by Armin Biere and colleagues looks promising; as I read it, near perfect speedup up to eight threads (at least for some problems).
Some of the main developers of Z3 are on Stack Overflow and have been active in the past answering questions. This seems very useful.

Resources like Handbook of Satisfiability and the proceedings of various SAT/SMT conferences seem helpful. More information on best practices for non-expert practitioners would be a great help to the community. If you know of any good resources, please share in the comments.

Grokking the James Bond chord

Posted on 16 March 2025 by John

The last chord of the James Bond theme is known as the James Bond chord. If you’ve ever seen a Bond movie, you’ll recognize it. Here’s a quick video showing how to play it on guitar.

I tried asking Grok 3 to generate some LilyPond code for sheet music a few days ago. LilyPond is pretty obscure as programming languages go—it’s not like asking for Python code—so I was skeptical, but Grok did well.

This afternoon I asked Grok to create sheet music for the James Bond chord. Specifically, I asked

What is the “James Bond chord”?
Create sheet music for the James Bond chord using Lilypond.
Move the chord up an octave.

Here’s the output of each step.

What is the chord?

According to Grok,

The “James Bond chord” is a distinctive musical chord famously associated with the James Bond film franchise. It’s not just a single chord but rather a specific voicing and progression that evokes a sense of mystery, danger, and sophistication—perfectly matching the suave spy’s persona. Officially, it’s often identified as an E minor major 9 chord (EmMaj9), though its exact interpretation can vary slightly depending on the context.

The chord was first introduced by composer Monty Norman in the “James Bond Theme” for Dr. No (1962), arranged by John Barry, whose orchestration gave it its iconic swagger. …

First output

Grok produced the following LilyPond code

\version "2.22.0"

\score {
  \new Staff {
    \clef treble
    \key c \major % Default key, no sharps/flats; we'll specify accidentals manually
    \time 4/4
     <e g b dis' fis'>1 % EmMaj9 chord: E G B D# F#
  }
  \layout {}
  \midi {}
}

When compiled this gave the following music.

Grok also produce a MIDI file.

Second output

I wanted to move the chord up an octave for aesthetic reasons, putting the notes inside the staff. Grok complied, changing one line in the code, essentially adding an extra prime mark after each note.

<e' g' b' dis'' fis''>1 % EmMaj9 chord moved up an octave: E' G' B' D#'' F#''

This compiled to the following music.

Problems and prospects

Grok’s not perfect. In another experiment it produced code that wouldn’t compile. But when I told Grok that the code didn’t compile and asked it to try again, it worked.

I tried to remove the time signature, the C symbol. I asked Grok to remove it, and it did not. I asked Grok “How do you get LilyPond to produce music without a time signature?” and it told me two ways, neither of which worked.

I’ve used LilyPond occasionally for years, not to produce full sheets of music but to produce little fragments for blog posts. I’ve always found it a bit mysterious, in part because I jumped in and used it as needed without studying it systematically. There have been times when I thought about including some music notation in a blog post and didn’t want to go to the effort of using LilyPond (or rather the effort of debugging LilyPond if what I tried didn’t work). I may go to the effort more often now that I have a fairly reliable code generator.

Posts using LilyPond

DeepSeek-R1: Do we need less compute now?

Posted on 3 February 2025 by Wayne Joubert

The reactions to the new DeepSeek-R1 AI model in recent days seem limitless. Some say it runs so much faster than existing models that we will no longer need the billions of dollars in compute hardware that big tech is preparing to buy.

Is that plausible?

To get an answer, we need only look back at the experience of the recently-completed Exascale Computing Project. This large scale multi-lab project was tasked with developing technology (primarily software) to prepare for exascale computing, which has recently been achieved by Frontier, Aurora and El Capitan.

During the course of the project, various algorithm and implementation improvements were discovered by the the science teams, these leading to as much as 60X speedup or more, over and above speedups possible from hardware alone [1]. In response, are the teams just running the very same problems faster on older hardware? No — instead, they are now able to run much, much larger problems than previously possible, exploiting both hardware and software improvements.

Or suppose today there were no such thing as the fast Fourier transform (FFT) and scientists were computing Fourier transforms using (essentially) large dense matrix-vector products. If someone then discovered the FFT, I’d guarantee you that scientists would not only say, (1) “Wow, now I can run my existing problems much, much faster,” but also, (2) “Wow, now I can run problems much larger than I ever dreamed and solve problems larger than I could have ever imagined!”

Paradoxically, faster algorithms might even increase the demand for newer, faster hardware. For example, a new faster algorithm for designing medications to cure cancer might be judged so important that it’s worth building the largest machine possible to run it effectively.

All this is not to say whether you should buy or sell Nvidia stock right now. However, it does mean that there is no simplistic argument that faster algorithms and implementations necessarily lead to lower spend on computing hardware. History shows that sometimes this is not true at all. The smart money, on the other hand, is on research teams that are able to exploit any and every new discovery to improve what is possible with their codes, whether by hardware, data, code optimizations or algorithms.

Notes

[1] See slide 9 from Doug Kothe’s talk, “Exascale and Artificial Intelligence: A Great Marriage“. The “Figure of Merit” (FOM) number represents speedup of science output from an application compared to an earlier baseline system. Specifically, a FOM speedup of 50X is the anticipated speedup from baseline due to efficient use of hardware only, for example, on Frontier compared to the earlier OLCF Titan system.

Can AI Models Reason: Is Data All You Need?

Posted on 16 January 2025 by Wayne Joubert

Many are voicing concern that the world is running out of data and that this will be a blocker to progress toward smarter AI models. One paper in fact projects timelines for when we will run out.

AI researchers are looking for ways to adapt. Nvidia has trained a specific model to generate synthetic data for training other models. Some use this approach, though using AI-generated data to train AI is not without risk.

Others have asked a bigger question, namely, is something fundamentally missing in our approach that relies so heavily on data. Certainly the bitter lesson thesis and the position long advocated by Geoffrey Hinton argue for a data-first approach with “as few” prior assumptions as possible (though every model has a bias).

But it’s currently simply unknown whether just adding more data and compute will do the trick for achieving general intelligence or whether something else is needed. Neurosymbolic approaches are being experimented with, in various forms. But it’s unclear whether these can scale up to the level needed. And the frontier labs, laser-focused on the current paradigm, may not have adequate time or resources to investigate high-risk/high-reward alternatives.

From a theoretical standpoint, sometimes more data is simply not enough. As discussed in a previous post, some problems in mathematics and engineering require exponentially large amount of data to train neural network models. Exponentials can work in your favor, but also can work against you (think of the Tower of Hanoi problem or the Wheat and Chessboard problem). Some problems on certain models cannot be solved by any amount of data available in the entire universe.

The requirements for solving these problems can grow much more quickly than expected. The strength of neural networks, their flexibility, their universal approximation property, can also be a weakness. It can take so much data to nail down all the parameters so that the model is completely error free. Thankfully, many other problems that people want to solve (such as human language modeling) are fundamentally lower dimensional and thus less vulnerable to this problem.

We just don’t know whether the current data-hungry approach will be enough—or whether we’ll need to learn another bitter lesson.

Can AI models reason like a human?

Posted on 7 January 2025 by Wayne Joubert

We’re awaiting the release of OpenAI’s o3 model later this month. Its performance is impressive on very hard benchmarks like SWE-bench Verified, Frontier Math and the ARC AGI benchmark (discussed previously in this blog).

And yet at the same time some behaviors of the frontier AI models are very concerning.

Their performance on assorted math exams is outstanding, but they make mistakes doing simple arithmetic, like wrongly multiplying numbers that are a few digits long. Performance of the o1 preview model on the difficult Putnam math exam is excellent but drops precipitously under simple changes like renaming constants and variables in the problem statement.

Similarly, when o1 is applied to a planning benchmark expressed in standardized language, it performs well, but accuracy falls apart when applied to a mathematically equivalent planning problem in a different domain. And also, a given AI model applied to the simple ROT13 cipher can have wildly different performance based on the value of the cipher key, suggesting the models don’t really understand the algorithm.

It was the best of times, it was the worst of times, . . .

What is going on here?

For years now, some have made claims of “human-level performance” for various deep learning algorithms. And as soon as one party starts making claims like this, it’s hard for the others to resist doing the same.

The confusion is that, from a certain point of view, the claim of “human-level” is true—but the definition of “human-level” is fraught.

Here, “human-level” is taken to mean achievement of some high score on a benchmark set, ostensibly exceeding some human performance measure on the same benchmark. However, a single AI model can vary wildly in capability across behaviors—“smart” compared to humans in some ways, “dumb” in others.

For humans, test-taking is a proxy for measuring a range of skills and abilities. And even for humans it is not always an accurate proxy. A person can perform very well on academic tests and very poorly on the job, or vice versa.

And the capability ratios for AI models are very different still, in ways we don’t fully understand. So, outscoring humans on a software engineering benchmark doesn’t mean the AI has the whole panoply of coding skills, decision-making abilities, software architecture design savvy, etc., needed to be a competent software engineer.

It’s no surprise that recent articles (below) show a growing perception of the limitations of AI benchmarks as currently conceived.

Ways forward

Perhaps we should consider developing requirements like the following before claiming human-level reasoning performance of an AI model:

It should be able to “explain its work” at any level of detail to another human (just like a human can), in a way that that human can understand.
It should be able to give answers without “hallucinating” or “confabulating” (yes, humans can hallucinate too, but most occupations would not be well-served by an employee who hallucinates on the job).
It should be able to reliably and consistently (100% of the time) do things that we routinely expect a human or computer to do accurately (like add or multiply two numbers accurately, for things like filling out tax returns or doing engineering calculations to build an airplane).
It should be frank and honest in assessing its level of certainty about an answer it gives (no gaslighting).
It should be able to solve a trivial perturbation of a given problem with the same ease as the original problem (to the same extent that a human can).
As someone has said, it should be able to do, without specific training, what a 5 year old can do without specific training.
This one sounds good, from Emmett Shear: “AGI [artificial general intelligence] is the ability to generalize [without special training by a human] to an adversarially chosen new benchmark.”

AI models are fantastic and amazing tools—and best used when one has eyes wide open about their limitations.

Have you had problems with AI model performance? If so, please share in the comments.

References

Rethinking AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence, https://venturebeat.com/ai/rethinking-ai-benchmarks-a-new-paper-challenges-the-status-quo-of-evaluating-artificial-intelligence/

Rethink reporting of evaluation results in AI, https://www.science.org/doi/10.1126/science.adf6369, https://eprints.whiterose.ac.uk/198211/

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence, https://arxiv.org/abs/2402.09880

Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless, https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

Why we must rethink AI benchmarks, https://bdtechtalks.com/2021/12/06/ai-benchmarks-limitations/

AI and the Everything in the Whole Wide World Benchmark, https://arxiv.org/abs/2111.15366

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices, https://arxiv.org/abs/2411.12990

Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.