We’re awaiting the release of OpenAI’s o3 model later this month. Its performance is impressive on very hard benchmarks like SWE-bench Verified, Frontier Math and the ARC AGI benchmark (discussed previously in this blog).
And yet at the same time some behaviors of the frontier AI models are very concerning.
Their performance on assorted math exams is outstanding, but they make mistakes doing simple arithmetic, like wrongly multiplying numbers that are a few digits long. Performance of the o1 preview model on the difficult Putnam math exam is excellent but drops precipitously under simple changes like renaming constants and variables in the problem statement.
Similarly, when o1 is applied to a planning benchmark expressed in standardized language, it performs well, but accuracy falls apart when applied to a mathematically equivalent planning problem in a different domain. And also, a given AI model applied to the simple ROT13 cipher can have wildly different performance based on the value of the cipher key, suggesting the models don’t really understand the algorithm.
It was the best of times, it was the worst of times, . . .
What is going on here?
For years now, some have made claims of “human-level performance” for various deep learning algorithms. And as soon as one party starts making claims like this, it’s hard for the others to resist doing the same.
The confusion is that, from a certain point of view, the claim of “human-level” is true—but the definition of “human-level” is fraught.
Here, “human-level” is taken to mean achievement of some high score on a benchmark set, ostensibly exceeding some human performance measure on the same benchmark. However, a single AI model can vary wildly in capability across behaviors—“smart” compared to humans in some ways, “dumb” in others.
For humans, test-taking is a proxy for measuring a range of skills and abilities. And even for humans it is not always an accurate proxy. A person can perform very well on academic tests and very poorly on the job, or vice versa.
And the capability ratios for AI models are very different still, in ways we don’t fully understand. So, outscoring humans on a software engineering benchmark doesn’t mean the AI has the whole panoply of coding skills, decision-making abilities, software architecture design savvy, etc., needed to be a competent software engineer.
It’s no surprise that recent articles (below) show a growing perception of the limitations of AI benchmarks as currently conceived.
Ways forward
Perhaps we should consider developing requirements like the following before claiming human-level reasoning performance of an AI model:
- It should be able to “explain its work” at any level of detail to another human (just like a human can), in a way that that human can understand.
- It should be able to give answers without “hallucinating” or “confabulating” (yes, humans can hallucinate too, but most occupations would not be well-served by an employee who hallucinates on the job).
- It should be able to reliably and consistently (100% of the time) do things that we routinely expect a human or computer to do accurately (like add or multiply two numbers accurately, for things like filling out tax returns or doing engineering calculations to build an airplane).
- It should be frank and honest in assessing its level of certainty about an answer it gives (no gaslighting).
- It should be able to solve a trivial perturbation of a given problem with the same ease as the original problem (to the same extent that a human can).
- As someone has said, it should be able to do, without specific training, what a 5 year old can do without specific training.
- This one sounds good, from Emmett Shear: “AGI [artificial general intelligence] is the ability to generalize [without special training by a human] to an adversarially chosen new benchmark.”
AI models are fantastic and amazing tools—and best used when one has eyes wide open about their limitations.
Have you had problems with AI model performance? If so, please share in the comments.
References
Rethinking AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence, https://venturebeat.com/ai/rethinking-ai-benchmarks-a-new-paper-challenges-the-status-quo-of-evaluating-artificial-intelligence/
Rethink reporting of evaluation results in AI, https://www.science.org/doi/10.1126/science.adf6369, https://eprints.whiterose.ac.uk/198211/
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence, https://arxiv.org/abs/2402.09880
Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless, https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless
Why we must rethink AI benchmarks, https://bdtechtalks.com/2021/12/06/ai-benchmarks-limitations/
AI and the Everything in the Whole Wide World Benchmark, https://arxiv.org/abs/2111.15366
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices, https://arxiv.org/abs/2411.12990
Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.
When MS first came out with their LLM front end for Bing, I was quickly able to sucker it into making math mistakes and tell me that my calculator program was broken and I should download the latest version (it was hilariously insistent that it’s wrong answers were correct). But MS quickly gussied up the front end to do arithmetic correctly.
Warning: I’m on the extreme end of the “the current round of AI is pure hype” crowd. Here’s why.
For me (MPhil in AI from Roger Schank, 1984), the point/job of AI is to understand how we humans are able to relate the content of the things we say to our (largely shared, very subtle, but not perfect) models of the real world. If we hear a story about kids hitting a baseball towards a house, we think broken glass. Not because of statistics, but because that’s how the world works.
In my view, 1970s/1980s AI failed at figuring out how to do that. In my opinion, that’s because human intelligence is seriously wonderful and amazing. But we did try. (Note that the current round of AI is characterized by a complete disrespect for human intelligence.)
With the LLM idea, AI has completely punted on that problem. The LLM guys discovered that the rather strange trick of random text generation based on an enormous sample base generates amazing looking text. But the LLM doesn’t have any idea that there’s a real world out there and that text talks about it. It just crunches statistics. So what happens when an LLM outputs a piece of text, is that said text has not been, in any way, related to some idea of reality _by the machine_. It’s just random text to the machine. But because it’s so plausible looking, a human looking at the text will do the work of figuing out that relationship and give the program credit for having actually done that work. Sometimes we notice that the text is crazy, and say, oops, the LLM halucinated.
This is why I am of the opinion that this whole line of research is completely misdirected. I just don’t see it as reasonable to think that the zillion stories in the text base about kids and baseball bats mean that the machine “understands” kids and baseball bats.
As an undergrad, Joe Weizenbaum was a regular at my lunch group, and he was a lovely, gentle bloke. But when he saw his secretary typing her deepest secrets into ELZIA, he freaked out and became rather upset by/opposed to AI. (I found this irritating because he didn’t give us credit for trying to figure reasoning out, and thought it was cheap tricks all the way down.) I guess you could say that he saw ChatGPT coming.
Very well said, David J. Littleboy!