An AI Odyssey, Part 3: Lost Needle in the Haystack

While shopping on a major e-commerce site, I wanted to get an answer to an obscure question about a certain product.

Not finding the answer immediately on the product page, I thought I’d try clicking the AI shopping assistant helper tool to ask this question.

I waited with anticipation for an answer I would expect be more informative and useful than a standard search result. But it was not to be. The AI tool had nothing worthwhile.

Then I decided on an old-fashioned keyword search across all the product reviews. And, lo and behold, I immediately found several credible reviews addressing my question.

It is not good usability when multiple search mechanisms exist but only one of them is reliable. And it is surprising that a retrieval-based approach (e.g., RAG) could not at least match the effectiveness of a simple keyword search over reviews.

Models are capable, but effective integration can be lacking. Without improvements for cases like this, customers will not be satisfied users of these new AI tools.

Related posts

Typesetting sheet music with AI

Lilypond is a TeX-like typesetting language for sheet music. I’ve had good results asking AI to generate Lilypond code, which is surprising given the obscurity of the language. There can’t be that much publicly available Lilypond code to train on.

I’ve mostly generated Lilypond code for posts related to music theory, such as the post on the James Bond chord. I was curious how well AI would work if I uploaded an image of sheet music and asked it to produce corresponding Lilypond code.

In a nutshell, the results were hilariously bad as far as the sheet music produced. But Grok did a good job of recognizing the source of the clips.

Test images

Here are the two images I used, one of classical music

and one of jazz.

I used the same prompt for both images with Grok and ChatGPT: Write Lilypond code corresponding to the attached sheet music image.

Classical results

Grok

Here’s what I got when I compiled the code Grok generated for the first image.

This bears no resemblance to the original, turning one measure into eight. However, Grok correctly inferred that the excerpt was by Bach, and the music it composed (!) is in the style of Bach, though it is not at all what I asked for.

ChatGPT

Here’s the corresponding output from ChatGPT.

Not only did ChatGPT hallucinate, it hallucinated in two-part harmony!

Jazz results

One reason I wanted to try a jazz example was to see what would happen with the chord symbols.

Grok

Here’s what Grok did with the second sheet music image.

The notes are almost unrelated to the original, though the chords are correct. The only difference is that Grok uses the notation Δ for a major 7th chord; both notations are common. And Grok correctly inferred the title of the song.

I edited the image above. I didn’t change any notes, but I moved the title to center it over the music. I also cut out the music and lyrics credits to make the image fit on the page easier. Grok correctly credited Johnny Burke and Jimmy Van Heusen for the lyrics and music.

ChatGPT

Here’s what I got when I compiled the Lilypond code that ChatGPT produced. The chords are correct, as above. The notes bear some similarity to the original, though ChatGPT took the liberty of changing the key and the time signature, and the last measure has seven and a half beats.

ChatGPT did not speculate on the origin of the clip, but when I asked “What song is this music from?” it responded with “The fragment appears to be from the jazz standard ‘Misty.'”

From logistic regression to AI

It is sometimes said that neural networks are “just” logistic regression. (Remember neural networks? LLMs are neural networks, but nobody talks about neural networks anymore.) In some sense a neural network is logistic regression with more parameters, a lot more parameters, but more is different. New phenomena emerge at scale that could not have been anticipated at a smaller scale.

Logistic regression can work surprisingly well on small data sets. One of my clients filed a patent on a simple logistic regression model I created for them. You can’t patent logistic regression—the idea goes back to the 1840s—but you can patent its application to a particular problem. Or at least you can try; I don’t know whether the patent was ever granted.

Some of the clinical trial models that we developed at MD Anderson Cancer Center were built on Bayesian logistic regression. These methods were used to run early phase clinical trials, with dozens of patients. Far from “big data.” Because we had modest amounts of data, our models could not be very complicated, though we tried. The idea was that informative priors would let you fit more parameters than would otherwise be possible. That idea was partially correct, though it leads to a sensitive dependence on priors.

When you don’t have enough data, additional parameters do more harm than good, at least in the classical setting. Over-parameterization is bad in classical models, though over-parameterization can be good for neural networks. So for a small data set you commonly have only two parameters. With a larger data set you might have three or four.

There is a rule of thumb that you need at least 10 events per parameter (EVP) [1]. For example, if you’re looking at an outcome that happens say 20% of the time, you need about 50 data points per parameter. If you’re analyzing a clinical trial with 200 patients, you could fit a four-parameter model. But those four parameters better pull their weight, and so you typically compute some sort of information criteria metric—AIC, BIC, DIC, etc.—to judge whether the data justify a particular set of parameters. Statisticians agonize over each parameter because it really matters.

Imaging working in the world of modest-sized data sets, carefully considering one parameter at a time for inclusion in a model, and hearing about people fitting models with millions, and later billions, of parameters. It just sounds insane. And sometimes it is insane [2]. And yet it can work. Not automatically; developing large models is still a bit of a black art. But large models can do amazing things.

How do LLMs compare to logistic regression as far as the ratio of data points to parameters? Various scaling laws have been suggested. These laws have some basis in theory, but they’re largely empirical, not derived from first principles. “Open” AI no longer shares stats on the size of their training data or the number of parameters they use, but other models do, and as a very rough rule of thumb, models are trained using around 100 tokens per parameter, which is not very different from the EVP rule of thumb for logistic regression.

Simply counting tokens and parameters doesn’t tell the full story. In a logistic regression model, data are typically binary variables, or maybe categorical variables coming from a small number of possibilities. Parameters are floating point values, typically 64 bits, but maybe the parameter values are important to three decimal places or 10 bits. In the example above, 200 samples of 4 binary variables determine 4 ten-bit parameters, so 20 bits of data for every bit of parameter. If the inputs were 10-bit numbers, there would be 200 bits of data per parameter.

When training an LLM, a token is typically a 32-bit number, not a binary variable. And a parameter might be a 32-bit number, but quantized to 8 bits for inference [3]. If a model uses 100 tokens per parameter, that corresponds to 400 bits of training data per inference parameter bit.

In short, the ratio of data bits to parameter bits is roughly similar between logistic regression and LLMs. I find that surprising, especially because there’s a sort of no man’s land between [2] a handful of parameters and billions of parameters.

Related posts

[1] P Peduzzi 1, J Concato, E Kemper, T R Holford, A R Feinstein. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 1996 Dec; 49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3.

[2] A lot of times neural networks don’t scale down to the small data regime well at all. It took a lot of audacity to believe that models would perform disproportionately better with more training data. Classical statistics gives you good reason to expect diminishing returns, not increasing returns.

[3] There has been a lot of work lately to find low precision parameters directly. So you might find 16-bit parameters rather than finding 32 bit parameters then quantizing to 16 bits.

An AI Odyssey, Part 2: Prompting Peril

I was working with a colleague recently on a project involving the use of the OpenAI API.

I brought up the idea that, perhaps it is possible to improve the accuracy of API response by modifying the API call to increase the amount of reasoning performed.

My colleague quickly asked ChatGPT if this was possible, and the answer came back “No, it’s not possible to do that.” then I asked essentially the same question to my own instance of ChatGPT, and the answer was, “Yes, you can do it, but you need to use the OpenAI Responses API.”

How did we get such different answers? Was it the wording of the prompt? Was it the custom instructions given in the account personalization, where you describe who you are and how you want ChatGPT to respond? Is it possibly different conversation history? Many factors could have contributed to the different response. Unfortunately, many of these factors are either not easily controllable at the user level or not convenient to change to alternatives in a protracted trial and error search.

I’ve had other times when I will first get a highly standardized, generic answer from ChatGPT, even in Thinking mode, that I know is not quite right or just seems off. Then when I push back, I may get a profoundly different answer.

It’s simply a fact that large language models are conditional probabilistic systems that do not guarantee reproducibility in practice, even given the same inputs, even at temperature=0 [1]. Their outputs depend sensitively on prompt wording, context window contents, system instructions, and model configuration. Small differences in these inputs can yield substantially different outputs.

How well an AI chatbot responds can obviously have a massive impact on how effective the tool will be for your use case. Differences in responses could materially affect the outcome of your project. I take this as a wake-up call to be persistent, vigilant and flexible in attempts to obtain reliable answers from these new AI tools.

Notes

[1] (some) sources of nondeterminism: floating point / GPU nondeterminism, differing order of operations from distributed collectives, ties or near-ties in token probabilities, backend/infrastructure changes, model routing, hidden system prompt differences or tool availability.

An AI Odyssey, Part 1: Correctness Conundrum

I recently talked with a contact who repeated what he’d heard regarding agentic AI systems—namely, that they can greatly increase productivity in professional financial management tasks. However, I pointed out that though this is true, these tools do not guarantee correctness, so one has to be very careful letting them manage critical assets such as financial data.

It is widely recognized that AI models, even reasoning models and agentic systems, can make mistakes. One example is a case showing that one of the most recent and capable AI models made multiple factual mistakes in drawing together information for a single slide of a presentation.  Sure, people can give examples where agentic systems can perform amazing tasks. But it’s another question as to whether they can always do them reliably. Unfortunately, we do not yet have procedural frameworks that provides reliability guarantees that are comparable to those required in other high-stakes engineering domains.

Many leading researchers have acknowledged that current AI systems have a degree of technical unpredictability that has not been resolved. For example, one has recently said, “Anyone who has worked with AI models understands that there is a basic unpredictability to them, that in a purely technical way we have not solved.”

What industrial-strength reliability looks like

Manufacturing has the notion of Six Sigma quality, to reduce the level of manufacturing defects to an extremely low level. In computing, the correctness requirements are even higher, sometimes necessitating provable correctness. The Pentium FDIV bug in the 1990s caused actual errors in calculations to occur in the wild, even though the chance of error was supposedly “rare.” These were silent errors that might have occurred undetected in mission critical applications, leading to failure. This being said, the Pentium FDIV error modes were precisely definable, whereas AI models are probabilistic, making it much harder to bound the errors.

Exact correctness is at times considered so important that there is an entire discipline, known as formal verification, to prove specified correctness properties for critical hardware and software systems (for example, the manufacture of computing devices). These methods play a key role in multi-billion dollar industries.

When provable correctness is not available, having at least a rigorous certification process (see here for one effort) is a step in the right direction.

It has long been recognized that reliability is a fundamental challenge in modern AI systems. Despite dramatic advances in capability, strong correctness guarantees remain an open technical problem. The central question is how to build AI systems whose behavior can be bounded, verified, or certified at domain-appropriate levels. Until this is satisfactorily resolved, we should use these incredibly useful tools in smart ways that do not create unnecessary risks.

AGI, ASI, A*I – Do we have all we need to get there?

Demis: “[to get to AGI] maybe there’s one or two big innovations needed

Sam: “everything based off what we see today is that it will happen.”

Ilya: “But is the belief really that if you just 100x the scale, everything would be transformed? I don’t think that’s true.

Dario: “If you just kind of like eyeball the rate at which these capabilities are increasing, it does make you think that we’ll get there by 2026 or 2027.

Jerry: “is [the transformer architecture] the last thing? I’m pretty sure it isn’t.

For years leading researchers have been speculating one way or the other as to whether better algorithms are needed to get to AGI, artificial general intelligence (however that might be defined).

Around the time of the release of GPT-4, some were saying they felt something more was needed. Since then, we have had several major new advances, like reasoning models and tool use. If we’d said, “we don’t need anything else” three years ago, where would we be now?

For frankness, I like this from John Schulman: “it’s hard to know what we need.” And for strategy, Demis: “you can think of as 50% of our effort is on scaling, 50% of it is on innovation. My betting is you’re going to need both to get to AGI.”

Experiences with GPT-5-Codex

OpenAI Codex is now generally available (see here, here). I’m using the Codex extension in the Cursor code editor with my OpenAI account.

Codex is very helpful for some tasks, such as complex code refactoring, implementing skeleton code for an operation, or writing a single small self-contained piece of code. Models have come a long way from a year ago when bugs in a 30-line piece of generated code were not uncommon.

Some have reported 40-60% overall productivity increase from coding agents. I used Codex recently for a complicated code refactoring that I estimate would have taken over 10x more time to plan and execute without an assistant.

The coding agent is less effective for some other tasks. For example, I asked it to ensure that all Python function signatures in my code had type hints, and it missed many cases.

Also, some have reported that the new Claude Sonnet 4.5 runs much faster, though Codex is being continually improved.

Obviously to be effective, these models must have access to adequate test case coverage, to enable the models to debug against. Without this, the coding agent can get really lost.

My approach to using the agent is very hands-on. Before letting it make any changes, I discuss the change plan in detail and make corrections as needed. (I sometimes need to remind the agent repeatedly to wait until I say “start” before commencing changes). Also when appropriate, I ask the model to make changes one step at a time instead of all in one go. This not only makes for a result that is more understandable and maintainable by humans, but also is more likely to give a good quality result.

Some have cautioned of the hazards of using coding agents. One concern is that a rogue coding agent could do something drastic like delete your code base or data files (both theoretically, and for some models, actually). One remedy is to set up your own sandbox to run the agent in, for example, a virtual machine, that has very locked-down access and no access to sensitive data. This may be cumbersome for some workflows, but for others may be a good security measure.

Also, some have warned that an agent can introduce dangerous security bugs in code. A remedy for this is to manually review every piece of code that the agent produces. This introduces some added developer overhead, though still in my experience it is much faster than writing the same code without the agent. And it is much better than just pushing a button to generate a big blob of incomprehensible code.

Coding agents have greatly improved over the last several months. Software development practices are presently passing through a point of no return, permanently changed by the new AI-enabled coding assistants. Even very bright people, who are already extremely skilled, are benefitting from using these tools.

GPT-5 for AI-assisted discovery

Many hope that AI will be “smart enough” to make breakthrough scientific discoveries in the near future, such as find a cure for cancer. Some research efforts have sought to create an “AI Scientist” that can make discoveries in an automated or semi-automated way; for a recent example, see [1]. Others [2] have called out the methodological pitfalls of some of these attempts. Still others question whether  a truly original discovery is even possible at all for an AI.

OpenAI released GPT-5 on August 7. Some thought it lackluster and falling behind compared to expectations. Others however found performance in some areas to be much advanced compared to its predecessor.

Two recent reports show the new model’s utility. Scott Aaronson published a paper last month [3], [4] in which “a key technical step in the proof of the main result came from AI.” Also, Terence Tao reported earlier this month [5] his use of ChatGPT to find a first counterexample to an unsolved mathematics problem.

I’m sure this resonates with the experience of other researchers using the tool. Recently, in the course of a discussion I had with ChatGPT, it came up with a new algorithmic recipe for something I was working on, based on ideas combined from several papers but in itself apparently original. That was a very simple case—but on a grander scale, connecting two ideas together in a novel way can lead to a real breakthrough. For example, Faltings’ proof of the Mordell Conjecture in 1983 was based on recognizing a subtle internal connection among some already existing theorems.

There is always the specter of concern that an idea “maybe was already in the training data.” It can be difficult to prove otherwise. But deep domain experts like Scott Aaronson and Terence Tao are likely to know with high likelihood whether the idea is truly an original never-before-published result or not.

If past is prologue, we can hope for more powerful models in the future that can solve increasingly hard problems.

Notes

[1] Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, Yue Zhang, “DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively.” https://arxiv.org/abs/2509.26603.

[2] Ziming Luo, Atoosa Kasirzadeh, Nihar B. Shah, “The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems.” https://arxiv.org/abs/2509.08713.

[3] “The QMA Singularity.” https://scottaaronson.blog/?p=9183

[4] Scott Aaronson, Freek Witteveen, “Limits to black-box amplification in QMA.” https://arxiv.org/abs/2509.21131.

[5] https://mathstodon.xyz/@tao/115306424727150237

 

The AI fork in the road

If AI can do part of your job, you’re likely to either be fired or promoted.

AI will always have some error rate, and if it can do your job at an acceptable error rate, you’ll need to find another job.

But if the error rate is not acceptable, the ability to identify and fix errors becomes more valuable.

People with the experience to recognize errors and fix them quickly are more productive when delegating work to AI. Higher productivity should result in higher wages, though you may have to become self-employed to be paid more for getting more done.

Contractors and consultants are more often paid in proportion to the value of their work than salaried employees are, especially since 1971.

Recognizing errors takes experience. This is especially true of LLM-generated errors because LLMs always return plausible results (according to some model) though they do not always return correct results.

Why are CUDA kernels hard to optimize?

Explosive datacenter demand has caused developers to leave no stone unturned in search of higher efficiencies. The DeepSeek team, not satisfied with Nvidia’s CUDA libraries, used a virtualized form of assembly language (PTX) to write kernel codes to accelerate their AI computations. Others have attempted to generate optimized kernels using AI, though some results have been questioned (for various attempts, see also here, here, here, here and here).

Why is it hard to write peak-speed GPU code? Writing really fast code has always been arduous, but it seems especially so for modern GPUs.

To understand the issues, my colleagues and I performed a detailed study of GPU kernel performance, across eight different GPU models from three GPU vendors [1]. The test case we considered was low precision matrix multiply, a resource-intensive operation for LLM training. We ran many, many experiments to understand what causes performance variability and why kernels sometimes run slower than you’d think they should.

For the cases we studied, we found about half a dozen different factors, but the upshot is this: modern processors like GPUs have become so complex—notably their multi-layered hierarchical memory subsystems—that it is difficult to get consistently high performance across all problem sizes a user might want to run in practice. As a result, the performance for the target problem might be surprisingly and mysteriously less than the advertised peak performance for the operation in question. The reasons might be obvious—like cache line misalignment—or more opaque. For the matrix multiply case, various issues like the need for prefetching, caching, tiling and block size selection, make it difficult for the kernel developer to optimize for every input size a user might specify.

Below is an example graphic from our paper. The color indicates floating point operation rate (FLOPs) for a reduced precision matrix multiply on a representative GPU using a library call. The horizontal and vertical axes refer to the matrix dimensions for the problem (see paper for details). Though some regions show performance near the theoretical peak (red), other immediately adjacent regions show problem sizes that run dramatically less—in fact, only about half of peak performance, or less. Presumably this is because either individual kernel performance or the selection of kernels used by the library is suboptimal. The net outcome is, if your problem lands a “bad” region, you’re in for a big surprise, your performance will be much less than expected, and you may not understand why. All high-performing GPUs we tested showed irregular behaviors such as this [2] [3].

In the past this was not always a problem.  Older architectures like Sun Sparc or Cray vector processor, complex as they were, were simple enough that a reasonably well-tuned computational kernel might run well across most if not all inputs [4]. Today, performance is much harder to predict and can vary substantially based on the requested problem sizes.

This is a tough challenge for library developers. Whenever a new GPU model family comes out, new kernel optimization and tuning are required to give (hopefully) more consistently high performance, and some cases get more developer attention than others due to customer needs and limited developer resources. As a result, infrequently used operations do not get as much attention, but they may be the exact ones you need for your particular case [5].

Tools are available to help optimize for specific cases. The excellent Nvidia CUTLASS library exposes access to many more fine-grained options compared to the standard cuBLAS library. The not faint of heart can try programming Nvidia GPUs at the level of PTX, or (shudder) SASS. Superoptimization might help, but only for very small code fragments and even then there may be too many external factors influencing performance to make it effective.

Autotuning is a promising approach though it doesn’t seem to have reached its full potential in production. AI might really help here [6]; in our own paper we had some success using machine learning methods like decision trees and random forests to model performance as a function of problem size, though our work was exploratory and not production-ready. To make a well-crafted general solution it would seem would require a lot of effort to do right. Code sustainability and maintenance are also critical; a sustainable workflow would be needed to retrain on new GPUs, new CUDA releases and even site-specific and system-specific settings like GPU power and frequency cap policies.

Most recent AI-driven work focuses on optimizing performance for one or a few problem sizes only. A truly production-quality general purpose tool would give both 100% accurate results and also top achievable performance for any input problem size (even for corner cases) or data type. This would require both optimized GPU kernels and optimal kernel dispatcher for kernel selection. And the method would need to be robust to issues like power and frequency variabilities in production runs. This would seem to currently be an unsolved problem. Solving it would be of huge benefit to the hyperscaler community.

Notes

[1] For related work from a slightly different angle, see this excellent work from Matt Sinclair’s lab.

[2] It turned out this study was helpful to us for production runs, to help us to triage an odd performance conundrum we encountered when attempting an exascale run (see here, here).

[3] Incidentally this example shows the hazards of simplistic benchmark suites to measure GPU code performance. Unless the benchmark captures a truly large and varied set of input cases, any new optimization method proposed can artificially “overfit” performance on the tests and still underperform miserably on many user cases of interest.

[4] I once wrote a 1-D wavelet convolution kernel for a Sparc processor, using a circular register buffer and loop unrolling to minimize loads and stores, this achieving near-peak performance. The code was correctly compiled from C to assembly, and performance for a given problem was almost precisely predictable. That was before the days of complex memory hierarchies.

[5] One vendor I know of used to take customer requests for hand tuning expensive library calls and made them run fast at the specific customer problem sizes.

[6] LLM kernel generation seems like a natural fit, particularly since LLM-generated code quality has much improved in recent months. Kernel selection and parameter selection for block size, tiling etc. might be better solved by direct training of machine learning models, or methods like this. Comparative studies on this would be informative.