An AI Odyssey, Part 3: Lost Needle in the Haystack

While shopping on a major e-commerce site, I wanted to get an answer to an obscure question about a certain product.

Not finding the answer immediately on the product page, I thought I’d try clicking the AI shopping assistant helper tool to ask this question.

I waited with anticipation for an answer I would expect be more informative and useful than a standard search result. But it was not to be. The AI tool had nothing worthwhile.

Then I decided on an old-fashioned keyword search across all the product reviews. And, lo and behold, I immediately found several credible reviews addressing my question.

It is not good usability when multiple search mechanisms exist but only one of them is reliable. And it is surprising that a retrieval-based approach (e.g., RAG) could not at least match the effectiveness of a simple keyword search over reviews.

Models are capable, but effective integration can be lacking. Without improvements for cases like this, customers will not be satisfied users of these new AI tools.

Related posts

An AI Odyssey, Part 2: Prompting Peril

I was working with a colleague recently on a project involving the use of the OpenAI API.

I brought up the idea that, perhaps it is possible to improve the accuracy of API response by modifying the API call to increase the amount of reasoning performed.

My colleague quickly asked ChatGPT if this was possible, and the answer came back “No, it’s not possible to do that.” then I asked essentially the same question to my own instance of ChatGPT, and the answer was, “Yes, you can do it, but you need to use the OpenAI Responses API.”

How did we get such different answers? Was it the wording of the prompt? Was it the custom instructions given in the account personalization, where you describe who you are and how you want ChatGPT to respond? Is it possibly different conversation history? Many factors could have contributed to the different response. Unfortunately, many of these factors are either not easily controllable at the user level or not convenient to change to alternatives in a protracted trial and error search.

I’ve had other times when I will first get a highly standardized, generic answer from ChatGPT, even in Thinking mode, that I know is not quite right or just seems off. Then when I push back, I may get a profoundly different answer.

It’s simply a fact that large language models are conditional probabilistic systems that do not guarantee reproducibility in practice, even given the same inputs, even at temperature=0 [1]. Their outputs depend sensitively on prompt wording, context window contents, system instructions, and model configuration. Small differences in these inputs can yield substantially different outputs.

How well an AI chatbot responds can obviously have a massive impact on how effective the tool will be for your use case. Differences in responses could materially affect the outcome of your project. I take this as a wake-up call to be persistent, vigilant and flexible in attempts to obtain reliable answers from these new AI tools.

Notes

[1] (some) sources of nondeterminism: floating point / GPU nondeterminism, differing order of operations from distributed collectives, ties or near-ties in token probabilities, backend/infrastructure changes, model routing, hidden system prompt differences or tool availability.

An AI Odyssey, Part 1: Correctness Conundrum

I recently talked with a contact who repeated what he’d heard regarding agentic AI systems—namely, that they can greatly increase productivity in professional financial management tasks. However, I pointed out that though this is true, these tools do not guarantee correctness, so one has to be very careful letting them manage critical assets such as financial data.

It is widely recognized that AI models, even reasoning models and agentic systems, can make mistakes. One example is a case showing that one of the most recent and capable AI models made multiple factual mistakes in drawing together information for a single slide of a presentation.  Sure, people can give examples where agentic systems can perform amazing tasks. But it’s another question as to whether they can always do them reliably. Unfortunately, we do not yet have procedural frameworks that provides reliability guarantees that are comparable to those required in other high-stakes engineering domains.

Many leading researchers have acknowledged that current AI systems have a degree of technical unpredictability that has not been resolved. For example, one has recently said, “Anyone who has worked with AI models understands that there is a basic unpredictability to them, that in a purely technical way we have not solved.”

What industrial-strength reliability looks like

Manufacturing has the notion of Six Sigma quality, to reduce the level of manufacturing defects to an extremely low level. In computing, the correctness requirements are even higher, sometimes necessitating provable correctness. The Pentium FDIV bug in the 1990s caused actual errors in calculations to occur in the wild, even though the chance of error was supposedly “rare.” These were silent errors that might have occurred undetected in mission critical applications, leading to failure. This being said, the Pentium FDIV error modes were precisely definable, whereas AI models are probabilistic, making it much harder to bound the errors.

Exact correctness is at times considered so important that there is an entire discipline, known as formal verification, to prove specified correctness properties for critical hardware and software systems (for example, the manufacture of computing devices). These methods play a key role in multi-billion dollar industries.

When provable correctness is not available, having at least a rigorous certification process (see here for one effort) is a step in the right direction.

It has long been recognized that reliability is a fundamental challenge in modern AI systems. Despite dramatic advances in capability, strong correctness guarantees remain an open technical problem. The central question is how to build AI systems whose behavior can be bounded, verified, or certified at domain-appropriate levels. Until this is satisfactorily resolved, we should use these incredibly useful tools in smart ways that do not create unnecessary risks.

AGI, ASI, A*I – Do we have all we need to get there?

Demis: “[to get to AGI] maybe there’s one or two big innovations needed

Sam: “everything based off what we see today is that it will happen.”

Ilya: “But is the belief really that if you just 100x the scale, everything would be transformed? I don’t think that’s true.

Dario: “If you just kind of like eyeball the rate at which these capabilities are increasing, it does make you think that we’ll get there by 2026 or 2027.

Jerry: “is [the transformer architecture] the last thing? I’m pretty sure it isn’t.

For years leading researchers have been speculating one way or the other as to whether better algorithms are needed to get to AGI, artificial general intelligence (however that might be defined).

Around the time of the release of GPT-4, some were saying they felt something more was needed. Since then, we have had several major new advances, like reasoning models and tool use. If we’d said, “we don’t need anything else” three years ago, where would we be now?

For frankness, I like this from John Schulman: “it’s hard to know what we need.” And for strategy, Demis: “you can think of as 50% of our effort is on scaling, 50% of it is on innovation. My betting is you’re going to need both to get to AGI.”

Experiences with GPT-5-Codex

OpenAI Codex is now generally available (see here, here). I’m using the Codex extension in the Cursor code editor with my OpenAI account.

Codex is very helpful for some tasks, such as complex code refactoring, implementing skeleton code for an operation, or writing a single small self-contained piece of code. Models have come a long way from a year ago when bugs in a 30-line piece of generated code were not uncommon.

Some have reported 40-60% overall productivity increase from coding agents. I used Codex recently for a complicated code refactoring that I estimate would have taken over 10x more time to plan and execute without an assistant.

The coding agent is less effective for some other tasks. For example, I asked it to ensure that all Python function signatures in my code had type hints, and it missed many cases.

Also, some have reported that the new Claude Sonnet 4.5 runs much faster, though Codex is being continually improved.

Obviously to be effective, these models must have access to adequate test case coverage, to enable the models to debug against. Without this, the coding agent can get really lost.

My approach to using the agent is very hands-on. Before letting it make any changes, I discuss the change plan in detail and make corrections as needed. (I sometimes need to remind the agent repeatedly to wait until I say “start” before commencing changes). Also when appropriate, I ask the model to make changes one step at a time instead of all in one go. This not only makes for a result that is more understandable and maintainable by humans, but also is more likely to give a good quality result.

Some have cautioned of the hazards of using coding agents. One concern is that a rogue coding agent could do something drastic like delete your code base or data files (both theoretically, and for some models, actually). One remedy is to set up your own sandbox to run the agent in, for example, a virtual machine, that has very locked-down access and no access to sensitive data. This may be cumbersome for some workflows, but for others may be a good security measure.

Also, some have warned that an agent can introduce dangerous security bugs in code. A remedy for this is to manually review every piece of code that the agent produces. This introduces some added developer overhead, though still in my experience it is much faster than writing the same code without the agent. And it is much better than just pushing a button to generate a big blob of incomprehensible code.

Coding agents have greatly improved over the last several months. Software development practices are presently passing through a point of no return, permanently changed by the new AI-enabled coding assistants. Even very bright people, who are already extremely skilled, are benefitting from using these tools.

GPT-5 for AI-assisted discovery

Many hope that AI will be “smart enough” to make breakthrough scientific discoveries in the near future, such as find a cure for cancer. Some research efforts have sought to create an “AI Scientist” that can make discoveries in an automated or semi-automated way; for a recent example, see [1]. Others [2] have called out the methodological pitfalls of some of these attempts. Still others question whether  a truly original discovery is even possible at all for an AI.

OpenAI released GPT-5 on August 7. Some thought it lackluster and falling behind compared to expectations. Others however found performance in some areas to be much advanced compared to its predecessor.

Two recent reports show the new model’s utility. Scott Aaronson published a paper last month [3], [4] in which “a key technical step in the proof of the main result came from AI.” Also, Terence Tao reported earlier this month [5] his use of ChatGPT to find a first counterexample to an unsolved mathematics problem.

I’m sure this resonates with the experience of other researchers using the tool. Recently, in the course of a discussion I had with ChatGPT, it came up with a new algorithmic recipe for something I was working on, based on ideas combined from several papers but in itself apparently original. That was a very simple case—but on a grander scale, connecting two ideas together in a novel way can lead to a real breakthrough. For example, Faltings’ proof of the Mordell Conjecture in 1983 was based on recognizing a subtle internal connection among some already existing theorems.

There is always the specter of concern that an idea “maybe was already in the training data.” It can be difficult to prove otherwise. But deep domain experts like Scott Aaronson and Terence Tao are likely to know with high likelihood whether the idea is truly an original never-before-published result or not.

If past is prologue, we can hope for more powerful models in the future that can solve increasingly hard problems.

Notes

[1] Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, Yue Zhang, “DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively.” https://arxiv.org/abs/2509.26603.

[2] Ziming Luo, Atoosa Kasirzadeh, Nihar B. Shah, “The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems.” https://arxiv.org/abs/2509.08713.

[3] “The QMA Singularity.” https://scottaaronson.blog/?p=9183

[4] Scott Aaronson, Freek Witteveen, “Limits to black-box amplification in QMA.” https://arxiv.org/abs/2509.21131.

[5] https://mathstodon.xyz/@tao/115306424727150237

 

An inverse problem for the quadratic equation

In elementary algebra we learn how to solve ax2 + bx + c = 0 for roots r = r1, r2. But how about the inverse problem: given a real-valued r, find integer coefficients a, b and c such that r is a root of the corresponding quadratic equation.

A simple approximation can be found by setting b=0, so r2 = – c/a. By appropriately choosing large values of a and c, one can approximate the solution to this inverse problem to arbitrary accuracy.

Are there any other solutions (possibly taking advantage of b for a better approximation)?

Two families of algorithms are available: PSLQ [1-3] and LLL [4-5]. The underlying problem formulation behind these algorithms can be understood geometrically. For fixed x, L(a, b, c) = ax2 + bx + c is a linear function of (a, b, c), here considering a, b and c as real-valued. Thus, for such x, the equation L(a, b, c) = 0 defines a known plane in 3-D space. However, we seek a, b, c that are integers. The problem then is to find a lattice point (a, b, c) with a, b, c integers that is “near” to this plane (for example, close to the plane within tolerance τ, which means a 2-D thin “slab” of thickness 2τ surrounding this plane hits a lattice point). We would also prefer if the magnitude of a, b and c is “small” (e.g., bounded by a constant H).

Given this formulation, the PSLQ algorithm attempts to find an exact solution, and gives up if it can’t find one. The LLL algorithm on the other hand finds approximate solutions within a tolerance.

Python has the mpmath package, which contains mpmath.pslq for the PSLQ algorithm. The Python SciPy package has Matrix.LLL() for the LLL reduction.

A colleague of mine was previously submitting overnight runs to compute solutions using  brute force search. More recently, using a Python code for the LLL algorithm, he is able to solve a large problem in seconds, with answers that exactly reproduce the brute force solution.

References

[1] Ferguson, Helaman R. P.; Bailey, David H. (1992). A Polynomial Time, Numerically Stable Integer Relation Algorithm. RNR Technical Report RNR-91-032, NASA Ames Research Center. URL: https://www.davidhbailey.com/dhbpapers/pslq.pdf

[2] Ferguson, Helaman R. P.; Bailey, David H.; Arno, Steve (1999). Analysis of PSLQ, an integer relation finding algorithm. Mathematics of Computation 68(225): 351–369. URL (AMS PDF): https://www.ams.org/mcom/1999-68-225/S0025-5718-99-00995-3/S0025-5718-99-00995-3.pdf

[3] Bailey, David H. (2020). PSLQ: An Algorithm to Discover Integer Relations. (Expository overview.) URL: https://www.davidhbailey.com/dhbpapers/pslq-comp-alg.pdf

[4] Lenstra, Arjen K.; Lenstra, Hendrik W., Jr.; Lovász, László (1982). Factoring Polynomials with Rational Coefficients. Mathematische Annalen 261: 515–534. DOI: https://doi.org/10.1007/BF01457454 (Springer page: https://link.springer.com/article/10.1007/BF01457454 (cf. https://www.math.ucdavis.edu/~deloera/MISC/LA-BIBLIO/trunk/Lovasz/LovaszLenstraLenstrafactor.pdf)

[5] Nguyen, Phong Q.; Vallée, Brigitte (eds.) (2010). The LLL Algorithm: Survey and Applications. Springer. URL: https://link.springer.com/book/10.1007/978-3-642-02295-1

Learning languages with the help of algorithms

Suppose you’re learning a new language and want to boost your vocabulary in a very time-efficient way.

People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.

Here is a formalization. Suppose for a large set of m books of average length n words, you want to pick the one book that has the highest vocabulary impact from the set of books. This vocabulary impact of a book is measured by a weighted sum across all vocabulary words in the book, each word weighted by how common the word is in the language, as measured by word frequency across all books; this essentially gives a probability weight of each word in the language.

That’s an easy problem to solve. First, optionally filter out stop words like (in the case of English) “the“ and “and”  considered in some sense to have “not much meaning.“ Second, across all books build a unique word list along with counts of number of occurrences of each word. Finally, for each book evaluate the coverage of those words, computing a score as described above.

Computing the unique word list and scores can be done by a hashing process that runs in linear order mn time typically, worst case mn log(mn). To compute the score also costs average linear time. So the entire process can be done in linear time.

What if you want to find the best two books to read, with the best joint vocabulary coverage? To find the optimal solution, the best known time is not linear but is quadratic for the general case.

How about the best k books for arbitrary k > 0?

This is an NP-hard problem, meaning that the compute time to solve this problem exactly for the best known algorithms for the general case grows exponentially in k as the size k of your desired set of reading books increases.

So you cannot expect to solve this problem exactly for large k. All hope is not lost, however. This is a maximal weighted cover problem, residing in a subset of the NP hard problems known as submodular problems.  Because of this, approximate algorithms are known that guarantee approximation accuracy within a certain known factor of the true best solution (see [1-5]).

These algorithms are described in [6], [7], [8]. The basic idea of the algorithms is to add a single high-impact book at a time to the running set of high-impact books—a greedy algorithm. It is not guaranteed to be the best book list, but it is reasonable.

The helpful Python submodlib package runs very fast on this problem.

One can improve the quality of the result by spending more on computation.  First, you could use a blocking strategy to compute precisely the best two books to add at each step, or three books, and so forth (computational complexity is quadratic, cubic and so forth). Similarly one could use a look-ahead strategy: add two books to the list that are together the best by an exact computation, then leave one off and find the best second and third books, and so forth. These in general do not improve on the submodularity bound, however in practice the result is more accurate.

One can also use various heuristics to improve performance in some cases. For example, if a book has little or no vocabulary that is not present in the other books, it can sometimes be safely discarded. However, in general the exact case remains hard (unless P=NP).

References

[1] Abhimanyu Das and David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection. Proceedings of ICML, 2011.

[2] Abhimanyu Das and David Kempe. Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection.
Journal of Machine Learning Research (JMLR), 19(3):1–35, 2018.

[3] M. Conforti and G. Cornuéjols. Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds. Discrete Applied Mathematics, 7(3):251–274, 1984.

[4] Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. Proceedings of SODA, 2013.

[5] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions. 2013.
https://arxiv.org/abs/1311.2110

[6] Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. “An analysis of approximations for maximizing submodular set functions—I.” Mathematical programming 14, no. 1 (1978): 265-294.

[7] “Maximum coverage problem,” https://en.wikipedia.org/wiki/Maximum_coverage_problem.

[8] Wei, Kai, Rishabh Iyer, and Jeff Bilmes. “Fast multi-stage submodular maximization.” In International conference on machine learning, pp. 1494-1502. PMLR, 2014.

 

Why are CUDA kernels hard to optimize?

Explosive datacenter demand has caused developers to leave no stone unturned in search of higher efficiencies. The DeepSeek team, not satisfied with Nvidia’s CUDA libraries, used a virtualized form of assembly language (PTX) to write kernel codes to accelerate their AI computations. Others have attempted to generate optimized kernels using AI, though some results have been questioned (for various attempts, see also here, here, here, here and here).

Why is it hard to write peak-speed GPU code? Writing really fast code has always been arduous, but it seems especially so for modern GPUs.

To understand the issues, my colleagues and I performed a detailed study of GPU kernel performance, across eight different GPU models from three GPU vendors [1]. The test case we considered was low precision matrix multiply, a resource-intensive operation for LLM training. We ran many, many experiments to understand what causes performance variability and why kernels sometimes run slower than you’d think they should.

For the cases we studied, we found about half a dozen different factors, but the upshot is this: modern processors like GPUs have become so complex—notably their multi-layered hierarchical memory subsystems—that it is difficult to get consistently high performance across all problem sizes a user might want to run in practice. As a result, the performance for the target problem might be surprisingly and mysteriously less than the advertised peak performance for the operation in question. The reasons might be obvious—like cache line misalignment—or more opaque. For the matrix multiply case, various issues like the need for prefetching, caching, tiling and block size selection, make it difficult for the kernel developer to optimize for every input size a user might specify.

Below is an example graphic from our paper. The color indicates floating point operation rate (FLOPs) for a reduced precision matrix multiply on a representative GPU using a library call. The horizontal and vertical axes refer to the matrix dimensions for the problem (see paper for details). Though some regions show performance near the theoretical peak (red), other immediately adjacent regions show problem sizes that run dramatically less—in fact, only about half of peak performance, or less. Presumably this is because either individual kernel performance or the selection of kernels used by the library is suboptimal. The net outcome is, if your problem lands a “bad” region, you’re in for a big surprise, your performance will be much less than expected, and you may not understand why. All high-performing GPUs we tested showed irregular behaviors such as this [2] [3].

In the past this was not always a problem.  Older architectures like Sun Sparc or Cray vector processor, complex as they were, were simple enough that a reasonably well-tuned computational kernel might run well across most if not all inputs [4]. Today, performance is much harder to predict and can vary substantially based on the requested problem sizes.

This is a tough challenge for library developers. Whenever a new GPU model family comes out, new kernel optimization and tuning are required to give (hopefully) more consistently high performance, and some cases get more developer attention than others due to customer needs and limited developer resources. As a result, infrequently used operations do not get as much attention, but they may be the exact ones you need for your particular case [5].

Tools are available to help optimize for specific cases. The excellent Nvidia CUTLASS library exposes access to many more fine-grained options compared to the standard cuBLAS library. The not faint of heart can try programming Nvidia GPUs at the level of PTX, or (shudder) SASS. Superoptimization might help, but only for very small code fragments and even then there may be too many external factors influencing performance to make it effective.

Autotuning is a promising approach though it doesn’t seem to have reached its full potential in production. AI might really help here [6]; in our own paper we had some success using machine learning methods like decision trees and random forests to model performance as a function of problem size, though our work was exploratory and not production-ready. To make a well-crafted general solution it would seem would require a lot of effort to do right. Code sustainability and maintenance are also critical; a sustainable workflow would be needed to retrain on new GPUs, new CUDA releases and even site-specific and system-specific settings like GPU power and frequency cap policies.

Most recent AI-driven work focuses on optimizing performance for one or a few problem sizes only. A truly production-quality general purpose tool would give both 100% accurate results and also top achievable performance for any input problem size (even for corner cases) or data type. This would require both optimized GPU kernels and optimal kernel dispatcher for kernel selection. And the method would need to be robust to issues like power and frequency variabilities in production runs. This would seem to currently be an unsolved problem. Solving it would be of huge benefit to the hyperscaler community.

Notes

[1] For related work from a slightly different angle, see this excellent work from Matt Sinclair’s lab.

[2] It turned out this study was helpful to us for production runs, to help us to triage an odd performance conundrum we encountered when attempting an exascale run (see here, here).

[3] Incidentally this example shows the hazards of simplistic benchmark suites to measure GPU code performance. Unless the benchmark captures a truly large and varied set of input cases, any new optimization method proposed can artificially “overfit” performance on the tests and still underperform miserably on many user cases of interest.

[4] I once wrote a 1-D wavelet convolution kernel for a Sparc processor, using a circular register buffer and loop unrolling to minimize loads and stores, this achieving near-peak performance. The code was correctly compiled from C to assembly, and performance for a given problem was almost precisely predictable. That was before the days of complex memory hierarchies.

[5] One vendor I know of used to take customer requests for hand tuning expensive library calls and made them run fast at the specific customer problem sizes.

[6] LLM kernel generation seems like a natural fit, particularly since LLM-generated code quality has much improved in recent months. Kernel selection and parameter selection for block size, tiling etc. might be better solved by direct training of machine learning models, or methods like this. Comparative studies on this would be informative.

 

Machine learning by satisfiability solving

Define B = {0, 1} and a Boolean function fp: BN → B where p is a Boolean parameter vector in Bn. Consider that fp(x) can be represented as a Boolean expression whose variables are the entries of vectors p and x. Assume that c is the cost of computing fp(x) measured in some way, for example, the number of operator evaluations based on some complete set of Boolean operators. Then, given y in B and x in BN, the cost to solve the equation fp(x) = y for satisfying y has cost c2n, using the most naïve brute force search.

For 3SAT solving, a much better worst case bound is known, namely O(1.307n), see here, for a related discussion see here. For the inexactly-defined class of “industrial problems,” performance in practice is often much better; for a discussion see here.

Now consider a set of Boolean feature vectors xi and labels yi. for i in {1, …, d}. One can now solve fp(xi) = yi for all i. Since the number of unknown variables to be solved for is unchanged, the naïve bound on computational cost is cd2n, scaling linearly in the number of data values d. Note this is tractable if the phenomenon in question can be modeled by a small number of logical parameters n.

In practice, one generally does not solve a machine learning problem exactly but approximately, minimizing a loss function. One can use the AtLeast operator in a SAT solver like Z3 to require that fp(xi) = yi be satisfied for at least K values of i, for some K. One can then find the maximal such K by performing bisection on K, requiring log2(d) such SAT solves. On exact solvers for this problem see also here.

The AtLeast operator can be implemented by either binary adder / totalizer encoding (Bailleux and Boufkhad 2003), sequential counter encoding (Sinz 2005), or Batcher sorting network approach (Abío et al. 2013). Unfortunately, all these methods require adding significant numbers of auxiliary variables, adversely affecting the naïve complexity bound. However, one can hope that performance is much better than this bound for industrial problems, as is often the case in practice. Furthermore, randomized approximation algorithms known to run in polynomial time can provably find assignments that satisfy a guaranteed fraction (e.g., 3/4 or more) of the maximum number of satisfiable Boolean constraints (see for example here, here, here and here). This might serve as a proxy for exactly solving for the optimizer.

If the problem instead has Boolean expressions with embedded linear inequality predicates on integer variables of bounded range, one could apply SMT solver methods directly using the ideas described above, or convert the problem to a SAT problem and apply the above methods directly.

The idea of using SAT solvers for machine learning in the way described here goes back to (Kamath et al. 1992). The method is described with an example in Donald Knuth’s TAOCP fascicle on Satisfiability, section on Learning a Boolean function.