An AI Odyssey, Part 2: Prompting Peril

I was working with a colleague recently on a project involving the use of the OpenAI API.

I brought up the idea that, perhaps it is possible to improve the accuracy of API response by modifying the API call to increase the amount of reasoning performed.

My colleague quickly asked ChatGPT if this was possible, and the answer came back “No, it’s not possible to do that.” then I asked essentially the same question to my own instance of ChatGPT, and the answer was, “Yes, you can do it, but you need to use the OpenAI Responses API.”

How did we get such different answers? Was it the wording of the prompt? Was it the custom instructions given in the account personalization, where you describe who you are and how you want ChatGPT to respond? Is it possibly different conversation history? Many factors could have contributed to the different response. Unfortunately, many of these factors are either not easily controllable at the user level or not convenient to change to alternatives in a protracted trial and error search.

I’ve had other times when I will first get a highly standardized, generic answer from ChatGPT, even in Thinking mode, that I know is not quite right or just seems off. Then when I push back, I may get a profoundly different answer.

It’s simply a fact that large language models are conditional probabilistic systems that do not guarantee reproducibility in practice, even given the same inputs, even at temperature=0 [1]. Their outputs depend sensitively on prompt wording, context window contents, system instructions, and model configuration. Small differences in these inputs can yield substantially different outputs.

How well an AI chatbot responds can obviously have a massive impact on how effective the tool will be for your use case. Differences in responses could materially affect the outcome of your project. I take this as a wake-up call to be persistent, vigilant and flexible in attempts to obtain reliable answers from these new AI tools.

Notes

[1] (some) sources of nondeterminism: floating point / GPU nondeterminism, differing order of operations from distributed collectives, ties or near-ties in token probabilities, backend/infrastructure changes, model routing, hidden system prompt differences or tool availability.

4 thoughts on “An AI Odyssey, Part 2: Prompting Peril

  1. I have the same experience, especially with having the tools write code to query their own APIs. I have had good experience though with the newer tools and using web-search — you have the tool review the current codebase first, then ask a follow up question.

  2. Transformer attention dynamics are inherently metastable.

    https://arxiv.org/abs/2512.01868

    Tokens organize into multi-cluster configurations and the metastability means small perturbations push the system into different basins, producing different outputs. The contradictory answers are saddle-basin crossings, not model failures. The “generic answer followed by a better answer after pushback” is the attractor pulling the system toward collapsed consensus, then perturbation forcing it back into a richer metastable state.

    An important social consequence is that the asymmetric cost of eliminating the KL divergence between the model and the users falls entirely on users. Unlike other expert systems with stable descriptions, an LLM’s effective distribution shifts between sessions, requiring repetitive payments of this asymmetric cost. Users have had to learn prompt engineering to adapt to the model.

    Formal verification cannot transfer to this setting because verification assumes a stable transition function. An LLM’s output is an interference sum over an astronomically large path space, sensitive to the full context window and infrastructure routing. No finite description exists to verify against.

    Worth pointing out, however, that the metastability is both a reliability liability and a source of the LLM’s expressivity. Eliminating metastability to achieve formal reliability would drive the system to its attractor — a generic, collapsed, useless response. Selective stabilization of high-stakes outputs, while preserving metastable expressivity elsewhere, is the unsolved engineering problem you’ve framed very well in these posts!

    Programming is evolving from a process of understanding user needs and building a complete map from every possible input to every possible output into a role more like managing people and problem selection — identifying the relevant context, architecture, and heuristics for exploration of the sparse, high dimensional landscape in which the highest impact solutions remains hidden.

    Or just supervising the LLMs in building out the known solutions to known problems. And maintaining those solutions over time. Both are important!

  3. I had a similar experience. I started with ChatGPT and another AI when I tried to update my website to the latest Bootstrap. I spent a week getting contradictory answers, broken instructions, and incompatible steps.
    I don’t write prompts — I write contracts. Full-page requirements, interfaces, invariants. When I switched to my alternate AI and applied that contract-driven workflow, we upgraded the entire system in a couple of hours and I deployed to Winhost the same day.
    The pattern matches what you describe: when the model drifts, it’s usually because the interface isn’t explicit enough. I design clear boundaries, use repeatable patterns, and treat the AI as a stateful collaborator with incremental refinement.
    The biggest lesson for me has been to let the AI surface contradictions early so I can resolve them before proceeding. Once the inconsistencies are exposed and the contract is stable, then I move to the APIs.

Comments are closed.