Can AI Models Reason: Is Data All You Need?

Many are voicing concern that the world is running out of data and that this will be a blocker to progress toward smarter AI models. One paper in fact projects timelines for when we will run out.

AI researchers are looking for ways to adapt.  Nvidia has trained a specific model to generate synthetic data for training other models. Some use this approach, though using AI-generated data to train AI is not without risk.

Others have asked a bigger question, namely, is something fundamentally missing in our approach that relies so heavily on data. Certainly the bitter lesson thesis and the position long advocated by Geoffrey Hinton argue for a data-first approach with “as few” prior assumptions as possible (though every model has a bias).

But it’s currently simply unknown whether just adding more data and compute will do the trick for achieving general intelligence or whether something else is needed. Neurosymbolic approaches are being experimented with, in various forms. But it’s unclear whether these can scale up to the level needed. And the frontier labs, laser-focused on the current paradigm, may not have adequate time or resources to investigate high-risk/high-reward alternatives.

From a theoretical standpoint, sometimes more data is simply not enough. As discussed in a previous post, some problems in mathematics and engineering require exponentially large amount of data to train neural network models. Exponentials can work in your favor, but also can work against you (think of the Tower of Hanoi problem or the Wheat and Chessboard problem). Some problems on certain models cannot be solved by any amount of data available in the entire universe.

The requirements for solving these problems can grow much more quickly than expected. The strength of neural networks, their flexibility, their universal approximation property, can also be a weakness. It can take so much data to nail down all the parameters so that the model is completely error free. Thankfully, many other problems that people want to solve (such as human language modeling) are fundamentally lower dimensional and thus less vulnerable to this problem.

We just don’t know whether the current data-hungry approach will be enough—or whether we’ll need to learn another bitter lesson.

Dimensional analysis for gamma function values

Sometimes it’s useful to apply dimensional analysis where it doesn’t belong, to imagine things having physical dimension when they don’t. This post will look at artificially injecting dimensions into equations involving factorials and related functions.

Factorials

The factorial of n is defined as the product of n terms. If each of these terms had units of length, the factorial would have units of n-dimensional volume. It occurred to me this morning that a lot of identities involving factorials make sense dimensionally if you pretend the terms in the identities have units. The expressions on both sides of an equation have the same dimension, and only terms of the same dimension are added together. This isn’t always the case, so caveat emptor.

We could also think of an nth rising power or nth falling power as an n-dimensional volume. If we do, then the dimensions in a Newton series cancel out, for example.

Dimensional analysis for factorials and related functions could make it easier to remember identities, and easier to spot errors. And it could suggest the correct form of a result before the details of the result are filled in. In other words, artificial dimensional analysis can provide the same benefits of physically meaningful dimensional analysis.

Gamma function

For integers n, Γ(n) = (n − 1)!, and so we could assign dimension n − 1 to Γ(n), and more generally assign dimension z − 1 to Γ(z) for any complex z.

Now for some examples to show this isn’t as crazy as it sounds. For starters, take the identity Γ(z + 1) = z Γ(z). We imagine the left hand side is a z-dimensional volume, and the right hand side is the product of a term with unit length and a term that represents a (z − 1) dimensional volume, so the units check out.

Next let’s take something more complicated, the Legendre duplication formula:

\Gamma (z)\Gamma \left(z+\frac{1}{2}\right)=2^{1-2z}\sqrt{\pi} \; \Gamma (2z)
The left hand side has dimension (z − 1) + (z − ½) = 2z − ½. The right hand side has dimension 2z − 1, apparently contradicting our dimensional analysis scheme. But √π = Γ(½), and if we rewrite √π as Γ(½) in the equation above, both sides have dimension 2z − ½. The dimensions in other identities, like the reflection formula, also balance when you replace π with Γ(½)².

Hypergeometric functions

The hypergeometric function F(a, b; c; z) is defined by its power series representation. If we assign dimensions to the coefficients in the series as we’ve done here, then the numerator and denominator of each term have the same dimensions, and so the hypergeometric function should be dimensionless in our sense. This implies we should expect that identities for hypergeometric functions to be dimensionless as well. And indeed they are. I’ll give two examples.

First, let’s look at Gauss’ summation identity

F (a,b;c;1)= \frac{\Gamma(c)\Gamma(c-a-b)}{\Gamma(c-a)\Gamma(c-b)}

provided the real part of c is greater than the real part of a + b. Notice that the numerator and denominator both have dimension 2cab − 2.

Next, let’s look at Kummer’s identity

F (a,b;1+a-b;-1)= \frac{\Gamma(1+a-b)\Gamma(1+\tfrac12a)}{\Gamma(1+a)\Gamma(1+\tfrac12a-b)}

Both the numerator and denominator have dimension 3a/2 − b.

Finally, let’s look at a more complicated formula.

\begin{align*} F(a, b; c; z) &= \frac{\Gamma(c) \Gamma(b-a)}{\Gamma(b) \Gamma(c-a)} \,(-z)^{-a\phantom{b}} F\left(a, 1-c+a; 1-b+a; \frac{1}{z}\right) \\ &+ \frac{\Gamma(c) \Gamma(a-b)}{\Gamma(a) \Gamma(c-b)} \,(-z)^{-b\phantom{a}} F\left(\,b, 1-c+b; 1-a+b; \,\frac{1}{z}\right) \\ \end{align*}

In both the terms involving gamma functions, the dimensions of the numerator and denominator cancel out.

Falling power analog of binomial theorem

Yesterday I wrote about how the right notation could make Newton’s interpolation theorem much easier to remember, revealing it as an analog of Taylor series. This post will do something similar for the binomial theorem.

Let’s start with the following identity.

(x + y)(x + y - 1)(x + y - 2) =

x(x - 1)(x - 2) + 3x(x - 1)y + 3xy(y - 1) + y(y - 1)(y - 2)

It’s not clear that this is true, or how one might generalize it. But if we rewrite the equation using falling power notation we have

(x + y)^{\underline{3}} = x^{\underline{3}} + 3x^{\underline{2}} y^{\underline{1}} + 3x^{\underline{1}}y^{\underline{2}} + y^{\underline{3}}

which looks a lot like the binomial theorem. In fact it is the case n = 3 of the Chu-Vandermonde theorem which says

(x + y)^{\underline{n}} = \sum_{k=0}^n \binom{n}{k} x^{\underline{n-k}} y^{\underline{k}}

Viewed purely visually, this is the binomial theorem with little lines under each exponent.

Incidentally, the analogous theorem holds for rising powers. Just change all the lines under the exponents to lines on top of the exponents.

Why eliminate trusted third parties?

Suppose one company would like to buy another company’s client list, but only if the lists don’t overlap too much. Neither company wants to hand over their list to the other before a sale takes place. What can they do?

A low-tech solution would be for both parties to provide their client lists to a trusted third party who will report back how much the lists overlap. That may be the best thing to do.

But it is possible to solve this problem without a trusted third party. With homomorphic encryption, the companies can exchange encrypted versions of their client lists that will allow both to calculate the amount of overlap without revealing any further information.

But why go to the effort? Many peer-to-peer technologies raise this question. So you’ve eliminated a third party; what’s so great about that? If you can send someone cryptocurrency, for example, without going through an intermediary like a bank or credit card company, what good is that if the transaction fees are no lower?

It’s often not worth using sophisticated technology to eliminate a trusted third party, but sometimes it is. Here are some reasons the technology might be worth using.

Transaction speed

The two companies hiring a third party to compare their lists have to wait on the third party, and the amount of time they need to wait is beyond their control. Maybe that’s acceptable for a one-time transaction, but not for repeated transactions. With homomorphic encryption, transactions could be automated and the only delay would be computation time.

Reproducibility

Sharing limited information via encryption reduces legal risk. If either party later accuses the other of providing incorrect information, the accused party can demonstrate that the software applied to the data gives the reported result.

Trust

To paraphrase Bob Dylan, you gotta trust somebody. Some technologies are labeled “zero-trust” or “trust no one,” but these terms need to be understood in context. When a company asks you to run a particular piece of software on your proprietary data and share the results, you have to trust that the software is not malicious or buggy.

Instead of trusting that a third party holding your data is honest and competent, you trust that a third party developing software is honest and competent. You have to decide that the software product is trustworthy. You might test the software on some sample data. Maybe you inspect the source code if it’s available. But at some point you have to trust the software and the context it runs in.

Related posts

Discrete Taylor series

Newton’s interpolation formula looks awfully complicated until you introduce the right notation. With the right notation, it looks like a Taylor series. Not only is this notation simpler and more memorable, it also suggests extensions.

The notation we need comes in two parts. First, we need the forward difference operator Δ defined by

\Delta f(x) = f(x+1) - f(x)

and its extension Δk defined by applying Δ to a function k times.

The other piece of notation we need is falling powers, denoted with a little bar under the exponent:

 x^{\underline{k}} = x(x-1)(x-2) \cdots (x - k + 1)

The Δ operator is meant to resemble the differential operator D and falling powers are meant to resemble ordinary powers. With this notation, Newton’s interpolation formula based at a

f(x) = \sum_{k=0}^\infty \frac{\Delta^k f(a)}{k!} (x - a)^{\underline{k}}

looks analogous to Taylor’s formula for a power series based at a

f(x) = \sum_{k=0}^\infty \frac{D^k f(a)}{k!} (x - a)^k.

Newton’s formula applies to polynomials, and the infinite sum is actually a finite sum because Δk f(a) is 0 for all k beyond some point.

Newton’s formula is a discrete analog of Taylor’s formula because it only uses the values of f at discrete points, i.e. at the integers (shifted by a) and only involves finite operations: finite differences do not involve limits.

Convergence

As I mentioned on @AlgebraFact this morning,

It’s often useful to write a finite series as an infinite series with a finite number of non-zero coefficients.

This eliminates the need to explicitly track the number of terms, and may suggest what to do next.

Writing Newton’s formula as an infinite series keeps us from having to write down one version for linear interpolation, another version for quadratic interpolation, another for cubic interpolation, etc. (It’s a good exercise to write out these special cases when you’re new to the topic, but then remember the infinite series version going forward.)

As for suggesting what to do next, it’s natural to explore what happens if the infinite series really is infinite, i.e. if f is not a polynomial. Under what circumstances does the series converge? If it does converge to something, does it necessarily converge to f(x) at each x?

The example f(x) = sin(πx) shows that Newton’s theorem can’t always hold, because for this function, with a = 0, the series on right hand side of Newton’s theorem is identically zero because all the Δk terms are zero. But Carlson’s theorem [1] essentially says that for an entire function that grows slower than sin(πx) along the imaginary axis the series in Newton’s theorem converges to f.

Saying that a function is “entire” means that it is analytic in the entire complex plane. This means that Taylor’s series above has to converge everywhere in order for Newton’s series to converge [2].

Related posts

[1] Carlson with no e. Not to be confused with Carleson’s theorem on the convergence of Fourier series.

[2] Carlson’s original theorem requires f to be entire. Later refinements show that it’s sufficient for f to be analytic in the open right half plane and continuous on the closed right half plane.

Podcast feed

The previous post was an AI-generated podcast that I friend made by crawling my web site. I decided to create an actual podcast for posting occasional audio files. I expect to post very sporadically. I’ve posted two audio files, and I have one more in mind to post some day. Maybe that’ll be the end of it, or maybe I’ll post more.

The first file I posted was the one from the previous post. The second was an interview I did with the late Sir Michael Atiyah.

Here’s the RSS feed for the podcast. You can also find the podcast via my Substack newsletter.

RSA security in light of news

A recent article reported on the successful factoring of a 512-bit RSA key. The process took $8 worth of rented computing power. What does that say about the security of RSA encryption?

The following Python function estimates the computation steps required to factor a number b bits long using the best known algorithm. We’re going to take a ratio of two such estimates, so proportionality constants will cancel.

def f(b):
    logn = b*log(2)
    return exp((8/3)**(2/3) * logn**(1/3) * (log(logn))**(2/3))

The minimum recommended RSA key size is 2048 bits. The cost ratio for factoring a 2048 bit key to a 512 bit key is f(2048) / f(512), which is on the order of 1016. So factoring a 2048-bit key would take 80 quadrillion dollars.

This is sort of a back-of-the-envelope estimate. There things it doesn’t take into account. If a sophisticated and highly determined entity wanted to factor a 2048 number, they could probably do so for less than 80 quadrillion dollars. But it’s safe to say that the people who factored the 512 bit key are unlikely to factor a 2048 bit key by the same approach.

Can AI models reason like a human?

We’re awaiting the release of OpenAI’s o3 model later this month. Its performance is impressive on very hard benchmarks like SWE-bench Verified, Frontier Math and the ARC AGI benchmark (discussed previously in this blog).

And yet at the same time some behaviors of the frontier AI models are very concerning.

Their performance on assorted math exams is outstanding, but they make mistakes doing simple arithmetic, like wrongly multiplying numbers that are a few digits long. Performance of the o1 preview model on the difficult Putnam math exam is excellent but drops precipitously under simple changes like renaming constants and variables in the problem statement.

Similarly, when o1 is applied to a planning benchmark expressed in standardized language, it performs well, but accuracy falls apart when applied to a mathematically equivalent planning problem in a different domain. And also, a given AI model applied to the simple ROT13 cipher can have wildly different performance based on the value of the cipher key, suggesting the models don’t really understand the algorithm.

It was the best of times, it was the worst of times, . . .

What is going on here?

For years now, some have made claims of “human-level performance” for various deep learning algorithms. And as soon as one party starts making claims like this, it’s hard for the others to resist doing the same.

The confusion is that, from a certain point of view, the claim of “human-level” is true—but the definition of “human-level” is fraught.

Here, “human-level” is taken to mean achievement of some high score on a benchmark set, ostensibly exceeding some human performance measure on the same benchmark. However, a single AI model can vary wildly in capability across behaviors—“smart” compared to humans in some ways, “dumb” in others.

For humans, test-taking is a proxy for measuring a range of skills and abilities. And even for humans it is not always an accurate proxy. A person can perform very well on academic tests and very poorly on the job, or vice versa.

And the capability ratios for AI models are very different still, in ways we don’t fully understand. So, outscoring humans on a software engineering benchmark doesn’t mean the AI has the whole panoply of coding skills, decision-making abilities, software architecture design savvy, etc., needed to be a competent software engineer.

It’s no surprise that recent articles (below) show a growing perception of the limitations of AI benchmarks as currently conceived.

Ways forward

Perhaps we should consider developing requirements like the following before claiming human-level reasoning performance of an AI model:

  • It should be able to “explain its work” at any level of detail to another human (just like a human can), in a way that that human can understand.
  • It should be able to give answers without “hallucinating” or “confabulating” (yes, humans can hallucinate too, but most occupations would not be well-served by an employee who hallucinates on the job).
  • It should be able to reliably and consistently (100% of the time) do things that we routinely expect a human or computer to do accurately (like add or multiply two numbers accurately, for things like filling out tax returns or doing engineering calculations to build an airplane).
  • It should be frank and honest in assessing its level of certainty about an answer it gives (no gaslighting).
  • It should be able to solve a trivial perturbation of a given problem with the same ease as the original problem (to the same extent that a human can).
  • As someone has said, it should be able to do, without specific training, what a 5 year old can do without specific training.
  • This one sounds good, from Emmett Shear: “AGI [artificial general intelligence] is the ability to generalize [without special training by a human] to an adversarially chosen new benchmark.”

AI models are fantastic and amazing tools—and best used when one has eyes wide open about their limitations.

Have you had problems with AI model performance? If so, please share in the comments.

References

Rethinking AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence, https://venturebeat.com/ai/rethinking-ai-benchmarks-a-new-paper-challenges-the-status-quo-of-evaluating-artificial-intelligence/

Rethink reporting of evaluation results in AI, https://www.science.org/doi/10.1126/science.adf6369, https://eprints.whiterose.ac.uk/198211/

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence, https://arxiv.org/abs/2402.09880

Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless, https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

Why we must rethink AI benchmarks, https://bdtechtalks.com/2021/12/06/ai-benchmarks-limitations/

AI and the Everything in the Whole Wide World Benchmark, https://arxiv.org/abs/2111.15366

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices, https://arxiv.org/abs/2411.12990

Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.

Quick change directory

One difficulty when working at a command line is navigating between directories, particularly between locations with long paths. There are several ways to mitigate this. One of the simplest is using cd - to return to the previous directory. Another is to use pushd and popd. Still another is to set the CDPATH variable.

qcd function

This post presents another approach that could be used instead of or in addition to the ones mentioned above. The book Efficient Linux at the Command Line by Daniel Barrett contains a function called qcd (quick change directory) that will cd to any of a list of commonly used directories. The function is essentially a big case statement taking a key and going to a corresponding directory.

qcd () {
 case "$1" in
    work)
      cd $HOME/Work/Projects/Web/src/include
      ;;
    recipes)
      cd $HOME/Family/Cooking/Recipes
      ;;
    …
  esac
  pwd
}

So, for example, qcd work will take you to the directory ~/…/include.

Barrett adds one more line after the definition of the qcd function:

complete -W "work recipes …" qcd

This turns on tab completion for the script using the bash builtin function complete. You use this function implicitly when you use tab completion with shell utilities. You can call it as above to add the same command completion to your own functions. So, for example, with the code above, a user could type

qcd w TAB

instead of cd work.

Improvements

The book says “Store the function in a shell configuration file such as $HOME/.bashrc … source it, and it’s ready to run.” I’d like to make two comments about this.

First, it’s important that qcd is a function and not a script. Scripts run in a subshell, so running a cd command in a script changes your working directory while the script is running. But when the script finishes, the subshell exits, and the working directory is just as it was before running the script.

Second, if you use this function you’ll edit it frequently as you think of new directories to add. I’d rather not put it in my .bashrc file for that reason. Also, maybe I’d like to use it from a bash shell on a Linux box and from zshell on a Mac. So instead of putting the definition of qcd in my .bashrc file, I put it in a file qcd.sh and source that file from my .bashrc file.

When you add a new key and directory to the qcd script, you need to also add the key to the call to complete, otherwise you’ll be in the awkward situation of tab completion working for some directories but not others. It would be possible to write a fancier shell script that would fix this problem.

Generating qcd

My knowledge of shell scripting is minimal, and I’d like to keep it that way. If I need to do something complicated, I don’t do it in a shell script. So I wrote a Python script to generate the qcd.sh file from a dictionary of keys and directories. Someone fluent in shell scripting would find this unnecessarily complicated. To each his own.

By the way, if you’re going to write a Python script, why not just write a Python script and be done with it rather than write a Python script to generate a shell script? For the same reason qcd is a function: cd inside a Python script will only change the working directory while the script is running. There is probably some way around this, but I didn’t want to take the time to figure it out.

Related posts