Formulating eight queens as a SAT problem

Posted on 11 May 2025 by John

The Boolean satisfiability problem is to determine whether there is a way to assign values to variables in a set of Boolean formulas to make the formulas hold [1]. If there is a solution, the next task would be to enumerate the solutions.

You can solve the famous eight queens problem, or its generalization to n-queens, by formulating the problem as a Boolean formula then using a SAT solver to find the solutions.

It’s pretty obvious how to start. A chessboard is an 8 by 8 grid, so you have a variable for each square on the board, representing whether or not that square holds a queen. Call the variables b_ij where i and j run from 1 to 8.

The requirement that every row contains exactly one queen can be turned into two subrequirements:

Each row contains at least one queen.
Each row contains at most one queen.

The first requirement is easy. For each row i, we have a clause

b_i1 ∨ b_i2 ∨ b_i3 ∨ … ∨ b_i8

The second requirement is harder. How do you express in terms of our boolean variables that there is no more than one queen in each row? This is the key difficulty. If we can solve this problem, then we’re essentially done. We just need to do the analogous thing for columns and diagonals. (We don’t require a queen on every diagonal, but we require that there be at most one queen on every diagonal.)

First approach

There are two ways to encode the requirement that every row contain at most one queen. The first is to use implication. If there’s a queen in the first column, then there is not a queen in the remaining columns. If there’s a queen in the second column, then there is not a queen in all but the second column, etc. We have an implication for each row in each column. Let’s just look at the first row and first column.

b₁₁ ⇒ ¬ (b₁₂ ∨ b₁₃ ∨ … ∨ b₁₈)

We can turn an implication of the form a ⇒ b into the clause ¬a ∨ b.

Second approach

The second way to encode the requirement that every row contain at most one queen is to say that for every pair of squares in a row (a, b) either a has no queen or b has no queen. So for the first row we would have ₈C₂ = 28 clauses because there are 28 ways to choose pairs from a set of 8 things.

(¬b₁₁ ∨ ¬b₁₂) ∧ (¬b₁₁ ∨ ¬b₁₃) ∧ … ∧ (¬b₁₇ ∨ ¬b₁₈)

An advantage of this approach is that it directly puts the problem into conjunctive normal form (CNF). That is, our formula is a conjunction of terms that contain only disjunctions, an AND of ORs.

[1] You’ll see the SAT problem described as finding the solution to a Boolean formula. If you have multiple formulas, then the first holds, and the second, etc. So you can AND them all together to make one formula.

Square root of a small number

Posted on 8 May 2025 by John

The recent post Converting between quaternions and rotation matrices describes an algorithm as “a mathematically correct but numerically suboptimal method/”

Let’s just look at a small piece of that post. The post explains how to find a rotation matrix to describe the same rotation as pre- and post- multiplying by a unit quaternion q, and how to recover q from a rotation. The latter involves

q₀ = ½ √(1 + r₁₁ + r₂₂ + r₃₃).

If you look at the definition of the r terms you’ll see that the quantity under the square root equal 4q₀² and so the term is positive. In theory. In floating point arithmetic, the term you’re taking the square root of could be small or even slightly negative.

I mentioned at the bottom of the earlier post that I came up with a unit quaternion q that caused the code to throw an exception because it attempted to take the square root of a negative number.

There’s an easy way to keep the code from throwing an exception: check whether the argument is positive before taking the square root. If it’s positive, forge ahead. If it’s negative, then round it up to zero.

This makes sense. The argument is theoretically positive, so rounding it up to zero can’t do any harm, and in fact it would to a tiny bit of good. But this is missing something.

As a general rule of thumb, if an algorithm has trouble when a quantity is negative, it might have trouble when the quantity is small but positive too.

How could the term

1 + r₁₁ + r₂₂ + r₃₃

which is theoretically positive be computed as a negative number? When all precision is lost in the sum. And this happens when

r₁₁ + r₂₂ + r₃₃

is nearly −1, i.e. when you’re subtracting nearly equal numbers. In general, if two positive numbers a and b are the same up to n bits, you can lose n bits of precision when subtracting them. You could lose all of your precision if two numbers agree to n bits and all you have are n bits.

Putting in a “breaker” to prevent taking the square root of a negative number doesn’t address the problem of code not throwing an exception but returning an inaccurate result. Maybe you’re subtracting two numbers that don’t agree to all the bits of precision you have, but they agree to more bits than you’d care to lose.

The solution to this problem is to come up with another expression for q₀ and the components, another expression that is equal in theory but numerically less sensitive, i.e. one that avoids subtracting nearly equal numbers. That’s what the authors do in the paper cited in the previous post.

A simple way to generate random points on a sphere

Posted on 6 May 2025 by John

I recently ran across a simple way to generate random points uniformly distributed on a sphere: Generate random points in a cube until you get one inside the sphere, then push it to the surface of the sphere by normalizing it [1].

In more detail, to generate random points on the unit sphere, generate triples

(u₁, u₂, u₃)

of independent random values uniformly generated on [−1, 1] and compute the sum of there squares

S² = u₁² + u₂² + u₃²

If S² > 1 then reject the sample and try again. Otherwise return

(u₁/S, u₂/S, u₃/S).

There are a couple things I like about this approach. First, it’s intuitively plausible that it works. Second, it has minimal dependencies. You could code it up quickly in a new environment, say in an embedded system, though you do need a square root function.

Comparison with other methods

It’s not the most common approach. The most common approach to generate points on a sphere in n dimensions is to generate n independent values from a standard normal distribution, then normalize. The advantage of this approach is that it generalizes efficiently to any number of dimensions.

It’s not obvious that the common approach works, unless you’ve studied a fair amount of probability, and it raises the question of how to generate samples from normal random variable. The usual method, the Box-Muller algorithm, requires a logarithm and a sine in addition to a square root.

The method starting with points in a cube is more efficient (when n = 3, though not for large n) than the standard approach. About half the samples that you generate from the cube will be thrown out, so on average you’ll need to generate six values from [−1, 1] to generate one point on the sphere. It’s not the most efficient approach—Marsaglia gives a faster algorithm in the paper cited aove—but it’s good enough.

Accept-reject methods

One disadvantage to accept-reject methods in general is that although they often have good average efficiency, their worst-case efficiency is theoretically infinitely bad: it’s conceivable that you’ll reject every sample, never getting one inside the sphere. This is not a problem in practice. However, it may be a problem in practice that the runtime is variable.

When computing in parallel, a task may have to wait on the last thread to finish. The runtime for the task is the runtime of the slowest thread. In that case you may want to use an algorithm that is less efficient on average but that has constant runtime.

It also matters in cryptography whether processes have a constant runtime. An attacker may be able to infer something about the path taken through code by observing how long it took for the code to execute.

[1] George Marsaglia. Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics. 1972, Vol. 43, No. 2, 645–646.

Distribution of run times for Euclidean algorithm

Posted on 27 April 2025 by John

The worst case run time for the Euclidean algorithm occurs when you give the algorithm a pair of consecutive Fibonacci numbers. The algorithm takes n steps to compute the greatest common divisor of F_n+1 and F_n+2.

The nth Fibonacci number is the nearest integer to φⁿ/√5 where φ = (1 + √5)/2 is the golden ratio. You can take logs and solve for the largest n such that F_n is less than x, subtract 2, and have an upper bound on the number of steps necessary to find the gcd of two numbers less than x.

But what about the average run time, or the distribution of run times?

For large x, the distribution of number of steps needed to calculate the gcd of two numbers 0 < a < b < x using the Euclidean algorithm is approximately normal with mean

12 log(2) log(x) / π².

See, for example, [1].

Let’s illustrate this with a simulation. Here’s Python code to return the number of steps used in computing gcd(a, b).

    def euclid(a, b):
        steps = 0
        while a != 0:
            a, b = b%a, a
            steps += 1
        return steps # gcd = b

I generated 10,000 pairs of random integers less than 2⁶³ (because the maximum signed 64 bit integer is 2⁶³ − 1) and plotted a histogram of the run times.

I got a nice bell curve with mean 36.3736. The theoretical mean is 0.8428 log(2⁶³) = 36.8021.

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Lessons Learned With the Z3 SAT/SMT Solver

Posted on 17 March 2025 by Wayne Joubert

Community best practices are useful for helping use a software product more effectively. I’ve just completed a small project using the Z3 solver. Here are some things I’ve learned:

My project involves an optimization problem: for a subset of Boolean variables, maximize the count of how many are true. My specific problem is solved much faster with Z3 by converting to a decision problem: set up a base problem to solve for the count being at least a certain fixed number, and iterate using bisection search to find the highest number satisfied. Bisection has been used for this problem before. Also, certain methods may possibly reduce the number of bisection steps.
Using Z3 “tactics” can greatly speed up the solve process. I found a good combination of tactics by trial and error, guided in part by the descriptions of the tactics. ChatGPT was of some help in finding good selections to try. An interesting paper discusses use of Monte Carlo tree search to define a good chain of tactics. The branching factor here is high, perhaps around 1000, though there are some redundancies in this number. Training multi-step MCTS might be expensive, but doing this once to get a good static chain of tactics might be worthwhile.
The strength of Z3 is in its extremely broad functionality, more so than its raw compute performance. It would be a daunting task for the Z3 team to fully optimize every possible solve option. I examined some of the SMT solver competitions to find faster codes. CVC5 on one case I tried was about twice as fast as Z3; I’ve seen similar reports in the literature. Presently I don’t find it worth the switching costs to use CVC5. One approach might be to use the very capable tactics engine of Z3 and pass the resulting modified problem to CVC5.
The specific formulation of the problem can make a big difference in solver performance. I’ve already seen this in the area of iterative linear solvers, for example diagonal matrix scaling can dramatically help (conjugate gradients) or hurt (multigrid) solver performance. Same thing here. Hence the huge importance in good “preprocessing“ for SAT/SMT solvers. One could wish the solver could handle all this automatically without user intervention. However, these powerful tools must be brandished very carefully for maximum effect.
Generally, one should move as much of the problem outside of the solver as possible, since the solver is the long pole in the tent in terms of scalability. For example if there is a Z3 integer that must be limited to a certain range and additionally some values in the interval must be blacklisted, it’s better, if possible, to compress all of the valid values into a single interval, to make testing for validity simpler in the Z3 code.
Along these lines: the Z3 tactics for propagating constants are not perfect; thus it can help to manually propagate constants (though this unfortunately makes the code more messy). This propagation can also sometimes allow for removal of unneeded constraints, further speeding up performance. Relatedly, some intriguing work by Benjamin Mikek shows how one can use the LLVM code optimizer to optimize the SMT problem in a way that is complementary to Z3 tactics, achieving significant speedup (for more info see here, here and here). I haven’t tried this but it seems promising.
Because of the scalability challenge of SMT solvers, various simplifying heuristics to modify the problem can be helpful. For example: solving a subproblem of the main problem and holding the resulting variables fixed in order to solve the rest of the problem. Or solving a simpler, smaller problem first to determine variable presets for the full problem. With these heuristics, one does not in general find the true global optimum; but the result may be adequate.
CPU threading does not work for my case (Z3 Python, macOS). Perfect parallelization of SAT and SMP is an unsolved (and perhaps in some sense not fully solvable) problem. One can naïvely parallelize bisection search by converting to trisection, etc., but this does not give perfect speedup (specif., log(P) speedup on P threads). Improvements to parallel bisection in some cases may be possible. Recent work by Armin Biere and colleagues looks promising; as I read it, near perfect speedup up to eight threads (at least for some problems).
Some of the main developers of Z3 are on Stack Overflow and have been active in the past answering questions. This seems very useful.

Resources like Handbook of Satisfiability and the proceedings of various SAT/SMT conferences seem helpful. More information on best practices for non-expert practitioners would be a great help to the community. If you know of any good resources, please share in the comments.

Colossus versus El Capitan: A Tale of Two Supercomputers

Posted on 10 March 2025 by Wayne Joubert

Colossus

The xAI Colossus supercomputer contains 100,000 NVIDIA H100 GPUs. Upgrades are planned, ultimately up to as much as a million GPUs. The H100 has theoretical peak speed of at least 60 teraFLOPs (FP64 tensor core), though the actual number depends on the power and frequency cap settings on the GPUs. Admittedly FP64 is overkill for Colossus’ intended use for AI model training, though it is required for most scientific and engineering applications on typical supercomputers. This would put Colossus nominally at theoretical peak speed of 6 Exaflops full FP64 precision for dense matrix multiplies.

El Capitan

El Capitan at Lawrence Livermore National Lab ranks now as top #1 fastest system in the world on the TOP500 list, recently taking the crown from Frontier at Oak Ridge National Lab. Both Frontier and El Cap were procured under the same collaborative CORAL-2 project by the two respective laboratories. El Capitan uses AMD Instinct MI300A GPUs for theoretical peak speed of 2.746 Exaflops.

Which system is fastest?

You may wonder about the discrepancy: Colossus has more raw FLOPs, while El Capitan is ranked #1. Which system is actually faster? For decades, top system performance has commonly been measured for TOP500 using the High Performance Linpack (HPL) benchmark. Some have expressed concerns that HPL is an unrepresentative “FLOPs-only” benchmark. However, HPL actually measures more than raw rate of floating point operations. HPL performs distributed matrix products on huge matrices that become smaller and smaller in size during the HPL run, with a serial dependency between sequential matrix multiplies. Near the end of the run, performance becomes very limited by network latency, requiring excellent network performance. Furthermore, HPL is also a system stability test, since the system (often made up of brand new hardware for which bad parts must be weeded out) must stay up for a period of hours without crashing and at the end yield a correct answer (my colleague Phil Roth gives a description of this ordeal for Frontier). In short, a system could have lots of FLOPs but fail these basic tests of being able to run a nontrivial application.

Some commercial system owners may choose not to submit an HPL number, for whatever reason (though Microsoft submitted one and currently has a system at #4). In some cases submitting a TOP500 number may not be a mission priority for the owner. Or the system may not have an adequate network or the requisite system stability to produce a good number, in spite of having adequate FLOPs. Companies don’t typically give reasons for not submitting, but their reasons can be entirely valid, and not submitting a number has certainly happened before.

How long to build a system?

You may also wonder how it is that Colossus was stood up in 122 days (indeed a remarkable achievement by a highly capable team) whereas the CORAL-2 Project, which delivered El Capitan and Frontier, spanned multiple years.

Put simply, a system like Colossus stands on the shoulders of many multi-year investments in vendor hardware and software under projects like CORAL-2. Years ago, Oak Ridge National Lab originally put NVIDIA on the map for supercomputing with Titan, the first NVIDIA-powered petascale supercomputer. Some of the core NVIDIA software in use today was developed in part under this multi-year Titan project. Similarly for AMD and CORAL-2. Many systems, including Colossus, have benefitted from these long-term multi-year investments.

Another reason has to do with intended workloads of the respective systems. Colossus is intended primarily for AI model training; even though model architecture variations have slightly different computational patterns, the requirements are similar. El Capitan on the other hand is a general purpose supercomputer, and as such must support many different kinds of science applications with highly diverse requirements (and even more so at other centers like OLCF, ALCF and NERSC) (on system requirements, application diversity and application readiness see here, here and here). It’s much harder to put together a system to meet the needs of such a variety of science projects.

Conclusion

Colossus and El Capitan are both highly capable systems that will provide millions of node-hours of compute for their respective projects. Colossus has a high flop rate to support reduced precision matrix multiples (and presumably high network bandwidth for Allreduce) required for AI model training. El Capitan has a balanced architecture to support a wide variety of science applications at scale.

ADDENDUM: Colossus is now up to 200K GPUs.

Unicode surrogates

Posted on 9 March 2025 by John

At the highest level, Unicode is a simply a list of symbols. But when you look closer you find that isn’t entirely true. Some of the symbols are sorta meta symbols. And while a list of symbols is not complicated, this list is adjacent to a lot of complexity.

I’ve explored various rabbit holes of Unicode in previous posts, and this time I’d like to go down the rabbit hole of surrogates. These came up in the recent post on how LLMs tokenize Unicode characters.

Incidentally, every time I hear “surrogates” I think of the Bruce Willis movie by that name.

Historical background

Once upon a time, text was represented in ASCII, and life was simple. Letters, numbers, and a few other symbols all fit into a single eight-bit byte with one bit left over. There wasn’t any need to distinguish ASCII and its encoding because they were synonymous. The nth ASCII character was represented the same way as the number n.

The A in ASCII stands for American, and ASCII was sufficient for American English, sorta. Then people played various tricks with the leftover bit to extend ASCII by adding 128 more symbols. This was enough, for example, to do things like include the tilde on the n in jalapeño, but representing anything like Russian letters, math symbols, or Chinese characters was out of the question.

So along came Unicode. Instead of using one byte to represent characters, Unicode used two bytes. This didn’t double the possibilities; it squared them. Instead of 128 ASCII symbols, or 256 extended ASCII symbols, there was the potential to encode 2¹⁶ = 65,536 symbols. Surely that would be enough! Except it wasn’t. Soon people wanted even more symbols. Currently Unicode has 154,998 code points.

Encoding

The most obvious way to represent Unicode characters, back when there were fewer than 2¹⁶ characters, was as a 16-bit number, i.e. to represent each character using two bytes rather than one. The nth Unicode character could be represented the same way as the number n, analogous to the case of ASCII.

But while text could potentially require thousands of symbols, at lot of text can be represented just fine with ASCII. Using two bytes per character would make the text take up twice as much memory. UTF-8 encoding was a very clever solution to this problem. Pure ASCII (i.e. not extended) text is represented exact as in ASCII, taking up no more memory. And if text is nearly all ASCII with occasional non-ASCII characters sprinkled in, the size of the text in memory increases in proportional to the number of non-ASCII characters. Nearly pure ASCII text takes up nearly the same amount of memory.

But the cost of UTF-8 is encoding. The bit representation of the nth character is no longer necessarily the bit representation of the integer n.

UTF-16, while wasteful when representing English prose, is a direct encoding. Or at least it was until Unicode spilled over its initial size limit. You need more than a single pair of bytes to represent more than 2¹⁶ characters.

Surrogate pairs

The way to represent more than 2¹⁶ symbols with 16-bit characters is to designate some of the “characters” as meta characters. These special characters represent half of a pair of 16-bit units corresponding to a single Unicode character.

There are 1024 characters reserved as high surrogates (U+D800 through U+DBFF) and 1024 reserved as low surrogates (U+DC00 through U+DFFF). High surrogates contain the high-order bits and low surrogates contain the low-order bits.

Let n > 2¹⁶ be the code point of a character. Then the last 10 bits of the high surrogate are the highest 10 bits of n, and the last 1o bits of the low surrogate are the lowest 10 bits of n.

In terms of math rather than bit representations, the pair of surrogates (H, L) represent code point

2¹⁶ + 2¹⁰(H − 55396) + (L − 56320)

where 55396 = D800_hex and 56320 = DC00_hex.

Example

The rocket emoji has Unicode value U+1F680. The bit pattern representing the emoji is DB3DDE80_hex, the combination of high surrogate U+D83D and low surrogate U+DE80. We can verify this with the following Python.

    >>> H, L = 0xD83D, 0xDE80
    >>> 2**16 + 2**10*(H - 0xD800) + L - 0xDC00 == 0x1F680
    True

When we write out the high and low surrogates in binary we can visualize how they contain the high and low bits of the rocket codepoint.

H D83D 1101100000111101 L DE80 1101111010000000 1F680 11111011010000000

High private use surrogates

The high surrogate characters U+D800 through U+DBFF are divided into two ranges. The range U+D800 through U+DB7F are simply called high surrogates, but the remaining characters U+DB80 through U+DBFF are called “high private use surrogates.”

These private use surrogates to not work any differently than the rest. They just correspond to code points in Unicode’s private use area.

On Making Databases Run Faster

Posted on 5 March 2025 by Wayne Joubert

Database technology is a mature field, and techniques for optimizing databases are well understood. However, surprises can still happen.

Certain performance optimizations you might expect to be automatic are not really. I’m working with a legacy code developed some time ago, before modern notions of separation of concerns between code business logic and data storage. The code runs slower than you’d expect, and some have wondered as to why this is.

Profiling of the code revealed that the slowdown was not in the core computation, but rather in the reading and writing of the backend database, which occurs frequently when this code executes.

My first thought was to run with the database in RAM disk, which would give higher bandwidth and lower latency than spinning disk or SSD. This helped a little, but not much.

As a short term fix I ended up writing code for (in-memory) hash tables as an interposer between the code and the database. This can cache commonly accessed values and thus reduce database access.

I would’ve thought high-speed RAM caching of values would be default behavior for a database manager. A principle of interface design is to make the defaults as useful as possible. But in this case apparently not.

Thankfully my fix gave over 10X speedup in application launch time and 6X speedup in the core operation of the code.

The project team is moving toward SQLite for database management in the near future. SQLite has perhaps a dozen or so available methods for optimizations like this. However early experiments with SQLite for this case show that more fundamental structural code modifications will also be needed, to improve database access patterns.

As with general code optimization, sometimes you’d be inclined to think the system (compiler, database manager, etc.) will “do it all for you.” But often not.

Code Profiling Without a Profiler

Posted on 27 February 2025 by Wayne Joubert

Making your code to run faster starts with understanding where in the code the runtime is actually spent. But suppose, for whatever reason, the code profiling tools won’t work?

I recently used MS Visual Studio on a legacy C++ code. The code crashed shortly after startup when attempting to profile, though otherwise the code ran fine for both release and debug build targets. The cause of the problem was not immediately visible.

If all else fails, using manual timers can help. The idea is to find a high-accuracy system wallclock timer function and use this to read the time before and after some part of the code you want to time. One can essentially apply “bisection search” to the code base to look for the code hot spots. See below for an example.

This can be useful in various situations. Codes in complex languages (or even mixed languages in the code base) can have unusual constructs that break debuggers or profilers. Also, exotic hardware like embedded systems, GPUs or FPGAs may lack full profiler support. Additionally, brand new hardware releases often lack mature tool support, at least initially.

Furthermore, profiling tools themselves, though helpful for getting a quick snapshot of the performance breakdown of each function in the code, have their own limitations. Profilers work either by instrumenting the executable code or sampling. Instrumenting can cause timing inaccuracies by adding overhead from calling the system timer on entrance and exit to every function called. Also it breaks function inlining, often reducing performance.

Sampling on the other hand can be inaccurate if the sample rate is too low, or can distort runtime when sampling at too high a frequency. In contrast, manual timers can circumvent these problems by a very surgical application to specific parts of the code (though some profilers let you turn the profiler on and off at different parts of the code).

Resorting to manual timing of code sections is a messy business. But sometimes it’s the only thing that will work.

Visual Studio C++ Code Example

// mycode.h

#include "cstdio"
#include "cstdarg"

// Get time of day - elapsed seconds
static double gtod() {   
    LARGE_INTEGER ctr, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&ctr);
    return static_cast(ctr.QuadPart) / static_cast(freq.QuadPart);
}   
    
// Convenience function for printing to file
static void FilePrintf(const char* format, ...) {   
    char buffer[1024];
    va_list args;
    va_start(args, format);
    vsnprintf(buffer, sizeof(buffer), format, args);
    va_end(args);
    FILE* myoutfile = fopen("mytimingsfile.txt", "a");
    fprintf(myoutfile, "%s", buffer);
    fclose(myoutfile);
}   
    
// Storage for timer
extern double g_timer;

// mycode.cpp

#include "mycode.h"

// Initialization for timer
double g_timer = 0;

int main() {

    // ...
    g_timer = 0;

    for (int i=0; i<n; ++i) {
        // ...
        const double t1 = gtod();
        my_expensive_function();
        g_timer += gtod() - t1;
        // ...
    }

    FilePrintf("my_expensive_function runtime: %.6f seconds.\n", g_timer);
    g_timer = 0;

    // ...

Rich Sutton’s bitter lesson of AI

Posted on 20 February 2025 by John

This morning I read Alberto Romeo’s article Grok 3: Another Win For The Bitter Lesson. I wasn’t clear on what exactly the bitter lesson was, so I followed the link to Romeo’s previous article GPT-4: The Bitterer Lesson, which lead to Rich Sutton’s original article The Bitter Lesson.

Sutton opens his article by saying

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Clear enough, except you have to read further to get the “compared to what.” I asked Grok 3 to read the article and tell me exactly what the bitter lesson is. It begins with this explanation:

The “Bitter Lesson” is the principle that, in the field of artificial intelligence, the most effective and lasting progress has historically come from methods that rely on leveraging ever-increasing computational power rather than depending on human ingenuity to craft domain-specific knowledge or shortcuts. …

and concludes with the following pithy summary of its summary:

In essence, the Bitter Lesson is: computation trumps human-crafted specialization in AI, and embracing this, though humbling, is key to future success.

Sutton supports his thesis with examples from chess, go, speech recognition, and computer vision.

In some sense this is the triumph of statistics over logic. All else being equal, collecting enormous amounts of data and doing statistical analysis (i.e. training an AI model) will beat out rule-based systems. But of course all else is often not equal. Logic is often the way to go, even on big problems.

When you do know whats going on and can write down a (potentially large) set of logical requirements, exploiting that knowledge via techniques like SAT solvers and SMT is the way to go. See, for example, Wayne’s article on constraint programming. But if you want to build the Everything Machine, building software that wants to be all things to all people, statistics is the way to go.

Sometimes you have a modest amount of relevant data, and you don’t have 200,000 Nvidia H100 GPUs. In that case you’re much better off with classical statistics than AI. Classical models are more robust and easier to interpret. They are also not subject to hastily written AI regulation.

Computing