On Making Databases Run Faster

Database  technology is a mature field, and techniques for optimizing databases are well understood. However, surprises can still happen.

Certain performance optimizations you might expect to be automatic are not really. I’m working with a legacy code developed some time ago, before modern notions of separation of concerns between code business logic and data storage. The code runs slower than you’d expect, and some have wondered as to why this is.

Profiling of the code revealed that the slowdown was not in the core computation, but rather in the reading and writing of the backend database, which occurs frequently when this code executes.

My first thought was to run with the database in RAM disk, which would give higher bandwidth and lower latency than spinning disk or SSD. This helped a little, but not much.

As a short term fix I ended up writing code for (in-memory) hash tables as an interposer between the code and the database. This can cache commonly accessed values and thus reduce database access.

I would’ve thought high-speed RAM caching of values would be default behavior for a database manager. A principle of interface design is to make the defaults as useful as possible. But in this case apparently not.

Thankfully my fix gave over 10X speedup in application launch time and 6X speedup in the core operation of the code.

The project team is moving toward SQLite for database management in the near future. SQLite has perhaps a dozen or so available methods for optimizations like this. However early experiments with SQLite for this case show that more fundamental structural code modifications will also be needed, to improve database access patterns.

As with general code optimization, sometimes you’d be inclined to think the system (compiler, database manager, etc.) will “do it all for you.” But often not.

Code Profiling Without a Profiler

Making your code to run faster starts with understanding where in the code the runtime is actually spent. But suppose, for whatever reason, the code profiling tools won’t work?

I recently used MS Visual Studio on a legacy C++ code. The code crashed shortly after startup when attempting to profile, though otherwise the code ran fine for both release and debug build targets. The cause of the problem was not immediately visible.

If all else fails, using manual timers can help. The idea is to find a high-accuracy system wallclock timer function and use this to read the time before and after some part of the code you want to time. One can essentially apply “bisection search” to the code base to look for the code hot spots. See below for an example.

This can be useful in various situations. Codes in complex languages (or even mixed languages in the code base) can have unusual constructs that break debuggers or profilers. Also, exotic hardware like embedded systems, GPUs or FPGAs may lack full profiler support. Additionally, brand new hardware releases often lack mature tool support, at least initially.

Furthermore, profiling tools themselves, though helpful for getting a quick snapshot of the performance breakdown of each function in the code, have their own limitations. Profilers work either by instrumenting the executable code or sampling. Instrumenting can cause timing inaccuracies by adding overhead from calling the system timer on entrance and exit to every function called. Also it breaks function inlining, often reducing performance.

Sampling on the other hand can be inaccurate if the sample rate is too low, or can distort runtime when sampling at too high a frequency. In contrast, manual timers can circumvent these problems by a very surgical application to specific parts of the code (though some profilers let you turn the profiler on and off at different parts of the code).

Resorting to manual timing of code sections is a messy business. But sometimes it’s the only thing that will work.

Visual Studio C++ Code Example

// mycode.h

#include "cstdio"
#include "cstdarg"

// Get time of day - elapsed seconds
static double gtod() {   
    LARGE_INTEGER ctr, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&ctr);
    return static_cast(ctr.QuadPart) / static_cast(freq.QuadPart);
}   
    
// Convenience function for printing to file
static void FilePrintf(const char* format, ...) {   
    char buffer[1024];
    va_list args;
    va_start(args, format);
    vsnprintf(buffer, sizeof(buffer), format, args);
    va_end(args);
    FILE* myoutfile = fopen("mytimingsfile.txt", "a");
    fprintf(myoutfile, "%s", buffer);
    fclose(myoutfile);
}   
    
// Storage for timer
extern double g_timer;

// mycode.cpp

#include "mycode.h"

// Initialization for timer
double g_timer = 0;

int main() {

    // ...
    g_timer = 0;

    for (int i=0; i<n; ++i) {
        // ...
        const double t1 = gtod();
        my_expensive_function();
        g_timer += gtod() - t1;
        // ...
    }

    FilePrintf("my_expensive_function runtime: %.6f seconds.\n", g_timer);
    g_timer = 0;

    // ...

Rich Sutton’s bitter lesson of AI

This morning I read Alberto Romeo’s article Grok 3: Another Win For The Bitter Lesson. I wasn’t clear on what exactly the bitter lesson was, so I followed the link to Romeo’s previous article GPT-4: The Bitterer Lesson, which lead to Rich Sutton’s original article The Bitter Lesson.

Sutton opens his article by saying

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Clear enough, except you have to read further to get the “compared to what.” I asked Grok 3 to read the article and tell me exactly what the bitter lesson is. It begins with this explanation:

The “Bitter Lesson” is the principle that, in the field of artificial intelligence, the most effective and lasting progress has historically come from methods that rely on leveraging ever-increasing computational power rather than depending on human ingenuity to craft domain-specific knowledge or shortcuts. …

and concludes with the following pithy summary of its summary:

In essence, the Bitter Lesson is: computation trumps human-crafted specialization in AI, and embracing this, though humbling, is key to future success.

Sutton supports his thesis with examples from chess, go, speech recognition, and computer vision.

In some sense this is the triumph of statistics over logic. All else being equal, collecting enormous amounts of data and doing statistical analysis (i.e. training an AI model) will beat out rule-based systems. But of course all else is often not equal. Logic is often the way to go, even on big problems.

When you do know whats going on and can write down a (potentially large) set of logical requirements, exploiting that knowledge via techniques like SAT solvers and SMT is the way to go. See, for example, Wayne’s article on constraint programming. But if you want to build the Everything Machine, building software that wants to be all things to all people, statistics is the way to go.

Sometimes you have a modest amount of relevant data, and you don’t have 200,000 Nvidia H100 GPUs. In that case you’re much better off with classical statistics than AI. Classical models are more robust and easier to interpret. They are also not subject to hastily written AI regulation.

DeepSeek-R1: Do we need less compute now?

 

The reactions to the new DeepSeek-R1 AI model in recent days seem limitless. Some say it runs so much faster than existing models that we will no longer need the billions of dollars in compute hardware that big tech is preparing to buy.

Is that plausible?

To get an answer, we need only look back at the experience of the recently-completed Exascale Computing Project. This large scale multi-lab project was tasked with developing technology (primarily software) to prepare for exascale computing, which has recently been achieved by Frontier, Aurora and El Capitan.

During the course of the project, various algorithm and implementation improvements were discovered by the the science teams, these leading to as much as 60X speedup or more, over and above speedups possible from hardware alone [1]. In response, are the teams just running the very same problems faster on older hardware? No — instead, they are now able to run much, much larger problems than previously possible, exploiting both hardware and software improvements.

Or suppose today there were no such thing as the fast Fourier transform (FFT) and scientists were computing Fourier transforms using (essentially) large dense matrix-vector products. If someone then discovered the FFT, I’d guarantee you that scientists would not only say, (1) “Wow, now I can run my existing problems much, much faster,” but also, (2) “Wow, now I can run problems much larger than I ever dreamed and solve problems larger than I could have ever imagined!”

Paradoxically, faster algorithms might even increase the demand for newer, faster hardware. For example, a new faster algorithm for designing medications to cure cancer might be judged so important that it’s worth building the largest machine possible to run it effectively.

All this is not to say whether you should buy or sell Nvidia stock right now. However, it does mean that there is no simplistic argument that faster algorithms and implementations necessarily lead to lower spend on computing hardware. History shows that sometimes this is not true at all. The smart money, on the other hand, is on research teams that are able to exploit any and every new discovery to improve what is possible with their codes, whether by hardware, data, code optimizations or algorithms.

Notes

[1] See slide 9 from Doug Kothe’s talk, “Exascale and Artificial Intelligence: A Great Marriage“. The “Figure of Merit” (FOM) number represents speedup of science output from an application compared to an earlier baseline system. Specifically, a FOM speedup of 50X is the anticipated speedup from baseline due to efficient use of hardware only, for example, on Frontier compared to the earlier OLCF Titan system.

Tricks for radix conversion by hand

The simplest trick for converting from one base to another is grouping. To convert between base b and base bk, group numbers in sets of k and convert one group at a time. To convert from binary to octal, for instance, group bits in sets of three, starting from the right end, and convert each group to octal.

11110010two → (11)(110)(010) → 362eight

For an example going the other direction, let’s convert 476 in base nine to base three.

476nine → (11)(21)(20) → 112120three

In general, conversion between bases is too tedious to do by hand, but one important case where it’s a little easier than it could be is converting between decimal and octal. In combination with the grouping trick above, this means you could, for example, convert between decimal and binary by first converting decimal to octal. Then the conversion from octal to binary is trivial.

The key to converting between decimal and octal is to exploit the fact that 10 = 8 + 2, so powers of 10 become powers of (8 + 2), or powers of 8 become powers of (10 − 2). These tricks are easier to carry out than to explain. You can find descriptions and examples in Knuth’s TAOCP, volume 2, section 4.4.

Knuth cites a note by Walter Soden from 1953 as the first description of the trick for converting octal to decimal.

The trick for moving between base 9 and base 10 (or by grouping, between base 3 and base 10) is simpler and left as an exercise by Knuth. (Problem 12 in section 4.4, with a solution given at the back of the volume.)

Related posts

Double rounding

The previous post started by saying that rounding has a surprising amount of detail. An example of this is double rounding: if you round a number twice, you might not get the same result as if you rounded directly to the final precision.

For example, let’s say we’ll round numbers ending in 0, 1, 2, 3, or 4 down, and numbers ending in 5, 6, 7, 8, or 9 up. Then if we have a number like 123.45 and round it to one decimal place we have 123.5, and if we round that to an integer we have 124. But if we had rounded 123.45 directly to an integer we would have gotten 123. This is not a mere curiosity; it comes up fairly often and has been an issue in lawsuits.

The double rounding problem cannot happen in odd bases. So, for example, if you have some fraction represented in base 7 and you round it first from three figures past the radix point to two, then from two to one, you’ll get the same result as if you directly rounded from three figures to one. Say we start with 4231.243seven. If we round it to two places we get 4231.24seven, and if we round again to one place we get 4231.3seven, the same result we would get by rounding directly from three places to one.

The reason this works is that you cannot represent ½ by a finite expression in an odd base.

A magical land where rounding equals truncation

Rounding numbers has a surprising amount of detail. It may seem trivial but, as with most things, there is a lot more to consider than is immediately obvious. I expect there have been hundreds if not thousands of pages devoted to rounding in IEEE journals.

An example of the complexity of rounding is what William Kahan called The Tablemaker’s Dilemma: there is no way in general to know in advance how accurately you’ll need to compute a number in order to round it correctly.

Rounding can be subtle in any number system, but there is an alternative number system in which it is a little simpler than in base 10. It’s base 3, but with a twist. Instead of using 0, 1, and 2 as “digits”, we use −1, 0, and 1. This is known as the balanced ternary system: ternary because of base 3, and balanced because the digits are symmetrical about 0.

We need a symbol for −1. A common and convenient choice is to use T. Think of moving the minus sign from in front of a 1 to on top of it. Now we could denote the number of hours in a day as 10T0 because

1 \times 3^3 + 0 \times 3^2 + (-1)\times 3 + 0 = 24

A more formal way of a describing balanced ternary representation of a number x is a set of coefficients tk such that

x = \sum_{k=-\infty}^\infty t_k 3^k

with the restriction that each tk is in the set {−1, 0, 1}.

Balanced ternary representation has many interesting properties. For example, positive and negative numbers can all be represented without a minus sign. See, for example, Brain Hayes’ excellent article Third Base. The property we’re interested in here is that to round a balanced ternary number to the nearest integer, you simply lop off the fractional part. Rounding is the same as truncation. To see this, note that the largest possible fractional part is a sequence of all 1s, which represents ½:

\frac{1}{3} + \frac{1}{3^2} + \frac{1}{3^3} + \cdots = \frac{1}{2}

Similarly, the most negative possible fractional part is a sequence of all Ts, which represents −½. So unless the fractional part is exactly equal to ½, truncating the fractional part rounds to the nearest integer. If the fractional part is exactly ½ then there is no nearest integer but two integers that are equally near.

Related posts

Entering Russian characters in Vim with digraphs

The purpose of this post is to expand on the following sentence [1]:

Russian letters are created by entering [Ctrl-k followed by] a corresponding Latin letter followed by an equals sign -, or, in a few places, a percent sign %.

The Russian alphabet has 33 letters, so there couldn’t be a Latin letter for every Russian letter. Also, there are Latin letters that don’t have a Russian counterpart and vice versa. So the mapping can’t be simple. But still, the above summary is nice: try control-k followed by the English analog and the equal sign. If that doesn’t work, try a percent sign instead.

Which Latin letters does Vim chose as corresponding to Russian letters? Does it go by sound or appearance? For example, the Russian letter Н looks like a Latin H but it sounds like a Latin N. Vim goes by sound. You would enter the Russian letter Н by typing Ctrl-k N =.

For full details, see the Vim documentation :h digraph-table. I give a simplified excerpt from the documentation below. I just look at capital letters because the lower case letters are analogous. All the official Unicode names begin with CYRILLIC CAPITAL LETTER and so I cut that part out.

char    digraph hex     official name 
А       A=      0410    A
Б       B=      0411    BE
В       V=      0412    VE
Г       G=      0413    GHE
Д       D=      0414    DE
Е       E=      0415    IE
Ё       IO      0401    IO
Ж       Z%      0416    ZHE
З       Z=      0417    ZE
И       I=      0418    I
Й       J=      0419    SHORT I
К       K=      041A    KA
Л       L=      041B    EL
М       M=      041C    EM
Н       N=      041D    EN
О       O=      041E    O
П       P=      041F    PE
Р       R=      0420    ER
С       S=      0421    ES
Т       T=      0422    TE
У       U=      0423    U
Ф       F=      0424    EF
Х       H=      0425    HA
Ц       C=      0426    TSE
Ч       C%      0427    CHE
Ш       S%      0428    SHA
Щ       Sc      0429    SHCHA
Ъ       ="      042A    HARD SIGN
Ы       Y=      042B    YERU
Ь       %"      042C    SOFT SIGN
Э       JE      042D    E
Ю       JU      042E    YU
Я       JA      042F    YA

Note that the end of the alphabet is more complicated than simply using a Latin letter and either an equal or percent sign. Also, the table is in alphabetical order, which doesn’t quite correspond to Unicode numerical order because of a quirk with the letter Ё (U+0401) explained here.

[1] Arnold Robbins and Elbert Hannah. Learning the vi & Vim Editors, 8th edition

Why eliminate trusted third parties?

Suppose one company would like to buy another company’s client list, but only if the lists don’t overlap too much. Neither company wants to hand over their list to the other before a sale takes place. What can they do?

A low-tech solution would be for both parties to provide their client lists to a trusted third party who will report back how much the lists overlap. That may be the best thing to do.

But it is possible to solve this problem without a trusted third party. With homomorphic encryption, the companies can exchange encrypted versions of their client lists that will allow both to calculate the amount of overlap without revealing any further information.

But why go to the effort? Many peer-to-peer technologies raise this question. So you’ve eliminated a third party; what’s so great about that? If you can send someone cryptocurrency, for example, without going through an intermediary like a bank or credit card company, what good is that if the transaction fees are no lower?

It’s often not worth using sophisticated technology to eliminate a trusted third party, but sometimes it is. Here are some reasons the technology might be worth using.

Transaction speed

The two companies hiring a third party to compare their lists have to wait on the third party, and the amount of time they need to wait is beyond their control. Maybe that’s acceptable for a one-time transaction, but not for repeated transactions. With homomorphic encryption, transactions could be automated and the only delay would be computation time.

Reproducibility

Sharing limited information via encryption reduces legal risk. If either party later accuses the other of providing incorrect information, the accused party can demonstrate that the software applied to the data gives the reported result.

Trust

To paraphrase Bob Dylan, you gotta trust somebody. Some technologies are labeled “zero-trust” or “trust no one,” but these terms need to be understood in context. When a company asks you to run a particular piece of software on your proprietary data and share the results, you have to trust that the software is not malicious or buggy.

Instead of trusting that a third party holding your data is honest and competent, you trust that a third party developing software is honest and competent. You have to decide that the software product is trustworthy. You might test the software on some sample data. Maybe you inspect the source code if it’s available. But at some point you have to trust the software and the context it runs in.

Related posts

RSA security in light of news

A recent article reported on the successful factoring of a 512-bit RSA key. The process took $8 worth of rented computing power. What does that say about the security of RSA encryption?

The following Python function estimates the computation steps required to factor a number b bits long using the best known algorithm. We’re going to take a ratio of two such estimates, so proportionality constants will cancel.

def f(b):
    logn = b*log(2)
    return exp((8/3)**(2/3) * logn**(1/3) * (log(logn))**(2/3))

The minimum recommended RSA key size is 2048 bits. The cost ratio for factoring a 2048 bit key to a 512 bit key is f(2048) / f(512), which is on the order of 1016. So factoring a 2048-bit key would take 80 quadrillion dollars.

This is sort of a back-of-the-envelope estimate. There things it doesn’t take into account. If a sophisticated and highly determined entity wanted to factor a 2048 number, they could probably do so for less than 80 quadrillion dollars. But it’s safe to say that the people who factored the 512 bit key are unlikely to factor a 2048 bit key by the same approach.