Experiences with Nvidia

Our team started working within Nvidia in early 2009 at the beginning of the ORNL Titan project. Our Nvidia contacts dealt with applications, libraries, programming environment and performance optimization. First impressions were that their technical stance on issues was very reasonable. One obscure example: in a C++ CUDA kernel were you allowed to use “enums,” and the answer would be, of course, yes, we would allow that. This was unlike some other companies that might have odd and cumbersome programming restrictions in their parallel programming models (though by now this has become a harder problem for Nvidia since there are so many software products a user might want to interoperate).

Another example, with a colleague at Nvidia on the C++ standards committee, to whom I mentioned, it might be too early to lock this certain feature design into the standard since hardware designs are still rapidly changing. His response was, Oh, yes, we think exactly the same thing. So in short, their software judgments and decisions generally seem to be well thought out, reasonable and well informed. It sounds simple, but it is amazing how many companies have gotten this wrong.

Nvidia has made good strategic decisions. In the 2013 time frame, Intel was becoming a competitive threat with the Xeon Phi processor. Intel was several times larger than Nvidia with huge market dominance. In response, Nvidia formed a partnership with IBM–itself several times larger than Intel at the time. This came to fruition in the ORNL Summit system in 2018. In the meantime, the Xeon Phi’s OpenMP programming model, though standards-based, turned out to be difficult to write optimized code for, and Nvidia CUDA captured market share dominance of accelerated user software. Intel eventually dropped the Xeon Phi product line.

In the early 2000s, Nvidia went all-in on CUDA. I’ve heard some project teams say they would never use CUDA, because it is nonstandard and too low-level. Many have turned back on this decision. Of course, it is often possible to write an abstraction layer on top of CUDA to make it easier to use and maintain. Also newer programming models like Kokkos can be helpful.

Nvidia also made a prescient decision early to bet big on AI. A little later they decided to go all in on developing a huge number of software libraries is to enable access to many new markets. A huge moat. AMD is trying hard to improve their software processes and catch up.

On the downside, Nvidia high prices are upsetting to many, from gamers to buyers of the world’s largest HPC systems. Competition from AMD and others is a good thing.

And Nvidia marketing speak is sometimes confusing. A comparison was once made claiming that a four GPU system was more powerful than one of the world’s top CPU-only supercomputers on a very specific science problem. I’d like to see the details of that comparison. Also, different figures are being given on how long it took to stand up xAI’s Colossus supercomputer, from 19 days to 122 days. One has to dig a little to find out what these figures mean. Also it was widely reported last year that the GB200 NVL72 GPU was “30 times faster” than H100, but this only refers to certain operations, not key performance measures like flops per watt.

Those are my takes. For more perspectives, see Tae Kim’s excellent book, The Nvidia Way, or this interview.

Thoughts on Nvidia? Please leave in the comments.

 

Most ints are not floats

All integers are real numbers, but most computer representations of integers do not equal computer representations of real numbers.

To make the statement above precise, we have to be more specific about what we mean by computer integers and floating point numbers. I’ll use int32 and int64 to refer to 32-bit and 64-bit signed integers. I’ll use float32 and float64 to refer to IEEE 754 single precision and double precision numbers, what C calls float and double.

Most int32 numbers cannot be represented exactly by a float32. All int32 numbers can be represented approximately as float32 numbers, but usually not exactly. The previous statements remain true if you replace “32” everywhere by “64.”

32 bit details

The int32 data type represents integers −231 through 231 − 1. The float32 data type represents numbers of the form

± 1.f × 2e

where 1 bit represents the sign, 23 bits represent f, and 8 bits represent e.

The number n = 224 can be represented by setting the fractional part f  to 0 and setting the exponent e to 24. But the number n + 1 cannot be represented as a float32 because its last bit would fall off the end of f:

224 + 1 = (1 + 2−24) 224 = 1.000000000000000000000001two × 224

The bits in f fill up representing the 23 zeros after the decimal place. There’s no 24th bit to store the final 1.

For each value of e, there are 223 possible values of f. So for each of e = 24, 25, 26, …, 31 there are 223 representable integers, for a total of 8 × 223.

So of the 231 non-negative integers that can be represented by an int32 data type, only 9 × 223 can also be represented exactly as a float32 data type. This means about 3.5% of 32-bit integers can be represented exactly by a 32-bit float.

64 bit details

The calculations for 64-bit integers and 64-bit floating point numbers are analogous. Numbers represented by float64 data types have the form

± 1.f × 2e

where now f has 52 bits and e has 11.

Of the 263 non-negative integers that can be represented by an int64 data type, only 11 × 252 can also be represented exactly as a float64 data type. This means about 0.5% of 64-bit integers can be represented exactly by a 64-bit float.

A note on Python

Python’s integers have unlimited range, while its floating point numbers correspond to float64. So there are two reasons an integer might not be representable as a float: it may be larger than the largest float, or it may require more than 53 bits of precision.

Related posts

Test whether a large integer is a square

Years ago I wrote about a fast way to test whether an integer n is a square. The algorithm rules out a few numbers that cannot be squares based on their last (hexadecimal) digit. If the the integer passes through this initial screening, the algorithm takes the square root of n as a floating point number, rounds  to an integer, and tests whether the square of the rest equals n.

What’s wrong with the old algorithm

The algorithm described above works fine if n is not too large. However, it implicitly assumes that the integer can be represented exactly as a floating point number. This is two assumptions: (1) it’s not too large to represent, and (2) the representation is exact.

A standard 64-bit floating point number has 1 bit for the sign, 11 bits for the exponent, and 52 bits for the significand. (More on that here.) Numbers will overflow if they run out of exponent bits, and they’ll lose precision if they run out of significand bits. So, for example, 21024 will overflow [1]. And although 253 + 1 will not overflow, it cannot be represented exactly.

These are illustrated in Python below.

    >>> float(2**1024)
    OverflowError: int too large to convert to float
    >>> n = 2**53
    >>> float(n) == float(n + 1)
    True

A better algorithm

The following algorithm [2] uses only integer operations, and so it will work exactly for arbitrarily large integers in Python. It’s a discrete analog of Newton’s square root algorithm.

    def intsqrt(N):
        n = N
        while n**2 > N:
            n = (n + N//n) // 2
        return n

    def issquare(N):
        n = intsqrt(N)
        return n**2 == N

The function issquare will test whether N is a square. The function intsqrt is broken out as a separate function because it is independently useful since it returns ⌊√N⌋.

The code above correctly handles examples that the original algorithm could not.

    >>> issquare(2**1024)
    True
    >>> issquare(2**54)
    True
    >>> issquare(2**54 + 1)
    False

You could speed up issquare by quickly returning False for numbers that cannot be squared on account of their final digits (in some base—maybe you’d like to use a base larger than 16) as in the original post.

[1] The largest allowed exponent is 1023 for reasons explained here.

[2] Bob Johnson. A Route to Roots. The Mathematical Gazette, Vol. 74, No. 469 (Oct., 1990), pp. 284–285

Why use hash puzzles for proof-of-work?

A couple days ago I wrote about the the problem that Bitcoin requires to be solved as proof-of-work. In a nutshell, you need to tweak a block of transactions until the SHA256 double hash of its header is below a target value [1]. Not all cryptocurrencies use proof of work, but those that do mostly use hash-based puzzles.

Other cryptocurrencies use a different hashing problem, but they still use hashes. Litecoin and Dogecoin use the same proof-of-work problem, similar to the one Bitcoin uses, but with the scrypt (pronounced S-crypt) hash function. Several cryptocurrencies use a hashing problem based on Equihash. Monero uses its RandomX algorithm for proof-of-work, and although this algorithm has multiple components, it ultimately solves a hashing problem. [2]

Why hash puzzles?

Why do cryptocurrencies use hashing problems for proof of work? In principle they could use any computational problem that is hard to solve but easy to verify, such as numerous problems in computational number theory.

One reason is that computer scientists are confident that quantum computing would not reduce the difficulty of solving hash puzzles, even though it would reduce the difficulty of factoring-based puzzles. Also, there is general agreement that it’s unlikely a mathematical breakthrough will find a weakness in hashing functions.

Ian Cassels said “Cryptography is a mixture of mathematics and muddle, and without the muddle the mathematics can be used against you.” Hashing is much more muddle than mathematics.

Why not do something useful?

Hash puzzles work well for demonstrating work done, but they’re otherwise useless. They keep the wheels of cryptocurrencies turning, but the solutions themselves are intrinsically worthless.

Wouldn’t it be nice if crypto miners were solving useful problems like protein folding? You could do that. In fact there is a cryptocurrency FoldingCoin that does just that. But FoldingCoin has a market cap seven orders of magnitude smaller than Bitcoin, on the order of $200,000 compared to Bitcoin’s market cap of $2T.

Cryptocurrencies that use proof of useful work have not taken off. This might necessarily be the case. Requiring practical work creates divergent incentives. If you base a currency on the difficulty of protein folding computations, for example, it would cause major disruption if a pharmaceutical company decided to get into mining protein folding cryptocurrency at a loss because it values the results.

Going back to Cassels’ remark about mathematics and muddle, practical real-world problems often have a mathematical structure. Which is a huge blessing, except when you’re designing problems to be hard. Hash-based problems have gradually become easier to solve over time, and cryptocurrencies have adjusted. But a mathematical breakthrough for solving a practical problem would have the potential to disrupt a currency faster than the market could adapt.

Related posts

[1] You don’t change the transaction amounts, but you may change the order in which the transactions are arranged into a Merkle tree so that you get different hash values. You can also change a 32-bit nonce, and a timestamp, but most of the degrees of freedom you need in order to find an acceptable hash comes from rearranging the tree.

[2] Both scrypt and Equihash were designed to be memory intensive and to thwart the advantage custom ASIC mining hardware. However, people have found a way to use ASIC hardware to solve scrypt and Equihash problems. RandomX requires running a randomly generated problem before hashing the output in an attempt to frustrate efforts to develop specialized mining hardware.

What is the Bitcoin proof-of-work problem?

In order to prevent fraud, anyone wanting to add a block to the Bitcoin blockchain must prove that they’ve put in a certain amount of computational work. This post will focus on what problem must be solved in order produce proof of work.

You’ll see the proof of work function described as finding strings whose SHA256 hash value begins with a specified number of 0s. That’s sort of a zeroth-level approximation of the problem.

The string s you’re trying to find has the form data + nonce where the data comes from the block you’re wanting to add and the nonce is a value you concatenate on the end. You try different values until you get an acceptable hash value.

You’re not computing the SHA256(s) but rather the double hash:

SHA256²(s) = SHA256( (SHA256(s) )

The only way to find such a string s is by brute force [1], and applying the hash function twice doubles the amount of brute force work needed.

And you’re not exactly trying to produce leading zeros; you’re trying to produce a value less than a target T. This is roughly the same thing, but not quite.

To illustrate this, suppose you have a 2FA fob that generates six-digit random numbers. You’ve been asked to wait until your fob generates a number less than 2025. Waiting until you have three leading zeros would be sufficient, but that would be making the task harder than it needs to be. You’d be waiting for a number less than 1000 when you’re only asked to wait for a number less than 2025.

A SHA256 hash value is a 256-bit number. If your target T is a power of 2

T = 2256−n

then finding a value of s such that

SHA256²(s) < T

really is finding an s whose (double) hash begins with n zeros, though T is not required to be a power of 2.

Finding a value of s with

SHA256²(s) < 2256−n

will require, on average, testing 2n values of s.

The value of T is adjusted over time in order to keep the amount of necessary work roughly constant. As miners have become more efficient, the value of T has gone down and the amount of work has gone up. But the value of T can go up as well. It is currently fluctuating around 2176, i.e. hashes must have around 80 leading zero bits.

Now here’s where things get a little more complicated. I said at the beginning that the string s has the form

s = data + nonce

where the data comes from the block you’re trying to add and the nonce is a number you twiddle in order to get the desired hash value. But the nonce is a 32-bit integer. If you need to hash on the order of 280 strings in order to find one with 80 leading zeros, you can’t do that just by adjusting a 32-bit nonce.

In practice you’re going to have to twiddle the contents of what I’ve called data. The data contains a Merkle tree of transactions, and you can change the hash values by adjusting the order in which transactions are arranged in the tree, in addition to adjusting the nonce.

Related posts

[1] Unless someone finds a flaw in SHA256, which cryptographers have been trying to do for years and have not been able to do. And if a significant weakness is found in SHA256, it may not translate into a significant flaw in SHA256².

Naively computing sine

Suppose you need to write software to compute the sine function. You were told in a calculus class that this is done using Taylor series—it’s not, but that’s another story—and so you start writing code to implement Taylor series.

How many terms do you need? Uh, …, 20? Let’s go with that.

from math import sin, factorial

def sine(x):
    s = 0
    N = 20
    for n in range(N//2):
        m = 2*n + 1
        s += (-1)**n * x**m / factorial(m)
    return s

You test your code against the built-in sin function and things look good.

    >>> sin(1)
    0.8414709848078965
    >>> sine(1)
    0.8414709848078965

Larger arguments

Next, just for grins, you try x = 100.

    >>> sin(100)
    -0.5063656411097588
    >>> sine(100)
    -7.94697857233433e+20

Hmm. Off by 20 orders of magnitude.

So you do a little pencil and paper calculation and determine that the code should work if you set N = 300 in the code above. You try again, and now you’re off by 25 orders of magnitude. And when you try entering 100.0 rather than 100, the program crashes.

What’s going on? In theory, setting N = 300 should work, according to the remainder term in the Taylor series for sine. But the largest representable floating point number is on the order of 10308 and so x**m terms in the code overflow. The reason the program didn’t crash when you typed in 100 with no decimal point is that Python executed x**m in integer arithmetic.

If you were patient and willing to use a huge number of terms in your Taylor series, you couldn’t include enough terms in your series to get accurate results for even moderately large numbers because overflow would kill you before you could get the series truncation error down. Unless you used extended precision floating point. But this would be using heroic measures to rescue a doomed approach.

Taylor series are good local approximations. They’re only valid within a radius of convergence; outside of that radius they’re worthless even in theory. Inside the radius they may be valid in theory but not in practice, as is the case above.

Range reduction

Taylor series isn’t the best approach for evaluating sine numerically, but you could make it work if you limited yourself to a small range of angles, say 0 to π/4, and used trig identities to reduce the general problem to computing sine in the range [0, π/4]. So how do you do that?

Let’s look at reducing arguments to live in the interval [0, 2π]. In practice you’d want to reduce the range further, but that requires keeping up with what quadrant you’re in etc. We’ll just consider [0, 2π] for simplicity.

The natural thing to do is subtract off the largest multiple of 2π you can.

    >>> from math import pi
    >>> k = int(100/(2*pi))
    >>> x = 100 - k*2*pi
    >>> sine(x)
    -0.5065317636696883
    >>> sin(100)
    -0.5063656411097588

That’s better. Now we don’t overflow, and we get three correct decimal places. We could increase N and get more. However, there’s a limit to how well we could do, as the calculation below shows.

    >>> sin(x)
    -0.5063656411097495
    >>> sin(100)
    -0.5063656411097588

When we reduce 100 mod 2π and call sin(x), we get a result that differs from sin(100) in the last three decimal places. That’s not terrible, depending on the application, but the loss of precision increases with the size of x and would be a complete loss for large enough values of x.

Let’s look back at where we divided 100 by 2π:

    >>> 100/(2*pi)
    15.915494309189533

When we throw away the integer part (15) and keep the rest (.915494309189533) we lose two figures of precision. The larger the integer part, the more precision we lose.

Our naive attempt at range reduction doesn’t work so well, but there are clever ways to make range reduction work. It is possible, for example, to compute the sine of a googol (i.e. sin 10100).

At first it seems simple to reduce the range. Then when you become aware of the precision problems and think about it a while, it seems precise range reduction is impossible. But if you keep thinking about it long enough you might find a way to make it work. A lot of effort went into range reduction algorithms years ago, and now they’re part of the computational infrastructure that we take for granted.

Computing the Euler-Mascheroni Constant

The Euler-Mascheroni constant is defined as the limit

\gamma = \lim_{n\to\infty} \left(\sum_{k=1}^n \frac{1}{k} \right) - \log n

So an obvious way to try to calculate γ would be to evaluate the right-hand side above for large n. This turns out to not be a very good approach. Convergence is slow and rounding error accumulates.

A much better approach is to compute

\gamma = \lim_{n\to\infty} \left(\sum_{k=1}^n \frac{1}{k} \right) - \log\left( n + \tfrac{1}{2}\right)

It’s not obvious that you can add the extra factor of ½ in the log term, but you can: the right-hand sides of both equations above converge to γ, but the latter converges faster.

Here’s a comparison of the two methods.

The error in the first method with n = 4000 is comparable to the error in the second method with n = 16.

Even though the second equation above is better for numerical evaluation, there are much more sophisticated approaches. And this brings us to y-cruncher, software that has been used to set numerous world records for computing various constants. A new record for computing the digits of π was set a few weeks ago using y-cruncher. From the y-cruncher documentation:

Of all the things that y-cruncher supports, the Euler-Mascheroni Constant is the slowest to compute and by far the most difficult to implement. …

The Euler-Mascheroni Constant has a special place in y-cruncher. It is one of the first constants to be supported (even before Pi) and it is among the first for which a world record was set using y-cruncher. As such, y-cruncher derives its name from the gamma symbol for the Euler-Mascheroni Constant.

Posts involving γ

The bad version of a good test

Ever since 1952, the largest known primes have all had the form 2n − 1, with one exception in 1989. The reason the largest known primes have this form is that it is easier to test whether these numbers are prime than other numbers.

A number of the form 2n − 1 is called a Mersenne number, denoted Mn. A Mersenne prime is a Mersenne number that is also prime. There is a theorem that says that if Mn is prime then n must be prime.

Lehmer’s theorem

Derrick Henry Lehmer proved in 1930 that Mp is prime if p is an odd prime and Mp divides sp−2 where the s terms are defined recursively as follows. Define s0 = 4 and sn = sn−1² − 2 for n > 0. This is the Lucas-Lehmer primality test for Mersenne numbers, the test alluded to above.

Let’s try this out on M7 = 27 − 1 = 127. We need to test whether 127 divides s5. The first six values of sn are

4, 14, 194, 37634, 1416317954, 2005956546822746114

and indeed 127 does divide 2005956546822746114. As you may have noticed, s5 is a big number, and s6 will be a lot bigger. It doesn’t look like Lehmer’s theorem is going to be very useful.

The missing piece

Here’s the idea that makes the Lucas-Lehmer [1] test practical. We don’t need to know  sp−2 per se; we only need to know whether it is divisible by Mp. So we can carry out all our calculations mod Mp. In our example, we only need to calculate the values of sn mod 127:

4, 14, 67, 42, 111, 0

Much better.

Lucas-Lehmer test takes the remainder by Mp at every step along the way when computing the recursion for the s terms. The naive version does not, but computes the s terms directly.

To make the two versions of the test explicit, here are Python implementations.

Naive code

def lucas_lehmer_naive(p):
    M = 2**p - 1
    s = 4
    for _ in range(p-2):
        s = (s*s - 2) 
    return s % M  == 0

Good code

def lucas_lehmer(p):
    M = 2**p - 1
    s = 4
    for _ in range(p-2):
        s = (s*s - 2) % M
    return s == 0

The bad version

How bad is the naive version of the Lucas-Lehmer test?

The OEIS sequence A003010 consists of the sn. OEIS gives the following formula:

s_n = \left\lceil (2 + \sqrt{3})^{2^n}\right\rceil

This shows that the sn grow extremely rapidly, which suggests the naive algorithm will hit a wall fairly quickly

When I used the two versions of the test above to verify that the first few Mersenne primes Mp really are prime, the naive version gets stuck after p = 19. The better version immediately works through p = 9689 before slowing down noticeably.

Historical example

Let’s look at M521. This was the first Mersenne number verified by a computer to be prime. This was done on the vacuum tube-based SWAC computer in 1952 using the smart version of the Lucas-Lehmer test.

Carrying out the Lucas-Lehmer test to show that this number is prime requires doing modular arithmetic with 521-bit numbers, which was well within the 9472 bits of memory in the vacuum tube-based SWAC computer that proved M521 was prime.

But it would not be possible now, or ever, to verify that M521 is prime using the naive version of the Lucas-Lehmer test because we could not store, much less compute, s519.

Since

\log_{2} s_{519} = 2^{519} \log_{2} (2 + \sqrt{3}) \approx 2^{520}

we’d need around 2520 bits to store s519. The number of bits we’d need is itself a 157-digit number. We’d need more bits than there are particles in the universe, which has been estimated to be on the order of 1080.

The largest Mersenne prime that the SWAC could verify using the naive Lucas-Lehmer test would have been M13 = 8191, which was found to be prime in the Middle Ages.

Related posts

[1] Édouard Lucas conjectured in 1878 the theorem that Lehmer proved in 1930.

Multiplying a matrix by its transpose

An earlier post claimed that there practical advantages to partitioning a matrix, thinking of the matrix as a matrix of matrices. This post will give an example.

Let M be a square matrix and suppose we need to multiply M by its transpose MT. We can compute this product faster than multiplying two arbitrary matrices of the same size by exploiting the fact that MMT will be a symmetric matrix.

We start by partitioning M into four blocks

M = \begin{bmatrix}A & B \\ C & D \end{bmatrix}

Then

M^\intercal = \begin{bmatrix}A^\intercal & C^\intercal \\ B^\intercal & D^\intercal \end{bmatrix}

and

MM^\intercal = \begin{bmatrix} AA^\intercal  + BB^\intercal & AC^\intercal  + BD^\intercal \\CA^\intercal + DB^\intercal & CC^\intercal + DD^\intercal \end{bmatrix}

Now for the first clever part: We don’t have to compute both of the off-diagonal blocks because each is the transpose of the other. So we can reduce our calculation by 25% by not calculating one of the blocks, say the lower left block.

And now for the second clever part: apply the same procedure recursively. The diagonal blocks in MMT involve a matrix times its transpose. That is, we can partition A and use the same idea to compute AAT and do the same for BBT, CCT, and DDT. The off diagonal blocks require general matrix multiplications.

The net result is that we can compute MMT in about 2/3 the time it would take to multiply two arbitrary matrices of the same size.

Recently a group of researchers found a way to take this idea even further, partitioning a matrix into a 4 by 4 matrix of 16 blocks and doing some clever tricks. The RXTX algorithm can compute MMT in about 26/41 the time required to multiply arbitrary matrices, a savings of about 5%. A 5% improvement may be significant if it appears in the inner loop of a heavy computation. According to the authors, “The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.”

Related posts

A bit-twiddling marvel

Pop quiz: what does the following code do?

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

It determines whether the year y is a leap year in the Gregorian calendar, of course. :)

It’s very efficient, though I don’t image that would ever matter. But it’s very clever.

Gregorian vs Julian calendar

A year is a leap year in the Julian calendar if and only if it is a multiple of 4. But the Julian year is a bit too long on average to match the earth’s orbit around the sun. You get a much better fit if you add the complication that years divisible by 100 but not by 400 are not leap years.

Presumably no one reading this recalls 1900 not being a leap year, but some of you may need to know some day that 2100 is not a leap year either.

C code

The cryptic function above comes from the recent post A leap year check in three instructions by Falk Hüffner. The function is correct for the next 100 thousand years (more precisely, through 103499) and correct if we anachronistically extent the Gregorian calendar back to the year 0.

The following C code shows empirically that Hüffner’s code is correct, but you’ll have to read his post to understand why it is correct.

#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

int main() {
    for (uint32_t y = 0; y <= 102499; y++)
        if (is_leap_year_fast(y) != is_leap_year(y))
            printf("Exception: %d\n", y);
    return 0;
}

Related posts