Why are CUDA kernels hard to optimize?

Explosive datacenter demand has caused developers to leave no stone unturned in search of higher efficiencies. The DeepSeek team, not satisfied with Nvidia’s CUDA libraries, used a virtualized form of assembly language (PTX) to write kernel codes to accelerate their AI computations. Others have attempted to generate optimized kernels using AI, though some results have been questioned (for various attempts, see also here, here, here, here and here).

Why is it hard to write peak-speed GPU code? Writing really fast code has always been arduous, but it seems especially so for modern GPUs.

To understand the issues, my colleagues and I performed a detailed study of GPU kernel performance, across eight different GPU models from three GPU vendors [1]. The test case we considered was low precision matrix multiply, a resource-intensive operation for LLM training. We ran many, many experiments to understand what causes performance variability and why kernels sometimes run slower than you’d think they should.

For the cases we studied, we found about half a dozen different factors, but the upshot is this: modern processors like GPUs have become so complex—notably their multi-layered hierarchical memory subsystems—that it is difficult to get consistently high performance across all problem sizes a user might want to run in practice. As a result, the performance for the target problem might be surprisingly and mysteriously less than the advertised peak performance for the operation in question. The reasons might be obvious—like cache line misalignment—or more opaque. For the matrix multiply case, various issues like the need for prefetching, caching, tiling and block size selection, make it difficult for the kernel developer to optimize for every input size a user might specify.

Below is an example graphic from our paper. The color indicates floating point operation rate (FLOPs) for a reduced precision matrix multiply on a representative GPU using a library call. The horizontal and vertical axes refer to the matrix dimensions for the problem (see paper for details). Though some regions show performance near the theoretical peak (red), other immediately adjacent regions show problem sizes that run dramatically less—in fact, only about half of peak performance, or less. Presumably this is because either individual kernel performance or the selection of kernels used by the library is suboptimal. The net outcome is, if your problem lands a “bad” region, you’re in for a big surprise, your performance will be much less than expected, and you may not understand why. All high-performing GPUs we tested showed irregular behaviors such as this [2] [3].

In the past this was not always a problem.  Older architectures like Sun Sparc or Cray vector processor, complex as they were, were simple enough that a reasonably well-tuned computational kernel might run well across most if not all inputs [4]. Today, performance is much harder to predict and can vary substantially based on the requested problem sizes.

This is a tough challenge for library developers. Whenever a new GPU model family comes out, new kernel optimization and tuning are required to give (hopefully) more consistently high performance, and some cases get more developer attention than others due to customer needs and limited developer resources. As a result, infrequently used operations do not get as much attention, but they may be the exact ones you need for your particular case [5].

Tools are available to help optimize for specific cases. The excellent Nvidia CUTLASS library exposes access to many more fine-grained options compared to the standard cuBLAS library. The not faint of heart can try programming Nvidia GPUs at the level of PTX, or (shudder) SASS. Superoptimization might help, but only for very small code fragments and even then there may be too many external factors influencing performance to make it effective.

Autotuning is a promising approach though it doesn’t seem to have reached its full potential in production. AI might really help here [6]; in our own paper we had some success using machine learning methods like decision trees and random forests to model performance as a function of problem size, though our work was exploratory and not production-ready. To make a well-crafted general solution it would seem would require a lot of effort to do right. Code sustainability and maintenance are also critical; a sustainable workflow would be needed to retrain on new GPUs, new CUDA releases and even site-specific and system-specific settings like GPU power and frequency cap policies.

Most recent AI-driven work focuses on optimizing performance for one or a few problem sizes only. A truly production-quality general purpose tool would give both 100% accurate results and also top achievable performance for any input problem size (even for corner cases) or data type. This would require both optimized GPU kernels and optimal kernel dispatcher for kernel selection. And the method would need to be robust to issues like power and frequency variabilities in production runs. This would seem to currently be an unsolved problem. Solving it would be of huge benefit to the hyperscaler community.

Notes

[1] For related work from a slightly different angle, see this excellent work from Matt Sinclair’s lab.

[2] It turned out this study was helpful to us for production runs, to help us to triage an odd performance conundrum we encountered when attempting an exascale run (see here, here).

[3] Incidentally this example shows the hazards of simplistic benchmark suites to measure GPU code performance. Unless the benchmark captures a truly large and varied set of input cases, any new optimization method proposed can artificially “overfit” performance on the tests and still underperform miserably on many user cases of interest.

[4] I once wrote a 1-D wavelet convolution kernel for a Sparc processor, using a circular register buffer and loop unrolling to minimize loads and stores, this achieving near-peak performance. The code was correctly compiled from C to assembly, and performance for a given problem was almost precisely predictable. That was before the days of complex memory hierarchies.

[5] One vendor I know of used to take customer requests for hand tuning expensive library calls and made them run fast at the specific customer problem sizes.

[6] LLM kernel generation seems like a natural fit, particularly since LLM-generated code quality has much improved in recent months. Kernel selection and parameter selection for block size, tiling etc. might be better solved by direct training of machine learning models, or methods like this. Comparative studies on this would be informative.

 

What’s hierarchical about a hierarchical wallet?

A few days ago I wrote about what’s in a crypto wallet. In that post I said that most crypto wallets now are hierarchical deterministic (HD) wallets.  And I said that HD wallets are deterministic in the sense that they derive all their keys from a seed phrase. But in what sense are HD wallets hierarchical? That’s the topic of this post.

A warm-up story

In the game of 20 questions, one person thinks of something and another tries to guess what it is by asking up to 20 yes-no questions. I once heard the physicist John Wheeler tell of a variation of this game in which the first person did not have a definite object in mind, but decided after each question what the answer should be. For example, if someone asks “Is this person a man?” the person would commit to the person being a man or woman, but would not decide on a particular man or woman yet.

Wheeler’s point was that quantum mechanics is like this variation on 20 questions in that the answers to questions don’t exist until the question is asked. What does this have to do with hierarchical deterministic wallets? Your private keys do not exist until you ask for them. But once you have created and used a key, a wallet will behave consistently with that creation.

The hierarchy

The hierarchy referred to in a hierarchical deterministic wallet is a set of five variables, as described in BIP-44:

  1. Purpose
  2. Coin type
  3. Account
  4. Change
  5. Address index

The meaning of the variables is explained in BIP-44. The lowest level, address index, is a sequential counter. So you can have separate sequential counters for each value of the four-tuple (purpose, coin type, account, change).

Your master key and the five variables above are inputs to a key derivation function used to create new keys as needed. Once you use a private key, a hash of its corresponding public key is memorialized on the blockchain. If it’s a Bitcoin transaction, it’s on the Bitcoin blockchain. If it’s an Ethereum transaction, it’s on the Ethereum blockchain, etc. (You can find a list of supported coin types here.)

You wallet does not (or at least logically need not) store all your keys. It can reason as follows. “If the master key and these hierarchical values were used, this would be the private key. And given this private key, this would be the public key, and this would be the corresponding address. Let me consult the blockchain to see whether in fact it was used.”

How would a wallet know how many transactions you’ve made under a particular branch of the hierarchy? It searches the corresponding blockchain. It first looks whether there is a ledger entry corresponding to address index 0. Then address index 1, etc. The algorithm allows for the possibility of gaps. If it cannot find a ledger entry corresponding to index 2, it looks for index 3, etc. up to a gap of 20. After looking ahead 20 index values and finding nothing, it concludes there is nothing else to be found.

Because everything is derived deterministically from the seed phrase and the hierarchical variables, you can back up a wallet by simply backing up the seed phrase.

In theory, you could carry out transactions using one brand of wallet, back it up by writing down the seed phrase, then restore the information to a different brand of wallet. In practice you may run into difficulty doing this.

Related posts

What’s in your wallet?

What’s in your Bitcoin wallet? Very little. I don’t mean very little value, but very little data. If you’re a Bitcoin billionaire, your wallet still doesn’t contain very many bits.

You might reasonably expect that a crypto wallet is a container, given the analogy with an ordinary wallet, but it’s not much of a container. It’s more like a bank teller than a physical wallet. It’s primarily a piece of software that facilitates transactions.

It’s misleading to speak of a wallet as holding cryptocurrency. It holds private keys that allow you to access cryptocurrency, so in that sense it’s a password manager. Not only that, these days it’s effectively a password manager containing only one password.

When you back up a wallet, say the Exodus wallet, it asks you to store a seed phrase somewhere safe. Fine, but how do you back up your wallet, not just your seed phrase? How do you back up the data in your wallet? You don’t, because there isn’t any data in your wallet other than the seed phrase [1]. Everything is derived from your seed phrase, at least for the current style of wallet, technically known as a hierarchical deterministic (HD) wallet. The derivation algorithm is documented in BIP39.

The first crypto wallets held a set of private keys. But now wallets derive private keys from your seed phrase. So even though you may have used multiple private keys, the wallet doesn’t store multiple keys. It could store them, but it doesn’t need to. That’s why you can back up your wallet by only backing up your seed phrase.

Related posts

[1] A wallet may contain metadata, such as notes on transactions or software configuration preferences. This data isn’t recovered if you restore you wallet from just your seed phrase. But all your private keys can be regenerated. Think of the private keys as traditional metal house keys and the metadata as a sticky note on top of the keys. If you lose your keys and the note, you’d be happy to get the keys back.

Genesis Block Easter Egg

The White House put out a position paper Strengthening American Leadership in Digital Financial Technology a few days ago. The last page of the paper contains a hex dump.

Kinda surprising to see something like that coming out of the White House, but it makes sense in the context of cryptocurrency. Presumably Donald Trump has no idea what a hex dump is, but someone around him does.

My first thought was that something was wrong because the hex codes don’t correspond to the text on the side as it would if you were opening a text file in a hex editor. But it’s not a mistake; it’s an Easter Egg.

Extracting text from image

I tried to convert the image to text using tesseract but it fell down. I’ve had good experience with tesseract in the past, but this time was disappointing.

I was skeptical that an LLM would do a better job, because the LLMs use tesseract internally. Or at least at one time OpenAI did. Grok 4 initially did a poor job, but it worked after I gave it more help using the following prompt.

Convert the attached image to text. It is a hex dump: all characters are hexadecimal symbols: digits and the capital letters A, B, C, D, E, or F.

Here’s the result.

01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 3B A3 ED FD 7A 7B 12 B2 7A C7 2C 3E
67 76 8F 61 7F C8 1B C3 88 8A 51 32 3A 9F B8 AA
4B 1E 5E 4A 29 AB 5F 49 FF FF 00 1D 1D AC 2B 7C
01 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 04 45 54 68 65 20 54 69 6D 65 73 20 30 33 2F
4A 61 6E 2F 32 30 30 39 20 43 68 61 6E 63 65 6C
6C 6F 72 20 6F 6E 20 62 72 69 6E 6B 20 6F 66 20
73 65 63 6F 6E 64 20 62 61 69 6C 6F 75 74 20 66
6F 72 20 62 61 6E 6B 73 FF FF FF FF 01 00 F2 05
2A 01 00 00 00 43 41 04 67 8A FD B0 FE 55 48 27
19 67 F1 A6 71 30 B7 10 5C D6 A8 28 E0 39 09 A6
79 62 E0 EA 1F 61 DE B6 49 F6 BC 3F 4C EF 38 C4
F3 55 04 E5 1E C1 12 DE 5C 38 4D F7 BA 0B 8D 57
8A 4C 70 2B 6B F1 1D 5F AC 00 00 00 00

The Genesis Block

The hex content is the header of the Bitcoin “Genesis Block,” the first block in the Bitcoin blockchain. You can find full breakdown of the bytes here.

The defining characteristic of a blockchain is that it is a chain of blocks. The blocks are connected by each block containing the cryptographic hash of the previous block’s header. For Bitcoin, the hash starts in the 5th byte and runs for the next 32 bytes. You see a lot of zeros at the top of the hex dump above because the Genesis Block had no predecessor on the chain.

Easter Egg Within an Easter Egg

Quoting the hex dump of the Genesis Block in the whitepaper was an Easter Egg for Bitcoin enthusiast. The Genesis Block contains a sort of Easter Egg of its own.

The section of the header

    54 69 6D ... 6E 6B 73

is the ASCII text

The Times 03/Jan/2009 Chancellor on brink of second bailout for banks

Satoshi Nakamoto quoted the headline from The Times from January 3, 2009 to prove that the genesis block was created on or after that date. The headline seems to also be a sort of Easter Egg, an implicit commentary on the instability of fractional-reserve banking.

Related posts

Looking back at Martin Gardner’s RSA article

Public key cryptography came to the world’s attention via Martin Gardner’s Scientific American article from August 1977 on RSA encryption.

The article’s opening paragraph illustrates what a different world 1977 was in regard to computation and communication.

… in a few decades … the transfer of information will probably be much faster and much cheaper by “electronic mail” than by conventional postal systems.

Gardner quotes Ron Rivest [1] saying that breaking RSA encryption by factoring the product of two 63-digit primes would take about 40 quadrillion years. The article included a challenge, a message encrypted using a 129-digit key, the product of a 64-digit prime and a 65-digit prime. Rivest offered a $100 prize for decrypting the message.

Note the tension between Rivest’s estimate and his bet. It’s as if he were saying “Based on the factoring algorithms and computational hardware now available, it would take forever to decrypt this message. But I’m only willing to bet $100 that that estimate remains valid for long.”

The message was decrypted 16 years later. Unbeknownst to Gardner’s readers in 1977, the challenge message was

THE MAGIC WORDS ARE SQUEAMISH OSSIFRAGE

encoded using 00 for space, 01 for A, 02 for B, etc.  It was decrypted in 1993 by a group of around 600 people using around 1600 computers. Here is a paper describing the effort. In 2015 Nat McHugh factored the key in 47 minutes using 8 CPUs on Google Cloud.

The RSA algorithm presented in Gardner’s article is much simpler than it’s current implementation, though the core idea remains unchanged. Now we use much larger public keys, the product of two 1024 bit (308 digit) primes or larger. Also, RSA isn’t used to encrypt messages per se; RSA is used to exchange symmetric encryption keys, such as AES keys, which are then used to encrypt messages.

RSA is still widely used, though elliptic curve cryptography (ECC) is taking its place, and eventually both RSA and ECC will presumably be replaced with post-quantum methods.

More RSA posts

[1] I met Ron Rivest at the Heidelberg Laureate Forum in 2013. When he introduced himself I said something like “So you’re the ‘R’ in RSA?” He’s probably tired of hearing that, but if so he was too gracious to show it.

Factoring RSA100

Earlier today I wrote about factoring four 255-bit numbers that I needed for a post. Just out of curiosity, I wanted to see how long it would take to factor RSA 100, the smallest of the factoring challenges posed by RSA Laboratories in 1991. This is a 100-digit (330-bit) number that is the product of two primes.

I used the CADO-NFS software. The software was developed in France, and CADO is a French acronym for Crible Algébrique: Distribution, Optimisation. NFS stands for number field sieve, the fastest algorithm for factoring numbers with over 100 digits.

RSA 100 was first factored in 1991 using a few days of compute time on an MP1 MasPar computer, a machine that cost $500,000 at the time, equivalent to around $1,250,000 today.

My effort took about 23 minutes (1376 seconds) on a System 76 Meerkat mini that I paid $600 for in 2022.

The MP1 was about the size of a refrigerator. The Meerkat is about 3″ × 3″ × 1.5″.

Most legible font for WIF

Bitcoin’s Wallet Import Format (WIF) is essentially Base58 encoding with a checksum. (See the next post for details.) It is meant to a human-friendly way to display cryptographic private keys. It’s not that friendly, but it could be worse.

The checksum (the first four bytes of SHA256 applied twice) is appended before the conversion to Base58, so the final result consists of only Base58 characters.

The Base58 alphabet is

    123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz

and so some easily-confused characters have been removed. For example, the lower case letter o is included but the upper case O and the numeral 0 are not included. The lower case letter l has been removed so that it won’t be confused with the numeral 1.

But there are still a few letters that could be confused:

    1ij 2Zz Cc Kk 5Ss Uu Vv Ww Xx Yy Zz

I was curious what font might make these letters the most distinct, and the best I found was IBM Plex Mono Italic.

Similar letters in IBM Plex Mono italic

The pairs Cc and Ss are still similar, but the rest of the upper and lower case pairs are distinct. (Note the serif on the lower case u, for example.)

Without the italic, lower case v, x, and z are simply smaller versions of their upper case counterparts.

Here’s the whole Base58 alphabet in IBM Plex Mono italic. Note the “holes” in the alphabet where some letters were removed.

Related posts

Retrofitting error detection

Bitcoin addresses include a checksum. The Base58Check algorithm uses the first four bytes of a hash function as a checksum before applying Base58 encoding.

Ethereum addresses did not include a checksum, but it became apparent later that a checksum was needed. How can you retroactively fit a checksum into an address without breaking backward compatibility? Here’s what Ethereum did in adopting the EIP-55 proposal.

Ethereum addresses are the last 20 bytes (40 hexadecimal characters) of the Keccak-256 hash of the public key. The protocol allowed the letters in hexadecimal addresses to be either upper or lower case. This option provided the wiggle room to retrofit a checksum. You could mix upper and lower case letters deliberately to encode an extra message (i.e. a checksum) on top of the key. This is sorta like steganography, except that it’s out in the open rather than hidden.

Hexadecimal notation uses 10 digits and 6 letters, and so the probability that a hexadecimal character is a letter is 6/16. On average a string of 40 hexadecimal characters will contain 15 letters. This means you could add 15 bits of information to an Ethereum address, on average, by alternating the case of the letters.

It’s conceivable that an address won’t contain any letters, or will consist entirely as letters. The number of letters in a random string of 40 hexadecimal characters is a binomial random variable with parameters n = 40 and p = 3/8, and this is approximately a normal random variable with mean 15 and standard deviation 3. This says the number of letters in an Ethereum address will be fairly tightly clustered around 15, rarely more than 21 or less than 9.

OK, but how do you take advantage of an average of 15 bits of freedom? You can’t encode a 15-bit number because you might not have 15 bits at your disposal.

Here’s what EIP-55 did. Left-align the address and the Keccek-256 hash of the address (which was itself a hash: there are two hashes going on) and capitalize all letters in the address that align with a character in the hash greater than or equal to 8.

As an example, let’s suppose our address is

    7341e5e972fc677286384f802f8ef42a5ec5f03b

This address contains 13 letters, which would not be unusual. Now let’s compute its hash.

    >>> from Crypto.Hash import keccak
    >>> kh = keccak.new(digest_bits=256)
    >>> kh.update(b"7341e5e972fc677286384f802f8ef42a5ec5f03b").hexdigest()
    'd8e8fcb225fb835fdb89a5918e736820ec75b3efaf3cbb229f03cdc41bf9f254'

Now we line up the address and its hash.

    341e5e972fc677286384f802f8ef42a5ec5f03b
    d8e8fcb225fb835fdb89a5918e736820ec75b3e...

The characters in the address that line up with a hash character of 0x8 through 0xf are highlighted red. The digits will be left alone, but the red letters will be turned to upper case.

    341E5E972fc677286384F802F8ef42a5EC5f03B

Related posts

Base58 versus Base85 encoding

Base58 encoding and Base85 encoding are used to represent binary data in a human-friendly way. Base58 uses a smaller character set and so is more conservative. Base85 uses a larger character set and so is more efficient.

There is a gotcha in that “base” means something different in Base58 compared to Base85. More on that below.

Base58

Base58 encoding is primarily used as part of the Bitcoin system. It is part of the Base58Check protocol used for encoding addresses and keys.

Base58 encoding is essentially the same as mathematical base 58 encoding, with a specific character set. The symbols for the “digits” 0 through 57 are chosen to avoid typographically similar letters. We’ll give that character set in the examples below.

There is only one version of Base58 in common use as far as I know, unlike Base85.

Base85

Base85 is a more compact alternative to Base64 encoding. The former encodes 4 bytes in 5 characters while the latter requires 6 characters. Base85 is used inside the PDF format. It is also used in the patch encoding for git.

Base85 encoding is analogous to binary-coded decimal (BCD). In some early computer systems, integers would not be expressed in binary per se. Instead, each digit would be represented as by four bits. So to represent a number like 427, you’d express 4, 2, and 7 in binary: 0100 0010 0111. If you were to express 427 in binary you’d get 110101011.

Base85 breaks bits into 32-bit words, then expresses each word in base 85. So you might say it’s base 85-encoded 32-bit words by analogy to binary coded decimal.

There are variations on Base85 encoding that use different alphabets, and so two software packages that say they do Base85 encoding might produce different results.

Base85 is more efficient than Base58 in the sense that it represents data using fewer symbols. It is also more computationally efficient because each 32-bit word is encoded independently.

Examples

We give four examples below: Base58 and Base85 applied to four bytes of data and eight bytes of data. The data length matters for Base85.

Base58, four bytes

Let n = CAFEBABEhex = 3405691582ten. This is the “magic number” at the beginning of Java class files, a pun on “java” as a slang for coffee.

In base 58 this number would be

5:10:55:3:26:22

We can verify this as follows:

    >>> 5*58**5 + 10*58**4 + 55*58**3 + 3*58**2 + 26*58 + 22
    3405691582
    >>>  hex(_)
    '0xcafebabe'

The Base58 alphabet is

    123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz

and so the Base58 encoding of 0xCAFEBABE would be the 5th, 10th, 55th, … elements of this alphabet (with zero-based index) which results in 6Bx4TP.

Note that the Base58 alphabet contains the digit 1 but not the letter l. It contains the lower case letter o but not the capital letter 0 or the digit 0. Some of the remaining characters are visibly similar, depending on your font. This post shows how one font makes the Base58 characters more distinct.

Base85, four bytes

Now suppose we want to encode n using Base85. Now we would get

65:20:50:84:67

If we use the alphabet

    !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu

then the “digits” above become b5Sud.

Note that the Base85 alphabet contains characters that could be confused, such as 0 (zero), O (capital letter), o (lower case letter). The characters were chosen to be printable ASCII characters, not necessarily visually distinct.

Base58, eight bytes

Now suppose n = CAFEBABECAFEBABEhex = 14627333968358193854ten.

We convert n to base 58 to get

33:55:17:43:49:44:3:47:49:44:26

which becomes axJkrm4prmT in the Base58 alphabet.

Base85, eight bytes

To encode CAFEBABECAFEBABEhex in Base85 we do not convert the number to base 85. Instead, we convert each 4-byte word to base 85. So we get two copies of CAFEBABEhex and so the encoding is b5Sudb5Sud.

If we were to wrongly convert n to base 85, we’d get

63:13:1:27:77:35:57:62:38:49

which becomes `."<nDZ_GR which is not the correct encoding.

Related posts

Experiences with Nvidia

Our team started working within Nvidia in early 2009 at the beginning of the ORNL Titan project. Our Nvidia contacts dealt with applications, libraries, programming environment and performance optimization. First impressions were that their technical stance on issues was very reasonable. One obscure example: in a C++ CUDA kernel were you allowed to use “enums,” and the answer would be, of course, yes, we would allow that. This was unlike some other companies that might have odd and cumbersome programming restrictions in their parallel programming models (though by now this has become a harder problem for Nvidia since there are so many software products a user might want to interoperate).

Another example, with a colleague at Nvidia on the C++ standards committee, to whom I mentioned, it might be too early to lock this certain feature design into the standard since hardware designs are still rapidly changing. His response was, Oh, yes, we think exactly the same thing. So in short, their software judgments and decisions generally seem to be well thought out, reasonable and well informed. It sounds simple, but it is amazing how many companies have gotten this wrong.

Nvidia has made good strategic decisions. In the 2013 time frame, Intel was becoming a competitive threat with the Xeon Phi processor. Intel was several times larger than Nvidia with huge market dominance. In response, Nvidia formed a partnership with IBM–itself several times larger than Intel at the time. This came to fruition in the ORNL Summit system in 2018. In the meantime, the Xeon Phi’s OpenMP programming model, though standards-based, turned out to be difficult to write optimized code for, and Nvidia CUDA captured market share dominance of accelerated user software. Intel eventually dropped the Xeon Phi product line.

In the early 2000s, Nvidia went all-in on CUDA. I’ve heard some project teams say they would never use CUDA, because it is nonstandard and too low-level. Many have turned back on this decision. Of course, it is often possible to write an abstraction layer on top of CUDA to make it easier to use and maintain. Also newer programming models like Kokkos can be helpful.

Nvidia also made a prescient decision early to bet big on AI. A little later they decided to go all in on developing a huge number of software libraries is to enable access to many new markets. A huge moat. AMD is trying hard to improve their software processes and catch up.

On the downside, Nvidia high prices are upsetting to many, from gamers to buyers of the world’s largest HPC systems. Competition from AMD and others is a good thing.

And Nvidia marketing speak is sometimes confusing. A comparison was once made claiming that a four GPU system was more powerful than one of the world’s top CPU-only supercomputers on a very specific science problem. I’d like to see the details of that comparison. Also, different figures are being given on how long it took to stand up xAI’s Colossus supercomputer, from 19 days to 122 days. One has to dig a little to find out what these figures mean. Also it was widely reported last year that the GB200 NVL72 GPU was “30 times faster” than H100, but this only refers to certain operations, not key performance measures like flops per watt.

Those are my takes. For more perspectives, see Tae Kim’s excellent book, The Nvidia Way, or this interview.

Thoughts on Nvidia? Please leave in the comments.