When High Performance Computing (HPC) Isn't High Performance

Everybody cares about codes running fast on their computers. Hardware improvements over recent decades have made this possible. But how well are we taking advantage of hardware speedups?

Consider these two C++ code examples. Assume here n = 10000000.

void sub(int* a, int* b) {
    for (int i=0; i<n; ++i)
        a[i] = i + 1;
    for (int i=0; i<n; ++i)
        b[i] = a[i];
}

void sub(int* a, int* b) {
    for (int i=0; i<n; ++i) {
        const int j = i + 1;
        a[i] = j;
        b[i] = j;
    }
}

Which runs faster? Both are simple and give identical results (assuming no aliasing). However on modern architectures, depending on the compilation setup, one will generally run significantly faster than the other.

In particular, Snippet 2 would be expected to run faster than Snippet 1. In Snippet 1, elements of the array “a”, which is too large to be cached, must be retrieved from memory after being written, but this is not required for Snippet 2. The trend for over two decades has been for compute speed of newly delivered systems to grow much faster than memory speed, and the disparity is extreme today. The performance of these kernels is bound almost entirely by memory bandwidth speed. Thus Snippet 2, a fused loop version of Snippet 1, improves speed by reducing main memory access.

Libraries like C++ STL are unlikely to help, since this operation is too specialized to expect a library to support it (especially the fused loop version). Also, the compiler cannot safely fuse the loops automatically without specific instructions that the pointers are unaliased, and even then is not guaranteed to do so.

Thankfully, high level computer languages since the 1950s have raised the programming abstraction level for all of us. Naturally, many of us would like to just implement the required business logic in our codes and let the compiler and the hardware do the rest. But sadly, one can’t always just throw the code on a computer and expect it to run fast. Increasingly, as hardware becomes more complex, giving attention to the underlying architecture is critical to getting high performance.

7 thoughts on “When High Performance Computing Is Not High Performance”

Ross

10 January 2024 at 11:18

Fascinating post in its simplicity —and surprise value: I would’ve thought almost any compiler would be able to fuse these loops . (boy, I’m so going to enjoy these articles…!)

Are there any mainstream compilers that are ‘chip aware’ not simply in terms of a hw architecture-informed rule base and series of static trade-offs, but in the sense of getting real time instrumented feedback from what the chip is doing in the computation results in order to feedback to the compiler for an optimal tuned decision?

Wayne Joubert

10 January 2024 at 11:45

If I’m not mistaken, if there is aliasing of the form &(a[1]) == &(b[0]) then the two codes will give different answers. Actually I noticed a slight error in the original blog post which I just corrected, thanks for your comment that sparked me looking at this.
I remember the old Intel compilers used to have an option to run a code and have it dump performance data that could be read by the compiler on next compile, to optimize. I’m not aware of anyone doing this kind of thing routinely now, but there a lot of different developer communities doing different things. Seems like something someone might do in the LLVM code base.

BobC

10 January 2024 at 12:46

I made a good living for decades delivering embedded systems using processors that were “too small for the workload”. Some of it was low-level software knowledge like the above, but more of it was simply paring bloat from libraries and the real-time OS. I can’t count the times I’ve had to rewrite thread management libraries, OS schedulers and memory managers.

However, the greatest improvements came from the high-level software architecture, employing “clean” designs that avoided excessive communication and computation, and tightly controlling “featuritis” every step of the way.

This often made software development more difficult, as “the easy way”, while fast, often incurred increased hardware costs to achieve performance goals.

Now, I started my career on 8-bit embedded processors (Intel 8085) and early 16-bit DSPs. As 32-bit architectures and feature-packed SoCs swept through the embedded space, my skills were still useful, though now to target power management, to minimize the compute needed for a problem both to support reduced BOM costs as well as to extend battery operation (and/or reduce battery size).

Similarly, my application domain shifted from embedded and distributed industrial systems and instruments to avionics for aircraft and spacecraft and to novel sensor technologies (getting a marginal piece of special matter to generate useful measurements).

The same low-level skills kept evolving as the technology landscape proceeded through multiple generations. But even those early register-level optimization skills still come in handy when profiling code only to find compiler, library or OS issues.

Max

11 January 2024 at 00:07

Definitely STL can help, for example from x17 ExecutionPilicy was added to many standard algorithm (example https://en.cppreference.com/w/cpp/algorithm/generate_n). 10000000 is long chunk of memory that easily without any interference can be split across several executors. Absence of interference mean no mutex, that are main delay point in modern multithread or multiprocess programming.

Globules

11 January 2024 at 01:45

@Ross “profile guided optimization” (a.k.a. PGO) sounds like the closest thing to what you described. GCC, LLVM and various other compilers claim to support it in one form or another. E.g. see:

Wikipedia: https://en.wikipedia.org/wiki/Profile-guided_optimization#Implementations
GCC (-fauto-profile, -fprofile-use): https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
LLVM: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

Wayne Joubert

11 January 2024 at 06:26

@Max Yes, something like that could be done. One would have to be sure the shorter chunks that are used to keep data in cache (e.g., a 64K L1 cache) would not reduce efficiency because of the shorter vector lengths. Benchmarks like the STREAM benchmark show that (up to a certain point) long vector lengths are more efficient.

ross

12 January 2024 at 13:48

@Globules. *thanks!*

Comments are closed.