When High Performance Computing Is Not High Performance

Everybody cares about codes running fast on their computers. Hardware improvements over recent decades have made this possible. But how well are we taking advantage of hardware speedups?

Consider these two C++ code examples. Assume here n = 10000000.

void sub(int* a, int* b) {
    for (int i=0; i<n; ++i)
        a[i] = i + 1;
    for (int i=0; i<n; ++i)
        b[i] = a[i];
void sub(int* a, int* b) {
    for (int i=0; i<n; ++i) {
        const int j = i + 1;
        a[i] = j;
        b[i] = j;

Which runs faster? Both are simple and give identical results (assuming no aliasing). However on modern architectures, depending on the compilation setup, one will generally run significantly faster than the other.

In particular, Snippet 2 would be expected to run faster than Snippet 1. In Snippet 1, elements of the array “a”, which is too large to be cached, must be retrieved from memory after being written, but this is not required for Snippet 2. The trend for over two decades has been for compute speed of newly delivered systems to grow much faster than memory speed, and the disparity is extreme today. The performance of these kernels is bound almost entirely by memory bandwidth speed. Thus Snippet 2, a fused loop version of Snippet 1, improves speed by reducing main memory access.

Libraries like C++ STL are unlikely to help, since this operation is too specialized to expect a library to support it (especially the fused loop version). Also, the compiler cannot safely fuse the loops automatically without specific instructions that the pointers are unaliased, and even then is not guaranteed to do so.

Thankfully, high level computer languages since the 1950s have raised the programming abstraction level for all of us. Naturally, many of us would like to just implement the required business logic in our codes and let the compiler and the hardware do the rest. But sadly, one can’t always just throw the code on a computer and expect it to run fast. Increasingly, as hardware becomes more complex, giving attention to the underlying architecture is critical to getting high performance.

7 thoughts on “When High Performance Computing Is Not High Performance

  1. Fascinating post in its simplicity —and surprise value: I would’ve thought almost any compiler would be able to fuse these loops . (boy, I’m so going to enjoy these articles…!)

    Are there any mainstream compilers that are ‘chip aware’ not simply in terms of a hw architecture-informed rule base and series of static trade-offs, but in the sense of getting real time instrumented feedback from what the chip is doing in the computation results in order to feedback to the compiler for an optimal tuned decision?

  2. If I’m not mistaken, if there is aliasing of the form &(a[1]) == &(b[0]) then the two codes will give different answers. Actually I noticed a slight error in the original blog post which I just corrected, thanks for your comment that sparked me looking at this.
    I remember the old Intel compilers used to have an option to run a code and have it dump performance data that could be read by the compiler on next compile, to optimize. I’m not aware of anyone doing this kind of thing routinely now, but there a lot of different developer communities doing different things. Seems like something someone might do in the LLVM code base.

  3. I made a good living for decades delivering embedded systems using processors that were “too small for the workload”. Some of it was low-level software knowledge like the above, but more of it was simply paring bloat from libraries and the real-time OS. I can’t count the times I’ve had to rewrite thread management libraries, OS schedulers and memory managers.

    However, the greatest improvements came from the high-level software architecture, employing “clean” designs that avoided excessive communication and computation, and tightly controlling “featuritis” every step of the way.

    This often made software development more difficult, as “the easy way”, while fast, often incurred increased hardware costs to achieve performance goals.

    Now, I started my career on 8-bit embedded processors (Intel 8085) and early 16-bit DSPs. As 32-bit architectures and feature-packed SoCs swept through the embedded space, my skills were still useful, though now to target power management, to minimize the compute needed for a problem both to support reduced BOM costs as well as to extend battery operation (and/or reduce battery size).

    Similarly, my application domain shifted from embedded and distributed industrial systems and instruments to avionics for aircraft and spacecraft and to novel sensor technologies (getting a marginal piece of special matter to generate useful measurements).

    The same low-level skills kept evolving as the technology landscape proceeded through multiple generations. But even those early register-level optimization skills still come in handy when profiling code only to find compiler, library or OS issues.

  4. Definitely STL can help, for example from x17 ExecutionPilicy was added to many standard algorithm (example https://en.cppreference.com/w/cpp/algorithm/generate_n). 10000000 is long chunk of memory that easily without any interference can be split across several executors. Absence of interference mean no mutex, that are main delay point in modern multithread or multiprocess programming.

  5. @Max Yes, something like that could be done. One would have to be sure the shorter chunks that are used to keep data in cache (e.g., a 64K L1 cache) would not reduce efficiency because of the shorter vector lengths. Benchmarks like the STREAM benchmark show that (up to a certain point) long vector lengths are more efficient.

Leave a Reply

Your email address will not be published. Required fields are marked *