A little optimization and a challenge

Waldir Pimenta asked me whether it is possible to test the condition

max(a, b) / min(a, b)  < r

without computing max(a, b) or min(a, b). Here a> 0, b> 0, and r > 1. (If r ≤ 1, the condition is always false.) The inequality has to be evaluated in an inner loop of a real-time image processing application and so the time saved by avoiding computing the max and min might matter.

We can multiply the original inequality by min(a, b) since a and b are positive. The condition becomes

max(a, b) < r min(a, b) = min(ra, rb).

We then expand this inequality into four simple inequalities:

a < ra
b < ra
a < rb
b < rb

Since r > 1, we know that a < ra and b < rb and so we only need to check a < rb and b < ra. In C notation we would evaluate

 ( a < r*b && b < r*a )

rather than

 ( max(a,b) / min(a,b) < r )

Whether this saves any time depends on context, though it’s plausible that the former might be more efficient. Some portion of the time the condition a < r*b will evaluate to false and the second half of the condition will not be evaluated.

(C evaluates an expression like (p && q) from left to right. If p is false, q is not evaluated since it is known at that point that the expression (p && q) will evaluate to false regardless of the value of q.)

Challenge: How often will the condition a < r*b evaluate to false? What are your assumptions? Leave your answers below.

Update: Combining the ideas of Carlos Santos and Christian Oudard in the comments below, you could implement the inequality as

a < b ? (b < r*a) : (a < r*b)

There are too many other good ideas in the comments for me to keep editing the blog post so I’ll just ask that you scroll down. I’m surprised at all the ideas that came out of evaluating a simple expression.

32 thoughts on “A little optimization and a challenge

  1. Are you only trying to avoid the function calls or there are other reasons for this? I thought decent compilers would inline the calls to max/min. Anyways, if you’re using C you could write something like:
    a > b ? a/b : b/a

  2. iCarlos: I think the idea here is to avoid a division call, which can be slower than multiplication on some architectures.

    John: Waldir Pimenta’s method involves two comparisons and one division. Your method involves up to two comparisons and two multiplications. I’m not sure if this is correct, but it may be possible to do two comparisons and one multiplication every time. I’ll use python-like pseudocode:

    def f(a, b):
        if a < b:
            return b < r*a
        else:
            return a < r*b
    
  3. Carlos: here are the differences, assuming you inline the max and min functions.

    • The direct approach requires three comparisons and a division.
    • Your approach requires two comparisons and one division.
    • The approach developed in the post sometimes requires one comparison and one multiplication and sometimes requires two comparisons and two multiplications.

    Your approach is an improvement over the most direct implementation since you eliminate a comparison.

    This is probably straining at gnats. But it could matter, for example, on an embedded chip without a division instruction or one with slow division. On some chips, division is an order of magnitude slower than multiplication.

    Christian: The original method requires three comparisons: one in max, one in min, and one comparing the ratio to r. Your approach is clever. I agree that your method uses two comparisons and one multiplication every time.

  4. If you are really talking about the “inner loop of a real-time image processing application”, then every if/then will generate a huge performance hit. Your “updated” example will likely be painfully slow compared to your first.

    Any modern, smart C compiler will NOT implement “&&” in the short-circuiting way you describe on this simple example. It will recognize that the right-side of the && (i.e., “b < r*a") has no side effects, and it will almost certainly generate machine code that does something like this:

    ((a < r*b) & (b < r*a)) // note the single "&"

    If your compiler is not this smart, you will get better performance by making such a transformation yourself. Comparisons and Boolean AND cost only a single cycle, while a mis-predicted conditional branch costs tens of cycles because it stalls the instruction fetch pipeline. (Although the multiplication complicates the analysis somewhat. If your floating point unit has enough spare time in the rest of your loop, they can effectively be "free", since they can happen in parallel with other computations. And integer multiplies are pretty fast to start with, especially compared to mis-predicted branches.)

    Nothing is faster than linear machine code (i.e. no conditional branches) operating on registers. This statement becomes more true every year as cores grow in speed faster than memory. As the old saying goes, the three most important things are "locality, locality, locality". Or something like that.

    Some compilers have a syntax to give the compiler "hints" about which way a branch is likely to go, and it will pass those hints on to the CPU to help it avoid mispredicting the branch. (GCC has "__builtin_expect", for instance.) If you happen to know (a < r*b) more often than not, you can use such a syntax to tell your compiler about it and get faster code.

    However, I would go in a different direction here. A REALLY smart compiler would implement this as a vector operation: [a, b] < r*[b, a]. (Modern CPUs have vector units. If you are writing a "real-time image processing application", you want to learn about these.)

    So if I were working on this, I would examine the optimized assembly output ("cc -S") and keep adjusting my code and compiler options until I got this vectorized form. Note: I am not suggesting writing in assembly. I am suggesting writing in standard C, but arranging the code so that your compiler will vectorize it for you. Smart modern compilers will do this, although they may require a switch (like "-sse3" or something) to unlock their full potential, since the available vector operations keep expanding with each generation of CPUs.

  5. Depending on the compiler and architecture, this may be faster to compute than using bitwise &.

    (a - rb) * (b - ra)

  6. @Thouis —

    Unlikely. Bitwise operations are inherently parallel and will take one cycle each. Multiplications will take (far) more. That third multiply you have introduced is deadly.

    But your observation leads to a different formulation:

    #include
    double a, b, r;
    return signbit(a-r*b) == signbit(b-r*a);

    (This works because it is impossible for a>rb AND b>ra)

    You can do the same with integer variables and bit-shifts to test the signs. Should be pretty fast.

  7. Scrutinizing compiler output is a waste of time. The best approach to targeted optimization is to take a bunch of implementations and measure the processing time for each option on sufficiently large input data, and weighing the ancillary cost (like additional memory usage) against the speedup.

    That said, and having had some personal (and recent) experience with vectorizing, the best option for this may well be vectorization, using the multiplication trick. The thing to remember, though, is that vectorization requires that you compute all paths of any decision you have to make. That can get more expensive than if/else trees if there are enough decisions. So it can work, but it requires some careful planning.

    One other thing to note: I assume that a, b, and r are floats? If they’re ints, the multiplication trick won’t give you the same results in general.

  8. It may also be valuable to think about maybe doing some GPU programming. This is a cakewalk for GPUs.

    (I screwed up my URL in my last comment; this one is correct.)

  9. Sure, it’s a “waste of time” if you are ignorant of how your CPU actually works and if your compiler is smarter than you are. Me, I like to stay smarter than my tools, so I like to see what stupid things the compiler does with my code. To each his own.

    Of course, profiling is absolutely necessary. But if you care about maximizing performance, profiling is how you validate your results, not how you find them.

    You are not the only one with “personal (and recent) experience with vectorizing” ;-).

  10. Well, it matters what’s actually faster, not what you think is faster.

    You’re suggesting making tiny alterations to code and then watching what the compiler does until you get what you want. Why not just do what you want in assembly then, and skip the step of micromanaging your compiler? Then you could spend the time instead just profiling the assembly code and the original code, and see what works.

    You are not the only one with “personal (and recent) experience with vectorizing”

    When did I ever say that I was?

  11. Pardon me, but that reminds me of a story.

    Some folks were tryin’ to put a MIN or MAX call in a compile-time constant in FORTRAN (19)77. Not directly possible, ‘course, but they wanted to do it anyway — maybe to save punch cards.

    One clever fella put down his slide rule and wrote out a version that leveraged integer division — a legal operation in a compile-time constant.

    It was the first time I ever encountered the British expression, “That’s using the old Sultana!”

    Shucks. Good times. Now where did I put those flowcharts from my salad days at Rand?

  12. Well, it matters what’s actually faster, not what you think is faster.

    Obviously. In fact, I made the same point when I wrote: “Of course profiling is absolutely necessary”. Did you miss that part?

    Look, all I am saying is that if you have identified a single small section of code as a bottleneck, it pays to optimize it. And despite what you may have read, some of us humans are still better at that than compilers. I routinely squeeze an extra factor of 1.5 or so out of inner loops when it is necessary.

    I do not write in assembly because I care about my code’s longevity. That is, I want it to run as fast as necessary today, but also to compile and run decades from now. So I code rigidly to the relevant standards, but I also use my brain when it is useful. But maybe that’s just me.

  13. Depending upon the processor architecture, it may be faster to get rid of the comparison.

    We can do that like this:
    ( (a<b) && (b<r*a) ) || ( (b<a) && (a<r*b) )

    With comparison:
    1 compares
    5 operators

    Without comparison:
    0 compares
    9 operators (assuming no short-circuiting, with short circuiting it may be more like 7 operators)

    So if a comparison is a single cycle, a compare needs to take 4cycles to break even.

    It's been awhile since I did any significant optimizations, so I'm curious to hear everyones response to this.

  14. So I code rigidly to the relevant standards, but I also use my brain when it is useful. But maybe that’s just me.

    Wow. This is the second rather transparent attempt of yours at insulting me.

    I’m not even sure that this is worth continuing. I’ll let my original comments stand for those that might find them useful.

  15. Have you looked at SSE instructions? There’s certainly a combination which can be applied here. As a bonus you can probably do two operations at the sime time 🙂

    For example:
    MAXSS – MAXIMUM scalar single-precision floating-point values
    MINSS – MINIMUM scalar single-precision floating-point values
    CMPSS – COMPARE scalar single-precision floating-point values

  16. @Gigi: Those would definitely be a great way to go — you could do MAX, MIN, MULT (the result of MIN by r), and CMP. At least I think there’s a way to multiply a vector by one value in SSE — I don’t recall off the top of my head. So four vector instructions. Nice.

  17. My understanding of SSE is that setup/teardown is expensive so you need to do it a lot at once. And that comparison is gonna get in the way.

    Is my recollection incorrect?

  18. It can be, but if you’re working with a large buffer of data, it gets amortized. So yeah, what you said is true.

    As for the CMPSS, it’s vectorized too — that is, it compares a bunch at a time and returns the result of each compare in a vector. Also, MAX and MIN are one instruction each. You don’t have to implement them as a compare, so that’s a win.

  19. As to the challenge question, I’ve assumed that a and b are each uniformly distributed on some interval [0,L]. Given some r, the probability that a < r*b is (2r-1)/(2r) (regardless of L).
    A reasonable pdf for r might be p(r) = 1/r2. If so, an easy integration yields a probability of 1/4 that a < r*b is false.

    P.S. Can we use LaTeX in comments?

  20. Sorry, but I don’t think my blog software supports LaTeX. You could make an image or PDF file and point to it. I could put the file on my server if you’d like.

  21. Ha, that last message was supposed to read “So we can use <img>> tags?”. The fact that the software ate my img tag hopefully means the answer is yes!

  22. You can use some HTML tags; I don’t know what the limits are. You can always try. And I’d be glad to edit or delete anything you’d like if it doesn’t work out.

  23. @sandy_l SSE setup is not that expensive. Besides, this question is about an inner function which is called a lot, so this is certainly an ideal case for SSE. I once used it to compute a lot of square roots (vector length computation), and the result was 1.4 times faster than 4 threads each computing standard sqrts on a quad core machine. And the gain was not because extra overhead of the standard sqrt function, since I’ve also tried it’s assembly language version (FSQRT).

    I’m not sure whats the target platform. Visual Studio provides intrinsinc functions, so you don’t even need to write assembly language.

  24. One more formulation you might try, if you lack built-in min/max instructions and/or the vectorized version doesn’t do it for you…

    max(a,b) = (a + b + |a-b|)/2
    min(a,b) = (a + b – |a-b|)/2

    In this example, the 2’s cancel (not that dividing by 2 is slow, but still), so you can write:

    #include <math.h>
    double a, double b, r;
    double tmp = fabs(a-b);
    return a + b + tmp < r*(a + b - tmp);
    

    This gets you down to one multiply, one comparison (well, two counting fabs()), and zero conditional branches. fabs() should be one machine instruction, so this should be pretty fast.

  25. @Gigi: True, though it is possible to incur some extra cost in building supporting data structures and such to support trickier vectorizations.

    Re: VS: In fact, VS disallows the use of inline assembly for 64-bit, so intrinsics are necessary if you don’t to write a whole assembly file (and pay for the assembler).

  26. If your compiler is decent, you do not need to use the non-portable MMX/SSE intrinsics. (You might even call them “a waste of time”. Why not just write in assembly? etc)

    I do not know about Visual Studia because I never use Windows. But I do know that both the Intel C compiler and gcc will vectorize perfectly standard C/C++ code automatically, provided you write your code carefully and tell the compiler what CPU you are targeting.

    That includes automatically generating the SSE “min” and “max” opcodes.

  27. @Nemo: The compilers are not that smart yet. I have a SSE version of a 4×4 matrix multiplication algorithm written by an Intel engineer which is much faster than anything else.

    VS is not bad either. When targeting modern CPUs it will not generate old-school math-coprocessor instructions, but SSE ones, even when computing a single value, or doing basic stuff like a/b. I was quite surprised when I discovered this behavior. In general you do not gain any speed, since you do not use the full width of the units, but it probably makes it easier to mix in SSE instructions which do.

    BTW, the next Intel architecture coming in 2011 will replace SSE with AVX, which has 256 bit wide SSE registers, doubling the size of the current ones. This is clearly the future for high performance algorithms (a move towards the GPU model).

  28. @Gigi —

    AVX is exactly the reason I advocate writing in standard C when possible. For example:

    for (int i =0 ; i < 1024 ; ++i) {
    // do stuff with a[i], b[i], c[i], etc.
    }

    Modern compilers will vectorize this loop automatically if they can figure out how. Of course, you have to be careful about what you do inside the loop, and you have to examine the compiler’s output to find out whether it worked. (Unless your compiler has an option to tell you as it vectorizes each loop; some do.)

    When targeting an SSE CPU, the compiler will use 128-bit vector instructions. Next year, when targeting an AVX CPU, you can simply recompile the code and it will use 256-bit vector instructions.

    The alternative is to re-write your code using new “256-bit intrinsics”. And then do it again in a couple of years. And again, and again, and again …

    I find recompiling easier.

    It is almost always possible to write your code in a style to allow these sorts of optimizations by the compiler, but it is a bit of an art, and it does take some extra effort. To me, the effort is worth it, because in my experience I always wind up maintaining code for (much) longer than I expect. And besides, these sorts of psycho optimizations are usually only needed on a tiny portion of the code base.

Leave a Reply

Your email address will not be published. Required fields are marked *