The most obvious way to compute the soft maximum can easily fail due to overflow or underflow.

The soft maximum of x and y is defined by

*g*(*x*, *y*) = log( exp(*x*) + exp(*y*) ).

The most obvious way to turn the definition above into C code would be

double SoftMaximum(double x, double y) { return log( exp(x) + exp(y) ); }

This works for some values of *x* and *y*, but fails if *x* or *y* is large. For example, if we use this to compute the soft maximum of 1000 and 200, the result is numerical infinity. The value of `exp(1000)`

is too big to represent in a floating point number, so it is computed as infinity. `exp(200)`

is finite, but the sum of an infinity and a finite number is infinity. Then the `log`

function applied to infinity returns infinity.

We have the opposite problem if we try to compute the soft maximum of -1000 and -1200. In this computation `exp(-1000)`

and `exp(-1200)`

both underflow to zero, and the `log`

function returns negative infinity for the logarithm of zero.

Fortunately it’s not hard to fix the function `SoftMaximum`

to avoid overflow and underflow. Look what happens when we shift both arguments by a constant.

log( exp(*x *− *k*) + exp(*y* − *k*) ) = log( exp(*x*) + exp(*y*) ) − *k*.

This says

log( exp(*x*) + exp(*y*) ) = log( exp(*x* − *k*) + exp(*y* − *k*) ) + *k*

If we pick *k* to be the maximum of *x* and *y*, then one of the calls to `exp`

has argument 0 (and so it returns 1) and the other has a negative argument. This means the follow code cannot overflow.

double SoftMaximum(double x, double y) { double maximum = max(x, y); double minimum = min(x, y); return maximum + log( 1.0 + exp(minimum - maximum) ); }

The call to `exp(minimum - maximum)`

could possibly underflow to zero, but in that case the code returns `maximum`

. And in that case the return value is very accurate: if `maximum`

is much larger than `minimum`

, then the soft maximum is essentially equal to `maximum`

.

The equation for the soft maximum implemented above has a few advantages in addition to avoiding overflow. It makes it clear that the soft maximum is always greater than the maximum. Also, it shows that the difference between the hard maximum and the soft maximum is controlled by the spread of the arguments. The soft maximum is nearest the hard maximum when the two arguments are very different and furthest from the hard maximum when the two arguments are equal.

Thanks to Andrew Dalke for suggesting the topic of this post by his comment.

Not bad, but you can do a tiny bit better using the log1p function.

log1p(exp(minimum-maximum)) + maximum

This avoids unnecessary precision loss when “maximum” is close to zero and minimum-maximum is less than -40 or so.

On another note, I am not sure I like how the “soft max” is always larger than the max. In particular, when x=y, shouldn’t any “max” function always return x?

log((exp(x) + exp(y))/2) is always between x and y, and reduces to x when x=y. (This is reminiscent of the “n-th root of the average of the n-th powers”, which approaches the max as n approaches infinity.)

In short, I might subtract log(2) from your formula.

Nemo: Andrew Dalke raised the same objection in a comment to my first post on soft maximum.

The choice of definition depends on what properties you want the soft maximum to have. If you want softmax(x, x) = x to hold, the factor of 2 you suggest works. But if you do that, then softmax(x, y) – max(x, y) no longer goes to zero as one of the arguments grows large.

The active ingredient in the “safe” version of soft maximum is log(1+exp(x)), which could be called the “soft half-wave rectifier” function. It is the softness in the soft maximum!

Be aware that there is already a different function called “softmax”.

I named it (although others may also have done so) so I must take the blame for a neat but misleading name.

It should have been called softargmax: it provides a soft version of indicating which of several values is greatest in value, and it has several uses in so-called neural network algorithms. It is important to subtract off the maximum value from all the inputs before applying exp, for the reasons you explain.

See this summary: http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html

It is the multiple-input generalization of the logistic: (1/(1+exp(-x))).

So to avoid confusion I suggest sticking to you name “soft maximum” for the function you describe, and avoiding the name “softmax”.

Thanks for that! I have a slightly different problem to solve: using isosurface mesh generation, and plotting to equations on the same graph and combining them,the mesh at the intersection is irregular and has faulty shadows. because it is isosurface, it makes sense to keep the maximum the same at ==0, and to change how the formulas are multiplied together close to level ==0…

For plane equations x = 0 and y = 0, intersecting along z for example, I’m puzzling on how to edit the actual equations with an extra equation so that they meet in a curve of value ==0 close to the intersection. I’m really bad at maths it’s a bit confusing.

There is still an accuracy problem using the above, which i propose a solution.

Let’s consider the test of positive value, say max(0,x).

The code works fine and is smooth for value close to zero. We obtain as well x for greater x value with a good accuracy. But if x is small (let’s say 10) then you obtain 10.0000454 ( 1.31 for 1…) etc..

Precisely, your code is faitfull for x larger than 35 ( of (maximum-minimum) larger than 35.

I suggest you take the numerical desired precision of your optimisation problem, say 1e-15 for the example. Then you compute the minimum acceptable differences (maximum-minimum) for this sensitivity, which is :

log( 1 + exp(dx) ) * = 1e-15, giving approximately dx = 34.43.

Dx is the point at where the exponential should be compute, so change the code into (and I assure you that you won’t have overflow for 35 more units in the exponential):

double SoftMaximum(double x, double y)

{

double maximum = max(x, y);

double minimum = min(x, y);

return maximum + log( exp(dx) + exp(minimum – maximum + dx) ) – dx;

}

Best !

I can’t see the reason right now behind the soft maximum since you already distracted all the other elements in the computation. it is very useful in the previous case since we can get the gradient of all the elements. but right now it is only the maximum and minimum who are included in the gradient and not the rest.

isn’t there any other way to avoid overflow and underflow without loosing the gradient of the rest of the points ?