Soft maximum

In applications you often want to take the maximum of two numbers. But the simple function

f(x, y) = max(x, y)

can be difficult to work with because it has sharp corners. Sometimes you want an alternative that sands down the sharp edges of the maximum function. One such alternative is the soft maximum. Here are some questions this post will answer.

  • What exactly is a soft maximum and why is it called that?
  • How is it related to the maximum function?
  • In what sense does the maximum have “sharp edges”?
  • How does the soft maximum round these edges?
  • What are the advantages to the soft maximum?
  • Can you control the “softness”?

I’ll call the original maximum the “hard” maximum to make it easier to compare to the soft maximum.

The soft maximum of two variables is the function

g(x, y) = log( exp(x) + exp(y) ).

This can be extended to more than two variables by taking

g(x1, x2, …,  xn) = log( exp(x1) + exp(x2) + … + exp(xn) ).

The soft maximum approximates the hard maximum. Why? If x is a little bigger than y, exp(x) will be a lot bigger than exp(y). That is, exponentiation exaggerates the differences between x and y. If x is significantly bigger than y, exp(x) will be so much bigger than exp(y) that exp(x) + exp(y) will essentially equal exp(x) and the soft maximum will be approximately log( exp(x) ) = x, the hard maximum. Here’s an example. Suppose you have three numbers: -2, 3, 8. Obviously the maximum is 8. The soft maximum is 8.007.

The soft maximum approximates the hard maximum but it also rounds off the corners. Let’s look at some graphs that show what these corners are and how the soft maximum softens them.

Here are 3-D plots of the hard maximum f(x, y) and the soft maximum g(x, y). First the hard maximum:

Now the soft maximum:

Next we look at a particular slice through the graph. Here’s the view as we walk along the line y = 1. First the hard maximum:

And now the soft maximum:

Finally, here are the contour plots. First the hard maximum:

And now the soft maximum:

The soft maximum approximates the hard maximum and is a convex function just like the hard maximum. But the soft maximum is smooth. It has no sudden changes in direction and can be differentiated as many times as you like. These properties make it easy for convex optimization algorithms to work with the soft maximum. In fact, the function may have been invented for optimization; that’s where I first heard of it.

Notice that the accuracy of the soft maximum approximation depends on scale. If you multiply x and y by a large constant, the soft maximum will be closer to the hard maximum. For example, g(1, 2) = 2.31, but g(10, 20) = 20.00004. This suggests you could control the “hardness” of the soft maximum by generalizing the soft maximum to depend on a parameter k.

g(x, y; k) = log( exp(kx) + exp(ky) ) / k

You can make the soft maximum as close to the hard maximum as you like by making k large enough. For every value of k the soft maximum is differentiable, but the size of the derivatives increase as k increases. In the limit the derivative becomes infinite as the soft maximum converges to the hard maximum.

Update: See How to compute the soft maximum.

Solver Foundation optimization library

Microsoft’s Solver Foundation is a numerical optimization library capable of solving problems involving millions of variables and millions of constraints. When I listened Scott Hanselman interview Nathan Brixius from Microsoft’s Solver Foundation team, I expected Brixius to say that Solver Foundation was written in C++ at its core and had a thin C# veneer to make it callable from .NET applications. Instead, he said that Solver Foundation is entirely written in managed code.

Even in heavy-duty numerical code the bottlenecks may not be numerical. The inner loops of the software would execute faster if they were written in C++, but Solver Foundation solves optimization problems about as quickly as other packages written in lower-level languages.

Four reasons we don’t apply the 80/20 rule

Why can’t we make more use of the 80/20 rule? I’ll review what the 80/20 rule is, explain how it can be powerful, then give four reasons why we don’t take advantage of it.

What is the 80/20 rule?

The 80/20 rule is amazing when you first learn about it. It says that efforts and results are often very unevenly distributed. You’ll get 80% of your results from the first 20% of your efforts. For example, maybe your top 20% of customers will provide 80% of your profit. Or when you’re debugging software, often 80% of the bugs will be in 20% of the code. Once you become aware of it, you’ll see 80/20 examples everywhere.

There’s nothing magical about the numbers 80 and 20. The general principle applies if 93% of your results come from 22% of your efforts. The numbers don’t have to add to 100. The principle is just that outcomes are unevenly distributed, more unevenly distributed than you may think.

Applying the 80/20 rule

Applications of the 80/20 rule are everywhere. For example, if you want to learn a foreign language, you don’t buy a dictionary and start learning words from page 1 and work your way to the end. Some words are used far more often than others. You’ll be able to use a language much sooner if you learn the vocabulary roughly in descending order of frequency.

Software optimizations can be extreme examples of an 80/20 rule. Sometimes 98% percent of a program’s time is being spent executing just five lines of code.  Finding those five lines and tuning them is far more effective than randomly tweaking things here and there in hopes that the changes improve performance.

Why don’t we apply the 80/20 rule?

If the 80/20 rule is so powerful, why don’t we use it more often? Why don’t we concentrate our efforts where we’re likely to see the best results? Here are four reasons.

  1. We don’t look for 80/20 payoffs. We don’t see 80/20 rules because we don’t think to look for them.
  2. We’re not clear about criteria for success. You can’t concentrate your efforts on the 20% with the biggest returns until you’re clear on how you measure returns.
  3. We’re unclear how inputs relate to outputs. It may be hard to predict what the most productive activities will be.
  4. We enjoy less productive activities more than more productive ones. We concentrate on what’s fun rather than what’s effective.

If you address these issues in order, you might get stuck on the third one. It can be hard to know what is most productive. Our intuition in this area is usually wrong. For example, maybe the most effective thing to do is very simple, but we overlook it because we think the answer must be more complicated. Or maybe we confuse what we need to do with what we want to do. Collecting data is the best way to find out what really works. The results are often surprising.

Sometimes the world changes and we’re stuck doing what used to be most effective. For example, some of the most persistent ideas about the “right” way to develop software come from studies of done forty years ago. It’s not enough to collect data one time.

More on Pareto and power laws

Convex optimization

I’ve enjoyed following Stephen Boyd’s lectures on convex optimization. I stumbled across a draft version of his textbook a few years ago but didn’t realize at first that the author and the lecturer were the same person. I recommend the book, but I especially recommend the lectures. My favorite parts of the lectures are the off-hand remarks about which methods are well known and which are not, what different fields call the same methods, which concerns are merely academic minutia and which are practically important. This kind of information is almost impossible to pick up just from reading books and articles.

A surprisingly large set of problems can be formulated as minimizing a convex function subject to convex constraints, and algorithms for solving these problems are very well developed. It’s roughly true that an optimization problem is tractable if and only if it is convex. Of course that’s not exactly true. Many special non-convex problems can be solved easily, though if you write down an arbitrary non-convex optimization problem, it’s probably hard to solve. On the other hand, enormous convex optimization problems can often be solved quite efficiently.

I became interested in optimization a few years ago in order to solve some problems that came up at work. After going through Boyd’s lectures, I’d like to go back and do a better job on some of these problems. For example, the EffTox dose-finding software solves an optimization problem to determine prior parameters based on input from doctors. That optimization is slow and not very robust. If I can find the time to do it, I think I could improve that software.