Elementary symmetric polynomials and optimization

Posted on 12 November 2025 by John

The mth elementary symmetric polynomial of degree n

$e_m(x_1, x_2, \ldots, x_n)$

is the sum of all terms containing a product of m variables. So, for example,

$\begin{align*} e_1(w, x, y, z) &= w + x + y + z \\ e_2(w, x, y, z) &= wx + wy + wz + xy + xz + yz \\ e_3(w, x, y, z) &= xyz + wyz + wxz + wxy \\ e_4(w, x, y, z) &= wxyz \end{align*}$
These polynomials came up in the previous post. The problem was choosing weights to minimize the variance of a weighted sum of random variables can be solved using elementary symmetric polynomials.

To state the optimization problem more generally, suppose you want to minimize

$t_1^2 x_1 + t_2^2x_2 + \cdots + t_n^2 x_n$

where the t_i and x_i are positive and the t_i sum to 1. You can use Lagrange multipliers to show that the solution is

$t_i = \frac{e_n(x_1, x_2, \cdots, x_n)}{x_i \,e_{n-1}(x_1, x_2, \cdots, x_n)}$

Minimize squared relative error

Posted on 21 June 2025 by John

Suppose you have a list of positive data points y₁, y₂, …, y_n and you wanted to find a value α that minimizes the squared distances to each of the y‘s.

$\sum_{i=1}^n (y_i - \alpha)^2$

Then the solution is to take α to be the mean of the y‘s:

$\alpha = \frac{1}{n} \sum_{i=1}^n y_i$

This result is well known [1]. The following variation is not well known.

Suppose now that you want to choose α to minimize the squared relative distances to each of the y‘s. That is, you want to minimize the following.

$\sum_{i=1}^n \left( \frac{y_i - \alpha}{\alpha} \right)^2$

The value of alpha this expression is the contraharmonic mean of the y‘s [2].

$\alpha = \frac{\sum_{i=1}^n y_i^2}{\sum_{i=1}^n y_i}$

[1] Aristotle says in the Nichomachean Ethics “The mean is in a sense an extreme.” This is literally true: the mean minimizes the sum of the squared errors.

[2] E. F. Beckenbach. A Class of Mean Value Functions. The American Mathematical Monthly. Vol. 57, No. 1 (Jan., 1950), pp. 1–6

Optimal rational approximation

Posted on 3 September 2024 by John

A few days ago I wrote about the approximation

$\cos x \approx \frac{\pi^2 - 4x^2}{\pi^2 + x^2}$

for cosine due to the Indian astronomer Aryabhata (476–550) and gave this plot of the error.

I said that Aryabhata’s approximation is “not quite optimal since the ripples in the error function are not of equal height.” This was an allusion to the equioscillation theorem.

Chebyshev proved that an optimal polynomial approximation has an error function that has equally large positive and negative oscillations. Later this theorem was generalized to rational approximations through a sequence of results by de la Vallée Poussin, Walsh, and Achieser. Here’s the formal statement of the theorem from [1] in the context of real-valued rational approximations with numerators of degree m and denominators of degree n

Theorem 24.1. Equioscillation characterization of best approximants. A real function f in C[−1, 1] has a unique best approximation r*, and a function r is equal to r* if and only if f − r equioscillates between at least m + n + 2 − d extremes where d is the defect of r.

When the theorem says the error equioscillates, it means the error alternately takes on ± its maximum absolute value.

The defect is non-zero when both numerator and denominator have less than maximal degree, which doesn’t concern us here.

We want to find the optimal rational approximation for cosine over the interval [−π/2, π/2]. It doesn’t matter that the theorem is stated for continuous functions over [−1, 1] because we could just rescale cosine. We’re looking for approximations with (m, n) = (2, 2), i.e. ratios of quadratic polynomials, to see if we can improve on the approximation at the top of the post.

The equioscillation theorem says our error should oscillate at least 6 times, and so if we find an approximation whose error oscillates as required by the theorem, we know we’ve found the optimal approximation.

I first tried finding the optimal approximation using Mathematica’s MiniMaxApproximation function. But this function tries to optimize relative error and I’m trying to minimize absolute error. Minimizing relative error creates problems because cosine evaluates to zero at the ends of interval ±π/2. I tried several alternatives and eventually decided to take another approach.

Because the cosine function is even, the optimal approximation is even. Which means the optimal approximation has the form

(a + bx²) / (c + dx²)

and we can assume without loss of generality that a = 1. I then wrote some Python code to minimize the error as a function of the three remaining variables. The results were b = −4.00487004, c = 9.86024544, and d = 1.00198695, very close to Aryabhata’s approximation that corresponds to b = −4, c = π², and d = 1.

Here’s a plot of the error, the difference between cosine and the rational approximation.

The absolute error takes on its maximum value seven times, alternating between positive and negative values, and so we know the approximation is optimal. However sketchy my approach to finding the optimal approximation may have been, the plot shows that the result is correct.

Aryabhata’s approximation had maximum error 0.00163176 and the optimal approximation has maximum error 0.00097466. We were able to shave about 1/3 off the maximum error, but at a cost of using coefficients that would be harder to use by hand. This wouldn’t matter to a modern computer, but it would matter a great deal to an ancient astronomer.

[1] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Differentially private stochastic gradient descent

Posted on 8 November 2023 by John

Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

Gradient descent

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

Stochastic gradient descent

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.

Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]

Normal approximation to normal

Posted on 4 November 2023 by John

In my previous post on approximating a logistic distribution with a normal distribution I accidentally said something about approximating a normal with a normal.

Obviously the best approximation to a probability distribution is itself. As Norbert Wiener said “The best material model of a cat is another, or preferably the same, cat.”

But this made me think of the following problem. Let f be the density function of a standard normal random variable, i.e. one with mean zero and standard deviation 1. Let g be the density function of a normal random variable with mean μ > 0 and standard deviation σ.

For what value of σ does g best approximate f? Is it simply σ = 1? Does it depend on μ?

I looked at the 1, 2, and ∞ norms, and in each case the optimal value of σ is not 1, and the optimal value does depend on μ. When μ is close to 0, σ is close to 1, as you’d probably expect. But for larger μ the results are surprising.

For the 1-norm and 2-norm, the optimal value of σ increases with μ and reaches a maximum of 2, then remains constant.

For the ∞ norm, the optimal value of σ increases briefly then decreases.

Logistic / Normal approximation

Posted on 4 November 2023 by John

In a recent post I pointed out that a soliton, a solution to the KdV equation, looks a lot like a normal density for fixed x. As someone pointed out in the comments, one way to look at this is that the soliton is exactly proportional to the density of a logistic distribution, and it’s well known that the logistic distribution is approximately normal.

Why?

Why might this approximation be useful?

You might want to approximate a normal distribution by a logistic distribution because the cumulative density function of the latter is an elementary function whereas the CDF of the former is not.

You might want to approximate a logistic distribution by a normal distribution because the normal distribution has nice, well-understood theoretical properties.

How?

If you wanted to approximate a logistic distribution by a normal, or vice versa, how would you do so? How large is the error in the approximation?

This post will answer these questions for four matching methods:

Moment matching
1-norm
2-norm
Sup-norm

We will find the value of the logistic scale parameter s that minimizes the distance between the logistic PDF

f(x, s) = sech²(x / 2s) / 4s

and that of the standard normal

g(x) = (2π)^−1/2exp( − x²/2 )

by each of the criteria above. We set the scale parameter of the normal to 1 because the ratio of the optimal logistic scale to the normal scale is constant.

Moment matching

Let s be the scale parameter for the logistic distribution and let σ be the scale parameter for the normal distribution. We will assume both distributions have mean 0.

The variance of the logistic is π² s²/3 and the variance of the normal is σ². So moment matching requires

σ = π s / √3

s = √3 / π

since we’re setting σ = 1.

plot

How good is this approximation? That depends on how you measure the error, which we will explore below. We will see how it compares to the optimal solution under each criterion.

1-norm

It would be nice to calculate the 1-norm of the difference f(x, s) − g(x) then minimize this as a function of s. But that difference cannot be computed in closed form. At least Mathematica can’t compute it in closed form. So I found the minimum numerically.

plot

Moment matching sets s = 0.5513 and leads to a 1-norm error of 0.1087.

The optimal value of s for the 1-norm is s = 0.6137 which yields an error of 0.0665.

2-norm

plot
With moment matching s = 0.5513 and the 2-norm error is 0.05566.

The optimal value for the 2-norm is s = 0.61476 which yields a 2-norm error of 0.0006973.

sup-norm

The sup norm, a.k.a. min-max norm or ∞ norm, minimizes the maximum distance between the two functions.

plot

When s = 0.5513 the sup norm is 0.0545.

The optimal value of s for the sup norm is 0.6237 and yields a sup norm error of 0.01845.

Conclusion

We can improve on moment matching, for all three norms simultaneously, by using a larger value of s, such as 0.61.

If you have a normal(μ, σ) distribution and you want to approximate it by a logistic distribution, set the mean of the latter to μ and the scale to 0.61σ. If you care about a particular error measure, use the corresponding multiplier rather than 0.61.

If you want to approximate a logistic with mean μ and scale s by a normal, set the mean of the normal to μ and set σ = s/0.61.

Lagrange multiplier setup: Now what?

Posted on 2 February 2023 by John

Suppose you need to optimize, i.e. maximize or minimize, a function f(x). If this is a practical problem and not a textbook exercise, you probably need to optimize f(x) subject to some constraint on x, say g(x) = 0.

Hmm. Optimize one function subject to a constraint given by another function. Oh yeah, Lagrange multipliers! You just need to solve

$\nabla f = \nabla g$

Or maybe you have two constraints: g(x) = 0 and h(x) = 0. No big deal. Just introduce another Lagrange multiplier:

$\nabla f = \lambda_1 \nabla g + \lambda_2 \nabla h$

OK. Now what?

Suppose your x is an n-dimensional vector, and you have k constraints. Now you have a system of n + k equations in n + k unknowns, one equation for each component of the Lagrange multiplier equation and one equation for each constraint.

If your system of equations is linear, hurrah! Maybe it won’t be hard to find where your function takes on its maximum and minimum values [1].

But in general you have a system of nonlinear equations, potentially a lot of nonlinear equations in a lot of unknowns. Maybe you’re new problem is harder than your original problem. That is, it might be easier to optimize your original function, subject to its constraints, using a different method than to solve the nonlinear system of equations that applying Lagrange multipliers gives you.

How do you solve a system of nonlinear equations? That’s a hard question because “nonlinear” is not a hypothesis; it’s the negation of a hypothesis. Suppose I tell you I’m thinking of an animal, and it’s not an elephant. That’s not much to go on. Non-linear is sorta like non-elephantine: you’ve said what a thing is not, but you haven’t said what it is.

One way for functions to be nonlinear is to be convex; convexity is often a tractable generalization of linearity. If your original optimization problem is convex, i.e. both the objective function and the constraints are convex functions, then directly solving your optimization problem may be easier than introducing Lagrange multipliers.

Another class of nonlinear functions is polynomials. If Lagrange multipliers give you a system of several polynomial equations in several unknowns, you may be able to solve your system of equations using techniques from algebraic geometry.

If your equations do not involve polynomial functions, but do involve functions that can be adequately approximated by polynomial functions, maybe the algebraic geometry route will work. But “adequate” here means that the solution to the approximate problem, the polynomial problem, gives a sufficiently accurate solution to the original problem. That may be hard to assess.

Your initial exposure to Lagrange multipliers in a calculus class may have left the wrong impression. Calculus textbooks are filled carefully contrived exercises that can be solved in a moderate amount of time, and for good reason. The problems you solve for exercises are not intrinsically important; you’re solving them to learn a technique. But in a real application, where the problem is intrinsically important (and the solution technique is not) things might be a lot harder.

[1] If then system of equations is linear, then your optimization problem must have a very special form: optimizing a quadratic objective function subject to linear constraints, i.e. a quadratic programming problem.

Best approximation of a catenary by a parabola

Posted on 29 June 2020 by John

A parabola and a catenary can look very similar but are not the same. The graph of

y = x²

is a parabola and the graph of

y = cosh(x) = (e^x + e^−x)/2

is a catenary. You’ve probably seen parabolas in a math class; you’ve seen a catenary if you’ve seen the St. Louis arch.

Depending on the range and scale, parabolas and catenaries can be too similar to distinguish visually, though over a wide range enough range the exponential growth of the catenary becomes apparent.

For example, for x between −1 and 1, it’s possible to scale a parabola to match a catenary so well that the graphs practically overlap. The blue curve is a catenary and the orange curve is a parabola.

The graph above looks orange because the latter essentially overwrites the former. The relative error in approximating the catenary by the parabola is about 0.6%.

But when x ranges over −10 to 10, the best parabola fit is not good at all. The catenary is much flatter in the middle and much steeper in the sides. On this wider scale the hyperbolic cosine function is essentially e^|x|.

Here’s an intermediate case, −3 < x < 3, where the parabola fits the catenary pretty well, though one can easily see that the curves are not the same.

Now for some details. How are we defining “best” when we say best fit, and how do we calculate the parameters for this fit?

I’m using a least-squares fit, minimizing the L² norm of the error, over the interval [−M, M]. That is, I’m approximating

cosh(x)

with

c + kx²

and finding c and k that minimize the integral

$\int_{-M}^M (\cosh(x) - c - kx^2)^2\, dx$

The optimal values of c and k vary with M. As M increases, c decreases and k increases.

It works out that the optimal value of c is

$-\frac{3 \left(M^2 \sinh (M)+5 \sinh (M)-5 M \cosh (M)\right)}{2 M^3}$

and the optimal value of k is

$\frac{15 \left(M^2 \sinh (M)+3 \sinh (M)-3 M \cosh (M)\right)}{2 M^5}$

Here’s a log-scale plot of the L² norm of the error, the square root of the integral above, for the optimal parameters as a function of M.

Inverse optimization

Posted on 18 May 2020 by John

This morning Patrick Honner posted the image below on Twitter.

The image was created by Robert Bosch by solving a “traveling salesman” problem, finding a nearly optimal route for passing through 12,000 points.

I find this interesting for a couple reasons. For one thing, I remember when the traveling salesman problem was considered intractable. And in theory it still is! But in practice it is not difficult now to find nearly optimal solutions for very large traveling salesman problems.

Another reason I find this interesting is that there is a higher-level problem in the background, the problem of constructing a problem whose solution gives you a desired image.

Robert Bosch is showing us a solution to two problems. The image above is the solution to an optimization problem, and the problem it solves is itself the solution to another problem in the background. He’s doing a sort of inverse optimization, searching for an optimization problem. He goes into some of his techniques in his recent book Opt Art which I wrote about here.

This gets back to an argument I’ve had with statisticians who are resistant to innovative methods. They’ll say “But our method is optimal. You can’t do any better by definition. End of discussion!” But every method is optimal by some criteria. The question is whether the method is optimal by criteria appropriate to the problem at had or whether it is only optimal according to criteria that were mathematically convenient at the time it was formulated.

Robert Bosch has shown beautifully that any image can be viewed as the solution to an optimization problem. Of course the problem has to be crafted for the image. If he were solving a naturally occurring problem, such as planning a tour for an actual sales rep, and the solution had an uncanny resemblance to a piece of art by Michelangelo, that would be astounding.

ADFGVX cipher and Morse code separation

Posted on 22 February 2020 by John

A century ago the German army used a field cipher that transmitted messages using only six letters: A, D, F, G, V, and X. These letters were chosen because their Morse code representations were distinct, thus reducing transmission error.

The ADFGVX cipher was an extension of an earlier ADFGV cipher. The ADFGV cipher was based on a 5 by 5 grid of letters. The ADFGVX extended the method to a 6 by 6 grid of letters and digits. A message was first encoded using the grid coordinates of the letters, then a transposition cipher was applied to the sequence of coordinates.

This post revisits the design of the ADFGVX cipher. Not the encryption method itself, but the choice of letters used for transmission. How would you quantify the difference between two Morse code characters? Given that method of quantification, how good was the choice of ADFGV or its extension ADFGVX? Could the Germans have done better?

Quantifying separation

There are several possible ways to quantify how distinct two Morse code signals are.

Time signal difference

My first thought was to compare the signals as a function of time.

There are differing conventions for how long a dot or dash should be, and how long the space between dots and dashes should be. For this post, I will assume a dot is one unit of time, a dash is three units of time, and the space between dots or dashes is one unit of time.

The letter A is represented by a dot followed by a dash. I’ll represent this as 10111: on for one unit of time for the dot, off for one unit of time for the space between the dot and the dash, and on for three units of time for the dash. D is dash dot dot, so that would be 1110101.

We could quantify the difference between two letters in Morse code as the Hamming distance between their representations as 0s and 1s, i.e. the number of positions in which the two letters differ. To compare A and D, for example, I’ll pad the A with a couple zeros on the end to make it the same length as D.

    A: 1011100
    D: 1110101
        x x  x

The distance is 3 because the two sequences differ in three positions. (Looking back at the previous post on popcount, you could compute the distance as the popcount of the XOR of the two bit patterns.)

A problem with this approach is that it seems to underestimate the perceived difference between F and G.

    F: ..-. 1010111010
    G: --.  1110111010
             x

These only differ in the second bit, but they sound fairly different.

Symbolic difference

The example above suggests maybe we should compare the sequence of dots and dashes themselves rather than compare their corresponding time signals. By this measure F and G are distance 4 apart since they differ in every position.

Other possibilities

Comparing the symbol difference may over-estimate the difference between U (..-) and V (...-). We should look at some combination of time signal difference and symbolic difference.

Or maybe the thing to do would be to look at something like the edit distance between letters. We could say that U and V are close because it only takes inserting a dot to turn a U into a V.

Was ADFGV optimal?

There are several choices of letters that would have been better than ADFGV by either way of measuring distance. For example, CELNU has better separation and takes about 14% less time to transmit than ADFGV.

Here are a couple tables that give the time distance (dT) and the symbol distance (dS) for ADFGV and for CELNU.

    |------+----+----|
    | Pair | dT | dS |
    |------+----+----|
    | AD   |  3 |  3 |
    | AF   |  4 |  3 |
    | AG   |  5 |  2 |
    | AV   |  4 |  3 |
    | DF   |  3 |  3 |
    | DG   |  2 |  1 |
    | DV   |  3 |  2 |
    | FG   |  1 |  4 |
    | FV   |  2 |  2 |
    | GV   |  3 |  3 |
    |------+----+----|
    
    |------+----+----|
    | Pair | dT | dS |
    |------+----+----|
    | CE   |  7 |  4 |
    | CL   |  4 |  3 |
    | CN   |  4 |  2 |
    | CU   |  5 |  2 |
    | EL   |  5 |  3 |
    | EN   |  3 |  2 |
    | EU   |  4 |  2 |
    | LN   |  4 |  4 |
    | LU   |  3 |  3 |
    | NU   |  3 |  2 |
    |------+----+----|

For six letters, CELNOU is faster to transmit than ADFGVX. It has a minimum time distance separation of 3 and a minimum symbol distance 2.

Time spread

This is an update in response to a comment that suggested instead of minimizing transmission time of a set of letters, you might want to pick letters that are most similar in transmission time. It takes much longer to transmit C (-.-.) than E (.), and this could make CELNU harder to transcribe than ADFGV.

So I went back to the script I’m using and added time spread, the maximum transmission time minus the minimum transmission time, as a criterion. The ADFGV set has a spread of 4 because V takes 4 time units longer to transmit than A. CELNU has a spread of 10.

There are 210 choices of 5 letters that have time distance greater than 1, symbol distance greater than 1, and spread equal to 4. That is, these candidates are more distinct than ADFGV and have the same spread.

It takes 44 time units to transmit ADFGV. Twelve of the 210 candidates identified above require 42 or 40 time units. There are five that take 40 time units:

ABLNU
ABNUV
AGLNU
AGNUV
ALNUV

Looking at sets of six letters, there are 464 candidates that have better separation than ADFGVX and equal time spread. One of these, AGLNUX, is an extension of one of the 5-letter candidates above.

The best 6-letter are ABLNUV and AGLNUV. They are better than ADFGVX by all the criteria discussed above. They both have time distance separation 2 (compared to 1), symbol distance separation 2 (compared to 1), time spread 4, (compared to 6) and transmission time 50 (compared to 56).

Optimization

Elementary symmetric polynomials and optimization

Related posts

Minimize squared relative error

Related posts

Optimal rational approximation

Related posts

Differentially private stochastic gradient descent

Gradient descent

Stochastic gradient descent

Differentially private stochastic gradient descent

Related posts

Normal approximation to normal

Logistic / Normal approximation

Why?

How?

Moment matching

1-norm

2-norm

sup-norm

Conclusion

Lagrange multiplier setup: Now what?

Related posts

Best approximation of a catenary by a parabola

More on catenaries

Inverse optimization

More optimization posts

ADFGVX cipher and Morse code separation

Quantifying separation

Time signal difference

Symbolic difference

Other possibilities

Was ADFGV optimal?

Time spread

More Morse code posts