A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

*a*+*b*+*c**ab*+*bc*+*ac**abc*

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let *h* be a symmetric function of *r* variables and suppose we have a set *S* of *n* numbers where *n* ≥ *r*. If we average *h* over all subsets of size *r* drawn from *S* then the result is another symmetric function, called a *U*-statistic. The “U” stands for unbiased.

If *h*(*x*) = *x* then the corresponding *U*-statistic is the sample mean.

If *h*(*x*, *y*) = (*x* − *y*)²/2 then the corresponding *U*-function is the sample variance. Note that this is the **sample** variance, not the **population** variance. You could see this as a justification for why sample variance as an *n*−1 in the denominator while the corresponding term for population variance has an *n.*

Here is some Python code that demonstrates that the average of (*x* − *y*)²/2 over all pairs in a sample is indeed the sample variance.

import numpy as np from itertools import combinations def var(xs): n = len(xs) bin = n*(n-1)/2 h = lambda x, y: (x - y)**2/2 return sum(h(*c) for c in combinations(xs, 2)) / bin xs = np.array([2, 3, 5, 7, 11]) print(np.var(xs, ddof=1)) print(var(xs))

Note the `ddof`

term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as *U*-statistics, and so numerous properties of such statistics are corollaries general results about *U*-statistics. For example *U*-statistics are asymptotically normal, and so sample variance is asymptotically normal.