Symmetric functions and U-statistics

A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

a + b + c
ab + bc + ac
abc

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let h be a symmetric function of r variables and suppose we have a set S of n numbers where n ≥ r. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.

If h(x) = x then the corresponding U-statistic is the sample mean.

If h(x, y) = (x − y)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n − 1 in the denominator while the corresponding term for population variance has an n.

Here is some Python code that demonstrates that the average of (x − y)²/2 over all pairs in a sample is indeed the sample variance.

    import numpy as np
    from itertools import combinations

    def var(xs):
        n = len(xs)
        bin = n*(n-1)/2    
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in combinations(xs, 2)) / bin

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs, ddof=1))
    print(var(xs))

Note the ddof term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.