Sorting Roman numerals

This morning I wrote about the frequencies of names for popes and kings. This involved sorting strings with Roman numerals since it’s common for popes and kings to have Roman numerals after their names.

Something that surprised me was that sorting Roman numerals alphabetically roughly sorts them in numerical order, especially for small numbers. It’s not perfect. For example, IX comes before V in alphabetical order.

Everyone who has done much work with data will have run into the problem of a column of numbers being sorted alphabetically rather than numerically. For example, “10” comes between “1” and “2” even though 10 comes after 1 and 2.

So you can’t sort numerals, Roman or Arabic, as strings and expect them to appear in numerical order. But Roman numbers come close when you’re sorting small numbers, such as I through XXIII for popes named John or I through VIII for kings of England named Henry.

To illustrate this, I plotted how well string sort order correlates with numeric order for Roman and Arabic numbers, for the sequence 1 … n for increasing values of n. I measured correlation using Spearman’s rank-order correlation. I tried Kendall’s tau and as well and got similar results.

Alphabetical order and numerical order for Roman numerals agree pretty well up to XXXVIII, with just a few numbers out of place, namely IX, XIX, and XXIX. But alphabetical order and numerical order diverge quite a bit for Arabic numerals when all the numbers between 10 and 19 come before 2.

As you go further out, alphabetical order and numerical order diverge for both writing systems, but especially for Roman numerals.

 

Knuth’s series for chi squared percentage points

In the latest two posts, we have needed to find the percentage points for a chi square random variable when the parameter ν, the number of degrees of freedom, is large.

In Volume 2 of Knuth’s magnum opus TAOCP, he gives a table of percentage points for the chi square distribution for small values of ν and he gives an asymptotic approximation to be used for values of ν > 30.

Knuth’s series is

\nu + \sqrt{2\nu}x_p + \frac{2}{3} x_p^2 - \frac{2}{3} + {\cal O}(1/\sqrt{\nu})

where xp is the corresponding percentage point function for a normal random variable. Knuth gives a few values of xp in his table.

What Knuth calls xp, I called p in the previous post. The approximation I gave, based on Fisher’s transformation to make the chi squared more like a normal, is similar to Knuth’s series, but not the same.

So which approximation is better? Where does Knuth’s approximation come from?

Comparing approximations

We want to find the quantiles for χ²(100) for p = 0.01 through 0.99. Both Fisher’s approximation and Knuth’s approximation are good enough that it’s hard to tell the curves apart visually when the exact values are plotted along with the two approximations.

But when you subtract off the exact values and just look at the errors, it’s clear that Knuth’s approximation is much better in general.

Here’s another plot, taking the absolute value of the error. This makes it easier to see little regions where Fisher’s approximation happens to be better.

Here are a couple more plots comparing the approximations at more extreme probabilities. First the left tail:

And then the right tail:

Derivation

The table mentioned above says to see exercise 16 in the same section, which in turn refers to Theorem 1.2.11.3A in Volume 1 of TAOCP. There Knuth derives an asymptotic series which is not directly the one above, but does most of the work. The exercise asks the reader to finish the derivation.

Chi squared approximations

In the previous post I needed to know the tail percentile points for a chi squared distribution with a huge number of degrees of freedom. When the number of degrees of freedom ν is large, a chi squared random variable has approximately a normal distribution with the same mean and variance, namely mean ν and variance 2ν.

In that post, ν was 9999 and we needed to find the 2.5 and 97.5 percentiles. Here are the percentiles for χ²(9999):

    >>> chi2(9999).ppf([0.025, 0.975])
    array([ 9723.73223701, 10278.05632026])

And here are the percentiles for N(9999, √19998):

    >>> norm(9999, (2*9999)**0.5).ppf([0.025, 0.975])
    array([ 9721.83309451, 10276.16690549])

So the results on the left end agree to three significant figures and the results on the right agree to four.

Fewer degrees of freedom

When ν is more moderate, say ν = 30, the normal approximation is not so hot. (We’re stressing the approximation by looking fairly far out in the tails. Closer to the middle the fit is better.)

Here are the results for χ²(30):

    >>> chi2(30).ppf([0.025, 0.975])
    array([16.79077227, 46.97924224])

And here are the results for N(30, √60):

    >>> norm(30, (60)**0.5).ppf([0.025, 0.975])
    array([14.81818426, 45.18181574])

The normal distribution is symmetric and the chi squared distribution is not, though it becomes more symmetric as ν → ∞. Transformations of the chi squared distribution that make it more symmetric may also improve the approximation accuracy. That wasn’t important when we had ν = 9999, but it is more important when ν = 30.

Fisher transformation

If X ~ χ²(ν), Fisher suggested the approximation √(2X) ~ N(√(2ν − 1), 1).

Let Y be a N(√(2ν − 1), 1) random variable and Z a standard normal random variable, N(0, 1). Then we can estimate χ² probabilities from normal probabilities.

\begin{align*} P(X \leq x) &= P(\sqrt{2X} \leq \sqrt{2x}) \\ &\approx P(Y \leq \sqrt{2x}) \\ &= P(Z \leq \sqrt{2x} - \sqrt{2\nu - 1}) \end{align*}

So if we want to find the percentage points for X, we can solve for corresponding percentage points for Z.

If z is the point where P(Zz) = p, then

x = \frac{(p + \sqrt{2\nu-1})^2}{2}

is the point where P(Xx) = p.

If we use this to find the 2.5 and 97.5 percentiles for a χ²(30) random variable, we get 16.36 and 46.48, an order of magnitude more accurate than before.

When ν = 9999, the Fisher transformation gives us percentiles that are two orders of magnitude more accurate than before.

Wilson–Hilferty transformation

If X ~ χ²(ν), the Wilson–Hilferty transformation is (X/ν)1/3 is approximately normal with mean 1 − 2/9ν and variance 2/9ν.

This transformation is a little more complicated than the Fisher transform, but also more accurate. You could go through calculations similar to those above to approximate percentage points using the Wilson–Hilferty transformation.

The main use for approximations like this is now for analytical calculations; software packages can give accurate numerical results. For analytical calculation, the simplicity of the Fisher transformation may outweigh the improve accuracy of the Wilson-Hilferty transformation.

Related posts

Variance of variances. All is variance.

In the previous post, I did a simulation to illustrate a theorem about the number of steps needed in the Euclidean algorithm. The distribution of the number of steps is asymptotically normal, and for numbers 0 < a < b < x the mean is asymptotically

12 log(2) log(x) / π².

What about the variance? The reference cited in the previous post [1] gives an expression for the variance, but it’s complicated. The article said the variance is close to a constant times log x, where that constant involves the second derivative of “a certain pseudo zeta function associated with continued fractions.”

So let’s estimate the variance empirically. We can easily compute the sample variance to get an unbiased point estimate of the population variance, but how can we judge how accurate this point estimate is? We’d need the variance of our estimate of variance.

When I taught statistics classes, this is where students would struggle. There are levels of randomness going on. We have some random process. (Algorithm run times are not really random, but we’ll let that slide.)

We have a statistic S², the sample variance, which is computed from samples from a random process, so our statistic is itself a random variable. As such, it has a mean and a variance. The mean of S² is the population variance; that’s what it means for a statistic to be an unbiased estimator.

The statistic S² has a known distribution, which we can use to create a confidence interval. Namely,

\frac{(n-1)S^2}{\sigma^2} \sim \chi^2

where the chi squared random variable has n − 1 degrees of freedom. Here n is the number of samples and σ² is the population variance. But wait, isn’t the whole purpose of this discussion to estimate σ² because we don’t know what it is?!

This isn’t quite circular reasoning. More like iterative reasoning. We have an estimate of σ² which we then turn around and use in constructing its confidence interval. It’s not perfect, but it’s useful.

A (1 − α) 100% confidence interval is given by

\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2}} \right)

Notice that the right tail probability appears on the left end of the interval and the left tail probability appears on the right end of the interval. The order flips because they’re in a denominator.

I modified the simulation code from the previous post to compute the sample standard deviation for the runs, and got 4.6926, which corresponds to a sample variance of 22.0205. There were 10,000 reps, so our chi squared distribution has 9,999 degrees of freedom. We can compute the confidence interval as follows.

>>> from scipy.stats import chi2
>>> 9999*22.0205/chi2.ppf([0.975, 0.025], 9999)

This says a 95% confidence interval for the variance, in one particular simulation run, was [21.42, 22.64].

The result in [1] said that the variance is proportional to log x, and in our case n = 263 because that post simulated random numbers up to that limit. We can estimate that the proportionality constant is 63 log(2)/22.0205 = 0.5042. In other words, the variance is about half of log x.

The theorem in [1] gives us the asymptotic variance. If we used a bigger value of x, the variance could be a little different, but I expect it would be around 0.5 log x.

Related posts

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

The multiple coupon collector problem

I’ve written about the Coupon Collector Problem and variations several times, most recently here.

Brian Beckman left a comment linking to an article he wrote, which in turn links to a paper on the Double Dixie Cup Problem [1].

The idea behind the Coupon Collector Problem is to estimate how long it will take to obtain at least one of each of n coupons when drawing coupons at random with replacement from a set of n.

The Double Dixie Cup Problem replaces coupons with dixie cups, and it asks how many couples you’d expect to need to collect until you have two of each cup.

The authors in [1] answer the more general question of how long to collect m of each cup, so m = 1 reduces to the Coupon Collector Problem. The punchline is Theorem 2 from the paper: the expected number of cups (or coupons) is

n \left( \log n + (m-1) \log \log n + C_m + o(1) \right)

for fixed m as n → ∞.

Here Cm is a constant that depends on m (but not on n) and o(1) is a term that goes to zero as n → ∞.

Since log log n is much smaller than log n when n is large, this says that collecting multiples of each coupon doesn’t take much longer, on average, than collecting one of each coupon. It takes substantially less time than playing the original coupon collector game m times.

Related posts

[1] Donald J. Newman and Lawrence Shepp. The Double Dixie Cup Problem. The American Mathematical Monthly, Vol. 67, No. 1 (Jan., 1960), pp. 58–61.

Top 100 radio

Suppose you want to hear the top 100 songs in some music genre, so you listen to a radio station that specializes in that kind of music. How long would it take to hear each of the songs at least once?

This is a variation on the coupon collector problem. If a radio station plays the top N songs in some genre, selected at random, the expected number of songs you would need to listen to before hearing each one at least once is

N HN

where HN is the Nth harmonic number. For large N this is approximately equal to

N(log N + γ)

where γ is the Euler-Mascheroni constant. When N = 100, this equals 518.

But radio stations do not choose songs at random. Even a station that only plays the 100 most popular songs will presumably play the most popular songs more often than the less popular ones. They probably also have some rules about waiting a while before playing the same song again, but we’ll ignore that for this post.

Billboard for radio station ZIPF

Suppose we’re listening to ZIPF, a radio station that plays the top 100 songs according to Zipf’s law: the probability of playing the nth most popular song is proportional to 1/n. How long before you’ve heard all 100 songs?

We know it’ll take more than 518 songs on average. The unequal frequency of the songs will cause us to wait longer to hear the least popular song. The 10oth most popular song, for example, will only play with probability 1/518 rather than probability 1/100.

Does the number 518 look familiar? It’s exactly the same number as above:

100 H100

That’s because the probability of hearing the 100th most popular song is proportional to 1/100, and the proportionality constant is the reciprocal of

1 + 1/2 + 1/3 + … 1/100 = H100 = 5.1873…

So in general the expected number of draws from a set of N items until you’ve seen everything at least once equals the probability of selected the least likely item when the items have draw probabilities according to Zipf’s law.

I don’t know how hard it would be to work out the exact probability for our problem of hearing all 100 songs on ZIPF, but we could estimate the probability by simulation.

When I ran it, the average of 1000 simulation reps was 1900 with a standard deviation of 595.

So it would take roughly 20x longer to hear every song selected according to Zipf’s law than to listen to the list.

Related posts

Normal probability approximation

The previous post presented an approximation

\int_0^x \exp(-t^2)\, dt \approx \sin(\sin(x))

for −1 ≤ x ≤ 1 and said that it was related to a probability function. This post will make the connect explicit.

Let X be a normally distributed random variable with mean μ and variance σ². Then the CDF of X is

P(X \leq x) = \frac{1}{\sqrt{2\pi} \sigma} \int_{-\infty}^x \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\, dt

So if μ = 0 and σ² = 1/2 then we have the following.

P(0 \leq X \leq x) = \frac{1}{\sqrt{\pi}} \int_0^x \exp(-t^2)\, dt \approx \frac{1}{\sqrt{\pi}} \sin(\sin(x))

Here’s Python code to show how good the approximation is.

    from numpy import sin, linspace, sqrt, pi
    from scipy.stats import norm
    import matplotlib.pyplot as plt

    x = linspace(-1, 1, 200)
    X = norm(0, sqrt(0.5))
    plt.plot(x, X.cdf(x) - X.cdf(0))
    plt.plot(x, sin(sin(x))/sqrt(pi))
    plt.legend(["exact", "approx"])
    plt.xlabel("$x$")
    plt.ylabel(r"$P(X \leq x)$")

Here’s the resulting plot:

The orange curve for the plot of the approximation completely covers the blue curve of the exact value. The two curves are the same to within the resolution of the plot. See the previous post for a detailed error analysis.

When do moments determine a function?

Two girls playing on a seesaw

The use of the word “moment” in mathematics is related to its use in physics, as in moment arm or moment of inertia. For a non-negative integer n, the nth moment of a function f is the integral of xn f(x) over the function’s domain.

Uniqueness

If two continuous functions f and g have all the same moments, are they the same function? The answer is yes for functions over a finite interval, but no for functions over an unbounded interval.

Existence

Now let’s consider starting with a set of moments rather than starting with a function. Given a set of moments m0, m1, m2, … is there a function that has these moments? Typically no.

A better question is what conditions on the moments are necessary for there to exist a function with these moments. This question breaks into three questions

  1. The Hausdorff moment problem
  2. The Stieltjes moment problem
  3. The Hamburger moment problem

depending on whether the function domain is a finite interval, a half-bounded interval, or the real line. For each problem there are known conditions that are necessary and sufficient, but the conditions are different for each problem.

Interestingly, each of the three names Hausdorff, Stieltjes, and Hamburger are well known. Felix Hausdorff is best known for his work in topology: Hausdorff spaces, etc. Thomas Stieltjes is best known for the Riemann-Stieltjes integral, and for his work on continued fractions. Hans Ludwig Hamburger is not as well known, though his last name is certainly familiar.

Finite moments

A practical question in probability is how well a finite number of moments determine a probability distribution. They cannot uniquely determine the distribution, but the do establish bounds for how different the two distributions can be. See this post.

Related posts

RNG, PRNG, CSPRNG

Most random number generators are pseudorandom number generators (PRNGs). The distinction may be pedantic or crucial, depending on context. In the context of cryptography, it’s critical.

For this post, RNG will mean a physical, true random number generator.

A PRNG may be suitable for many uses—Monte Carlo simulation, numerical integration, game development, etc.—but not be suitable for cryptography. A PNRG which is suitable for cryptography is called a CSPRNG (cryptographically secure pseudorandom number generator).

A PRNG may have excellent statistical properties, and pass standard test suites for random number generators, and yet be insecure. The output of an insecure generator may have no statistical regularities, and yet have regularities that a sneaky cryptanalyst can exploit. For example, the popular Mersenne Twister is fine for simulations but its output can be predicted after a relatively short run. The prediction depends on a clever set of calculations that would be unnatural from a statistical perspective, which is why its statistical performance is better than its cryptographic performance.

CSPRNGs tend to be much slower than PRNGs, so you pay for security. And for a non-cryptographic application this cost isn’t worth it.

In general, statistical tests give necessary but not sufficient conditions for a PRNG to be a CSPRNG. If a PRNG fails statistical tests, it has some sort of regularity that potentially could be exploited by cryptanalysis. I had to tell a client once that although his PRNG passed the standard statistical tests, I was pretty sure I could break the cryptographic system he wanted to use it in. This news was not well received.

Lava lamp photo

I suspect that a physical RNG with good statistical properties will have good cryptographic properties as well, contrary to the usual case. Cloudflare famously uses lava lamps to generate random bits for TLS keys. Cryptanalysts have exploited minor flaws in PRNGs, and so the lava lamps give Cloudflare one less thing to worry about. (I’m sure they still have plenty else to worry about.)

A physical RNG might fail statistical tests. For example, maybe the physical process is truly random but biased. Or maybe the process of turning physical phenomena into numbers introduces some bias. But it’s hard to imagine that an RNG could have a clean bill of statistical health and yet have a cryptographically exploitable weakness. It’s conceivable that a statistically impeccable physical RNG might have some unforeseen exploitable regularity, but this seems highly doubtful.

Related posts