Random polynomials revisited

A few days ago I wrote about the expected number of roots in a random polynomial where each coefficient is drawn from a standard normal, i.e. a Gaussian distribution with mean 0 and variance 1.

Another class of random polynomials, one that comes up in applications to physics, draws each coefficient from a different distribution. Specifically, the nth coefficient in an Nth degree polynomial is drawn from a normal distribution with mean zero and variance N choose n. As before, our reference is [1].

It turns out that for these random polynomials, the expected number of real zeros is √N. This is not an asymptotic approximation.

For an example, I created a random 44th degree polynomial.

    from scipy.stats import norm
    from scipy.special import binom

    N = 44
    coeff = [norm.rvs(scale=binom(N,n)**0.5) for n in range(N+1)]

I plotted the (tanh of) this polynomial as described here.

This polynomial clearly has 8 real roots. The expected number of roots for a 44th degree random polynomial of this type is √44 = 6.63.

[1] Alan Edelman and Eric Kostlan. How many zeros of a random polynomial are real? Bulletin of the AMS. Volume 32, Number 1, January 1995

Large matrices rarely have saddlepoints

A matrix is said to have a saddlepoint if an element is the smallest element in its row and the largest element in its column. For example, 0.546 is a saddlepoint in the matrix below because it is the smallest element in the third row and the largest element in the first column.

\begin{bmatrix} -0.850 & 1.529 & \phantom{-}0.896 \\ -1.878 & 1.233 & -0.200 \\ \phantom{-}0.546 & 0.762 & \phantom{-}1.763 \\ -0.449 & 3.136 & \phantom{-}1.345 \end{bmatrix}

In [1], Goldman considers the probability of an m by n matrix having a saddlepoint if its elements are chosen independently at random from the same continuous distribution. He reports this probability as

\frac{m!\,\,n!}{(m+n-1)!}

but it may be easier to interpret as

\left. (m+n) \middle/ {m+n \choose n} \right.

When m and n are large, the binomial coefficient in the denominator is very large.

For square matrices, m and n are equal, and the probability above is 2n/(n + 1) divided by the nth Catalan number Cn.

A couple years ago I wrote about bounds on the middle binomial coefficient

\frac{2^{2n-1}}{\sqrt{n}} < \binom{2n}{n} < \frac{2^{2n-1/2}}{\sqrt{n}}

by E. T. H. Wang. These bounds let us conclude that P(n, n), the probability of an n by n matrix having a saddlepoint, is bounded as follows.

\frac{n^{3/2}}{2^{2n - 3/2}} < P(n,n) < \frac{n^{3/2}}{2^{2n-2}}

Here’s a plot of P(n, n) and its bounds.

plot of saddlepoint probability and its bounds

This shows that the probability goes to zero exponentially as n increases, being impossible to visually distinguish from 0 in the plot above when n = 8.

The upper bound on P(n, n) based on Wang’s inequalities is tighter than the corresponding lower bound. This is because Wang’s lower bound on the middle binomial coefficient is tighter (see plot here) and we’ve taken reciprocals.

Simulation

The theory above says that if we generate 4 by 3 matrices with independent normal random values, each matrix with have a saddle point with probability 1/5. The following code generates 10,000 random 4 by 3 matrices and counts what proportion have a saddle point.

    
    import numpy as np
    from scipy.stats import norm

    def has_saddlepoint(M):
        m, n = M.shape
        row_min_index = [np.argmin(M[i,:]) for i in range(m)]
        col_max_index = [np.argmax(M[:,j]) for j in range(n)]
        for i in range(m):
            if col_max_index[row_min_index[i]] == i:
                return True
        return False

    m, n = 4, 3
    N = 10000
    saddlepoints = 0

    for _ in range(N):
        M = norm.rvs(size=(m,n))
        if has_saddlepoint(M):
            saddlepoints += 1
    print(saddlepoints / N)

When I ran this, I got 0.1997, very close to the expected value.

More on random matrices

[1] A. J. Goldman. The Probability of a Saddlepoint. The American Mathematical Monthly, Vol. 64, No. 10, pp. 729-730

Informative stopping

When the rule for stopping an experiment depends on the data in the experiment, the results could be biased if the stopping rule isn’t taken into account in the analysis [1].

For example, suppose Alice wants to convince Bob that π has a greater proportion of even digits than odd digits.

Alice: I’ll show you that π has more even digits than odd digits by looking at the first N digits. How big would you like N to be?

Bob: At least 1,000. Of course more data is always better.

Alice: Right. And how many more even than odd digits would you find convincing?

Bob: If there are at least 10 more evens than odds, I’ll believe you.

Alice: OK. If you look at the first 2589 digits, there are 13 more even digits than odd digits.

Now if Alice wanted to convince Bob that there are more odd digits, she could do that too. If you look at the first 2077 digits, 13 more are odd than even.

No matter what two numbers Bob gives, Alice can find a sample size that will give the result she wants. Here’s Alice’s Python code.

    from mpmath import mp
    import numpy as np

    N = 3000 
    mp.dps = N+2
    digits = str(mp.pi)[2:]

    parity = np.ones(N, dtype=int)
    for i in range(N):
        if digits[i] in ['1', '3', '5', '7', '9']:
            parity[i] = -1
    excess = parity.cumsum()
    print(excess[-1])
    print(np.where(excess == 13))
    print(np.where(excess == -13))

The number N is a guess at how far out she might have to look. If it doesn’t work, she increases it and runs the code again.

The array parity contains a 1 in positions where the digits of π (after the decimal point) are even and a -1 where they are odd. The cumulative sum shows how many more even than odd digits there have been up to a given point, a negative number meaning there have been more odd digits.

Alice thought that stopping when there are exactly 10 more of the parity she wants would look suspicious, so she looked for places where the difference was 13.

Here are the results:

    [ 126,  128,  134,  …,  536, 2588, … 2726]
    [ 772,  778,  780,  …,  886, 2076, … 2994]

There’s one minor gotcha. The array excess is indexed from zero, so Alice reports 2589 rather than 2588 because the 2589th digit has index 2588.

Bob’s mistake was that he specified a minimum sample size. By saying “at least 1,000” he gave Alice the freedom to pick the sample size to get the result she wanted. If he specified an exact sample size, there probably would be either more even digits or more odd digits, but there couldn’t be both. And if he were more sophisticated enough, he could pick an excess value that would be unlikely given that sample size.

Related posts

[1] This does not contradict the likelihood principle; it says that informative stopping rules should be incorporated into the likelihood function.

Random Fibonacci numbers

The Fibonacci numbers are defined by F1 = F2 = 1, and for n > 2,

Fn = Fn-1 + Fn-2.

A random Fibonacci sequence f is defined similarly, except the addition above is replaced with a subtraction with probability 1/2. That is, f1 = f2 = 1, and for n > 2,

fn = fn-1 + b fn-2

where b is +1 or -1, each with equal probability.

Here’s a graph a three random Fibonacci sequences.

Graph of random Fibonacci sequences

Here’s the Python code that was used to produce the sequences above.

    import numpy as np

    def rand_fib(length):
        f = np.ones(length)
        for i in range(2, length):
            b = np.random.choice((-1,1))
            f[i] = f[i-1] + b*f[i-2]
        return f

It’s easy to see that the nth random Fibonacci number can be as large as the nth ordinary Fibonacci number if all the signs happen to be positive. But the numbers are typically much smaller.

The nth (ordinary) Fibonacci number asymptotically approaches φn is the golden ratio, φ = (1 + √5)/2 = 1.618…

Another way to say this is that

\lim_{n\to\infty}(F_n)^{1/n} = \varphi

The nth random Fibonacci number does not have an asymptotic value—it wanders randomly between positive and negative values—but with probability 1, the absolute values satisfy

\lim_{n\to\infty}|f_n|^{1/n} = 1.1319882\ldots

This was proved in 1960 [1].

Here’s a little Python code to show that we get results consistent with this result using simulation.

    N = 500
    x = [abs(rand_fib(N)[-1])**(1/N) for _ in range(10)]
    print(f"{np.mean(x)} ± {np.std(x)}")

This produced

1.1323 ± 0.0192

which includes the theoretical value 1.1320.

Update: The next post looks at whether every integer appear in a random Fibonacci sequence. Empirical evidence suggests the answer is yes.

Related posts

[1] Furstenberg and Kesten. Products of random matrices. Ann. Math. Stat. 31, 457-469.

Real-time analytics

There’s an ancient saying “Whom the gods would destroy they first make mad.” (Mad as in crazy, not mad as in angry.) I wrote a variation of this on Twitter:

Whom the gods would destroy, they first give real-time analytics.

Having more up-to-date information is only valuable up to a point. Past that point, you’re more likely to be distracted by noise. The closer you look at anything, the more irregularities you see, and the more likely you are to over-steer [1].

I don’t mean to imply that the noise isn’t real. (More on that here.) But there’s a temptation to pay more attention to the small variations you don’t understand than the larger trends you believe you do understand.

I became aware of this effect when simulating Bayesian clinical trial designs. The more often you check your stopping rule, the more often you will stop [2]. You want to monitor a trial often enough to shut it down, or at least pause it, if things change for the worse. But monitoring too often can cause you to stop when you don’t want to.

Flatter than glass

A long time ago I wrote about the graph below.

The graph looks awfully jagged, until you look at the vertical scale. The curve represents the numerical difference between two functions that are exactly equal in theory. As I explain in that post, the curve is literally smoother than glass, and certainly flatter than a pancake.

Notes

[1] See The Logic of Failure for a discussion of how over-steering is a common factor in disasters such as the Chernobyl nuclear failure.

[2] Bayesians are loathe to talk about things like α-spending, but when you’re looking at stopping frequencies, frequentist phenomena pop up.

Sample size calculation

If you’re going to run a test on rabbits, you have to decide how many rabbits you’ll use. This is your sample size. A lot of what statisticians do in practice is calculate sample sizes.

A researcher comes to talk to a statistician. The statistician asks what effect size the researcher wants to detect. Do you think the new thing will be 10% better than the old thing? If so, you’ll need to design an experiment with enough subjects to stand a good chance of detecting a 10% improvement. Roughly speaking, sample size is inversely proportional to the square of effect size. So if you want to detect a 5% improvement, you’ll need 4 times as many subjects as if you want to detect a 10% improvement.

You’re never guaranteed to detect an improvement. The race is not always to the swift, nor the battle to the strong. So it’s not enough to think about what kind of effect size you want to detect, you also have to think about how likely you want to be to detect it.

Here’s what often happens in practice. The researcher makes an arbitrary guess at what effect size she expects to see. Then initial optimism may waver and she decides it would be better to design the experiment to detect a more modest effect size. When asked how high she’d like her chances to be of detecting the effect, she thinks 100% but says 95% since it’s necessary to tolerate some chance of failure.

The statistician comes back and says the researcher will need a gargantuan sample size. The researcher says this is far outside her budget. The statistician asks what the budget is, and what the cost per subject is, and then the real work begins.

The sample size the negotiation will converge on is the budget divided by the cost per sample. The statistician will fiddle with the effect size and probability of detecting it until the inevitable sample size is reached. This sample size, calculated to 10 decimal places and rounded up to the next integer, is solemnly reported with a post hoc justification containing no mention of budgets.

Sample size is always implicitly an economic decision. If you’re willing to make it explicitly an economic decision, you can compute the expected value of an experiment by placing a value on the possible outcomes. You make some assumptions—you always have to make assumptions—and calculate the probability under various scenarios of reaching each conclusion for various sample sizes, and select the sample size that leads to the best expected value.

More on experimental design

[1] There are three ways an A/B test can turn out: A wins, B wins, or there isn’t a clear winner. There’s a tendency to not think enough about the third possibility. Interim analysis often shuts down an experiment not because there’s a clear winner, but because it’s becoming clear there is unlikely to be a winner.

Truncated distributions vs clipped distributions

In the previous post, I looked at truncated probability distributions. A truncated normal distribution, for example, lives on some interval and has a density proportional to a normal distribution; the proportionality constant is whatever it has to be to make the density integrate to 1.

Truncated distributions

Suppose you wanted to generate random samples from a normal distribution truncated to [a, b]. How would you go about it? You could generate a sample x from the (full) normal distribution, and if axb, then return x. Otherwise, try again. Keep trying until you get a sample inside [a, b], and when you get one, return it [1].

This may be the most efficient way to generate samples from a truncated distribution. It’s called the accept-reject method. It works well if the probability of a normal sample being in [a, b] is high. But if the interval is small, a lot of samples may be rejected before one is accepted, and there are more efficient algorithms. You might also use a different algorithm if you want consistent run time more than minimal average run time.

Clipped distributions

Now consider a different kind of sampling. As before, we generate a random sample x from the full normal distribution and if axb we return it. But if x > b we return b. And if x < a we return a. That is, we clip x to be in [a, b]. I’ll call this a clipped normal distribution. (I don’t know whether this name is standard, or even if there is a standard name for this.)

Clipped distributions are not truncated distributions. They’re really a mixture of three distributions. Start with a continuous random variable X on the real line, and let Y be X clipped to [a, b]. Then Y is a mixture of a continuous distribution on [a, b] and two point masses: Y takes on the value a with probability P(X < a) and Y takes on the value b with probability P(X > b).

Clipped distributions are mostly a nuisance. Maybe the values are clipped due to the resolution of some sensor: nobody wants to clip the values, that’s just a necessary artifact. Or maybe values are clipped because some form only allows values within a given range; maybe the person designing the form didn’t realize the true range of possible values.

It’s common to clip extreme data values while deidentifying data to protect privacy. This means that not only is some data deliberately inaccurate, it also means that the distribution changes slightly since it’s now a clipped distribution. If that’s unacceptable, there are other ways to balance privacy and statistical utility.

Related posts

[1] Couldn’t this take forever? Theoretically, but with probability zero. The number of attempts follows a geometric distribution, and you could use that to find the expected number of attempts.

The truncated Cauchy paradox

The Cauchy distribution is the probability distribution on the real line with density proportional to 1/(1 + x²). It comes up often as an example of a fat-tailed distribution, one that can wreak havoc on intuition and on applications. It has no mean and no variance.

The truncated Cauchy distribution has the same density, with a different proportionality constant, over a finite interval [a, b]. It’s a well-behaved distribution, with a mean, variance, and any other moment you might care about.

Here’s a sort of paradox. The Cauchy distribution is pathological, but the truncated Cauchy distribution is perfectly well behaved. But if a is a big negative number and b is a big positive number, surely the Cauchy distribution on [a, b] is sorta like the full Cauchy distribution.

I will argue here that in some sense the truncated Cauchy distribution isn’t so well behaved after all, and that a Cauchy distribution truncated to a large interval is indeed like a Cauchy distribution.

You can show that the variance of the truncated Cauchy distribution over a large interval is approximately equal to the length of the interval. So if you want to get around the problems of a Cauchy distribution by truncating it to live on a large interval, you’ve got a problem: it matters very much how you truncate it. For example, you might think “It doesn’t matter, make it [-100, 100]. Or why not [-1000, 1000] just to be sure it’s big enough.” But the variances of the two truncations differ by a factor of 10.

Textbooks usually say that the Cauchy distribution has no variance. It’s more instructive to say that it has infinite variance. And as the length of the interval you truncate the Cauchy to approaches infinity, so does its variance.

The mean of the Cauchy distribution does not exist. It fails to exist in a different way than the variance. The integral defining the variance of a Cauchy distribution diverges to +∞. But the integral defining the mean of the Cauchy simply does not exist. You can get different values of the integral depending on how you let the end points go off to infinity. In fact, you could get any value you want by specifying a particular way for truncation interval to grow to the real line.

So once again you can’t simply say “Just truncate it to a big interval” because the mean depends on how you choose your interval.

If we were working with a thin-tailed distribution, like a normal, and truncating it to a big interval, the choice of interval would make little difference. If you truncate a normal distribution to [-10, 5], or [-33, 42], or [-2000, 1000] makes almost no difference to the mean or the variance. But for the Cauchy distribution, it makes a substantial difference.

***

I’ve experimented with something new in this post: it’s deliberately short on details. Lots of words, no equations. Everything in this post can be made precise, and you may consider it an exercise to make everything precise if you’re so inclined.

By leaving out the details, I hope to focus attention on the philosophical points of the post. I expect this will go over well with people who don’t want to see the details, and with people who can easily supply the details, but maybe not with people in between.

Expected length of longest common DNA substrings

If we have two unrelated sequences of DNA, how long would we expect the longest common substring of the two sequences to be? Since DNA sequences come from a very small alphabet, just four letters, any two sequences of moderate length will have some common substring, but how long should we expect it to be?

Motivating problem

There is an ongoing controversy over whether the SARS-CoV-2 virus contains segments of the HIV virus. The virologist who won the Nobel Prize for discovering the HIV virus, Luc Montagnier, claims it does, and other experts claim it does not.

I’m not qualified to have an opinion on this dispute, and do not wish to discuss the dispute per se. But it brought to mind a problem I’d like to say something about.

This post will look at a simplified version of the problem posed by the controversy over the viruses that cause COVID-19 and AIDS. Namely, given two random DNA sequences, how long would you expect their longest common substring to be?

I’m not taking any biology into account here other than using an alphabet of four letters. And I’m only looking at exact substring matches, not allowing for any insertions or deletions.

Substrings and subsequences

There are two similar problems in computer science that could be confused: the longest common substring problem and the longest common subsequence problem. The former looks for uninterrupted matches and the latter allows for interruptions in matches in either input.

For example, “banana” is a subsequence of “bandana” but not a substring. The longest common substring of “banana” and “anastasia” is “ana” but the longest common subsequence is “anaa.” Note that the latter is not a substring of either word, but a subsequence of both.

Here we are concerned with substrings, not subsequences. Regarding the biological problem of comparing DNA sequences, it would seem that a substring is a strong form of similarity, but maybe too demanding since it allows no insertions. Also, the computer science idea of a subsequence might be too general for biology since it allows arbitrary insertions. A biologically meaningful comparison would be somewhere in between.

Simulation results

I will state theoretical results in the next section, but I’d like to begin with a simulation.

First, some code to generate a DNA sequence of length N.

    from numpy.random import choice

    def dna_sequence(N):
        return "".join(choice(list("ACGT"), N))

Next, we need a way to find the longest common substring. We’ll illustrate how SequenceMatcher works before using it in a simulation.

    from difflib import SequenceMatcher

    N = 12
    seq1 = dna_sequence(N)
    seq2 = dna_sequence(N)   
 
    # "None" tells it to not consider any characters junk
    s = SequenceMatcher(None, seq1, seq2)

    print(seq1)
    print(seq2)
    print(s.find_longest_match(0, N, 0, N))

This produced

    TTGGTATCACGG
    GATCACAACTAC
    Match(a=5, b=1, size=5)

meaning that the first match of maximum length, ATCAC in this case, starts in position 5 of the first string and position 1 of the second string (indexing from 0).

Now let’s run a simulation to estimate the expected length of the longest common substring in two sequences of DNA of length 300. We’ll do 1000 replications of the simulation to compute our average.

    def max_common_substr_len(a, b):
        # The None's turn off junk detection heuristic
        s = SequenceMatcher(None, a, b, None)
        m = s.find_longest_match(0, len(a), 0, len(b))
        return m.size

    seqlen = 300
    numreps = 1000
    avg = 0
    for _ in range(numreps):
        a = dna_sequence(seqlen)
        b = dna_sequence(seqlen)
        avg += max_common_substr_len(a, b)
    avg /= numreps
    print(avg)

When I ran this code I got 7.951 as the average. Here’s a histogram of the results.

Expected substring length

Arratia and Waterman [1] published a paper in 1985 looking at the longest common substring problem for DNA sequences, assuming each of the four letters in the DNA alphabet are uniformly and independently distributed.

They show that for strings of length N coming from an alphabet of size b, the expected length of the longest match is asymptotically

2 logb(N).

This says in our simulation above, we should expect results around 2 log4(300) = 8.229.

There could be a couple reasons our simulation results were smaller than the theoretical results. First, it could be a simulation artifact, but that’s unlikely. I repeated the simulation several times and consistently got results below the theoretical expected value. I expect Arratia and Waterman’s result, although exact in the limit, is biased upward for finite n.

Update: The graph below plots simulation results for varying sequence sizes compared to the asymptotic value, the former always below the latter.

Simulation vs asymptotic estimate

Arratia and Waterman first assume i.i.d. probabilities to derive the result above, then prove a similar theorem under the more general assumption that the sequences are Markov chains with possibly unequal transition probabilities. The fact that actual DNA sequences are not exactly uniformly distributed and not exactly independent doesn’t make much difference to their result.

Arratia and Waterman also show that allowing for a finite number of deletions, or a number of deletions that grows sufficiently slowly as a function of n, does not change their asymptotic result. Clearly allowing deletions makes a difference for finite n, but the difference goes away in the limit.

Related posts

[1] Richard Arratia and Michael S. Waterman. An Erdős-Rényi Law with Shifts. Advances in Mathematics 55, 13-23 (1985).

 

Underestimating risk

Construction site

When I hear that a system has a one in a trillion (1,000,000,000,000) chance of failure, I immediately translate that in my mind to “So, optimistically the system has a one in a million (1,000,000) chance of failure.”

Extremely small probabilities are suspicious because they often come from one of two errors:

  1. Wrongful assumption of independence
  2. A lack of imagination

Wrongfully assuming independence

The Sally Clark case is an infamous example of a woman’s life being ruined by a wrongful assumption of independence. She had two children die of what we would call in the US sudden infant death syndrome (SIDS) and what was called in her native UK “cot death.”

The courts reasoned that the probability of two children dying of SIDS was the square of the probability of one child dying of SIDS. The result, about one chance in 73,000,000, was deemed to be impossibly small, and Sally Clark was sent to prison for murder. She was released from prison years later, and drank herself to death.

Children born to the same parents and living in the same environment hardly have independent risks of anything. If the probability of losing one child to SIDS is 1/8500, the conditional probability of losing a sibling may be small, but surely not as small as 1/8500.

The Sally Clark case only assumed two events were independent. By naively assuming several events are independent, you can start with larger individual probabilities and end up with much smaller final probabilities.

As a rule of thumb, if a probability uses a number name you’re not familiar with (such as septillion below) then there’s reason to be skeptical.

Lack of imagination

It is possible to legitimately calculate extremely small probabilities, but often this is by calculating the probability of the wrong thing, by defining the problem too narrowly. If a casino owner believes that the biggest risk to his business is dealing consecutive royal flushes, he’s not considering the risk of a tiger mauling a performer.

A classic example of a lack of imagination comes from cryptography. Amateurs who design encryption systems assume that an attacker must approach a system the same way they do. For example, suppose I create a simple substitution cipher by randomly permuting the letters of the English alphabet. There are over 400 septillion (4 × 1026) permutations of 26 letters, and so the chances of an attacker guessing my particular permutation are unfathomably small. And yet simple substitution ciphers are so easy to break that they’re included in popular books of puzzles. Cryptogram solvers do not randomly permute the alphabet until something works.

Professional cryptographers are not nearly so naive, but they have to constantly be on guard for more subtle forms of the same fallacy. If you create a cryptosystem by multiplying large prime numbers together, it may appear that an attacker would have to factor that product. That’s the idea behind RSA encryption. But in practice there are many cases where this is not necessary due to poor implementation. Here are three examples.

If the calculated probability of failure is infinitesimally small, the calculation may be correct but only address one possible failure mode, and not the most likely failure mode at that.

More posts on risk management