One-liner to troubleshoot LaTeX references

In LaTeX, sections are labeled with commands like \label{foo} and referenced like \ref{foo}. Referring to sections by labels rather than hard-coded numbers allows references to automatically update when sections are inserted, deleted, or rearranged.

For every reference there ought to be a label. A label without a corresponding reference is fine, though it might be a mistake. If you have a reference with no corresponding label, and one label without a reference, there’s a good chance the reference is a typo variation on the unreferenced label.

We’ll build up a one-liner for comparing labels and references. We’ll use grep to find patterns that look like labels by searching for label{ followed by any string of letters up to but not including a closing brace. We don’t want the label{ part, just what follows it, so we’ll use look-behind syntax, to exclude it from the match.

Here’s our regular expression:


We’re using Perl-style look-behind syntax, so we’ll need to give grep the -P option. Also, we only want the match itself, not matching lines, so we’ll also using the -o option. This will print all the labels:

    grep -oP '(?<=label{)[^}]+' foo.tex

The regex for finding references is the same with label replaced with ref.

To compare the list of labels and the list of references, we’ll use the comm command. For more on comm, see Set theory at the command line.

We could save the labels to a file, save the references to a file, and run comm on the two files. But we’re more interested in the differences between the two lists than the two lists, so we could pass both as streams to comm using the <(...) syntax. Finally, comm assumes its inputs are sorted so we pipe the output of both grep commands to sort.

Here’s our one-liner

    comm -12 <(grep -oP '(?<=label{)[^}]+' foo.tex | sort) 
             <(grep -oP '(?<=ref{)[^}]+' foo.tex | sort)

This will produce three sections of output: labels which are not references, references which not labels, and labels that are also references.

If you just want to see references that don’t refer to a label, give comm the option -13. This suppresses the first and third sections of output, leaving only the second section, references that are not labels.

You can also add a -u option (u for unique) to the calls to sort to suppress multiple instances of the same label or same reference.

A “well-known” series

I was reading an article [1] that refers to “a well-known trigonometric series” that I’d never seen before. This paper cites [2] which gives the series as

\begin{align*} \frac{\sin m\phi}{\cos \phi} &= m\sin\phi - \frac{m(m^2-2^2)}{3!}\sin^3\phi \\ &\phantom{=} \;+ \frac{m(m^2-2^2)(m^2 - 4^2)}{5!}\sin^5\phi - \cdots \end{align*}

Note that the right hand side is not a series in φ but rather in sin φ.


Why might you know sin φ and want to calculate sin mφ / cos φ? This doesn’t seem like a sufficiently common task for the series to be well-known. The references are over a century old, and maybe the series were useful in hand calculations in a way that isn’t necessary anymore.

However, [1] was using the series for a theoretical derivation, not for calculation; the author was doing some hand-wavy derivation, sticking the difference operator E into a series as if it were a number, a technique known as “umbral calculus.” The name comes from the Latin word umbra for shadow. The name referred to the “shadowy” nature of the technique which wasn’t make rigorous until much later.


The series above terminates if m is an even integer. But there are no restrictions on m, and in general the series is infinite.

The series obviously has trouble if cos φ = 0, i.e. when φ = ±π/2, but it converges for all m if −π/2 < φ < π/2.


If m = 1, sin mφ / cos φ is simply tan φ. The function tan φ has a complicated power series in φ involving Bernoulli numbers, but it has a simpler power series in sin φ.


[1] G. J. Lidstone. Notes on Everett’s Interpolation Formula. 1922

[2] E. W. Hobson. A Treatise on Plane Trigonometry. Fourth Edition, 1918. Page 276.

Probability, cryptography, and naïveté

Probability and cryptography have this in common: really smart people can be confidently wrong about both.

I wrote years ago about how striking it was to see two senior professors arguing over an undergraduate probability exercise. As I commented in that post, “Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.”

Not only do smart people often get probability wrong, they can be very confident while doing so. The same applies to cryptography.

I recently learned of a cipher J. E. Littlewood invented that he believed was unbreakable. His idea was essentially a stream cipher, simulating a one-time pad by using a pseudorandom number generator. He assumed that since a one-time pad is secure, his simulacrum of a one-time pad would be secure. But it was not, for reasons explained in this paper.

Littlewood was a brilliant mathematician, but he was naive, and even arrogant, about cryptography. Here’s the opening to the paper in which he explained his method.

The legend that every cipher is breakable is of course absurd, though still widespread among people who should know better. I give a sufficient example …

He seems to be saying “Here’s a little example off the top of my head that shows how easy it is to create an unbreakable cipher.” He was the one who should have known better.

Related posts

Thinking by playing around

Richard Feynman’s Nobel Prize winning discoveries in quantum electrodynamics were partly inspired by his randomly observing a spinning dinner plate one day in the cafeteria. Paul Feyerabend said regarding science discovery, “The only principle that does not inhibit progress is: anything goes” (within relevant ethical constraints, of course).

Ideas can come from anywhere, including physical play. Various books can improve creative discovery skills, like George Pólya’s How to Solve It, Isaac Watts’ Improvement of the Mind, W. J. J. Gordon’s Synectics, and methodologies like mind mapping and C-K theory, to name a few. Many software products present themselves as shiny new tools promising help. However, we are not just disembodied minds interacting with a computer, but instead integrated beings with reasoning integrated with memories, life history, emotions and multisensory input and interaction with the world The tactile is certainly a key avenue of learning, discovering, understanding.

Fidget toys are popular. Different kinds of toys have different semiotics with respect to how they interplay with our imaginations. Like Legos. Structured, like Le Corbusier-style architecture, or multidimensional arrays or tensors, or the snapping together of many software components with well-defined interfaces, with regular scaling from the one to many. Or to take a much older example, Tinkertoys—the analogy of the graph, interconnectedness, semi-structured but composable, like DNA or protein chains, or complex interrelated biological processes, or neuronal connections, or the wild variety within order of human language.

As creative workers, we seek ideas from any and every place to help us in what we do. The tactile, the physical, is a vital place to look.

Approximation by prime powers

The well-known Weierstrass approximation theorem says that polynomials are dense in C [0, 1]. That is, given any continuous function f on the unit interval, and any ε > 0, you can find a polynomial P such that f and P are never more than ε apart.

This means that linear combinations of the polynomials

1, x, x², x³, …

are dense in C [0, 1].

Do you need all these powers of x? Could you approximate any continuous function arbitrarily well if you left out one of these powers, say x7? Yes, you could.

You cannot omit the constant polynomial 1, but you can leave out any other power of x. In fact, you can leave out a lot of powers of x, as long as the sequence of exponents doesn’t thin out too quickly.

Müntz approximation theorem

Herman Müntz proved in 1914 that a necessary and sufficient pair of conditions on the exponents of x is that the first exponent is 0 and that the sum of the reciprocals of the exponents diverges.

In other words, the sequence of powers of x

xλ0, xλ1, xλ2, …


λ0 < λ1 < λ2

is dense in C [0, 1] if and only if λ0 = 0 and

1/λ1 + 1/λ2 + 1/λ3 + … = ∞

Prime power example

Euler proved in 1737 that the sum of the reciprocals of the primes diverges, so the sequence

1, x2, x3, x5, x7, x11, …

is dense in C [0, 1]. We can find a polynomial as close as we like to any particular continuous function if we combine enough prime powers.

Let’s see how well we can approximate |x − ½| using prime exponents up to 11.

The polynomial above is

0.4605 − 5.233 x2 + 7.211 x3 + 0.9295 x5 − 4.4646 x7 + 1.614 x11.

This polynomial is not the best possible uniform approximation: it’s the least squares approximation. That is, it minimizes the 2-norm and not the ∞-norm. That’s because it’s convenient to do a least squares fit in Python using scipy.optimize.curve_fit.

Incidentally, the Müntz approximation theorem holds for the 2-norm as well.

Related posts

Logarithm approximation curiosity

I’ve written before about three simple approximations for logarithms, for base 10

log10(x) ≈ (x – 1)/(x + 1)

base e,

loge(x) ≈ 2(x – 1)/(x + 1)

and base 2

log2(x) ≈ 3(x – 1)/(x + 1).

These can be used to mentally approximate logarithms to moderate accuracy, accurate enough for quick estimates.

Here’s what’s curious about the approximations: the proportionality constants are apparently wrong, and yet the approximations are each fairly accurate.

It is not the case that

loge(x) = 2 log10(x).

In fact,

loge(x) = loge(10) log10(x) = 2.3 log10(x)

and so it seems that the approximation for natural logarithms should be off by 15%. But it’s not. The error is less than 2.5%.


log2(x) = log2(10) log10(x) = 3.32 log10(x)

and so the approximation for logarithms base 2 should be off by about 10%. But it’s not. The error here is also less than 2.5%.

What’s going on?

First of all, the approximation errors are nonlinear functions of x and the three approximation errors are not proportional. Second, the approximation for logb(x) is only good for 1/√bx ≤ √b. You can always reduce the problem of calculating logb(x) to the problem of calculating the log in the range 1/√bx ≤ √b and so this isn’t a problem.

Here’s a plot of the three error functions.

This plot makes it appear that the approximation error is much worse for natural logs and logs base 2 than for logs base 10. And it would be if we ignored the range of each approximation. Here’s another plot of the approximation errors, plotting each over only its valid range.

When restricted to their valid ranges, the approximations for logarithms base e and base 2 are more accurate than the approximation for logarithms base 10. Both errors are small, but in opposite directions.

Here’s a look at the relative approximation errors.

We can see that the relative errors for the log 2 and log e errors are less than 2.5%, while the relative error for log 10 can be up to 15%.


Iterated Mersenne primes

A Mersenne number is a number of the form 2k − 1. A Mersenne prime is a Mersenne number which is also a prime.

It turns out that if 2k − 1 is prime then k must be prime, so Mersenne numbers have the form 2p − 1 is prime. What about the converse? If p is prime, is 2k − 1 also prime? No, because, for example, 211 −  1 = 2047 = 23 × 89.

If p is not just a prime but a Mersenne prime, then is 2p − 1 a prime? Sometimes, but not always. The first counterexample is p = 8191.

There is an interesting chain of iterated Mersenne primes:

\begin{align*} M_1 &= 2 \\ M_2 &= 2^{M_1} - 1 \\ M_3 &= 2^{M_2} - 1 \\ M_4 &= 2^{M_3} - 1 \\ M_{12} &= 2^{M_4} - 1 \\ \end{align*}

This raises the question of whether m = 2M12 − 1 is prime. Direct testing using available methods is completely out of the question. The only way we’ll ever know is if there is some theoretical result that settles the question.

Here’s an easier question. Suppose m is prime. Where would it fall on the list of Mersenne primes if conjectures about the distribution of Mersenne primes are true?

This post reports

It has been conjectured that as x increases, the number of primes px such that 2p – 1 is also prime is asymptotically

eγ log x / log 2

where γ is the Euler-Mascheroni constant.

If that conjecture is true, the number of primes less than M12 that are the exponents of Mersenne primes would be approximately

eγ log M12 / log 2 = 226.2.

So if m is a Mersenne prime, it may be the 226th Mersenne prime, or Mn for some n around 226, if the conjectured distribution of Mersenne primes is correct.

We’ve discovered a dozen Mersenne primes since the turn of the century and we’re up to 51 discovered so far. We’re probably not going to get up to the 226th Mersenne prime, if there even is a 226th Mersenne prime, any time soon.


Small probabilities add, big ones don’t

A video has been making the rounds in which a well-known professor [1] says that if something has a 20% probability of happening in one attempt, then it has a 40% chance of happening in two attempts, a 60% chance in happening in three attempts, etc.

This is wrong, but it’s a common mistake. And one reason it’s common is that a variation on the mistake is approximately correct, which we will explain shortly.

It’s obvious the reasoning in the opening paragraph is wrong when you extend it to five, or especially six, attempts. Are you certain to succeed after five attempts? What does it even mean that you have a 120% chance of success after six attempts?!

But let’s reduce the probabilities in the opening paragraph. If there’s a 2% chance of success on your first attempt, is there a 4% chance of success in two attempts and a 6% chance of success in three attempts? Yes, approximately.

Two attempts

Here’s is the correct formula for the probability of an event happening in two tries.

P(A \cup B) = P(A) + P(B) - P(A\cap B)

In words, the probability of A or B happening equals the probability of A happening, plus the probability of B happening, minus the probability of A and B both happening. The last term is is a correction term. Without it, you’re counting some possibilities twice.

So if the probability of success on each attempt is 0.02, the probability of success on two attempts is

0.02 + 0.02 − 0.0004 = 0.0396 ≈ 0.04.

When the probabilities of A and B are each small, the probability of A and B both happening is an order of magnitude smaller, assuming independence [2]. The smaller the probabilities of A and B, the less the correction term matters.

If the probability of success on each attempt is 0.2, now the probability of success after two attempts is 0.36. Simply adding probabilities and neglecting the correction term is incorrect, but not terribly far from correct in this case.

Three attempts

When you consider more attempts, things get more complicated. The probability of success after three attempts is given by

\begin{align*} P(A \cup B \cup C) &= P(A) + P(B) + P(C) \\ &- P(A\cap B) - P(B \cap C) - P(A \cap C) \\ &+ P(A \cap B \cap C) \end{align*}

as I discuss here. Adding the probabilities of success separately over-estimates the correct probability. So you correct by subtracting the probabilities of pairs of successes. But then this is over-corrects, because you need to add back in the probability of three successes.

If A, B, and C all have a 20% probability, the probability of A or B or C happening is 48.8%, not 60%, again assuming independence.

The error from naively adding probabilities increases when the number of probabilities increase.

n attempts

Now let’s look at the general case. Suppose your probability of success on each attempt is p. Then your probability of failure on each independent attempt is 1 − p. The probability of at least one success out of n attempts is the complement of the probability of all failures, i.e.

1 - (1 - p)^n

When p is small, and when n is small, we can approximate this by np. That’s why naively adding probabilities works when the probabilities are small and there aren’t many of them. Here’s a way to say this precisely using the binomial theorem.

\begin{align*} 1 - (1 - p)^n &= 1 - \left(1 - np + {n \choose 2}p^2 - {n \choose 3}p^3 - \cdots\right ) \\ &= np + {\cal O}(p^2) \end{align*}

The exact probability is np plus (n − 1) terms that involve higher powers of p. When p and n are sufficiently small, these terms can be ignored.


[1] I’m deliberately not saying who. My point here is not to rub his nose in his mistake. This post will be online long after the particular video has been forgotten.

[2] Assuming A and B are independent. This is not always the case, and wrongly assuming independence can have disastrous consequences as I discuss here, but that’s a topic for another day.

Logistic regression quick takes

This post is a series of quick thoughts related to logistic regression. It starts with this article on moving between logit and probability scales.


Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.


The article cited above is entitled “Rough approximations to move between logit and probability scales.” Here is a paragraph from the article giving its motivation.

When working in this space, it’s helpful to have some approximations to move between the logit and probability scales. (As an analogy, it is helpful to know that for a normal distribution, the interval ± 2 standard deviations around the mean covers about 95% of the distribution, while ± 3 standard deviations covers about 99%.)

Here are half the results from the post; the other half follow by symmetry.

    |  prob | logit |
    | 0.500 |     0 |
    | 0.750 |     1 |
    | 0.900 |     2 |
    | 0.995 |     3 |
    | 0.999 |     7 |

Zero on the logit scale corresponds exactly to a probability of 0.5. The other values are approximate.

When I say the rest of the table follows by symmetry, I’m alluding to the fact that

logit(1 − p) = − logit(p).

So, for example, because logit(0.999) ≈ 7, logit(0.001) ≈ −7.


The post reminded me of the decibel scale. As I wrote in this post, “It’s a curious and convenient fact that many decibel values are close to integers.”

  • 3 dB ≈ 2
  • 6 dB ≈ 4
  • 7 dB ≈ 5
  • 9 dB ≈ 8

I was curious whether the logit / probability approximations were as accurate as these decibel approximations. Alas, they are not. They are rough approximations, as advertised in the title, but still useful.


The post also reminded me of a comment by Andrew Gelman and Jennifer Hill on why  natural logs are natural for regression.


Numerical application of mean value theorem

Suppose you’d like to evaluate the function

u(z) = \frac{e^z - 1 - z}{z^2}

for small values of z, say z = 10−8. This example comes from [1].

The Python code

    from numpy import exp
    def f(z): return (exp(z) - 1 - z)/z**2

prints -0.607747099184471.

Now suppose you suspect numerical difficulties and compute your result to 50 decimal places using bc -l.

    scale = 50
    z = 10^-8
    (e(z) - 1 - z)/z^2

Now you get u(z) = .50000000166666667….

This suggests original calculation was completely wrong. What’s going on?

For small z,

ez ≈ 1 + z

and so we lose precision when directly evaluating the numerator in the definition of u. In our example, we lost all precision.

The mean value theorem from complex analysis says that the value of an analytic function at a point equals the continuous average of the values over a circle centered at that point. If we approximate this average by taking the average of 16 values in a circle of radius 1 around our point, we get full accuracy. The Python code

    def g(z):
        N = 16
        ws = z + exp(2j*pi*arange(N)/N)
        return sum(f(ws))/N


    0.5000000016666668 + 8.673617379884035e-19j

which departs from the result calculated with bc in the 16th decimal place.

At a high level, we’re avoiding numerical difficulties by averaging over points far from the difficult region.


[1] Lloyd N. Trefethen and J. A. C. Weideman. The Exponentially Convergent Trapezoid Rule. SIAM Review. Vol. 56, No. 3. pp. 385–458.