Editing a file without an editor

I don’t use sed very often, but it’s very handy when I do use it, particularly when needing to make a small change to a large file.

Fixing a JSON file

Lately I’ve been trying to fix a 30 MB JSON file that has been corrupted somehow. The file is one very long line.

Emacs was unable to open the file. (It might have eventually opened the file, but I killed the process when it took longer than I was willing to wait.)

Emacs can open large files, but it has trouble with long lines. Somewhere its data structures assume lines are not typically millions of characters long.

I used sed to add line breaks after closing brackets

    sed -i 's/]/]\n/g' myfile.json

and then I was able to open the file in Emacs.

If the problem with my JSON file were simply a missing closing brace—it’s not—then I could add a closing brace with

    sed -i 's/$/}/' myfile.json

Using sed to find a job

I had a friend in college who got a job because of a sed script he wrote as an intern.

A finite element program would crash when it attempted to take too large a time step, but the program would not finish by the time the results were needed if it always took tiny time steps. So they’d let the program crash occasionally, then edit the configuration file with a smaller time step and restart the program.

They were asking engineers to work around the clock so someone could edit the configuration file and restart the finite element program if it crashed in the middle of the night. My friend wrote a shell script to automate this process, using sed to do the file editing. He eliminated the need for a night shift and got a job offer.

Related posts

Interpolating the gamma function

Suppose you wanted to approximate Γ(10.3). You know it’s somewhere between Γ(10) = 9! and Γ(11) = 10!, and linear interpolation would give you

Γ(10.3) ≈ 0.7 × 9! + 0.3 × 10! = 1342656.

But the exact value is closer to 716430.69, and so our estimate is 53% too high. Not a very good approximation.

Now let’s try again, applying linear interpolation to the log of the gamma function. Our approximation is

log Γ(10.3) ≈ 0.7 × log 9! + 0.3 × log 10! = 13.4926

while the actual value is 13.4820, an error of about 0.08%. If we take exponentials to get an approximation of Γ(10.3), not log Γ(10.3), the error is larger, about 1%, but still much better than 53% error.

The gamma function grows very quickly, and so the log gamma function is usually easier to work with.

As a bonus, the Bohr–Mollerup theorem says that log gamma is a convex function. This tells us that not only does linear interpolation give an approximation, it gives us an upper bound.

The Bohr–Mollerup theorem essentially says that the gamma function is the only function that extends factorial from a function on the integers to a log-convex function on the real numbers. This isn’t quite true since it’s actually &Gamma(x + 1) that extends factorial. Showing the gamma function is unique is the hard part. In the preceding paragraph we used the easy direction of the theorem, saying that gamma is log-convex.

Related posts

Too clever Monte Carlo

One way to find the volume of a sphere would be to imagine the sphere in a box, randomly select points in the box, and count how many of these points fall inside the sphere. In principle this would work in any dimension.

The problem with naive Monte Carlo

We could write a program to estimate the volume of a high-dimensional sphere this way. But there’s a problem: very few random samples will fall in the sphere. The ratio of the volume of a sphere to the volume of a box it fits in goes to zero as the dimension increases. We might take a large number of samples and none of them fall inside the sphere. In this case we’d estimate the volume as zero. This estimate would have small absolute error, but 100% relative error.

A more clever approach

So instead of actually writing a program to randomly sample a high dimensional cube, let’s imagine that we did. Instead of doing a big Monte Carlo study, we could be clever and use theory.

Let n be our dimension. We want to draw uniform random samples from [−1, 1]n and see whether they land inside the unit sphere. So we’d draw n random samples from [−1, 1] and see whether the sum of their squares is less than or equal to 1.

Let Xi be a uniform random variable on [−1, 1]. We want to know the probability that

X1² + X2² + X3² + … + Xn² ≤ 1.

This would be an ugly calculation, but since we’re primarily interested in the case of large n, we can approximate the sum using the central limit theorem (CLT). We can show, using the transformation theorem, that each Xi² has mean 1/3 and variance 4/45. The CLT says that the sum has approximately the distribution of a normal random variable with mean n/3 and variance 4n/45.

Too clever by half

The approach above turns out to be a bad idea, though it’s not obvious why.

The CLT does provide a good approximation of the sum above, near the mean. But we have a sum with mean n/3, with n large, and we’re asking for the probability that the sum is less than 1. In other words, we’re asking for the probability in the tail where the CLT approximation error is a bad (relative) fit. More on this here.

This post turned out to not be about what I thought it would be about. I thought this post would lead to a asymptotic approximation for the volume of an n-dimensional sphere. I would compare the approximation to the exact value and see how well it did. Except it did terribly. So instead, this post a cautionary tale about remembering how convergence works in the CLT.

Related posts

Evaluating a class of infinite sums in closed form

The other day I ran across the surprising identity

\sum_{n=1}^\infty \frac{n^3}{2^n} = 26

and wondered how many sums of this form can be evaluated in closed form like this. Quite a few it turns out.

Sums of the form

\sum_{n=1}^\infty \frac{n^k}{c^n}

evaluate to a rational number when k is a non-negative integer and c is a rational number with |c| > 1. Furthermore, there is an algorithm for finding the value of the sum.

The sums can be evaluated using the polylogarithm function Lis(z) defined as

\text{Li}_s(z) = \sum_{n=1}^\infty \frac{z^n}{n^s}

using the identity

\sum_{n=1}^\infty \frac{n^k}{c^n} = \text{Li}{-k}\left(\frac{1}{c}\right)

We then need to have a way to evaluate Lis(z). This cannot be done in closed form in general, but it can be done when s is a negative integer as above. To evaluate Lik(z) we need to know two things. First,

Li_1(z) = -\log(1-z)

and second,

\text{Li}_{s-1}(z) = z \frac{d}{dz} \text{Li}_s(z)

Now Li0(z) is a rational function of z, namely z/(1 − z). The derivative of a rational function is a rational function, and multiplying a rational function of z by z produces another rational function, so Lis(z) is a rational function of z whenever s is a non-positive integer.

Assuming the results cited above, we can prove the identity

\sum_{n=1}^\infty \frac{n^3}{2^n} = 26

stated at the top of the post.The sum equals Li−3(1/2), and

\text{Li}_{-3}(z) = \left(z \frac{d}{dz}\right)^3 \frac{z}{1-z} = \frac{z(1 + 4z + z^2)}{(1-z)^4}

The result comes from plugging in z= 1/2 and getting out 26.

When k and c are positive integers, the sum

\sum_{n=1}^\infty \frac{n^k}{c^n}

is not necessarily an integer, as it is when k = 3 and c = 2, but it is always rational. It looks like the sum is an integer if c= 2; I verified that the sum is an integer for c = 2 and k = 1 through 10 using the PolyLog function in Mathematica.

Update: Here is a proof that the sum is an integer when n = 2. From a comment by Theophylline on Substack.

The sum is occasionally an integer for larger values of c. For example,

\sum_{n=1}^\infty \frac{n^4}{3^n} = 15

and

\sum_{n=1}^\infty \frac{n^8}{3^n} = 17295

Related posts

Sphere spilling out

Center a small blue sphere on every corner of an n-dimensional unit hypercube. These are the points in ℝn for which every coordinate is either a 0 or a 1. Now inflate each of these small spheres at the same time until they touch. Each sphere will have radius 1/2. For example, the spheres centered at

(0, 0, 0, …, 0)

and

(0, 0, 0, …, 1)

can’t have a radius larger than 1/2 because they’ll intersect at (0, 0, 0, …, 1/2).

Now center a little red sphere at center, the point where every coordinate equals 1/2, and blow it up until it touches the blue spheres. Which is larger, the blue spheres or the red one?

The answer depends on n. In low dimensions, the blue spheres are larger. But in higher dimensions, the red sphere is larger.

The distance of from the center of the hypercube to any of the vertices is √n / 2 and so the radius of the red sphere is (√n − 1) / 2. When n = 4, the radius of the red sphere equals the radius of the blue spheres.

The hypercube and blue spheres will fit inside a hypercube whose side is length 2. But if n > 9 the red sphere will spill out of this larger hypercube.

This post was motivate by this post by Aryeh Kontorovich on Twitter.

More on high-dimensional spheres

A variation on Rock, Paper, Scissors

Imagine in a game of Rock, Paper, Scissors one player is free to play as usual but the other is required to choose each option the same number of times. That is, in 3n rounds of the game, the disadvantaged player much choose Rock n times, Paper n times, and Scissors n times.

Obviously the unrestricted player would be expected to win more games, but how many more? At least one, because the unrestricted player knows what the restricted player will do on the last round.

If n is large, then in the early rounds of the game it makes little difference that one of the players is restricted. The restriction doesn’t give the unrestricted player much exploitable information. But in the later rounds of the game, the limitations on the restricted player’s moves increase the other players chances of winning more.

It turns out that the if the unrestricted player uses an optimal strategy, he can expect to win O(√n) more rounds than he loses. More precisely, the expected advantage approaches 1.4658√n as n grows. You can find a thorough analysis of the problem in this paper.

Related posts

q-analog of rising powers

The previous post looked at the probability that a random n by n matrix over a finite field of order q is invertible. This works out to be

\prod_{i=1}^n \left(1 - \frac{1}{q^i}\right)

This function of q and n comes up in other contexts as well and has a name that we will get to shortly.

Pochhammer symbols

Leo August Pochhammer (1841–1920) defined the kth rising power of x by

(x)_k = x(x + 1)(x + 2)\cdots(x + k - 1)

Rising and falling powers come up naturally in combinatorics, finite difference analogs of calculus, and in the definition of hypergeometric functions.

q-Pochhammer symbols

The q-analog of the Pochhammer symbol is defined as

(a;q)_n = \prod_{k=0}^{n-1} (1-aq^k)=(1-a)(1-aq)(1-aq^2)\cdots(1-aq^{n-1})

Like the Pochhammer symbol, the q-Pochhammer symbol also comes up in combinatorics.

In general, q-analogs of various concepts are generalizations that reduce to the original concept in some sort of limit as q goes to 1. The relationship between the q-Pochhammer symbol and the Pochhammer symbol is

\lim_{q\to 1} \frac{(q^x;q)_n}{(1-q)^n} = (x)_n
For a simpler introduction to q-analogs, see this post on q-factorial and q-binomial coefficients.

Back to our probability problem

The motivation for this post was to give a name to the function that gives probability a random n by n matrix over a finite field of order q is invertible. In the notation above, this function is (1/q; 1/q)n.

There’s a confusing notational coincidence here. The number of elements in a finite field is usually denoted by q. The q in q-analogs such as the q-Pochhammer symbol has absolute value less than 1. It’s a coincidence that they both use the letter q, and the q in our application of q-Pochhammer symbols is the reciprocal of the q representing the order of a finite field.

I mentioned in the previous post that the probability of the matrix being invertible is a decreasing function of n. This probability decreases to a positive limit, varying with the value of q. This limit is (1/q; 1/q). Here the subscript ∞ denotes that we take the limit in (1/q; 1/q)n as n goes to infinity. There’s no problem here because the infinite product converges.

Mathematica and plotting

The q-Pochhammer symbol (a; q)n is implemented in Mathematica as QPochhammer[a, q, n] and the special case (q; q) is implemented as QPochhammer[q]. We can use the latter to make the following plot.

Plot[QPochhammer[q], {q, -1, 1}]

Recall that the q in our motivating application is the reciprocal of the q in the q-Pochhhammer symbol. This says for large fields, the limiting probability that an n by n matrix is invertible as n increases is near 1, but that for smaller fields the limiting probability is also smaller. For q = 2, the probability is 0.288788.

Plot[QPochhammer[1/q], {q, 2, 100}, PlotRange -> All]

Related posts

Solvability of linear systems over finite fields

If you have n equations in n unknowns over a finite field with q elements, how likely is it that the system of equations has a solution?

The number of possible n × n matrices with entries from a field of size q is qn². The set of invertible n × n matrices over a field with q elements is GLn(q) and the number of elements in this set is [1]

|GL_n(q)| = q^{n(n-1)/2} \prod_{i=1}^n(q^i - 1)

The probability that an n × n matrix is invertible is then

\prod_{i=1}^n \left(1 - \frac{1}{q^i}\right)

which is an increasing function of q and a decreasing function of n. More on this function in the next post.

Related posts

[1] Robert A. Wilson. The Finite Simple Groups. Springer 2009

Why do medical tests always have error rates?

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Related posts

Rényi’s parking constant

Parallel parking photo by IFCAR, Public Domain

Imagine parallel parking is available along the shoulder of a road, but no parking spaces are marked.

The first person to park picks a spot on the shoulder at random. Then another car also chooses a spot along the shoulder at random, with the constraint that the second car can’t overlap the first. This process continues until no more cars can park. How many people can park this way?

Assume all cars have the same length, and we choose our units so that cars have length 1. The expected number of cars that can park is a function M(x) of the length of the parking strip x. Clearly if x < 1 then M(x) = 0. Alfréd Rényi [1] found that for x ≥ 1, M(x) satisfies the equation

M(x) = 1 + \frac{2}{x-1}\int_0^{x-1} M(t)\, dt

This is an unusual equation, difficult to work with because it defined M only implicitly. It’s not even clear that the equation has a solution. But it does, and the ratio of M(x) to x approaches a constant as x increases.

m = \lim_{x\to\infty} \frac{M(x)}{x} = 0.74759\ldots

The number m is known as Rényi’s parking constant.

This says for a long parking strip, parking sequentially at random will allow about 3/4 as many cars to park as if you were to pack the cars end-to-end.

More posts on Rényi’s work

[1] Steven R. Finch. Mathematical Constants. Cambridge University Press, 2003.