String interpolation in Python and R

One of the things I liked about Perl was string interpolation. If you use a variable name in a string, the variable will expand to its value. For example, if you a variable $x which equals 42, then the string

    "The answer is $x."

will expand to “The answer is 42.” Perl requires variables to start with sigils, like the $ in front of scalar variables. Sigils are widely considered to be ugly, but they have their benefits. Here, for example, $x is clearly a variable name, whereas x would not be.

You can do something similar to Perl’s string interpolation in Python with so-called f-strings. If you put an f in front of an opening quotation mark, an expression in braces will be replaced with its value.

    >>> x = 42
    >>> f"The answer is {x}."
    'The answer is 42.'

You could also say

    >>> f"The answer is {6*7}."

for example. The f-string is just a string; it’s only printed because we’re working from the Python REPL.

The glue package for R lets you do something very similar to Python’s f-strings.

    > library(glue)
    > x <- 42
    > glue("The answer is {x}.")
    The answer is 42.
    > glue("The answer is {6*7}.")
    The answer is 42.

As with f-strings, glue returns a string. It doesn’t print the string, though the string is displayed because we’re working from the REPL, the R REPL in this case.

Regular expressions and special characters

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.

This post goes through an example in detail that shows how to manage special characters in several different contexts.

Escaping special TeX characters

I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like ICD9_CODE and need to make that ICD9\_CODE so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.

Underscore isn’t the only special character in TeX. It has ten special characters:

    \ { } $ & # ^ _ % ~

The two that people most commonly stumble over are probably $ and % because these are fairly common in ordinary prose. Since % begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.

So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.

Raw strings

First I need to tell Python what the special characters are for TeX:

    special = r"\\{}$&#^_%~"

There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The r prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”

Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.

Lookbehind

Here’s my regular expression:

    re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)

I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with (?<! and end with ). So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression

    (?<!charity )ball

This expression would match “foot ball” or “foosball” but not “charity ball”.

Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].

After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable special is so the regular expression engine would see two backslashes and interpret that as one literal backslash.

Captures

The second argument to re.sub tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.

We want to tell re.sub to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it \\ to represent a literal backslash. When we follow this with \1 for the first capture, the result is \\\1 as above.

Testing

We can test our code above on with the following.

    line = r"a_b $200 {x} %5 x\y"

and get

    a\_b \$200 \{x\} \%5 x\\y

which would cause TeX to produce output that looks like

a_b $200 {x} %5 x\y.

Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the r in front of the opening quote.

P.S. on raw strings

Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say

    special = "\\\\{}$&#^_%~"

starting with four backslashes to send Python two to send the regular expression engine one.

Related posts

[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.

[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.

Angles in the spiral of Theodorus

The previous post looked at how to plot the spiral of Theodorus shown below.

Spiral of TheodorusWe stopped the construction where we did because the next triangle to be added would overlap the first triangle, which would clutter the image. But we could certainly have kept going.

If we do keep going, then the set of hypotenuse angles will be dense in the circle, with no repeats.

The nth triangle has sides of length 1 and √n, and so the tangent of nth triangle’s acute angle is 1/√n. The angle formed by the nth hypotenuse is thus

arctan(1) + arctan(1/√2) + arctan(1/√3) + … + arctan(1/√n).

Here’s a plot of the first 99 hypotenuse angles.

Angles formed by hypotenuses in spiral of Theodorus

Here’s the code that produced the plot.

    from numpy import *
    import matplotlib.pyplot as plt

    plt.axes().set_aspect(1)

    N = 100
    theta = cumsum(arctan(arange(1,N)**-0.5))
    plt.scatter(cos(theta), sin(theta))
    plt.savefig("theodorus_angles.svg")

If we change N to 500 we get a solid ring because the angles are closer together than the default thickness of dots in a scatterplot.

Distribution of quadratic residues

Let p be an odd prime number. If the equation

x² = n mod p

has a solution then n is a square mod p, or in classical terminology, n is a quadratic residue mod p. Half of the numbers between 0 and p are quadratic residues and half are not. The residues are distributed somewhat randomly. For large p, quadratic residues are distributed enough like random numbers to be useful in cryptographic applications. For example, see this post. But the distribution isn’t entirely random as we’ll demonstrate here.

First let’s look at a couple illustrations, using p = 101 and p = 103. We’ll use the following Python code to plot the residues.

    import matplotlib.pyplot as plt

    for p in [101, 103]:
        s = {x**2 % p for x in range(1, p)}
        for q in s:
            plt.vlines(q, 0, 1)
        plt.title("Quadratic residues mod {}".format(p))
        plt.savefig("residues_mod_{}.svg".format(p))
        plt.close()

Here are the plots.

quadratic residues mod 101

quadratic residues mod 103

Symmetry and anti-symmetry

If you look closely at the first image, you’ll notice it appears to be symmetric about the middle. And if you look closely at the second image, you’ll notice it appears to be anti-symmetric, i.e. the left side has bars where the right side has gaps and vice versa.

The following code shows that these observations are correct.

    p = 101
    s = {x**2 % p for x in range(1, p)}
    for q in s:
        assert(p - q in s)
    
    p = 103
    s = {x**2 % p for x in range(1, p)}
    for q in s:
        assert(p - q not in s)

Both of these observations hold more generally. If p = 1 mod 4, the residues are symmetric: n is a quadratic residue if and only if pn is a quadratic residue. And if p = 3 mod 4, the residues are anti-symmetric: n is a quadratic residue if and only if pn is not a quadratic residue. Both of these statements are consequences of the quadratic reciprocity theorem.

Quadratic excess

If you look again at the plot of residues for p = 103, you’ll notice that not only is it anti-symmetric, it is also darker on the left side. We can verify with the following code

   print(len( [q for q in s if q < 51] )) 
   print(len( [q for q in s if q > 51] ))

that there are 28 residues on the left and and 23 on the right. This is a general pattern called quadratic excess. The quadratic excess of an odd prime p, denoted E(p) is the number of quadratic residues to the left of the middle minus the number to the right of the middle. If p = 1 mod 4, E(p) is zero by symmetry, but if p = 3 mod 4, E(p) is always positive.

Here is a plot of quadratic excess for primes less than 3000 and congruent to 3 mod 4.

plot of quadratic excess

Statistical properties

In this post I’ve said a couple things that are in tension. At the top I said that quadratic residues act sufficiently like random bits to be useful in cryptography. Then later on I showed that they have major departures from random behavior. That would be a problem if you looked at the entire sequence of residues, but if p has thousands of digits, and you’re looking at a small sample of the reciprocity values, that’s a different situation.

Suppose you pick two large enough primes, p and q, and keep them secret but tell me their product N = pq. Then you pick a random value n and ask me whether it’s a quadratic residue mod N, the best I can do is say “Maybe, with probability 1/2.” There are crypto systems that depend on this being a computationally hard problem to solve, unless the factors p and q are known.

However, you need to be a little careful. If n is smaller than √N, I could test whether n is a square, working in the integers, not the integers mod N. If it is a square integer, then it’s a square mod N. If it’s not a square integer, it might be a square mod N, but I wouldn’t know.

There’s a related issue I may explore in the future. Suppose instead of just asking whether a particular number is a quadratic residue, you take a range of numbers and return 0 or 1 depending on whether each number is a residue. Can I distinguish the sequence from a random sequence using statistical tests? If I took a sequence of length greater than N/2 then I could tell by looking at symmetry or antisymmetry. But what if the range is small relative to N? Maybe N is a thousand-digit numnber but you only give me a sequence of a million bits. Would the distribution be distinguishable from uniform random bits? Is there any sort of autocorrelation that might betray the sequence as not random?

Related posts

Bessel function crossings

The previous post looked at the angles that graphs make when they cross. For example, sin(x) and cos(x) always cross with the same angle. The same holds for sin(kx) and cos(kx) since the k simply rescales the x-axis.

The post ended with wondering about functions analogous to sine and cosine, such as Bessel functions. This post will look at that question in more detail. Specifically we’ll look at the functions Jν and Yν.

Because these two Bessel functions satisfy the same second order linear homogeneous differential equation, the Strum separation theorem says that their zeros are interlaced: between each pair of consecutive zeros of Jν is exactly one zero of Yν, and between each pair of consecutive zeros of Yν there is exactly one zero of Jν.

Plotting Bessel functions J_3 and Y_3

In the following Python code, we find zeros of Jν, then look in between for places where Jν and Yν cross. Next we find the angle the two curves make at each intersection and plot the angles.

    from scipy.special import jn_zeros, jv, yv
    from scipy.optimize import bisect
    from numpy import empty, linspace, arccos
    import matplotlib.pyplot as plt
    
    n = 3 # bessel function order
    N = 100 # number of zeros
    
    z = jn_zeros(n, N) # Zeros of J_n
    crossings = empty(N-1)
    
    f = lambda x: jv(n,x) - yv(n,x)    
    for i in range(N-1):
        crossings[i] = bisect(f, z[i], z[i+1])
    
    def angle(n, x):
        # Derivatives of J_nu and Y_nu
        dj = 0.5*(jv(n-1,x) - jv(n+1,x))
        dy = 0.5*(yv(n-1,x) - yv(n+1,x))
        
        top = 1 + dj*dy
        bottom = ((1 + dj**2)*(1 + dy**2))**0.5
        return arccos(top/bottom)
        
    y = angle(n, crossings)
    plt.plot(y)
    plt.xlabel("Crossing number")
    plt.ylabel("Angle in radians")
    plt.show()

This shows that the angles steadily decrease, apparently quadratically.

Angles of crossing of J_3 and Y_3

This quadratic behavior is what we should expect from the asymptotics of Jν and Yν: For large arguments they act like shifted and rescaled versions of sin(x)/√x. So if we looked at √xJν and √xYν rather than Jν and Yν we’d expect the angles to reach some positive asymptote, and they do, as shown below.

Angles of crossing of √x J_3 and √xY_3

Related posts

Squircle perimeter and the isoparametric problem

If you have a fixed length of rope and you want to enclose the most area inside the rope, make it into a circle. This is the solution to the so-called isoparametric problem.

Dido’s problem is similar. If one side of your bounded area is given by a straight line, make your rope into a semi-circle.

This post looks at a similar problem for a squircle. Peter Panholzer mentioned the problem of finding the squircle exponent that led to the largest area in proportion to its arclength. This sounds like the isoparametric problem, but it’s not.

The isoparametric problem holds perimeter constant and lets the shape enclosed vary, maximizing the area. The question here is to hold the axes constant and maximize the ratio of the area to the perimeter. Panholzer reports the maximum occurs at p = 4.39365.

Computing the perimeter

The volume of a squircle can be found in closed form, and I’ve mentioned the equation a few times, for example here. The perimeter, however, cannot be found in closed form, as far as I know, except for special exponents.

We can solve for y as a function of x and find the arclength using the formula taught in calculus courses. However, the derivative of y has a singularity at x = 1. By switching to polar coordinates, we can find arclength in terms of an integrand with no singularities. We can also simplify things a little by computing the total arclength as 4 times the arclength in the first quadrant; this avoids having to take absolute values.

The following Python code computes the perimeter and the ratio of the area to the perimeter.

    from scipy import sin, cos, pi
    from scipy.integrate import quad
    from scipy.special import gamma
    
    def r(t, p):
        return (cos(t)**p + sin(t)**p)**-(1/p)
    
    def drdt(t, p):
        deriv = (cos(t)**p + sin(t)**p)**(-1-1/p)
        deriv *= cos(t)**(p-1)*sin(t) - sin(t)**(p-1)*cos(t)
        return deriv
    
    def integrand(t, p):
        return (drdt(t,p)**2 + r(t,p)**2)**0.5
    
    def arclength(p):
        integral = quad(lambda t: integrand(t,p), 0, pi/2)[0]
        return 4*integral
    
    def area(p):
        return 4*gamma(1 + 1/p)**2 / gamma(1 + 2/p)
    
    def ratio(p):
        return area(p) / arclength(p)

Basic geometry tells us the ratio is 1/2 when p = 2 and we have a circle. The ratio is also 1/2 when p = ∞, i.e. when we have a square. We can use the code above to find that the ratio when p = 0.528627, so there is at least one local maximum for values of p between 2 and ∞.

Here’s a plot of the ratio of area to perimeter as a function of p.

ratio of area to perimeter for squircle

The plot is quite flat for large values of p, but if we zoom in we can see that there is a local maximum near 4.4.

close up of previous graph near the maximum

When I find the maximum of the ratio function using Brent’s method (scipy.optimize.brent) I find p = 4.39365679, which agrees with the value Panholzer stated.

Here’s a plot of the squircle corresponding to this value of p.

squircle with largest area to perimeter ratio

Back to the isoparametric problem

Now why doesn’t this contradict the isoparametric problem? Area scales quadratically but perimeter scales linearly. If you don’t hold perimeter constant, you can find a larger ratio of area to volume by making the perimeter larger. And that’s what we’ve done. When p = 2, we have a unit circle, with area π and perimeter 2π. When p = 4.39365679 we have area 3.750961567 and permimeter 7.09566295. If we were to take a circle with the same perimeter, it would have area 4.00660097, larger than the squircle we started with.

Related posts

An infinite product challenge

Gil Kalai wrote a blog post yesterday entitled “Test Your Intuition (or knowledge, or programming skills) 36.” The challenge is to evaluate the infinite product

\prod_{p\,\, \mathrm{prime}} \frac{p^2+1}{p^2 - 1}

I imagine there’s an elegant analytical solution, but since the title suggested that programming might suffice, I decided to try a little Python. I used primerange from SymPy to generate the list of primes up to 200, and cumprod from NumPy to generate the list of partial products.

    cumprod(
        [(p*p+1)/(p*p-1) for p in primerange(1,200)]
    )

Apparently the product converges to 5/2, and a plot suggests that it converges very quickly.

Plot of partial products

Here’s another plot to look more closely at the rate of convergence. Here we look at the difference between 5/2 and the partial products, on a log scale, for primes less than 2000.

Plot of 2.5 minus partial products, log scale

 

Implementing the ChaCha RNG in Python

My previous post talked about the ChaCha random number generator and how Google is using it in a stream cipher for encryption on low-end devices. This post talks about how to implement ChaCha in pure Python.

First of all, the only reason to implement ChaCha in pure Python is to play with it. It would be more natural and more efficient to implement ChaCha in C.

RFC 8439 gives detailed, language-neutral directions for how to implement ChaCha, including test cases for intermediate results. At its core is the function that does a “quarter round” operation on four unsigned integers. This function depends on three operations:

  • addition mod 232, denoted +
  • bitwise XOR, denoted ^, and
  • bit rotation, denoted <<<=n.

In C, the += operator on unsigned integers would do what the RFC denotes by +=, but in Python working with (signed) integers we need to explicitly take remainders mod 232. The Python bitwise-or operator ^ can be used directly. We’ll write a function roll that corresponds to <<<=.

So the following line of pseudocode from the RFC

    a += b; d ^= a; d <<<= 16;

becomes

    a = (a+b) % 2**32; d = roll(d^a, 16)

in Python. One way to implement roll would be to use the bitstring library:

    from bitstring import Bits

    def roll(x, n):
        bits = Bits(uint=x, length=32)
        return (bits[n:] + bits[:n]).uint

Another approach, a little harder to understand but not needing an external library, would be

    def roll2(x, n):
        return (x << n) % (2 << 31) + (x >> (32-n))

So here’s an implementation of the ChaCha quarter round:

    def quarter_round(a, b, c, d):
        a = (a+b) % 2**32; d = roll(d^a, 16)
        c = (c+d) % 2**32; b = roll(b^c, 12)
        a = (a+b) % 2**32; d = roll(d^a,  8)
        c = (c+d) % 2**32; b = roll(b^c,  7)
        return a, b, c, d

ChaCha has a state consisting of 16 unsigned integers. A “round” of ChaCha consists of four quarter rounds, operating on four of these integers at a time. All the details are in the RFC.

Incidentally, the inner workings of the BLAKE2 secure hash function are similar to those of ChaCha.

Related posts

Computing Legendre and Jacobi symbols

In a earlier post I introduce the Legendre symbol

\left(\frac{a}{p}\right)

where a is a positive integer and p is prime. It is defined to be 0 if a is a multiple of p, 1 if a has a square root mod p, and -1 otherwise.

The Jacobi symbol is a generalization of the Legendre symbol and uses the same notation. It relaxes the requirement that p be prime and only requires that p is odd.

If m has prime factors pi with exponents ei, then the Jacobi symbol is defined by

\left(\frac{n}{m}\right) = \prod \left(\frac{n}{p_i} \right )^{e_i}

Note that the symbol on the left is a Jacobi symbol while the symbols on the right are Legendre symbols.

The Legendre and Jacobi symbols are not fractions, but they act in some ways like fractions, and so the notation is suggestive. They come up in applications of number theory, so it’s useful to be able to compute them.

Algorithm for computing Jacobi symbols

Since the Legendre symbol is a special case of the Jacobi symbol, we only need an algorithm for computing the latter.

In the earlier post mentioned above, I outline an algorithm for computing Legendre symbols. The code below is more explicit, and more general. It’s Python code, but it doesn’t depend on any libraries or special features of Python, so it could easily be translated to another language. The algorithm is taken from Algorithmic Number Theory by Bach and Shallit. Its execution time is O( (log a)(log n) ).

    def jacobi(a, n):
        assert(n > a > 0 and n%2 == 1)
        t = 1
        while a != 0:
            while a % 2 == 0:
                a /= 2
                r = n % 8
                if r == 3 or r == 5:
                    t = -t
            a, n = n, a
            if a % 4 == n % 4 == 3:
                t = -t
            a %= n
        if n == 1:
            return t
        else:
            return 0

Testing the Python code

To test the code we randomly generate positive integers a and odd integers n greater than a. We compare our self-contained Jacobi symbol function to the one in SymPy.

    N = 1000
    for _ in range(100):
        a = randrange(1, N)
        n = randrange(a+1, 2*N)
        if n%2 == 0:
            n += 1
            
        j1 = jacobi_symbol(a, n)
        j2 = jacobi(a, n)
        if j1 != j2:
            print(a, n, j1, j2)

This prints nothing, suggesting that we coded the algorithm correctly.

Related posts

Exploring the sum-product conjecture

Quanta Magazine posted an article yesterday about the sum-product problem of Paul Erdős and Endre Szemerédi. This problem starts with a finite set of real numbers A then considers the size of the sets A+A and A*A. That is, if we add every element of A to every other element of A, how many distinct sums are there? If we take products instead, how many distinct products are there?

Proven results

Erdős and Szemerédi proved that there are constants c and ε > 0 such that

max{|A+A|, |A*A|} ≥ c|A|1+ε

In other words, either A+A or A*A is substantially bigger than A. Erdős and Szemerédi only proved that some positive ε exists, but they suspected ε could be chosen close to 1, i.e. that either |A+A| or |A*A| is bounded below by a fixed multiple of |A|² or nearly so. George Shakan later showed that one can take ε to be any value less than

1/3 + 5/5277 = 0.3342899…

but the conjecture remains that the upper limit on ε is 1.

Python code

The following Python code will let you explore the sum-product conjecture empirically. It randomly selects sets of size N from the non-negative integers less than R, then computes the sum and product sets using set comprehensions.

    from numpy.random import choice

    def trial(R, N):
        # R = integer range, N = sample size
        x = choice(R, N, replace=False)
        s = {a+b for a in x for b in x}
        p = {a*b for a in x for b in x}
        return (len(s), len(p))

When I first tried this code I thought it had a bug. I called trial 10 times and got the same values for |A+A| and |A*A| every time. That was because I chose R large relative to N. In that case, it is likely that every sum and every product will be unique, aside from the redundancy from commutativity. That is, if R >> N, it is likely that |A+A| and |A*A| will both equal N(N+1)/2. Things get more interesting when N is closer to R.

Probability vs combinatorics

The Erdős-Szemerédi problem is a problem in combinatorics, looking for deterministic lower bounds. But it seems natural to consider a probabilistic extension. Instead of asking about lower bounds on |A+A| and |A*A| you could ask for the distribution on |A+A| and |A*A| when the sets A are drawn from some probability distribution.

If the set A is drawn from a continuous distribution, then |A+A| and |A*A| both equal N(N+1)/2 with probability 1. Only careful choices, ones that would happen randomly with probability zero, could prevent the sums and products from being unique, modulo commutativity, as in the case R >> N above.

If the set A is an arithmetic sequence then |A+A| is small and |A*A| is large, and the opposite holds if A is a geometric sequence. So it might be interesting to look at the correlation of |A+A| and |A*A| when A comes from a discrete distribution, such as choosing N integers uniformly from [1, R] when N/R is not too small.