Solving hard problems

We help companies solve hard problems in mathematics, statistics, and computing. Let’s explore how we might work together.

Perfectly nonlinear functions

Posted on 9 October 2020 by John

The other day I heard someone suggest that a good cocktail party definition of cryptography is puzzle making. Cryptographers create puzzles that are easy to solve given the key, but ideally impossible without the key.

Linearity is very useful in solving puzzles, and so a puzzle maker would like to create functions that are as far from linear as possible. That is why cryptographers are interested in perfectly nonlinear functions (PNs) and almost perfect nonlinear functions (APNs), which we will explore here.

The functions we are most interested in map a set of bits to a set of bits. To formalize things, we will look at maps between two finite Abelian groups A and B. If we look at lists of bits with addition defined as component-wise XOR [1], we have an Abelian group, but the ideas below apply to other kinds of Abelian groups as well.

We will measure the degree of nonlinearity of a function F from A to B by the size of the sets

$\delta(a, b) = \{x \mid F(a + x) - F(x) = b\}$

where a and x are in A and b is in B [2].

If F is a linear function, F(a + x) = F(a) + F(x), and so δ(a, b) simplifies to

$\delta(a, b) = \{x \mid F(a) = b\}$

which equals all of the domain A, if F(a) = b. So for linear functions, the set δ(a, b) is potentially as large as possible, i.e. all of A. Perfectly nonlinear functions will be functions where the sets δ(a, b) are as small as possible, and almost perfectly nonlinear functions will be functions were δ(a, b) is as small as possible in a given context.

The uniformity of a function F is defined by

$\underset{a \ne 0}{\max_{a \in A, \, b\in B}} \,\, |\delta(a,b)|$

The restriction a ≠ 0 is necessary since otherwise the uniformity of every function would be the same.

For linear functions, the uniformity is as large as possible, i.e. |A|. As we will show below, a nonlinear function can have maximum uniformity as well, but some nonlinear functions have smaller uniformity. So some nonlinear functions are more nonlinear than others.

We will compute the uniformity of a couple functions, mapping 4 bits to 1 bit, with the following Python code.

    A = range(16)
    B = range(2)
    nonZeroA = range(1, 16)

    def delta(f, a, b):
        return {x for x in A if f(x^a) ^ f(x) == b}

    def uniformity(f):
        return max(len(delta(f, a, b)) for a in nonZeroA for b in B)

We could look at binary [3] functions with domains and ranges of different sizes by changing the definitions of A, B, and nonZeroA.

Note that in our groups, addition is component-wise XOR, denoted by ^ in the code above, and that subtraction is the same as addition since XOR is its own inverse. Of course in general addition would be defined differently and subtraction would not be the same as addition. When working over a binary field, things are simpler, sometimes confusingly simpler.

We want to compute the uniformity of two functions. Given bits x₀, x₁, x₂, and x₃, we define

F(x₀, x₁, x₂, x₃) = x₀ x₁

and

G(x₀, x₁, x₂, x₃) = x₀ x₁ + x₂ x₃.

We can implement these in Python as follows.

    def F(x):
        x = [c == '1' for c in format(x, "04b")]
        return x[0] and x[1]

    def G(x):
        x = [c == '1' for c in format(x, "04b")]
        return (x[0] and x[1]) ^ (x[2] and x[3])

Next we compute the uniformity of the two functions.

    print( uniformity(F) )
    print( uniformity(G) )

This shows that the uniformity of F is 16 and the uniformity of G is 8. Remember that functions with smaller uniformity are considered more nonlinear. Both functions F and G are nonlinear, but G is more nonlinear in the sense defined here.

To see that F is nonlinear, let x = (1, 0, 0, 0) and y = (0, 1, 0, 0).

1 = F(x + y) ≠ F(x) + F(y) = 0.

It turns out that 8 is the smallest possible uniformity for a function from 4 bits to 1 bit, and the function G is said to be perfectly nonlinear.

I haven’t defined perfect nonlinearity or almost perfect nonlinearaity precisely, only saying that they have something to do with maximum nonlinearity (minimum uniformity) in some context. I may say more about that context in a future post.

[1] XOR stands for exclusive or. For Boolean variables p and q, p XOR q is true if either p is true or q is true, but not both. The component-wise XOR of two binary numbers sets each bit of the result to the XOR of the corresponding input bits. For example, 1001 XOR 1010 = 0011.

[2] This definition comes from Perfect nonlinear functions and cryptography by Céline Blondeau and Kaisa Nyberg.

[3] The next post replaces binary numbers with base 5 and other bases.

Python one-liner to print Roman numerals

Posted on 7 October 2020 by John

Here’s an amusing little Python program to convert integers to Roman numerals:

    def roman(n): print(chr(0x215F + n))

It only works if n is between 1 and 12. That’s because Unicode contains characters for the Roman numerals I through XII.

Here are the characters it produces:

Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ

You may not be able to see these, depending on what fonts you have installed. I can see them in my browser, but when I ran the code above in a terminal window I only saw a missing glyph placeholder.

If you can be sure your reader can see the characters, say in print rather than on the web, the single-character Roman numerals look nice, and they make it clear that they’re to be interpreted as Roman numerals.

Here’s a screenshot of the symbols.

I II III IV V VI VII VIII IX X XI XII

Bidirectional text

Posted on 6 October 2020 by John

This post will take a look at simple bidirectional text, such as a bit of English inside an Arabic document, or a few words of Hebrew inside a French document. If you want to explore the subject in all its complexity, see Unicode Standard Annex 9.

You may not need to do anything special to display bidirectional text. For example, when I typed the following sentence, I just typed the letters in logical order.

The first letter of the Hebrew alphabet is אלף.

For the last word, I typed א, then ל, then ף. When I entered the ל, the editor put it on the left side of the א, and when I entered ף the editor put it to the left of the ל. The characters are stored in memory in the same sequence that I typed them, though they are displayed in the order appropriate for each language.

You can change the default display ordering of characters by inserting control characters. For example, I typed

The [U+202E]quick brown fox[U+202C] jumped.

and the text displays [1] as

The ‮quick brown fox‬ jumped.

The Unicode character U+202E, known as RLO for “right-to-left override,” tells the browser to display the following letters from right-to-left. Then the character U+202C, known as PDF for “pop directional formatting,” exits that mode, returning to left-to-right [2]. If we copy the first sentence into a text file and open it with a hex editor we can see the control characters, circled in red.

hex editor screen shot

I saved the file in UTF-16 encoding to make the characters easy to see: each quartet of hex characters represented a Unicode character. UTF-8 encoding more common and more compressed.

If for some reason you wanted to force Hebrew to display from left-to-right, you could insert U+202D, known as LRO for “left-to-right override.” The character to exit this mode is PDF, U+202C, as before.

Here’s a bit of Hebrew written left-to-right:

Written left-to-right: ‭אלף.

And here’s what it looks like in an hex editor:

another hex editor screen shot

[1] This should look the same as

The xof nworb kciuq jumped.

though in this footnote I typed the letters in the order they appear: xof …

If for some reason the text in the body of the post displays in normal order, not as in this note, then something has gone wrong with your browser’s rendering.

[2] So in addition to Portable Document Format and Probability Density Function, PDF can stand for Pop Directional Formatting. Here “pop” is being used in the computer science sense of popping a stack.

Understanding statistical error

Posted on 5 October 2020 by John

A simple linear regression model has the form

y = μ + βx + ε.

This means that the output variable y is a linear function of the input variable x, plus some error term ε that is randomly distributed.

There’s a common misunderstanding over whose error the error term is. A naive view is that the world really is linear, that

y = μ + βx

is some underlying Platonic reality, and that the only reason that we don’t measure exactly that linear part is that we as observers have made some sort of error, that the fault is the real world rather than in the model.

No, reality is what it is, and it’s our model that is in error. Some aspect of reality may indeed have a structure that is approximately linear (over some range, under some conditions), but when we truncate reality to only that linear approximation, we introduce some error. This error may be tolerable—and thankfully it often is—but the error is ours, not the world’s.

Why a little knowledge is a dangerous thing

Posted on 4 October 2020 by John

Alexander Pope famously said

A little learning is a dangerous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.

I’ve been thinking lately about why a little knowledge is often a dangerous thing, and here’s what I’ve come to.

Any complex system has many causes acting on it. Some of these are going to be more legible than others. Here I’m using “legible” in a way similar to how James Scott uses the term. As Venkatesh Rao summarizes it,

A system is legible if it is comprehensible to a calculative-rational observer looking to optimize the system from the point of view of narrow utilitarian concerns and eliminate other phenomenology. It is illegible if it serves many functions and purposes in complex ways, such that no single participant can easily comprehend the whole. The terms were coined by James Scott in Seeing Like a State.

People who have a little knowledge of a subject are only aware of some of the major causes that are acting, and probably they are aware of the most legible causes. They have an unbalanced view because they are aware of the forces pushing in one direction but not aware of other forces pushing in other directions.

A naive view may be unaware of a pair of causes in tension, and may thus have a somewhat balanced perspective. And an expert may be aware of both causes. But someone who knows about one cause but not yet about the other is unbalanced.

Examples

When I first started working at MD Anderson Cancer Center, I read a book on cancer called One Renegade Cell. After reading the first few chapters, I wondered why we’re not all dead. It’s easy to see how cancer can develop from one bad cell division and kill you a few weeks later. It’s not as easy to understand why that doesn’t usually happen. The spreading of cancer is more legible than natural defenses against cancer.

I was recently on the phone with a client who had learned enough about data deidentification to become worried. I explained that there were also reasons to not be as worried, but that they’re more complicated, less legible.

What to do

Theories are naturally biased toward causes that are amenable to theory, toward legible causes. Practical experience and empirical data tend to balance out theory by providing some insight into less legible causes.

A little knowledge is dangerous not so much because it is partial but because it is biased; it’s often partial in a particular way, such as theory lacking experience. If you spiral in on knowledge in a more balanced manner, with a combination of theory and experience, you might not be as dangerous along the way.

When theory and reality differ, the fault lies in the theory. More on that in my next post. Theory necessarily leaves out complications, and that’s what makes it useful. The art is knowing which complications can be safely ignored under which circumstances.

Expected value of X and 1/X

Posted on 3 October 2020 by John

Yesterday I blogged about an exercise in the book The Cauchy-Schwarz Master Class. This post is about another exercise from that book, exercise 5.8, which is to prove Kantorovich’s inequality.

Assume

$0 < m \leq x_1 \leq x_2 \leq \cdots \leq x_n \leq M < \infty$

and

$p_1 + p_2 + \cdots + p_n = 1$

for non-negative numbers p_i.

Then

$\left(\sum_{i=1}^n p_i x_i \right) \left(\sum_{i=1}^n p_i \frac{1}{x_i} \right) \leq \frac{\mu^2}{\gamma^2}$

where

$\mu = \frac{m+M}{2}$

is the arithmetic mean of m and M and

$\gamma = \sqrt{mM}$

is the geometric mean of m and M.

In words, the weighted average of the x‘s times the weighted average of their reciprocals is bounded by the square of the ratio of the arithmetic and geometric means of the x‘s.

Probability interpretation

I did a quick search on Kantorovich’s inequality, and apparently it first came up in linear programming, Kantorovich’s area of focus. But when I see it, I immediately think expectations of random variables. Maybe Kantorovich was also thinking about random variables, in the context of linear programming.

The left side of Kantorovich’s inequality is the expected value of a discrete random variable X and the expected value of 1/X.

To put it another way, it’s a relationship between E[1/X] and 1/E[X],

$\text{E}\left(\frac{1}{X} \right ) \leq \frac{\mu^2}{\gamma^2} \frac{1}{\text{E}(X)}$

which I imagine is how it is used in practice.

I don’t recall seeing this inequality used, but it could have gone by in a blur and I didn’t pay attention. But now that I’ve thought about it, I’m more likely to notice if I see it again.

Python example

Here’s a little Python code to play with Kantorovich’s inequality, assuming the random values are uniformly distributed on [0, 1].

    from numpy import random

    x = random.random(6)
    m = min(x)
    M = max(x)
    am = 0.5*(m+M)
    gm = (m*M)**0.5
    prod = x.mean() * (1/x).mean()
    bound = (am/gm)**2
    print(prod, bound)

This returned 1.2021 for the product and 1.3717 for the bound.

If we put the code above inside a loop we can plot the product and its bound to get an idea how tight the bound is typically. (The bound is perfectly tight if all the x’s are equal.) Here’s what we get.

All the dots are above the dotted line, so we haven’t found an exception to our inequality.

(I didn’t think that Kantorovich had made a mistake. If he had, someone would have noticed by now. But it’s worth testing a theorem you know to be true, in order to test that your understanding of the theorem is correct.)

More inequalities

The baseball inequality

Posted on 2 October 2020 by John

baseball game

There’s a theorem that’s often used and assumed to be true but rarely stated explicitly. I’m going to call it “the baseball inequality” for reasons I’ll get to shortly.

Suppose you have two lists of k positive numbers each:

$n_1, n_2, n_3, \ldots, n_k$

and

$d_1, d_2, d_3, \ldots, d_k$

Then

$\min_{1 \leq i \leq k} \frac{n_i}{d_i} \leq \frac{n_1 + n_2 + n_3 + \cdots + n_k}{d_1 + d_2 + d_3 + \cdots + d_k} \leq \max_{1 \leq i \leq k} \frac{n_i}{d_i}$

This says, for example, that the batting average of a baseball team is somewhere between the best individual batting average and the worst individual batting average.

The only place I can recall seeing this inequality stated is in The Cauchy-Schwarz Master Class by Michael Steele. He states the inequality in exercise 5.1 and gives it the batting average interpretation. (Update: This is known as the “mediant inequality.” Thanks to Tom in the comments for letting me know. So the thing in the middle is called the “mediant” of the fractions.)

Note that this is not the same as saying the average of a list of numbers is between the smallest and largest numbers in the list, though that’s true. The batting average of a team as a whole is not the same as the average of the individual batting averages on that team. It might happen to be, but in general it is not.

I’ll give a quick proof of the baseball inequality. I’ll only prove the first of the two inequalities. That is, I’ll prove that the minimum fraction is no greater than the ratio of the sums of numerators and denominators. Proving that the latter is no greater than the maximum fraction is completely analogous.

Also, I’ll only prove the theorem for two numerators and two denominators. Once you have proved the inequality for two numerators and denominators, you can bootstrap that to prove the inequality for three numerators and three denominators, and continue this process for any number of numbers on top and bottom.

So we start by assuming

$\frac{a}{b} \leq \frac{c}{d}$

Then we have

$\begin{align*} \frac{a}{b} &= \frac{a\left(1 + \dfrac{d}{b} \right )}{b\left(1 + \dfrac{d}{b} \right )} \\ &= \frac{a + \dfrac{a}{b}d}{b + d} \\ &\leq \frac{a + \dfrac{c}{d}d}{b+d} \\ &= \frac{a + c}{b+d} \end{align*}$

Solving for the catenary scale parameter

Posted on 28 September 2020 by John

A catenary with scale a is the graph of the function

f(x; a) = a cosh(x/a) – a.

The x and the a are separated by a semicolon rather than a comma to imply that we think of x as the variable and a as a parameter.

This graph passes through the origin, i.e. for any a, f(0, a) = 0. To find the scaling parameter a we need to specify the value of f at one more point and solve f(x, a) = y. This is the equation I alluded to recently as not having a closed-form solution.

Without loss of generality, we can assume x = 1. Why is that?

Define

g(a) = f(1; a).

Then

f(x‘; a‘) = y‘

if and only if

g(a‘/x‘) = y‘/x‘.

So will assume x = 1 for now and focus on solving for a value of a such that g(a) = y. But we will include Python code shortly that goes back to f, i.e. does not assume x = 1.

The Taylor series for g looks like

g(a) = 1 / 2a + 1 / 24a³ + …

and so for large a,

g(a) ≈ 1 / 2a.

Here are a couple plots showing how good this approximation is, even for a not that large. First, a plot of g and its approximation.

And here’s a plot of the relative error in approximating g(a) by 1/2a.

This means that for small y,

g(1 / 2y) ≈ y.

It’s fortunate that we have a convenient approximation when y is small, because in practice y is usually small: catenaries are usually wider than deep, or at least not much deeper than wide.

Since the terms in the Taylor series that we discarded are all positive, we also have a bound

g(1 / 2y) > y.

If we want to solve for a numerically, 1 / 2y makes a good starting guess, and it also makes a left bracket for root-finding methods that require a bracket around the root.

Here’s Python code to solve f(x, a) = y for a, given x and y.

    from numpy import cosh
    from scipy.optimize import root

    def g(a):
        return a*cosh(1/a) - a

    def solve_g(y):
        assert(y > 0)
        lower = 0.5/y  
        return root(lambda a: g(a) - y, lower).x

    def solve_catenary(x, y):
        "Solve for a such that a cosh(x/a) - a == y."
        return x*solve_g(y/abs(x))

Now that we can solve g(a) = y numerically, we can go back and see how far 1 / 2y is from the actual solution for varying values of y.

Recall that the second plot above showed that the relative error in approximating g(a) by 1 / 2a is large when a is small (and thus y is big). But the plot immediately above shows that nevertheless, 1/ 2y is a good guess at a solution to g(a) = y, never off by more than 0.175.

Recall also that we said 1 / 2y is a lower bound on the solution, and could be used as a left bracket in a root-finding method that requires a bracket. The plot above suggests that 0.18 + 1 / 2y would work as a right bracket

Alphabets and Unicode

Posted on 27 September 2020 by John

ASCII codes may seem arbitrary when you’re looking at decimal values, but they make more sense in hex [1]. For example, the ASCII value for 0 is 48. Why isn’t it zero, or at least a number that ends in zero? Well it is, in hex: 0x30. And the codes are in consecutive order, so the ASCII value of a digit d is d + 0x30.

There are also patterns in ASCII codes for letters, and this post focuses on these patterns and their analogies in the Unicode values assigned to other alphabets.

Continue reading →

How much does it matter if the measuring tape sags?

Posted on 26 September 2020 by John

couple measuring a wall

There are a couple ways in which a measurement might not be straight. Yesterday I wrote a blog post about not measuring straight toward your target. You’d like to measure from (0, 0) to (x, 0), but something is in the way, and so you measure from (0, 0) to (x, y), where y is small relative to x. We assumed the tape measure was straight, but didn’t aim straight at the target.

In this post we will consider measuring from (0, y) to (x, y), but with the tape measure sagging to (x/2, 0) in the middle. We end exactly at our target, but the tape bends. As before, we’ll assume y is small relative to x. About how large will our error be as a function of y?

My first thought was to assume our tape measure takes the shape of a catenary, because that’s the shape of a handing cable. But the resulting calculations are too complicated. The calculations depend on solving for a scaling factor, and that calculation cannot be done in closed form.

Since we are assuming the amount of sag y is small relative to the horizontal distance x, a parabola will do just as well as a catenary. (More on how well a parabola approximates a catenary here.)

The equation of the parabola passing through our three specified points, as a function of t, is

4y(t – x/2)² / x².

Even with our simplifying assumption that the tape bends like a parabola, the arclength calculation is still a little tedious, but if I’ve done things correctly the result turns out to be

x + 32y² / 3x + higher order terms

So the error is on the order of (32/3) y²/x, where as in the earlier post the error was on the order of (1/2) y²/x. If y is small relative to x, the error is still small, but about 20x larger than before.

Going back to our example of measuring a 10 ft (120 inch) room, we want to measure from (0, 0) to (120, 0). If we measure with a straight line from (0, 0) to (120, 4) instead, there will be an error of about 1/15 of an inch. If instead we measure from (0, 4) to (120, 0) with a tape that sags like a parabola touching (60, 0), then the error will be closer to an inch and a half. But if we could pull tight enough to limit the sag to 1 inch, the measurement error would be more like 1/10 of an inch.

I inadvertently ended up writing three blog posts in a row related to measuring tapes. Here are the other two:

Solving hard problems

Perfectly nonlinear functions

Related posts

Python one-liner to print Roman numerals

Bidirectional text

Related posts

Understanding statistical error

Why a little knowledge is a dangerous thing

Examples

What to do

Related posts

Expected value of X and 1/X

Probability interpretation

Python example

More inequalities

The baseball inequality

More inequality posts

Solving for the catenary scale parameter

More catenary posts

Alphabets and Unicode

How much does it matter if the measuring tape sags?

Related posts