How long does it take to find large primes?

Descriptions of cryptography algorithms often start with something like “Let p be a large prime” or maybe more specifically “Let p be a 256 bit prime.” How hard is it to find large prime numbers?

Let’s look at the question in base b, so we can address binary and base 10 at the same time. The prime number theorem says that for large n, the number of primes less than bn is approximately

bn / n log b

and so the probability that an n-digit number base b is prime is roughly 1 / n log b. If you select an odd b-digit number, you double your chances.

If you generate odd n-digit numbers at random, how long would it be until you find a prime? You might get lucky and find one on your first try. Or you might have extraordinarily bad luck and find a composite every time until you give up.

The number of tries until you find a prime is a geometric random variable with

p = 2 / n log b.

The expected value is 1 / p.

So, for example, if you wanted to find a 256-bit prime, you would need to test on average 256 log(2) / 2 = 89 candidates. And if you wanted to find a 100-digit prime, you’d need to test on average 100 log(10) / 2 = 115 candidates.

You could get more accurate bounds by using a refined variation of the prime number theorem. Also, we implicitly looked at the problem of finding a number with at most n digits rather than exactly n digits. Neither of these would make much difference to our final result. To back up this claim, let’s compare our estimate to a simulation that actually looks for primes.

histogram

The histogram above was based on finding 10,000 primes, each with 100 digits, and counting how many candidates were evaluated each time. The plot matches a geometric distribution well. The mean from the simulation was 113.5, close to the 115 estimated above.

For cryptographic applications, you often don’t want just any prime. You might want a safe prime or a prime satisfying some other property, in which case you’ll need to test more primes. But your probability of success will likely still be geometrically distributed.

It’s plausible that for large pp being prime and 2p + 1 being prime are independent events, at least approximately, and a numerical experiment suggests that’s true. I generated 10,000 primes, each 100-digits long, and 48 were safe primes, about what you’d expect from theory.

Related posts

Schnorr groups

Claus Schnorr

Schnorr groups, named after Claus Schnorr, are multiplicative subgroups of a finite field. They are designed for cryptographic application, namely as a setting for hard discrete logarithm problems.

The application of Schnorr groups to Schnorr signatures was patented, but the patent ran out in 2008.

There has been a proposal to include Schnorr signatures in Bitcoin, but it has not been accepted at this point. (The proposal would use Schnorr signatures but over an elliptic curve rather than a Schnorr group.)

Group construction

Pick a large prime p. Then the integers mod p form a finite field. The non-zero elements form an Abelian group of order p-1 under multiplication. Then p – 1 is composite, and so it can be factored into qr where q is prime. We want q to be large as well because it will be the order of our Schnorr group.

Pick any h such that hr is not congruent to 1 mod p. Then g = hr is the generator of our Schnorr group. Why does g have order q? A simple calculation shows

g^q \equiv h^{qr} \equiv h^{p-1} \equiv 1 \pmod p

where the last step follows from Fermat’s little theorem. This shows that the order of g is either q or a divisor of q, but by construction g is not congruent to 1 (mod p), and q has no other factors since it is prime.

Python code

Here’s a toy example using small numbers. Let p = 23, q = 11 and r = 2. Then we can pick h to be any number that is not a root of 1 (mod 23), i.e. any number except 1 or 22. Say we pick h = 7. Then g = 49 = 3 (mod 23).

Here’s a Python script to verify our claims and print out the elements of our Schnorr group.

    from sympy import isprime
    
    p, q, r, h = 23, 11, 2, 7
    
    assert(isprime(p))
    assert(isprime(q))
    assert(p-1 == q*r)
    
    g = h**r % p
    assert(g != 1)
    assert(g**q % p == 1)
    
    for i in range(q):
        print( g**i % p )

This shows that our group elements are {1, 3, 9, 4, 12, 13, 16, 2, 6, 18, 8}.

In theory we could use the same script with much larger numbers, but in that case we wouldn’t want to print out the elements. Also, the code would take forever. We naively computed ab (mod m) by computing ab first, then taking the remainder modulo m. It is far more efficient to reduce modulo m along the way. That’s what Python’s three-argument pow function does.

Here’s a revised version that runs quickly with large numbers.

    from sympy import isprime
    
    p = 115792089237316195423570985008687907852837564279074904382605163141518161494337
    q = 341948486974166000522343609283189
    r = 338624364920977752681389262317185522840540224
    h = 3141592653589793238462643383279502884197
    
    assert(isprime(p))
    assert(isprime(q))
    assert(p-1 == q*r)
    
    g = pow(h, r, p)
    assert(g != 1)
    assert(pow(g, q, p) == 1)

Related posts

Photo of Claus Schnorr by Konrad Jacobs CC BY-SA 2.0 de

Bitcoin key mechanism and elliptic curves over finite fields

Bitcoin uses the Elliptic Curve Digital Signature Algorithm (ECDSA) based on elliptic curve cryptography. The particular elliptic curve is known as secp256k1, which is the curve

y² = x³ + 7

over a finite field to be described shortly.

graph of elliptic curve y^2 = x^3 + 7

Addition on elliptic curves in the plane is defined geometrically in terms of where lines intercept the curve. We won’t go into the geometry here, except to say that it boils down to a set of equations involving real numbers. But we’re not working over real numbers; we’re working over a finite field.

Finite field modulus

The idea is to take the equations motivated by the geometry in the plane then use those equations to define addition when you’re not working over real numbers but over a different field. In the case of secp256k1, the field is the finite field of integers mod p where

p = 2256 – 232 – 977

Here p was chosen to be relatively close to 2256. It’s not the largest prime less than 2256; there are a lot of primes between p and 2256. Other factors also went into the choice p. Note that we’re not working in the integers mod p per se; we’re working in an Abelian group whose addition law is defined by an elliptic curve over the integers mod p.

Base point

Next, we pick a base point g on the elliptic curve. The standard defining secp256k1 says that g is

0279BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798

in “compressed form” or

040x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8

in “uncompressed form”.

The base point is a specially chosen point on the elliptic curve, and so it is a pair of numbers mod p, not a single number. How do you extract x and y from these compressed or uncompressed forms?

Compressed form

The compressed form only gives x and you’re supposed to solve for y. The uncompressed form gives you x and y. However, the numbers are slightly encoded. In compressed form, the string either starts with “o2” or “o3” and the rest of the string is the hexadecimal representation of x. There will be two values of y satisfying

y² = x³ + 7 mod p

and the “o2” or “03” tells you which one to pick. If the compressed form starts with 02, pick the root whose least significant bit is even. And if the compressed form starts with 03, pick the root whose least significant bit is odd. (The two roots will add to p, and p is odd, so one of the roots will be even and one will be odd.)

Uncompressed form

The uncompressed form will always start with 04. After this follow the hex representations of x and y concatenated together.

In either case we have

x = 79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798

and

y = 483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8

We can verify this with a little Python code:

    x = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798
    y = 0x483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8
    p = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F
    assert((y*y - x*x*x - 7) % p == 0)

Exponentiation over elliptic curve

Starting with our base point g, define kg to be g added to itself k times. Note again that the sense of “addition” here is addition in the elliptic curve, not addition in the field of integers mod p. The key to elliptic curve cryptography is that kg can be computed efficiently, but solving for k starting from the product kg cannot. You can compute kg using the fast exponentiation algorithm, but solving for k requires computing discrete logarithms. (This is the ECDLP: Elliptic Curve Discrete Logarithm Problem.)

Why is this called “exponentiation” and not “multiplication”? Arithmetic on the elliptic curve is commutative, and in commutative (i.e. Abelian) groups the group operation is usually denoted as addition. And repeated addition is called multiplication.

But in general group theory, the group operation is denoted as multiplication, and repeated application of the group operation is called  exponentiation. It’s conventional to use the general term “exponentiation” even though over an Abelian group it makes more sense to call it multiplication.

You undo exponentiation by taking logarithms, so the process of solving for k is called the discrete logarithm problem. The security of elliptic curve cryptography depends on the difficulty of computing discrete logarithms.

Counting bits of security

The best algorithms for solving discrete logarithm problem in a group of size n currently require O(√n) operations. How big is n in our case?

The base point g was chosen to have a large order, and in fact its order is approximately 2256.  Specifically, the order of g written in hexadecimal is

n = FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141.

This means that we get approximately 256/2 = 128 bits of security because √(2256) = 2128.

Related posts

Currying in calculus, PDEs, programming, and categories

logician Haskell Brooks Curry

Currying is a simple but useful idea. (It’s named after logician Haskell Curry [1] and has noting to do with spicy cuisine.) If you have a function of two variables, you can think of it as a function of one variable that returns a function of one variable. So starting with a function f(x, y), we can think of this as a function that takes a number x and returns a function f(x, -) of y. The dash is a placeholder as in this recent post.

Calculus: Fubini’s theorem

If you’ve taken calculus then you saw this in the context of Fubini’s theorem:

\int_a^b\!\int_c^d f(x, y) \, dx \, dy= \int_a^b\left(\int_c^d f(x, y)\, dx\right )\,dy

To integrate the function of two variables f(x, y), you can temporarily fix y and integrate the remaining function of x. This gives you a number, the value of an integral, for each y, so it’s a function of y. Integrate that function, and you have the value of the original function of two variables.

The first time you see this you may think it’s a definition, but it’s not. You can define the integral on the left directly, and it will equal the result of the two nested integrations on the right. Or at least the two sides will often be equal. The conditions on Fubini’s theorem tell you exactly when the two sides are equal.

PDEs: Evolution equations

A more sophisticated version of the same trick occurs in partial differential equations. If you have an evolution equation, a PDE for a function on one time variable and several space variables, you can think of it as an ODE via currying. For each time value t, you get a function of the spatial variables. So you can think of your solution as a path in a space of functions. The spatial derivatives specify an operator on that space of functions.

(I’m glossing over details here because spelling everything out would take a lot of writing, and might obscure the big idea, which relevant for this post. If you’d like the full story, you can see, for example, my graduate advisor’s book. It was out of print when I studied it, but now it’s a cheap Dover paperback.)

Haskell programming

In the Haskell programming language (also named after Haskell Curry) you get currying for free. In fact, there’s no other way to express a function of two variables. For example, suppose you want to implement the function f(xy) = x² + y.

    Prelude> f x y = x**2 + y

Then Haskell thinks of this as a function of one variable (i.e. x), that returns a function of one variable (i.e. f(x, -)) which itself returns a number (i.e. f(x, y)). You can see this by asking the REPL for the type of f:

    Prelude> :info f
    f :: Floating a => a -> a -> a

Technically, Haskell, just like lambda calculus, only has functions of one variable. You could create a product datatype consisting of a pair of variables and have your function take that as an argument, but it’s still a function on one variable, though that variable is not atomic.

Category theory

The way you’d formalize currying in category theory is to say that the following is a natural isomorphism:

\mathrm{Hom}(A \times B, C) \cong \mathrm{Hom}(A, \mathrm{Hom}(B, C))

For more on what Hom means, see this post.

Related posts

[1] In concordance with Stigler’s law of eponymy, currying was not introduced by Curry but Gottlob Frege. It was then developed by Moses Schönfinkel and developed further by Haskell Curry.

Equal Earth map projection

There’s no perfectly satisfying way to map the globe on to a flat surface. Every projection has its advantages and disadvantages. The Mercator projection, for example, is much maligned for the way it distorts area, but it has the property that lines of constant bearing correspond to straight lines on the map. Obviously this is convenient if you’re sailing without GPS. But for contemporary use, say in a classroom, minimizing area distortion is often a higher priority than keeping bearing lines straight.

Bojan Šavrič, Tom Patterson, and Bernhard Jenny have developed a new map projection called Equal Earth that nicely balances several competing criteria, including aesthetics.

Equal earth projection

The Equal Earth projection satisfies the following mathematical criteria:

  1. Equal area: The area of a region on the globe is proportional to the area of its projection.
  2. Straight parallels: Lines of equal latitude project on to horizontal lines.
  3. Regularly distributed meridians: At a given latitude, the spacing between lines of longitude are equal.
  4. Bilateral symmetry: The projection is symmetric with respect to the x-axis (equator) and y-axis (prime meridian).

Here are the equations for the Equal Earth projection.

\begin{eqnarray*} \sin\theta &=& \frac{\sqrt{3}}{2} \sin \phi \\ x &=& \frac{2\sqrt{3}}{3} \frac{\lambda \cos \theta}{a_1+ 3a_2 \theta^2 + \theta^6(7a_3 + 9a_4\theta^2)} \\ y &=& \theta (a_1 + a_2\theta^2 + \theta^6(a_3 + a_4\theta^2)) \end{eqnarray*}

Here λ and φ are longitude and latitude respectively. The parametric latitude θ is introduced for convenience and is simply a rescaling of latitude φ.

The parameters ai are given below.

    a_1 =  1.340264 
    a_2 = -0.081106 
    a_3 =  0.000893 
    a_4 =  0.003796 

You can find more in their paper: Bojan Šavrič, Tom Patterson, and Bernhard Jenny. The Equal Earth map projection. International Journal of Geographical Information Science. https://doi.org/10.1080/13658816.2018.1504949.

Related posts

The other butterfly effect

monarch butterfly

The original butterfly effect

The butterfly effect is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show sensitive dependence on initial conditions, that the slightest change now can make an enormous difference later. Often this comes up in the context of nonlinear, chaotic systems but it isn’t limited to that. I give an example here of a linear differential equation whose solutions start out the essentially the same but eventually diverge completely.

Once you think about these things for a while, you start to see nonlinearity and potential butterfly effects everywhere. There are tipping points everywhere waiting to be tipped!

The other butterfly effect

But a butterfly flapping its wings usually has no effect, even in sensitive or chaotic systems. You might even say especially in sensitive or chaotic systems.

Sensitive systems are not always and everywhere sensitive to everything. They are sensitive in particular ways under particular circumstances, and can otherwise be quite resistant to influence. Not only may a system be insensitive to butterflies, it may even be relatively insensitive to raging bulls. The raging bull may have little more long-term effect than a butterfly. This is what I’m calling the other butterfly effect.

Steering complex systems

In some ways chaotic systems are less sensitive to change than other systems. And this can be a good thing. Resistance to control also means resistance to unwanted tampering. Chaotic or stochastic systems can have a sort of self-healing property. They are more stable than more orderly systems, though in a different way.

The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome. Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. We should evaluate what aspects can be steered and how well they can be steered. The most effective intervention may not come from tweaking the inputs but from changing the structure of the system.

Related posts

How many ways to make rock, paper, scissors, lizard, Spock?

In The Big Bang Theory, Sheldon Cooper explains an extension of the game Rock, Paper, Scissors by introducing two more possibilities, Lizard, and Spock. Sam Kass and Karen Bryla invented the game before it became widely known via the television show.

The diagram below summarizes the rules:

Rules of rock, paper, scissors, lizard, spock

Alternative rules

Imagine yourself in the position of Sam Kass and Karen Bryla designing the game. You first try adding one extra move, but it turns out that’s not possible.

The main property of Rock, Paper, Scissors is that no player has a winning strategy. That implies you can only add an even number of extra moves, keeping the total number of moves odd. That way, for every move one player makes, the other player has an equal number of winning and losing moves. Otherwise some moves would be better and others worse. So you can’t add just one move, say Lizard. You have to add two (or four, or six, …).

How many ways could you assign rules to an extension of Rock, Paper, Scissors adding two more moves?

Number the possible moves 1 through 5. We will make a table of which moves beat which, with +1 in the (ik) position if move i beats move j and -1 if j beats i. There will be zeros on the diagonal since it’s a tie if both players make the same move.

Let’s start with the original Rock, Paper, Scissors. In order for this game to not have a winning strategy, the table must be filled in as below, with the only option being to set a = 1 or a = -1.

|---+----+----+----|
|   |  1 |  2 |  3 |
|---+----+----+----|
| 1 |  0 |  a | -a |
|---+----+----+----|
| 2 | -a |  0 |  a |
|---+----+----+----|
| 3 |  a | -a |  0 |
|---+----+----+----|

If 1, 2, and 3 correspond to Rock, Paper, and Scissors, then a = 1 according to the usual rules, but we’ll allow the possibility that the usual rules are reversed. (If you don’t like that, just set a = 1).

Next we fill in the rest of the moves. The table must be skew-symmetric, i.e. the (ij) element must be the negative of the (ji) element, because if (ij) is a winning move then (ji) is a losing move and vice versa. Also, the rows and columns must sum to zero. Together these requirements greatly reduce the number of possibilities.

|---+----+----+----+----+----|
|   |  1 |  2 |  3 |  4 |  5 |
|---+----+----+----+----+----|
| 1 |  0 |  a | -a |  b | -b |
|---+----+----+----+----+----|
| 2 | -a |  0 |  a |  c | -c |
|---+----+----+----+----+----|
| 3 |  a | -a |  0 |  d | -d |
|---+----+----+----+----+----|
| 4 | -b | -c | -d |  0 |  d |
|---+----+----+----+----+----|
| 5 |  b |  c |  d | -d |  0 |
|---+----+----+----+----+----|

The values of a, b, and c may each be chosen independently to be 1 or -1. If bc = 0, then d can be chosen freely. Otherwise b and c have the same sign, and d must have the opposite sign. So all together there are 12 possibilities (6 if you insist a = 1). These are listed below.

|---+---+---+---|
| a | b | c | d |
|---+---+---+---|
| + | + | + | - |
| + | + | - | + |
| + | + | - | - |
| + | - | + | + |
| + | - | + | - |
| + | - | - | + |
| - | + | + | - |
| - | + | - | + |
| - | + | - | - |
| - | - | + | + |
| - | - | + | - |
| - | - | - | + |
|---+---+---+---|

The version of the rules used on The Big Bang Theory corresponds to the second row of the table above: a = b = d = 1 and c = -1.

Simpler solution

Here’s another way to count the number of possible designs. Suppose we start with tradition Rock, Paper, Scissors, corresponding to a = 1 in the notation above. Now let’s add the rules for Lizard. We can pick any two things from {Rock, Paper, Scissors, Spock} and say that Lizard beats them, and then the other two things must beat Lizard. There are 6 ways to choose 2 things from a set of 4.

Once we’ve decided the rules for Lizard, we have no choice regarding Spock. Spock’s rules must be the opposite of Lizard’s rules in order to balance everything out. If  we decide Lizard beats Rock, for example, then Rock must beat Spock so two things beat Rock and Rock beats two things.

If we’re willing to consider reversing the usual rules of Rock, Paper, Scissors, i.e. setting a = -1, then there are 6 more possibilities, for a total of 12.

Adding two more moves

By the way, we can see from the approach above how to add more moves. If we wanted to add Stephen Hawking and Velociraptor to our game, then we have 20 choices: we choose 3 things out of 6 for Hawking to beat, and the rest of the rules are determined by these choices. Velociraptor has to be the anti-Hawking. If we decide that Hawking beats Paper, Lizard, and Spock, then we’d get the rules in the diagram below.

Extending rock, paper, scissors, lizard, Spock with two more moves

Fair subsets

You might want to design the game so that for any subset of three moves you have a game with no winning strategy. Here’s an example why. If the subset (1 , 2, 4) is a fair game, then a = c. But if the subset (2, 3, 4) is a fair game, then a-c. So one of the two games must have a winning strategy.

Graphing the rules

The first graph above was made with GraphVis using the code below.

digraph rock {
    node [fontname = "helvetica"]

    "Rock"     -> "Scissors" 
    "Rock"     -> "Lizard"
    "Paper"    -> "Rock"
    "Paper"    -> "Spock"
    "Scissors" -> "Paper"
    "Scissors" -> "Lizard"
    "Lizard"   -> "Spock"
    "Lizard"   -> "Paper"
    "Spock"    -> "Rock"
    "Spock"    -> "Scissors"

    layout = circo
}

Save the code to a file, say rock.gv, then run the command

dot -Tpng rock.gv > rock.png

to produce a PNG file.

Related posts

Hom functors and a glimpse of Yoneda

Given two objects A and B, Hom(A, B) is simply the set of functions between A and B. From this humble start, things get more interesting quickly.

Hom sets

To make the above definition precise, we need to say what kinds of objects and what kinds of functions we’re talking about. That is, we specify a category C that the object belong to, and the functions are the morphisms of that category [1]. For example, in the context of groups, Hom(A, B) would be the set of group homomorphisms [2] between A and B, but in the context of continuous groups (Lie groups), we would restrict Hom(A, B) to be continuous group homomorphisms.

To emphasize that Hom refers to a set of morphisms in a particular category, sometimes you’ll see the name of the category as a subscript, as in HomC(A, B). Sometimes you’ll see the name of the category as a replacement for Hom as in C(A, B). You may also see the name of the category in square braces as in [C](AB).

Hom functors

So far Hom has been a set, but you’ll also see Hom as a functor. This is where the notation takes a little more interpretation. You may see a capital H with objects as superscripts or subscripts:

\begin{eqnarray*} H^A &=& \mathrm{Hom}(A, -) \\ H_A &=& \mathrm{Hom}(-, A) \end{eqnarray*}

You may also see a lower case h instead. And I’ll use the name of the category instead of “Hom” just to throw that variation in too.

\begin{eqnarray*} h^A &=& {\cal C}(A, -) \\ h_A &=& {\cal C}(-, A) \end{eqnarray*}

I’ll explain what these mean and give a mnemonic for remembering which is which.

Action on objects

The dash in Hom(A, -) and Hom(-, A) means “insert your object here.” That is, Hom(A) takes an object B and maps it to the set of morphisms Hom(AB), and Hom(-, A) takes an object B to Hom(BA).

So Hom(A, -) and Hom(-, A) each take an object in the category C to a set of morphims, i.e. an element in the category Set. But that’s only half of what it takes to be a functor. A functor not only maps objects in one category to objects in another category, it also maps morphisms in one category to morphisms in the other. (And it does so in a way that the interactions between the maps of objects and morphisms interact coherently.)

Action on morphisms

Where do the morphisms come in? We’ve established what HA and HA do to objects, but what do they do to morphisms?

Suppose we have a morphism fXY in the category and a function g in Hom(A, X) to Hom(A, Y)? For each g in Hom(AX), the composition fg is an element of Hom(AY).

Next suppose fYX (note the reversed order) and a function g in Hom(X, A). Now the composition gf is an element of Hom(YA). Note that before we applied f after g, but here we pre-compose with f, i.e. apply f before g.

Aside from the notation, what’s going on is very simple. If you have a function from A to X and a function from X to Y, then the composition of the two is a function from A to Y. Similarly, if you have a function from Y to X and a function from X to A, the composition is a function from Y to A.

Note that HA is a covariant functor, but HA is a contravariant. More on covariant vs contravariant below.

Notation mnemonic

How can you keep HA and HA straight? It’s common to use superscript notation YX to indicate the set of functions from the superscript object X to the base object Y. You may have seen this before even if you don’t think you have.

The notation Y² denotes the product of Y with it self, such as R² for the plane, i.e. pairs of real numbers. A pair of things in Y is a function from a two-element set to Y. You could think of (y1, y2) as the result of mapping the set (1, 2) into Y.

You may also have seen the notation 2X for the power set of X, i.e. the set of all subsets of X. You could think of the power set of X being the set of maps from X to the Boolean values (true, false) where an element x is mapped to true if and only if x is in that particular subset.

The notation using H or h with a superscript A stands for Hom(A, -), i.e. morphisms out of A, which is consistent with the usage described above. And so the other is the other, i.e. a subscript A stands for Hom(-, A), i.e morphisms into A.

(Unfortunately, some authors use the opposite of the convention used here, which blows the whole mnemonic away. But the convention used here is most common.)

Yoneda lemma

We’re close to being able to state one of the most important theorems in category theory, the Yoneda lemma. (Lemmas often turn out to be more useful and better known than the theorem they were first used to prove.) So while we’re in the neighborhood, we’ll take a look.

A corollary of the Yoneda lemma says

\begin{eqnarray*} \mathrm{Hom}(H^A, H^B) &\cong& \mathrm{Hom}(B, A) \\ \mathrm{Hom}(H_A, H_B) &\cong& \mathrm{Hom}(A, B) \\ \end{eqnarray*}

The meaning of “Hom” is different on the left and right because we’re looking at morphisms between different kinds of objects. On the right we have sets of morphisms in our category C as above. The left side takes more explanation.

What kind of things are HA and HB? They are functors from C to Set. The class of functors between two categories forms a category itself. The functors are the objects in this new category, and natural transformations are the morphisms. So Hom in this context is the set of natural transformations between the two functors.

What kind of things are HA and HB? They are contravariant functors from C to Set, and contravariant functors also form a category. However, contemporary category theory doesn’t like to speak of contravariant functors, preferring to only work with covariant functors, and leaving the term “covariant” implicit. So rather than saying HA and HB are contravariant functors on C, most contemporary writers would say they are (covariant) functors on a new category Cop, where “op” stands for opposite. That is, Cop is a category with all the same objects as C, but with all the arrows turned around. Every morphism from A to B in C corresponds to a morphism from B to A in Cop.

Related posts

[1] Morphisms are a generalization of functions. Often morphisms are functions, but they might not be. But still, they have to be things that compose like functions.

[2] The name Hom is a shortened from of “homomorphism.” Different areas of math have different names for structure-preserving functions, and category theory wanted to have one name for them all. It used “Hom” as an allusion to what structure-preserving functions are called in group theory. Similarly, “morphism” is also a shorted form of “homomorphism.” I suppose the goal was to use names reminiscent of group theory, but different enough to remind the reader that the category theory counterparts are distinct.

Incidentally, “homomorphism” comes from the Greek roots meaning “similar” and “shape.” A homomorphism is a function between similar objects (objects in the same category) that preserves structure (shape).

The cold start problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of Bayes theorem, you don’t have to have an ad hoc rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to all information available at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

The cold start problem for business applications is easier than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use Bayesian model selection or Bayesian model averaging to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

Deconstructing a parametric plot

Today’s exponential sum looks like three octagons, slightly rotated with respect to each other.

exponential sum 2018-08-03

I thought that the graph was tracing out one octagon, then another, then another. But that’s not what it’s doing at all. You can see for yourself by going to the exponential sum page and clicking the animate link.

The curve is being traced out in back-and-forth strokes. It’s drawing thin wedges, not octagons. You can get a better picture of this by looking at the real and imaginary parts (x and y coordinates) separately.

real and imaginary parts separately

It would be hard to look at the first plot an imagine the second, or to look at the second plot and imagine the first.

Sine of five degrees

Today’s the first day of a new month, which means the exponential sum of the day will be simpler than usual. The exponential sum of the day plots the partial sums of

\sum_{n=0}^N \exp\left( 2\pi i \left( \frac{n}{m} + \frac{n^2}{d} + \frac{n^3}{y} \right ) \right )

where md, and y are the month, day, and (two-digit) year. The n/d term is simply n, and integer, when d = 1 and so it has no effect because exp(2πn) = 1. Here’s today’s sum, the plot formed by the partial sums above.

exponential sum for August 1, 2018

It’s possible in principle to work out exactly each of the points are on the curve. In order to do so, you’d need to compute

\exp\left(\frac{2\pi i}{72}\right) = \cos\left(\frac{2\pi}{72}\right) + i \sin\left(\frac{2\pi}{72}\right) = \cos(5^\circ) + i \sin(5^\circ)

Since we can use the Pythagorean theorem to compute sine from cosine and vice versa, we only need to know how to compute the sine of five degrees.

OK, so how do we do that? The sine and cosine of 45 degrees and of 30 degrees are well known, and we can use the trig identity for sin(A – B) to find that the sine of 15 degrees is (√3 – 1)/ 2√2. So how do we get from the sine of 15 degrees to the sine of 5 degrees? We can borrow a trick from medieval astronomers. They did something similar to our work above, then solved the trig identity

\sin 3\theta = 3 \sin\theta - 4\sin^3\theta

as a cubic equation in sin θ to find the sine of 1/3 of an angle with known sine.

If we ask Mathematica to solve this equation for us

    Solve[3 x -4 x^3 == Sin[15 Degree], x]

we get three roots, and it’s the middle one that we’re after. (We can evaluate the roots numerically and tell that the middle one has the right magnitude.) So Mathematica says that the sine of 5 degrees is

\frac{\left(1+i \sqrt{3}\right) \sqrt[3]{-\sqrt{2}+\sqrt{6}+i \sqrt{8+4 \sqrt{3}}}}{4\ 2^{2/3}}+\frac{1-i \sqrt{3}}{2 \sqrt[3]{2 \left(-\sqrt{2}+\sqrt{6}+i \sqrt{8+4 \sqrt{3}}\right)}}

Unfortunately, this is the end of the line for Mathematica. We know that the expression above must be real, but Mathematica is unable to find its real part. I tried running FullSimplify on this expression, and aborted evaluation after it ran for 20 minutes or so. If anybody wants to pick up where Mathematica left off, let me know your solution. It may be possible to continue with Mathematica by giving the software some help, or it may be easier to continue with pencil and paper.

Update: Based on the comments below, it seems this is about as far as we can go. This was a surprise to me. We’re solving a cubic equation, less than 5th degree, so the roots are expressible in terms of radicals. But that doesn’t mean we can eliminate the appearance of the i‘s, even though we know the imaginary part is zero.

John Conway’s subprime fib sequence

Here’s how to construct what John Horton Conway calls his “subprime fib” sequence. Seed the sequence with any two positive integers. Then at each step, sum the last two elements of the sequence. If the result is prime, append it to the sequence. Otherwise divide it by its smallest prime factor and append that.

If we start the sequence with (1, 1) we get the sequence

1, 1, 2, 3, 5, 4, 3, 7, 5, 6, 11, 17, 14, 31, 15, 23, 19, 21, 20, 41, 61, 51, 56, 107, 163, 135, 149, 142, 97, 239, 168, 37, 41, 39, 40, 79, 17, 48, 13, 61, 37, 49, 43, 46, 89, 45, 67, 56, 41, 97, 69, 83, 76, 53, 43, 48, 13,

We know it will keep repeating because 48 followed by 13 has occurred before, and the “memory” of the sequence only stretches back two steps.

If we start with (6, 3) we get

6, 3, 3, 3, …

It’s easy to see that if the sequence ever repeats a value n then it will produce only that value from then on, because n + n = 2n is obviously not prime, and its smallest prime factor is 2.

Conway believes that this sequence will always get stuck in a cycle but this has not been proven.

Here’s Python code to try out Conway’s conjecture:

from sympy import isprime, primefactors

def subprime(a, b):

    seq = [a, b]
    repeated = False
    while not repeated:
        s = seq[-1] + seq[-2]
        if not isprime(s):
            s //= primefactors(s)[0]
        repeated = check_repeat(seq, s)
        seq.append(s)
    return seq

def check_repeat(seq, s):
    
    if not s in seq:
        return False
    
    if seq[-1] == s:
        return True
    
    # all previous occurances of the last element in seq
    indices = [i for i, x in enumerate(seq) if x == seq[-1]]
    
    for i in indices[:-1]:
        if seq[i+1] == s:
            return True
        
    return False

I’ve verified by brute force that the sequence repeats for all starting values less than 1000.

Related posts:

Distribution of eigenvalues for symmetric Gaussian matrix

Symmetric Gaussian matrices

The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let M be the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues will all be real because M is symmetric.

This is called a “Gaussian Orthogonal Ensemble” or GOE. The term is standard but a little misleading because such matrices may not be orthogonal.

Eigenvalue distribution

The joint probability distribution for the eigenvalues of M has three terms: a constant term that we will ignore, an exponential term, and a product term. (Source)

p(x_1, x_2, \ldots, x_n) \propto \exp\left(-\frac{1}{2} \sum_{j=1}^nx_j^2 \right ) \prod_{j < k} |x_j - x_k|

The exponential term is the same as in a multivariate normal distribution. This says the probability density drops of quickly as you go away from the origin, i.e. it’s rare for eigenvalues to be too big. The product term multiplies the distances between each pair of eigenvalues. This says it’s also rare for eigenvalues to be very close together.

(The missing constant to turn the expression above from a proportionality to an equation is whatever it has to be for the right side to integrate to 1. When trying to qualitatively understand a probability density, it usually helps to ignore proportionality constants. They are determined by the rest of the density expression, and they’re often complicated.)

If eigenvalues are neither tightly clumped together, nor too far apart, we’d expect that the distance between them has a distribution with a hump away from zero, and a tail that decays quickly. We will demonstrate this with a simulation, then give an exact distribution.

Python simulation

The following Python code simulates 2 by 2 Gaussian matrices.

    import matplotlib.pyplot as plt
    import numpy as np
    
    n = 2
    reps = 1000
    
    diffs = np.zeros(reps)
    for r in range(reps):
        A = np.random.normal(scale=n**-0.5, size=(n,n)) 
        M = 0.5*(A + A.T)
        w = np.linalg.eigvalsh(M)
        diffs[r] = abs(w[1] - w[0])
    
    plt.hist(diffs, bins=int(reps**0.5))
    plt.show()

This produced the following histogram:

The exact probability distribution is p(s) = s exp(-s²/4)/2. This result is known as “Wigner’s surmise.”

Circular law for random matrices

Suppose you create a large matrix M by filling its components with random values. If has size n by n, then we require the probability distribution for each entry to have mean 0 and variance 1/n. Then the Girko-Ginibri circular law says that the eigenvalues of M are approximately uniformly distributed in the unit disk in the complex plane. As the size n increases, the distribution converges to a uniform distribution on the unit disk.

The probability distribution need not be normal. It can be any distribution, shifted to have mean 0 and scaled to have variance 1/n, provided the tail of the distribution isn’t so thick that the variance doesn’t exist. If you don’t scale the variance to 1/n you still get a circle, just not a unit circle.

We’ll illustrate the circular law with a uniform distribution. The uniform distribution has mean 1/2 and variance 1/12, so we will subtract 1/2 and multiply each entry by √(12/n).

Here’s our Python code:

    import matplotlib.pyplot as plt
    import numpy as np
    
    n = 100
    M = np.random.random((n,n)) - 0.5
    M *= (12/n)**0.5
    w = np.linalg.eigvals(M)
    plt.scatter(np.real(w), np.imag(w))
    plt.axes().set_aspect(1)
    plt.show()

When n=100 we get the following plot.

eigenvalues of 100 by 100 matrix

When n=1000 we can see the disk filling in more.

eigenvalues of 1000 by 1000 matrix

Note that the points are symmetric about the real axis. All the entries of M are real, so its characteristic polynomial has all real coefficients, and so its roots come in conjugate pairs. If we randomly generated complex entries for M we would not have such symmetry.

Related post: Fat-tailed random matrices