Möbius transformations over a finite field

A Möbius transformation is a function of the form

g(z) = \frac{az + b}{cz + d}

where adbc = 1.

We usually think of z as a complex number, but it doesn’t have to be. We could define Möbius transformations in any context where we can multiply, add, and divide, i.e. over any field. In particular, we could work over a finite field such as the integers modulo a prime. The plot above represents a Möbius transformation over a finite field which we will describe below.

There is a subtle point, however. In the context of the complex numbers, the transformation above doesn’t quite map the complex plane onto the complex plane. It maps the complex plane minus one point to the complex plane minus one point. The domain is missing the point z = −d/c because that value makes the denominator zero. It’s also missing a point in the range, namely a/c.

The holes in the domain and range are a minor irritant, analogous to the pea in The Princess and the Pea. You can work around the holes, though the formalism is a little complicated. But over a finite field, the holes are a big deal. If you’re working over the integers mod 7, for example, then 1/7th of your domain is missing.

In the case of the complex numbers, the usual fix is to replace the complex numbers ℂ with the extended complex numbers ℂ ∪ ∞ and say that g(−d/c) = ∞ and g(∞) = a/c. There are a couple ways to make this more rigorous/elegant. The topological approach is to think of ℂ ∪ ∞ as the Riemann sphere. The algebraic approach is to think of it as a projective space.

Now let’s turn to finite fields, say the integers mod 17, which we will write as ℤ17. For a concrete example, let’s set a = 3, b = 8, c = 6, and d = 5. Then adbc = 1 mod 17. The multiplicative inverse of 6 mod 17 is 3, so we have a hole in the domain when

z = −d/c = −5/6 = −5 × 3 = − 15 = 2 mod 17.

Following the patch used with complex numbers, we define g(2) to be ∞, and we define

g(∞) = a/c = 3/6 = 3 × 3 = 9 mod 17.

That’s all fine, except now we’re not actually working over ℤ17 but rather ℤ17 ∪ ∞. We could formalize this by saying we’re working in a projective space over ℤ17. For this post let’s just say we’re working over set G with 18 elements that mostly follows the rules of ℤ17 but has a couple additional rules.

Now our function g maps G onto G. No holes.

Here’s how we might implement g in Python.

    def g(n):
        if n == 2:
            return 17
        if n == 17:
            return 9
        a, b, c, d = 3, 8, 6, 5
        denom = c*n + d
        denom_inverse = pow(denom, -1, 17)
        return (a*n + b)*denom_inverse % 17

The plot at the top of the post arranges 18 points uniformly around a circle and connects n to g(n).

    from numpy import pi, linspace, sin, cos
    import matplotlib.pyplot as plt

    θ = 2*pi/18
    t = linspace(0, 2*pi)
    plt.plot(cos(t), sin(t), 'b-')

    for n in range(18):
        plt.plot(cos(n*θ), sin(n*θ), 'bo')
        plt.plot([cos(n*θ), cos(g(n)*θ)],
                 [sin(n*θ), sin(g(n)*θ)], 'g-')
    plt.gca().set_aspect("equal")
    plt.show()

Application to cryptography

What use is this? Möbius transformations over finite fields [1] are “higgledy-piggledy” in the words of George Marsaglia, and so they can be used to create random-like permutations. In particular, Möbius transformations over finite fields are used to design S-boxes for use in symmetric encryption algorithms.

Related posts

[1] Technically, finite fields plus an element at infinity.

[2] “If the [pseudorandom] numbers are not random, they are at least higgledy-piggledy.” — RNG researcher George Marsaglia

Generating and inspecting an RSA private key

In principle you generate an RSA key by finding two large prime numbers, p and q, and computing n = pq. You could, for example, generate random numbers by rolling dice, then type the numbers into Mathematica to test each for primality until you find a couple prime numbers of the right size.

In practice you’d use a specialized program to find the primes and to wrap everything up in a format that software using the keys can understand. There are a lot of layers between the numbers p and q and the file that key generating software produces, and this post aims to peel back these layers a bit.

Here’s an example of generating a private key taken from The OpenSSL Cookbook.

    openssl genpkey -out fd.key -algorithm RSA \
      -pkeyopt rsa_keygen_bits:2048 -aes-128-cbc

The genpkey function can be used for generating several kinds of public keys. The option -algorithm RSA tells it that we want an RSA key, but we could have asked for an elliptic curve key. As noted in the previous post, in practice public key encryption is used to transfer symmetric encryption keys, not messages per se. The flag -aes-128-cbc tells the software the we’d like to use AES encryption with a 128-bit key in CBC (cipher block chaining) mode.

When you press enter you’ll see a flurry of dots and plus signs that show the progress of the software in generating and testing candidates for the primes p and q. Then you’ll be prompted for a password to encrypt the private key you’ve just created.

If you open the fd.key file you won’t see much:

    % cat fd.key
    -----BEGIN ENCRYPTED PRIVATE KEY-----
    MIIFLTBXBgkqhkiG9w0BBQ0wSjApBgkqhkiG9w0BBQwwHAQIdCZSKfkqh6kCAggA
    MAwGCCqGSIb3DQIJBQAwHQYJYIZIAWUDBAECBBAqbtHXkZ+uqa3rvj6qKqbRBIIE
    ...
    U6QCPcWukFyUAghHdTfjKgoAEXfOEunALoaTF6LMPsd6
    -----END ENCRYPTED PRIVATE KEY-----

This is just base 64-encoded data.

The data is encoded in two senses. It is encoded in a non-secret way, expressed in a standardized data structure, then encoded in the sense of being encrypted. The openssl command pkey will undo both levels of encoding to let us see the contents of the file.

Here’s what this produces.

    Private-Key: (2048 bit, 2 primes)
    modulus:
        00:a7:b8:39:80:0b:18:d9:db:c1:a3:c1:3a:92:89:
        ...
        7a:c5
    publicExponent: 65537 (0x10001)
    ...
    prime1:
        00:dc:8c:27:e6:7f:1c:11:d4:9c:8c:33:bf:07:57:
        ...
        97:5f:8c:4c:44:23:d2:85:f9
    prime2:
        00:c2:ae:20:80:87:da:d0:a1:66:8f:2e:90:7c:ae:
        ...
        9c:e9:8a:8b:bc:c7:71:de:2d
    ...

The exponent is the default value 65537. (More on that here.)

The large numbers are displayed in hexadecimal with colons separating pairs of hex digits. If you remove the colons and concatenate everything together, you can verify that the number called modulus is indeed the product of the numbers called prime1 and prime2. I verified this for the output above using a little Python code:

    modulus = 0xa7b839...c5
    prime1  = 0xdc8c27...f9
    prime2  = 0xc2ae20...2d
    assert(prime1*prime2 == modulus)

The file also contains four numbers that require more explanation: privateExponent, exponent1, exponent2, and coefficient. The privateExponent is described here. The remaining numbers are not strictly necessary for RSA but are used in Garner’s algorithm for more efficient decryption.

Related posts

RSA encryption in practice

At its core, RSA encryption is modular exponentiation. That is, given a message m, the encrypted form of m is

x = me mod n

where e is a publicly known exponent and n is a product of two large primes. The number n is made public but only the holder of the private key knows the factors of n, and without knowing the factors of n you can’t recover m from x, or so we assume.

You can implement RSA encryption in just a few lines of code as long as you have a way to work with very large integers.

In principle you could divide your message into segments each less than n and encrypt each segment. In practice, that would be inefficient. Instead, asymmetric (public key) cryptography is only used to exchange symmetric cryptography keys. So, for example, someone wishing to send you a long message would use RSA to share the AES key used to encrypt the rest of the transmission.

So RSA is used to transfer keys, but that’s not the whole story. As is often the case, the real world implementation of cryptography is more complicated than the mathematical core.

In 1993 RSA published its PKCS#1 standard specifying that messages should be padded a certain way. That was an improvement, but then in 1998, Daniel Bleichenbacher published what has become known as the “million message attack” against the PKCS#1 standard. There were multiple proposed fixes, but these were complicated and often implemented incorrectly.

Now the standard is RSA-OAEP (Optimal Asymmetric Encryption Padding) which combines the message with random bits before applying the RSA algorithm per se. So there’s a bit of symmetric encryption, before using asymmetric encryption to share symmetric encryption keys!

My point here is not to get into the details of the OAEP protocol, but only to point out that it’s not trivial. It is, however, secure in the sense that you can prove that if someone can break RSA-OAEP then they can break the core RSA algorithm too.

“Cryptography is a mixture of mathematics and muddle, and without the muddle the mathematics can be used against you.” — Ian Cassels

Related posts

Image above CC BY-SA 4.0 by Jm-lemmi

Density of safe primes

Sean Connolly asked in a comment yesterday about the density of safe primes. Safe primes are so named because Diffie-Hellman encryption systems based on such primes are safe from a particular kind of attack. More on that here.

If q and p = 2q + 1 are both prime, q is called a Sophie Germain prime and p is a safe prime. We could phrase Sean’s question in terms of Sophie Germain primes because every safe prime corresponds to a Sophie Germain prime.

It is unknown whether there are infinitely many Sophie Germain primes, so conceivably there are only a finite number of safe primes. But the number of Sophie Germain primes less than N is conjectured to be approximately

1.32 N / (log N)².

See details here.

Sean asks specifically about the density of safe primes with 19,000 digits. The density of Sophie Germain primes with 19,000 digits or less is conjectured to be about

1.32/(log 1019000)² = 1.32/(19000 log 10)² = 6.9 × 10-10.

So the chances that a 19,000-digit number is a safe prime are on the order of one in a billion.

How to turn an unkeyed hash into a keyed hash

Secure hash functions often do not take a key per se, but they can be used with a key. Adding a key to a hash is useful, for example, to prevent a rainbow table attack.

There are a couple obvious ways to incorporate a key K when hashing a message M. One is to prepend the key to M before hashing. The other is to append K to M before hashing. That is, we either stick K onto the front or the end of M, then apply the hash function.

Both of these approaches could be vulnerable to attack under certain circumstances for reasons that are more complicated than I’d like to go into. Instead, a better approach is prepend and append the key. This is called the envelope method or more descriptively the sandwich method because the message is sandwiched between two copies of the key.

For details, see Ken Yasuda’s paper “‘Sandwich’ Is Indeed Secure: How to Authenticate a Message with Just One Hashing”, Australasian Conference on Information Security and Privacy, ACISP 2007: Information Security and Privacy pp 355–369.

Related posts

Privacy implications of hashing data

Cryptographic hash functions are also known as one-way functions because given an input x, one can easily compute its hashed value f(x), but it is impractical to recover x from knowing f(x).

However, if we know that x comes from a small universe of possible values, our one-way function can effectively become a two-way function, i.e. it may be possible to start with f(x) and recover x using a rainbow table attack.

It’s possible to defend against a rainbow attack yet still leak information about the probability distribution of values of x given f(x).

For privacy protection, hashing is better than not hashing. Keyed hashing is better than unkeyed hashing. But even keyed hashing can leak information.

Rainbow tables

Suppose a data set contains SHA-256 hashed values of US Social Security Numbers (SSNs). Since SSNs have 10 digits, there are 10 billion possible SSNs. It would be possible to hash all possible SSNs and create a lookup table, known as a rainbow table. There are three things that make the rainbow table attack possible in this example:

  1. The range of possible inputs is known and relatively small.
  2. The hashing algorithm is known.
  3. The hashing algorithm can be computed quickly.

There are a couple ways to thwart a rainbow table attack. Assuming we have no control of (1) above, we can alter (2) or (3).

Keyed hashing

A way to alter (2) is to use a keyed hash algorithm. For example, if we XOR the SSNs with a key before applying SHA-256, an attacker cannot construct a rainbow table without knowing the key. The attacker may know the core hash algorithm we are using, but they do not know the entire algorithm because the key is part of the algorithm.

Expensive hashing

A way to alter (3) is to use hashing algorithms that are expensive to compute, such as Argon2. The idea is that such hash functions can be computed quickly enough for legitimate use cases, but not for brute-force attacks.

The time required to create a rainbow table is the product of the time to compute a single hash and the size of the set of possible inputs. If it took an hour to compute a hash value, but there are only 10 possible values, then an attacker could compute a rainbow table in 10 hours.

Leaking attribute probabilities

Now suppose we’re using a keyed hashing algorithm. Maybe we’re paranoid and use an expensive hashing algorithm on top of that. What can go wrong now?

If we’re hashing values that each appear one time then we’re OK. If we apply a keyed hash to primary keys in a database, such as patient ID, then the hashed ID does not reveal the patient ID. But if we hash attributes associated with that patient ID, things are different.

Frequency analysis

Suppose US state is an attribute in your database, and you hash this value. No matter how secure and how expensive your hash algorithm is, each state will have a unique hash value. If the database contains a geographically representative sample of the US population, then the hashed state value that appears most often is probably the most populous state, i.e. California. The second most common hashed state value probably corresponds to Texas.

Things are fuzzier on the low end. The hashed state value appearing least often may not correspond to Wyoming, but it very likely does not correspond to Florida, for example.

In short, you can infer the state values using the same kind of frequency analysis you’d use to solve a simple substitution cipher in a cryptogram puzzle. The larger the data set, the more closely the empirical order will likely align with the population order.

Maybe this is OK?

Frequency analysis makes it possible to infer (with some uncertainty) the most common values. But the most common values are also the least informative. Knowing someone is from California narrows down their identity less than knowing that they are from Wyoming.

Frequency analysis is less specific for less common values. It might not tell you with much confidence that someone is from Wyoming, but it might tell you with some confidence that they come from a low-population state. However, since there are several low-population states, knowing someone is from such a state, without knowing which particular state, isn’t so informative.

Data privacy depends on context. Knowing what state someone is likely from may or may not be a problem depending on what other information is available.

Related posts

Playfair cipher

The Playfair cipher was the first encryption technique to encrypt text two letters at a time. Instead of substituting one letter for another, it substitutes one pair of letters for another pair. This makes the method more secure than a simple substitution cipher, but hardly secure by modern standards.

The Playfair cipher was used (and broken) during the first world war. I vaguely remember reading somewhere that the cipher took about an hour to break using pencil and paper. It was secure in the sense that it could be used for messages that only needed to be secure for less time than it took to break the method. It was more secure than simple substitution, and easy to encrypt and decrypt manually.

True to Stigler’s law of eponymy, the Playfair cipher was not named after its inventor, Charles Wheatstone of Wheatstone bridge fame, but after Lyon Playfair who popularized the method. Playfair acknowledged Wheatstone, but his name stuck to the method nevertheless.

Message preparation

The Playfair cipher uses a 5 × 5 grid of letters, so some letter of the Roman alphabet has to go. A common choice was to use the same letter for I and J. (A variation on the method using a 6 × 6 grid of letters and digits would not have to leave out any letters.)

For reasons that will soon be apparent, double letters had to be broken up, say with an X. So “FOOTBALL” would become “FOXOTBALXL.” Amusingly, “MISSISSIPPI” would become “MISXSISXSIPXPI.”

After eliminating Js and splitting double letters, the message is divided into pairs. So FOXOTBALXL becomes FO XO TB AL XL.

Encryption algorithm

The key for the encryption method is the arrangement of the letters in a square. In practice, the key would be some word or phrase that was used to permute the alphabet, and then that permutation was filled into the grid.

Here’s a grid I constructed by asking Python for a random permutation of the alphabet.

IFTVX PCGDY RNHQK EBSLA OUMWZ

Given a pair of letters, the two letters either lie on the same row, the same column, or are in different rows and columns. (This is why you break up double letters.)

If the two letters lie in the same row, advance each letter one position, wrapping around if necessary. For example, IT would be encrypted as FV, and TX would be encrypted as VI.

If two letter line in the same column, proceed analogously, moving each letter down. So TH would be encrypted as GB and OI would be encrypted as IP.

Finally, if the two letters are in different rows and columns, they form the diagonal corners of a rectangle. Replace the two letters with the letters on the remaining corners. For example, IH becomes TR, HE becomes RB, GW becomes DM, etc.

Cryptanalysis

Just as you can attack a simple substitution cipher by looking at letter frequencies, you can attack a Playfair cipher by looking at bigram frequencies. You can find these frequencies for English text on Peter Norvig’s site. TH sticks out in bigram frequencies similarly to how E sticks out in letter frequencies. However, bigram frequencies are more evenly distributed than letter frequencies.

As I pointed out in the previous post, making a mapping between 676 pairs of letters to a randomly generated list of 676 other pairs of letters will not create a secure cipher. But Playfair is much weaker than such a random assignment. There is a lot of structure to the Playfair cipher. This makes it more convenient to use, and easier to break.

Suppose pairs of letters where mapped to random pairs of letters and you learn that GB is the encrypted form of TH. What have you learned about decrypting any other pair? Nothing, except that you’ve eliminated 1 out of 676 possibilities.

But if you learn that a Playfair cipher sends TH to GB, you learn that either (1) T, H. G, and B all lie in the same row or column, or (2) that T and B are in the same column, G and B are in the same column, T and G are in the same row, and H and B are in the same row.

Symmetry

If we rotate the rows or columns in our encryption matrix, nothing changes. This is easy to see in the case when two letters are in the same row or in the same column. It’s a little harder to see but still true when the letters are in different rows and columns.

For example, consider the following encryption matrix, formed by rotating the columns two positions and the rows one position.

GDYPC
HQKRN
BLAES
MWZOU
TVXIF

If you work through all the examples above, you’ll see that they remain the same. IT still goes to FV etc.

The reason rotating columns or rows doesn’t make a difference is that in matrix notation, the encryption algorithm does not depend on the subscripts per se but the difference in subscripts mod 5.

It almost doesn’t matter if you transpose the encryption matrix. If you transpose a matrix, elements that were in the same row are now in the same column and vice versa. When two letters are not in the same row or column, transposing the encryption matrix transposes the encrypted pair. In the example above HE goes to RB. If we transpose the encryption matrix, HE goes to BR.

We said above that the key to a Playfair cipher is a permutation of the alphabet. But many keys correspond to the same encryption mapping. The analyst doesn’t need to recover the original encryption matrix but only some rearrangement of it.

Related posts

Simple substitution ciphers over a gargantuan alphabet

Simple substitution ciphers replace one letter with another. Maybe A goes to W, B goes to G, C goes to A, etc.

These ciphers are famously easy to break, so easy that they’re common in puzzle books. Here’s one I made [1] for this post in case you’d like to try it.

X RF SXIIXKW XK IYZ UXINYZK HT IYZ CXIICZ YHJSZ RI FZGTXZCG, HJQ SZNHKG TRQF BYXNY XS NJI HTT EV IYZ QXGWZ RKG R MJRQIZQ-FXCZ RNQHSS IYZ TXZCGS TQHF HJQ YHFZ LCRNZ, BYZQZ VHJ RQZ. X RF BQXIXKW R EHHU. XK XI X RF SLZRUXKW IH VHJ. EJI X RF RCSH SLZRUXKW IH IYZ BHQCG. IH EHIY X HBZ RK RNNHJKIXKW.

As is common in puzzle books, I kept the spaces and punctuation.

When you learn that simple substitution is breakable, you might reasonably think that the problem is the small alphabet size. What if you replaced pairs of letters with pairs of letters, effectively working over an alphabet of size 26² = 676. That’s an improvement, but it’s still not secure. It could be broken manually in a few hours, depending on the length of the text, and of course could be broken quickly using a computer.

If we want a cipher to be secure against computer-aided cryptanalysis, we’re going to need a much bigger alphabet.

The Roman alphabet has 26 letters, which can be expressed in 5 bits. Pairs of Roman letters would require 10 bits. What if we used a 32-bit alphabet, substituting 32-bit sequences with other 32-bit sequences? This is working over an alphabet of over 4 billion symbols. Surely that’s secure? Nope.

What if we use blocks of 128 bits? This is working over an alphabet of size

2128 = 340,282,366,920,938,463,463,374,607,431,768,211,456.

Nope. Still not good enough. Because you can see the penguin.

Original encrypted Tux image

The image above is a famous example of a downfall of simple substitution, albeit over a gargantuan alphabet. The image was created by taking a graphic of the Linux mascot and encrypting the bits using 128-bit encryption. Each block of 128 bits goes to a unique, essentially random replacement. Each block is well encrypted. But there are repetitive blocks in the original that become repetitive blocks in the encrypted version.

The AES (Rijndael) encryption algorithm is a good algorithm, but in the example above it was used poorly. It was used in electronic code book mode (ECB), something that nobody would do in practice.

In practice, you might do something like cipher block chaining where you XOR each block with the encrypted version of the previous block. You could think of this as a clever way of using a simple substitution over an enormous alphabet. You look up the substitution of each block, but then XOR its bits with the previously encrypted block. Now repetitive input does not produce repetitive output. You cannot see the penguin. The penguin image becomes random-looking static.

Related posts

[1] I produced the cryptogram using

    cat myfile | tr [a-z] [A-Z] | tr [A-Z] ...

where “…” is a permutation of the 26 upper case letters.

Cryptography, hydrodynamics, and celestial mechanics

V. I. Arnold

Last night I was reading a paper by the late Russian mathematician V. I. Arnold “Polymathematics: is mathematics a single science or a set of arts?” and posted a lightly edited extract of it on Twitter. It begins

All mathematics is divided into three parts: cryptography, hydrodynamics, and celestial mechanics.

Arnold is alluding to the opening line to Julius Caesar’s Gallic Wars [1] which suggests he’s being a little playful. The unedited version leaves no doubt that he’s being playful or cynical.

All mathematics is divided into three parts: cryptography (paid for by CIA, KGB and the like), hydrodynamics (supported by manufacturers of atomic submarines), and celestial mechanics (financed by military and other institutions dealing with missiles, such as NASA).

I edited out the parenthetical remarks, in part edit the sentence down to a tweet, but also because when you take out the humor/cynicism he makes a valid if hyperbolic point. He goes on to back this up.

Cryptography has generated number theory, algebraic geometry over finite fields, algebra, combinatorics and computers.

Hydrodynamics has procreated complex analysis, partial differential equations, Lie groups and algebra theory, cohomology theory and scientific computing.

Celestial mechanics is the origin of dynamical systems, linear algebra, topology, variational calculus and symplectic geometry.

Arnold adds a footnote to the comment about cryptography:

The creator of modern algebra, Viète, was the cryptographer of King Henry IV of France.

Of course not all mathematics was motivated by cryptography, hydrodynamics, and celestial mechanics, but an awful lot of it was. And his implicit argument that applied math gives birth to pure math is historically correct. Sometimes pure math gives rise to applied math, but much more often it’s the other way around.

His statements roughly match my own experience. Much of the algebra and number theory that I’ve learned has been motivated by cryptography. I dove into Knuth’s magnum opus, volume 2, because I wanted to implement the RSA algorithm in C++.

I got started in scientific computing in a computational fluid dynamics (CDF) lab. I didn’t work in the lab, but my roommate did, and I went there to use the computers. That’s where I would try my hand at numerical analysis and where I wrote simulation code for my dissertation. My dissertation in partial differential equations was related (very abstractly) to fluid flow in porous media.

I didn’t know anything about celestial mechanics until I sat in on Richard Arenstorf‘s class as a postdoc. So celestial mechanics was not my personal introduction to dynamical systems etc., but Arnold is correct that these fields came out of celestial mechanics.

Related posts

[1] “Gallia est omnis divisa in partes tres.” which translates “Gaul is a whole divided into three parts.”

The quality of an RNG depends on the application

A random number generator can be good for some purposes and not for others. This isn’t surprising given the fundamentally impossible task such generators are supposed to perform. Technically a random number generator is a pseudo random number generator because it cannot produce random numbers. But random is as random does, and for many purposes the output of a pseudorandom number generator is random for practical purposes. But this brings us back to purposes.

Let β be an irrational number and consider the sequence

xnnβ mod 1

for n = 0, 1, 2, … That is, we start at 0 and repeatedly add β, taking the fractional part each time. This gives us a sequence of points in the unit interval. If you think of bending the interval into a circle by joining the 1 end to 0, then our sequence goes around the circle, each time moving β/360°.

Is this a good random number generator? For some purposes yes. For others no. We’ll give an example of each.

Integration

If your purpose is Monte Carlo integration, then yes it is. Our sequence has low discrepancy. You can approximate the integral of a function f over [0, 1] by taking the average of f(xn) over the first N elements of the sequence. Doing Monte Carlo integration with this particular RNG amounts to quasi Monte Carlo (QMC) integration, which is often more efficient than Monte Carlo integration.

Here’s an example using β = e.

    import numpy as np
    
    # Integrate f with N steps of (quasi) Monte Carlo 
    def f(x): return 1 + np.sin(2*np.pi*x)
    N = 1000
    
    # Quasi Monte Carlo
    sum = 0
    x = 0
    e = np.exp(1) 
    for n in range(N):
        sum += f(x)
        x = (x + e) % 1
    print(sum/N)
    
    # Monte Carlo
    sum = 0
    np.random.seed(20220623)
    for _ in range(N):
        sum += f(np.random.random())
    print(sum/N)

This code prints

    0.99901...
    0.99568...

The exact value of the integral is 1, and so the error using QMC between 4 and 5 times smaller than the error using MC. To put it another way, integration using our simple RNG is much more accurate than using the generally better RNG that NumPy uses.

Simulation

Now suppose we’re doing some sort of simulation that requires computing the gaps between consecutive random numbers. Let’s look at the set of gaps we get using our simple RNG.

    gapset = set()
    x = 0
    for _ in range(N):
        newx = (x + e) % 1
        gap = np.abs(newx - x)
        x = newx
        gapset.add(np.round(gap, 10))
    print( gapset )

Here we rounded the gaps to 10 decimal places so we don’t have minuscule gaps caused by floating point error.

And here’s our output:

    {0.7182818285, 0.2817181715}

There are only two gap sizes! This is a consequence of the three-gap theorem that Greg Egan mentioned on Twitter this morning. Our situation is a slightly different in that we’re looking at gaps between consecutive terms, not the gaps that the interval is divided into. That’s why we have two gaps rather than three.

If we use NumPy’s random number generator, we get 1000 different gap sizes.

Histogram of gap sizes

Cryptography

Random number generators with excellent statistical properties may be completely unsuitable for use in cryptography. A lot of people don’t know this or don’t believe it. I’ve seen examples of insecure encryption systems that use random number generators with good statistical properties but bad cryptographic properties.

These systems violate Kirchoff’s principle that the security of an encryption system should reside in the key, not in the algorithm. Kirchoff assumed that encryption algorithms will eventually be known, and difficult to change, and so the strength of the system should rely on the keys it uses. At best these algorithms provide security by obscurity: they’re easily breakable knowing the algorithm, but the algorithm is obscure. But these systems may not even provide security by obscurity because it may be possible to infer the algorithm from the output.

Fit for purpose

The random number generator in this post would be terrible for encryption because the sequence is trivially predictable. It would also fail some statistical tests, though it would pass others. It passes at least one statistical test, namely using the sequence for accurate Monte Carlo integration.

Even so, the sequence would pass a one-sided test but not a two-sided test. If you tested whether the sequence, when used in Monte Carlo integration, produced results with error below some threshold, it would pass. But if you looked at the distribution of the integration errors, you’d see that they’re smaller than would be expected from a random sequence. The sequence provides suspiciously good integration results, failing a test of randomness but suggesting the sequence might be useful in numerical integration.

Related posts