Recovering the state of xorshift128

I’ve written a couple posts lately about reverse engineering the internal state of a random number generator, first Mersenne Twister then lehmer64. This post will look at xorshift128, implemented below.

import random

# Seed the generator state
a: int = random.getrandbits(32)
b: int = random.getrandbits(32)
c: int = random.getrandbits(32)
d: int = random.getrandbits(32)

MASK = 0xFFFFFFFF

def xorshift128() -> int:
    global a, b, c, d

    t = d
    s = a

    t ^= (t << 11) & MASK t ^= (t >>  8) & MASK
    s ^= (s >> 19) & MASK

    a, b, c, d = (t ^ s) & MASK, a, b, c

    return a

Recovering state

Recovering the internal state of the generator is simple: it’s the four latest outputs in reverse order. This is illustrated by the following chart.

 a b c d output 5081e5ca 79259a41 63e12955 651e537d c793d11c c793d11c 5081e5ca 79259a41 63e12955 ad52e33a ad52e33a c793d11c 5081e5ca 79259a41 f8f09343 f8f09343 ad52e33a c793d11c 5081e5ca a7009622 a7009622 f8f09343 ad52e33a c793d11c fe42a8ef

This means that once we’ve seen four outputs, we can predict the rest of the outputs. The following code demonstrates this.

Let’s generate five random values.

out = [xorshift128() for _ in range(5)]

Running

print(hex(out[4]))

shows the output 0xc3f4795d.

If we reset the state of the generator using the first four outputs

d, c, b, a, _ = out
print(hex(xorshift128()))

we get the same result.

Good stats, bad security

Mersenne Twister and lehmer64 have good statistical properties, despite being predictable. The xorshift128 generator is even easier to predict, but it also has good statistical properties. These generators would be fine for many applications, such as Monte Carlo simulation, but disastrous for use in cryptography.

The post on lehmer64 mentioned at the end that the internal state of PCG64 can also be recovered from its output. However, doing so requires far more sophisticated math and thousands of hours of compute time. Still, it’s not adequate for cryptography. For that you’d need a random number generator designed to be secure, such as ChaCha.

So why not just use a cryptographically secure random number generator (CSPRNG) for everything? You could, but the other generators mentioned in this post use less memory and are much faster. PCG64 occupies an interesting middle ground: simple and fast, but not easily reversible.

Initialize and print 128-bit integers in C

If you look very closely at my previous post, you’ll notice that I initialize a 128-bit integer with a 64-bit value. The 128-bit unsigned integer represents the internal state of a random number generator. Why not initialize it to a 128-bit value? I was trying to keep the code simple.

A surprising feature of C compilers, at least of GCC and Clang, is that you cannot initialize a 128-bit integer to a 128-bit integer literal. You can’t directly print a 128-bit integer either, which is why the previous post introduces a function print_u128.

The code

__uint128_t x = 0x00112233445566778899aabbccddeeff;

Produces the following error message.

error: integer literal is too large to be represented in any integer type

The problem isn’t initializing a 128-bit number to a 128-bit value; the problem is that the compiler cannot parse the literal expression

0x00112233445566778899aabbccddeeff

One solution to the problem is to introduce the macro

#define U128(hi, lo) (((__uint128_t)(hi) << 64) | (lo))

and use it to initialize the variable.

__uint128_t x = U128(0x0011223344556677, 0x8899aabbccddeeff);

You can verify that x has the intended state by calling print_u128 from the previous post.

void print_u128(__uint128_t n)
{
    printf("0x%016lx%016lx\n",
           (uint64_t)(n >> 64),      // upper 64 bits
           (uint64_t)n);             // lower 64 bits
}

Then

print_u128(x);

prints

0x00112233445566778899aabbccddeeff

Update. The code for print_u128 above compiles cleanly with gcc but clang gives the following warning.

warning: format specifies type 'unsigned long' but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat]

You can suppress the warning by including the inttypes header and modifying the print_u128 function.

Here’s the final code. It compiles cleanly under gcc and clang.

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#define U128(hi, lo) (((__uint128_t)(hi) << 64) | (lo))

void print_u128(__uint128_t n)
{
    printf("0x%016" PRIx64 "%016" PRIx64 "\n",
           (uint64_t)(n >> 64),
           (uint64_t)n);
}

int main(void)
{
    __uint128_t x = U128(0x0011223344556677, 0x8899aabbccddeeff);
    print_u128(x);
    return 0;
}

Hacking the lehmer64 RNG

A couple days ago I wrote about hacking the Mersenne Twister. I explained how to recover the random number generator’s internal state from a stream of 640 outputs.

This post will do something similar with the lehmer64 random number generator. This generator is very simple to implement. Daniel Lemire found it to be “the fastest conventional random number generator that can pass Big Crush,” a well-respected test for pseudorandom number generators.

Implementing lehmer64

The lehmer64 generator can be implemented in C by

__uint128_t g_lehmer64_state;

uint64_t lehmer64() {
    g_lehmer64_state *= 0xda942042e4dd58b5ULL;  
  return g_lehmer64_state >> 64;
}

The analogous code in Python would have to simulate the overflow behavior of a 128-bit integer by reducing the state mod 2128 after the multiplication.

Reverse engineering lehmer64 is easier than reverse engineering the Mersenne Twister because only three outputs are needed. However, the theory behind the exploit is more sophisticated. See [1].

The following code sets the state to an arbitrary initial seed value and generates three values.

#include <stdio.h>
#include <stdint.h>

int main(void)
{
    g_lehmer64_state = 0x4789499d78770934; // seed
    for (int i = 0; i < 3; i++) {
        printf("0x%016lx\n", lehmer64());
    }

    return 0;
}

The code prints the following.

0x3d144d12822bcc2e
0x85a67226191a568d
0x53e803dffc88e8f8

Exploiting lehmer64

Here is Python code for recovering the state of the lehmer64 generator given in [1].

def reconstruct(X):
    a = 0xda942042e4dd58b5
    r = round(2.64929081169728e-7 * X[0] + 3.51729342107376e-7 * X[1] + 3.89110109147656e-8 * X[2])
    s = round(3.12752538137199e-7 * X[0] - 1.00664345453760e-7 * X[1] - 2.16685184476959e-7 * X[2])
    t = round(3.54263598631140e-8 * X[0] - 2.05535734808162e-7 * X[1] + 2.73269247090513e-7 * X[2])
    u = r * 1556524 + s * 2249380 + t * 1561981
    v = r * 8429177212358078682 + s * 4111469003616164778 + t * 3562247178301810180
    state = (a*u + v) % (2**128)
    return state

Let’s call reconstruct with the output of our C code.

X = [0x3d144d12822bcc2e, 0x85a67226191a568d, 0x53e803dffc88e8f8]
print( hex( reconstruct(X) ) )

This prints

0x3d144d12822bcc2e1b81101c593761c4

Now for the confusing part: at what point is the number above the state of the generator? Is it the state before or after generating the three values? Neither! It is the state after generating the first value.

We can verify this by modifying the C code as follows and rerunning it.

void print_u128(__uint128_t n)
{
    printf("0x%016lx%016lx\n",
           (uint64_t)(n >> 64),      // upper 64 bits
           (uint64_t)n);             // lower 64 bits
}

int main(void)
{
    g_lehmer64_state = 0x4789499d78770934; // seed
    for (int i = 0; i < 3; i++) {
        printf("0x%016lx\n", lehmer64());
        printf("state: ");
        print_u128(g_lehmer64_state);
    }
 
    return 0;
}

The main goal of [1] is to recover the state of the PCG generator, not the lehmer64 generator. The latter was a side quest. Recovering the state of PCG64 is much harder; the authors estimate it takes about 20,000 CPU-hours. The paper shows that a technique used as part of pursuing their main goal can quickly recover the lehmer64 state.

Related posts

[1] Charles Bouillaguet, Florette Martinez, and Julia Sauvage. Practical seed-recovery for the PCG Pseudo-Random Number Generator. IACR Transactions on Symmetric Cryptology. ISSN 2519-173X, Vol. 2020, No. 3, pp. 175–196.

Inverse shift

What is the inverse of shifting a sequence to the right? Shifting it to the left, obviously.

But wait a minute. Suppose you have a sequence of eight bits

abcdefgh

and you shift it to the right. You get

0abcdefg.

If you shift this sequence to the left you get

abcdefg0

You can’t recover the last element h because the right-shift destroyed information about h.

A left-shift doesn’t fully recover a right-shift, and yet surely left shift and right shift are in some sense inverses.

Yesterday I wrote a post about representing bit manipulations, including shifts, as matrix operators. The matrix corresponding to shifting right by k bits has 1s on the kth diagonal above the main diagonal and 0s everywhere else. For example, here is the matrix for shifting an 8-bit number right two bits. A black square represents a 1 and a white square represents a 0.

This matrix isn’t invertible. When you’d like to take the inverse of a non-invertible matrix, your kneejerk response should be to compute the pseudoinverse. (Technically the Moore-Penrose pseudoinverse. There are other pseudoinverses, but Moore-Penrose is the most common.)

As you might hope/expect, the pseudoinverse of a right-shift matrix is a left-shift matrix. In this case the pseudoinverse is simply the transpose, though of course that isn’t always the case.

If you’d like to prove that the pseudoinverse of a matrix that shifts right by k places is a matrix that shifts left by k places, you don’t have to compute the pseudo inverse per se: you can verify your guess. This post gives four requirements for a pseudoinverse. You can prove that left shift is the inverse of right shift by showing that it satisfies the four equations.

The linear algebra of bit twiddling

The previous post looked at the tempering step of the Mersenne Twister, formulating a sequence of bit operations as multiplication by a matrix mod 2. This post will look at the components more closely.

The theorems of linear algebra generally hold independent of the field of scalars. Typically the field is ℝ or ℂ, but most of basic linear algebra works the same over every field [1]. In particular, we can do linear algebra over a finite field, and we’re interested in the most finite of finite fields GF(2), the field with just two elements, 0 and 1.

In GF(2), addition corresponds to XOR. We will denote this by ⊕ to remind us that although it’s addition, it’s not the usual addition, i.e. 1 ⊕ 1 = 0. Similarly, multiplication corresponds to AND. We’ll work with 8-bit numbers to make the visuals easier to see.

Shifting a number left one bit corresponds to multiplication by a matrix with 1’s below the diagonal main. Shifting left by k bits is the same as shifting left by 1 bit k times, so the the matrix representation for x << k is the kth power of the matrix representation of shifting left once. This matrix has 1s on the kth diagonal below the main diagonal. Below is the matrix for shifting left two bits, x << k.

Right shifts are the mirror image of left shifts. Here’s the matrix for shifting right two bits, x >> k.

Shifts are not fully invertible because bits either fall off the left or the right end. The steps in the Mersenne Twister are invertible because shifts are always XOR’d with the original argument. For example, although the function that takes x to x >> 2 is not invertible, the function that takes x to x ⊕ (x >> 2) is invertible. This operation corresponds to the matrix below.

This is an upper triangular matrix, so its determinant is the product of the diagonal elements. These are all 1s, so the determinant is 1, and the matrix is invertible.

Bitwise AND multiplies each bit of the input by the corresponding bit in another number known as the mask. The bits aligned with a 1 are kept and the bits aligned with a 0 are cleared. This corresponds to multiplying by a diagonal matrix whose diagonal elements correspond to the bits in the mask. For example, here is the matrix that corresponds to taking the bitwise AND with 10100100.

Each of the steps in the Mersenne Twister tempering process are invertible because they all correspond to triangular matrices with all 1’s on the diagonal. For example, the line

y ^= (y <<  7) & 0x9d2c5680 

says to shift the bits of y left 7 places, then zero out the elements corresponding to 0s in the mask, then XOR the result with y. In matrix terms, we multiply by a lower triangular matrix with zeros on the main diagonal, then multiply by a diagonal matrix that zeros out some of the terms, then add the identity matrix. So the matrix corresponding to the line of code above is lower triangular, with all 1s on the diagonal, so it is invertible.

[1] Until you get to eigenvalues. Then it matters whether the field is algebraically complete, which no finite field is.

Unified config files

I try to maintain a consistent work environment across computers that I use. The computers differ for important reasons, but I’d rather they not differ for unimportant reasons.

Unified keys

One thing I do is remap keys so that the same key does the same thing everywhere, to the extent that’s practical. This requires remapping keys. In particular, I want the key functionality, not the key name, to be the same across operating systems. For example, the Command key on a Mac does what the Control key does on Windows and Linux. I have my machines set up so the key in the lower left corner, whatever you call it, does things like copy, paste, and cut.

Unified config files

I use bash everywhere even though zshell is the default on Mac OS. But Linux and MacOS are sufficiently different that I have two .bashrc files, one for each OS. However, both .bashrc files source a common file .bash_aliases for aliases. You can set that up by putting the following code in your .bashrc file.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

Sometimes you need some kind of branching logic even for two machines running the same OS. For example, if you have two computers, named Castor and Pollux, you might have code like the following in your .bashrc.

HOSTNAME_SHORT=$(hostname -s)
if [ "$HOSTNAME_SHORT" = "Castor" ]; then
...
elif [ "$HOSTNAME_SHORT" = "Pollux" ]; then
...

One problem with using hostname is that the OS can change the name on you. At least MacOS will do this; I don’t know whether other operating systems will. If you give your machine an uncommon name then this is unlikely to happen.

I have one init.el file for Emacs. It contains some branching logic for OS:

(cond
    ((string-equal system-type "windows-nt") ; Microsoft Windows
     ...)
    ((string-equal system-type "gnu/linux") ; linux
     ...)
    ((string-equal system-type "darwin") ; Mac
     ...)
)

You can also branch on hostname.

(if (string-equal (system-name) "Castor")
...)
(if (string-equal (system-name) "Pollux")
...)

This isn’t difficult, but in the short run it’s easier to just make one ad hoc edit to a config file at a time, letting the files drift apart. Along those lines, see bicycle skills.

Toffoli gates are all you need

Landauer’s principle gives a lower bound on the amount of energy it takes to erase one bit of information:

E ≥ log(2) kB T

where kB is the Boltzmann constant and T is the ambient temperature in Kelvin. The lower bound applies no matter how the bit is physically stored. There is no theoretical lower limit on the energy required to carry out a reversible calculation.

In practice the energy required to erase a bit is around a billion times greater than Landauer’s lower bound. You might reasonably conclude that reversible computing isn’t practical since we’re nowhere near the Landauer limit. And yet in practice reversible circuits have been demonstrated to use less energy than conventional circuits. We’re far from the ultimate physical limit, but reversibility still provides practical efficiency gains today.

A Toffoli gate is a building block of reversible circuits. A Toffoli gate takes three bits as input and returns three bits as output:

T(abc) = (abc XOR (a AND b)).

In words, a Toffoli gate flips its third bit if and only if the first two bits are ones.

A Toffoli gate is its own inverse, and so it is reversible. This is easy to prove. If ab = 1, then the third bit is flipped. Apply the Toffoli gate again flips the bit back to what it was. If ab = 0, i.e. at least one of the first two bits is zero, then the Toffoli gate doesn’t change anything.

There is a theorem that any Boolean function can be computed by a circuit made of only NAND gates. We’ll show that you can construct a NAND gate out of Toffoli gates, which shows any Boolean function can be computed by a circuit made of Toffoli gates, which shows any Boolean function can be computed reversibly.

To compute NAND, i.e. ¬ (ab), send (ab, 1) to the Toffoli gate. The third bit of the output will contain the NAND of a and b.

T(a, b, 1) = (ab, ¬ (ab))

A drawback of reversible computing is that you may have to send in more input than you’d like and get back more output than you’d like, as we can already see from the example above. NAND takes two input bits and returns one output bit. But the Toffoli gate simulating NAND takes three input bits and returns three output bits.

Related posts

Quantum Y2K

I’m skeptical that quantum computing will become practical. However, if it does become practical before we’re prepared, the world’s financial system could collapse. Everyone agrees we should prepare for quantum computing, even those of us who doubt it will be practical any time soon.

Quantum computers exist now, but the question is when and if a cryptographically relevant quantum computer (CRQC) is coming. At the moment, a quantum computer cannot factor 21 without cheating (i.e. not implementing circuits that you know a priori won’t be needed). But that could change suddenly. And some believe quantum computers could quickly go from being able to factor numbers with two digits to being able to factor numbers with thousands of digits (i.e. breaking RSA encryption) without much incremental transition between.

The move to post-quantum encryption may be a lot like Y2K, fixing vast amounts of 20th century software that represented years with two digits. Y2K turned out to be a big nothingburger, but only because the world spent half a trillion dollars on preparation to make sure it would be a big nothingburger.

Programmers in the 1970s obviously knew that the year 2000 was coming, but they also knew that they needed to conserve bytes at the time. And they assumed, reasonably but incorrectly, that their software would all be replaced before two-digit dates became a problem.

Programmers still need to conserve bytes, though this is less obvious today. Quantum-resistant signatures and encryption keys are one or two orders of magnitude bigger. This takes up bandwidth and storage space, which may or may not be a significant problem, depending on context. Programmers may conclude that it’s not (yet) worth the extra overhead to use post-quantum encryption. Like their counterparts 50 years ago, they may assume, rightly or wrongly, that their software will be replaced by the time it needs to be.

Moving to post-quantum cryptography ASAP is not a great idea if you can afford to be more strategic. It takes many years to gain confidence that new encryption algorithms are secure. The SIKE algorithm, for example, was a semi-finalist the NIST post-quantum encryption competition, but someone found a way to break it using an hour of computing on a laptop.

Another reason to not be in a hurry is that it may be possible to be more clever than simply replacing pre-quantum algorithms with post-quantum analogs. For example, some blockchains are exploring zero-knowledge proofs as a way to aggregate signatures. Simply moving to post-quantum signatures could make every transaction block 100 times bigger. But replacing a set of signatures by a (post-quantum) zero-knowledge proof of the existence of the signatures, transaction blocks could be smaller than now.

As with Y2K, the move to post-quantum cryptography will be gradual. Some things have already moved, and some are in transition now. You may have seen the following warning when connecting to a remote server.

** WARNING: connection is not using a post-quantum key exchange algorithm.
** This session may be vulnerable to "store now, decrypt later" attacks.
** The server may need to be upgraded. See https://openssh.com/pq.html

Key sizes don’t matter as much to sftp connections as they do to blockchains. And the maturity of post-quantum algorithms is mitigated by OpenSSH using hybrid encryption: well-established encryption (like ECDH) wrapped by newer quantum-resistant encryption (like MK-KEM). If the newer algorithm isn’t as secure as expected, you’re no worse off than if you had only used the older algorithm.

When clocks rolled over from 1999 to 2000 without incident, many people felt the concern about Y2K had been overblown. Maybe something similar will happen with quantum computing. Let’s hope so.

Related posts

Set intersection and difference at the command line

A few years ago I wrote about comm, a utility that lets you do set theory at the command line. It’s a really useful little program, but it has two drawbacks: the syntax is hard to remember, and the input files must be sorted.

If A and B are two sorted lists,

    comm A B

prints A − B, B − A, and A ∩ B. You usually don’t want all three, and so comm lets you filter the output. It’s a little quirky in that you specify what you don’t want instead of what you do. And you have to remember that 1, 2, and 3 correspond to A − B, B − A, and A ∩ B respectively.

Venn diagram of comm parameters

A couple little scripts can hide the quirks. I have a script intersect

    comm -12 <(sort "$1") <(sort "$2")

and another script setminus

    comm -23 <(sort "$1") <(sort "$2")

that sort the input files on the fly and eliminate the need to remember comm‘s filtering syntax.

The setminus script computes A − B. To find B − A call the script with the arguments reversed.

Embedded regex flags

The hardest part of using regular expressions is not crafting regular expressions per se. In my opinion, the two hardest parts are minor syntax variations between implementations, and all the environmental stuff outside of regular expressions per se.

Embedded regular expression modifiers address one of the environmental complications by putting the modifier in the regular expression itself.

For example, if you want to make a grep search case-insensitive, you pass it the -i flag. But if you want to make a regex case-insensitive inside a Python program, you pass a function the argument re.IGNORECASE. But if you put (?i) at the beginning of your regular expression, then the intention to make the match case-insensitive is embedded directly into the regex. You could use the regex in any environment that supports (?i) without having to know how to specify modifiers in that environment.

I was debugging a Python script this morning that worked under one version of Python and not under another. The root of the problem was that it was using re.findall() with several huge regular expression that had embedded modifiers. That was OK up to Python 3.5, then it was a warning between versions 3.6 and 3.10, and it’s an error in versions 3.11 and later.

The problem isn’t with all embedded modifiers, only global modifiers that don’t appear at the beginning of the regex. Older versions of Python, following Perl’s lead, would let you put a modifier like (?i) in the middle of a regex, and apply the modifier from that point to the end of the expression. In the latest versions of Python, you can either place the modifier at the beginning of the regex, or use a scoped modifier like (?:…) in the middle of the expression.

I didn’t want to edit the regular expressions in my code—some had over a thousand characters—so I changed re.findall() to regex.findall(). The third-party regex module is generally more Perl-compatible than Python’s standard re module.