# Multicolor reproducing cellular automata

The previous post looked at a cellular automaton introduced by Edward Fredkin. It has only two states: a cell is on or off. At each step, each cell is set to the sum of the states of the neighboring cells mod 2. So a cell is on if it had an odd number neighbors turned on, and is turned off if it had an even number of neighbors turned on.

You can look at cellular automata with more states, often represented as colors. If the number of states per cell is prime, then the extension of Fredkin’s automaton to multiple states also reproduces its initial pattern. Instead of taking the sum of the neighboring states mod 2, more generally you take the sum of the neighboring states mod p.

This theorem is due to Terry Winograd, cited in the same source as in the previous post.

Here’s an example with 11 states. The purple background represents 0. As before, I start with an ‘E’ pattern, but I make it multi-colored. Before turn on the automaton, we need to clarify what we mean by “neighborhood.” In this example, I mean cells to the north, south, east, and west. You could include the diagonals, but that would be different example, though still self-reproducing.

After the first step, the pattern blurs. After 10 steps the original pattern is nowhere to be seen. And then suddenly on the next step the pattern reappears. I don’t know whether there are theorems on how long it takes the pattern to reappear, or how the time depends on the initial pattern, but in my experiments, the a pattern reappears after 4 steps with 2 states, 9 steps with 3 states, 5 steps with 5 states, 7 steps with 7 states, and 11 steps with 11 states.

Here’s an animation of the 11-color version, going through 22 steps. # Evolution of random number generators

The previous post showed that the bits of prime powers look kinda chaotic. When you plot them, they form a triangular shape because the size of the numbers is growing. The numbers are growing geometrically, so their binary representations are growing linearly. Here’s the plot for powers of 5: You can crop the triangle so that you just see a rectangular portion of it, things look less regular. If we look at powers of p modulo a power of 2, say 232, then we’re chopping off the left side of the triangle.

Suppose we don’t start with 1, but start with some other number, call it a seed, and multiply by 5 every time. Would we get a chaotic pattern? Yes, and we have just invented the congruential random number generator. These RNGs have the form

xn+1 = a xn mod m

The RNG we implicitly described above would have a = 5 and m = 232. This is not a good choice of a and m, but it’s on to something. We can use different values of multiplier and modulus to make a decent RNG.

The RANDU random number generator, commonly used in the 1960’s, used a = 65539 and m = 231. This turned out to have notoriously bad statistical properties. We can see this with a small modification of the code used to produce the plot above.

```    def out(s):
s = s.replace('1', chr(0x2588))
s = s.replace('0', ' ')
print(s)

a, m = 65539, 2**31
x = 7
for i in range(32):
x = a * x % m
s = format(x, '32b')
out(s)
```

Here’s the plot. The pattern on the right side looks like a roll of film. This shows that the lowest order bits are very regular. You can also see that we need to start with a large seed: starting with 7 created the large triangular holes at the top of the image. There are other patterns in the image that are hard to describe, but you get the sense something is not right, especially when you compare this image to a better alternative.

A much better choice of parameters is multiplier a = 48271 and modulus m = 231 -1. These are the parameters in the MINSTD random number generator. Here’s a plot of its bits, also starting with seed 7. The MINSTD generator is much better than RANDU, and adequate for some applications, but far from state of the art. You could do better by using a much larger multiplier and a modulus on the order of  264 or 2128.

You could do even better by adding a permutation step to your congruential generator. That’s the idea behind the PCG (permuted congruential generator) family of random number generators. These generators pass all the standard statistical tests with flying colors.

There are many other approaches to random number generation, but this post shows one thread of historical development, how hypothetically someone could go from noticing that prime powers have chaotic bit patterns to inventing PCG.

# Is fast grep faster?

The `grep` utility searches text files for regular expressions, but it can search for ordinary strings since these strings are a special case of regular expressions. However, if your regular expressions are in fact simply text strings, `fgrep` may be much faster than `grep`. Or so I’ve heard. I did some benchmarks to see.

Strictly speaking I used `grep -F` rather than `fgrep`. On Linux, if you ask for the man (manual) page for `fgrep` you’ll be taken to the man page for `grep` which says

In addition, the variant programs `egrep`, `fgrep` and `rgrep` are the same as `grep -E`, `grep -F`, and `grep -r`, respectively. These variants are deprecated, but are provided for backward compatibility.

I was working on a project for a client where I had to search for a long list of words in a long list of files . This is the kind of task where `fgrep` (“fast grep”) is supposed to be much faster than `grep`. It was a tiny bit faster, not enough to notice. When I timed it the difference was on the order of 1%.

I ran an analogous search on my own computer with different data and got similar results . There may be instances where `fgrep` is much faster than `grep`, but I haven’t seen one first hand.

I suspect that the performance difference between `fgrep` and `grep` used to be larger, but the latter has gotten more efficient. Now `grep` is smart enough to search for strings quickly without having to be told explicitly via `-F` that the regular expressions are in fact strings. Maybe it scans the regular expression(s) before searching and effectively sets the `-F` flag itself if appropriate.

## Related posts

 I used the `-f` to tell `grep` the name of a file containing the terms to search for, not to be confused with the additional flag `-F` to tell grep that the search terms are simply strings.

 I got similar results when I was using Linux (WSL) on my computer. When I used `grep` from GOW the `-F` flag made the search 24 times faster. Because the GOW project provides light-weight ports of Gnu tools to Windows, it’s understandable that it would not include some of the optimizations in Gnu’s implementation of `grep`.

# Comparing files with very long lines

Suppose you need to compare two files with very long lines. Maybe each file is one line.

The `diff` utility will show you which lines differ. But if the files are each one line, this tells you almost nothing. It confirms that there is a difference, but it doesn’t show you where the difference is. It just dumps both files to the command line.

For example, I created two files each containing the opening paragraph of Moby Dick. The file `file1.txt` contains the paragraph verbatim, and in `file2.txt` contains a typo. I ran

`    diff temp1.txt temp2.txt`

and this is what I saw: The `1c1` output tells us that the difference occurs on line 1, but that’s no help because the whole file is line 1.

There are no doubt many ways to fix this problem, but the solution I thought of was to use `fold` to break the files into more lines before running `diff`. Without any optional arguments, `fold` will split a file into segments of 80 characters.

Here’s my first attempt.

```    \$ diff <(fold temp1.txt) <(fold temp2.txt)
7c7
< and bringing up the rear of every funeral I meet; and especially whenever my hy
---
> and bringing up the rear of every funeral I meet; and especially whenever my ty
```

That’s better, but we lost context when we chopped the file into 80-character segments. We can’t see that the last words should be “hypos” and “typos”. This is a minor inconvenience, but there’s a bigger problem.

In the example above I simply changed one character, turning an ‘h’ into a ‘t’. But suppose I had changed “hypos” into “typoes”, not only changing a letter, but also inserting a letter. Then the rest of the files would split differently into 80-character segments. Even though the rest of the text is identical, we would be seeing differences created by `fold`.

This problem can be mitigated by giving `diff` the option `-s`, telling it to split lines at word boundaries.

```    \$ diff <(fold -s temp1.txt) <(fold -s temp2.txt)
8,9c8,9
< whenever my hypos get such an upper hand of me, that it requires a strong moral
< principle to prevent me from deliberately stepping into the street, and
---
> whenever my typoes get such an upper hand of me, that it requires a strong
> moral principle to prevent me from deliberately stepping into the street, and
```

The extra letter that I added changed the way the first line was broken; putting “moral” on the first line of the second output would have made the line 81 characters long. But other than pushing “moral” to the next line, the rest of chopped lines are the same.

You can use the `-w` option to have `diff` split lines of lengths other than 80 characters. If I split the files into lines 20 characters long, the differences between the two files are more obvious.

```    \$ diff <(fold -s -w 20 temp1.txt) <(fold -s -w 20 temp2.txt)
34c34
< my hypos get such
---
> my typoes get such
```

Ah, much better.

Not only do shorter lines make it easier to see where the differences are, in this case changing the segment length fixed the problem of a word being pushed to the next line. That won’t always be the case. It would not have been the case, for example, if I had used 15-character lines. But you can always change the number of characters slightly to eliminate the problem if you’d like.

# Recursive grep

The regular expression search utility `grep` has a recursive switch -R, but it may not work like you’d expect.

Suppose want to find the names of all .org files in your current directory and below that contain the text “cheese.”

You have four files, two in the working directory and two below, that all contain the same string: “I like cheese.”

```    \$ ls -R
.:
rootfile.org  rootfile.txt  sub

./sub:
subfile.org  subfile.txt
```

It seems that `grep -R` can either search all files of the form *.org in the current directory, ignoring the -R switch, or search all files recursively if you don’t give it a file glob, but it can’t do both.

```    \$ grep -R -l cheese *.org
rootfile.org

\$ grep -R -l cheese .
./rootfile.org
./rootfile.txt
./sub/subfile.org
./sub/subfile.txt
```

One way to solve this is with `find` and `xargs`:

```    \$ find . -name '*.org' | xargs grep -l cheese
./rootfile.org
./sub/subfile.org
```

I was discussing this with Chris Toomey and he suggested an alternative using a subshell that seems more natural:

`    grep -l cheese \$(find . -name '*.org')`

Now the code reads more like an ordinary call to `grep`. From left to right, it essentially says “Search for ‘cheese’ in files ending in .org” whereas the version with `find` reads like “Find files whose names end in .org and search them for ‘cheese.'” It’s good to understand how both approaches work.

# tcgrep: grep rewritten in Perl

In The Perl Cookbook, Tom Christiansen gives his rewrite of the Unix utility `grep` that he calls `tcgrep`. You don’t have to know Perl to use `tcgrep`, but you can send it Perl regular expressions.

## Why not grep with PCRE?

You can get basically the same functionality as `tcgrep` by using `grep` with its PCRE option `-P`. Since `tcgrep` searches directories recursively, a more direct comparison would be

`    grep -R -P`

However, your version of `grep` might not support `-P`. And if it does, its Perl-compatible regular expressions might not be completely Perl-compatible. The `man` page for `grep` on my machine says

```    -P, --perl-regexp
Interpret the pattern as a Perl-compatible regular
expression (PCRE). This is experimental and grep -P
may warn of unimplemented features.
```

The one implementation of regular expressions guaranteed to be fully Perl-compatible is Perl.

If the version of `grep` on your system supports the `-P` option and is adequately Perl-compatible, it will run faster than `tcgrep`. But if you find yourself on a computer that has `Perl` but not a recent version of `grep`, you may find `tcgrep` handy.

## Installation

`tcgrep` is included as part of the `Unicode::Tussle` Perl module; since `tcgrep` is a wrapper around Perl, it is as Unicode-compliant as Perl is. So you could install `tcgrep` (and several more utilities) with

`    cpan Unicode::Tussle`

This worked for me on Linux without any issues but the install failed on Windows.

I installed `tcgrep` on Windows by simply copying the source code. (I don’t recall now where I found the source code. I didn’t see it this morning when I searched for it, but I imagine I could have found it if I’d been more persistent.) I commented out the definition of `%Compress` to disable searching inside compressed files since this feature required Unix utilities not available on Windows.

## Consistency

Another reason to use `tcgrep` is consistency. Perl is criticized for being inconsistent. The Camel book itself says

In general, Perl functions do exactly what you want—unless you want consistency.

But Perl’s inconsistencies are different, and in my opinion less annoying, than the inconsistencies of Unix tools.

Perl is inconsistent in the sense that functions behave differently in different contexts, such as a scalar context or a list context.

Unix utilities are inconsistent across platforms and across tools. For example, a tool like `sed` will have different features on different platforms, and it will not support the same regular expressions as another tool such as `awk`.

Perl was written to be a “portable distillation of Unix culture.” As inconsistent as Perl is, it’s more consistent that Unix.

# Low-tech transparency

I recently received two large data files from a client, with names like `foo.xlsx` and `foo.csv`. Presumably these are redundant; the latter is probably an export of the former. I did a spot check and that seems to be the case.

Then I had a bright idea: use `pandas` to make sure the two files are the same. It’s an elegant solution: import both files as data frames, then use the `compare()` function to verify that they’re the same.

Except it didn’t work. I got a series of mysterious and/or misleading messages as I tried to track down the source of the problem, playing whack-a-mole with the data. There could be any number of reason why `compare()` might not work on imported data: character encodings, inferred data types, etc.

So I used brute force. I exported the Excel file as CSV and compared the text files. This is low-tech, but transparent. It’s easier to compare text files than to plumb the depths of `pandas`.

One of the problems was that the data contained heights, such as `5'9"`. This causes problems with quoting, whether you enclose strings in single or double quotes. A couple quick `sed` one-liners resolved most of the mismatches. (Though not all. Of course it couldn’t be that simple …)

It’s easier to work with data in a high-level environment like `pandas`. But it’s also handy to be able to use low-level tools like `diff` and `sed` for troubleshooting.

I suppose someone could write a book on how to import CSV files. If all goes well, it’s one line of code. Then there are a handful of patterns that handle the majority of remaining cases. Then there’s the long tail of unique pathologies. As Tolstoy would say, happy data sets are all alike, but unhappy data sets are each unhappy in their own way.

# Shells, quoting, and one-liners

The goal this post is to show how to call Perl and Python one-liners from a shell. We want to be able to use bash on Linux, and cmd or PowerShell on Windows, ideally with the same code in all three shells.

## Bash

Bash interprets text inside double quotes but not in single quotes. So if you want to send a string to a program unmolested, you surround it with single quotes.

For example, suppose I want to write a Perl one-line program to print `hello`. This works:

`    perl -e 'print "hello\n"'`

The single quotes tell the shell to pass what’s inside to `perl` without doing anything to it.

The corresponding Python example would be

`    python -c 'print("hello\n")'`

and this also works.

## Cmd and PowerShell

For reasons I don’t understand, the Perl example above works from cmd, but the Python example does not.

And neither example works in PowerShell. This is particularly puzzling since PowerShell follows basically the same quoting rules as bash: double quotes interpolate but single quotes do not.

## Workarounds

There are ways to fix the problems with cmd and PowerShell without breaking bash.

### Perl

Perl has a handy feature for avoiding confusion over quotes: you can use `q{}` for single quotes and `qq{}` for double quotes. If I write the Perl example as

`    perl -e 'print qq{hello\n}'`

then it works everywhere: bash, cmd, and PowerShell.

### Python

Python makes no distinction between single and double quotes, so you can reverse the role of single and double quotes to avoid quoting confusion. The code

`    python -c "print('hello\n')"`

works everywhere.

## Printing quote marks

In the previous section we said we can get around problems in Perl by using `q{}` and `qq{}`, and get around problems in Python by switching to double quotes on the outside and single quotes on the inside. But what if we wanted to print a single or double quote mark, such as printing `'hello'` or `"hello"`?

You can escape single quotes with `\'` and double quotes with `\"`, but you run into things that work on cmd but not on bash, or work on bash but not in PowerShell, etc.

One sure-fire way to make quotes work, in Perl and in Python, and on every shell we’re considering, is to denote quotes by their Unicode values, i.e. to use `\x27` for single quotes and `\x22` for double quotes. The one-liners

`    perl -e 'print qq{\x27hello\x27\n}'`

and

`    python -c "print('\x27hello\x27\n')"`

print `'hello'` in bash, cmd, and PowerShell.

And similarly

`    perl -e 'print qq{\x22hello\x22\n}'`

and

`    python -c "print('\x22hello\x22\n')"`

print `"hello"` in bash, cmd, and PowerShell.

## Guidelines for portable one-liners

In summary, you can avoid problems with Perl one-liners by using `q{}` and `qq{}` inside code rather than single or double quotes. For Python one-liners, you can swap single quotes and double quotes.

On Linux (bash) surround code with single quotes. On Windows (cmd and PowerShell) surround code with double quotes.

Use character codes for literal quote characters. This isn’t necessary, but if you do this you won’t have to remember how quotes and escapes work on different platforms.

# Redundant Residue Number Systems

You can uniquely represent a large number by its remainders when divided by smaller numbers, provided the smaller numbers have no factors in common, and carry out arithmetic in this representation. Such a representation is called a Residue Number System (RNS).

In the 1960’s people realized RNSs could be useful in computer arithmetic. The original papers are now hard to find, but you can find a summary of some of their ideas here. We will give examples working in a particular RNS below.

When you work with remainders by more numbers than necessary, you have a Redundant Residue Number System (RRNS) that provides error detection and correction. We will also demonstrate this with an example.

## RNS example

To be concrete, we’ll use the remainders by 199, 233, 194, and 239 to represent numbers, following the work of C. W. Hastings and R. W. Watson referenced in the link cited above.

Let M = 199*233*194*239. Any non-negative integer less than M can be specified by its remainders mod 199, 233, 194, and 239.

The following Python code generates a random number less than M, represents it by its remainders, and then recovers the original number from the remainders using the Chinese Remainder Theorem.

```    from random import randrange
from sympy import prod
from sympy.ntheory.modular import crt

moduli = [199, 233, 194, 239]
M = prod(moduli)

x = randrange(M)
remainders = [x % m for m in moduli]
# See footnote 
y = crt(moduli, remainders, symmetric=False)

print(x)
print(remainders)
print(y)
```

This printed

```    1119075671
[166, 204, 57, 235]
1119075671
```

You can add, subtract, and multiply numbers by carrying out the corresponding operations on their remainders. There are three advantages to this.

First, you can work with smaller numbers. In our example, all the moduli are 8-bit numbers; we can carry out arithmetic on 32-bit numbers  by only working directly with 8-bit numbers. We could use the same idea to represent extremely large numbers by their remainders via 64-bit integers.

Second, we can do our calculations in parallel, working with each of our remainders at the same time.

Third, there are no carries. There’s no need to keep track of whether component calculations overflow.

The following code shows how we can add two numbers via their remainders.

```    a = randrange(M//2)
b = randrange(M//2)

arem = [a % m for m in moduli]
brem = [b % m for m in moduli]
crem = [z + z for z in zip(arem, brem)]
c = crt(moduli, crem, symmetric=False)

print(a + b)
print(c)
```

When I ran this code, it printed 832537447 twice.

## RRNS

Now we get to the redundant part. Suppose we add more numbers to our list of moduli, larger than the previous numbers and relatively prime to all of them and to each other. Now working modulo this larger list of numbers, we have more information than we need. If the results we get from using various subsets of the list of moduli are inconsistent, we’ve detected an error. And with enough extra information, we can correct the error.

Watson and Hastings added 251 and 509 in their example, and so we’ll do the same.

As before, we will generate a couple random numbers and represent them via their remainders, though now by a longer list of remainders. We will deliberately corrupt one of the component sums and then compute their sum using different choices of four moduli from the set of six.

```    xmoduli = [199, 233, 194, 239, 251, 509]
a = randrange(M//2)
b = randrange(M//2)
aremx = [a % m for m in xmoduli]
bremx = [b % m for m in xmoduli]
cremx = [z + z for z in zip(aremx, bremx)]
cremx += 13

c0 = crt(xmoduli[:4], cremx[:4], symmetric=False)
c1 = crt(xmoduli[2:], cremx[2:], symmetric=False)
c2 = crt(xmoduli[:2] + xmoduli[4:], cremx[:2] + cremx[4:], symmetric=False)
print(c0, c1, c2)
```

This will return three different answers, so we know that there has been an error. When I ran it I got 70315884, 819846513, and 890162397. If you run the code yourself you’ll get different answers since you’ll generate different random numbers.

Now how do we correct the error? Suppose we know there has only been one error (or more realistically, we are willing to assume that the probability of two errors is tolerably small). Then one of the results above must be correct: the first sum excludes the last two moduli, the second excludes the first two, and the last excludes the middle two. One of them must exclude the error.

The first calculation based on a different subset of moduli that gives one of the results above is correct. The code

```    c3 = crt(xmoduli[:1]+xmoduli[2:5], cremx[:1] + cremx[2:5], symmetric=False)
print(c3)
```

produced 890162397, matching the third sum above, so we know that eliminating the second modulus gives the correct answer. We’ve found the correct answer, and we’ve discovered which component was corrupted.

***

 A couple things about the call to `crt` require explanation. We set `symmetric` to `False` in order to get a positive return value. Otherwise we might get a value that is correct mod M but negative. Also, `crt` returns not just the solution we’re after but a pair consisting of the solution and the product of the moduli. We take the first element of the pair with `` to get the part we’re interested in.

 Not all 32-bit numbers. with any numbers less than M, and M is between 231 and 232.

# How much can Boolean expressions be simplified?

In the previous post we looked at how to minimize Boolean expressions using a Python module `qm`. In this post we’d like to look at how much the minimization process shortens expressions.

Witn n Boolean variables, you can create 2^n terms that are a product of distinct variables. You can specify a Boolean function by specifying the subset of such terms on which it takes the value 1, and so there are 2^(2^n) Boolean functions on n variables. For very small values of n we can minimize every possible Boolean function.

To do this, we need a way to iterate through the power set (set of all subsets) of the integers up to 2^n. Here’s a function to do that, borrowed from itertools recipes.

```    from itertools import chain, combinations
def powerset(iterable):
xs = list(iterable)
return chain.from_iterable(
combinations(xs, n) for n in range(len(xs) + 1))
```

Next, we use this code to run all Boolean functions on 3 variables through the minimizer. We use a matrix to keep track of how long the input expressions are and how long the minimized expressions are.

```    from numpy import zeros
from qm import q

n = 3
N = 2**n
tally = zeros((N,N), dtype=int)
for p in powerset(range(N)):
if not p:
continue # qm can't take an empty set
i = len(p)
j = len(qm(ones=p, dc={}))
tally[i-1, j-1] += 1
```

Here’s a table summarizing the results . The first column gives the number of product terms in the input expression and the subsequent columns give the number of product terms in the output expressions.

For example, of the expressions of length 2, there were 12 that could be reduced to expressions of length 1 but the remaining 16 could not be reduced. (There are 28 possible input expressions of length 2 because there are 28 ways to choose 2 items from a set of 8 things.)

There are no nonzero values above the main diagonal, i.e. no expression got longer in the process of minimization. Of course that’s to be expected, but it’s reassuring that nothing went obviously wrong.

We can repeat this exercise for expressions in 4 variables by setting n = 4 in the code above. This gives the following results. We quickly run into a wall as n increases. Not only does the Quine-McCluskey algorithm take about twice as long every time we add a new variable, the number of possible Boolean functions grows even faster. There were 2^(2^3) = 256 possibilities to explore when n = 3, and 2^(2^4) = 65,536 when n = 4.

If we want to explore all Boolean functions on five variables, we need to look at 2^(2^5) = 4,294,967,296 possibilities. I estimate this would take over a year on my laptop. The `qm` module could be made more efficient, and in fact someone has done that. But even if you made the code a billion times faster, six variables would still be out of the question.

To explore functions of more variables, we need to switch from exhaustive enumeration to random sampling. I may do that in a future post. (Update: I did.)

***

 The raw data for the tables presented as images is available here.