Multicolor reproducing cellular automata

The previous post looked at a cellular automaton introduced by Edward Fredkin. It has only two states: a cell is on or off. At each step, each cell is set to the sum of the states of the neighboring cells mod 2. So a cell is on if it had an odd number neighbors turned on, and is turned off if it had an even number of neighbors turned on.

You can look at cellular automata with more states, often represented as colors. If the number of states per cell is prime, then the extension of Fredkin’s automaton to multiple states also reproduces its initial pattern. Instead of taking the sum of the neighboring states mod 2, more generally you take the sum of the neighboring states mod p.

This theorem is due to Terry Winograd, cited in the same source as in the previous post.

Here’s an example with 11 states. The purple background represents 0. As before, I start with an ‘E’ pattern, but I make it multi-colored.

Initial state

Before turn on the automaton, we need to clarify what we mean by “neighborhood.” In this example, I mean cells to the north, south, east, and west. You could include the diagonals, but that would be different example, though still self-reproducing.

After the first step, the pattern blurs.

step 1

After 10 steps the original pattern is nowhere to be seen.

step 10

And then suddenly on the next step the pattern reappears.

step 11

I don’t know whether there are theorems on how long it takes the pattern to reappear, or how the time depends on the initial pattern, but in my experiments, the a pattern reappears after 4 steps with 2 states, 9 steps with 3 states, 5 steps with 5 states, 7 steps with 7 states, and 11 steps with 11 states.

Here’s an animation of the 11-color version, going through 22 steps.

Animated gif of CA

Evolution of random number generators

The previous post showed that the bits of prime powers look kinda chaotic. When you plot them, they form a triangular shape because the size of the numbers is growing. The numbers are growing geometrically, so their binary representations are growing linearly. Here’s the plot for powers of 5:

bits of powers of 5 plotted

You can crop the triangle so that you just see a rectangular portion of it, things look less regular. If we look at powers of p modulo a power of 2, say 232, then we’re chopping off the left side of the triangle.

Suppose we don’t start with 1, but start with some other number, call it a seed, and multiply by 5 every time. Would we get a chaotic pattern? Yes, and we have just invented the congruential random number generator. These RNGs have the form

xn+1 = a xn mod m

The RNG we implicitly described above would have a = 5 and m = 232. This is not a good choice of a and m, but it’s on to something. We can use different values of multiplier and modulus to make a decent RNG.

The RANDU random number generator, commonly used in the 1960’s, used a = 65539 and m = 231. This turned out to have notoriously bad statistical properties. We can see this with a small modification of the code used to produce the plot above.

    def out(s):
        s = s.replace('1', chr(0x2588))
        s = s.replace('0', ' ')
        print(s)    

    a, m = 65539, 2**31
    x = 7
    for i in range(32):
        x = a * x % m
        s = format(x, '32b')
        out(s)

Here’s the plot.

plotting bits of RANDU random number generator

The pattern on the right side looks like a roll of film. This shows that the lowest order bits are very regular. You can also see that we need to start with a large seed: starting with 7 created the large triangular holes at the top of the image. There are other patterns in the image that are hard to describe, but you get the sense something is not right, especially when you compare this image to a better alternative.

A much better choice of parameters is multiplier a = 48271 and modulus m = 231 -1. These are the parameters in the MINSTD random number generator. Here’s a plot of its bits, also starting with seed 7.

plot of MINSTD bits

The MINSTD generator is much better than RANDU, and adequate for some applications, but far from state of the art. You could do better by using a much larger multiplier and a modulus on the order of  264 or 2128.

You could do even better by adding a permutation step to your congruential generator. That’s the idea behind the PCG (permuted congruential generator) family of random number generators. These generators pass all the standard statistical tests with flying colors.

There are many other approaches to random number generation, but this post shows one thread of historical development, how hypothetically someone could go from noticing that prime powers have chaotic bit patterns to inventing PCG.

Related posts

Is fast grep faster?

The grep utility searches text files for regular expressions, but it can search for ordinary strings since these strings are a special case of regular expressions. However, if your regular expressions are in fact simply text strings, fgrep may be much faster than grep. Or so I’ve heard. I did some benchmarks to see.

Strictly speaking I used grep -F rather than fgrep. On Linux, if you ask for the man (manual) page for fgrep you’ll be taken to the man page for grep which says

In addition, the variant programs egrep, fgrep and rgrep are the same as grep -E, grep -F, and grep -r, respectively. These variants are deprecated, but are provided for backward compatibility.

I was working on a project for a client where I had to search for a long list of words in a long list of files [1]. This is the kind of task where fgrep (“fast grep”) is supposed to be much faster than grep. It was a tiny bit faster, not enough to notice. When I timed it the difference was on the order of 1%.

I ran an analogous search on my own computer with different data and got similar results [2]. There may be instances where fgrep is much faster than grep, but I haven’t seen one first hand.

I suspect that the performance difference between fgrep and grep used to be larger, but the latter has gotten more efficient. Now grep is smart enough to search for strings quickly without having to be told explicitly via -F that the regular expressions are in fact strings. Maybe it scans the regular expression(s) before searching and effectively sets the -F flag itself if appropriate.

Related posts

[1] I used the -f to tell grep the name of a file containing the terms to search for, not to be confused with the additional flag -F to tell grep that the search terms are simply strings.

[2] I got similar results when I was using Linux (WSL) on my computer. When I used grep from GOW the -F flag made the search 24 times faster. Because the GOW project provides light-weight ports of Gnu tools to Windows, it’s understandable that it would not include some of the optimizations in Gnu’s implementation of grep.

Comparing files with very long lines

Suppose you need to compare two files with very long lines. Maybe each file is one line.

The diff utility will show you which lines differ. But if the files are each one line, this tells you almost nothing. It confirms that there is a difference, but it doesn’t show you where the difference is. It just dumps both files to the command line.

For example, I created two files each containing the opening paragraph of Moby Dick. The file file1.txt contains the paragraph verbatim, and in file2.txt contains a typo. I ran

    diff temp1.txt temp2.txt

and this is what I saw:

screen shot of command line

The 1c1 output tells us that the difference occurs on line 1, but that’s no help because the whole file is line 1.

There are no doubt many ways to fix this problem, but the solution I thought of was to use fold to break the files into more lines before running diff. Without any optional arguments, fold will split a file into segments of 80 characters.

Here’s my first attempt.

    $ diff <(fold temp1.txt) <(fold temp2.txt)
    7c7
    < and bringing up the rear of every funeral I meet; and especially whenever my hy 
    --- 
    > and bringing up the rear of every funeral I meet; and especially whenever my ty

That’s better, but we lost context when we chopped the file into 80-character segments. We can’t see that the last words should be “hypos” and “typos”. This is a minor inconvenience, but there’s a bigger problem.

In the example above I simply changed one character, turning an ‘h’ into a ‘t’. But suppose I had changed “hypos” into “typoes”, not only changing a letter, but also inserting a letter. Then the rest of the files would split differently into 80-character segments. Even though the rest of the text is identical, we would be seeing differences created by fold.

This problem can be mitigated by giving diff the option -s, telling it to split lines at word boundaries.

    $ diff <(fold -s temp1.txt) <(fold -s temp2.txt) 
    8,9c8,9
    < whenever my hypos get such an upper hand of me, that it requires a strong moral
    < principle to prevent me from deliberately stepping into the street, and 
    --- 
    > whenever my typoes get such an upper hand of me, that it requires a strong
    > moral principle to prevent me from deliberately stepping into the street, and

The extra letter that I added changed the way the first line was broken; putting “moral” on the first line of the second output would have made the line 81 characters long. But other than pushing “moral” to the next line, the rest of chopped lines are the same.

You can use the -w option to have diff split lines of lengths other than 80 characters. If I split the files into lines 20 characters long, the differences between the two files are more obvious.

    $ diff <(fold -s -w 20 temp1.txt) <(fold -s -w 20 temp2.txt)
    34c34
    < my hypos get such 
    --- 
    > my typoes get such

Ah, much better.

Not only do shorter lines make it easier to see where the differences are, in this case changing the segment length fixed the problem of a word being pushed to the next line. That won’t always be the case. It would not have been the case, for example, if I had used 15-character lines. But you can always change the number of characters slightly to eliminate the problem if you’d like.

Related posts

Recursive grep

The regular expression search utility grep has a recursive switch -R, but it may not work like you’d expect.

Suppose want to find the names of all .org files in your current directory and below that contain the text “cheese.”

You have four files, two in the working directory and two below, that all contain the same string: “I like cheese.”

    $ ls -R
    .:
    rootfile.org  rootfile.txt  sub
 
    ./sub:
    subfile.org  subfile.txt

It seems that grep -R can either search all files of the form *.org in the current directory, ignoring the -R switch, or search all files recursively if you don’t give it a file glob, but it can’t do both.

    $ grep -R -l cheese *.org
    rootfile.org
 
    $ grep -R -l cheese .
    ./rootfile.org
    ./rootfile.txt
    ./sub/subfile.org
    ./sub/subfile.txt

One way to solve this is with find and xargs:

    $ find . -name '*.org' | xargs grep -l cheese 
    ./rootfile.org                                                           
    ./sub/subfile.org                                                        

I was discussing this with Chris Toomey and he suggested an alternative using a subshell that seems more natural:

    grep -l cheese $(find . -name '*.org')

Now the code reads more like an ordinary call to grep. From left to right, it essentially says “Search for ‘cheese’ in files ending in .org” whereas the version with find reads like “Find files whose names end in .org and search them for ‘cheese.'” It’s good to understand how both approaches work.

Related posts

tcgrep: grep rewritten in Perl

In The Perl Cookbook, Tom Christiansen gives his rewrite of the Unix utility grep that he calls tcgrep. You don’t have to know Perl to use tcgrep, but you can send it Perl regular expressions.

Why not grep with PCRE?

You can get basically the same functionality as tcgrep by using grep with its PCRE option -P. Since tcgrep searches directories recursively, a more direct comparison would be

    grep -R -P

However, your version of grep might not support -P. And if it does, its Perl-compatible regular expressions might not be completely Perl-compatible. The man page for grep on my machine says

    -P, --perl-regexp
        Interpret the pattern as a Perl-compatible regular 
        expression (PCRE). This is experimental and grep -P 
        may warn of unimplemented features.

The one implementation of regular expressions guaranteed to be fully Perl-compatible is Perl.

If the version of grep on your system supports the -P option and is adequately Perl-compatible, it will run faster than tcgrep. But if you find yourself on a computer that has Perl but not a recent version of grep, you may find tcgrep handy.

Installation

tcgrep is included as part of the Unicode::Tussle Perl module; since tcgrep is a wrapper around Perl, it is as Unicode-compliant as Perl is. So you could install tcgrep (and several more utilities) with

    cpan Unicode::Tussle

This worked for me on Linux without any issues but the install failed on Windows.

I installed tcgrep on Windows by simply copying the source code. (I don’t recall now where I found the source code. I didn’t see it this morning when I searched for it, but I imagine I could have found it if I’d been more persistent.) I commented out the definition of %Compress to disable searching inside compressed files since this feature required Unix utilities not available on Windows.

Consistency

Another reason to use tcgrep is consistency. Perl is criticized for being inconsistent. The Camel book itself says

In general, Perl functions do exactly what you want—unless you want consistency.

But Perl’s inconsistencies are different, and in my opinion less annoying, than the inconsistencies of Unix tools.

Perl is inconsistent in the sense that functions behave differently in different contexts, such as a scalar context or a list context.

Unix utilities are inconsistent across platforms and across tools. For example, a tool like sed will have different features on different platforms, and it will not support the same regular expressions as another tool such as awk.

Perl was written to be a “portable distillation of Unix culture.” As inconsistent as Perl is, it’s more consistent that Unix.

Related posts

Low-tech transparency

I recently received two large data files from a client, with names like foo.xlsx and foo.csv. Presumably these are redundant; the latter is probably an export of the former. I did a spot check and that seems to be the case.

Then I had a bright idea: use pandas to make sure the two files are the same. It’s an elegant solution: import both files as data frames, then use the compare() function to verify that they’re the same.

Except it didn’t work. I got a series of mysterious and/or misleading messages as I tried to track down the source of the problem, playing whack-a-mole with the data. There could be any number of reason why compare() might not work on imported data: character encodings, inferred data types, etc.

So I used brute force. I exported the Excel file as CSV and compared the text files. This is low-tech, but transparent. It’s easier to compare text files than to plumb the depths of pandas.

One of the problems was that the data contained heights, such as 5'9". This causes problems with quoting, whether you enclose strings in single or double quotes. A couple quick sed one-liners resolved most of the mismatches. (Though not all. Of course it couldn’t be that simple …)

It’s easier to work with data in a high-level environment like pandas. But it’s also handy to be able to use low-level tools like diff and sed for troubleshooting.

I suppose someone could write a book on how to import CSV files. If all goes well, it’s one line of code. Then there are a handful of patterns that handle the majority of remaining cases. Then there’s the long tail of unique pathologies. As Tolstoy would say, happy data sets are all alike, but unhappy data sets are each unhappy in their own way.

Shells, quoting, and one-liners

The goal this post is to show how to call Perl and Python one-liners from a shell. We want to be able to use bash on Linux, and cmd or PowerShell on Windows, ideally with the same code in all three shells.

Bash

Bash interprets text inside double quotes but not in single quotes. So if you want to send a string to a program unmolested, you surround it with single quotes.

For example, suppose I want to write a Perl one-line program to print hello. This works:

    perl -e 'print "hello\n"'

The single quotes tell the shell to pass what’s inside to perl without doing anything to it.

The corresponding Python example would be

    python -c 'print("hello\n")'

and this also works.

Cmd and PowerShell

For reasons I don’t understand, the Perl example above works from cmd, but the Python example does not.

And neither example works in PowerShell. This is particularly puzzling since PowerShell follows basically the same quoting rules as bash: double quotes interpolate but single quotes do not.

Workarounds

There are ways to fix the problems with cmd and PowerShell without breaking bash.

Perl

Perl has a handy feature for avoiding confusion over quotes: you can use q{} for single quotes and qq{} for double quotes. If I write the Perl example as

    perl -e 'print qq{hello\n}'

then it works everywhere: bash, cmd, and PowerShell.

Python

Python makes no distinction between single and double quotes, so you can reverse the role of single and double quotes to avoid quoting confusion. The code

    python -c "print('hello\n')"

works everywhere.

Printing quote marks

In the previous section we said we can get around problems in Perl by using q{} and qq{}, and get around problems in Python by switching to double quotes on the outside and single quotes on the inside. But what if we wanted to print a single or double quote mark, such as printing 'hello' or "hello"?

You can escape single quotes with \' and double quotes with \", but you run into things that work on cmd but not on bash, or work on bash but not in PowerShell, etc.

One sure-fire way to make quotes work, in Perl and in Python, and on every shell we’re considering, is to denote quotes by their Unicode values, i.e. to use \x27 for single quotes and \x22 for double quotes. The one-liners

    perl -e 'print qq{\x27hello\x27\n}'

and

    python -c "print('\x27hello\x27\n')"

print 'hello' in bash, cmd, and PowerShell.

And similarly

    perl -e 'print qq{\x22hello\x22\n}'

and

    python -c "print('\x22hello\x22\n')"

print "hello" in bash, cmd, and PowerShell.

Guidelines for portable one-liners

In summary, you can avoid problems with Perl one-liners by using q{} and qq{} inside code rather than single or double quotes. For Python one-liners, you can swap single quotes and double quotes.

On Linux (bash) surround code with single quotes. On Windows (cmd and PowerShell) surround code with double quotes.

Use character codes for literal quote characters. This isn’t necessary, but if you do this you won’t have to remember how quotes and escapes work on different platforms.

Further reading

Redundant Residue Number Systems

You can uniquely represent a large number by its remainders when divided by smaller numbers, provided the smaller numbers have no factors in common, and carry out arithmetic in this representation. Such a representation is called a Residue Number System (RNS).

In the 1960’s people realized RNSs could be useful in computer arithmetic. The original papers are now hard to find, but you can find a summary of some of their ideas here. We will give examples working in a particular RNS below.

When you work with remainders by more numbers than necessary, you have a Redundant Residue Number System (RRNS) that provides error detection and correction. We will also demonstrate this with an example.

RNS example

To be concrete, we’ll use the remainders by 199, 233, 194, and 239 to represent numbers, following the work of C. W. Hastings and R. W. Watson referenced in the link cited above.

Let M = 199*233*194*239. Any non-negative integer less than M can be specified by its remainders mod 199, 233, 194, and 239.

The following Python code generates a random number less than M, represents it by its remainders, and then recovers the original number from the remainders using the Chinese Remainder Theorem.

    from random import randrange
    from sympy import prod
    from sympy.ntheory.modular import crt

    moduli = [199, 233, 194, 239]
    M = prod(moduli)

    x = randrange(M)
    remainders = [x % m for m in moduli]
    # See footnote [1]
    y = crt(moduli, remainders, symmetric=False)[0]

    print(x)
    print(remainders)
    print(y)

This printed

    1119075671
    [166, 204, 57, 235]
    1119075671

You can add, subtract, and multiply numbers by carrying out the corresponding operations on their remainders. There are three advantages to this.

First, you can work with smaller numbers. In our example, all the moduli are 8-bit numbers; we can carry out arithmetic on 32-bit numbers [2] by only working directly with 8-bit numbers. We could use the same idea to represent extremely large numbers by their remainders via 64-bit integers.

Second, we can do our calculations in parallel, working with each of our remainders at the same time.

Third, there are no carries. There’s no need to keep track of whether component calculations overflow.

The following code shows how we can add two numbers via their remainders.

    a = randrange(M//2)
    b = randrange(M//2)

    arem = [a % m for m in moduli]
    brem = [b % m for m in moduli]
    crem = [z[0] + z[1] for z in zip(arem, brem)]
    c = crt(moduli, crem, symmetric=False)[0]

    print(a + b)
    print(c)

When I ran this code, it printed 832537447 twice.

RRNS

Now we get to the redundant part. Suppose we add more numbers to our list of moduli, larger than the previous numbers and relatively prime to all of them and to each other. Now working modulo this larger list of numbers, we have more information than we need. If the results we get from using various subsets of the list of moduli are inconsistent, we’ve detected an error. And with enough extra information, we can correct the error.

Watson and Hastings added 251 and 509 in their example, and so we’ll do the same.

As before, we will generate a couple random numbers and represent them via their remainders, though now by a longer list of remainders. We will deliberately corrupt one of the component sums and then compute their sum using different choices of four moduli from the set of six.

    xmoduli = [199, 233, 194, 239, 251, 509]
    a = randrange(M//2)
    b = randrange(M//2)
    aremx = [a % m for m in xmoduli]
    bremx = [b % m for m in xmoduli]
    cremx = [z[0] + z[1] for z in zip(aremx, bremx)]
    cremx[1] += 13

    c0 = crt(xmoduli[:4], cremx[:4], symmetric=False)[0]
    c1 = crt(xmoduli[2:], cremx[2:], symmetric=False)[0]
    c2 = crt(xmoduli[:2] + xmoduli[4:], cremx[:2] + cremx[4:], symmetric=False)[0]
    print(c0, c1, c2)

This will return three different answers, so we know that there has been an error. When I ran it I got 70315884, 819846513, and 890162397. If you run the code yourself you’ll get different answers since you’ll generate different random numbers.

Now how do we correct the error? Suppose we know there has only been one error (or more realistically, we are willing to assume that the probability of two errors is tolerably small). Then one of the results above must be correct: the first sum excludes the last two moduli, the second excludes the first two, and the last excludes the middle two. One of them must exclude the error.

The first calculation based on a different subset of moduli that gives one of the results above is correct. The code

    c3 = crt(xmoduli[:1]+xmoduli[2:5], cremx[:1] + cremx[2:5], symmetric=False)[0]
    print(c3)

produced 890162397, matching the third sum above, so we know that eliminating the second modulus gives the correct answer. We’ve found the correct answer, and we’ve discovered which component was corrupted.

***

[1] A couple things about the call to crt require explanation. We set symmetric to False in order to get a positive return value. Otherwise we might get a value that is correct mod M but negative. Also, crt returns not just the solution we’re after but a pair consisting of the solution and the product of the moduli. We take the first element of the pair with [0] to get the part we’re interested in.

[2] Not all 32-bit numbers. with any numbers less than M, and M is between 231 and 232.

How much can Boolean expressions be simplified?

In the previous post we looked at how to minimize Boolean expressions using a Python module qm. In this post we’d like to look at how much the minimization process shortens expressions.

Witn n Boolean variables, you can create 2^n terms that are a product of distinct variables. You can specify a Boolean function by specifying the subset of such terms on which it takes the value 1, and so there are 2^(2^n) Boolean functions on n variables. For very small values of n we can minimize every possible Boolean function.

To do this, we need a way to iterate through the power set (set of all subsets) of the integers up to 2^n. Here’s a function to do that, borrowed from itertools recipes.

    from itertools import chain, combinations
    def powerset(iterable):
        xs = list(iterable)
        return chain.from_iterable(
            combinations(xs, n) for n in range(len(xs) + 1))

Next, we use this code to run all Boolean functions on 3 variables through the minimizer. We use a matrix to keep track of how long the input expressions are and how long the minimized expressions are.

    from numpy import zeros
    from qm import q 

    n = 3
    N = 2**n
    tally = zeros((N,N), dtype=int)
    for p in powerset(range(N)):
        if not p:
            continue # qm can't take an empty set
        i = len(p)
        j = len(qm(ones=p, dc={}))
        tally[i-1, j-1] += 1 

Here’s a table summarizing the results [1].

The first column gives the number of product terms in the input expression and the subsequent columns give the number of product terms in the output expressions.

For example, of the expressions of length 2, there were 12 that could be reduced to expressions of length 1 but the remaining 16 could not be reduced. (There are 28 possible input expressions of length 2 because there are 28 ways to choose 2 items from a set of 8 things.)

There are no nonzero values above the main diagonal, i.e. no expression got longer in the process of minimization. Of course that’s to be expected, but it’s reassuring that nothing went obviously wrong.

We can repeat this exercise for expressions in 4 variables by setting n = 4 in the code above. This gives the following results.

We quickly run into a wall as n increases. Not only does the Quine-McCluskey algorithm take about twice as long every time we add a new variable, the number of possible Boolean functions grows even faster. There were 2^(2^3) = 256 possibilities to explore when n = 3, and 2^(2^4) = 65,536 when n = 4.

If we want to explore all Boolean functions on five variables, we need to look at 2^(2^5) = 4,294,967,296 possibilities. I estimate this would take over a year on my laptop. The qm module could be made more efficient, and in fact someone has done that. But even if you made the code a billion times faster, six variables would still be out of the question.

To explore functions of more variables, we need to switch from exhaustive enumeration to random sampling. I may do that in a future post. (Update: I did.)

***

[1] The raw data for the tables presented as images is available here.