Upper case, lower case, title case

Converting text to all upper case or all lower case is a fairly common task.

One way to convert text to upper case would be to use the tr utility to replace the letters a through z with the letters A through Z. For example,

    $ echo Now is the time | tr '[a-z]' '[A-Z]'
    NOW IS THE TIME

You could convert to lower case by reversing the arguments to tr.

The approach above works if your text consists of only unadorned Roman letters. But it wouldn’t work, for example, if you gave it a jalapeño or π:

    $ echo jalapeño π | tr '[a-z]' '[A-Z]'
    JALAPEñO π

Using the character classes [:lower:] and [:upper:] won’t help either.

Tussling with Unicode

One alternative would be to use the uc command from the Unicode::Tussle package [1] I mentioned a few days ago. There’s also a lc counterpart, and a tc for title case. These utilities handle far more than Roman letters.

    $ echo jalapeño π | uc
    JALAPEÑO Π

Unicode capitalization rules are a black hole, but we’ll just look at one example and turn around quickly before we cross the event horizon.

Suppose you want to send all the letters in the Greek word σόφος to upper case.

    $ echo σόφος | uc
    ΣΌΦΟΣ

Greek has two lower case forms of sigma: ς at the end of a word and σ everywhere else. But there’s only one upper case sigma, so both get mapped to Σ. This means that if we convert the text to upper case and then to lower case, we won’t end up exactly where we started.

    $ echo σόφος | uc | lc
    σόφοσ

Note that the lc program chose σ as the lower case of Σ and didn’t take into account that it was at the end of a word.

Related posts

[1] “Tussle” is an acronym for Tom [Christiansen]’s Unicode Scripts So Life is Easier.

Technological boundary layer

The top sides of your ceiling fan blades are dusty because of a boundary layer effect. When the blades spin, a thin layer of air above the blades moves with the blades. That’s why the fan doesn’t throw off the dust.

A mathematical model may have very different behavior in two large regions, with a thin region of rapid transition between the two, such as the transition between the ceiling fan blade and the circulating air in the room. That transition region is called a boundary layer. In this post I want to use the term boundary layer metaphorically.

Some information you use so frequently that you memorize it without trying. And some information you use so infrequently that it’s not worth memorizing it.

There’s not much need to deliberately memorize how software tools work because that information mostly falls into one of the two categories above. Either you use the tool so often that you remember how to use it, or you use the tool so rarely that it’s best to look up how to use it just-in-time.

But there’s a thin region between these two categories: things you use often enough that it’s annoying to keep looking up how to use them, but not so often that you remember the details. In this technological boundary layer, it might be worthwhile to deliberately memorize and periodically review how things work. Maybe this takes the form of flashcards [1] or exercises.

This boundary layer must be kept very small. If you spend more than a few minutes a day on it, you’re probably taking on too much. YMMV.

Things may move in and out of your technological boundary layer over time. Maybe you use or review something so much that it moves into long-term memory. Or maybe you use it less often and decide to let it slip into things you look up as needed.

Related posts

[1] Paper or electronic? In my experience, making paper flashcards helps me memorize things, even if I don’t review them. But I’ve also used the Anki software before and found it helpful.

The magic AGM box

Suppose you are visited by aliens from halfway across the galaxy. After asking you a lot of questions, they give you a parting gift, little black boxes can compute

xx²/2 + x³/3 – …

with unbelievable speed and accuracy. You say thank you and your visitors vanish.

You get back home and wonder what you can do with your black boxes. The series the boxes compute looks vaguely familiar, but you can’t remember what it is. You call up a friend and he tells you that it’s the Taylor series for log(1 + x).

OK, so now what?

Your friend tells you he can predict what the boxes will output. He tells you, for example, that if you enter 0.5 it will output 0.4054651081081644. You try it, and your friend is right. At least partly. If you ask your box to give you only 16 digits, it will give you exactly what your friend said it would. But you could ask it for a thousand digits or a million digits or a billion digits, something your friend cannot do, at least not quickly.

Then you realize your friend has things backward. The way to exploit these boxes is not to compute logs on your laptop to predict their output, but to use the boxes instead of your laptop.

So you have a way to compute logs. You can bootstrap that to compute inverse logs, i.e. exp(x). And you can bootstrap that to compute sines and cosines. You try to compute anything you can starting from logs.

Enter the AGM

The preceding story was an introduction to the AGM, the arithmetic-geometric mean. It is the limit of alternatingly taking ordinary and geometric means. More on that here. What I want to focus on here is that the AGM can be computed extremely quickly.

Each iteration in the process of computing the AGM doubles the number of correct figures in the answer. Suppose you want to compute its output to a billion decimal places, and you’ve calculated a million decimal places. You need to compute 999,000,000 more decimal places, but you’re nearly there! Ten more steps and you’ll have all billion decimal places.

If you want to compute something to millions of digits, it would make sense to try to compute it in terms of the AGM. This was the research program of the brothers Jonathan and Peter Borwein. Much of this research was codified in their book Pi and the AGM. They used the AGM to compute π to crazy precision, but that wasn’t their goal per se.

Computing π was a demonstration project for a deeper agenda. While describing the work of the Borwein brothers, Richard Brent said

… theorems about π are often just the tips of “mathematical icebergs”—much of interest lies hidden beneath the surface.

The AGM of x and y equals

\frac{\pi(x+y)}{4\,K\left( \dfrac{x-y}{x+y}\right)}

where K is the “complete elliptic integral of the first kind.” You might reasonably think “Great. I’ll keep that in mind if I ever need to compute the compute elliptic integral of the first kind, whatever that is.” But K is like the log function in the story above, something that can be bootstrapped to compute other things.

The aliens Gauss, Lagrange, and Legendre gave the Borweins the AGM black box, and the Borweins figured out how to use it to compute π, but also log, exp, cos, etc. The Borwein algorithms may not be the most efficient if you only want, say, 16 digits of precision. But as you need more precision, eventually they become the algorithms to use.

See the next post for an example using the AGM to compute logarithms to 1000 digits.

And see the one after that for a way to compute π with the AGM.

Related posts

More readable lambda calculus

Lambda calculus is simple. The definitions and rules of lambda calculus would easily fit on an index card.

But you can’t really “read” lambda calculus so much as you can mentally execute it. Not many people can look at more than a few characters of lambda calculus and have any idea what it represents, though it’s not hard to mechanically manipulate it. I suppose in hindsight that’s what you’d expect from a theoretical model for computing.

The post The Programmer’s Ring by Loup Vaillant gives a notation for lambda calculus that I hadn’t seen before, notation that in my opinion is easier to understand. The author combines two ideas: de Bruijn indices (which I’d seen before), and replacing lambdas with brackets (which I had not seen).

The idea of de Bruijn indices is to replace variables with numbers. Vaillant describes De Bruijn indices saying

… variables are not referenced by name, but by stack offset: the last variable is accessed with the number 0, the second last with the number 1, and so on …

This would be a terrible idea in an ordinary programming language, but in lambda calculus it actually helps. Lambda calculus is so low-level that the variables aren’t meaningful anyway. It’s not like you can look at a few lines of lambda calculus and say “Oh, this is calculating net present value, so this variable must be the interest rate. It would be easier to read if they just called it interest_rate instead of 42 because it’s the 42nd variable.”

Wikipedia describes de Bruijn indices differently than Vaillant:

Each De Bruijn index is a natural number that represents an occurrence of a variable in a λ-term, and denotes the number of binders that are in scope between that occurrence and its corresponding binder.

It’s not immediately obvious that these two definitions of a de Bruijn index are the same, and in fact they’re not. But the only difference is that the first definition numbers from 0 and the second numbers from 1. This post will count from 0 with Vaillant.

After rewriting lambda calculus expressions to use de Bruijn indices, Vaillant fully parenthesizes the expressions (using braces, saving parentheses for grouping, as Mathematica does) then deletes the λs: every bracketed expression starts with a λ, so the λ itself is redundant. Also, you can delete the name of the variable that the λ takes since the variable numbering takes care of that.

OK, so what does all this buy us? As Vaillant points out, this notation is more concise, and it makes some patterns easier to recognize. For example, the famous Y combinator

    Y = λf. (λx. f (x x)) λx. f (x x)

becomes

    Y = [ [1 (0 0)] [1 (0 0)] ]

The latter makes the repetition inside more apparent.

As another example, let’s look at Church encoding for numerals:

    
    0 → λf. λx. x
    1 → λf. λx. f x
    2 → λf. λx. f (f x)
    3 → λf. λx. f (f (f x))
    …

Here’s what Church numerals would look like in the notation described above.

    
    0 → [[ 0 ]]
    1 → [[ 1 0 ]]
    2 → [[ 1 (1 0) ]]
    3 → [[ 1 (1 (1 0)) ]]
    …

You can convert a Church numeral to its corresponding integer by adding all the numbers in the lambda calculus expression.

Vaillant’s notation is more formal than traditional lambda calculus notation, but lambda calculus is formal anyway, so you might as well carry the formality a step further if it makes things easier.

Related posts

What does RIPEMD stand for?

The RIPEMD-160 secure hash function may be best known these days for its role as part of the implementation of Bitcoin. I’ve wondered what “RIPEMD” stands for, and today I stumbled on an explanation [1]:

“RIPEMD” stands for “RIPE Message Digest,” where “RIPE” stands for “RACE Integrity Primitives Evaluation” and where “RACE” stands for “Research and Development in Advanced Communications Technologies in Europe”—a nice example of a recursive abbreviation.

This deserves a diagram:

I created the diagram above with DITAA.

Related posts

[1] Introduction to Cryptography with Open-Source Software by Alasdair McAndrew. CRC Press. 2011

Multicolor reproducing cellular automata

The previous post looked at a cellular automaton introduced by Edward Fredkin. It has only two states: a cell is on or off. At each step, each cell is set to the sum of the states of the neighboring cells mod 2. So a cell is on if it had an odd number neighbors turned on, and is turned off if it had an even number of neighbors turned on.

You can look at cellular automata with more states, often represented as colors. If the number of states per cell is prime, then the extension of Fredkin’s automaton to multiple states also reproduces its initial pattern. Instead of taking the sum of the neighboring states mod 2, more generally you take the sum of the neighboring states mod p.

This theorem is due to Terry Winograd, cited in the same source as in the previous post.

Here’s an example with 11 states. The purple background represents 0. As before, I start with an ‘E’ pattern, but I make it multi-colored.

Initial state

Before turn on the automaton, we need to clarify what we mean by “neighborhood.” In this example, I mean cells to the north, south, east, and west. You could include the diagonals, but that would be different example, though still self-reproducing.

After the first step, the pattern blurs.

step 1

After 10 steps the original pattern is nowhere to be seen.

step 10

And then suddenly on the next step the pattern reappears.

step 11

I don’t know whether there are theorems on how long it takes the pattern to reappear, or how the time depends on the initial pattern, but in my experiments, the a pattern reappears after 4 steps with 2 states, 9 steps with 3 states, 5 steps with 5 states, 7 steps with 7 states, and 11 steps with 11 states.

Here’s an animation of the 11-color version, going through 22 steps.

Animated gif of CA

Evolution of random number generators

The previous post showed that the bits of prime powers look kinda chaotic. When you plot them, they form a triangular shape because the size of the numbers is growing. The numbers are growing geometrically, so their binary representations are growing linearly. Here’s the plot for powers of 5:

bits of powers of 5 plotted

You can crop the triangle so that you just see a rectangular portion of it, things look less regular. If we look at powers of p modulo a power of 2, say 232, then we’re chopping off the left side of the triangle.

Suppose we don’t start with 1, but start with some other number, call it a seed, and multiply by 5 every time. Would we get a chaotic pattern? Yes, and we have just invented the congruential random number generator. These RNGs have the form

xn+1 = a xn mod m

The RNG we implicitly described above would have a = 5 and m = 232. This is not a good choice of a and m, but it’s on to something. We can use different values of multiplier and modulus to make a decent RNG.

The RANDU random number generator, commonly used in the 1960’s, used a = 65539 and m = 231. This turned out to have notoriously bad statistical properties. We can see this with a small modification of the code used to produce the plot above.

    def out(s):
        s = s.replace('1', chr(0x2588))
        s = s.replace('0', ' ')
        print(s)    

    a, m = 65539, 2**31
    x = 7
    for i in range(32):
        x = a * x % m
        s = format(x, '32b')
        out(s)

Here’s the plot.

plotting bits of RANDU random number generator

The pattern on the right side looks like a roll of film. This shows that the lowest order bits are very regular. You can also see that we need to start with a large seed: starting with 7 created the large triangular holes at the top of the image. There are other patterns in the image that are hard to describe, but you get the sense something is not right, especially when you compare this image to a better alternative.

A much better choice of parameters is multiplier a = 48271 and modulus m = 231 -1. These are the parameters in the MINSTD random number generator. Here’s a plot of its bits, also starting with seed 7.

plot of MINSTD bits

The MINSTD generator is much better than RANDU, and adequate for some applications, but far from state of the art. You could do better by using a much larger multiplier and a modulus on the order of  264 or 2128.

You could do even better by adding a permutation step to your congruential generator. That’s the idea behind the PCG (permuted congruential generator) family of random number generators. These generators pass all the standard statistical tests with flying colors.

There are many other approaches to random number generation, but this post shows one thread of historical development, how hypothetically someone could go from noticing that prime powers have chaotic bit patterns to inventing PCG.

Related posts

Is fast grep faster?

The grep utility searches text files for regular expressions, but it can search for ordinary strings since these strings are a special case of regular expressions. However, if your regular expressions are in fact simply text strings, fgrep may be much faster than grep. Or so I’ve heard. I did some benchmarks to see.

Strictly speaking I used grep -F rather than fgrep. On Linux, if you ask for the man (manual) page for fgrep you’ll be taken to the man page for grep which says

In addition, the variant programs egrep, fgrep and rgrep are the same as grep -E, grep -F, and grep -r, respectively. These variants are deprecated, but are provided for backward compatibility.

I was working on a project for a client where I had to search for a long list of words in a long list of files [1]. This is the kind of task where fgrep (“fast grep”) is supposed to be much faster than grep. It was a tiny bit faster, not enough to notice. When I timed it the difference was on the order of 1%.

I ran an analogous search on my own computer with different data and got similar results [2]. There may be instances where fgrep is much faster than grep, but I haven’t seen one first hand.

I suspect that the performance difference between fgrep and grep used to be larger, but the latter has gotten more efficient. Now grep is smart enough to search for strings quickly without having to be told explicitly via -F that the regular expressions are in fact strings. Maybe it scans the regular expression(s) before searching and effectively sets the -F flag itself if appropriate.

Related posts

[1] I used the -f to tell grep the name of a file containing the terms to search for, not to be confused with the additional flag -F to tell grep that the search terms are simply strings.

[2] I got similar results when I was using Linux (WSL) on my computer. When I used grep from GOW the -F flag made the search 24 times faster. Because the GOW project provides light-weight ports of Gnu tools to Windows, it’s understandable that it would not include some of the optimizations in Gnu’s implementation of grep.

Comparing files with very long lines

Suppose you need to compare two files with very long lines. Maybe each file is one line.

The diff utility will show you which lines differ. But if the files are each one line, this tells you almost nothing. It confirms that there is a difference, but it doesn’t show you where the difference is. It just dumps both files to the command line.

For example, I created two files each containing the opening paragraph of Moby Dick. The file file1.txt contains the paragraph verbatim, and in file2.txt contains a typo. I ran

    diff temp1.txt temp2.txt

and this is what I saw:

screen shot of command line

The 1c1 output tells us that the difference occurs on line 1, but that’s no help because the whole file is line 1.

There are no doubt many ways to fix this problem, but the solution I thought of was to use fold to break the files into more lines before running diff. Without any optional arguments, fold will split a file into segments of 80 characters.

Here’s my first attempt.

    $ diff <(fold temp1.txt) <(fold temp2.txt)
    7c7
    < and bringing up the rear of every funeral I meet; and especially whenever my hy 
    --- 
    > and bringing up the rear of every funeral I meet; and especially whenever my ty

That’s better, but we lost context when we chopped the file into 80-character segments. We can’t see that the last words should be “hypos” and “typos”. This is a minor inconvenience, but there’s a bigger problem.

In the example above I simply changed one character, turning an ‘h’ into a ‘t’. But suppose I had changed “hypos” into “typoes”, not only changing a letter, but also inserting a letter. Then the rest of the files would split differently into 80-character segments. Even though the rest of the text is identical, we would be seeing differences created by fold.

This problem can be mitigated by giving diff the option -s, telling it to split lines at word boundaries.

    $ diff <(fold -s temp1.txt) <(fold -s temp2.txt) 
    8,9c8,9
    < whenever my hypos get such an upper hand of me, that it requires a strong moral
    < principle to prevent me from deliberately stepping into the street, and 
    --- 
    > whenever my typoes get such an upper hand of me, that it requires a strong
    > moral principle to prevent me from deliberately stepping into the street, and

The extra letter that I added changed the way the first line was broken; putting “moral” on the first line of the second output would have made the line 81 characters long. But other than pushing “moral” to the next line, the rest of chopped lines are the same.

You can use the -w option to have diff split lines of lengths other than 80 characters. If I split the files into lines 20 characters long, the differences between the two files are more obvious.

    $ diff <(fold -s -w 20 temp1.txt) <(fold -s -w 20 temp2.txt)
    34c34
    < my hypos get such 
    --- 
    > my typoes get such

Ah, much better.

Not only do shorter lines make it easier to see where the differences are, in this case changing the segment length fixed the problem of a word being pushed to the next line. That won’t always be the case. It would not have been the case, for example, if I had used 15-character lines. But you can always change the number of characters slightly to eliminate the problem if you’d like.

Related posts

Recursive grep

The regular expression search utility grep has a recursive switch -R, but it may not work like you’d expect.

Suppose want to find the names of all .org files in your current directory and below that contain the text “cheese.”

You have four files, two in the working directory and two below, that all contain the same string: “I like cheese.”

    $ ls -R
    .:
    rootfile.org  rootfile.txt  sub
 
    ./sub:
    subfile.org  subfile.txt

It seems that grep -R can either search all files of the form *.org in the current directory, ignoring the -R switch, or search all files recursively if you don’t give it a file glob, but it can’t do both.

    $ grep -R -l cheese *.org
    rootfile.org
 
    $ grep -R -l cheese .
    ./rootfile.org
    ./rootfile.txt
    ./sub/subfile.org
    ./sub/subfile.txt

One way to solve this is with find and xargs:

    $ find . -name '*.org' | xargs grep -l cheese 
    ./rootfile.org                                                           
    ./sub/subfile.org                                                        

I was discussing this with Chris Toomey and he suggested an alternative using a subshell that seems more natural:

    grep -l cheese $(find . -name '*.org')

Now the code reads more like an ordinary call to grep. From left to right, it essentially says “Search for ‘cheese’ in files ending in .org” whereas the version with find reads like “Find files whose names end in .org and search them for ‘cheese.'” It’s good to understand how both approaches work.

Related posts