New R book

Five years ago I recommended the book Learning Base R. Here’s the last paragraph of my review:

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t start there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R per se. This is the book I wish I could have handed my programmers who had to pick up R.

Now there’s a second edition. The author has gone through the book and made countless changes, many of them small updates that might be unnoticeable. Here are some of the changes that would be more noticeable.

  1. There are 265 new exercises.
  2. Chapter 26 (statistics) and Chapter 28 (packages) got a complete overhaul.
  3. Dozens of new functions are introduced (either in the body of the text or through exercises).
  4. New sections include the switch function (in Chapter 13 on relational operators), algorithm development (in Chapter 23 on iteration), analysis of variance (in Chapter 26 on statistics), and time series analysis (in Chapter 26 on statistics).

I was impressed with the first edition, and the new edition promises to be even better.

The book is available at Barnes & Noble and Amazon.

Probability of a magical permutation

Take a permutation of the numbers 1 through n² and lay out the elements of the permutation in a square. We will call a permutation a magic permutation if the corresponding square is a magic square. What is the probability that a permutation is a magic permutation? That is, if you fill a grid randomly with the numbers 1 through n², how likely are you to get a magic square?

For example, we could generate a random permutation in Python and see whether it forms a magic square.

>>> import numpy as np
>>> np.random.permutation(9).reshape(3,3)
array([[4, 7, 3],
       [2, 6, 5],
       [8, 0, 1]])

The first row adds up to 14 and the second row adds up to 13, so this isn’t a magic square. We could write a script to do this over and over and check whether the result is a magic square.

The exact number of magic squares of size n is known for n up to 5, and we have Monte Carlo estimates for larger values of n.

The number of unique magic squares, modulo rotations and reflections, of size 1 through 5 is

1, 0, 1, 880, 275305224.

For n > 2 the total number of magic squares, counting rotations and reflections as different squares, is 8 times larger than the numbers above. This is because the group of rotations and reflections of a square, D4, has 8 elements.

The probability that a permutation of the numbers 1 through 9, arranged in a square, gives a magic square is 8/9!. The corresponding probability for the numbers 1 through 16 is 8 × 880/16!, and for the numbers 1 through 25 we have 8 × 275305224/25!.

    |---+-------------|
    | n | Prob(magic) |
    |---+-------------|
    | 3 | 2.20459e-05 |
    | 4 | 3.36475e-10 |
    | 5 | 1.41990e-16 |
    |---+-------------|

It looks like the exponents are in roughly a linear progression, so maybe you could fit a line fairly well to the points on a logarithmic scale. In fact, empirical studies suggest that the probability that a permutation of the first n² positive integers is magic decreases a little faster than exponentially.

We could fit a linear regression to the logs of the numbers above to come up with an estimate for the result for n = 6. We expect the estimate could be pretty good, and likely an upper bound on the correct answer. Let’s see what actually happens with a little R code.

    > x <- c(3,4,5)
    > y <- 8*c(1/factorial(9), 880/factorial(16), 275305224/factorial(25))
    > m <- lm(log(y) ~ x)
    > predict(m, data.frame(x=c(6)))
    1
    -48.77694

This says we’d estimate that the natural log of the probability that a permutation of the first 6² positive integers is magic is -48.77694. So the estimate of the probability itself is

exp(−48.77694) = 6.55306 × 10−22

We don’t know the number of magic squares of size n = 6, but the number of distinct squares has been estimated [1] at

(0.17745 ± 0.00016) × 1020

and so the total number including rotations and reflections would be 8 times higher. This says we’d expect our probability to be around

8 × 0.17745 × 1020 / 36! = 3.18162 × 10−22

So our estimate is off by about a factor of 2, and as predicted it does give an upper bound.

Looking back at our regression model, the slope is -12.88 and the intercept is 28.53. This suggests that an upper bound of the probability of a permutation of size n² being magical is

exp(28.53 − 12.88 n).

In closing I’d like to point out that the estimate for n = 6 that we’re referencing above did not come from simply permuting 36 integers over and over and counting how many of the permutations correspond to magic squares. That would take far too long. It’s quite likely that after the first billion billion tries you would not have seen a magic permutation. To estimate such a small number to four significant figures requires a more sophisticated Monte Carlo procedure.

Related posts

[1] See OEIS sequence A006052

Extended floating point precision in R and C

The GNU MPFR library is a C library for extended precision floating point calculations. The name stands for Multiple Precision Floating-point Reliable. The library has an R wrapper Rmpfr that is more convenient for interactive use. There are also wrappers for other languages.

It takes a long time to install MPFR and its prerequisite GMP, and so I expected it to take a long time to install Rmpfr. But the R library installs quickly, even on a system that doesn’t have MPFR or GMP installed. (I installed GMP and MPFR from source on Linux, but installed Rmpfr on Windows. Presumably the Windows R package included pre-compiled binaries.)

I’ll start by describing the high-level R interface, then go into the C API.

Rmpfr

You can call the functions in Rmpfr with ordinary numbers. For example, you could calculate ζ(3), the Riemann zeta function evaluated at 3.

    > zeta(3)
    1 'mpfr' number of precision  128   bits
    [1] 1.202056903159594285399738161511449990768

The default precision is 128 bits, and a numeric argument is interpreted as a 128-bit MPFR object. R doesn’t have a built-in zeta function, so the only available zeta is the one from Rmpfr. If you ask for the cosine of 3, you’ll get ordinary precision.

    > cos(3)
    [1] -0.9899925

But if you explicitly pass cosine a 128-bit MPFR representation of the number 3 you will get cos(3) to 128-bit precision.

    > cos(mpfr(3, 128))                            
    1 'mpfr' number of precision  128   bits       
    [1] -0.9899924966004454572715727947312613023926

Of course you don’t have to only use 128-bits. For example, you could find π to 100 decimal places by multiplying the arctangent of 1 by 4.

    > 100*log(10)/log(2) # number of bits needed for 100 decimals                                               
    [1] 332.1928     
                                                                                           
    >  4*atan(mpfr(1,333))                                                                                      
    1 'mpfr' number of precision  333   bits                                                                    
    [1] 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706807 

MPFR C library

The following C code shows how to compute cos(3) to 128-bit precision and 4 atan(1) to 333 bit precision as above.

    #include <stdio.h>
    #include <gmp.h>
    #include <mpfr.h>
    
    int main (void)
    {
        // All functions require a rounding mode.
        // This mode specifies round-to-nearest
        mpfr_rnd_t rnd = MPFR_RNDN;
    
        mpfr_t x, y;
    
        // allocate uninitialized memory for x and y as 128-bit numbers
        mpfr_init2(x, 128);
        mpfr_init2(y, 128);
    
        // Set x to the C double number 3
        mpfr_set_d(x, 3, rnd);
    
        // Set y to the cosine of x
        mpfr_cos(y, x, rnd);
    
        // Print y to standard out in base 10
        printf ("y = ");
        mpfr_out_str (stdout, 10, 0, y, rnd);
        putchar ('\n');
    
        // Compute pi as 4*atan(1)
    
        // Re-allocate x and y to 333 bits
        mpfr_init2(x, 333);    
        mpfr_init2(y, 333);    
        mpfr_set_d(x, 1.0, rnd);        
        mpfr_atan(y, x, rnd);
        // Multiply y by 4 and store the result back in y
        mpfr_mul_d(y, y, 4, rnd);
    
        printf ("y = ");
        mpfr_out_str (stdout, 10, 0, y, rnd);
        putchar ('\n');
    
        // Release memory
        mpfr_clear(x);
        mpfr_clear(y);     
    
        return 0;
    }

If this code is saved in the file hello_mpfr.c then you can compile it with

    gcc hello_mpfr.c -lmpfr -lgmp

One line above deserves a little more explanation. The second and third arguments to mpfr_out_str are the base b and number of figures n to print.

We chose b=10 but you could specify any base value 2 ≤ b ≤ 62.

If n were set to 100 then the output would contain 100 significant figures. When n=0, MPFR will determine the number of digits to output, enough digits that the string representation could be read back in exactly. To understand how many digits that is, see Matula’s theorem in the previous post.

Excel, R, and Unicode

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.

Say the CSV file looked like this:

    foo,bar
    1,2
    3,4

I read the file into R with

    df <- read.csv("foobar.csv", header=TRUE)

and could access the second column as df$bar but could not access the first column as df$foo. What’s going on?

When I ran names(df) it showed me that the first column was named not foo but ï..foo. I opened the CSV file in a hex editor and saw this:

    efbb bf66 6f6f 2c62 6172 0d0a 312c 320d

The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.

If you saw my post about Unicode the other day, you may have seen Daniel Lemire’s comment:

There are various byte-order masks like EF BB BF for UTF-8 (unused).

Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.

Now we can see where the funny characters in front of “foo” came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, “Latin Small Letter I with Diaeresis.” I don’t know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as

    foo,bar

with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.

How to fix the encoding problem with R? The read.csv function has an optional encoding parameter. I tried setting this parameter to “utf-8” and “utf8”. Neither made any difference. I looked at the R documentation, and it seems I need to set it to “UTF-8”. When I did that, the name of the first column became X.U.FEFF.foo [1]. I don’t know what’s up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.

Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you’ll first see CSV UTF-8 (that’s why I picked it) but if you go further down you’ll see an option that’s simply CSV, implicitly in ASCII.

Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don’t know, but I do know that the problem goes away when I stick to ASCII.

***

[1] A couple people pointed out in the comments that you could use fileEncoding="UTF-8-BOM" to fix the problem. This works, though I didn’t see it in the documentation the first time. The read.csv function takes an encoding parameter that appears to be for this purpose, but is a decoy. You need the fileEncoding parameter. With enough persistence you’ll eventually find that "UTF-8-BOM" is a possible value for fileEncoding.

R with Conda

I’ve been unable to get some R libraries to install on my Linux laptop. Two libraries in particular were tseries and tidyverse. The same libraries installed just fine on Windows. (Maybe you need to install Rtools first before installing these on Windows; I don’t remember.)

I use conda all the time with Python, but I hadn’t tried it with R until this evening. Apparently it just works. The libraries I was trying to install have a lot of dependencies, and conda is very good at managing dependencies.

I removed my installation of R and reinstalled from conda:

    conda install r-base

Then I installed tseries with

    conda install r-tseries

and installed tidyverse analogously:

    conda install r-tidyverse

Just prepend r- to the name of the R library you want to install.

I haven’t used it in anger yet, but it seems that everything works just fine.

Speeding up R code

People often come to me with R code that’s running slower than they’d like. It’s not unusual to make the code 10 or even 100 times faster by rewriting it in C++.

Not all that speed improvement comes from changing languages. Some of it comes from better algorithms, eliminating redundancy, etc.

Why bother optimizing?

If code is running 100 times slower than you’d like, why not just run it on 100 processors? Sometimes that’s the way to go. But maybe the code doesn’t split up easily into pieces that can run in parallel. Or maybe you’d rather run the code on your laptop than send it off to the cloud. Or maybe you’d like to give your code to someone else and you want them to be able to run the code conveniently.

Optimizing vs rewriting R

It’s sometimes possible to tweak R code to make it faster without rewriting it, especially if it is naively using loops for things that could easily be vectorized. And it’s possible to use better algorithms without changing languages.

Beyond these high-level changes, there are a number of low-level changes that may give you a small speed-up. This way madness lies. I’ve seen blog posts to the effect “I rewrote this part of my code in the following non-obvious way, and for reasons I don’t understand, it ran 30% faster.” Rather than spending hours or days experimenting with such changes and hoping for a small speed up, I use a technique fairly sure to give a 10x speed up, and that is rewriting (part of) the code in C++.

If the R script is fairly small, and if I have C++ libraries to replace all the necessary R libraries, I’ll rewrite the whole thing in C++. But if the script is long, or has dependencies I can’t replace, or only has a small section where nearly all the time is spent, I may just rewrite that portion in C++ and call it from R using Rcpp.

Simulation vs analysis

The R programs I’ve worked on often compute something approximately by simulation that could be calculated exactly much faster. This isn’t because the R language encourages simulation, but because the language is used by statisticians who are more inclined to use simulation than analysis.

Sometimes a simulation amounts to computing an integral. It might be possible to compute the integral in closed form with some pencil-and-paper work. Or it might be possible to recognize the integral as a special function for which you have efficient evaluation code. Or maybe you have to approximate the integral, but you can do it more efficiently by numerical analysis than by simulation.

Redundancy vs memoization

Sometimes it’s possible to speed up code, written in any language, simply by not calculating the same thing unnecessarily. This could be something simple like moving code out of inner loops that doesn’t need to be there, or it could be something more sophisticated like memoization.

The first time it sees a function called with a new set of arguments, memoization saves the result and creates a way to associate the arguments with the result in some sort of look-up table, such as a hash. The next time the function is called with the same argument, the result is retrieved from memory rather than recomputed.

Memoization works well when the set of unique arguments is fairly small and the calculation is expensive relative to the cost of looking up results. Sometimes the set of potential arguments is very large, and it looks like memoization won’t be worthwhile, but the set of actual arguments is small because some arguments are used over and over.

 Related post: 20 weeks down to 20 minutes

Gentle introduction to R

The R language is closely tied to statistics. It’s ancestor was named S, because it was a language for Statistics. The open source descendant could have been named ‘T’, but its creators chose to call it’R.’

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t start there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R per se. This is the book I wish I could have handed my programmers who had to pick up R.

 

Mixing Haskell and R

It would be hard to think of two programming languages more dissimilar than Haskell and R.

Haskell was designed to enforce programming discipline; R was designed for interactive use. Haskell emphasizes correctness; R emphasizes convenience.  Haskell emphasizes computer efficiency; R emphasizes interactive user efficiency. Haskell was written to be a proving ground for programming language theorists. R was written to be a workbench for statisticians. Very different goals lead to very different languages.

When I first heard of a project to mix Haskell and R, I was a little shocked. Could it even be done? Aside from the external differences listed above, the differences in language internals are vast. I’m very impressed that the folks at Tweag I/O were able to pull this off. Their HaskellR project lets you call R from Haskell and vice versa. (It’s primarily for Haskell calling R, though you can call Haskell functions from your R code: Haskell calling R calling Haskell. It kinda hurts your brain at first.) Because the languages are so different, some things that are hard in one are easy in the other.

I used HaskellR while it was under initial development. Our project was written in Haskell, but we wanted to access R libraries. There were a few difficulties along the way, as with any project, but these were resolved and eventually it just worked.

* * *

Related: Help integrating R into your environment