R with Conda

I’ve been unable to get some R libraries to install on my Linux laptop. Two libraries in particular were tseries and tidyverse. The same libraries installed just fine on Windows. (Maybe you need to install Rtools first before installing these on Windows; I don’t remember.)

I use conda all the time with Python, but I hadn’t tried it with R until this evening. Apparently it just works. The libraries I was trying to install have a lot of dependencies, and conda is very good at managing dependencies.

I removed my installation of R and reinstalled from conda:

    conda install r-base

Then I installed tseries with

    conda install r-tseries

and installed tidyverse analogously:

    conda install r-tidyverse

Just prepend r- to the name of the R library you want to install.

I haven’t used it in anger yet, but it seems that everything works just fine.

Speeding up R code

People often come to me with R code that’s running slower than they’d like. It’s not unusual to make the code 10 or even 100 times faster by rewriting it in C++.

Not all that speed improvement comes from changing languages. Some of it comes from better algorithms, eliminating redundancy, etc.

Why bother optimizing?

If code is running 100 times slower than you’d like, why not just run it on 100 processors? Sometimes that’s the way to go. But maybe the code doesn’t split up easily into pieces that can run in parallel. Or maybe you’d rather run the code on your laptop than send it off to the cloud. Or maybe you’d like to give your code to someone else and you want them to be able to run the code conveniently.

Optimizing vs rewriting R

It’s sometimes possible to tweak R code to make it faster without rewriting it, especially if it is naively using loops for things that could easily be vectorized. And it’s possible to use better algorithms without changing languages.

Beyond these high-level changes, there are a number of low-level changes that may give you a small speed-up. This way madness lies. I’ve seen blog posts to the effect “I rewrote this part of my code in the following non-obvious way, and for reasons I don’t understand, it ran 30% faster.” Rather than spending hours or days experimenting with such changes and hoping for a small speed up, I use a technique fairly sure to give a 10x speed up, and that is rewriting (part of) the code in C++.

If the R script is fairly small, and if I have C++ libraries to replace all the necessary R libraries, I’ll rewrite the whole thing in C++. But if the script is long, or has dependencies I can’t replace, or only has a small section where nearly all the time is spent, I may just rewrite that portion in C++ and call it from R using Rcpp.

Simulation vs analysis

The R programs I’ve worked on often compute something approximately by simulation that could be calculated exactly much faster. This isn’t because the R language encourages simulation, but because the language is used by statisticians who are more inclined to use simulation than analysis.

Sometimes a simulation amounts to computing an integral. It might be possible to compute the integral in closed form with some pencil-and-paper work. Or it might be possible to recognize the integral as a special function for which you have efficient evaluation code. Or maybe you have to approximate the integral, but you can do it more efficiently by numerical analysis than by simulation.

Redundancy vs memoization

Sometimes it’s possible to speed up code, written in any language, simply by not calculating the same thing unnecessarily. This could be something simple like moving code out of inner loops that doesn’t need to be there, or it could be something more sophisticated like memoization.

The first time it sees a function called with a new set of arguments, memoization saves the result and creates a way to associate the arguments with the result in some sort of look-up table, such as a hash. The next time the function is called with the same argument, the result is retrieved from memory rather than recomputed.

Memoization works well when the set of unique arguments is fairly small and the calculation is expensive relative to the cost of looking up results. Sometimes the set of potential arguments is very large, and it looks like memoization won’t be worthwhile, but the set of actual arguments is small because some arguments are used over and over.

 Related post:

Gentle introduction to R

The R language is closely tied to statistics. It’s ancestor was named S, because it was a language for Statistics. The open source descendant could have been named ‘T’, but its creators chose to call it’R.’

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t start there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R per se. This is the book I wish I could have handed my programmers who had to pick up R.

Mixing Haskell and R

It would be hard to think of two programming languages more dissimilar than Haskell and R.

Haskell was designed to enforce programming discipline; R was designed for interactive use. Haskell emphasizes correctness; R emphasizes convenience.  Haskell emphasizes computer efficiency; R emphasizes interactive user efficiency. Haskell was written to be a proving ground for programming language theorists. R was written to be a workbench for statisticians. Very different goals lead to very different languages.

When I first heard of a project to mix Haskell and R, I was a little shocked. Could it even be done? Aside from the external differences listed above, the differences in language internals are vast. I’m very impressed that the folks at Tweag I/O were able to pull this off. Their HaskellR project lets you call R from Haskell and vice versa. (It’s primarily for Haskell calling R, though you can call Haskell functions from your R code: Haskell calling R calling Haskell. It kinda hurts your brain at first.) Because the languages are so different, some things that are hard in one are easy in the other.

I used HaskellR while it was under initial development. Our project was written in Haskell, but we wanted to access R libraries. There were a few difficulties along the way, as with any project, but these were resolved and eventually it just worked.

* * *

Related: Help integrating R into your environment

R resources

This is the third in my weekly series of posts pointing out resources on this site. This week’s topic is R.

See also posts tagged Rstats.

I started the Twitter account RLangTip and handed it over the folks at Revolution Analytics.

Last week: Emacs resources

Next week: C++ resources

Let me know if you’d like help making R part of your environment.

Example of unit testing R code with testthat

Here’s a little example of using Hadley Wickham’s testthat package for unit testing R code. You can read more about testthat here.

The function below computes the real roots of a quadratic. All that really matters for our purposes is that the function can return 0, 1, or 2 numbers and it could raise an error.

real.roots <- function(a, b, c)
{
    if (a == 0.)
        stop("Leading term cannot be zero")

    d = b*b - 4*a*c # discriminant

    if (d < 0)
       rr = c()
    else if (d == 0)
       rr = c( -b/(2*a) )
    else
        rr = c( (-b - sqrt(d))/(2*a), 
                (-b + sqrt(d))/(2*a)  )

    return(rr)
}

To test this code with testthat we create another file for tests. The name of the file should begin with test so that testthat can recognize it as a file of test code. So let name the file containing the code above real_roots.R and the file containing its tests test_real_roots.R.

The test file needs to read in the file being tested.

source("real_roots.R")

Now let’s write some tests for the case of a quadratic with two real roots.

test_that("Distinct roots", {

    roots <- real.roots(1, 7, 12)

    expect_that( roots, is_a("numeric") )
    expect_that( length(roots), equals(2) )
    expect_that( roots[1] < roots[2], is_true() )
})

This tests that we get back two numbers and that they are sorted in increasing order.

Next we find the roots of (x + 3000)2 = x2 + 6000x + 9000000. We’ll test whether we get back -3000 as the only root. In general you can’t expect to get an exact answer, though in this case we do since the root is an integer. But we’ll show in the next example how to test for equality with a given tolerance.

test_that("Repeated root", {

    roots <- real.roots(1, 6000, 9000000)

    expect_that( length(roots), equals(1) )

    expect_that( roots, equals(-3000) )

    # Test whether ABSOLUTE error is within 0.1 
    expect_that( roots, equals(-3000.01, tolerance  = 0.1) )

    # Test whether RELATIVE error is within 0.1
    # To test relative error, set 'scale' equal to expected value.
    # See base R function all.equal for optional argument documentation.
    expect_equal( roots, -3001, tolerance  = 0.1, scale=-3001) 
})

To show how to test code that should raise an error, we’ll find the roots of 2x + 3, which isn’t a quadratic. Notice that you can test whether any error is raised or you can test whether the error message matches a given regular expression.

test_that("Polynomial must be quadratic", {

    # Test for ANY error                     
    expect_that( real.roots(0, 2, 3), throws_error() )

    # Test specifically for an error string containing "zero"
    expect_that( real.roots(0, 2, 3), throws_error("zero") )

    # Test specifically for an error string containing "zero" or "Zero" using regular expression
    expect_that( real.roots(0, 2, 3), throws_error("[zZ]ero") )
})

Finally, here are a couple tests that shouldn’t pass.

test_that("Bogus tests", {

    x <- c(1, 2, 3)

    expect_that( length(x), equals(2.7) )
    expect_that( x, is_a("data.frame") )
})

To run the tests, you can run test_dir or test_file. If you are at the R command line and your working directory is the directory containing the two files above, you could run the tests with test_dir("."). In this case we have only one file of test code, but if we had more test files test_dir would find them, provided the file names begin with test.

* * *

Related: Help integrating R into your environment

Basics of Sweave and Pweave

Sweave is a tool for embedding R code in a LaTeX file. Pweave is an analogous tool for Python. By putting your code in your document rather than the results of running your code somewhere else, results are automatically recomputed when inputs change. This is especially useful with graphs: rather than including an image into your document, you include the code to create the image.

To use either Sweave or Pweave, you create a LaTeX file and include source code inside. A code block begins with <<>>= and ends with @ on a line by itself. By default, code blocks appear in the LaTeX output. You can start a code block with <<echo=FALSE>>= to execute code without echoing its source. In Pweave you can also use <% and %> to mark a code block that executes but does not echo. You might want to do this at the top of a file, for example, for import statements.

Sweave echos code like the R command line, with > for the command prompt. Pweave does not display the Python >>> command line prompt by default, though it will if you use the option term=TRUE in the start of your code block.

In Sweave, you can use Sexpr to inline a little bit of R code. For example, $x = Sexpr{sqrt(2)}$ will produce x = 1.414…. You can also use Sexpr to reference variables defined in previous code blocks. The Pweave analog uses <%= and %>. The previous example would be $x = <%= sqrt(2) %>$.

You can include a figure in Sweave or Pweave by beginning a code block with <<fig=TRUE, echo=FALSE>>= or with echo=TRUE if you want to display the code that produces the figure. With Sweave you don’t need to do anything else with your file. With Pweave you need to add usepackage{graphicx} at the top.

To process an Sweave file foo.Rnw, run Sweave("foo.Rnw") from the R command prompt. To process a Pweave file foo.Pnw, run Pweave -f tex foo.Pnw from the shell. Either way you get a LaTeX file that you can then compile to a PDF.

Here are sample Sweave and Pweave files. First Sweave:

\documentclass{article}
\begin{document}

Invisible code that sets the value of the variable $a$.

<<<echo=FALSE>>=
a <- 3.14
@

Visible code that sets $b$ and squares it.

<<bear, echo=TRUE>>=
b <- 3.15
b*b
@

Calling R inline: $\sqrt{2} = Sexpr{sqrt(2)}$

Recalling the variable $a$ set above: $a = Sexpr{a}$.

Here's a figure:

<<fig=TRUE, echo=FALSE>>=
x <- seq(0, 6*pi, length=200)
plot(x, sin(x))
@

\end{document}

And now Pweave:

\documentclass{article}
\usepackage{graphicx}
\begin{document}

<%
import matplotlib.pyplot as plt
from numpy import pi, linspace, sqrt, sin
%>

Invisible code that sets the value of the variable $a$.

<<echo=FALSE>>=
a = 3.14
@

Visible code that sets $b$ and squares it.

<<term=True>>=
b = 3.15
print b*b
@

Calling Python inline: $\sqrt{2} = <%= sqrt(2) %>$

Recalling the variable $a$ set above: $a = <%= a %>$.

Here's a figure:

<<fig=True, echo=False>>=
x = linspace(0, 6*pi, 200)
plt.plot(x, sin(x))
plt.show()
@

\end{document}

Related links: