Hadley Wickham posted a photo on Twitter back in September illustrating R list indices with pepper:

Then a few days ago, Jenny Bryan posted on Twitter her follow up, an analogous photo for XML:

**Related posts**:

(832) 422-8646

It would be hard to think of two programming languages more dissimilar than Haskell and R.

Haskell was designed to enforce programming discipline; R was designed for interactive use. Haskell emphasizes correctness; R emphasizes convenience. Haskell emphasizes computer efficiency; R emphasizes interactive user efficiency. Haskell was written to be a proving ground for programming language theorists. R was written to be a workbench for statisticians. Very different goals lead to very different languages.

When I first heard of a project to mix Haskell and R, I was a little shocked. Could it even be done? Aside from the external differences listed above, the differences in language internals are vast. I’m very impressed that the folks at Tweag I/O were able to pull this off. Their HaskellR project lets you call R from Haskell and vice versa. (It’s primarily for Haskell calling R, though you can call Haskell functions from your R code: Haskell calling R calling Haskell. It kinda hurts your brain at first.) Because the languages are so different, some things that are hard in one are easy in the other.

I used HaskellR while it was under initial development. Our project was written in Haskell, but we wanted to access R libraries. There were a few difficulties along the way, as with any project, but these were resolved and eventually it just worked.

* * *

This is the third in my weekly series of posts pointing out resources on this site. This week’s topic is R.

- R language for programmers
- Default arguments and lazy evaluation in R
- Distributions in R
- Moving data between R and Excel via the clipboard
- Sweave: First steps toward reproducible analyses
- Troubleshooting Sweave
- Regular expressions in R

See also posts tagged Rstats.

I started the Twitter account RLangTip and handed it over the folks at Revolution Analytics.

**Last week**: Emacs resources

**Next week**: C++ resources

Let me know if you’d like help making R part of your environment.

Here’s a little example of using Hadley Wickham’s `testthat`

package for unit testing R code. You can read more about `testthat`

here.

The function below computes the real roots of a quadratic. All that really matters for our purposes is that the function can return 0, 1, or 2 numbers and it could raise an error.

real.roots <- function(a, b, c) { if (a == 0.) stop("Leading term cannot be zero") d = b*b - 4*a*c # discriminant if (d < 0) rr = c() else if (d == 0) rr = c( -b/(2*a) ) else rr = c( (-b - sqrt(d))/(2*a), (-b + sqrt(d))/(2*a) ) return(rr) }

To test this code with `testthat`

we create another file for tests. The name of the file should begin with `test`

so that `testthat`

can recognize it as a file of test code. So let name the file containing the code above `real_roots.R`

and the file containing its tests `test_real_roots.R`

.

The test file needs to read in the file being tested.

source("real_roots.R")

Now let’s write some tests for the case of a quadratic with two real roots.

test_that("Distinct roots", { roots <- real.roots(1, 7, 12) expect_that( roots, is_a("numeric") ) expect_that( length(roots), equals(2) ) expect_that( roots[1] < roots[2], is_true() ) })

This tests that we get back two numbers and that they are sorted in increasing order.

Next we find the roots of (*x* + 3000)^{2} = *x*^{2} + 6000*x* + 9000000. We’ll test whether we get back -3000 as the only root. In general you can’t expect to get an exact answer, though in this case we do since the root is an integer. But we’ll show in the next example how to test for equality with a given tolerance.

test_that("Repeated root", { roots <- real.roots(1, 6000, 9000000) expect_that( length(roots), equals(1) ) expect_that( roots, equals(-3000) ) # Test whether ABSOLUTE error is within 0.1 expect_that( roots, equals(-3000.01, tolerance = 0.1) ) # Test whether RELATIVE error is within 0.1 # To test relative error, set 'scale' equal to expected value. # See base R function all.equal for optional argument documentation. expect_equal( roots, -3001, tolerance = 0.1, scale=-3001) })

To show how to test code that should raise an error, we’ll find the roots of 2*x* + 3, which isn’t a quadratic. Notice that you can test whether any error is raised or you can test whether the error message matches a given regular expression.

test_that("Polynomial must be quadratic", { # Test for ANY error expect_that( real.roots(0, 2, 3), throws_error() ) # Test specifically for an error string containing "zero" expect_that( real.roots(0, 2, 3), throws_error("zero") ) # Test specifically for an error string containing "zero" or "Zero" using regular expression expect_that( real.roots(0, 2, 3), throws_error("[zZ]ero") ) })

Finally, here are a couple tests that shouldn’t pass.

test_that("Bogus tests", { x <- c(1, 2, 3) expect_that( length(x), equals(2.7) ) expect_that( x, is_a("data.frame") ) })

To run the tests, you can run `test_dir`

or `test_file`

. If you are at the R command line and your working directory is the directory containing the two files above, you could run the tests with `test_dir(".")`

. In this case we have only one file of test code, but if we had more test files `test_dir`

would find them, provided the file names begin with `test`

.

* * *

Here’s an interview I did with Microsoft Channel 9 right after my talk in Brisbane.

You can find the interview in multiple audio and video formats on Channel 9.

Sweave is a tool for embedding R code in a LaTeX file. Pweave is an analogous tool for Python. By putting your *code* in your document rather than the *results* of running your code somewhere else, results are automatically recomputed when inputs change. This is especially useful with graphs: rather than including an *image* into your document, you include the code to *create* the image.

To use either Sweave or Pweave, you create a LaTeX file and include source code inside. A code block begins with `<<>>=`

and ends with `@`

on a line by itself. By default, code blocks appear in the LaTeX output. You can start a code block with `<<echo=FALSE>>=`

to execute code without echoing its source. In Pweave you can also use `<%`

and `%>`

to mark a code block that executes but does not echo. You might want to do this at the top of a file, for example, for `import`

statements.

Sweave echos code like the R command line, with `>`

for the command prompt. Pweave does not display the Python `>>>`

command line prompt by default, though it will if you use the option `term=TRUE`

in the start of your code block.

In Sweave, you can use `Sexpr`

to inline a little bit of R code. For example, `$x = Sexpr{sqrt(2)}$`

will produce *x* = 1.414…. You can also use `Sexpr`

to reference variables defined in previous code blocks. The Pweave analog uses `<%=`

and `%>`

. The previous example would be `$x = <%= sqrt(2) %>$`

.

You can include a figure in Sweave or Pweave by beginning a code block with `<<fig=TRUE, echo=FALSE>>=`

or with `echo=TRUE`

if you want to display the code that produces the figure. With Sweave you don’t need to do anything else with your file. With Pweave you need to add `usepackage{graphicx}`

at the top.

To process an Sweave file `foo.Rnw`

, run `Sweave("foo.Rnw")`

from the R command prompt. To process a Pweave file `foo.Pnw`

, run `Pweave -f tex foo.Pnw`

from the shell. Either way you get a LaTeX file that you can then compile to a PDF.

Here are sample Sweave and Pweave files. First Sweave:

\documentclass{article} \begin{document} Invisible code that sets the value of the variable $a$. <<<echo=FALSE>>= a <- 3.14 @ Visible code that sets $b$ and squares it. <<bear, echo=TRUE>>= b <- 3.15 b*b @ Calling R inline: $\sqrt{2} = Sexpr{sqrt(2)}$ Recalling the variable $a$ set above: $a = Sexpr{a}$. Here's a figure: <<fig=TRUE, echo=FALSE>>= x <- seq(0, 6*pi, length=200) plot(x, sin(x)) @ \end{document}

And now Pweave:

\documentclass{article} \usepackage{graphicx} \begin{document} <% import matplotlib.pyplot as plt from numpy import pi, linspace, sqrt, sin %> Invisible code that sets the value of the variable $a$. <<echo=FALSE>>= a = 3.14 @ Visible code that sets $b$ and squares it. <<term=TRUE>>= b = 3.15 print b*b @ Calling Python inline: $\sqrt{2} = <%= sqrt(2) %>$ Recalling the variable $a$ set above: $a = <%= a %>$. Here's a figure: <<fig=TRUE, echo=FALSE>>= x = linspace(0, 6*pi, 200) plt.plot(x, sin(x)) @ \end{document}

**Related links**:

- Troubleshooting Sweave
- Running Python and R inside Emacs
- Literate programming in statistics
- Help integrating R into your environment

Tim Hopper asked on Twitter today:

#rstats programming without @hadleywickham’s libraries is like ________ without _________.

Some of the replies were:

- (skydiving, a parachute)
- (gouging your own eyes out, NULL)
- (dentistry, anesthesia)
- (shaving, a razor)
- (Internet shopping, credit card)

Clearly there’s a lot of love out there for Hadley Wickham’s R packages. I’m familiar with his ggplot2 graphics package, and it’s quite impressive. I need to look at his other packages as well.

* * *

Someone asked me yesterday for R code to compute the probability P(*X* > *Y* + δ) where *X* and *Y* are independent beta random variables. I’m posting the solution here in case it benefits anyone else.

For an example of why you might want to compute this probability, see A Bayesian view of Amazon resellers.

Let *X* be a Beta(*a*, *b*) random variable and *Y* be a Beta(*c*, *d*) random variable. Denote PDFs by *f* and CDFs by *F*. Then the probability we need is

If you just need to compute this probability a few times, here is a desktop application to compute random inequalities.

But if you need to do this computation repeated inside R code, you could use the following.

beta.ineq <- function(a, b, c, d, delta) { integrand <- function(x) { dbeta(x, a, b)*pbeta(x-delta, c, d) } integrate(integrand, delta, 1, rel.tol=1e-4)$value }

The code is as good or as bad as R’s `integrate`

function. It’s probably accurate enough as long as none of the parameters *a*, *b*, *c*, or *d* are near zero. When one or more of these parameters is small, the integral is harder to compute numerically.

There is no error checking in the code above. A more robust version would verify that all parameters are positive and that `delta`

is less than 1.

Here’s the solution to the corresponding problem for gamma random variables, provided `delta`

is zero: A support one-liner.

And here is a series of blog posts on random inequalities.

- Introduction
- Analytical results
- Numerical results
- Cauchy distributions
- Beta distributions
- Gamma distributions
- Three or more random variables
- Folded normals

Almost exactly a year ago, I wrote about my frustration calling C++ from R. Maybe this will become an annual event because I’m back at it.

This time my experience was more pleasant. I was able to install Rcpp on an Ubuntu virtual machine and run example 2.2 from the Rcpp FAQ once I upgraded R to the latest version. I wrote up some notes on the process of installing the latest version of R and Rcpp on Ubuntu.

I have not yet been able to run Rcpp on Windows.

**Update**: Thanks to Dirk Eddelbuettel for pointing out in the comments that you can install Rcpp from the shell rather than from inside R by running `sudo apt-get install r-cran-rcpp`

. With this approach, I was able to install Rcpp without having to upgrade R first.

Drew Conway and John Myles White have a new book out, Machine Learning for Hackers (ISBN 1449303714). As the name implies, the emphasis is on exploration rather than mathematical theory. Lots of code, no equations.

If you’re looking for a hands-on introduction to machine learning, maybe as a prelude to or complement to a more theoretical text, you’ll enjoy this book. Even if you’re not all that interested in machine learning, you might enjoy the examples, such as how a computer could find patterns in senatorial voting records and twitter networks. And R users will find examples of using advanced language features to solve practical problems.

For daily tips on data science, follow @DataSciFact on Twitter.

Francois Pinard comparing the R programming language to smoking:

Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run, it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.

I’ve never gotten to the point that I would call using R pleasurable.

Quote via Stats in the Wild

**Related posts**:

Emacs org-mode lets you manage blocks of source code inside a text file. You can execute these blocks and have the output display in your text file. Or you could export the file, say to HTML or PDF, and show the code and/or the results of executing the code.

Here I’ll show some of the most basic possibilities. For much more information, see orgmode.org. And for the use of org-mode in research, see A Multi-Language Computing Environment for Literate Programming and Reproducible Research.

Source code blocks go between lines of the form

#+begin_src #+end_src

On the `#+begin_src`

line, specify the programming language. Here I’ll demonstrate Python and R, but org-mode currently supports C++, Java, Perl, etc. for a total of 35 languages.

Suppose we want to compute √42 using R.

#+begin_src R sqrt(42) #+end_src

If we put the cursor somewhere in the code block and type C-c C-c, org-mode will add these lines:

#+results: : 6.48074069840786

Now suppose we do the same with Python:

#+begin_src python from math import sqrt sqrt(42) #+end_src

This time we get disappointing results:

#+results: : None

What happened? The org-mode manual explains:

… code should be written as if it were the body of such a function. In particular, note that Python does not automatically return a value from a function unless a

`return`

statement is present, and so a ‘`return`

’ statement will usually be required in Python.

If we change `sqrt(42)`

to `return sqrt(42)`

then we get the same result that we got when using R.

By default, evaluating a block of code returns a single result. If you want to see the output as if you were interactively using Python from the REPL, you can add `:results output :session`

following the language name.

#+begin_src python :results output :session print "There are %d hours in a week." % (7*24) 2**10 #+end_src

This produces the lines

#+results: : There are 168 hours in a week. : 1024

Without the `:session`

tag, the second line would not appear because there was no `print`

statement.

I had to do a couple things before I could get the examples above to work. First, I had to upgrade org-mode. The version of org-mode that shipped with Emacs 23.3 was quite out of date. Second, the only language you can run by default is Emacs Lisp. You have to turn on support for other languages in your `.emacs`

file. Here’s the code to turn on support for Python and R.

(org-babel-do-load-languages 'org-babel-load-languages '((python . t) (R . t)))

**Update**: My next post shows how to call code in written in one language from code written in another language.

**Related posts**:

No Starch Press sent me a copy of The Art of R Programming last Fall and I wrote a review of it here. Then a couple weeks ago, Manning sent me a copy of R in Action (ISBN 1935182390). Here I’ll give a quick comparison of the two books, then focus specifically on *R in Action*.

**Comparing R books**

Norman Matloff, author of *The Art of R Programming*, is a statistician-turned-computer scientist. As the title may imply, Matloff’s book has more of a programmer’s perspective on R as a language.

Robert Kabacoff, author of *R in Action*, is a psychology professor-turned-statistical consultant. And as its title may imply, Kabacoff’s book is more about using R to analyze data. That is, the book is organized by analytical task rather than by language feature.

Many R books are organized like a statistical text. In fact, many *are* statistics texts, organized according to the progression of statistical theory with R code sprinkled in. *R in Action* is organized roughly in the order of steps one would take to analyze data, starting with importing data and ending with producing reports.

In short, *The Art of R Programming* is for programmers, *R in Action* is for data analysts, and most other R books I’ve seen are for statisticians. Of course a typical R user is to some extent a programmer, an analyst, and a statistician. But this comparison gives you some idea which book you might want to reach for depending on which hat you’re wearing at the moment. For example, I’d pick up *The Art of R Programming* if I had a question about interfacing R and C, but I’d pick up R in Action if I wanted to read about importing SAS data or using the `ggplot2`

graphics package.

**R in Action**

Kabacoff begins his book off with two appropriate quotes.

What is the use of a book, without pictures or conversations? — Alice,

Alice in WonderlandIt’s wonderous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. — Q, “Q Who?”

Star Trek: The Next Generation

*R in Action* is filled with pictures and conversations. It is also a treasure chest of practical information.

The first third of the book concerns basic data management and graphics. This much of the book would be accessible to someone with no background in statistics. The middle third of the book is devoted to basic statistics: correlation, linear regression, etc. The final third of the book contains more advanced statistics and graphics. (I was pleased to see the book has an appendix on using `Sweave`

and `odfWeave`

to produce reports.)

*R in Action* includes practical details that I have not seen in other books on R. Perhaps this is because the book is focused on analyzing and graphing data rather than exploring the dark corners of R or rounding out statistical theory.

Kabacoff says that he wrote the book that he wishes he’d had years ago. I also wish I’d had his book years ago.

**Related links**:

- R programming for those coming from other languages (referenced in
*R in Action*) - Calling C++ from R
- Better R console fonts

For daily tips on data science, follow @DataSciFact on Twitter.

Here are my first impressions of The Art of R Programming (ISBN 1593273843). I haven’t had time to read it thoroughly, and I doubt I will any time soon. Rather than sitting on it, I wanted to get something out quickly. I may say more about the book later.

The book’s author, Norman Matloff, began his career as a statistics professor and later moved into computer science. That may explain why his book seems to be more programmer-friendly than other books I’ve seen on R.

My impression is that few people actually sit down and learn R the way they’d learn, say, Java. Most learn R in the context of learning statistics. Here’s a statistical chore, and here’s a snippet of R to carry it out. Books on R tend to follow that pattern, organized more by statistical task than by language feature. That serves statisticians well, but it’s daunting to outsiders.

Matloff’s book is organized more like a typical programming book and may be more accessible to a programmer needing to learn R. He explains some things that might require no explanation if you were learning R in the context of a statistics class.

The last four chapters would be interesting even for an experienced R programmer:

- Debugging
- Performance enhancement: memory and speed
- Interfacing R to other languages
- Parallel R

No one would be surprised to see the same chapters in a Java textbook if you replaced “R” with “Java” in the titles. But these topics are not typical in a book on R. They wouldn’t come up in a statistics class because they don’t provide any statistical functionality *per se*. As long as you don’t make mistakes, don’t care how long your code takes to run, and don’t need to interact with anything else, these chapters are unnecessary. But of course these chapters are quite necessary in practice.

As I mentioned up front, I haven’t read the book carefully. So I’m going out on a limb a little here, but I think this may be the book I’d recommend for someone wanting to learn R, especially for someone with more experience in programming than statistics.

**Related post**: R: The Good Parts

I’ve decided to hand my Twitter account RLangTip over to the folks at Revolution Analytics starting next week. I thought it would be better to give the account to someone who is more enthusiastic about R than I am, and so I offered it to David Smith. If you’ve enjoyed RLangTip so far, I expect you’ll like it even better under new ownership.

If you’d like to continue to hear from me on Twitter, you can follow one of my 10 other daily tip accounts or my personal account.

Descriptions of these accounts are available here.