R without Hadley Wickham

Tim Hopper asked on Twitter today:

#rstats programming without @hadleywickham’s libraries is like ________ without _________.

Some of the replies were:

  • (skydiving, a parachute)
  • (gouging your own eyes out, NULL)
  • (dentistry, anesthesia)
  • (shaving, a razor)
  • (Internet shopping, credit card)

Clearly there’s a lot of love out there for Hadley Wickham’s R packages. I’m familiar with his ggplot2 graphics package, and it’s quite impressive. I need to look at his other packages as well.

Related: Help integrating R into your environment

Beta inequalities in R

Someone asked me yesterday for R code to compute the probability P(X > Y + δ) where X and Y are independent beta random variables. I’m posting the solution here in case it benefits anyone else.

For an example of why you might want to compute this probability, see A Bayesian view of Amazon resellers.

Let X be a Beta(a, b) random variable and Y be a Beta(c, d) random variable. Denote PDFs by f and CDFs by F. Then the probability we need is

P(X > Y + \delta) &=& \int_\delta^1 \int_0^{x-\delta} f_X(x) \, f_Y(y)\, dy,dx \ &=& \int_\delta^1 f_X(x)\, F_Y(x-\delta) \, dx

If you just need to compute this probability a few times, here is a desktop application to compute random inequalities.

But if you need to do this computation repeated inside R code, you could use the following.

beta.ineq <- function(a, b, c, d, delta)
{
    integrand <- function(x) { dbeta(x, a, b)*pbeta(x-delta, c, d) }
    integrate(integrand, delta, 1, rel.tol=1e-4)$value
}

The code is as good or as bad as R’s integrate function. It’s probably accurate enough as long as none of the parameters a, b, c, or d are near zero. When one or more of these parameters is small, the integral is harder to compute numerically.

There is no error checking in the code above. A more robust version would verify that all parameters are positive and that delta is less than 1.

Here’s the solution to the corresponding problem for gamma random variables, provided delta is zero: A support one-liner.

And here is a series of blog posts on random inequalities.

Click to learn more about Bayesian statistics consulting

 

Installing R and Rcpp

Almost exactly a year ago, I wrote about my frustration calling C++ from R. Maybe this will become an annual event because I’m back at it.

This time my experience was more pleasant. I was able to install Rcpp on an Ubuntu virtual machine and run example 2.2 from the Rcpp FAQ once I upgraded R to the latest version. I wrote up some notes on the process of installing the latest version of R and Rcpp on Ubuntu.

I have not yet been able to run Rcpp on Windows.

Update: Thanks to Dirk Eddelbuettel for pointing out in the comments that you can install Rcpp from the shell rather than from inside R by running sudo apt-get install r-cran-rcpp. With this approach, I was able to install Rcpp without having to upgrade R first.

Machine Learning for Hackers

Drew Conway and John Myles White have a new book out, Machine Learning for Hackers (ISBN 1449303714). As the name implies, the emphasis is on exploration rather than mathematical theory. Lots of code, no equations.

If you’re looking for a hands-on introduction to machine learning, maybe as a prelude to or complement to a more theoretical text, you’ll enjoy this book. Even if you’re not all that interested in machine learning, you might enjoy the examples, such as how a computer could find patterns in senatorial voting records and twitter networks. And R users will find examples of using advanced language features to solve practical problems.

For daily tips on data science, follow @DataSciFact on Twitter.

DataSciFact twitter icon

Comparing R to smoking

Francois Pinard comparing the R programming language to smoking:

Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run, it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.

I’ve never gotten to the point that I would call using R pleasurable.

Quote via Stats in the Wild

Related posts:

Running Python and R inside Emacs

Emacs org-mode lets you manage blocks of source code inside a text file. You can execute these blocks and have the output display in your text file. Or you could export the file, say to HTML or PDF, and show the code and/or the results of executing the code.

Here I’ll show some of the most basic possibilities. For much more information, see orgmode.org. And for the use of org-mode in research, see A Multi-Language Computing Environment for Literate Programming and Reproducible Research.

Source code blocks go between lines of the form

#+begin_src
#+end_src

On the #+begin_src line, specify the programming language. Here I’ll demonstrate Python and R, but org-mode currently supports C++, Java, Perl, etc. for a total of 35 languages.

Suppose we want to compute √42 using R.

#+begin_src R
sqrt(42)
#+end_src

If we put the cursor somewhere in the code block and type C-c C-c, org-mode will add these lines:

#+results:
: 6.48074069840786

Now suppose we do the same with Python:

#+begin_src python
from math import sqrt
sqrt(42)
#+end_src

This time we get disappointing results:

#+results:
: None

What happened? The org-mode manual explains:

… code should be written as if it were the body of such a function. In particular, note that Python does not automatically return a value from a function unless a return statement is present, and so a ‘return’ statement will usually be required in Python.

If we change sqrt(42) to return sqrt(42) then we get the same result that we got when using R.

By default, evaluating a block of code returns a single result. If you want to see the output as if you were interactively using Python from the REPL, you can add :results output :session following the language name.

#+begin_src python :results output :session
print "There are %d hours in a week." % (7*24)
2**10
#+end_src

This produces the lines

#+results:
: There are 168 hours in a week.
: 1024

Without the :session tag, the second line would not appear because there was no print statement.

I had to do a couple things before I could get the examples above to work. First, I had to upgrade org-mode. The version of org-mode that shipped with Emacs 23.3 was quite out of date. Second, the only language you can run by default is Emacs Lisp. You have to turn on support for other languages in your .emacs file. Here’s the code to turn on support for Python and R.

(org-babel-do-load-languages
    'org-babel-load-languages '((python . t) (R . t)))

Update: My next post shows how to call code in written in one language from code written in another language.

Related posts:

R in Action

No Starch Press sent me a copy of The Art of R Programming last Fall and I wrote a review of it here. Then a couple weeks ago, Manning sent me a copy of R in Action (ISBN 1935182390). Here I’ll give a quick comparison of the two books, then focus specifically on R in Action.

Comparing R books

Norman Matloff, author of The Art of R Programming, is a statistician-turned-computer scientist. As the title may imply, Matloff’s book has more of a programmer’s perspective on R as a language.

Robert Kabacoff, author of R in Action, is a psychology professor-turned-statistical consultant. And as its title may imply, Kabacoff’s book is more about using R to analyze data. That is, the book is organized by analytical task rather than by language feature.

Many R books are organized like a statistical text. In fact, many are statistics texts, organized according to the progression of statistical theory with R code sprinkled in. R in Action is organized roughly in the order of steps one would take to analyze data, starting with importing data and ending with producing reports.

In short, The Art of R Programming is for programmers, R in Action is for data analysts, and most other R books I’ve seen are for statisticians. Of course a typical R user is to some extent a programmer, an analyst, and a statistician. But this comparison gives you some idea which book you might want to reach for depending on which hat you’re wearing at the moment. For example, I’d pick up The Art of R Programming if I had a question about interfacing R and C, but I’d pick up R in Action if I wanted to read about importing SAS data or using the ggplot2 graphics package.

R in Action

Kabacoff begins his book off with two appropriate quotes.

What is the use of a book, without pictures or conversations? — Alice, Alice in Wonderland

It’s wonderous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. — Q, “Q Who?” Star Trek: The Next Generation

R in Action is filled with pictures and conversations. It is also a treasure chest of practical information.

The first third of the book concerns basic data management and graphics. This much of the book would be accessible to someone with no background in statistics. The middle third of the book is devoted to basic statistics: correlation, linear regression, etc. The final third of the book contains more advanced statistics and graphics. (I was pleased to see the book has an appendix on using Sweave and odfWeave to produce reports.)

R in Action includes practical details that I have not seen in other books on R. Perhaps this is because the book is focused on analyzing and graphing data rather than exploring the dark corners of R or rounding out statistical theory.

Kabacoff says that he wrote the book that he wishes he’d had years ago. I also wish I’d had his book years ago.

Related links:

For daily tips on data science, follow @DataSciFact on Twitter.

DataSciFact twitter icon

The Art of R Programming

Here are my first impressions of The Art of R Programming (ISBN 1593273843). I haven’t had time to read it thoroughly, and I doubt I will any time soon. Rather than sitting on it, I wanted to get something out quickly. I may say more about the book later.

The book’s author, Norman Matloff, began his career as a statistics professor and later moved into computer science. That may explain why his book seems to be more programmer-friendly than other books I’ve seen on R.

My impression is that few people actually sit down and learn R the way they’d learn, say, Java. Most learn R in the context of learning statistics. Here’s a statistical chore, and here’s a snippet of R to carry it out. Books on R tend to follow that pattern, organized more by statistical task than by language feature. That serves statisticians well, but it’s daunting to outsiders.

Matloff’s book is organized more like a typical programming book and may be more accessible to a programmer needing to learn R. He explains some things that might require no explanation if you were learning R in the context of a statistics class.

The last four chapters would be interesting even for an experienced R programmer:

  • Debugging
  • Performance enhancement: memory and speed
  • Interfacing R to other languages
  • Parallel R

No one would be surprised to see the same chapters in a Java textbook if you replaced “R” with “Java” in the titles. But these topics are not typical in a book on R. They wouldn’t come up in a statistics class because they don’t provide any statistical functionality per se. As long as you don’t make mistakes, don’t care how long your code takes to run, and don’t need to interact with anything else, these chapters are unnecessary. But of course these chapters are quite necessary in practice.

As I mentioned up front, I haven’t read the book carefully. So I’m going out on a limb a little here, but I think this may be the book I’d recommend for someone wanting to learn R, especially for someone with more experience in programming than statistics.

Related post: R: The Good Parts

RLangTip changing hands

I’ve decided to hand my Twitter account RLangTip over to the folks at Revolution Analytics starting next week. I thought it would be better to give the account to someone who is more enthusiastic about R than I am, and so I offered it to David Smith. If you’ve enjoyed RLangTip so far, I expect you’ll like it even better under new ownership.

If you’d like to continue to hear from me on Twitter, you can follow one of my 10 other daily tip accounts or my personal account.

Descriptions of these accounts are available here.

Calling C++ from R

This post relates my experience with calling C++ from R by writing an R module from scratch and by the inline module.

The most reliable way to speed up R code is to take it out of R. I’ve looked at various tricks for speeding up R code and none have improved the performance of code more than a few percent. Rewriting R code in C++, however, can speed it up by a couple orders of magnitude.

Until recently, my preferred approach to speeding up an R program has been to rewrite it entirely in C++. Now I’m looking at ways to rewrite only key parts of the code because I’m working with collaborators who want to work in R as much as they can. I’ve tried two approaches. The first I’ll call the bare-knuckle approach, writing an R package starting from C++ sample code from a colleague. The second approach is using the R module inline to embed C++ code in R.

I’ve tested everything on six computers: two running Windows 7, and one each running Windows XP, Ubuntu 10.04, Ubuntu 11.04, and Mac OS X 10.6.

Working with R and C++ on Windows requires installing Rtools and adding it to your path. I have not been able to get Rtools to work on my Windows XP machine. It apparently installs successfully, but everything that requires it fails. On my Windows 7 machines I’ve been able to install R packages containing C++ source but I’ve not been able to build them.

(In R terminology, “building” a package means gathering the source into a format that R wants. This does not “build” in the sense of compiling the C++ source; the C++ compilation happens on install. So, in theory, one could build an R package on one operating system and then install it on another.)

I’ve been able to build R packages on Ubuntu and Mac OS X and install them on Windows 7, albeit with warnings. The install process complains bitterly, but apparently everything works. At least once I got an error message saying that R couldn’t find the compiler, but apparently it did since the code did in fact compile. (Rtools on Windows installs a Unix-like build system, including the gcc compiler.)

Building and installing R packages on Ubuntu has worked smoothly. I’ve been able to build R packages on OS X and install them on Ubuntu, but not always the other way around. Sometimes R packages built on Ubuntu will install on OS X and sometimes not. My Ubuntu 10.04 box runs a newer version of gcc but an older version of R compared to the Mac, so the portability problems may come from tool version differences rather than operating system differences.

The bare-knuckle approach to writing R packages involves a great deal of tedious work. The R package Rcpp eliminates much of this tedium. The R package inline makes the process even simpler. Using inline you can include C++ source code in your R file and have the code compile automatically when the R code runs. This is the simplest way to call C++ from R, when it works. You can read far more about Rcpp and inline on Dirk Eddelbuettel’s web site.

I find the inline approach promising. It could be a convenient way for me to give C++ code to collaborators who may not be able to write C++ but who could read it. It would also eliminate the portability problems that could arise from building a package on one platform and installing it on another.

The inline module worked smoothly on my Ubuntu 11.04 machine, but not at all on Ubuntu 10.04. I was unable to install Rcpp or inline on Ubuntu 10.04 from R itself, though I was able to install these packages through the Synaptic Package Manager. The packages load from R, but fail when I try to use them.

I was able to get inline to work from Windows 7 and from OS X, though in both cases there are numerous warnings.

It was not immediately clear to me how to use inline to call a C++ function that depended on another C++ function. If you want to call a single C++ function f() from R, you could put its source code in an R string and pass that string as an argument to cxxfunction. But what if the function f() uses a function g()?

My first thought was to define g() inside f(). C++ doesn’t let you define a function inside a function, but it does let you define a struct inside a function, so you could make g() a method on a struct to get around this language limitation.

It turns out there’s a more direct method. In addition to the body argument for source code, the function cxxfunction also has an argument includes for header files. You could put the source of g() in a header file and pass a string with a #include statement for that header as the value of includes. However, I don’t know where inline looks for headers. Apparently it does not look in the working directory of the R process that calls it. However, you could pass the content of the header file as the value of includes rather than a #include statement. After all, the first thing the C++ compiler will do with a header file reference is replace it with the contents of the file. So you could put the source for g() or any other dependencies in a string passed as the includes argument.

Calling C++ from R has been painful, but I expect will be easier in the future now that I’ve learned a few things the hard way.