Calling C++ from R

This post relates my experience with calling C++ from R by writing an R module from scratch and by the inline module.

The most reliable way to speed up R code is to take it out of R. I’ve looked at various tricks for speeding up R code and none have improved the performance of code more than a few percent. Rewriting R code in C++, however, can speed it up by a couple orders of magnitude.

Until recently, my preferred approach to speeding up an R program has been to rewrite it entirely in C++. Now I’m looking at ways to rewrite only key parts of the code because I’m working with collaborators who want to work in R as much as they can. I’ve tried two approaches. The first I’ll call the bare-knuckle approach, writing an R package starting from C++ sample code from a colleague. The second approach is using the R module inline to embed C++ code in R.

I’ve tested everything on six computers: two running Windows 7, and one each running Windows XP, Ubuntu 10.04, Ubuntu 11.04, and Mac OS X 10.6.

Working with R and C++ on Windows requires installing Rtools and adding it to your path. I have not been able to get Rtools to work on my Windows XP machine. It apparently installs successfully, but everything that requires it fails. On my Windows 7 machines I’ve been able to install R packages containing C++ source but I’ve not been able to build them.

(In R terminology, “building” a package means gathering the source into a format that R wants. This does not “build” in the sense of compiling the C++ source; the C++ compilation happens on install. So, in theory, one could build an R package on one operating system and then install it on another.)

I’ve been able to build R packages on Ubuntu and Mac OS X and install them on Windows 7, albeit with warnings. The install process complains bitterly, but apparently everything works. At least once I got an error message saying that R couldn’t find the compiler, but apparently it did since the code did in fact compile. (Rtools on Windows installs a Unix-like build system, including the gcc compiler.)

Building and installing R packages on Ubuntu has worked smoothly. I’ve been able to build R packages on OS X and install them on Ubuntu, but not always the other way around. Sometimes R packages built on Ubuntu will install on OS X and sometimes not. My Ubuntu 10.04 box runs a newer version of gcc but an older version of R compared to the Mac, so the portability problems may come from tool version differences rather than operating system differences.

The bare-knuckle approach to writing R packages involves a great deal of tedious work. The R package Rcpp eliminates much of this tedium. The R package inline makes the process even simpler. Using inline you can include C++ source code in your R file and have the code compile automatically when the R code runs. This is the simplest way to call C++ from R, when it works. You can read far more about Rcpp and inline on Dirk Eddelbuettel’s web site.

I find the inline approach promising. It could be a convenient way for me to give C++ code to collaborators who may not be able to write C++ but who could read it. It would also eliminate the portability problems that could arise from building a package on one platform and installing it on another.

The inline module worked smoothly on my Ubuntu 11.04 machine, but not at all on Ubuntu 10.04. I was unable to install Rcpp or inline on Ubuntu 10.04 from R itself, though I was able to install these packages through the Synaptic Package Manager. The packages load from R, but fail when I try to use them.

I was able to get inline to work from Windows 7 and from OS X, though in both cases there are numerous warnings.

It was not immediately clear to me how to use inline to call a C++ function that depended on another C++ function. If you want to call a single C++ function f() from R, you could put its source code in an R string and pass that string as an argument to cxxfunction. But what if the function f() uses a function g()?

My first thought was to define g() inside f(). C++ doesn’t let you define a function inside a function, but it does let you define a struct inside a function, so you could make g() a method on a struct to get around this language limitation.

It turns out there’s a more direct method. In addition to the body argument for source code, the function cxxfunction also has an argument includes for header files. You could put the source of g() in a header file and pass a string with a #include statement for that header as the value of includes. However, I don’t know where inline looks for headers. Apparently it does not look in the working directory of the R process that calls it. However, you could pass the content of the header file as the value of includes rather than a #include statement. After all, the first thing the C++ compiler will do with a header file reference is replace it with the contents of the file. So you could put the source for g() or any other dependencies in a string passed as the includes argument.

Calling C++ from R has been painful, but I expect will be easier in the future now that I’ve learned a few things the hard way.

Doubly periodic functions

Functions like sine and cosine are periodic. For example, sin(x + 2πn) = sin(x) for all x and any integer n, and so the period of sine is 2π. But what happens if you look at sine or cosine as functions of a complex variable? They’re still periodic if you shift left or right, but not if you shift up or down. If you move up or down, i.e. in a pure imaginary direction, sines and cosines become unbounded.

Doubly periodic functions are periodic in two directions. Formally, a function f(z) of complex variable is doubly periodic if there exist two constants ω1 and ω2 such that

f(z) = f(z + ω1) = f(z + ω2)

for all z. The ratio ω1 / ω2 cannot be real; otherwise the two periods point in the same direction. For the rest of this post, I’ll assume ω1 = 1 and ω2 = i. Such functions are periodic in the horizontal (real-axis) and vertical (imaginary-axis) directions. They repeat everywhere their behavior on the unit square.

What do doubly periodic functions look like? It depends on what restrictions we place on the functions. When we’re working with complex functions, we’re typically interested in functions that are analytic, i.e. differentiable as complex functions.

Only constant functions can be doubly periodic and analytic everywhere. Why? Our functions take on over and over the values they take on over the unit square. If a function is analytic over the (closed) unit square then it’s bounded over that square, and since it’s doubly periodic, it’s bounded everywhere. By Liouville’s theorem, the only bounded analytic functions are constants.

This says that to find interesting doubly periodic functions, we’ll have to relax our requirements. Instead of requiring functions to be analytic everywhere, we will require them to be analytic except at isolated singularities. That is, the functions are allowed to blow up at a finite number of points. There’s a rich set of such functions, known as elliptic functions.

There are two well-known families of elliptic functions. One is the Weierstrass ℘ function (TeX symbol wp, Unicode U+2118) and its derivatives. The other is the Jacobi functions sn, cn, and dn. These functions have names resembling familiar trig functions because the Jacobi functions are in some ways analogous to trig functions.

It turns out that all elliptic functions can be written as combinations either of the Weierstrass function and its derivative or combinations of Jacobi functions. Roughly speaking, Weierstrass functions are easier to work with theoretically and Jacobi functions are easier to work with numerically.

Related posts:

It takes more than a better mouse trap

Emerson was wrong. The world will not beat a path to your door just because you build a better mouse trap.

No busy, overstressed, fire-putting-out, content-with-the-product-they-have-now person really wants to hear from you. Even when you do build a better mousetrap, the world thinks you’re a giant pain in the ass. Nobody has the time, nobody has the patience, nobody wants to take the risk that your claims are exaggerated … We have to be invited in or we never get to tell our tale.

From Why Johnny Can’t Brand.

Not only does this apply to consumer and business products, it applies to science as well.

Related post: Getting women to smoke

High-brow limericks

Philosophy

Said Plato: “These things that we feel
Are not ontologically real,
But just the excresence
Of numinous essence
Our senses can never reveal.”

via Futility Closet

Calculus

The integral z-squared dz
From one to the cube root of 3
Times the cosine
Of three pi over nine
Is the log of the cube root of e.

via Scott Franklin

Biology

Planktonic cells are all alone;
More typically in biofilms grown.
Bacterial masses
Whose abundance surpasses
The weight of all elephants known.

via Brendan Niemira

A prime number

Two quintillion, seventy-seven
Quadrillion, three hundred eleven
Trillion, one billion,
Twenty-four million,
One thousand two hundred and seven.

via Andrew

Topology

The topological part of my brain
Finds Möbius strips quite a strain.
But I make you this pledge:
I’ll glue one at its edge
And build a real projective plane.

By Richard Elwes

Feel free to contribute your own limericks in the comments, but please follow these guidelines:

  1. Keep it geeky.
  2. Keep it clean.

Related posts:

Fifteen interviews

Seven Nine people I have interviewed:

Two Four people who have interviewed me:

Six interviews I have blogged about:

Square root interview question

Imagine some of the answers you might get to  “What is the square root of 101?” First, three answers that suggest an interviewee is not strong with math.

  • What’s a square root?
  • You gotta calculator?
  • 101 doesn’t have a square root.

And here are some other answers that might give an idea where an interviewee is coming from.

  • An irrational number. (Pure mathematician)
  • Approximately 10. (Engineer)
  • Somewhere between 10 and 11, closer to 10. (Better engineer)
  • Approximately 10.05, based on a linear Taylor approximation centered at 100 (Applied mathematician)
  • Approximately 10.05, based on one step of Newton’s method (Computer scientist)

(This is just for amusement. I don’t think quiz show-like interviews are a good way to find people you want to work with for years.)

Related posts:

Platform lock-in

From Baron Schwartz speaking at the O’Reilly Media MySQL Conference:

We talk about proprietary vendor lock-in, but in many cases the reality is that anyone who uses any platform, even an open source one, ends up being locked-in to some extent. Switching is something you just can’t contemplate after a while.

How to fit an elephant

John von Neumann famously said

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

By this he meant that one should not be impressed when a complex model fits a data set well. With enough parameters, you can fit any data set.

It turns out you can literally fit an elephant with four parameters if you allow the parameters to be complex numbers.

I mentioned von Neumann’s quote on StatFact last week and Piotr Zolnierczuk replied with reference to a paper explaining how to fit an elephant:

“Drawing an elephant with four complex parameters” by Jurgen Mayer, Khaled Khairy, and Jonathon Howard,  Am. J. Phys. 78, 648 (2010), DOI:10.1119/1.3254017.

Piotr also sent me the following Python code he’d written to implement the method in the paper. This code produced the image above.

"""
Author: Piotr A. Zolnierczuk (zolnierczukp at ornl dot gov)

Based on a paper by:
Drawing an elephant with four complex parameters
Jurgen Mayer, Khaled Khairy, and Jonathon Howard,
Am. J. Phys. 78, 648 (2010), DOI:10.1119/1.3254017
"""
import numpy as np
import pylab

# elephant parameters
p1, p2, p3, p4 = (50 - 30j, 18 +  8j, 12 - 10j, -14 - 60j )
p5 = 40 + 20j # eyepiece

def fourier(t, C):
    f = np.zeros(t.shape)
    A, B = C.real, C.imag
    for k in range(len(C)):
        f = f + A[k]*np.cos(k*t) + B[k]*np.sin(k*t)
    return f

def elephant(t, p1, p2, p3, p4, p5):
    npar = 6
    Cx = np.zeros((npar,), dtype='complex')
    Cy = np.zeros((npar,), dtype='complex')

    Cx[1] = p1.real*1j
    Cx[2] = p2.real*1j
    Cx[3] = p3.real
    Cx[5] = p4.real

    Cy[1] = p4.imag + p1.imag*1j
    Cy[2] = p2.imag*1j
    Cy[3] = p3.imag*1j

    x = np.append(fourier(t,Cx), [-p5.imag])
    y = np.append(fourier(t,Cy), [p5.imag])

    return x,y

x, y = elephant(np.linspace(0,2*np.pi,1000), p1, p2, p3, p4, p5)
pylab.plot(y,-x,'.')
pylab.show()

Related posts:

An unexpected Father's Day card

As an undergraduate I was part of an honors program called Dean’s Scholars. I have a T-shirt from a Dean’s Scholars reunion several years ago that my wife doesn’t appreciate as much as I do. She thinks I shouldn’t wear it just because it has started to deteriorate.

She surprised me today with a new Dean’s Scholars T-shirt as a Father’s Day present. She told me that when she asked to buy a T-shirt, the honors program insisted on giving it too her. They also included a handwritten card wishing me a happy Father’s Day.

I was especially impressed with the card. That wasn’t scripted; I don’t imagine they get that many calls like my wife’s. It was just someone being thoughtful. That made a bigger impression on me than all glossy publications I’ve gotten since I graduated.

A surprise with Emacs and Office 2007

I had a little surprise when I tried to open an Excel file from Emacs. I was using dired, a sort of file explorer inside Emacs. I expected one of two things to happen. Maybe Emacs would know to launch the file using Excel. Or maybe it would open the binary file as a bunch of gibberish. Instead I got something like this:

screenshot

This was disorienting at first. Then I thought about how Office 2007 documents are zipped XML files. But how does dired know that my .xlsx file is a zip file? I suppose since Emacs is a Unix application at heart, it’s acting like a Unix application even though I was using it on Windows. It’s determining the file type by inspected the file itself and ignoring the file extension.

(Office 2007 documents are not entirely XML. The data and styling directives are XML, but embedded files are just files. The spreadsheet in this example contains a large photo. That photo is a JPEG file that gets zipped up with all the XML files that make up the spreadsheet.)

So I learned that Emacs knows how to navigate inside zip files, and that a convenient way to poke around inside an Office 2007 file is to browse into it from Emacs.

Here’s another post that discusses Emacs and Office 2007, contrasting their interface designs: Clutter-discoverability trade-off

Impure math

When Samuel Hansen said in his interview “You’re not a pure mathematician” I agreed without thinking, but later the statement bothered me a little. I know what he meant: considering the two categories of pure math and applied math, you’d put yourself in the latter category. Which is true.

But the term “pure” math can be misleading, as if everyone else does impure math. Applied math is not an alternative to theoretical math. Applied mathematicians prove theorems etc. We work on applications in addition to doing what is expected of pure mathematicians. The difference between pure and applied math is motivation, not content. Applied math is motivated by direct application to non-mathematical problems. Pure math seeks to advance math for its own sake. Both are important.

Statistics uses the terms “theoretical” and “applied” rather than “pure” and “applied.” Math doesn’t use “theoretical” as an antithesis to “applied” because applied math is theoretical. But unlike math, being “applied” in statistics does mean you’re often (too often?) excused from proving theorems. The first time I was a coauthor on a statistics paper I was surprised to find out you could publish with just simulation results and no theorems. This happens in applied math as well, but not nearly as often as it does in applied statistics.

On the other hand, when I hear the term “applied statistics” I want to ask “Is there any other kind?” Statistics is applied (and theoretical!) though some statisticians work more directly on applications than others. As Andrew Gelman quips, the difference between theoretical and applied statisticians is that

The theoretical statistician uses x, the applied statistician uses y (because we reserve x for predictors).

I assume that statement wasn’t meant to be taken literally, but I agree with the sentiment that the distinction between theoretical and applied statistics can be exaggerated. I’d say the same applies to pure and applied math.