Integrating the clipboard and the command line

Two of my favorite cmdlets from the PowerShell Community Extensions are get-clipboard and out-clipboard. These cmdlets let you read from and write to the Windows clipboard from PowerShell. For example, the following code will grab the contents of the clipboard, replace every block of white-space with a comma, and paste the result back to the clipboard.

(get-clipboard) -replace 's+(?!$)', ',' | out-clipboard

I saved this to a file comma.ps1 in my path and run it when I get a list of numbers from one program delimited by newlines or tabs and need to make it the input to another program expecting comma-delimited values. For example, turning a column of numbers into an array for R. I copy one format, run comma.ps1, and paste in the new format.

In case you’re curious about the mysterious characters in the script, s+(?!$) is a regular expression describing where I want to substitute a comma. The s refers to white-space characters (tabs, spaces, newlines) and the +says this is repeated one or more times. So match one or more consecutive white-space characters. That would be enough by itself, but it would replace trailing white-space with a comma too, so I might get an unwanted comma at the end. The sequence (?!$) fixes that. The $ matches the end of line. The (?! before and the ) after form a negative look ahead, meaning “except when the thing inside matches.” So taken all together, the regular expression matches chunks of white-space except at the end of the input.

Update: See Manipulating the clipboard with PowerShell

Innovation V

About a month ago I wrote a series of four blog posts on innovation. The most important theme from these posts is the statement by Seth Godin just posted an article on his blog entitled The fibula and the safety pin that fits in with a series of innovation post I did about a month ago (Innovation I, II, III, IV). From Godin’s post:

Just about everything has a strike against it. It’s either already been done or it’s never been done. Ignore both conditions. Pushing an idea through the dip of acceptance is far more valuable than inventing something that’s never existed… and then walking away from it.

His phrase “pushing an idea through the dip of acceptance” is a good definition of innovation. (It also contains a passing reference to his excellent little book The Dip.) It goes right along with Michael Schrage’s statement that “innovation is not what innovators do but what customers adopt.” Too often we romanticize the inventor and fail to appreciate the toil of the innovator.

Preventing an unpleasant Sweave surprise

Sweave is a tool for making statistical analyses more reproducible by using literate programming in statistics. Sweave embeds R code inside LaTeX and replaces the code with the result of running the code, much like web development languages such as PHP embed code inside HTML.

Sweave is often launched from an interactive R session, but this can defeat the whole purpose of the tool. When you run Sweave this way, the Sweave document inherits the session’s state. Here’s why that’s a bad idea.

Say you’re interactively tinkering with some plots to make them look like you want. As you go, you’re copying R code into an Sweave file. When you’re done, you run Sweave on your file, compile the resulting LaTeX document, and get beautiful output. You congratulate yourself on having gone to the effort to put all your R code in an Sweave file so that it will be self-contained and reproducible. You forget about your project then revisit it six months later. You run Sweave and to your chagrin it doesn’t work. What happened? What might have happened is that your Sweave file depended on a variable that wasn’t defined in the file itself but happened to be defined in your R session. When you open up R months later and run Sweave, that variable may be missing. Or worse, you happen to have a variable in your session with the right name that now has some unrelated value.

I recommend always running Sweave from a batch file. On Windows you can save the following two lines to a file, say sw.bat, and process a file foo.Rnw with the command sw foo.

  R.exe -e "Sweave('%1.Rnw')"
  pdflatex.exe %1.tex

This assumes R.exe and pdflatex.exe are in your path. If they are not, you could either add them to your path or put their full paths in the batch file.

Running Sweave from a clean session does not insure that your file is self-contained. There could still be other implicit dependencies. But running from a clean session improves the chances that someone else will be able to reproduce the results.

See Troubleshooting Sweave for some suggestions for how to prevent or recover from other possible problems with Sweave.

Update: See the links provided by Gregor Gorjanc in the first comment below for related batch files and bash scripts.

* * *

Related: Help integrating R into your environment

How to calculate percentiles in memory-bound applications

I just published a new article on CodeProject: Calculating percentiles in memory-bound applications.

Finding the percentiles of a list of samples is trivial if you can read all the samples into memory and sort the list. If the list is too big to fit into memory, it’s still not terribly difficult, but there are a couple subtleties if you haven’t thought about it before.

The application that motivated the code presented in the article generates a large amount of data. You could think of the software as generating an enormous matrix one row at a time, then computing percentiles of each column. For example, in one problem the matrix had 800,000 columns and 2,000 rows of floating point numbers.

One program to rule them all

Do you have a single program that you “live in” when you’re at a computer? Emacs users are known for “living” inside Emacs. This means more than just using the program for a large part of the day. It means using the program as the integration point for other programs, a sort of backplane for tying other things together.

Steve Yegge’s most recent blog post described his switch from Windows to Mac. He said the main reason for the switch was that he prefers the appearance of the fonts on a Mac. Changing operating systems was not a big deal for Yegge because he didn’t really live in Windows before, nor does he live in OS X now. He lives in Emacs. He concludes his essay by saying

So I’ll keep using my Macs. They’re all just plumbing for Emacs, anyway. And now my plumbing has nicer fonts.

Graphic artists may spend the majority of their work day using Photoshop, but they don’t send email from Photoshop, and they don’t keep their calendar in Photoshop. So I wouldn’t say they “live” in Photoshop. Microsoft developers spend a great deal of their time inside Visual Studio, though they don’t live inside Visual Studio to the same extent that Emacs users live inside Emacs. The Visual Studio experience is somewhere between Photoshop and Emacs on the “live in” scale. Unlike Emacs, Visual Studio has no ambition to become an operating system, probably because the company that makes Visual Studio already has an operating system.

I once knew someone who lived in Mathematica, doing his word processing etc. inside this mathematical package. Mathematica is a nice place to visit, but I wouldn’t want to live there.

A growing number of people now live inside their web browser, particularly if that browser is Firefox. There are Firefox plug-ins available to mow your lawn and take your children to the orthodontist. Maybe Firefox is becoming the Emacs of a new generation.

The choice of a program to live in is really a choice of how you want to tie applications together. To live in Emacs, you have to write Emacs Lisp, and that’s a deal-breaker for many. Interestingly, Microsoft has a project to create a highly configurable editor some have nicknamed Emacs.NET. You can bet that the extension language will not be Emacs Lisp.

Some people live in their command shell and use shell scripts to tie everything together. While many Unix folks live that way, that hasn’t been practical on Windows until recently when PowerShell came out.

How to calculate binomial probabilities

Suppose you’re doing something that has probability of success p and probability of failure q = 1-p. If you repeat what you’re doing m+n times, the probability of m successes and n failures is given by

\frac{(m+n)!}{m!\, n!} p^m q^n

Now suppose m and n are moderately large. The terms (m+n)! and m! n! will be huge, but the terms pm and qn will be tiny. The huge terms might overflow, and the tiny terms might underflow, even though the final result may be a moderate-sized number. The numbers m and n don’t have to be that large for this to happen since factorials grow very quickly. On a typical system, overflow happens if m+n > 170. How do you get to the end result without having to deal with impossibly large or small values along the way?

The trick is to work with logs. You want to first calculate log( (m+n)! ) – log( m! ) – log( n! ) + m log( p ) + n log( q ) , then exponentiate the result. This pattern comes up constantly in statistical computing.

Libraries don’t always include functions for evaluating factorial or log factorial. They’re more likely to include functions for Γ(x) and its log. For positive integers n, Γ(n+1) = n!. Now suppose you have a function lgamma that computes log Γ(x). You might write something like this.

    double probability(double p, double q, int m, int n)
    { 
        double temp = lgamma(m + n + 1.0);
        temp -=  lgamma(n + 1.0) + lgamma(m + 1.0);
        temp += m*log(p) + n*log(q);
        return exp(temp);
    }

The function lgamma is not part of the ANSI standard library for C or C++, but it is part of the POSIX standard. On Unix-like systems, lgamma is included in the standard library. However, Microsoft does not include lgamma as part of the Visual C++ standard library. On Windows you have to either implement your own lgamma function or grab an implementation from somewhere like Boost.

Here’s something to watch out for with POSIX math libraries. I believe the POSIX standard does not require a function called gamma for evaluating the gamma function Γ(x). Instead, the standard requires functions lgamma for the log of the gamma function and tgamma for the gamma function itself. (The mnemonic is “t” for “true,” meaning that you really want the gamma function.) I wrote a little code that called gamma and tested it on OS X and Red Hat Linux this evening. In both cases gcc compiled the code without warning, even with the -Wall and -pedantic warning flags. But on the Mac, gamma was interpreted as tgamma and on the Linux box it was interpreted as lgamma. This means that gamma(10.0) would equal 362880 on the former and 12.8018 on the latter.

If you don’t have access to an implementation of log gamma, see How to compute log factorial.

Fibonacci numbers at work

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form

f(x, y) = x^a (1-x)^b y^c (1-y)^d \,p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

\frac{1}{F_n} \sum_{j=0}^{F_n-1} f\left( \left\{ \frac{j}{F_k}\right\}, \left\{\frac{jF_{n-1}}{F_n}\right\} \right)

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule is not very efficient. However, it can be surprisingly efficient for periodic integrands. See point #2 here. But the Fibonacci lattice rule was even more efficient.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

    int Fibonacci(int k)
    {
        double phi = 0.5*(1.0 + sqrt(5.0));
        return int( pow(phi, k)/sqrt(5.0) + 0.5 );
    }

The reason this works is that

equation for Fibonacci numbers

where

definition of phi, golden ratio

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

If you’re interested in technical details of the numerical integration method, see Lattice Methods for Multiple Integration by I. H. Sloan and S. Joe.

Duct tape on the moon

Yesterday’s Science at NASA podcast had a story about duct tape and Apollo 17. (Transcript, audio)

lunar rover

The lunar rover lost a fender and they taped it back on with duct tape. That worked for a while, then they had to make a new fender with laminated maps and duct tape.

Why is a fender such a big deal? Without a fender, the astronauts would get dirty. So why is that a big deal?

  1. Dirt is dark, and dark absorbs sun light. A dirty astronaut may become a fried astronaut.
  2. Dirt scratches visors, making it hard to see.
  3. Dirt gets in parts like hinges and breaks them.

So not only did duct tape save the Apollo 13 astronauts, as in the eponymous movie, it may have saved the Apollo 17 astronauts as well.

* * *

The image above is actually from Apollo 16. Same lunar rover design. I happened to have a set of photos from Apollo 16 that a friend gave me.

Problems versus Dilemmas

In a recent interview on the PowerScripting Podcast, Jeffrey Snover said that software versioning isn’t a problem, it’s a dilemma. The difference is that problems can be solved, but dilemmas can only be managed. No versioning system can do everything that everyone would like.

The same phenomena exists in biostatistics. As I’d mentioned in Galen and clinical trials, biostatistics is filled with dilemmas. There are problems along the way that can be solved, but fundamentally biostatistics manages dilemmas.