Integrating the clipboard and the command line

Two of my favorite cmdlets from the PowerShell Community Extensions are get-clipboard and out-clipboard. These cmdlets let you read from and write to the Windows clipboard from PowerShell. For example, the following code will grab the contents of the clipboard, replace every block of white-space with a comma, and paste the result back to the clipboard.

(get-clipboard) -replace 's+(?!$)', ',' | out-clipboard

I saved this to a file comma.ps1 in my path and run it when I get a list of numbers from one program delimited by newlines or tabs and need to make it the input to another program expecting comma-delimited values. For example, turning a column of numbers into an array for R. I copy one format, run comma.ps1, and paste in the new format.

In case you’re curious about the mysterious characters in the script, s+(?!$) is a regular expression describing where I want to substitute a comma. The s refers to white-space characters (tabs, spaces, newlines) and the +says this is repeated one or more times. So match one or more consecutive white-space characters. That would be enough by itself, but it would replace trailing white-space with a comma too, so I might get an unwanted comma at the end. The sequence (?!$) fixes that. The $ matches the end of line. The (?! before and the ) after form a negative look ahead, meaning “except when the thing inside matches.” So taken all together, the regular expression matches chunks of white-space except at the end of the input.

Update: See Manipulating the clipboard with PowerShell

Read More

Innovation V

About a month ago I wrote a series of four blog posts on innovation. The most important theme from these posts is the statement by Seth Godin just posted an article on his blog entitled The fibula and the safety pin that fits in with a series of innovation post I did about a month ago (Innovation I, II, III, IV). From Godin’s post:

Just about everything has a strike against it. It’s either already been done or it’s never been done. Ignore both conditions. Pushing an idea through the dip of acceptance is far more valuable than inventing something that’s never existed… and then walking away from it.

His phrase “pushing an idea through the dip of acceptance” is a good definition of innovation. (It also contains a passing reference to his excellent little book The Dip.) It goes right along with Michael Schrage’s statement that “innovation is not what innovators do but what customers adopt.” Too often we romanticize the inventor and fail to appreciate the toil of the innovator.

Read More

Preventing an unpleasant Sweave surprise

Sweave is a tool for making statistical analyses more reproducible by using literate programming in statistics. Sweave embeds R code inside LaTeX and replaces the code with the result of running the code, much like web development languages such as PHP embed code inside HTML.

Sweave is often launched from an interactive R session, but this can defeat the whole purpose of the tool. When you run Sweave this way, the Sweave document inherits the session’s state. Here’s why that’s a bad idea.

Say you’re interactively tinkering with some plots to make them look like you want. As you go, you’re copying R code into an Sweave file. When you’re done, you run Sweave on your file, compile the resulting LaTeX document, and get beautiful output. You congratulate yourself on having gone to the effort to put all your R code in an Sweave file so that it will be self-contained and reproducible. You forget about your project then revisit it six months later. You run Sweave and to your chagrin it doesn’t work. What happened? What might have happened is that your Sweave file depended on a variable that wasn’t defined in the file itself but happened to be defined in your R session. When you open up R months later and run Sweave, that variable may be missing. Or worse, you happen to have a variable in your session with the right name that now has some unrelated value.

I recommend always running Sweave from a batch file. On Windows you can save the following two lines to a file, say sw.bat, and process a file foo.Rnw with the command sw foo.

  R.exe -e "Sweave('%1.Rnw')"
  pdflatex.exe %1.tex

This assumes R.exe and pdflatex.exe are in your path. If they are not, you could either add them to your path or put their full paths in the batch file.

Running Sweave from a clean session does not insure that your file is self-contained. There could still be other implicit dependencies. But running from a clean session improves the chances that someone else will be able to reproduce the results.

See Troubleshooting Sweave for some suggestions for how to prevent or recover from other possible problems with Sweave.

Update: See the links provided by Gregor Gorjanc in the first comment below for related batch files and bash scripts.

Read More

How to calculate percentiles in memory-bound applications

I just published a new article on CodeProject: Calculating percentiles in memory-bound applications.

Finding the percentiles of a list of samples is trivial if you can read all the samples into memory and sort the list. If the list is too big to fit into memory, it’s still not terribly difficult, but there are a couple subtleties if you haven’t thought about it before.

The application that motivated the code presented in the article generates a large amount of data. You could think of the software as generating an enormous matrix one row at a time, then computing percentiles of each column. For example, in one problem the matrix had 800,000 columns and 2,000 rows of floating point numbers.

Read More

One program to rule them all

Do you have a single program that you “live in” when you’re at a computer? Emacs users are known for “living” inside Emacs. This means more than just using the program for a large part of the day. It means using the program as the integration point for other programs, a sort of backplane for tying other things together.

Steve Yegge’s most recent blog post described his switch from Windows to Mac. He said the main reason for the switch was that he prefers the appearance of the fonts on a Mac. Changing operating systems was not a big deal for Yegge because he didn’t really live in Windows before, nor does he live in OS X now. He lives in Emacs. He concludes his essay by saying

So I’ll keep using my Macs. They’re all just plumbing for Emacs, anyway. And now my plumbing has nicer fonts.

Graphic artists may spend the majority of their work day using Photoshop, but they don’t send email from Photoshop, and they don’t keep their calendar in Photoshop. So I wouldn’t say they “live” in Photoshop. Microsoft developers spend a great deal of their time inside Visual Studio, though they don’t live inside Visual Studio to the same extent that Emacs users live inside Emacs. The Visual Studio experience is somewhere between Photoshop and Emacs on the “live in” scale. Unlike Emacs, Visual Studio has no ambition to become an operating system, probably because the company that makes Visual Studio already has an operating system.

I once knew someone who lived in Mathematica, doing his word processing etc. inside this mathematical package. Mathematica is a nice place to visit, but I wouldn’t want to live there.

A growing number of people now live inside their web browser, particularly if that browser is Firefox. There are Firefox plug-ins available to mow your lawn and take your children to the orthodontist. Maybe Firefox is becoming the Emacs of a new generation.

The choice of a program to live in is really a choice of how you want to tie applications together. To live in Emacs, you have to write Emacs Lisp, and that’s a deal-breaker for many. Interestingly, Microsoft has a project to create a highly configurable editor some have nick-named Emacs.NET. You can bet that the extension language will not be Emacs Lisp.

Some people live in their command shell and use shell scripts to tie everything together. While many Unix folks live that way, that hasn’t been practical on Windows until recently when PowerShell came out.

By the way, you can run PowerShell and Emacs at the same time. See Jeffrey Snover’s blog post PowerShell Running Inside of Emacs.

Read More

How to calculate binomial probabilities

Suppose you’re doing something that has probability of success p and probability of failure q = 1-p. If you repeat what you’re doing m+n times, the probability of m successes and n failures is given by

frac{(m+n)!}{m!, n!} p^m q^n

Now suppose m and n are moderately large. The terms (m+n)! and m! n! will be huge, but the terms pm and qn will be tiny. The huge terms might overflow, and the tiny terms might underflow, even though the final result may be a moderate-sized number. The numbers m and n don’t have to be that large for this to happen since factorials grow very quickly. On a typical system, overflow happens if m+n > 170. How do you get to the end result without having to deal with impossibly large or small values along the way?

The trick is to work with logs. You want to first calculate log( (m+n)! ) – log( m! ) – log( n! ) + m log( p ) + n log( q ) , then exponentiate the result. This pattern comes up constantly in statistical computing.

Libraries don’t always include functions for evaluating factorial or log factorial. They’re more likely to include functions for Γ(x) and its log. For positive integers n, Γ(n+1) = n!. Now suppose you have a function lgamma that computes log Γ(x). You might write something like this.

    double probability(double p, double q, int m, int n)
    { 
        double temp = lgamma(m + n + 1.0);
        temp -=  lgamma(n + 1.0) + lgamma(m + 1.0);
        temp += m*log(p) + n*log(q);
        return exp(temp);
    }

The function lgamma is not part of the ANSI standard library for C or C++, but it is part of the POSIX standard. On Unix-like systems, lgamma is included in the standard library. However, Microsoft does not include lgamma as part of the Visual C++ standard library. On Windows you have to either implement your own lgamma function or grab an implementation from somewhere like Boost.

Here’s something to watch out for with POSIX math libraries. I believe the POSIX standard does not require a function called gamma for evaluating the gamma function Γ(x). Instead, the standard requires functions lgamma for the log of the gamma function and tgamma for the gamma function itself. (The mnemonic is “t” for “true,” meaning that you really want the gamma function.) I wrote a little code that called gamma and tested it on OS X and Red Hat Linux this evening. In both cases gcc compiled the code without warning, even with the -Wall and -pedantic warning flags. But on the Mac, gamma was interpreted as tgamma and on the Linux box it was interpreted as lgamma. This means that gamma(10.0) would equal 362880 on the former and 12.8018 on the latter.

If you don’t have access to an implementation of log gamma, see How to compute log factorial.

Read More

Fibonacci numbers at work

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form

f(x, y) = xa (1-x)b yc (1-y)d p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

Fibonacci lattice integration rule

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule is not very efficient. However, it can be surprisingly efficient for periodic integrands. See point #2 here. But the Fibonacci lattice rule was even more efficient.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

    int Fibonacci(int k)
    {
        double phi = 0.5*(1.0 + sqrt(5.0));
        return int( pow(phi, k)/sqrt(5.0) + 0.5 );
    }

The reason this works is that

equation for Fibonacci numbers

where

definition of phi, golden ratio

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

If you’re interested in technical details of the numerical integration method, see Lattice Methods for Multiple Integration by I. H. Sloan and S. Joe.

Read More

Duct tape on the moon

Yesterday’s Science at NASA podcast had an entertaining story about duct tape and Apollo 17. (Transcript, audio)

The lunar rover lost a fender and they taped it back on with duct tape. That worked for a while, then they had to make a new fender with laminated maps and duct tape.

Why is a fender such a big deal? Without a fender, the astronauts would get dirty. So why is that a big deal?

  1. Dirt is dark, and dark absorbs sun light. A dirty astronaut may become a fried astronaut.
  2. Dirt scratches visors, making it hard to see.
  3. Dirt gets in parts like hinges and breaks them.

moonbuggy photo

Read More

Problems versus dilemmas

In a recent interview on the PowerScripting Podcast, Jeffrey Snover said that software versioning isn’t a problem, it’s a dilemma. The difference is that problems can be solved, but dilemmas can only be managed. No versioning system can do everything that everyone would like.

The same phenomena exists in biostatistics. As I’d mentioned in Galen and clinical trials, biostatistics is filled with dilemmas. There are problems along the way that can be solved, but fundamentally biostatistics manages dilemmas.

Read More

Four types of errors

There are two kinds of errors discussed in classical statistics, unimaginatively named Type I and Type II. Aside from having completely non-mnemonic names, they represent odd concepts.

Technically, a Type I error consists of rejecting the “null hypothesis” (roughly speaking, the assumption of no effect, the hypothesis you typically set out to disprove) in favor of the “alternative hypothesis” when in fact the null hypothesis is true. A Type II error consists of accepting the null hypothesis (technically, failing to reject the null hypothesis) when in fact the null hypothesis is false.

Suppose you’re comparing two treatments with probabilities of efficacy θj and θk. The typical null hypothesis is that θj = θk, i.e. that the treatments are identically effective. The corresponding alternative hypothesis is that θj ≠ θk, that the treatments are not identical. Andrew Gelman says in this presentation (page 87)

I’ve never made a Type 1 error in my life. Type 1 error is θj = θk, but I claim they’re different. I’ve never studied anything where θj = θk.

I’ve never made a Type 2 error in my life. Type 2 error is θj ≠ θk, but I claim they’re the same. I’ve never claimed that θj = θk.

But I make errors all the time!

(Emphasis added.) The point is that no two treatments are ever identical. People don’t usually mean to test whether two treatments are exactly equal but rather that they’re “nearly” equal, though they are often fuzzy about what “nearly” means.

Instead of Type I and Type II errors, Gelman proposes we concentrate on “Type S” errors (sign errors, concluding θj > θk when in fact θj < θk) and “Type M” errors (magnitude errors, concluding that an effect is larger than it truly is.)

Read More

Automated software builds

My first assignment as a professional programmer was to build another person’s program. I learned right away not to assume a project will build just because the author says it will. I’ve seen the same pattern repeated everywhere I’ve worked. Despite version control systems and procedures, there’s usually some detail in the developer’s head that doesn’t get codified and only the original developer can build the project easily.

The first step in making software builds reproducible is documentation. There’s got to be a document explaining how to extract the project from version control and build it. Requiring screen shots helps since developers have to rehearse their own instructions in order to produce the shots.

The second step is verification. Documentation needs to be tested, just like software. Someone who hasn’t worked on the project needs to extract the code onto a clean machine and build the project using only written instructions — no conversation with the developer allowed. Everyone thinks their code is easy to build; experience says most people are wrong.

The verifiers need to rotate. If one person serves as build master very long, they develop the same implicit knowledge that the original programmers failed to codify.

The third step is automation. Automated instructions are explicit and testable. If automation also saves time, so much the better, but automation is worthwhile even if it does not save time. Clift Norris and I just wrote an article on CodeProject entitled Automated Extract and Build from Team System using PowerShell that helps with this third step if you’re using Visual Studio and VSTS.

Read More

Database-first or object model-first software development

The most recent edition of the .NET Rocks podcast interviewed Jeremy Miller and David Laribee. They made the observation that Microsoft’s development tools are implicitly designed to support database-first development rather than object model-first development. Because they are committed to agile and object model-first development, they’re feeling resistance from some of Microsoft’s products.

Miller and Laribee are part of an interesting group of developers who call themselves “ALT.NET” because they’re developing on Microsoft’s .NET framework but are looking for alternative tools. They embrace the bottom of the Microsoft technology stack — operating systems, languages, compilers — but not the top — tools for modelling, code generation, testing, etc.

Read More

IPv6

The most recent RunAs Radio podcast interviews Sean Siler on IPv6, the eventual replacement for IPv4, the current Internet protocol address scheme. At the current rate, we will run out of IPv4 addresses in May 2010. Previous estimated dates for running out of IP addresses have come and gone, and we keep coming up with ways to postpone the date. Still, no one disputes that we’ll run out of IPv4 addresses some day.

IPv4 uses 32-bit addresses and IPv6 uses 128-bit addresses. The latter uses four times as many bits, but that represents 296 times as many addresses. Siler illustrates this by saying that if the IPv4 address space were as wide as an atomic nucleus, the IPv6 address space would be a light-month wide.

I checked Siler’s calculation. Thirty light-days is 9.8 x 296 femtometers, so the analogy is correct for a nucleus of diameter 9.8 fm. According to Wikipedia, atomic nuclei range from about 1.6 fm (hydrogen) to 15 fm (uranium), so there’s some element in between for which he’s right.

Read More