Barriers to good statistical software

May 16th, 2008

I attended a National Cancer Institute workshop yesterday entitled “Barriers to producing well-tested, user-friendly software for cutting-edge statistical methodology.” I was pleased that everyone there realized there is a huge difference between code created for personal use and reliable software that others would willingly use. Not all statisticians appreciate the magnitude of the difference.

I was also pleased that several people at the workshop were aware of the problem of irreproducible statistical analyses. Not everyone was aware how serious or how common the problem is, but those who were aware were adamant that something needs to be done about it, such as journals requiring authors to publish the code used to analyze their data.

Customizing the PowerShell command prompt II

May 13th, 2008

I just picked up a copy of Windows PowerShell Cookbookby Lee Holmes. One of the first examples in the book is customizing the PowerShell command prompt. His example sets the command window title as part of the prompt function. For example, adding

$host.UI.RawUI.WindowTitle = "$env:computername $pwd.path"

to the function given in my previous post would display the computer name and full path to the working directory in the title bar. The full code would be

function prompt
{
    $m = 30 # maximum prompt length
    $str = $pwd.Path
    if ($str.length -ge $maxPromptLength)
    {
        # The prompt will begin with "...",
        # end with ">", and in between contain
        # as many of the path characters as will fit,
        # reading from the end of the path.
        $str = "..." + $str.substring($str.length - $m + 4)
    }
    $host.UI.RawUI.WindowTitle = "$env:computername $pwd.path"
    "$str> "
}

Customizing the PowerShell command prompt

May 12th, 2008

By default, the PowerShell command prompt does not echo the current working directory. To customize the command prompt, simply create a function named prompt. If you want this customization to persist, add it to your profile.

For example, adding the following line to your profile will cause the working directory to be displayed much like it is in cmd.exe.

function prompt { "$pwd>" }

However, the prompt function can contain any code at all. Here’s a prompt function that will display the right-most part of the working directory. This keeps long working directory names from taking up most of the space at the command line.

function prompt
{
    $m = 30 # maximum prompt length
    $str = $pwd.Path
    if ($str.length -ge $maxPromptLength)
    {
        # The prompt will begin with "...",
        # end with ">", and in between contain
        # as many of the path characters as will fit,
        # reading from the end of the path.
        $str = "..." + $str.substring($str.length - $m + 4)
    }
    "$str> "
}

For example, if

C:\Documents and Settings\Administrator\My Documents\My Music

is the current directory, the prompt would be

...ator\My Documents\My Music>

Jenga mathematics

May 11th, 2008

Jenga is a game where you start with a tower of wooden pegs and take turns removing pegs until someone makes the tower collapse. A style of mathematics analogous to Jenga reached the height of its popularity about 40 years ago and then fell out of fashion. I use the phrase “Jenga mathematics” to refer to generalizing a well-known theorem by weakening its hypotheses, seeing how many pegs you can pull out before it falls.

Jenga game photo

Many 20th century mathematicians spent their careers going over the work of 19th century mathematicians, removing every hypothesis they could. Sometimes a 20th century mathematician would get his name tacked on to a 19th century theorem due to his Jenga accomplishments.

Taken to extremes, Jenga mathematics turns theorems inside-out and proofs become hypotheses. Natural hypotheses are replaced with a laundry list of properties necessary to make the proof work. Start with some theorem of the form “Let X be a widget. Then X has a foozle.” Go back over the proof and see just what features of a widget are needed for the proof. Then restate the theorem as “Let X have the following apparently arbitrary list of properties necessary for my proof to work. Then X has a foozle.” Never mind whether anybody can think of anything other that a widget that satisfies the hypotheses of the new theorem.

Jenga mathematics is no longer fashionable. Mathematicians still value removing unneeded hypotheses, but they’re not as willing to go to extremes to do so. They are more interested in building new towers than in removing every piece possible from old towers.

It’s a bird, it’s a snake, it’s … a duck-billed platypus

May 10th, 2008

The duck-billed platypus is the most recent species to have its genome sequenced. These odd animals are even more strange at the DNA level. Some features of their DNA are avian, some are reptilian, and of course some are mammalian. See the Science Daily article for more details.

Perry the Platypus

Publishing correct sample code

May 9th, 2008

It’s infuriating to read published sample code that’s wrong. Sometimes code given in books is not even syntactically correct. I’ve wondered why publishers didn’t have a way to verify that the code at least compiles, and maybe even check that it gives the stated output.

Dave Thomas said in recent interview that his publishing company, The Pragmatic Programmers, does just that. Authors write in a logical mark-up language and software turns that into a publishable form, compiling code samples and inserting the output. Sample code from one of their books is more likely to work the first time you type it in than code from other publishers.

Wikipedia in 10 GB

May 9th, 2008

The Stack Overflow podcast, episode 4, mentioned in passing that the Wikipedia database is about 10 GB. I was surprised it isn’t bigger. If that size is correct, you could download a snapshot of Wikipedia to your local hard drive.

Regular expressions in C++ TR1

May 7th, 2008

Regular expressions are not a part of the C++ Standard Library quite yet, but there is a document (Technical Report 1, or TR1) that includes among other things a specification for regular expression support that will probably be added to the C++ standard eventually.

The Boost library has supported TR1 for a while. Microsoft just released a feature pack for Visual Studio 2008 a month ago that includes support for most of TR1. (They’ve left out support for mathematical special functions.) And Dinkumware sells a complete TR1 implementation.

I’ve added some notes to my web site for getting started with C++ TR1 regular expressions. I took my PowerShell regex notes as a starting point and implemented some of the same examples in C++. I changed the organization though, because the C++ implementation is fairly different from PowerShell.

Working with regular expressions is harder in C++ than in scripting languages such as Perl or Python, but not unnecessarily so. C++ is optimized for fine-grained control and efficiency rather than ease of use; that’s what C++ is for. The TR1 implementation is internally consistent and elegant in its own way.

It’s easy to find API-level documentation but harder to find examples for getting started. (I’ve heard good things about Pete Becker’s book The C++ Standard Library Extensions but I haven’t read it.) So I decided to keep some notes as I played with the Visual Studio implementation. I imagine most of the content applies to other implementations, but I’ve only tested the examples using Visual Studio.

Update: GCC just added support for C++ TR1 two days ago with their verion 4.3 release. 

Bias

May 7th, 2008

An unbiased estimator, very roughly speaking, is a statistic that gives the correct result on average. For a precise definition, see Wikipedia. Unbiasedness is an intuitively desirable property. In fact, it seems indispensable at first.

In the colloquial sense, “bias” is practically synonymous with self-serving dishonesty. Who wants a self-serving, dishonest statistical estimate? But it’s important to remember that “bias” in statistical sense has a technical meaning that may not correspond to the colloquial meaning.

Here’s the big problem with statistical bias: if U is an unbiased estimator of θ, f(U) is NOT an unbiased estimator of f(θ) in general. For example, standard deviation is the square root of variance, but the square root of an unbiased estimator for variance is not an unbiased estimator for standard deviation. This shows bias has nothing to do with accuracy, since the square root of an accurate estimation of variance is an accurate estimate of standard deviation. In fact, unbiased estimators can be terrible.

The fact that unbiasedness is not preserved under transformations calls into question its usefulness. People seldom care directly about abstract statistical parameters directly. Instead they care about some calculation based on those parameters. An unbiased estimate of the parameters does not generally lead to an unbiased estimate of what people really want to estimate.

LINQ to Regex

May 6th, 2008

Roy Osherove just posted an article about his Introducing LINQ to Regex project.

LINQ stands for Language INtegrated Query, a way of baking query support into .NET programming languages. Microsoft has been promising a unified way to query all kinds of data for years now.  Along the way they came out with a score of new libraries that were going to be the solution. They’d work for all kinds of data that happened to look very much like a relational database. But now with LINQ they’ve finally delivered something that works well not only with relational data but also with hierarchical data such as XML. With LINQ to Regex, you can query unstructured text with LINQ as well.

There are two big advantages to LINQ. First, you can query different kinds of data sources with similar code. Second, “language integrated” means that your programming language knows about your query language, making strong typing and better tool support possible. (By contrast, if you have a SQL statement inside VB, for example, VB knows nothing about SQL. The SQL command is just a string as far as VB is concerned. If the SQL is malformed, you won’t know until runtime. But with LINQ, malformed queries generate compile errors.)

Update: See Scott Hanselman’s discussion of LINQ to Regex.

Blog and website changes

May 3rd, 2008

I’ve made a few changes to my blog and my personal web site and would welcome your feedback.

I added a widget on my blog sidebar to make it easy to subscribe. It seems to work well. Let me know if you have problems.

I added tags to my blog posts. The tag links should help people find more closely related articles if they’re interested.I’m still figuring out how I want to use tags and categories. For now, categories are high-level groupings and tags are more detailed. Also, posts generally fall into one category, maybe two, but often have multiple tags. I appreciate what Thomas Guest said on his blog about eliminating categories and just having tags, but I haven’t decided I want to do. I’ve thought about adding a tag cloud, but I don’t want the sidebar to be too cluttered. Maybe I’ll add a cloud and cut out the category list. I would appreciate your suggestions.

My personal website now has a sitemap for humans. I’ve had a sitemap for search engines but realized I needed to make it easier for humans to find things on the site as the number of pages has increased.

Update: I just looked at this site with Internet Explorer 6 for the first time. All the content that is supposed to be at the top of the right sidebar is at the bottom, and the main content is pushed off to the left. Has the site always looked bad under IE 6 or did a recent change cause this? Any suggestions how to fix it?

Reusable code vs. re-editable code

May 3rd, 2008

In a recent interview, Donald Knuth made this comment about reusable code.I also must confess to a strong bias against the fashion for reusable code.

I also must confess to a strong bias against the fashion for reusable code. To me, “re-editable code” is much, much better than an untouchable black box or toolkit. I could go on and on about this. If you’re totally convinced that reusable code is wonderful, I probably won’t be able to sway you anyway, but you’ll never convince me that reusable code isn’t mostly a menace.

Knuth didn’t elaborate on what he means by “re-editable” code, but I assume he means code that is easy to maintain. The best chance most code has at reuse is remaining useful in its original project over multiple versions, so maybe we’d get more reuse if we focused more on maintainability.

I think whether code should be editable or in “an untouchable black box” depends on the number of developers involved, as well as their talent and motivation. Knuth is a highly motivated genius working in isolation. Most software is developed by large teams of programmers with varying degrees of motivation and talent. I think the further you move away from Knuth along these three axes the more important black boxes become.

Top five gotchas when learning PowerShell

May 2nd, 2008

Here is my list of the top five gotchas when learning Windows PowerShell.

5. PowerShell will not run scripts by default.

4. PowerShell requires .\ to run a script in the current directory.

3. PowerShell uses -eq, -gt, etc. for comparison operators.

2. PowerShell uses backquote as the escape character.

1. PowerShell separates function arguments with spaces, not commas.

See PowerShell gotchas for more details and an explanation for why PowerShell made the design decisions it did. As surprising as these features are, there are good reasons for each.

Readable path listings

May 1st, 2008

Windows has never made it easy to read long environment variables. If I display the path on one machine I get something like this, both from cmd and from PowerShell.

C:\bin;C:\bin\Python25;C:\bin\TeX\miktex\bin;C:\bin\TeX\MiKTeX\miktex\bin;C:\bin\Perl\bin\;C:\ProgramFiles\Compaq\Compaq Management Agents\Dmi\Win32\Bin; ...

The System Properties window is worse since you can only see a tiny slice of your path at a time.

screen shot of path UI

Here’s a PowerShell one-liner to produce readable path listing:

$env:path -replace ";", "`n"

This produces

C:\bin
C:\bin\Python25\
C:\bin\TeX\miktex\bin
C:\bin\TeX\MiKTeX\miktex\bin
C:\bin\Perl\bin\
C:\Program Files\Compaq\Compaq Management Agents\Dmi\Win32\Bin
...

(If you’re not familiar with PowerShell, note the backquote before the n to indicate the newline character to replace semicolons. This is one of the most unconventional features of PowerShell since backslash is the escape character in most contexts. Because Windows uses either forward or backward slashes as path separators, PowerShell could not use backslash as an escape character. Think of the backquote as a little backslash. Once you get over the initial shock, you get used to the backquote quickly.)

Update: It occurred to me after the original post that there’s an even simpler way to display the path.

$env:path.split(';')

Integrating the clipboard and the command line

April 30th, 2008

Two of my favorite cmdlets from the PowerShell Community Extensions are get-clipboard and out-clipboard. These cmdlets let you read from and write to the Windows clipboard from PowerShell. For example, the following code will grab the contents of the clipboard, replace every block of white-space with a comma, and paste the result back to the clipboard.

(get-clipboard) -replace '\s+(?!$)', ',' | out-clipboard 

I saved this to a file comma.ps1 in my path and run it when I get a list of numbers from one program delimited by newlines or tabs and need to make it the input to another program expecting comma-delimited values. For example, turning a column of numbers into an array for R. I copy one format, run comma.ps1, and paste in the new format.

In case you’re curious about the mysterious characters in the script, \s+(?!$) is a regular expression describing where I want to substitute a comma. The \s refers to white-space characters (tabs, spaces, newlines) and the +says this is repeated one or more times. So match one or more consecutive white-space characters. That would be enough by itself, but it would replace trailing white-space with a comma too, so I might get an unwanted comma at the end. The sequence (?!$) fixes that. The $ matches the end of line. The (?! before and the ) after form a negative look ahead, meaning “except when the thing inside matches.” So taken all together, the regular expression matches chunks of white-space except at the end of the input.