ReproducibleResearch.org

I started a new website this week, http://www.reproducibleresearch.org, to promote reproducible research.

I’d like to see this become a community site. Depending on how much interest the site stirs up, I may add a blog, a Wiki, etc. For now, if you’d like to contribute, send me articles or links and I’ll add them to the site. You can send email to “contribute” at the domain name.

Porting Visual C++ code to Linux/gcc

Here are a few lessons learned from porting a numerical library recently from Windows/Visual C++ to Linux/gcc.

Some of our code only runs on Windows, and only needs to run on Windows. Our first thought was to put #ifdef WIN32 directives around that code. Clift Norris came up with the clever idea of using #ifndef EXCLUDE_WINDOWS_ONLY_CODE instead. That way we could do a preliminary test of the portable subset of the code while still working on Windows where we’re more comfortable. Also, by not referring specifically to 32-bit Windows, we’re OK moving the code to 64-bit Windows.

Visual C++ does not require source and header files to end with newline characters, but gcc does. We got hundreds of warnings of the form warning: no newline at end of file when we first attempted to compile our code on Linux. Apparently there’s no gcc switch to turn this off, and it may not be prudent to turn it off if you could. As I understand it, Visual Studio inserts a linebreak after including header files, but gcc may not and so gcc needs to issue a warning in this case while Visual Studio does not. We copy our source tree to a Linux box then run the following Python code on that box to insert the extra newline characters when needed.

import os 

# List of directories of files to add newlines to.
# Script must be in the same location as these directories.
directories = ["Banana", "Apple", "Peach"]

for dir in directories:
    for file in os.listdir(dir):
        if file.endswith(".h") or file.endswith(".cpp"):
            path = dir + "/" + file
            handle = open(path, "r")
            slurp = handle.read()
            handle.close()

            if not slurp.endswith("n"):
                retcode = os.system("chmod +w " + path)
                if retcode != 0:
                    print "chmod returned " + retcode + " on " + path
                else:
                    handle = open(path, "a")
                    handle.write("n")
                    handle.close()

There were several places in our code where a variable was deliberately unused but retained in a function signature. Suppose a function has signature void foo(int a, int b) but b is unused. We had usually handled that by making b; the first line of the implementation. That would suppress unused variable warnings in Visual C++, but not in gcc. When we changed the function signature to void foo(int a, int /* b */), that made both compilers happy.

We started out using autoconf, but that was overkill for our project. Our build process became two orders of magnitude simpler when we switched over to a crude, old-fashioned make file. After porting the library to Linux, we built it on OS X without any issue.

This wasn’t an issue for us, but a potential problem when moving numerical code between Unix-like systems is that the function gamma computes different things on different systems. On Linux it computes the logarithm of the mathematical gamma function but on OS X it computes the gamma function itself. See the last two paragraphs of how to calculate binomial probabilities and Thomas Guest’s comment on that post for a full explanation.

This library had been ported to Linux years ago, but nobody used it on Linux and so development continued only on Windows. When we first ported the code, gcc and Visual C++ seemed to have incompatible requirements, especially with templates. The more recent port described here was much easier now that both compilers are more compliant with the C++ standard.

PowerShell script to make an XML sitemap

A while back I wrote a post on how to create a sitemap in the standard sitemap.org format using Python. This post does the same task using PowerShell. The solution presented here is an idiomatic PowerShell solution using pipes, not a direct translation of the Python code. I’ll introduce the script in pieces, then present the entire script at the end.

The final line of the script is

dir | wrap | out-file -encoding ASCII sitemap.xml

The heart of the script is the function wrap that wraps each file’s properties in the necessary XML tags. This function uses the pipeline, and so it has begin, process, and end blocks. The begin block prints out the XML header and the opening <urlset> tag. The end block prints out the closing </urlset> tag. In between is the process block that does most of the work.

Since all unassigned expressions are returned from PowerShell functions, the code is very clean. No need for print statements, just state the strings that make up the output. Variable interpolation helps keep the code succinct as well: simply use the name of a variable where you want to insert that variable’s value in a string. (Be sure to use double quotes if you want interpolation.)

The wrap function uses the implicit variable $_ which means “the next thing in the pipeline.” Since we’re piping in the output of dir (alias for Get-ChildItem), $_ represents a FileSystemInfo object. We look at the extension property on this object to see whether the file is one of the types we want to include in the sitemap. In this case, .html, .htm, or .pdf. Obviously you can edit the value of the variable $extensions if you want to include different file types in your sitemap.

Getting the file timestamp in the necessary format is particularly easy. The format specifier {0:s} causes the date and time to be written in the ISO 8601 format that the sitemap standard requires. The Z tacked on at the end says that time is UTC rather than some other time zone.

This script will produce a file sitemap.xml in the standard format. Once you upload the sitemap to your server, you’ve got to let the search engines know how to find it. The simplest way to do this is to create a file called robots.txt at the top of your site containing one line, Sitemap: followed by the URL of your sitemap.

Sitemap: http://www.yourdomain.com/sitemap.xml

Now here’s the full script.

# Change this to your URL
$domain = "http://www.yourdomain.com"

# file extensions to include in sitemap
$extensions = ".htm", ".html", ".pdf"

# wrap file information in XML tags
function wrap
{
    begin
    {
        '<?xml version="1.0" encoding="UTF-8"?>'
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
    }

    process
    {
        if ($extensions -contains $_.extension)
        {
        "`t<url>"
        "`t`t<loc>$domain/$_</loc>"
        "`t`t<lastmod>{0:s}Z</lastmod>" -f $_.LastWriteTimeUTC
        "`t</url>"
        }
    }

    end
    {
        "</urlset>"
    }
}

dir | wrap | out-file -encoding ASCII sitemap.xml

Reproducible scientific computing

Greg Wilson gave a great interview on the IT Conversations podcast recently. He says the emphasis on HPC draws time and energy away from quality concerns, and may not even help scientists get their results faster. While some problems definitely require HPC, most could be solved faster by developing software in the simpler environment of a single PC and waiting longer for it to run.

I’ve written here about reproducibility problems in statistics and in general software development. Apparently there are similar problems in every area of scientific computing. For example, Wilson quotes a survey of computational economics articles that found that 70% of the results could not be reproduced a year after publication. I doubt that computational economics is worse than other fields.

Wilson says he wants to make raise the reproducibility expectations in computational research closer to those common in physical research. I admire his efforts, but it’s a sad commentary that reproducibility standards are lower in computational science than in physical science.

Related: Scientific and statistical computing

Diagramming modes of convergence

Sometimes a good diagram is a godsend, reducing the entropy in your head at a glance.

When I was studying integration theory, I ran across a diagram something like the following in the out-of-print book Elements of Integration by Robert G. Bartle.

convergence for a general measure space

Each arrow summarizes a theorem. The four corners stand for modes of convergence: almost everywhere, almost uniform, Lp, and convergence in measure. A solid line from one mode to another means that if a sequence of function converges in the first mode then it also converges in the other. A dashed line means that a subsequence converges in the second mode. For example, if a sequence of functions fn converge to a function f, then the sequence also converges in measure to f, and a subsequence converges to f almost everywhere.

Note that this isn’t the typical mathematical diagram where the corners represent spaces and the arrows represent functions. I don’t recall seeing anything like this outside of Bartle’s book.

For a finite measure space, additional relationships hold as summarized in the diagram below.

convergence for a finite measure space

If the sequence fn is uniformly dominated by an Lp function g, then even more relationships hold.

dominated convergence

For more details, see modes of convergence. These notes also contain counterexamples that show no other relationships exist. Three simple counterexamples suffice to show no arrows are missing.

Uninitialized variables in PowerShell

I just got a bug report about an uninitialized variable in a PowerShell script I’d written. I’d gone through and renamed most instances of a variable, but not all. If I’d put Set-PsDebug -strict in my profile, the instance I missed would have been caught as an uninitialized variable error. I always used the analogous warning feature in other languages, such as use strict in Perl or option explicit in VB, but I haven’t gotten into the habit yet of using Set-PsDebug -strict in PowerShell.

Jeffrey Snover published an article a few days ago about the new Set-StrictMode cmdlet that will be part of version 2 of PowerShell and will replace Set-PsDebug -strict. The new feature will be more strict and will be more finely configurable.

Plane crashes, software crashes, and business crashes

I’ve run into the same theme in very different contexts lately: people ignore data from crashes.

FlowingData has an article today claiming that, contrary to popular belief, some parts of an airplane are safer than others.  According to the article, pundits routinely claim that all seats are equally safe even though data show that the probability of surviving a plane crash varies from 49% in the front of the aircraft up to 69% in the rear.

Also today, Coding Horror published its second article on software crashes. See Crashing Responsibly and Twitter: How Not To Crash Responsibly. Many applications don’t collect data from crashes, and those that do don’t always make good use of it.

Finally, Scott Shane’s book The Illusions of Entrepreneurship (ISBN 0300113315) examines small business crashes. Entrepreneurs, investors, and policy makers often make decisions based on myths that are soundly refuted by data.

Rounding and integer division in PowerShell

The way PowerShell converts floating point numbers to integers makes perfect sense, unless you’ve been exposed to another way first. PowerShell rounds floating point numbers to the nearest integer when casting to int. For example, in PowerShell [int] 1.25 evaluates to 1 but [int] 1.75 evaluates to 2.

When there isn’t a unique nearest integer, i.e. when the decimal part of a number is exactly 0.5, PowerShell rounds to the nearest even integer. This is known as banker’s rounding or round-to-even. So, for example, [int] 1.5 would round to 2 but so would [int] 2.5. The motivation for banker’s rounding is that is unbiased in the sense that numbers of the form n + 0.5 will round up as often as down on average.

Apart from the detail of handing numbers ending in exactly one half, PowerShell does what most people would expect. However, people who program in C and related languages have different expectations. These languages truncate when converting floating point numbers to integers. For example, in C++ both int(1.25) and int(1.75)evaluate to 1. When I learned C, I found it’s behavior surprising. But now that PowerShell does what I once expected C to do, I find PowerShell surprising.

The PowerShell folks make the right decision for a couple reasons. For one, they are being consistent with their decision to break with tradition when necessary to do what they believe is right. Also, their primary audience is system administrators, not programmers steeped in C++ or C#.

Another way PowerShell breaks from C tradition is integer division. For example, 5/4evaluates to 1 in C, but 1.25 in PowerShell. Both language designs make sense in context. C is explicitly typed, and so the ratio of two integers is an integer. PowerShell is implicitly typed, so integers are converted to doubles when necessary.

(As an aside, Python initially followed the C tradition regarding integer division, but future versions of the language will act more like PowerShell. In the future, the / operator will perform floating point division and the new // operator will perform integer division.)

Barriers to good statistical software

I attended a National Cancer Institute workshop yesterday entitled “Barriers to producing well-tested, user-friendly software for cutting-edge statistical methodology.” I was pleased that everyone there realized there is a huge difference between code created for personal use and reliable software that others would willingly use. Not all statisticians appreciate the magnitude of the difference.

I was also pleased that several people at the workshop were aware of the problem of irreproducible statistical analyses. Not everyone was aware how serious or how common the problem is, but those who were aware were adamant that something needs to be done about it, such as journals requiring authors to publish the code used to analyze their data.

Customizing the PowerShell command prompt II

I just picked up a copy of Windows PowerShell Cookbook by Lee Holmes. One of the first examples in the book is customizing the PowerShell command prompt. His example sets the command window title as part of the prompt function. For example, adding

$host.UI.RawUI.WindowTitle = "$env:computername $pwd.path"

to the function given in my previous post would display the computer name and full path to the working directory in the title bar. The full code would be

function prompt
{
    $m = 30 # maximum prompt length
    $str = $pwd.Path
    if ($str.length -ge $m)
    {
        # The prompt will begin with "...",
        # end with ">", and in between contain
        # as many of the path characters as will fit,
        # reading from the end of the path.
        $str = "..." + $str.substring($str.length - $m + 4)
    }
    $host.UI.RawUI.WindowTitle = "$env:computername $pwd.path"
    "$str> "
}