Archive for the ‘Software development’ Category

Languages that are easy to pick back up

Tuesday, July 1st, 2008

Some programming languages are much easier to come back to than others. In my previous post I mentioned that Mathematica is easy to come back to, put Perl is not.

I found it easy to come back LaTeX after not using it for a while. It has a few quirks, but it’s basically consistent. The LaTeX commands for Greek letters are their names, lower case names for lower case letters, upper case names for upper case letters. The command mathematical symbols is usually the name a mathematician would give the symbol. Modes always begin with \begin and end with \end.

Python also has a consistent syntax that make it easier to come back to the language after a break. Someone has said that Python is similar to Perl, except that the word “except” does not appear nearly so often in the Python documentation.

It’s more important that a language be internally consistent than conventional. Each of the languages I mentioned have their peculiarities. Mathematica uses square brackets for function argument arguments. LaTeX uses percent signs for comments. Python uses indention to denote blocks. Each of these take a little getting used to, but each makes sense in its own context.

A special case of consistency is using full names for keywords. Mathematica always spells out words in full. For example, the gamma distribution object is named GammaDistribution. I don’t mind a little extra typing. I’d rather optimize for recall and readability than minimize keystrokes since I spend more time recalling and reading than typing. (One flaw in LaTeX is that it occasionally uses unnecessary abbreviations. For example, \infty for infinity. The corresponding Mathematica keyword is Infinity.)

Tips for using regular expressions

Friday, June 27th, 2008

Jeff Atwood just posted a good article on regular expressions. Not the syntax of regular expressions but rather the strategy of when and how to use them.

Why computer scientists count from zero

Thursday, June 26th, 2008

In my previous post, cohort assignments in clinical trials, I mentioned in passing how you could calculate cohort numbers from accrual numbers if the world were simpler than it really is.

Suppose you want to treat patients in groups of 3. If you count patients and cohorts starting from 1, then patients 1, 2, and 3 are in cohort 1. Patients 4, 5, and 6 are in cohort 2. Patients 7, 8, and 9 are in cohort 3, etc. In general patient n is in cohort 1 + ⌊(n-1)/3⌋.

If you start counting patients and cohorts from 0, then patients 0, 1, and 2 are in cohort 0. Patients 3, 4, and 5 are in cohort 1. Patients 6, 7, and 8 are in cohort 2, etc. In general patient n is in cohort ⌊n/3⌋.

These kinds of calculations, common in computer science, are often simpler when you start counting from 0. If you want to divide things (patients, memory locations, etc.) into groups of size k, the nth item is in group ⌊n/k⌋. In C notation, integer division truncates to an integer and so the expression is even simpler: n/k.

Counting centuries is confusing because we count from 1. That’s why the 1900’s were the 20th century etc. If we called the century immediately following the birth of Christ the 0th century, then the 1900’s would be the 19th century.

Because computer scientists usually count from 0, most programming languages also count from zero. Fortran and Visual Basic are notable exceptions.

The vast majority of humanity finds counting from 0 unnatural and so there is a conflict between how software producers and consumers count. Demanding that average users learn to count from zero is absurd. So the programmer must either use one-based counting internally, and risk confusing his peers, or use zero-based counting internally, and risk forgetting to do a conversion for input or output. I prefer the latter. The worst option is to vascillate between the two approaches. 

Why functional programming hasn’t taken off

Sunday, June 22nd, 2008

Bjarne Stroustrup made a comment in an interview about functional programming. He said advocates of functional programming have been in charge of computer science departments for 30 years now, and yet functional programming has hardly been used outside academia. Maybe it’s because it’s not practical, at least in its purest form.

I’ve heard other people say that functional programming is the way to go, but most programmers aren’t smart enough to work that way and its too hard for the ones who are smart enough to go against the herd. But there are too many brilliant maverick programmers out there to make such a condescending explanation plausible. Stroustrup’s explanation makes more sense.

Let me quickly address some objections.

  • Yes, there have been very successful functional programming projects.
  • Yes, procedural programming languages are adding support for functional programming.
  • Yes, the rise of multi-core processors is driving the search for ways to make concurrent programming easier, and functional programming has a lot to offer.

I fully expect there will be more functional programming in the future, but it will be part of a multi-paradigm approach. On the continuum between pure imperative programming and pure functional programming, development will move toward the functional end, but not all the way. A multi-paradigm approach could be a jumbled mess, but it doesn’t have to be. One could clearly delineate which parts of a code base are purely functional (say, because they need to run concurrently) and which are not (say, for efficiency). The problem of how to mix functional and procedural programming styles well seems interesting and tractable.

[Stroustrup’s remark came from an OnSoftware podcast. I’ve listed to several of his podcasts with OnSoftware lately but I don’t remember which one contained his comment about functional programming.]

Bugs in food and software

Thursday, June 19th, 2008

What is an acceptable probability of finding bug parts in a box of cereal? You can’t say zero. As the acceptable probability goes to zero, the price of a box of cereal goes to infinity. In practice, the FDA sets very small but non-zero limits on the probability of finding bug parts in food. This is unsettling at first, but there’s no rational way around it.

What is an acceptable probability of finding bugs in your software? Again, you can’t say zero. The cost increases without bound as the quality requirements increase. In my previous post, I wrote about the extraordinary quality procedures for writing software for space probes. And yet even these projects have to tolerate some non-zero probability of error. It’s not worthwhile to spend 10 billion dollars to prevent a bug in a billion dollar mission.

Bugs are a fact of life. We can insist that they are unacceptable or we can pretend they don’t exist, but neither approach is constructive. It’s better to focus on the probability of running into bugs and consequences of running into bugs.

Not all bugs have the same consequences. It’s distasteful to find a piece of a roach leg in your can of green beans, but it’s not the end of the world. Toxic microscopic bugs are more serious. Along the same lines, a software bug that causes incorrect hyphenation is hardly the same as a bug that causes a plane crash. To get an idea of the potential economic cost of  running into a bug, and therefore the resources worthwhile to detect and fix it, multiply the probability by the consequences.

How do you estimate the probabilities of software bugs? The same way you estimate the probability of bugs in food: by conducting experiments and analyzing data. Some people find this very hard to accept. They understand that testing is necessary in the physical world, but they think software is entirely different and must be proven correct in some mathematical sense. They object that computer programs are complex systems, too complex to test. Computer programs are complex, but human bodies are far more complex, and yet we conduct tests on human subjects all the time to estimate different probabilities, such as the probabilities of drug toxicity.

Another objection to software testing is that it can only test paths through the software that are actually taken, not all potential paths. That’s true, but the most important data when estimating the probability of running into a bug is data from people using the software under normal conditions. A bug that you never run into has no consequences.

But what about people using software in unanticipated ways? I certainly find it frustrating when I uncover bugs when I use a program in an atypical way. But this is not terribly different from physical systems. Bridges may fail when they’re subject to loads they weren’t designed for. There is a difference, however. Most software is designed to permit far more uses than can be tested, whereas there’s less of a gap in physical systems between what is permissible and what is testable. Unit testing helps. If every component of a software system works correctly in isolation, it more likely, though not certain, that the components will work correctly together in a new situation. Still, there’s no getting around the fact that the best tested uses are the most likely to succeed.

Software in Space

Tuesday, June 17th, 2008

The latest episode of Software Engineering Radio has an interview with Hans-Joachim Popp of the German aerospace company DLR. A bug in the software embedded in a space probe could cost years of lost time and billions of dollars. These folks have to write solid code.

The interview gives some details of the unusual practices DLR uses to produce such high quality code. For one, Popp said that his company writes an average of 12 lines of test code for every line of production code. They also pair junior and senior developers. The junior developer writes all the code, and the senior developer picks it apart.

Such extreme attention to quality doesn’t come cheap. Popp said that they produce about 0.6 lines of (production?) code per hour.

C++ templates may reduce memory footprint

Monday, June 16th, 2008

One of the complaints about C++ templates is that they can cause code bloat. But Scott Meyers pointed out in an interview that some people are using templates in embedded systems applications because templates result in smaller code.

C++ compilers only generate code for template methods that are actually used in an application, so it’s possible that code using templates may result in a smaller executable than code that a more traditional object oriented approach.

Solving the problem with Visual Source Safe and time zones

Friday, June 13th, 2008

If you use Microsoft Visual SourceSafe (VSS) with developers in more than one time zone, you may be in for an unpleasant surprise.

VSS uses the local time on each developer’s box as the time of a check in/out. If every developer’s time is set by the same reference, say an NNTP server, then no problems will result. But what if one developer is in a different time zone? Say one developer is in Boston and one in Houston. The Boston developer checks in a file at 2:00 PM Eastern time, then 10 minutes later the Houston developer checks out the file, quickly makes a change, and checks the file back in at 2:20 Eastern time, 1:20 Central local time. VSS now says the latest version of the file is the one made at 2:00 Eastern time. When the Houston developer looks at VSS, the Boston developer’s changes were make 40 minutes into the future!

This has been a problem with VSS for all versions prior to VSS 2005, and is still a problem in the most recent version by default. Starting with the 2005 version, you can configure VSS 2005 to use “server local time.” This means all transactions will use the time on the server where the VSS repository is located. The time is stored internally as UTC (GMT) but displayed to each user according to his own time zone. In the example above, the server would record the Boston check-in as 7:00 PM UTC and the Houston check-in as 7:20 PM UTC. The Boston user would see the check-ins as happening at 2:00 and 2:20 Eastern time, and the Houston user would see the check-ins as happening at 1:00 and 1:20 Central time. Importantly, everyone agrees which check-in occurred first.

A more subtle version of the problem can occur even if all users are in the same time zone but have not synchronized their clocks. This is a good reason to use server local time even if everyone works in the same city.

Although it is possible to set server local time in VSS 2005, it still uses client local time by default, presumably for backward compatibility. You have to turn on server local time by opening the VSS Administrator tool and clicking on Tools/Options and going to the Time Zone tab.

SourceSafe Options dialog for time zones

Microsoft has written about this problem at http://support.microsoft.com/kb/931804. Note that the solution only applies to Visual SourceSafe 2005 and later.

When you set the time zone, you will get dire warnings discouraging you from doing so.

To avoid unintended data loss, do not change the time zone for a Visual SourceSafe database after it has been established and is being used.

You may have to let VSS set idle for as many hours as your developers span time zones to let everything synchronize.

Programming language subsets

Wednesday, June 11th, 2008

I just found out that Douglas Crockford has written a book JavaScript: The Good Parts. I haven’t read the book, but I imagine it’s quite good based on having seen the author’s JavaScript videos.

Crockford says JavaScript is an elegant and powerful language at its core, but it suffers from numerous egregious flaws that have been impossible to correct due to its rapid adoption and standardization.

I like the idea of carving out a subset of a language, the good parts, but several difficulties come to mind.

  1. Although you may limit yourself to a certain language subset, your colleagues may choose a different subset. This is particularly a problem with an enormous language such as Perl. Coworkers may carve out nearly disjoint subsets for their own use.
  2. Features outside your intended subset may be just a typo away. You have to have at least some familiarity with the whole language because every feature is a feature you might accidentally use.
  3. The parts of the language you don’t want to use still take up space in your reference material and make it harder to find what you’re looking for.

One of the design principles of C++ is “you only pay for what you use.” I believe the primary intention was that you shouldn’t pay a performance penalty for language features you don’t use, and C++ delivers on that promise. But there’s a mental price to pay for language features you don’t use. As I’d commented about Perl before, you have to use the language several hours a week just to keep it loaded in your memory.

There’s an old saying that when you marry a girl you marry her family. A similar principle applies to programming languages. You may have a subset you love, but you’re going to have to live with the rest of the language.

Experienced programmers and lines of code

Tuesday, June 3rd, 2008

I heard of a study recently that concluded inexperienced and experienced programmers write about the same number of lines of code per day. The difference is that experienced programmers keep more of those lines of code, making steady progress toward a goal. Less experienced programmers write large chunks of code only to rip them out and rewrite the same chunk many times until the code appears to work. Or instead of ripping out the code, they debug for days on end, changing one or two lines at a time, almost at random, until the code appears to work.

As Greg Wilson pointed out in his interview, focusing on quality in software development often results in increased productivity as well. More effort goes into forward progress and less goes into re-work.

Not only do experienced programmers produce more lines of code worth keeping each day, they also accomplish more per line of code, sometimes dramatically more. But that’s not news. It’s well known that the best programmers aren’t just a little more productive than average, they’re one or two orders of magintude more productive. (See, for example, Joel Spolsky’s book Smart and Gets Things Done.) More interesting is that the best programmers don’t seem to have a much larger capacity for producing and understanding lines of code.

There have also been studies that show programmers produce about the same number of lines of code per day independent of the language they use. You might think that someone working in assembly language could produce more lines of per day than someone writing in a higher level language such as VB or Java, but that’s not the case. It seems that while counting lines of code is a terrible way to measure productivity, it is a good way to measure what you can expect someone to be able to hold in their head.

ReproducibleResearch.org

Friday, May 30th, 2008

I started a new web site this week, http://www.reproducibleresearch.org, to promote reproducible research.

I’d like to see this become a community site. Depending on how much interest the site stirs up, I may add a blog, a Wiki, etc. For now, if you’d like to contribute, send me articles or links and I’ll add them to the site. You can send email to “contribute” at the domain name.

Porting Visual C++ code to Linux/gcc

Thursday, May 29th, 2008

Here are a few lessons learned from porting a numerical library recently from Windows/Visual C++ to Linux/gcc.

Some of our code only runs on Windows, and only needs to run on Windows. Our first thought was to put #ifdef WIN32 directives around that code. Clift Norris came up with the clever idea of using #ifndef EXCLUDE_WINDOWS_ONLY_CODE instead. That way we could do a preliminary test of the portable subset of the code while still working on Windows where we’re more comfortable. Also, by not referring specifically to 32-bit Windows, we’re OK moving the code to 64-bit Windows.  

Visual C++ does not require source and header files to end with newline characters, but gcc does. We got hundreds of warnings of the form warning: no newline at end of file when we first attempted to compile our code on Linux. Apparently there’s no gcc switch to turn this off, and it may not be prudent to turn it off if you could. As I understand it, Visual Studio inserts a linebreak after including header files, but gcc may not and so gcc needs to issue a warning in this case while Visual Studio does not. We copy our source tree to a Linux box then run the following Python code on that box to insert the extra newline characters when needed.

import os  

# List of directories of files to add newlines to.
# Script must be in the same location as these directories.
directories = ["Banana", "Apple", "Peach"] 

for dir in directories:
    for file in os.listdir(dir):
        if file.endswith(".h") or file.endswith(".cpp"):
            path = dir + "/" + file
            handle = open(path, "r")
            slurp = handle.read()
            handle.close() 

            if not slurp.endswith("\n"):
                retcode = os.system("chmod +w " + path)
                if retcode != 0:
                    print "chmod returned " + retcode + " on " + path
                else:
                    handle = open(path, "a")
                    handle.write("\n")
                    handle.close()

There were several places in our code where a variable was deliberately unused but retained in a function signature. Suppose a function has signature void foo(int a, int b) but b is unused. We had usually handled that by making b; the first line of the implementation. That would suppress unused variable warnings in Visual C++, but not in gcc. When we changed the function signature to void foo(int a, int /* b */), that made both compilers happy.

We started out using autoconf, but that was overkill for our project. Our build process became two orders of magnitude simpler when we switched over to a crude, old-fashioned make file. After porting the library to Linux, we built it on OS X without any issue.

This wasn’t an issue for us, but a potential problem when moving numerical code between Unix-like systems is that the function gamma computes different things on different systems. On Linux it computes the logarithm of the mathematical gamma function but on OS X it computes the gamma function itself. See the last two paragraphs of how to calculate binomial probabilities and Thomas Guest’s comment on that post for a full explanation.

This library had been ported to Linux years ago, but nobody used it on Linux and so development continued only on Windows. When we first ported the code, gcc and Visual C++ seemed to have incompatible requirements, especially with templates. The more recent port described here was much easier now that both compilers are more compliant with the C++ standard.

PowerShell script to make an XML sitemap

Tuesday, May 27th, 2008

A while back I wrote a post on how to create a sitemap in the standard sitemap.org format using Python. This post does the same task using PowerShell. The solution presented here is an ideomatic PowerShell solution using pipes, not a direct translation of the Python code. I’ll introduce the script in pieces, then present the entire script at the end.

The final line of the script is

dir | wrap | out-file -encoding ASCII sitemap.xml

The heart of the script is the function wrap that wraps each file’s properties in the necessary XML tags. This function uses the pipeline, and so it has begin, process, and end blocks. The begin block prints out the XML header and the opening <urlset> tag. The end block prints out the closing </urlset> tag. In between is the process block that does most of the work.

Since all unassigned expressions are returned from PowerShell functions, the code is very clean. No need for print statements, just state the strings that make up the output. Variable interpolation helps keep the code succinct as well: simply use the name of a variable where you want to insert that variable’s value in a string. (Be sure to use double quotes if you want interpolation.)

The wrap function uses the implicit variable $_ which means “the next thing in the pipeline.” Since we’re piping in the output of dir (alias for Get-ChildItem), $_ represents a FileSystemInfo object. We look at the extension property on this object to see whether the file is one of the types we want to include in the sitemap. In this case, .html, .htm, or .pdf. Obviously you can edit the value of the variable $extensions if you want to include different file types in your sitemap.

Getting the file timestamp in the necessary format is particularly easy. The format specifier {0:s} causes the date and time to be written in the ISO 8601 format that the sitemap standard requires. The Z tacked on at the end says that time is UTC rather than some other time zone.

This script will produce a file sitemap.xml in the standard format. Once you upload the sitemap to your server, you’ve got to let the search engines know how to find it. The simplest way to do this is to create a file called robots.txt at the top of your site containing one line, Sitemap: followed by the URL of your sitemap.

Sitemap: http://www.yourdomain.com/sitemap.xml

Now here’s the full script.

# Change this to your URL
$domain = "http://www.yourdomain.com"   

# file extensions to include in sitemap
$extensions = ".htm", ".html", ".pdf"   

# wrap file information in XML tags
function wrap
{
    begin
    {
        '<?xml version="1.0" encoding="UTF-8"?>'
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
    }   

    process
    {
        if ($extensions -contains $_.extension)
        {
        "`t<url>"
        "`t`t<loc>$domain/$_</loc>"
        "`t`t<lastmod>{0:s}Z</lastmod>" -f $_.LastWriteTimeUTC
        "`t</url>"
        }
    }   

    end
    {
        "</urlset>"
    }
}   

dir | wrap | out-file -encoding ASCII sitemap.xml

Reproducible scientific computing

Monday, May 26th, 2008

Greg Wilson gave a great interview on the IT Conversations podcast recently. He says the emphasis on HPC draws time and energy away from quality concerns, and may not even help scientists get their results faster. While some problems definitely require HPC, most could be solved faster by developing software in the simpler environment of a single PC and waiting longer for it to run.

I’ve written here about reproducibility problems in statistics and in general software development. Apparently there are similar problems in every area of scientific computing. For example, Wilson quotes a survey of computational economics articles that found that 70% of the results could not be reproduced a year after publication. I doubt that computational economics is worse than other fields.

Wilson says he wants to make raise the reproducibility expectations in computational research closer to those common in physical research. I admire his efforts, but it’s a sad commentary that reproducibility standards are lower in computational science than in physical science.

Uninitialized variables in PowerShell

Saturday, May 24th, 2008

I just got a bug report about an uninitialized variable in a PowerShell script I’d written. I’d gone through and renamed most instances of a variable, but not all. If I’d put Set-PsDebug -strict in my profile, the instance I missed would have been caught as an unitialized variable error. I always used the analogous warning feature in other languages, such as use strict in Perl or option explict in VB, but I haven’t gotten into the habit yet of using Set-PsDebug -strict in PowerShell.

Jeffrey Snover published an article a few days ago about the new Set-StrictMode cmdlet that will be part of version 2 of PowerShell and will replace Set-PsDebug -strict. The new feature will be more strict and will be more finely configurable.