Programming language subsets

I just found out that Douglas Crockford has written a book JavaScript: The Good Parts. I haven’t read the book, but I imagine it’s quite good based on having seen the author’s JavaScript videos.

Crockford says JavaScript is an elegant and powerful language at its core, but it suffers from numerous egregious flaws that have been impossible to correct due to its rapid adoption and standardization.

I like the idea of carving out a subset of a language, the good parts, but several difficulties come to mind.

  1. Although you may limit yourself to a certain language subset, your colleagues may choose a different subset. This is particularly a problem with an enormous language such as Perl. Coworkers may carve out nearly disjoint subsets for their own use.
  2. Features outside your intended subset may be just a typo away. You have to have at least some familiarity with the whole language because every feature is a feature you might accidentally use.
  3. The parts of the language you don’t want to use still take up space in your reference material and make it harder to find what you’re looking for.

One of the design principles of C++ is “you only pay for what you use.” I believe the primary intention was that you shouldn’t pay a performance penalty for language features you don’t use, and C++ delivers on that promise. But there’s a mental price to pay for language features you don’t use. As I’d commented about Perl before, you have to use the language several hours a week just to keep it loaded in your memory.

There’s an old saying that when you marry a girl you marry her family. A similar principle applies to programming languages. You may have a subset you love, but you’re going to have to live with the rest of the language.

Porting Visual C++ code to Linux/gcc

Here are a few lessons learned from porting a numerical library recently from Windows/Visual C++ to Linux/gcc.

Some of our code only runs on Windows, and only needs to run on Windows. Our first thought was to put #ifdef WIN32 directives around that code. Clift Norris came up with the clever idea of using #ifndef EXCLUDE_WINDOWS_ONLY_CODE instead. That way we could do a preliminary test of the portable subset of the code while still working on Windows where we’re more comfortable. Also, by not referring specifically to 32-bit Windows, we’re OK moving the code to 64-bit Windows.

Visual C++ does not require source and header files to end with newline characters, but gcc does. We got hundreds of warnings of the form warning: no newline at end of file when we first attempted to compile our code on Linux. Apparently there’s no gcc switch to turn this off, and it may not be prudent to turn it off if you could. As I understand it, Visual Studio inserts a linebreak after including header files, but gcc may not and so gcc needs to issue a warning in this case while Visual Studio does not. We copy our source tree to a Linux box then run the following Python code on that box to insert the extra newline characters when needed.

import os 

# List of directories of files to add newlines to.
# Script must be in the same location as these directories.
directories = ["Banana", "Apple", "Peach"]

for dir in directories:
    for file in os.listdir(dir):
        if file.endswith(".h") or file.endswith(".cpp"):
            path = dir + "/" + file
            handle = open(path, "r")
            slurp = handle.read()
            handle.close()

            if not slurp.endswith("n"):
                retcode = os.system("chmod +w " + path)
                if retcode != 0:
                    print "chmod returned " + retcode + " on " + path
                else:
                    handle = open(path, "a")
                    handle.write("n")
                    handle.close()

There were several places in our code where a variable was deliberately unused but retained in a function signature. Suppose a function has signature void foo(int a, int b) but b is unused. We had usually handled that by making b; the first line of the implementation. That would suppress unused variable warnings in Visual C++, but not in gcc. When we changed the function signature to void foo(int a, int /* b */), that made both compilers happy.

We started out using autoconf, but that was overkill for our project. Our build process became two orders of magnitude simpler when we switched over to a crude, old-fashioned make file. After porting the library to Linux, we built it on OS X without any issue.

This wasn’t an issue for us, but a potential problem when moving numerical code between Unix-like systems is that the function gamma computes different things on different systems. On Linux it computes the logarithm of the mathematical gamma function but on OS X it computes the gamma function itself. See the last two paragraphs of how to calculate binomial probabilities and Thomas Guest’s comment on that post for a full explanation.

This library had been ported to Linux years ago, but nobody used it on Linux and so development continued only on Windows. When we first ported the code, gcc and Visual C++ seemed to have incompatible requirements, especially with templates. The more recent port described here was much easier now that both compilers are more compliant with the C++ standard.

Regular expressions in C++ TR1

Regular expressions are not a part of the C++ Standard Library quite yet, but there is a document (Technical Report 1, or TR1) that includes among other things a specification for regular expression support that will probably be added to the C++ standard eventually.

The Boost library has supported TR1 for a while. Microsoft just released a feature pack for Visual Studio 2008 a month ago that includes support for most of TR1. (They’ve left out support for mathematical special functions.) And Dinkumware sells a complete TR1 implementation.

I’ve added some notes to my website for getting started with C++ TR1 regular expressions. I took my PowerShell regex notes as a starting point and implemented some of the same examples in C++. I changed the organization though, because the C++ implementation is fairly different from PowerShell.

Working with regular expressions is harder in C++ than in scripting languages such as Perl or Python, but not unnecessarily so. C++ is optimized for fine-grained control and efficiency rather than ease of use; that’s what C++ is for. The TR1 implementation is internally consistent and elegant in its own way.

It’s easy to find API-level documentation but harder to find examples for getting started. (I’ve heard good things about Pete Becker’s book The C++ Standard Library Extensions but I haven’t read it.) So I decided to keep some notes as I played with the Visual Studio implementation. I imagine most of the content applies to other implementations, but I’ve only tested the examples using Visual Studio.

Update: GCC just added support for C++ TR1 two days ago with their version 4.3 release.  However, it appears support for regular expressions is not included.

Random number generator controversy

I submitted an article to Code Project yesterday, Simple Random Number Generation, describing a small C# class called SimpleRNG that uses George Marsaglia’s WMC algorithm. The article was posted around 5 PM (central US time) and comments started pouring in right away. I didn’t expect any feedback on a Friday afternoon or Saturday morning. But as I write this post, there have been 580 page views and 11 comments.

There have been three basic questions raised in the comments.

  1. Why not just use the random number generator that comes with .NET?
  2. Is this code suitable for cryptography?
  3. Is this code suitable for Monte Carlo applications?

Why not use the built-in generator? For many applications, the simplest thing would be to use the .NET random number generator. But there are instances where this might not be best. There are questions about the statistical quality of the .NET generator; I’ll get to that in a minute. The primary advantages I see to the SimpleRNG class are transparency and portability.

By transparency I mean that the internal state of the generator is simple and easy to access. When you’re trying to reproduce a result, say while debugging, it’s convenient to have full access to the internal state of the random generator. If you’re using your own generator, you can see everything. You can even temporarily change it: for debugging, it may be convenient to temporarily have the “random” generator return a very regular, predictable sequence.

By portability I do not necessarily mean moving the code between operating systems. The primary application I have in mind is moving the algorithm between languages. For example, in my work we often have prototype code written in R that needs to be rewritten in C++ for efficiency. If the code involves random number generation, the output of the prototype and the rewrite cannot be directly compared, only compared on average. Then you have to judge whether the differences are to be expected or whether they indicate a bug. But if both the R and the C++ code use the same RNG algorithm and the same seed, the results may be directly comparable. (They still may not be directly comparable due to other factors, but at least this way the results are often comparable.)

As for cryptography, no, SimpleRNG is not appropriate for cryptography.

As for Monte Carlo applications, not all Monte Carlo applications are created equal. Some applications do not require high quality random number generators. Or more accurately, different applications require different kinds of quality. Some random number generators break down when used for high-dimensional integration. I suspect SimpleRNG is appropriate for moderate dimensions. I use the Mersenne Twister generator for numerical integration. However, SimpleRNG is faster and much simpler; the MT generator has a very large internal state.

Someone commented on the CodeProject article that the random number generator in .NET is not appropriate for Monte Carlo simulation because it does not pass Marsaglia’s DIEHARD tests while SimpleRNG does. I don’t know what algorithm the .NET generator uses, so I can’t comment on its quality. Before I’d use it in statistical applications, I’d want to find out.

Innovation III

In his book Diffusion of Innovations Everett Rogers lists five factors in determining rate of adoption of an innovation.

First is the relative advantage of the innovation. This is not limited to objective improvements but also includes factors such as social prestige.

The second is compatibility with existing systems and values.

Third is complexity, especially perceived complexity.

The fourth is trialability, how easily someone can try out the innovation without making a commitment.

The fifth is observability, whether the advantages of the innovation are visible.

Innovators are often criticized for compatibility, for not making a larger break from the past. After Bjarne Stroustrup invented the C++ programming language, many people said he should have sacrificed compatibility with C in order to make C++ a better language. However, had he done so, C++ would not have become popular enough to gain the critics’ attention. As Stroustrup said in an interview, “There are just two kinds of languages: the ones everybody complains about and the ones nobody uses.”