Posts Tagged ‘Programming’

Languages that are easy to pick back up

Tuesday, July 1st, 2008

Some programming languages are much easier to come back to than others. In my previous post I mentioned that Mathematica is easy to come back to, put Perl is not.

I found it easy to come back LaTeX after not using it for a while. It has a few quirks, but it’s basically consistent. The LaTeX commands for Greek letters are their names, lower case names for lower case letters, upper case names for upper case letters. The command mathematical symbols is usually the name a mathematician would give the symbol. Modes always begin with \begin and end with \end.

Python also has a consistent syntax that make it easier to come back to the language after a break. Someone has said that Python is similar to Perl, except that the word “except” does not appear nearly so often in the Python documentation.

It’s more important that a language be internally consistent than conventional. Each of the languages I mentioned have their peculiarities. Mathematica uses square brackets for function argument arguments. LaTeX uses percent signs for comments. Python uses indention to denote blocks. Each of these take a little getting used to, but each makes sense in its own context.

A special case of consistency is using full names for keywords. Mathematica always spells out words in full. For example, the gamma distribution object is named GammaDistribution. I don’t mind a little extra typing. I’d rather optimize for recall and readability than minimize keystrokes since I spend more time recalling and reading than typing. (One flaw in LaTeX is that it occasionally uses unnecessary abbreviations. For example, \infty for infinity. The corresponding Mathematica keyword is Infinity.)

Why computer scientists count from zero

Thursday, June 26th, 2008

In my previous post, cohort assignments in clinical trials, I mentioned in passing how you could calculate cohort numbers from accrual numbers if the world were simpler than it really is.

Suppose you want to treat patients in groups of 3. If you count patients and cohorts starting from 1, then patients 1, 2, and 3 are in cohort 1. Patients 4, 5, and 6 are in cohort 2. Patients 7, 8, and 9 are in cohort 3, etc. In general patient n is in cohort 1 + ⌊(n-1)/3⌋.

If you start counting patients and cohorts from 0, then patients 0, 1, and 2 are in cohort 0. Patients 3, 4, and 5 are in cohort 1. Patients 6, 7, and 8 are in cohort 2, etc. In general patient n is in cohort ⌊n/3⌋.

These kinds of calculations, common in computer science, are often simpler when you start counting from 0. If you want to divide things (patients, memory locations, etc.) into groups of size k, the nth item is in group ⌊n/k⌋. In C notation, integer division truncates to an integer and so the expression is even simpler: n/k.

Counting centuries is confusing because we count from 1. That’s why the 1900’s were the 20th century etc. If we called the century immediately following the birth of Christ the 0th century, then the 1900’s would be the 19th century.

Because computer scientists usually count from 0, most programming languages also count from zero. Fortran and Visual Basic are notable exceptions.

The vast majority of humanity finds counting from 0 unnatural and so there is a conflict between how software producers and consumers count. Demanding that average users learn to count from zero is absurd. So the programmer must either use one-based counting internally, and risk confusing his peers, or use zero-based counting internally, and risk forgetting to do a conversion for input or output. I prefer the latter. The worst option is to vascillate between the two approaches. 

Why functional programming hasn’t taken off

Sunday, June 22nd, 2008

Bjarne Stroustrup made a comment in an interview about functional programming. He said advocates of functional programming have been in charge of computer science departments for 30 years now, and yet functional programming has hardly been used outside academia. Maybe it’s because it’s not practical, at least in its purest form.

I’ve heard other people say that functional programming is the way to go, but most programmers aren’t smart enough to work that way and its too hard for the ones who are smart enough to go against the herd. But there are too many brilliant maverick programmers out there to make such a condescending explanation plausible. Stroustrup’s explanation makes more sense.

Let me quickly address some objections.

  • Yes, there have been very successful functional programming projects.
  • Yes, procedural programming languages are adding support for functional programming.
  • Yes, the rise of multi-core processors is driving the search for ways to make concurrent programming easier, and functional programming has a lot to offer.

I fully expect there will be more functional programming in the future, but it will be part of a multi-paradigm approach. On the continuum between pure imperative programming and pure functional programming, development will move toward the functional end, but not all the way. A multi-paradigm approach could be a jumbled mess, but it doesn’t have to be. One could clearly delineate which parts of a code base are purely functional (say, because they need to run concurrently) and which are not (say, for efficiency). The problem of how to mix functional and procedural programming styles well seems interesting and tractable.

[Stroustrup’s remark came from an OnSoftware podcast. I’ve listed to several of his podcasts with OnSoftware lately but I don’t remember which one contained his comment about functional programming.]

Software in Space

Tuesday, June 17th, 2008

The latest episode of Software Engineering Radio has an interview with Hans-Joachim Popp of the German aerospace company DLR. A bug in the software embedded in a space probe could cost years of lost time and billions of dollars. These folks have to write solid code.

The interview gives some details of the unusual practices DLR uses to produce such high quality code. For one, Popp said that his company writes an average of 12 lines of test code for every line of production code. They also pair junior and senior developers. The junior developer writes all the code, and the senior developer picks it apart.

Such extreme attention to quality doesn’t come cheap. Popp said that they produce about 0.6 lines of (production?) code per hour.

C++ templates may reduce memory footprint

Monday, June 16th, 2008

One of the complaints about C++ templates is that they can cause code bloat. But Scott Meyers pointed out in an interview that some people are using templates in embedded systems applications because templates result in smaller code.

C++ compilers only generate code for template methods that are actually used in an application, so it’s possible that code using templates may result in a smaller executable than code that a more traditional object oriented approach.

Programming language subsets

Wednesday, June 11th, 2008

I just found out that Douglas Crockford has written a book JavaScript: The Good Parts. I haven’t read the book, but I imagine it’s quite good based on having seen the author’s JavaScript videos.

Crockford says JavaScript is an elegant and powerful language at its core, but it suffers from numerous egregious flaws that have been impossible to correct due to its rapid adoption and standardization.

I like the idea of carving out a subset of a language, the good parts, but several difficulties come to mind.

  1. Although you may limit yourself to a certain language subset, your colleagues may choose a different subset. This is particularly a problem with an enormous language such as Perl. Coworkers may carve out nearly disjoint subsets for their own use.
  2. Features outside your intended subset may be just a typo away. You have to have at least some familiarity with the whole language because every feature is a feature you might accidentally use.
  3. The parts of the language you don’t want to use still take up space in your reference material and make it harder to find what you’re looking for.

One of the design principles of C++ is “you only pay for what you use.” I believe the primary intention was that you shouldn’t pay a performance penalty for language features you don’t use, and C++ delivers on that promise. But there’s a mental price to pay for language features you don’t use. As I’d commented about Perl before, you have to use the language several hours a week just to keep it loaded in your memory.

There’s an old saying that when you marry a girl you marry her family. A similar principle applies to programming languages. You may have a subset you love, but you’re going to have to live with the rest of the language.

Experienced programmers and lines of code

Tuesday, June 3rd, 2008

I heard of a study recently that concluded inexperienced and experienced programmers write about the same number of lines of code per day. The difference is that experienced programmers keep more of those lines of code, making steady progress toward a goal. Less experienced programmers write large chunks of code only to rip them out and rewrite the same chunk many times until the code appears to work. Or instead of ripping out the code, they debug for days on end, changing one or two lines at a time, almost at random, until the code appears to work.

As Greg Wilson pointed out in his interview, focusing on quality in software development often results in increased productivity as well. More effort goes into forward progress and less goes into re-work.

Not only do experienced programmers produce more lines of code worth keeping each day, they also accomplish more per line of code, sometimes dramatically more. But that’s not news. It’s well known that the best programmers aren’t just a little more productive than average, they’re one or two orders of magintude more productive. (See, for example, Joel Spolsky’s book Smart and Gets Things Done.) More interesting is that the best programmers don’t seem to have a much larger capacity for producing and understanding lines of code.

There have also been studies that show programmers produce about the same number of lines of code per day independent of the language they use. You might think that someone working in assembly language could produce more lines of per day than someone writing in a higher level language such as VB or Java, but that’s not the case. It seems that while counting lines of code is a terrible way to measure productivity, it is a good way to measure what you can expect someone to be able to hold in their head.

Transactional memory

Wednesday, May 21st, 2008

The IT Conversations podcast carried a by Simon Peyton-Jones entitled Transactional Memory for Concurrent Programming.

The best programmers find concurrency hard and the rest find it impossible. And yet concurrency can’t be avoided if we’re going to take advantage of multiple core machines. Researchers are working to make concurrent programming easier, but most approaches are minor variations on the basic approach used for the last 30 years. Transactional memory sounds like a promising departure from the traditional lock-based model.

Reusable code vs. re-editable code

Saturday, May 3rd, 2008

In a recent interview, Donald Knuth made this comment about reusable code.I also must confess to a strong bias against the fashion for reusable code.

I also must confess to a strong bias against the fashion for reusable code. To me, “re-editable code” is much, much better than an untouchable black box or toolkit. I could go on and on about this. If you’re totally convinced that reusable code is wonderful, I probably won’t be able to sway you anyway, but you’ll never convince me that reusable code isn’t mostly a menace.

Knuth didn’t elaborate on what he means by “re-editable” code, but I assume he means code that is easy to maintain. The best chance most code has at reuse is remaining useful in its original project over multiple versions, so maybe we’d get more reuse if we focused more on maintainability.

I think whether code should be editable or in “an untouchable black box” depends on the number of developers involved, as well as their talent and motivation. Knuth is a highly motivated genius working in isolation. Most software is developed by large teams of programmers with varying degrees of motivation and talent. I think the further you move away from Knuth along these three axes the more important black boxes become.

How to calculate percentiles in memory-bound applications

Monday, April 28th, 2008

I just published a new article on CodeProject: Calculating percentiles in memory-bound applications.

Finding the percentiles of a list of samples is trivial if you can read all the samples into memory and sort the list. If the list is too big to fit into memory, it’s still not terribly difficult, but there are a couple subtleties if you haven’t thought about it before.

The application that motivated the code presented in the article generates a large amount of data. You could think of the software as generating an enormous matrix one row at a time, then computing percentiles of each column. For example, in one problem the matrix had 800,000 columns and 2,000 rows of floating point numbers.

Fibonacci numbers at work

Wednesday, April 23rd, 2008

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form  

f(x, y) = xa (1-x)b yc (1-y)d p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

Fibonacci lattice integration rule

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule typically not very efficient. It’s more efficient for periodic integrands, but the lattice rule was more efficient still.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

    int Fibonacci(int k)
    {
        double phi = 0.5*(1.0 + sqrt(5.0));
        return int( pow(phi, k)/sqrt(5.0) + 0.5 );
    }

The reason this works is that

equation for Fibonacci numbers

where

definition of phi, golden ratio

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

If you’re interested in technical details of the numerical integration method, see Lattice Methods for Multiple Integration by I. H. Sloan and S. Joe.

Database-first or object model-first software development

Friday, April 18th, 2008

The most recent edition of the .NET Rocks podcast interviewed Jeremy Miller and David Laribee. They made the observation that Microsoft’s development tools are implicitly designed to support database-first development rather than object model-first development. Because they are committed to agile and object model-first development, they’re feeling resistance from some of Microsoft’s products.

Miller and Laribee are part of an interesting group of developers who call themselves “ALT.NET” because they’re developing on Microsoft’s .NET framework but are looking for alternative tools. They embrace the bottom of the Microsoft technology stack — operating systems, languages, compilers — but not the top — tools for modelling, code generation, testing, etc.

Overflow and loss of precision

Wednesday, April 16th, 2008

Suppose you need to evaluate the function f(x) = log(1 + ex). The most obvious code to write something like this in C:

    double f(double x) { return log(1 + exp(x)); }

This is so simple you don’t even test it. Then someone calls you to say they’re getting strange output such as 1.#INF depending on your environment. What’s going on?

For large values of x, ex is much bigger than 1, so f(x) is approximately log(ex) which equals x. So, for example, f(1000) should return something very close to 1000. But instead, you get gibberish. In most environments a floating point number can be as large as about 10308, but exp(1000) is about 2×10434 and so the result overflows. Then the code takes the log of “infinity.”

But f(1000) shouldn’t overflow; the result is about 1000. Our function result can be contained in a floating point number, but our intermediate result exp(1000) cannot be. We need some way of telling the computer “Hold on. I know this is going to be a huge number, but trust me for a minute. After I add 1 I’m going to take a log and bring the result back down to a medium sized number.” Unfortunately computers don’t work that way.

One solution would be to see whether x is so large that exp(x) will overflow, and in that case f(x) could just return x. So our second attempt might look like this:

    double f(double x) { return (x > X_MAX) ? x : log(1 + exp(x)); }

I’ll come back later to what X_MAX should be. Suppose you’ve found a good value for this constant and release your code again. Then you get another call. They say your code is returning zero when it should not.

Someone is trying to compute f(-50). They know the answer is small, but it shouldn’t be zero. How can that be? Since you just learned about overflow, you suspect the problem might have to do with the opposite problem, underflow. But that’s not quite it. The result of exp(-50) does not underflow; it’s about 2×10-22. But it is so small that machine precision cannot distinguish 1+exp(-50) from 1. So the code returns log(1), which equals zero. This may or may not be a problem. The correct answer is very small, so the absolute error is small but the relative error is 100%. Whether that is a problem depends on what you’re going to do with the result. If you’re going to add it to a moderate size number, no big deal. If you’re going to divide by it, it’s a very big deal.

Now what do you do? You think back to your calculus class and remember that for small x, log(1 + x) is approximately x with error on the order of x2. (To see this, expand log(1 + x) in a power series centered at 1.) If x is so small that 1 + x equals 1 inside the computer, x2 must be very small indeed. So f(x) could return exp(x) if exp(x) is sufficiently small. This gives our third attempt.

double f(double x)
{ 
    if (x > X_MAX) return x; 
    if (x < X_MIN) return exp(x);
    return log(1.0 + exp(x));
}

This is a big improvement. It can still underflow for large negative x, but in that case the function itself is underflowing, not just an intermediate result.

Now what do we make X_MAX and X_MIN? For X_MAX we don’t really have to worry about exp(x) overflowing. We can stop when x is so large that exp(x) + 1 equals exp(x) to machine precision. In C there is a header file float.h that contains a constant DBL_EPSILON which is the smallest number ε such that 1 + ε does not equal 1 in machine precision. So it turns out we can set X_MAX equal to -log(DBL_EPSILON). Similarly we could set X_MIN equal to log(DBL_EPSILON).

There’s still a small problem. When x is large and negative, but not so negative that 1 + exp(x) equals 1 in the computer, we could lose precision in computing log(1 + exp(x)).

I have just posted an article on CodeProject entitled Avoiding Overflow, Underflow, and Loss of Precision that has more examples where the most direct method for evaluating a function may not work. Example 2 in that paper explains why directly computing log(1 + y) can be a problem even if y is not so small that 1 + y equals 1 to machine precision. The article explains how to compute log(1 + y) in that case. Setting y = exp(x) explains how to finish the example here when x is moderately large and negative.

If you’re going to do XHTML, you’d better do it right

Monday, April 14th, 2008

XHTML is essentially a stricter form of HTML, but not quite. For the most part, you can satisfy the requirements of both standards at the same time. However, when it comes to closing tags, the two standards are incompatible. For example, the line break tag in HTML is <br> but in XHTML is <br/>. Most browsers will tolerate the unnecessary backslash before the closing tag in HTML, especially if you put a space before it. But it’s not strictly correct.

So is this just a pedantic point of markup language grammar? Chris Maunder says an error with closing tags caused Google to stop indexing his web site. He had XHTML-style end tags but had set his DOCTYPE to HTML.

I’ve also heard of browsers refusing to render a page at all because it had DOCTYPE set to XHTML but contained an HTML entity not supported in XHTML. I believe the person reporting this said that he had run the XHTML page through a validator that failed to find the error. Unfortunately I’ve forgotten where I saw this. Does anyone know about this?

Text reviews for software

Friday, April 11th, 2008

When users find spelling and grammar errors in your software, your credibility takes a hit. But apparently very few software projects review the text their software displays. I imagine the ones that do review their text use a combination of two leaky methods: asking execution testers to take note of prose errors, and requiring that all text displayed to users be stored in a string table.

There are a couple problems with asking execution testers to be copy editors. First, they’re not copy editors. They may not recognize a grammatical error when they see it. Second, they only see the text that their path through the software exposes. Messages displayed to the user under unusual circumstances slip through testing.

String tables are a good idea. They can be reviewed by a professional editor. (Or translator, if you’re application is internationalized.) But it’s difficult to make sure that every string the user might see is in the string table. When you need to add a few quick lines of error-handling code, it’s so easy to just include the text right there in the code rather than adding an entry to the string table. After all, you say to yourself, the code’s probably not going to run anyway.

My solution was to write a script that extracts all the quoted text from a source tree so it can be reviewed separately. The script tries to only pick out strings that a user could see, filtering out, for example, code quoted inside code. Doing this perfectly would be very hard, but by tolerating a small error rate, the problem can be solved quickly in a few lines of code. I’ve used this script for years. Nearly every time I run it I discover potentially embarrassing errors.

In addition to helping with copy editing, an extract of all the string literals in a project gives an interesting perspective on the source code. For example, it could help uncover security risks such as SQL injection vulnerabilities.

I’ve posted an article on CodeProject along with the script I wrote.

PowerShell Script for Reviewing Text Shown to Users

The script on CodeProject is written for Microsoft’s PowerShell. If anyone would like a Perl version of the script, just let me know. I first wrote the script in Perl, but then moved it to PowerShell as my team was moving to PowerShell for all administrative scripting.