Posts Tagged ‘Programming’

How to calculate percentiles in memory-bound applications

Monday, April 28th, 2008

I just published a new article on CodeProject: Calculating percentiles in memory-bound applications.

Finding the percentiles of a list of samples is trivial if you can read all the samples into memory and sort the list. If the list is too big to fit into memory, it’s still not terribly difficult, but there are a couple subtleties if you haven’t thought about it before.

The application that motivated the code presented in the article generates a large amount of data. You could think of the software as generating an enormous matrix one row at a time, then computing percentiles of each column. For example, in one problem the matrix had 800,000 columns and 2,000 rows of floating point numbers.

Fibonacci numbers at work

Wednesday, April 23rd, 2008

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form  

f(x, y) = xa (1-x)b yc (1-y)d p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

Fibonacci lattice integration rule

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule typically not very efficient. It’s more efficient for periodic integrands, but the lattice rule was more efficient still.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

    int Fibonacci(int k)
    {
        double phi = 0.5*(1.0 + sqrt(5.0));
        return int( pow(phi, k)/sqrt(5.0) + 0.5 );
    }

The reason this works is that

equation for Fibonacci numbers

where

definition of phi, golden ratio

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

If you’re interested in technical details of the numerical integration method, see Lattice Methods for Multiple Integration by I. H. Sloan and S. Joe.

Database-first or object model-first software development

Friday, April 18th, 2008

The most recent edition of the .NET Rocks podcast interviewed Jeremy Miller and David Laribee. They made the observation that Microsoft’s development tools are implicitly designed to support database-first development rather than object model-first development. Because they are committed to agile and object model-first development, they’re feeling resistance from some of Microsoft’s products.

Miller and Laribee are part of an interesting group of developers who call themselves “ALT.NET” because they’re developing on Microsoft’s .NET framework but are looking for alternative tools. They embrace the bottom of the Microsoft technology stack — operating systems, languages, compilers — but not the top — tools for modelling, code generation, testing, etc.

Overflow and loss of precision

Wednesday, April 16th, 2008

Suppose you need to evaluate the function f(x) = log(1 + ex). The most obvious code to write something like this in C:

    double f(double x) { return log(1 + exp(x)); }

This is so simple you don’t even test it. Then someone calls you to say they’re getting strange output such as 1.#INF depending on your environment. What’s going on?

For large values of x, ex is much bigger than 1, so f(x) is approximately log(ex) which equals x. So, for example, f(1000) should return something very close to 1000. But instead, you get gibberish. In most environments a floating point number can be as large as about 10308, but exp(1000) is about 2×10434 and so the result overflows. Then the code takes the log of “infinity.”

But f(1000) shouldn’t overflow; the result is about 1000. Our function result can be contained in a floating point number, but our intermediate result exp(1000) cannot be. We need some way of telling the computer “Hold on. I know this is going to be a huge number, but trust me for a minute. After I add 1 I’m going to take a log and bring the result back down to a medium sized number.” Unfortunately computers don’t work that way.

One solution would be to see whether x is so large that exp(x) will overflow, and in that case f(x) could just return x. So our second attempt might look like this:

    double f(double x) { return (x > X_MAX) ? x : log(1 + exp(x)); }

I’ll come back later to what X_MAX should be. Suppose you’ve found a good value for this constant and release your code again. Then you get another call. They say your code is returning zero when it should not.

Someone is trying to compute f(-50). They know the answer is small, but it shouldn’t be zero. How can that be? Since you just learned about overflow, you suspect the problem might have to do with the opposite problem, underflow. But that’s not quite it. The result of exp(-50) does not underflow; it’s about 2×10-22. But it is so small that machine precision cannot distinguish 1+exp(-50) from 1. So the code returns log(1), which equals zero. This may or may not be a problem. The correct answer is very small, so the absolute error is small but the relative error is 100%. Whether that is a problem depends on what you’re going to do with the result. If you’re going to add it to a moderate size number, no big deal. If you’re going to divide by it, it’s a very big deal.

Now what do you do? You think back to your calculus class and remember that for small x, log(1 + x) is approximately x with error on the order of x2. (To see this, expand log(1 + x) in a power series centered at 1.) If x is so small that 1 + x equals 1 inside the computer, x2 must be very small indeed. So f(x) could return exp(x) if exp(x) is sufficiently small. This gives our third attempt.

double f(double x)
{ 
    if (x > X_MAX) return x; 
    if (x < X_MIN) return exp(x);
    return log(1.0 + exp(x));
}

This is a big improvement. It can still underflow for large negative x, but in that case the function itself is underflowing, not just an intermediate result.

Now what do we make X_MAX and X_MIN? For X_MAX we don’t really have to worry about exp(x) overflowing. We can stop when x is so large that exp(x) + 1 equals exp(x) to machine precision. In C there is a header file float.h that contains a constant DBL_EPSILON which is the smallest number ε such that 1 + ε does not equal 1 in machine precision. So it turns out we can set X_MAX equal to -log(DBL_EPSILON). Similarly we could set X_MIN equal to log(DBL_EPSILON).

There’s still a small problem. When x is large and negative, but not so negative that 1 + exp(x) equals 1 in the computer, we could lose precision in computing log(1 + exp(x)).

I have just posted an article on CodeProject entitled Avoiding Overflow, Underflow, and Loss of Precision that has more examples where the most direct method for evaluating a function may not work. Example 2 in that paper explains why directly computing log(1 + y) can be a problem even if y is not so small that 1 + y equals 1 to machine precision. The article explains how to compute log(1 + y) in that case. Setting y = exp(x) explains how to finish the example here when x is moderately large and negative.

If you’re going to do XHTML, you’d better do it right

Monday, April 14th, 2008

XHTML is essentially a stricter form of HTML, but not quite. For the most part, you can satisfy the requirements of both standards at the same time. However, when it comes to closing tags, the two standards are incompatible. For example, the line break tag in HTML is <br> but in XHTML is <br/>. Most browsers will tolerate the unnecessary backslash before the closing tag in HTML, especially if you put a space before it. But it’s not strictly correct.

So is this just a pedantic point of markup language grammar? Chris Maunder says an error with closing tags caused Google to stop indexing his web site. He had XHTML-style end tags but had set his DOCTYPE to HTML.

I’ve also heard of browsers refusing to render a page at all because it had DOCTYPE set to XHTML but contained an HTML entity not supported in XHTML. I believe the person reporting this said that he had run the XHTML page through a validator that failed to find the error. Unfortunately I’ve forgotten where I saw this. Does anyone know about this?

Text reviews for software

Friday, April 11th, 2008

When users find spelling and grammar errors in your software, your credibility takes a hit. But apparently very few software projects review the text their software displays. I imagine the ones that do review their text use a combination of two leaky methods: asking execution testers to take note of prose errors, and requiring that all text displayed to users be stored in a string table.

There are a couple problems with asking execution testers to be copy editors. First, they’re not copy editors. They may not recognize a grammatical error when they see it. Second, they only see the text that their path through the software exposes. Messages displayed to the user under unusual circumstances slip through testing.

String tables are a good idea. They can be reviewed by a professional editor. (Or translator, if you’re application is internationalized.) But it’s difficult to make sure that every string the user might see is in the string table. When you need to add a few quick lines of error-handling code, it’s so easy to just include the text right there in the code rather than adding an entry to the string table. After all, you say to yourself, the code’s probably not going to run anyway.

My solution was to write a script that extracts all the quoted text from a source tree so it can be reviewed separately. The script tries to only pick out strings that a user could see, filtering out, for example, code quoted inside code. Doing this perfectly would be very hard, but by tolerating a small error rate, the problem can be solved quickly in a few lines of code. I’ve used this script for years. Nearly every time I run it I discover potentially embarrassing errors.

In addition to helping with copy editing, an extract of all the string literals in a project gives an interesting perspective on the source code. For example, it could help uncover security risks such as SQL injection vulnerabilities.

I’ve posted an article on CodeProject along with the script I wrote.

PowerShell Script for Reviewing Text Shown to Users

The script on CodeProject is written for Microsoft’s PowerShell. If anyone would like a Perl version of the script, just let me know. I first wrote the script in Perl, but then moved it to PowerShell as my team was moving to PowerShell for all administrative scripting.

New spin on the cathedral and the bazaar

Wednesday, April 2nd, 2008

Eric Raymond’s famous essay The Cathedral and the Bazaar compares commercial software projects to cathedrals and open source software projects to bazaars. Cathedrals are carefully planned. Bazaars are not. The organizational structure a bazaars emerges without deliberate coordination of its participants. The open source community has embraced the metaphor of the bazaar and the informality and spontaneity it implies.

Shmork wrote the following observation in the comments to a Coding Horror post yesterday that discussed the difficulties of using Linux software.

Almost nobody in the Western world shops at real-life bazaars either, because they are dodgy, unsafe, and unregulated. And in the Western world, we like things to be reliable, working, safe. So cathedral it is. Even our flea markets aren’t bazaars, really, they’re just knock-off cathedrals.

Closures and lambdas in C++

Wednesday, April 2nd, 2008

The ISO C++ Standard committee voted to closures and lambda (anonymous) functions to the language according to Herb Sutter. Anonymous functions would make templates much easier to use, but it seems odd to add these features to C++.

Feasibility studies

Thursday, March 27th, 2008

Jeff Atwood gives a summary of Facts and Fallacies of Software Engineering by Robert Glass on his blog. I was struck by point #14:

The answer to a feasibility study is almost always “yes”.

I hadn’t thought about that before, but it certainly rings true. I can’t think of an exception.

Some say about half of all large software projects fail, and presumably many of these failures passed a feasibility study. Why can’t we predict whether a project stands a good chance of succeeding? Are committees sincerely overly optimistic, or do they recognize doomed projects but tell the sponsor what the sponsor wants to hear?

Interview with Ward Cunniningham

Saturday, March 22nd, 2008

A professor told me one time that you’re lucky if you have one good idea in your career. He said he was famous because he’d had two or three good ideas.

Ward Cunningham has had at least two good ideas. He created the first Wiki and developed Extreme Programming. The FLOSS Weekly podcast has an interview with Cunningham. The interview shows how closely related Wikis and Extreme Programming are, how both flow naturally from Cunningham’s way of thinking.

Simple unit tests

Thursday, March 20th, 2008

After you’ve read a few books or articles on unit testing, the advice becomes repetitive. But today I heard someone who had a few new things to say. Gerard Meszaros made these points in an interview on the OOPSLA 2007 podcast, Episode 11.

Test code should be much simpler than production code for three reasons.

  1. Unit tests should not contain branching logic. Each test should test one path through the production code. If a unit test has branching logic, it’s doing too much, attempting to test more than one path.
  2. Unit tests are the safety net for changes to production code, but there is no such safety net for the tests themselves. Therefore tests should be written simply the first time rather simplified later through refactoring.
  3. Unit tests are not subject to the same constraints as production code. They can be slow, and they only have to work in isolation. Brute force is more acceptable in tests than in production code. 

(Meszaros made points 1 and 2 directly. Point 3 is my interpolation.)

A well-tested project will have at least as much test code as production code. The immediate conclusion too many people draw is that therefore unit testing doubles the cost of a project.  One reason this is not true is that test code is easier to write than production code for the reasons listed above. Or rather, test code can be easier to write, if the project uses test-driven development. Retrofitting tests to code that wasn’t designed to be testable is hard work indeed.

Conceptual integrity

Tuesday, March 18th, 2008

How do you maintain conceptual integrity when multiple people contribute to a project?

Fred Brooks, author of the software engineering classic The Mythical Man-Month, gave a talk at at OOPSLA 2007 entitled Collaboration and Telecollaboration in Design (audio here). In his talk, Brooks discusses the importance of conceptual integrity. Great products have conceptual integrity and are nearly always the fruit of one or at most two minds. Products reflect their creators, and products designed by committees have multiple personalities. How do you maintain conceptual integrity when scale and complexity demand the participation of many people? Listen to Fred Brooks for some ideas.

Going overboard with nouns

Saturday, March 15th, 2008

I ran across an article by Steve Yegge called Execution in the Kindom of Nouns that explains what can happen when object oriented programming goes bad.

In programming, objects are like nouns and functions are like verbs. You can go off into the weeds by placing too much emphasis on nouns, and Yegge makes a good case that Java has done just that.

Alphabetical order is wrong

Wednesday, March 12th, 2008

Seth Godin posted an article today entitled Alphabetical order is obsolete. He makes a good argument that alphabetically sorted lists are often not the best user interface. I particularly liked his idea that an email client should sort junk mail according to the probability that each message is spam.

Code to make an XML sitemap

Monday, February 25th, 2008

Here’s some Python code to create a sitemap in the format specified by sitemaps.org and read by search engines. Download the file sitemapmaker.txt and change the extension from .txt to .py.

Change the url variable in the script before running it or else you’ll point search engines to my web site rather than yours. Also, edit the file extensions_to_keep variable if you want to index any file types besides HTML and PDF.

Copy the file sitemapmaker.py to the directory on your computer where you have your files. Run the script and direct its output to a file, sitemapmaker.py > sitemap.xml. See sitemaps.org for instructions on how to let search engines know about your sitemap.

This code assumes all the files to index in your sitemap are in one directory, the directory you run the script from. It also assumes the timestamps on your computer match those on your web server. Optional fields are left out of the sitemap.