Posts tagged as:

Programming

Bug in SciPy’s erf function

by John on September 2, 2010

Last night I produced the plot below and was very surprised at the jagged spike. I knew the curve should be smooth and strictly increasing.

My first thought was that there must be a numerical accuracy problem in my code, but it turns out there’s a bug in SciPy version 0.8.0b1. I started to report it, but I saw there were similar bug reports and one such report was marked as closed, so presumably the fix will appear in the next release.

The problem is that SciPy’s erf function is inaccurate for arguments with imaginary part near 5.8. For example, Mathematica computes erf(1.0 + 5.7i) as -4.5717×1012 + 1.04767×1012 i. SciPy computes the same value as -4.4370×1012 + 1.3652×1012 i. The imaginary component is off by about 30%.

Here is the code that produced the plot.

from scipy.special import erf
from numpy import linspace, exp
import matplotlib.pyplot as plt

def g(y):
    z = (1 + 1j*y) /  sqrt(2)
    temp = exp(z*z)*(1 - erf(z))
    u, v = temp.real, temp.imag
    return -v / u

x = linspace(0, 10, 101)
plt.plot(x, g(x))

{ 3 comments }

Is helpful software really helpful?

by John on August 24, 2010

In his new book The Shallows, Nicholas Carr relates an experiment by Christof van Nimwegen on computer-human interaction. Users were asked to solve a puzzle using software. Some users were given software designed to be very helpful, highlighting permissible moves etc. Other users were given bare-bones software.

In the early stages of solving the puzzle, the group using the helpful software made correct moves more quickly than the other group, as would be expected. But as the test proceeded, the proficiency of the group using the bare-bones software increased more rapidly. In the end, those using the unhelpful program were able to solve the puzzle more quickly and with fewer wrong moves.

I immediately thought of the debate over fancy software development tools versus simple tools. Then I read the conclusion:

… those using the unhelpful software were better able to plan ahead and plot strategy, while those using the helpful software tending to rely on simple trial and error. Often, in fact, those with the helpful software were found “to aimlessly click around” as they tried to crack the puzzle.

That really sounds like software development.

Christof van Nimwegen did variations on his experiment and got similar results. For example, he had two groups schedule a complicated series of meetings. One group had plain calendar software and one had software designed to help people schedule complicated meetings. The folks with the simple software won.

The debate over whether to use fancy software development tools (e.g. integrated development environments, wizards, etc.) or simple tools (editors and make files) is a Ford-Chevy argument that won’t go away. I could imagine many valid objections to the applicability of the van Nimwegen studies to the software tools debate, but I’d say they score a point for the simple tools side.

A rebuttal to the van Nimwegan studies is that he has only shown that particular helpful software wasn’t particularly helpful. Maybe the specific puzzle-solving software didn’t help in the long run, but someone could have written software that was ultimately more helpful than the bare-bones software. Maybe someone could have written scheduling software that allows people to schedule tasks faster than using simple calendar software.

A rebuttal to the rebuttal is that someone might indeed write software that allows users get the job done more quickly than they would using simpler software. It may even be inevitable that someone will write such software eventually. However, most attempts fail. It’s hard to write genuinely helpful software. Attempts to help a user too much may interfere with the user’s ability to form a good mental model of the problem.

Related post:

Would you rather have a chauffeur or a Ferrari?

{ 13 comments }

Timing software

by John on August 5, 2010

When measuring how long it takes to execute a program, people often report the best time out of several runs with the same input. That seems odd at first. Why report the best time? Why not report the average time?

Operating systems have more to do than just run your program. The time that elapses between the start and finish of your program may vary because of other demands on the operating system’s resources. By taking the best time, you (presumably) get an idea how much time it takes to run the program itself with minimal overhead from other operating system activity.

The best-time approach is appropriate when comparing the efficiency of two programs: compare the best time for Program A to the best time of Program B. But if you want to estimate how long you can expect wait for Program A itself to run, not comparing the efficiency of Program A to anything else, then averaging the run times is reasonable.

(Everything above assumes constant input. This post is not about analyzing execution times with varying inputs. For example, you might use best-case timing to compare two sorting algorithms by sorting the exact same list several times. The only thing changing from run to run is the state of the operating system, not the input. Of course you might also want to measure the average run time with multiple randomly generated data sets, but that’s a different matter.)

{ 6 comments }

What does this code do?

by John on July 21, 2010

At the SciPy 2010 conference, a speaker showed several short code samples and asked us what each sample did. The samples were clearly written, but we had no comments to provide context. This was the last sample.

def what( x, n ):
    if n < 0:
        n = -n
        x = 1.0 / x
    z = 1.0
    while n > 0:
        if n % 2 == 1:
            z *= x
        x *= x
        n /= 2
    return z

The quiz was at the end of the day and I was tired. I couldn’t tell what the code does. Then I found out to my chagrin that the sample above implements an algorithm I know well. I’ve written the same code and I’ve even blogged about here.

This exercise changed my opinion of “self-documenting” code. Without some contextual clue, it is hard to understand the purpose of even a small piece of code.

Meaningful variable and function names would have helped, but a tiny comment might have helped even more. Not some redundant comment like explaining that the line x = 1.0 / x takes a reciprocal, but a comment explaining the problem the code is trying to solve.

For another example, what do you think this code does?

uint what()
{
    m_z = 36969 * (m_z & 65535) + (m_z >> 16);
    m_w = 18000 * (m_w & 65535) + (m_w >> 16);
    return (m_z << 16) + (m_w & 65535);
}

It’s clear enough what the code does at a low level — it’s just a few operations — but it’s not at all clear what it’s for.

Try to figure out what the code samples do before reading further. But if you give up, the first example is described here and the second example comes from here.

In an ordinary face-to-face conversation, more information is conveyed non-verbally than verbally. We may think that our literal words are most important, but so much is conveyed by voice inflection, facial expression, posture, etc. Something similar is going on with source code. When we read a piece of source code, we typically bring a huge amount of implicit knowledge with us.

Suppose a coworker Sam asks you to look at his code. The fact that the question came up at work provides a large amount of context; this isn’t just a random code fragment on the web. More specifically, you know what kinds of projects Sam works on. You know why Sam wants you to look at the code. He may be showing you something he’s proud of or he may be asking for help finding a bug. You know a lot about his code before you see it.

Now suppose you’re a contractor. Sam was hit by a bus and you’ve been asked to work on his projects until he gets out of the hospital. You may complain to his office mate that Sam’s code is an awful mess, but she can’t understand what you’re talking about. She thinks his code is perfectly clear.

Now suppose you’re a contractor on the opposite side of the world from Sam. You have even less context than if you were in his office talking to his office mate. After a great deal of agony, you send your contribution back to Sam’s company. You comment your code beautifully, but Sam’s colleagues complain that your code is poorly written and that you didn’t solve the right problem.

Institutional memory is more valuable than source code comments. It costs a great deal to replace a programmer, even one who leaves behind well-commented code.

Related posts:

Do you really want to be indispensable?
Preserving (the memory of) documents
The buck stops with the programmer

{ 27 comments }

I heard the terms “covariance” and “contravariance” used in math long before I heard them used in object oriented programming.  I was curious whether there was any connection between the two. To my surprise, they’re very similar. In fact, you could formalize the OOP use of the terms so that they’re not just analogous but actually  special cases of the mathematical terms.

When I started writing this post, I intended to explain covariance and contravariance. However, the post became longer and more technical than what I like to write here. Instead, I just announce that a connection exists and give references for those who want to read further.

Chris Burrows describes covariance and contravariance in object oriented programming in his article New C# Features in the .NET Framework 4.

Wikipedia has a short but readable introduction to category theory including covariant and contravariant functors. See also A Categorical Manifesto (PostScript file). The terms covariant and contravariant were defined in category theory before computer scientists applied the terms to object oriented programming.

Computer scientists have been interested in category theory for some time, so it’s not too surprising that category theory terms would filter down into practical programming. The real surprise was hearing category terminology used outside of math. It was like the feeling you get when you run into a coworker at a family reunion or a neighbor at a restaurant in another city.

Related post:

My mathematical opposite

{ 3 comments }

Porting Python to C#

by John on June 18, 2010

When people start programming in Python, they often mention having to type less: no braces, no semicolons, fewer type declarations etc.

The difference may be more obvious when you go in the other direction, moving from Python to another language. This morning I ported some Python code to C# and was a little surprised how much extra code I had to add. When I’ve ported C# to Python I wasn’t as aware of the language differences. I guess it is easier to go down a notch in ceremony than to go up a notch.

Related post:

Plain Python

{ 20 comments }

Why computers have two zeros: +0 and -0

by John on June 15, 2010

Here’s a strange detail of floating point arithmetic: computers have two versions of 0: positive zero and negative zero. Most of the time the distinction between +0 and -0 doesn’t matter, once in a while signed versions of zero come in handy.

If a positive quantity underflows to zero, it becomes +0. And if a negative quantity underflows to zero, it becomes -0. You could think of +0 (respectively, -0) as the bit pattern for a positive (negative) number too small to represent.

The IEEE floating point standard says 1/+0 should be +infinity and 1/-0 should be -infinity. This makes sense if you interpret +/- 0 as the ghost of a number that underflowed leaving behind only its sign. The reciprocal of a positive (negative) number too small to represent is a positive (negative) number too large to represent.

To demonstrate this, run the following C code.

int main()
{
    double x = 1e-200;
    double y = 1e-200 * x;
    printf("Reciprocal of +0: %g\n", 1/y);
    y = -1e-200*x;
    printf("Reciprocal of -0: %g\n", 1/y);
}

On Linux with gcc the output is

Reciprocal of +0: inf
Reciprocal of -0: -inf

Windows with Visual C++ returns the same output except Windows prints infinity as 1#INF rather than inf.

There is something, however, about signed zeros and exceptions that doesn’t make sense to me. The aptly named report “What Every Computer Scientist Should Know About Floating Point Arithmetic” has the following to say about signed zeros.

In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.

This implies log(-0) should be NaN and log(+0) should be -∞. That makes sense, but that’s not what happens in practice. The log function returns -∞ for both +0 and -0.

I ran the following code on Linux and Windows.

int main()
{
    double x = 1e-200;
    double y = 1e-200 * x;
    printf("Log of +0: %g\n", log(y));
    y = -1e-200*x;
    printf("Log of -0: %g\n", log(y));
}

On Linux, the code prints

Log of +0: -inf
Log of -0: -inf

The results were the same on Windows, except for the way Windows displays infinities.

Related link:

IEEE floating point exceptions in C++
Anatomy of a floating point number
Math library functions that seem unnecessary

{ 10 comments }

Generating Poisson random values

by John on June 14, 2010

The most direct way of generating random samples from a Poisson distribution is efficient for some parameters and inefficient for others. Wikipedia attributes the following algorithm to Donald Knuth:

    init:
         Let L ← exp(−λ), k ← 0 and p ← 1.
    do:
         k ← k + 1.
         Generate uniform random number u in [0,1] and let p ← p × u.
    while p > L.
    return k − 1.

Here’s the idea behind the algorithm. The time between arrivals in a Poisson process is exponentially distributed. Count how many arrivals there are in an interval by simulating the times between arrivals and adding them up until the time sum spills over the interval.

The problem with this approach is that the expected run time of the loop is proportional to the parameter λ. This algorithm is commonly used despite its terrible performance for large arguments.

Below is an algorithm that has expected run time independent of the argument λ. The algorithm is fairly simple, though it takes a moderate amount of theoretical justification to explain. It goes back to 1979 and so may not the most efficient algorithm available. It is definitely not the most efficient algorithm if you need to generate a large number of samples using the same parameter λ. The paper describing the algorithm recommends using Knuth’s algorithm for values of λ less than 30 and this algorithm for larger values of λ.

c = 0.767 - 3.36/lambda
beta = PI/sqrt(3.0*lambda)
alpha = beta*lambda
k = log(c) - lambda - log(beta)

forever
{
	u = random()
	x = (alpha - log((1.0 - u)/u))/beta
	n = floor(x + 0.5)
	if (n < 0)
		continue
	v = random()
	y = alpha - beta*x
	lhs = y + log(v/(1.0 + exp(y))^2)
	rhs = k + n*log(lambda) - log(n!)
	if (lhs <= rhs)
		return n
}

This is an accept-reject method based on a normal approximation. The performance improves as λ increases because the quality of the normal approximation improves. (Actually, it uses a logistic approximation to a normal approximation!) Theoretically it could be caught in an infinite loop, as with all accept-reject methods. However, the expected number of iterations is bounded. The continue statement means that if n is negative, the algorithm goes back to the top of the loop. The random() function is assumed to return a uniform random variable between 0 and 1.

A possible obstacle to turning the pseudocode above into actual code is computing log(n!). Naively computing n! and taking the log of the result will overflow for moderately large values of n. If you have a function available that computes the log of the gamma function, such as lgamma in C’s math.h, then evaluate that function at n+1. Otherwise, see How to compute log factorial.

The algorithm above is “method PA” from “The Computer Generation of Poisson Random Variables” by A. C. Atkinson, Journal of the Royal Statistical Society Series C (Applied Statistics) Vol. 28, No. 1. (1979), pages 29-35. Method PA is on page 32.

Related posts:

C# random number generation code
How to test a random number generator

{ 3 comments }

Dynamic typing and anti-lock brakes

by John on June 9, 2010

When we make one part of our lives safer, we tend to take more chances somewhere else. Psychologists call this tendency risk homeostasis.

One of the studies often cited to support the theory of risk homeostasis involved German cab drivers. Drivers in the experimental group were given cabs with anti-lock brakes while drivers in the control group were given cabs with conventional brakes. There was no difference in the rate of crashes between the two groups. The drivers who had better brakes drove less carefully.

Risk homeostasis may explain why dynamic programming languages such as Python aren’t as dangerous as critics suppose.

Advocates of statically typed programming languages argue that it is safer to have static type checking than to not have it. Would you rather the computer to catch some of your errors or not? I’d rather it catch some of my errors, thank you. But this argument assumes two things:

  1. static type checking comes at no cost, and
  2. static type checking has no impact on programmer behavior.

Advocates of dynamic programming languages have mostly focused on the first assumption. They argue that static typing requires so much extra programming effort that it is not worth the cost. I’d like to focus on the second assumption. Maybe the presence or absence of static typing changes programmer behavior.

Maybe a lack of static type checking scares dynamic language programmers into writing unit tests. Or to turn things around, perhaps static type checking lulls programmers into thinking they do not need unit tests. Maybe static type checking is like anti-lock brakes.

Nearly everyone would agree that static type checking does not eliminate the need for unit testing. Someone accustomed to working in a statically typed language might say “I know the compiler isn’t going to catch all my errors, but I’m glad that it catches some of them.” Static typing might not eliminate the need for unit testing, but it may diminish the motivation for unit testing. The lack of compile-time checking in dynamic languages may inspire developers to write more unit tests.

See Bruce Eckel’s article Strong Typing vs. Strong Testing for more discussion of the static typing and unit testing.

Update: I’m not knocking statically typed languages. I spend most of my coding time in such languages and I’m not advocating that we get rid of static typing in order to scare people into unit testing.

I wanted to address the question of what programmers do, not what they should do. In that sense, this post is more about psychology than software engineering. (Though I believe a large part of software engineering is in fact psychology as I’ve argued here.) Do programmers who work in dynamic languages write more tests? If so, does risk homeostasis help explain why?

Finally, I appreciate the value of unit testing. I’ve spent most of the last couple days writing unit tests. But there is a limit to the kinds of bugs that unit tests can catch. Unit tests are good at catching errors in code that has been written, but most errors come from code that should have been written. See Software sins of omission.

Related posts:

All languages equally complex
Reasoning about code

{ 23 comments }

C# math gotchas

by John on June 8, 2010

C# has three mathematical constants that look like constants in the C header file float.h. Two of these are not what you might expect.

The constant double.MaxValue in C# looks like the constant DBL_MAX in C, and indeed it is. Both give the maximum finite value of a double, which is on the order of 10^308. This might lead you to believe that double.MinValue in C# is the same as DBL_MIN in C or that double.Epsilon in C# is the same as DBL_EPSILON. If so, you’re in for a surprise.

The constants DBL_MAX and double.MaxValue are the same because there is no ambiguity over what “max” means: the largest finite value of a double. But DBL_MIN and double.MinValue are different because they minimize over different ranges. The constant DBL_MIN is the smallest positive value of a normalized double. The constant double.MinValue in C# is the smallest (i.e. most negative) value of a double and is the negative of double.MaxValue. The difference between DBL_MIN and double.MinValue is approximately the difference between 10^-308 and -10^308, between a very small positive number and a very large negative number.

C has a constant DBL_EPSILON for the smallest positive double precision number x such that 1 + x does not equal 1 in machine precision. Typically a double has about 15 figures of precision, and so DBL_EPSILON is on the order of 10^-16. (For a more precise description, see Anatomy of a floating point number.)

You might expect double.Epsilon in C# corresponds to DBL_EPSILON in C. I did, until a unit test failed on some numerical code I was porting from C++ to C#. But in C# double.Epsilon is the smallest positive value a (denormalized) double can take. It is similar to DBL_MIN, except that double.Epsilon is the possible smallest value of a double, not requiring normalization. The constant DBL_MIN is on the order of 10^-308 while double.Epsilon is on the order of 10^-324 because it allows denormalized values. (See Anatomy of a floating point number for details of denormalized numbers.)

Incidentally, the C constants DBL_MAX, DBL_MIN, and DBL_EPSILON equal the return values of max, min, and epsilon for the C++ class numeric_limits<double>.

To summarize,

  • double.MaxValue in C# equals DBL_MAX in C.
  • double.MinValue in C# equals -DBL_MAX in C.
  • double.Epsilon is similar to DBL_MIN in C, but orders of magnitude smaller.
  • C# has no analog of DBL_EPSILON from C.

One could argue that the C# names are better than the C names. It makes sense for double.MinValue to be the negative of double.MaxValue. But the use of Epsilon was a mistake. The term “epsilon” in numeric computing has long been established and is not limited to C. It would have been better if Microsoft had used the name MinPositiveValue to be more explicit and not conflict with established terminology.

Related posts:

C# random number generation
Math library functions that seem unnecessary
Floating point numbers are a leaky abstraction

{ 10 comments }

Numerical libraries usually have a function for finding the hypotenuse of a right triangle. Isn’t this trivial? If the sides of a triangle are x and y, just write this:

sqrt(x*x + y*y)

That works in theory, but in practice it may fail. If x is so large that x*x overflows, the code will produce an infinite result. 

We’d like to be able to say to the computer “Now x*x and y*y might be too big, but just wait a minute. I’m going to take a square root, and so the numbers will get smaller. I just need to borrow some extra range for a minute.” You can do something like that in environments that have extended precision, but that would be inefficient and unnecessary. And of course it would fail completely if you’re already using the largest numeric type available.

Here’s how to compute sqrt(x*x + y*y) without risking overflow.

  1. max = maximum(|x|, |y|)
  2. min = minimum(|x|, |y|)
  3. r = min / max
  4. return max*sqrt(1 + r*r)

Notice that the square root argument is between 1 and 2 and so there is no danger of overflow in taking the square root. The only way the algorithm could overflow is if the result itself is too large to represent. In that case the fault lies with the problem, not the algorithm: the algorithm was asked to compute a quantity larger than the largest representable number, and so it does the correct thing by overflowing and returning an infinite value. However, the algorithm will not unnecessarily return an infinite value due to an overflow in an intermediate computation.

To see how the algorithm above may succeed when the naive implementation would fail, let M be the largest representable number. For IEEE double precision, M is on the order of 10^308. All that matters for this example is that M is greater than 4. Let x and y equal 0.5 M. The algorithm above would return 0.707 M. But the naive algorithm would compute x*x and overflow since 0.25 M2 > M. Then x*x + y*y would evaluate to infinity and the square root would evaluate to infinity.

A hypotenuse function is included in math.h on every system that I am aware of. It may be named hypot or _hypot.

Related posts:

Floating point numbers are a leaky abstraction
IEEE floating point exceptions in C++
Anatomy of a floating point number

{ 18 comments }

Computing before Fortran

by John on May 25, 2010

In the beginning was Fortran. Or maybe not.

It’s easy to imagine that no one wrote any large programs before there were compilers, but that’s not true. The SAGE system, for example, involved 500,000 lines of assembly code and is regarded as one of the most successful large computer systems ever developed. Work on SAGE began before the first Fortran compiler was delivered in 1957.

The Whirlwind computer that ran SAGE had a monitor, keyboard, printer, and modem. It also had a light gun, a precursor to the mouse. It’s surprising that all this came before Fortran.

The video below is an interview with Admiral Grace Hopper, a pioneering computer scientist who began her career before Fortran.

Related posts:

Technology history quiz
Doing good work with bad tools

{ 2 comments }

In defense of reinventing wheels

by John on May 20, 2010

Sometimes reinventing the wheel can be a good thing. Jeff Atwood wrote an article that ends with this conclusion:

So, no, you shouldn’t reinvent the wheel. Unless you plan on learning more about wheels, that is.

You have to be very selective about which wheels you’re going to reinvent. Trying to do everything yourself from scratch is the road to poverty. But in your area of specialization, reinventing a few wheels is a good idea.

Buy, don’t build?

Years ago I subscribed to the “buy, don’t build” philosophy of software development. Pay $100 for a CD rather than spend weeks writing your own buggy knock-off of a commercial product, etc.  (This isn’t limited to commercial software. Substitute “download an open source product” for “pay $100 for a CD” and the principle is still the same. Call it “download, don’t build” if you like.) I still believe “buy, don’t build” is basically good advice, though I’d make more exceptions now than I would have then.

When you buy a dozen incompatible components that each solve part of your problem, you now have two new problems: managing the components and writing code to glue the components together.

Managing components includes sorting out licenses and keeping track of versions. Commercial components may require purchasing support. Using open source components may require monitoring blogs and list servers to keep track of bug fixes and upgrades.

Writing the glue code is a bigger problem. Mike Taylor calls this “the library lie.” This is the idea that your problem becomes trivial once you find a set of libraries that together do what you need to do. But integrating all these libraries may be more work than writing more of the code yourself.

Related posts:

Opening black boxes
Chauffeurs and Ferraris revisited

{ 6 comments }

Pure functions have side-effects

by John on May 18, 2010

Functional programming emphasizes “pure” functions, functions that have no side effects. When you call a pure function, all you need to know is the return value of the function. You can be confident that calling a function doesn’t leave any state changes that will effect future function calls.

But pure functions are only pure at a certain level of abstraction. Every function has some side effect: it uses memory, it takes CPU time, etc. Harald Armin Massa makes this point in his PyCon 2010 talk “The real harm of functional programming.” (His talk is about eight minutes into the February 21, 2010 afternoon lightning talks:  video, audio.)

Even pure functions in programming have side effects. They use memory. They use CPU. They take runtime. And if you look at those evil languages, they are quite fast at doing Fibonacci or something, but in bigger applications you get reports “Hmm, I have some runtime problems. I don’t know how to get it faster or what it going wrong.

Massa argues that the concept of an action without side effects is dangerous because it disassociates us from the real world. I disagree. I appreciate his warning that the “no side effect” abstraction may leak like any other abstraction. But pure functions are a useful abstraction.

You can’t avoid state, but you can partition the stateful and stateless parts of your code. 100% functional purity is impossible, but 85% functional purity may be very productive.

Related posts:

Using Python as a functional language
F# may succeed where others have failed
Functional in the small, OO in the large

{ 9 comments }

The dark matter of programmers

by John on May 17, 2010

Kate Gregory said in an interview that she likes to call C++ programmers the dark matter of the software development world. They’re not visible — they don’t attend conferences and they don’t buy a lot of books — and yet there are a lot of them.

I’ve thought the same about those who write code for games and embedded systems. I’ve hardly met anyone who works in either area, but there is a huge amount of such software and someone has to be writing it.

Game developers and embedded systems folk often work in C++, so my observation would overlap with Kate Gregory’s.

Related posts:

Termites and programmers
Disagreeing with Torvalds on C++
Complementary perspectives on the design of C++

{ 7 comments }