Bad programmers create jobs

Posted on 22 July 2009 by John

Jeff Atwood quotes an interview with David Parnas in his most recent blog post.

Q: What is the most often-overlooked risk in software engineering?

A: Incompetent programmers. There are estimates that the number of programmers needed in the U.S. exceeds 200,000. This is entirely misleading. It is not a quantity problem; we have a quality problem. One bad programmer can easily create two new jobs a year. Hiring more bad programmers will just increase our perceived need for them. If we had more good programmers, and could easily identify them, we would need fewer, not more.

IEEE floating point arithmetic in Python

Posted on 21 July 2009 by John

Sometimes a number is not a number. Numeric data types represent real numbers in a computer fairly well most of the time, but sometimes the abstraction leaks. The sum of two numeric types is always a numeric type, but the result might be a special bit pattern that says overflow occurred. Similarly, the ratio of two numeric types is a numeric type, but that type might be a special type that says the result is not a number.

The IEEE 754 standard dictates how floating point numbers work. I’ve talked about IEEE exceptions in C++ before. This post is the Python counterpart. Python’s floating point types are implemented in terms of C’s double type and so the C++ notes describe what’s going on at a low level. However, Python creates a higher level abstraction for floating point numbers. (Python also has arbitrary precision integers, which we will discuss at the end of this post.)

There are two kinds of exceptional floating point values: infinities and NaNs. Infinite values are represented by inf and can be positive or negative. A NaN, not a number, is represented by nan. Let x = 10²⁰⁰. Then x² will overflow because 10⁴⁰⁰ is too big to fit inside a C double. (To understand just why, see Anatomy of a floating point number.) In the following code, y will contain a positive infinity.

x = 1e200; y = x*x

If you’re running Python 3.0 and you print y, you’ll see inf. If you’re running an earlier version of Python, the result may depend on your operating system. On Windows, you’ll see 1.#INF but on Linux you’ll see inf. Now keep the previous value of y and run the following code.

z = y; z /= y

Since z = y/y, you might think z should be 1. But since y was infinite, it doesn’t work that way. There’s no meaningful way to assign a numeric value to the ratio of infinite values and so z contains a NaN. (You’d have to know “how they got there” so you could take limits.) So if you print z you’d see nan or 1.#IND depending on your version of Python and your operating system.

The way you test for inf and nan values depends on your version of Python. In Python 3.0, you can use the functions math.isinf and math.isnan respectively. Earlier versions of Python do not have these functions. However, the SciPy library has corresponding functions scipy.isinf and scipy.isnan.

What if you want to deliberately create an inf or a nan? In Python 3.0, you can use float('inf') or float('nan'). In earlier versions of Python you can use scipy.inf and scipy.nan if you have SciPy installed.

IronPython does not yet support Python 3.0, nor does it support SciPy directly. However, you can use SciPy with IronPython by using Ironclad from Resolver Systems. If you don’t need a general numerical library but just want functions like isinf and isnan you can create your own.

def isnan(x): return type(x) is float and x != x def isinf(x): inf = 1e5000; return x == inf or x == -inf

The isnan function above looks odd. Why would x != x ever be true? According to the IEEE standard, NaNs don’t equal anything, even each other. (See comments on the function IsFinite here for more explanation.) The isinf function is really a dirty hack but it works.

To wrap things up, we should talk a little about integers in Python. Although Python floating point numbers are essentially C floating point numbers, Python integers are not C integers. Python integers have arbitrary precision, and so we can sometimes avoid problems with overflow by working with integers. For example, if we had defined x as 10**200 in the example above, x would be an integer and so would y = x*x and y would not overflow; a Python integer can hold 10⁴⁰⁰ with no problem. We’re OK as long as we keep producing integer results, but we could run into trouble if we do anything that produces a non-integer result. For example,

x = 10**200; y = (x + 0.5)*x

would cause y to be inf, and

x = 10**200; y = x*x + 0.5

would throw an OverflowError exception.

Probability distributions in SciPy

Posted on 20 July 2009 by John

Here are some notes on how to work with probability distributions using the SciPy numerical library for Python.

Functions related to probability distributions are located in scipy.stats. The general pattern is

scipy.stats.<distribution family>.<function>

There are 81 supported continuous distribution families and 12 discrete distribution families. Some distributions have obvious names: gamma, cauchy, t, f, etc. The only possible surprise is that all distributions begin with a lower-case letter, even those corresponding to a proper name (e.g. Cauchy). Other distribution names are less obvious: expon for the exponential, chi2 for chi-squared distribution, etc.

Each distribution supports several functions. The density and cumulative distribution functions are pdf and cdf respectively. (Discrete distributions use pmf rather than pdf.) One surprise here is that the inverse CDF function is called ppf for “percentage point function.” I’d never heard that terminology and would have expected something like “quantile.”

Example: scipy.stats.beta.cdf(0.1, 2, 3) evaluates the CDF of a beta(2, 3) random variable at 0.1.

Random values are generated using rvs which takes an optional size argument. The size is set to 1 by default.

Example: scipy.stats.norm.rvs(2, 3) generates a random sample from a normal (Gaussian) random variable with mean 2 and standard deviation 3. The function call scipy.stats.norm.rvs(2, 3, size = 10) returns an array of 10 samples from the same distribution.

The command line help() facility does not document the distribution parameterizations, but the external documentation does. Most distributions are parameterized in terms of location and scale. This means, for example, that the exponential distribution is parameterized in terms of its mean, not its rate. Somewhat surprisingly, the exponential distribution has a location parameter. This means, for example, that scipy.stats.expon.pdf(x, 7) evaluates at x the PDF of an exponential distribution with location 7. This is not what I expected. I assumed there would be no location parameter and that the second argument, 7, would be the mean (scale). Instead, the location was set to 7 and the scale was left at its default value 1. Writing scipy.stats.expon.pdf(x, scale=7) would have given the expected result because the default location value is 0.

SciPy also provides constructors for objects representing random variables.

Example: x = scipy.stats.norm(3, 1); x.cdf(2.7) returns the same value as scipy.stats.norm.cdf(2.7, 3, 1).

Constructing objects representing random variables encapsulates the differences between distributions in the constructors. For example, some distributions take more parameters than others and so their object constructors require more arguments. But once a distribution object is created, its PDF, for example, can be called with a single argument. This makes it easier to write code that takes a general distribution object as an argument.

Financial control and useless projects

Posted on 17 July 2009 by John

Tom DeMarco has an article in the latest IEEE Software in which he gives an example of two hypothetical software projects. Both are expected to cost around a million dollars. One is expected to return a value of 1.1 million and the other 50 million. Financial controls are crucial for the former but not for the latter. He concludes

… strict control is something that matters a lot on relatively useless projects and much less on useful projects. It suggests that the more you focus on control, the more likely you’re working on a project that’s striving to deliver something of relatively minor value.

Thanks to John MacIntyre for pointing out Tom DeMarco’s article.

Solo software development

Posted on 16 July 2009 by John

Is it becoming easier or harder to be a solo software developer? I see two trends flowing in opposite directions.

Matt Heusser argues in his article The Boutique Tester that it’s easier to be an independent software developer now than it was a decade ago. You don’t have to burn CDs and ship them; you just put your software up on the web. You don’t have to maintain your own server; you can rent a server cheaply.You don’t have to buy expensive development tools; good tools are available for free. All these things are true, but there are other issues.

Software developers are required to know more languages than ever. A decade ago, you could make a career writing desktop applications in Visual Basic or C++ and not need to know any other language. Now in order to write a web application you need to know at least HTML, CSS, JavaScript, and SQL in addition to a programming language such as Java, C#, or Ruby. However, just knowing these languages is just a beginning. You need to learn a web development framework such as JSP, ASP.NET, or Rails. The list seems to never end. See Joe Brinkman’s article Polyglot Programming: Death By A Thousand DSLs. Programming language proliferation is not the only new difficulty in software development — security anyone? — but I’ll focus on languages.

Can one developer learn all these languages? The surprising answer is “yes.” You might think that such a menagerie of languages would lead developers to specialize, but programmers are not nearly as specialized as an outsider might expect, even in large organizations. On the other hand, most developers don’t entirely understand what they’re doing, having to work with more languages than they could possibly master. This is no doubt the root cause of many bugs.

Going back to the original question, is it easier or harder to be a solo developer these days? Software development itself has gotten harder, but the external difficulties have been greatly reduced. Programmers have to know more programming languages, but programmers have a knack for that. They don’t have to spend as much time on distribution, system administration, etc. Even sales and marketing, the bane of many developers, is easier now. So while software development itself has become harder, being an independent software developer may have become easier.

Many people disagree that software development has gotten harder; my opinion may be in the minority. Software development tools have certainly improved. It would be much easier to develop 1999-style applications in now than it was in 1999. But I believe that developing 2009-style applications with 2009 tools is harder than developing 1999-style applications was with 1999 tools, particularly for high quality software. Throwing together applications that sorta work most of the time may be easier now, but developing quality software is more difficult.

Related post: Programming language fatigue

I owe Microsoft Word an apology

Posted on 15 July 2009 by John

I tried to use the Equation Editor in Microsoft Word years ago and hated it. It was hard to use and produced ugly output. I tried it again recently and was pleasantly surprised. I’m using Word 2007. I don’t remember what version I’d tried before.

I’ve long said that math written in Word is ugly, and it usually is. But the fault lies with users, like myself, not with Word. I realize now that the problem is that most people writing math in Word are not using the Equation Editor. LaTeX produces ugly math too when people do not use it correctly, though this happens less often.

Math typography is subtle. For example, mathematical symbols are set in an italic font that is not quite the same as the italic font used in prose. Also, word-like symbols such as “log” or “cos” are not set in italics. I imagine most people do not consciously notice these conventions — I never noticed until I learned to use LaTeX — but subconsciously notice when the conventions are violated. The conventions of math typography give clues that help readers distinguish, for example, the English indefinite article “a” from a variable named “a” and to distinguish the symbol for maximum from the product of variables “m”, “a”, and “x.”

Microsoft’s Equation Editor typesets math correctly. Word documents usually do not, but only because folks usually do not use the Equation Editor. In the following example, I set the same equation three times: using ordinary text, using ordinary italic for the “x”, and finally using the Equation Editor.

screen shot of trig identity using MS Word

Note that the “x” in the third version is not the same as the italic “x” in the second version. The prose in this example is set in Calibri font and the Equation Editor uses Cambria Math font. Also, I did not tell Word to format “sin” and “cos” one way and “x” another or tell it what font to use; I simply typed sin^2 x + cos^2 x = 1 into the Equation Editor and it formatted the result as above. I haven’t used it much, but the Equation Editor seems to be more capable and easier to use than I thought.

Here are a few more examples of Equation Editor output.

examples of math using Word: Gaussian integral, Fourier series, quadratic equation

I still prefer using LaTeX for documents containing math symbols. I’ve used LaTeX for many years and I can typeset equations very quickly using it. But I’m glad to know that Word can typeset equations well and that the process is easier than I thought.

I tried out the Equation Editor because Bob Matthews suggested I try MathType, a third-party equation editor add-on for Microsoft Word. I haven’t tried MathType yet but from what I hear it produces even better output.

Related post: Contrasting Microsoft Word and LaTeX

Ever feel like a newspaper?

Posted on 13 July 2009 by John

Why are newspapers going out of business? The simple explanation is that newspaper owners are stupid; the world around them is changing and they’re oblivious. Michael Nielsen has a more interesting explanation. He says that newspapers are in trouble not because they’re stupid now but because they’ve been smart in the past.

Nielsen argues that newspapers are locked into their current business models because they have been so successful. Any small changes will make their businesses less profitable. I don’t know enough about the newspaper industry to say whether Nielsen is right, though I find his argument plausible. (His article is entitled Is scientific publishing about to be disrupted? However, it is about much more than scientific publishing.)

Nielsen argues that newspapers are standing on the top of one hill and profitable online news sources are standing on a higher hill, a hill that didn’t exist 20 years ago. In mathematical lingo, both businesses are at local maxima. Newspapers are trapped because they can’t improve their situation without first making it worse. Anyone who leads a newspaper down its hill in order to climb a new hill will be fired before he starts gaining altitude again.

I don’t care that much about newspapers, but Nielsen’s article struck me because it provides an explanation for many other situations. I feel like some areas of my life are stuck at a local maximum: there’s plenty of room for improvement, but not by making small changes.

Random inequalities VIII: folded normals

Posted on 13 July 2009 by John

Someone who ran into my previous posts on random inequalities asked me how to compute random inequalities for folded normals. (A folded normal random variable is the absolute value of a normal random variable.) So the question is how to compute

Pr(|X| > |Y|)

where X and Y are normally distributed. Here’s my reply as a short tech report: Inequality probabilities for folded normal random variables.

Previous posts in this series:

Introduction
Analytical results
Numerical results
Cauchy distributions
Beta distributions
Gamma distributions
Three or more random variables

F# may succeed where others have failed

Posted on 13 July 2009 by John

Philip Wadler wrote an article a decade ago entitled Why no one uses functional languages. He begins the article by explaining that yes, there have been a number of large successful projects developed in functional programming languages. But compared to the number of programmers who work in procedural languages, the number working in functional languages is essentially zero. The reasons he listed fall into eight categories.

Lack of compatibility with existing code
Limited library support compared to popular languages
Lack of portability across operating systems
Small communities and correspondingly little community support
Inability to package code well for reuse
Lack of sophisticated tool support
Lack of training for new developers in functional programming
Lack of popularity

Most of these reasons do not apply to Microsoft’s new functional language F# since it is built on top of the .NET framework. For example, F# has access to the enormous Common Language Runtime library and smoothly interoperates with anything developed with .NET. And as far as tool support, Visual Studio will support F# starting with the 2010 release. Even portability is not a barrier: The Mono Project has been quite successful in porting .NET code to non-Microsoft platforms. (Listen to this Hanselminutes interview with Aaron Bockover for an idea how mature Mono is.)

The only issues that may apply to F# are training and popularity. Programmers receive far more training in procedural programming, and the popularity of procedural programming is self-reinforcing. Despite these disadvantages, interest in functional programming in general is growing. And when programmers want to learn a functional programming language, I believe many will choose F#.

It will be interesting to see whether F# catches on. It resolves many of the accidental difficulties of functional programming, but the intrinsic difficulties remain. Functional programming requires a different mindset, one that programmers have been reluctant to adopt. But now programmers have a new incentive to give functional languages a try: multi-core processors.

Individual processor cores are not getting faster, but we’re getting more of them per box. We have to write multi-threaded code to take advantage of extra cores, and multi-threaded programming in procedural languages is hard, beyond the ability of most programmers. Strict functional languages eliminate many of the difficulties with multi-threaded programming, and so it seems likely that at least portions of systems will be written in functional languages.

Related post: Functional in the small, OO in the large

Emily Dickinson versus Paris Hilton

Posted on 9 July 2009 by John

Mark Helprin discusses the decline of serious political discourse in America in his excellent book Digital Barbarism. Earlier generations were more patient, “primed to deliberate rather than merely to react.” He summarizes his argument by comparing Emily Dickinson and Paris Hilton.

That is not to say that all Americans were models of dignity and concentration, but by and large they were quite different from what we are now. … Rather than a massive comparison, suffice it to say that although today not everyone is like Paris Hilton, and in the nineteenth century not everyone was like Emily Dickinson, each of these is far more characteristic of her age than would be the other, and that this is self-evident along with all it implies.

Related post: Place, privacy, and dignity

Month: July 2009

Bad programmers create jobs

IEEE floating point arithmetic in Python

Related posts

Probability distributions in SciPy

Financial control and useless projects

Solo software development

I owe Microsoft Word an apology

Ever feel like a newspaper?

Random inequalities VIII: folded normals

F# may succeed where others have failed

Emily Dickinson versus Paris Hilton