Posts tagged as:

Python

A knight’s tour magic square

by John on April 6, 2011

This magic square was created by Leonhard Euler (1707-1783). Each row and each column sum to 260. Each half-row and half-column sum to 130. The square is also a knight’s tour: a knight could visit each square on a chessboard exactly once by following the numbers in sequence.

Here is Python code to verify that the square has the properties listed above.

Update: It seems the attribution to Euler is a persistent error. Euler did publish the first paper on knight’s tours, but the knight’s tour square above was published by William Beverley in 1848. Thanks to George Jelliss for the correction. See the comments below.

Update 2: Notes from George Jelliss on magic king and queen tours.

{ 22 comments }

Python for high performance computing

by John on March 21, 2011

William Scullin’s talk from PyCon 2011: Python for high performance computing.

At least in our shop [Argonne National Laboratory] we have three accepted languages for scientific computing. In this order they are C/C++, Fortran in all its dialects, and Python. You’ll notice the absolute and total lack of Ruby, Perl, Java.

If you’re interested in Python and HPC, check out SciPyTip.

{ 8 comments }

Digital workflow

by John on March 13, 2011

William Turkel has a nice four-part series of blog posts entitled A Workflow for Digital Research Using Off-the-Shelf Tools. His four points are

  1. Start with a backup and versioning strategy.
  2. Make everything digital.
  3. Research 24/7 (using RSS feeds).
  4. Make local copies of everything.

Also by William Turkel, The Programming Historian, “an open-access introduction to programming in Python, aimed at working historians (and other humanists) with little previous experience.”

Related post:

Create offline, analyze online

{ 5 comments }

Scientific Python on Twitter

by John on February 21, 2011

Next week I’m starting a new daily tip Twitter account: @SciPyTip.

This account will post on things related to scientific computing in Python, including the SciPy library, related software, and scientific computing in general.

Full list of daily tip accounts

{ 2 comments }

See Microsoft Research’s announcement of the the Sho project.

Sho is an interactive environment for data analysis and scientific computing that lets you seamlessly connect scripts (in IronPython) with compiled code (in .NET) to enable fast and flexible prototyping. The environment includes powerful and efficient libraries for linear algebra as well as data visualization that can be used from any .NET language, as well as a feature-rich interactive shell for rapid development.

Maybe this is why Microsoft contracted Enthought this summer to port NumPy and SciPy to .NET.

{ 7 comments }

Ruby, Python, and Science

by John on November 28, 2010

David Jacobs has written a long blog post Ruby is beautiful (but I’m moving to Python). Here’s my summary.

Ruby is much better than Java, but the Ruby community is too focused on web development and the language has no scientific library. Python has a lot of the same advantages as Ruby, is used for more than web programming, and has SciPy.

Update: There is now a fledgling SciRuby project.

Further reading:

Plain Python
Getting started with SciPy
Replacing Mathematica with Python
SciPy and NumPy for .NET

{ 9 comments }

Best rational approximation

by John on October 20, 2010

Suppose you have a number x between 0 and 1. You want to find a rational approximation for x, but you only want to consider fractions with denominators below a given limit.

For example, suppose x = 1/e = 0.367879…  Rational approximations with powers of 10 in the denominator are trivial to find: 3/10, 36/100, 367/1000, etc. But say you’re willing to have a denominator as large as 10. Could you do better than 3/10? Yes, 3/8 = 0.375 is a better approximation. What about denominators no larger than 100? Then 32/87 = 0.36781… is the best choice, much better than 36/100.

How do you find the best approximations? You could do a brute force search. For example, if the maximum denominator size is N, you could try all fractions with denominators less than or equal to N. But there’s a much more efficient algorithm. The algorithm is related to the Farey sequence named after John Farey, though I don’t know whether he invented the algorithm.

The idea is to start with two fractions, a/b = 0/1 and c/d = 1/1. We update either a/b or c/d at each step so that a/b will be the best lower bound of x with denominator no bigger than b, and c/d will be the best upper bound with denominator no bigger than d. At each step we do a sort of binary search by introducing the mediant of the upper and lower bounds. The mediant of a/b and c/d is the fraction (a+c)/(b+d) which always lies between a/b and c/d.

Here is an implementation of the algorithm in Python. The code takes a number x between 0 and 1 and a maximum denominator size N. It returns the numerator and denominator of the best rational approximation to x using denominators no larger than N.

def farey(x, N):
    a, b = 0, 1
    c, d = 1, 1
    while (b <= N and d <= N):
        mediant = float(a+c)/(b+d)
        if x == mediant:
            if b + d <= N:
                return a+c, b+d
            elif d > b:
                return c, d
            else:
                return a, b
        elif x > mediant:
            a, b = a+c, b+d
        else:
            c, d = a+c, b+d

    if (b > N):
        return c, d
    else:
        return a, b

In Python 3.0, the float statement could be removed since the division operator does floating point division of integers.

Read more about rational approximation in Breastfeeding, the golden ratio, and rational approximation.

Here’s an example of a situation in which you might need rational approximations. Suppose you’re designing an experiment which will randomize subjects between two treatments A and B. You want to randomize in blocks of size no larger than N and you want the probability of assigning treatment A to be p. You could find the best rational approximation a/b to p with denominator b no larger than N and use the denominator as the block size. Each block would be a permutation of a A’s and b-a B’s.

Update 1: Here is a form that implements the algorithm above.

Update 2: Eugene Wallingford wrote a blog post about implementing the algorithm in Klein, a very restricted language used for teaching compiler writing.

{ 17 comments }

Bug in SciPy’s erf function

by John on September 2, 2010

Last night I produced the plot below and was very surprised at the jagged spike. I knew the curve should be smooth and strictly increasing.

My first thought was that there must be a numerical accuracy problem in my code, but it turns out there’s a bug in SciPy version 0.8.0b1. I started to report it, but I saw there were similar bug reports and one such report was marked as closed, so presumably the fix will appear in the next release.

The problem is that SciPy’s erf function is inaccurate for arguments with imaginary part near 5.8. For example, Mathematica computes erf(1.0 + 5.7i) as -4.5717×1012 + 1.04767×1012 i. SciPy computes the same value as -4.4370×1012 + 1.3652×1012 i. The imaginary component is off by about 30%.

Here is the code that produced the plot.

from scipy.special import erf
from numpy import linspace, exp
import matplotlib.pyplot as plt

def g(y):
    z = (1 + 1j*y) /  sqrt(2)
    temp = exp(z*z)*(1 - erf(z))
    u, v = temp.real, temp.imag
    return -v / u

x = linspace(0, 10, 101)
plt.plot(x, g(x))

{ 4 comments }

What does this code do?

by John on July 21, 2010

At the SciPy 2010 conference, a speaker showed several short code samples and asked us what each sample did. The samples were clearly written, but we had no comments to provide context. This was the last sample.

def what( x, n ):
    if n < 0:
        n = -n
        x = 1.0 / x
    z = 1.0
    while n > 0:
        if n % 2 == 1:
            z *= x
        x *= x
        n /= 2
    return z

The quiz was at the end of the day and I was tired. I couldn’t tell what the code does. Then I found out to my chagrin that the sample above implements an algorithm I know well. I’ve written the same code and I’ve even blogged about here.

This exercise changed my opinion of “self-documenting” code. Without some contextual clue, it is hard to understand the purpose of even a small piece of code.

Meaningful variable and function names would have helped, but a tiny comment might have helped even more. Not some redundant comment like explaining that the line x = 1.0 / x takes a reciprocal, but a comment explaining the problem the code is trying to solve.

For another example, what do you think this code does?

uint what()
{
    m_z = 36969 * (m_z & 65535) + (m_z >> 16);
    m_w = 18000 * (m_w & 65535) + (m_w >> 16);
    return (m_z << 16) + (m_w & 65535);
}

It’s clear enough what the code does at a low level — it’s just a few operations — but it’s not at all clear what it’s for.

Try to figure out what the code samples do before reading further. But if you give up, the first example is described here and the second example comes from here.

In an ordinary face-to-face conversation, more information is conveyed non-verbally than verbally. We may think that our literal words are most important, but so much is conveyed by voice inflection, facial expression, posture, etc. Something similar is going on with source code. When we read a piece of source code, we typically bring a huge amount of implicit knowledge with us.

Suppose a coworker Sam asks you to look at his code. The fact that the question came up at work provides a large amount of context; this isn’t just a random code fragment on the web. More specifically, you know what kinds of projects Sam works on. You know why Sam wants you to look at the code. He may be showing you something he’s proud of or he may be asking for help finding a bug. You know a lot about his code before you see it.

Now suppose you’re a contractor. Sam was hit by a bus and you’ve been asked to work on his projects until he gets out of the hospital. You may complain to his office mate that Sam’s code is an awful mess, but she can’t understand what you’re talking about. She thinks his code is perfectly clear.

Now suppose you’re a contractor on the opposite side of the world from Sam. You have even less context than if you were in his office talking to his office mate. After a great deal of agony, you send your contribution back to Sam’s company. You comment your code beautifully, but Sam’s colleagues complain that your code is poorly written and that you didn’t solve the right problem.

Institutional memory is more valuable than source code comments. It costs a great deal to replace a programmer, even one who leaves behind well-commented code.

Related posts:

Do you really want to be indispensable?
Preserving (the memory of) documents
The buck stops with the programmer

{ 30 comments }

How many errors are left to find?

by John on July 13, 2010

There’s a simple statistic called the Lincoln Index that lets you estimate the total number of errors based on the number of errors found. I’ll explain what the Lincoln Index is, why it works, give some code for playing with it, and discuss how it applies to software testing.

What is the Lincoln Index?

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn’t very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let E1 be the number of errors the first tester finds and E2 the number of errors the second tester finds. Let S be the number of errors both testers find. The Lincoln Index estimates the total number of errors as E1 E2/S. You can find historical background on the Lincoln Index here.

How does the index work?

Suppose there are n bugs and the two testers find bugs with probability p1 and p2 respectively. You’d expect the two testers to find around np1 and np2 bugs. If you assume the probabilities of each tester finding a bug are independent, you’d expect the testers to find around np1 p2 bugs in common. That says E1*E2/S would be around

(n2 p1 p2) / (n p1 p2) = n.

The probabilities of each tester finding a bug cancel out leaving only n, the total number of bugs.

Simulation code

Here’s some Python code for simulating estimates using the Lincoln Index.


from random import random

def find_error(p):
    "Find an error with probability p"
    if random() < p:
        return 1
    return 0

def simulate(true_error_count, p1, p2, reps=10000):
    """Simulate Lincoln's method for estimating errors
    given the true number of errors, each person's probability
    of finding an error, and the number of simulations to run."""
    estimation_error_sum = 0
    for rep in xrange(reps):
        caught1 = 0
        caught2 = 0
        caught_both = 0
        for error in xrange(true_error_count):
            found1 = find_error(p1)
            found2 = find_error(p2)
            caught1 += found1
            caught2 += found2
            caught_both += found1*found2
        estimate = caught1*caught2 / float(caught_both)
        estimation_error_sum += abs(estimate - true_error_count)
    return estimation_error_sum / float(reps)

I used this to simulate the case of two testers, one with a 30% chance of finding a bug and the other with a 40% chance, and a total of 100 bugs. I simulated the Lincoln Index 1,000 times, keeping track of the absolute error in the estimates. The code to do this was simulate(100, 0.30, 0.40, 1000). On average, the Lincoln index over- or under-estimated the number of bugs by about 16. This is a good estimate considering each tester greatly under-estimated the number of bugs.

If you didn’t think about using something like the Lincoln Index, in the previous example one tester would find around 30 bugs and the other around 40. The two lists might have 10 bugs in common, so you’d estimate the total number at 60, far short of 100. But the Lincoln index would often find estimates between 84 and 116.

Note that it is possible that the testers won’t find any of the same bugs. In that case the Lincoln Index cannot be computed and the code will divide by zero. But this is unlikely unless the p’s are small and n is small.

Software testing

Does the Lincoln Index actually provide a good bug count estimate? That depends on how well the assumptions are met. The index assumes all bugs are equally hard for a given tester to find. It does not assume that both testers are equally skilled, but it does assume that their chances of finding a bug are independent. In other words, tester A is no more or less likely to find a bug just because tester B found it.

The most questionable assumption is that all bugs are equally hard to find. That’s usually not true. But it may be true that all bugs of a certain kind are equally hard to find. For example, spelling errors may be easier to find than validation oversights, but the Lincoln Index might be good for estimating separately how many spelling errors or validation errors there are.

The index might provide a rough rule of thumb even if the assumptions it that go into it are violated. For example, suppose one tester found 15 bugs and another found 20. But only 3 of the bugs were the same. A naive estimate would say since there are 32 unique bugs found, there must be around that many in total. But the Lincoln Index would estimate 100 bugs. Maybe the Lincoln estimate is not at all accurate, but it does tell you to be worried that there may be a lot more bugs to find since the overlap between the two bug lists was so small.

Related post:

Estimating the chances of something that hasn’t happened yet

{ 19 comments }

Replacing Mathematica with Python

by John on July 9, 2010

Everything I do regularly in Mathematica can be done in Python. Even though Mathematica has a mind-boggling amount of functionality, I only use a tiny proportion of it. I skimmed through some of my Mathematica files to see what functions I use and then looked for Python counterparts. I found I use less of Mathematica than I imagined.

The core mathematical functions I need are in SciPy. The plotting features are in matplotlib. The SymPy library appears to have the symbolic functionality I need, though I’m as not sure about this one.

I don’t have much experience with the Python libraries listed above. I haven’t used SymPy at all; I’ve only browsed its web site. Maybe I’ll find I’d rather work in Mathematica, particularly when I’m just trying out ideas. But I want to experiment with using Python for more tasks.

As I’ve blogged about before, I’d like to consolidate my tools. I started using Emacs again because I was frustrated with using a different editor for every kind of file. One of the things I find promising about Python is that I may be able to do more in Python and reduce the number of programming languages I use regularly.

Related posts:

Languages that are easy to pick back up
Where the Unix philosophy breaks down

{ 17 comments }

SciPy and NumPy for .NET

by John on July 1, 2010

Travis Oliphant announced this morning at the SciPy 2010 conference that Microsoft is partnering with Enthought to produce a version of NumPy and SciPy for .NET. NumPy and SciPy are Python libraries for scientific computing. Oliphant is the president of Enthought and the original developer of NumPy.

It is possible to call NumPy and SciPy from IronPython now by using IronClad. However, going through IronClad can be inefficient.  The new libraries will enable efficient access to NumPy and SciPy from .NET languages and in particular from IronPython.

Here is the official press release from Enthought.

Photo credit: Paul Ivanov

{ 15 comments }

Porting Python to C#

by John on June 18, 2010

When people start programming in Python, they often mention having to type less: no braces, no semicolons, fewer type declarations etc.

The difference may be more obvious when you go in the other direction, moving from Python to another language. This morning I ported some Python code to C# and was a little surprised how much extra code I had to add. When I’ve ported C# to Python I wasn’t as aware of the language differences. I guess it is easier to go down a notch in ceremony than to go up a notch.

Related post:

Plain Python

{ 23 comments }

Dynamic typing and anti-lock brakes

by John on June 9, 2010

When we make one part of our lives safer, we tend to take more chances somewhere else. Psychologists call this tendency risk homeostasis.

One of the studies often cited to support the theory of risk homeostasis involved German cab drivers. Drivers in the experimental group were given cabs with anti-lock brakes while drivers in the control group were given cabs with conventional brakes. There was no difference in the rate of crashes between the two groups. The drivers who had better brakes drove less carefully.

Risk homeostasis may explain why dynamic programming languages such as Python aren’t as dangerous as critics suppose.

Advocates of statically typed programming languages argue that it is safer to have static type checking than to not have it. Would you rather the computer to catch some of your errors or not? I’d rather it catch some of my errors, thank you. But this argument assumes two things:

  1. static type checking comes at no cost, and
  2. static type checking has no impact on programmer behavior.

Advocates of dynamic programming languages have mostly focused on the first assumption. They argue that static typing requires so much extra programming effort that it is not worth the cost. I’d like to focus on the second assumption. Maybe the presence or absence of static typing changes programmer behavior.

Maybe a lack of static type checking scares dynamic language programmers into writing unit tests. Or to turn things around, perhaps static type checking lulls programmers into thinking they do not need unit tests. Maybe static type checking is like anti-lock brakes.

Nearly everyone would agree that static type checking does not eliminate the need for unit testing. Someone accustomed to working in a statically typed language might say “I know the compiler isn’t going to catch all my errors, but I’m glad that it catches some of them.” Static typing might not eliminate the need for unit testing, but it may diminish the motivation for unit testing. The lack of compile-time checking in dynamic languages may inspire developers to write more unit tests.

See Bruce Eckel’s article Strong Typing vs. Strong Testing for more discussion of the static typing and unit testing.

Update: I’m not knocking statically typed languages. I spend most of my coding time in such languages and I’m not advocating that we get rid of static typing in order to scare people into unit testing.

I wanted to address the question of what programmers do, not what they should do. In that sense, this post is more about psychology than software engineering. (Though I believe a large part of software engineering is in fact psychology as I’ve argued here.) Do programmers who work in dynamic languages write more tests? If so, does risk homeostasis help explain why?

Finally, I appreciate the value of unit testing. I’ve spent most of the last couple days writing unit tests. But there is a limit to the kinds of bugs that unit tests can catch. Unit tests are good at catching errors in code that has been written, but most errors come from code that should have been written. See Software sins of omission.

Related posts:

All languages equally complex
Reasoning about code

{ 25 comments }

Pure functions have side-effects

by John on May 18, 2010

Functional programming emphasizes “pure” functions, functions that have no side effects. When you call a pure function, all you need to know is the return value of the function. You can be confident that calling a function doesn’t leave any state changes that will effect future function calls.

But pure functions are only pure at a certain level of abstraction. Every function has some side effect: it uses memory, it takes CPU time, etc. Harald Armin Massa makes this point in his PyCon 2010 talk “The real harm of functional programming.” (His talk is about eight minutes into the February 21, 2010 afternoon lightning talks:  video, audio.)

Even pure functions in programming have side effects. They use memory. They use CPU. They take runtime. And if you look at those evil languages, they are quite fast at doing Fibonacci or something, but in bigger applications you get reports “Hmm, I have some runtime problems. I don’t know how to get it faster or what it going wrong.

Massa argues that the concept of an action without side effects is dangerous because it disassociates us from the real world. I disagree. I appreciate his warning that the “no side effect” abstraction may leak like any other abstraction. But pure functions are a useful abstraction.

You can’t avoid state, but you can partition the stateful and stateless parts of your code. 100% functional purity is impossible, but 85% functional purity may be very productive.

Related posts:

Using Python as a functional language
F# may succeed where others have failed
Functional in the small, OO in the large

{ 20 comments }