From the category archives:

Computing

Last week I wrote about how memoization reduced a simulation’s execution time from 300 hours down to 38 minutes. This weekend I came up with a different approach that reduces the time down to 21 minutes by taking advantage of a recurrence relation.

The most expensive function in the simulation has four integer arguments. Each time the function is called, one of the arguments is incremented by 1. A random process determines which argument is incremented each time. This function has a set of recurrence relations that make it faster to calculate a new value if you have a previously computed value for adjacent arguments.

The recurrence relation approach is about 100x faster than the memoization approach for small runs. However, the time per run with memoization decreases with each run. The longer it runs, the more values it caches, and so the greater the chances that a value has already been computed. In the end, the recurrence relation is about 2x faster than the memoization approach. With an even larger run, I suspect the time for the two approaches would be roughly equal.

The new approach is faster, but it was more work to implement, and it depends critically on the assumption that consecutive arguments differ only by one increment of one argument.

Related post:

20 weeks down to 20 minutes

{ 2 comments }

Twenty weeks down to twenty minutes

by John on January 26, 2012

I had a simulation this week that I estimated would take 20 weeks to run. This estimate was based on too small a run. A more reliable estimate based on a longer run was 300 CPU hours, about two weeks running on a single processor. Before I split the problem up to run in parallel, I wanted to see if I could make each thread run faster.

The most expensive part of the simulation is a function that takes in four integer parameters and does a numerical integration. Each parameter could be any value from 0 to 120. So in principle there are 121^4 = 214,358,881 possible parameter combinations. I knew that some of these combinations were extremely unlikely or impossible, so I thought I could reduce the execution time by caching the most commonly called parameters. Maybe a few thousand arguments would account for most of the function calls.

I saved the function arguments and computed values in map, a.k.a. dictionary, associative array, etc. (The buzz word for this is memoization.) I set a maximum size on the map because I thought it could grow very large. Even if, say, a thousand arguments accounted for most of the function calls, maybe there were many more that only appeared once.

It turned out that the function was only called with 408 distinct parameter combinations. That means the same integrals were being evaluated millions of times. Caching these values reduced the execution time by a factor of about 800. In other words, the software could retrieve a previously computed value 800x faster than compute it from scratch.

The simulation that would have taken 300 hours took only 22 minutes, and so there was no need to make it run in parallel.

Update: I found a bug in my hashing algorithm. After fixing the bug, the time to run my simulations increased to 38 minutes. So my speed-up was a about a factor of 420 rather than 800, but still enough to run on one thread over lunch. The number of distinct arguments to the numerical integration was much more than 408 in the corrected program, but still small enough to cache.

{ 6 comments }

The most brutal man page

by John on January 25, 2012

In The Linux Command Line, the author describes the bash man page* as “the most brutal man page of them all.”

Many man pages are hard to read, but I think that the grand prize for difficulty has to go to the man page for bash. As I was doing my research for this book, I gave it a careful review to ensure that I was covering most of its topics. When printed, it’s over 80 pages long and extremely dense, and its structure makes absolutely no sense to a new user.

On the other hand, it is very accurate and concise, as well as being extremely complete. So check it out if you dare, and look forward to the day when you can read it and it all makes sense.

* If you’re not familiar with Unix lingo, “man” stands for “manual”. Man pages are online documentation.

Related post:

Review of The Linux Command Line

{ 15 comments }

Review: The Linux Command Line

by John on January 25, 2012

No Starch Press recently released The Linux Command Line: A Complete Introduction by William E. Shotts, Jr.

True to its name, the book is about using Linux from command line. It’s not an encyclopedia of Linux. It doesn’t explain how to install Linux, doesn’t go into system APIs, and says little about how to administer Linux. At the same time, the book is broader than just a book on bash. It’s about how to “live” at the command line.

The introduction explains the intended audience.

This book is for Linux users who have migrated from other platforms. Most likely you are a “power user” of some version of Microsoft Windows.

The book has a conversational style, explaining the motivation behind ways of working as well as providing technical detail. It includes small but very useful suggestions along the way, the kinds of tips you’d pick up from a friend but might not find in a book.

The book has four parts

  1. Learning the shell
  2. Configuration and the environment
  3. Common tasks and essential tools
  4. Writing shell scripts

The book could have just included the first three sections; the forth part is a bit more specialized than the others. If you’d prefer, think of the book has having three parts, plus a lengthy appendix on shell scripting.

The Linux Command Line is pleasant to read. It has a light tone, while also getting down to business.

Related post:

Perverse hipster desire for retro-computing

{ 8 comments }

Rushing to see numbers

by John on January 19, 2012

Scientific programmers and algebra students start out with analogous bad habits.

Beginning algebra students rush to enter numbers into a calculator. They find it comforting to reduce expressions to floating point numbers as frequently as possible. This is understandable, but it’s a habit they need to break for numerous reasons: it’s more work, harder for a grader to follow, less accurate, etc. Better to do as much calculation as possible with symbols, then stick in numbers at the end.

A similar phenomena happens in scientific programming. We’re anxious to see numbers, so we print out numbers as soon as we produce them. There’s a tendency to sprinkle code with printf statements, then write scripts to scrape text output to gather results.

It’s better to keep numbers in data structures as long as possible, then dump the data to text as the last step. Why? For one thing, the output format might change: instead of a text dump, maybe you’ll want to put the data in a database or format it as an HTML table. More importantly, the “last step” will often change. What was the last step now becomes the next-to-last step because you think of something else to do. So you do more with the data in memory before producing output, rather than scraping the text output.

I quickly learned to delay plugging numbers into algebraic expressions. It took me longer to learn not to output numeric results until the last moment.

{ 13 comments }

double.Epsilon != DBL_EPSILON

by John on January 5, 2012

Here’s a pitfall in C# that keeps coming up. C# has a constant double.Epsilon that programmers coming from C naturally assume is the same as C’s DBL_EPSILON. It’s not. In fact, the former is hundreds of orders of magnitude smaller.

C#’s double.Epsilon is the closest floating point number to 0. C’s DBL_EPSILON is the distance between 1 and the closest floating point number greater than 1. Said another way, DBL_EPSILON is the smallest positive floating point number x such that 1 + x != 1, often called “machine epsilon.”

Typically double.Epsilon is on the order of 10^-324 and DBL_EPSILON is on the order of 10^-16. (These values could potentially change depending on the platform, but they hardly ever do.)

C# has no constant corresponding to DBL_EPSILON. This is unfortunate, since this constant appears frequently in numerical software. Why? Because it tells you, for example, when to stop adding series.

If DBL_EPSILON is on the order of 10^-16, that means that if you add two numbers that differ by more than 16 orders of magnitude, the sum doesn’t change. If you’re summing a decreasing series of numbers, say in order to evaluate a Taylor approximation, you might as well stop once the next term is 16 orders of magnitude smaller than the sum. If you keep going past that point, you’ll burn CPU cycles but you won’t change your answer.

DBL_EPSILON is almost always about 10^-16. But by giving it a name, you avoid having 10^-16 as a mysterious constant throughout code. And if your code should ever move to an environment with different floating point resolution, your code will correctly adjust to the new platform.

Related links:

An introduction to numerical programming in C#
Anatomy of a floating point number

{ 8 comments }

Just what do you mean by ’scale’?

by John on January 4, 2012

“Fancy algorithms are slow when n is small, and n is usually small.” — Rob Pike

Someone might object that Rob Pike’s observation is irrelevant. Everything is fast when the problem size n is small, so design your code to be efficient for large n and don’t worry about small n. But it’s not that simple.

Suppose you have two sorting algorithms, Simple Sort and Fancy Sort. Simple Sort is more efficient for lists with less than 50 element and Fancy Sort is more efficient for lists with more than 50 elements.

You could say that Fancy Sort scales better. What if n is a billion? Fancy Sort could be a lot faster.

But there’s another way a problem could scale. Instead of sorting longer lists, you could sort more lists. What if you have a billion lists of size 40 to sort?

People toss around the term “scaling,” assuming everyone has the same notion of scaling. But projects could scale along different dimensions. Whether Simple Sort or Fancy Sort scales better depends on how the problem scales.

The sorting example just has two dimensions: the length of each list and the number of lists. Software trade-offs are often much more complex. The more dimensions a problem has, the more opportunities there are for competing solutions to each claim that it scales better.

Related posts:

{ 12 comments }

Moore’s law squared

by John on January 1, 2012

In a review of linear programming solvers from 1987 to 2002, Bob Bixby says that solvers benefited as much from algorithm improvements as from Moore’s law.

Three orders of magnitude in machine speed and three orders of magnitude in algorithmic speed add up to six orders of magnitude in solving power. A model that might have taken a year to solve 10 years ago can now solve in less than 30 seconds.

Source: In Pursuit of the Traveling Salesman

Related post:

A brief note on Moore’s law

{ 14 comments }

Advantages of everything-is-text

by John on December 30, 2011

In a typical Windows program, some output text can be copied as text but some is only an image. For example, the text in the body of a document is text, but the words on the menus are not. The text in a dialog box may or may not be.

In Emacs, text is just text, no matter where you see it. This means that Emacs is less visually interesting because there is less variety. But this also has several advantages that may not be immediately obvious. I’ll give a couple examples.

The Emacs directory editor dired lists files in a text buffer. The output is essentially the output of ls. (On Windows, Emacs emulates ls.) That means you can do anything to the output you could do to text: copy, paste, search, etc. No text is off-limits. And in editable mode, you can rename files just by editing their names. You could, for example, rename many files at once using regular expressions.

Dired also lets you navigate the file system with the same commands you use to edit files. In particular, you can keep your hands on the keyboard. (It’s possible to use the Windows file explorer without a mouse, but it’s limited and awkward.)

Another advantage of everything simply being text is that you can run a shell inside Emacs. Why not? A shell is just text output. In an ordinary shell, editing commands are mostly limited to the current line. Not all text is created equal. But when you’re running a shell inside an editor, all text is equal: current line, previous lines, and text on either side of the command prompt. And if you want a transcript of a shell session, just save the buffer displaying the session.

If you’re shopping for a text editor and spend an hour trying each one out, Emacs may not look very promising. It has odd terminology, weird key strokes, etc. But if you try Emacs long enough, and particularly if you have someone to help you get started, you start to appreciate the advantages of subtle features such as everything simply being a text buffer.

Related posts:

The importance of being textual
Mental context switches are evil
Personal organization software

{ 10 comments }

Selective use of technology

by John on December 8, 2011

In his book The Nine, Jeffrey Toobin gives a few details of former Supreme Court Justice David Souter’s decidedly low-tech life. Souter has no cell phone or voice mail. He does not use email. He was given a television once but never turned it on. He moves his chair around his office throughout the day so he can read by natural light. Toobin says Souter lives something like an eighteenth century gentleman.

I find it interesting that Justice Souter would have such independence of mind that he chooses not to use much of the technology that our world takes for granted. He made it to the top of his profession and had a job for life, so he could afford to be eccentric. But he wasn’t born on the Supreme Court. I would like to know whether his low-tech work habits developed before or after his legal success.

I imagine most readers of this blog could more easily relate to Donald Knuth than David Souter. Knuth obviously doesn’t reject technology, but he is selective in how he uses it.

I had the opportunity to see Knuth speak while I was in college. Much to my surprise, his slides were handwritten. The author of TeX didn’t see the need to use TeX for his slides. While he cares about the fine details of how math looks in print, he apparently didn’t feel it was worth the effort to typeset his notes for an informal presentation.

In 1990 Knuth decided to stop using email.

I’d used email since about 1975, and it seems to me that 15 years of email is plenty for one lifetime. Email is a wonderful thing for people whose role in life is to be on top of things. But not for me; my role is to be on the bottom of things.

I believe I’ve read that Knuth does most of his work on a Linux box with no network connection. He also has a Mac for creating graphics and using the Internet. He has a secretary to handle his correspondence, including email.

If you’re reading legal briefs by sunlight, your thoughts will not be exactly the same as they would be if you were reading by fluorescent light. If you’re writing a presentation by hand, you’re not going to think the same way you would if you were pecking on a computer keyboard. And if you do use a computer, your thinking is subtlety different depending on what program you use. Technology affects the way you think. The effect is not uniformly better or worse, but it is certainly real.

Related posts:

Create offline, analyze online
Tim Bray’s high-tech monastic cell
Living within chosen limits

{ 14 comments }

Unix tool tips

by John on November 10, 2011

I’ve renamed my SedAwkTip twitter account to UnixToolTip to reflect its new scope. If you were following SedAwkTip, there’s no need to do anything. You’ll just see a different name.

I have about a week’s worth of sed and awk tips scheduled. Then I’ll start adding in tips on grep, find, uniq, etc. And I’ll come back to sed and awk now and then.

These tools came from the Unix world, but they’re also available on Windows.

For now I’m keeping the original icon. I’m open to suggestions if someone has an idea for a better icon.

s///

Related posts:

Thermonuclear word processor
Retro computing

{ 2 comments }

More than sed and awk

by John on November 8, 2011

I’m going to expand the content of my SedAwkTip twitter account. I’ve covered the most commonly used features of sed and awk, and rather than go into more advanced/obscure features of these languages, I’m going to add tips on other common command line software.

I’ll probably change the name of the account to reflect the new content. I’ll cycle back to sed and awk tips now and then.

Update: I’ve renamed SedAwkTip to UnixToolTip.

The applications I plan to go into ship as part of Linux and OS X, and they’re available for Windows here.

Other daily tip Twitter accounts:

Computing:

Math:

{ 2 comments }

Software knowledge shelf life

by John on October 18, 2011

In my experience, software knowledge has a longer useful shelf life in the Unix world than in the Microsoft world. (In this post Unix is a shorthand for Unix and Linux.)

A pro-Microsoft explanation would say that Microsoft is more progressive, always improving their APIs and tools, and that Unix is stagnant.

A pro-Unix explanation would say that Unix got a lot of things right the first time, that it is more stable, and that Microsoft’s technology turn-over is more churn than progress.

Pick your explanation. But for better or worse, change comes slower on the Unix side. And when it comes, it’s less disruptive.

At least that’s how it seems to me. Although I’ve used Windows and Unix, I’ve done different kinds of work on the two platforms. Maybe the pace of change relates more to the task than the operating system. Also, I have more experience with Windows and so perhaps I’m more aware of the changes there. But most of the things I knew about Unix 20 years ago are still useful, and most the things I knew about Windows 10 years ago are not.

Related posts:

Programmers without computers
Where the Unix philosophy breaks down
Software development and the myth of progress

{ 19 comments }

A brief note on Moore’s law

by John on October 14, 2011

One variation on Moore’s law says that computer performance doubles every 18 months. Another says performance improves by a factor of 100 every decade. The two are equivalent, though the latter sounds more impressive.

Exponential growth doesn’t mesh well with human intuition. Scott Berkun said “Technological progress is overestimated in the short term, underestimated in the long term.” This is probably not limited to technology but true of exponential growth in general.

(Why are the two statements of Moore’s law equivalent? The first says performance grows like exp( log(2) t / 1.5 ) and the second says it grows like exp( log(100) t / 10 ). And log(2) / 1.5 = 0.462 and log(100)/10 = 0.461.)

Related post:

Moore’s law and software bloat

{ 13 comments }

System administration in remote areas

by John on August 14, 2011

Last February I interviewed Rick Richter, CIO of Food for the Hungry, a Christian relief and development organization. This morning I spoke with Rick and was reminded of some of the challenges involved in supporting computers in poor parts of the world.

Traveling to areas most in need of relief is arduous. You may, for example, have two or three days of travel by land or water after your airplane lands. So naturally you want to do as much system administration remotely as you can.

In general you can’t expect broadband in the poorest areas, though there are some surprises. You often have to rely on satellite connections, though these are slow, unreliable, and expensive. Dial-up is preferable, if you can get it. Bandwidth may be adequate for command line access, though even then the latency is annoying. Video conferencing (e.g. for showing someone how to do something) is out of the question. So remote locations need to be mostly self-sufficient.

To read more about Food for the Hungry and their computing environment, see Why Food for the Hungry runs Ubuntu.

{ 0 comments }