Timing software

When measuring how long it takes to execute a program, people often report the best time out of several runs with the same input. That seems odd at first. Why report the best time? Why not report the average time?

Operating systems have more to do than just run your program. The time that elapses between the start and finish of your program may vary because of other demands on the operating system’s resources. By taking the best time, you (presumably) get an idea how much time it takes to run the program itself with minimal overhead from other operating system activity.

The best-time approach is appropriate when comparing the efficiency of two programs: compare the best time for Program A to the best time of Program B. But if you want to estimate how long you can expect wait for Program A itself to run, not comparing the efficiency of Program A to anything else, then averaging the run times is reasonable.

(Everything above assumes constant input. This post is not about analyzing execution times with varying inputs. For example, you might use best-case timing to compare two sorting algorithms by sorting the exact same list several times. The only thing changing from run to run is the state of the operating system, not the input. Of course you might also want to measure the average run time with multiple randomly generated data sets, but that’s a different matter.)

6 thoughts on “Timing software

  1. Re: “The only thing changing…” can’t the cache change (hardware, not really OS), perhaps files get memory mapped (I guess that’s OS state) — so the first time would likely be shorter than subsequent runs?

    Also what about VMs that JIT (C#, Java) on the first run and subsequent runs have shorter startup times, especially if modules are loading into memory on the first run (not really OS but sort of quasi-OS). I guess that means subsequent runs are more useful to compare to one another if you are profiling/optimizing but I agree it wouldn’t necessarily tell you about typical operating behavior or how end users can expect the software to actually behave. This also depends all lot more on the type of process (desktop/GUI app v. long running server process, or mobile app where widget-style apps run for seconds and need to start up fast).

  2. Repeated runs can also be misleading due to caches. From the L1 cache (10s of kilobytes) to the L2 cache (100s of kilobytes) to the L3 cache (megabytes) to the kernel’s page cache (gigabytes) to your hard disk’s internal cache (also gigabytes), it can be quite tricky to get consistent and realistic benchmark numbers between runs.

    I know how to flush most of these — but not all — on the systems I care about benchmarking. If you want to be sure, you can always power-cycle the whole system between runs :-).

    Anyway, the point is that taking the fastest run is not always right, even for simply comparing two programs.

  3. @Janne: exactly my thought. The mean is influenced too much by outliers to rely on it for something like this. To state an extreme case: when your computer is at full load when you start the program, the measured time will not be representative for the program’s performance.

    Although, the replies from Nemo and Jared also do make sense: such benchmarks are a lot more intricate than it first seems. If I’d need a really good measure to account for such effects, I would run the program inside a VM which can easily be restored to a (relatively well) known state.

  4. I personally use a non-parametric Wilcoxon rank sum test to compare code timings. It sidesteps the issue of whether the best time is a fluke, does not assume that the timing distribution is unimodal, is robust to outliers, and gives me a confidence interval. A burn-in run is used to initialize caches and load shared libraries into memory.

Comments are closed.