From Knuth’s book Sorting and Searching:
Computer manufacturers of the 1960’s estimated that more than 25 percent of the running time of their computers was spent on sorting, when all their customers were taken into account. In fact, there were many installations in which the task of sorting was responsible for more than half of the computing time. From these statistics we may conclude that either
- there are many important applications of sorting, or
- many people sort when they shouldn’t, or
- inefficient sorting algorithms have been in common use.
Computing has changed since the 1960’s, but not so much that sorting has gone from being extraordinarily important to unimportant.
A operations research professor once told me a similar story about the simplex algorithm, that it was taking a large percentage of cycles in the 60’s. No wonder they didn’t have Twitter back then, the computers were all busy sorting and finding convex hulls.
> Computing has changed since the 1960’s, but not so much that sorting has gone from being extraordinarily important to unimportant.
I wonder. When I think about what the kernel is doing or what my applications are doing, sorting doesn’t seem like something they’d need to do too often.
And I was reading Google’s research paper about their company wide profiling system which monitors all their servers (http://research.google.com/pubs/archive/36575.pdf), and sorting is mentioned only in the context of the formatting of the profiling results. In contrast, other tasks are singled out: compression, for example, used up *5*% of their server cycles. And apparently they do so much floating-point they considered buying co-processors for that (not mentioned is whether they found it worthwhile; I am guessing not).
They don’t specifically say that sorting *isn’t* a common task, but given how many optimizations are possible for sorting and that they mentioned zlib at 5%, I can’t believe that sorting would be anywhere close to 25% of the server cycles.
I think two is a big reason. Good programmers learn very early to maintain a data set sorted by some common variable, say, school ID, rather than sorting it in temp space for multiple uses. There are plentynof mediocre programmers who learn slowly or not at all
If you define sorting broadly as arranging data for efficient retrieval, then computers spend a great deal of time sorting: maintaining hash tables, database indexes, etc.
If you look at the sorting phase of big MapReduce jobs, you can see that the problem today is how to efficiently route (key, value) data from one set of machines (the mappers) to the destination set (the reducers), each reducer handling a subset of the (key, [value1, … , value n]), tuples -with guarantees that it is the only (non-failing) reducer to see that key, and that the map phase is complete so that all possible values for this run are complete. The problem is still there, only more distributed. And while GB ot TB of RAM means that a lot of sorting can be done in-memory, the big datasets are measured in Petabytes, with the IO bandwidth of the disks they are striped across, seek times measured in milliseconds. Sorting still matters.