Math/CS cheat sheet

Here’s something called a theoretical computer science cheat sheet. I don’t know whether I agree with the name, but it’s a nice cheat sheet.

The first two pages of the cheat sheet have to do with sums, combinatorics, and recurrence relations, the kinds of things you’d find in Concrete Mathematics and certainly useful in theoretical computer science. The chart mentions the “master method” that I blogged about here. But a large part of the cheat sheet is devoted to calculus and other general mathematics not as directly related to computer science.

There’s a nice summary of finite calculus on page 8, a useful little area of math that’s not very widely known.

Thanks to Bill the Lizard for pointing out the cheat sheet.

Counterfeit coins and rare diseases

Here’s a puzzle I saw a long time ago that came to mind recently.

You have a bag of 27 coins. One of these coins is counterfeit and the rest are genuine. The genuine coins all weigh exactly the same, but the counterfeit coin is lighter. You have a simple balance. How can you find the counterfeit coin while using the scale the least number of times?

The surprising answer is that the false coin can be found while only using the scales only three times. Here’s how. Put nine coins on each side of the balance. If one side is lighter, the counterfeit is on that side; otherwise, it is one of the nine not on the scales. Now that you’ve narrowed it down to nine coins, apply the same idea recursively by putting three of the suspect coins on each side of the balance. The false coin is now either on the lighter side if the scales do not balance or one of the three remaining coins if the scales do balance. Now apply the same idea one last time to find which of the remaining three coins is the counterfeit. In general, you can find one counterfeit in 3n coins by using the scales n times.

The counterfeit coin problem came to mind when I was picking out homework problems for a probability class and ran into the following (problem 4.56 here):

A large number, N = mk, of people are subject to a blood test. This can be administered in two ways.

  1. Each person can be tested separately. In this case N tests are required.
  2. The blood samples of k people can be pooled and analyzed together. If the test is negative, this one test suffices for k people. If the test is positive, each of the k people must be tested separately, and, in all, k+1 test are required for the k people.

Suppose each person being tested has the disease with probability p. If the disease is rare, i.e. p is sufficiently small, the second approach will be more efficient. Consider the extremes. If p = 0, the first approach takes mk tests and the second approach takes only m tests. At the other extreme, if p = 1, the first approach still takes mk tests but the second approach now takes m(k+1) tests.

The homework problem asks for the expected number of tests used with each approach as a function of p for fixed k. Alternatively, you could assume that you always use the second method but need to find the optimal value of k. (This includes the possibility that k=1, which is equivalent to using the first method.)

I’d be curious to know whether these algorithms have names. I suspect computer scientists have given the coin testing algorithm a name. I also suspect the idea of pooling blood samples has several names, possibly one name when it is used in literally testing blood samples and other names when the same idea is applied to analogous testing problems.

The cost of breaking things down and putting them back together

The basic approach to solving large, complex problems is to break them down into smaller problems then re-assemble the solutions. The total cost of a solution has to account for not only the cost of solving the all the sub-problems, but also the cost of dividing the original problem into sub-problems and the cost of assembling the solutions to the sub-problems into a solution to the original problem. For example, the cost of writing a program is not simply the cost of each of the subroutines that comprise the final program. The cost has to include the cost of dividing the original task into subroutines and the cost of integrating these subroutines into the final product.

There is a theorem in computer science that quantifies the fuzzy statements above.

Let a ≥ 1 and b > 1 be constants. Let T(n) be defined on non-negative integers by the recurrence T(n) = a T(n/b) + f(n). In applications, T(n) is the total cost of solving a problem of size n. The recurrence relation says we have a way of breaking the original problem into b sub-problems of size n/b, each of which takes a units of work to solve. The cost of breaking the original problem into sub-problems and assembling the solutions into the final solution is described by f(n).

The theorem, which Introduction to Algorithms (ISBN 0262032937) calls the “master theorem,” says that the total cost T(n) of the solution can be bound according to three separate cases.

  1. If f(n) = O(nlogba – ε) for some constant ε > 0 then T(n) = O(nlogba).
  2. If f(n) = T(nlogba), then T(n) = T(nlogba lg(n)).
  3. If f(n) = (nlogba + ε) for some constant ε > 0, and if a f(n/b) = c f(n) for some constant c < 1 and for all sufficiently large n, then T(n) = O(f(n)).

The notation O, Ω, and Θ is explained in these notes on asymptotic order notation.

The main point here is that the term f(n), the one we tend to think less about, is the most important term in the theorem. T(n), the total time for the solution, breaks into three cases depending on whether grows at a rate slower, equal to, or faster than a multiple of nlogba.

Strictly speaking, the theorem applies to the cost (operation count) of running a program. However, you can draw an analogy from the theorem to the cost of developing a program as well. You have to consider the cost of the human subdividing the original problem (design) and the cost of making the pieces work together (integration).