Better R console fonts

The default installation of R on Windows uses Courier New for the console font. Unfortunately, this font offers low contrast between the letter ‘l’ and the number ‘1.’ There is also poor contrast between the letter ‘O’ and the number ‘0.’ The contrast between period and commas is OK.

Lucida Console is an improvement. It has high contrast between ‘l’ and ‘1’, though ‘O’ and ‘0’ are still hard to distinguish. But my favorite console font is Consolas. It offers strong contrast between ‘l’ and ‘1’, commas and periods, and especially between lower case ‘o’, upper case ‘O’, and the number ‘0.’

Consolas is more legible while also fitting more characters into the same horizontal space. It can do this because it uses ClearType anti-aliasing while the other two fonts do not. Here is a sample of the three fonts magnified 4x to show the anti-aliasing.

I found setting the default console font in R a little tricky. Clicking on the Edit -> GUI preferences menu brings up the Rgui Configuration Editor. From there it’s obvious how to change the font. However, what I found surprising is that clicking the “OK” button only changes the font for the current session. I can’t think of another application that behaves analogously. To set your choice of font for all future sessions, click “Save” rather than “OK.”

The cost of breaking things down and putting them back together

The basic approach to solving large, complex problems is to break them down into smaller problems then re-assemble the solutions. The total cost of a solution has to account for not only the cost of solving the all the sub-problems, but also the cost of dividing the original problem into sub-problems and the cost of assembling the solutions to the sub-problems into a solution to the original problem. For example, the cost of writing a program is not simply the cost of each of the subroutines that comprise the final program. The cost has to include the cost of dividing the original task into subroutines and the cost of integrating these subroutines into the final product.

There is a theorem in computer science that quantifies the fuzzy statements above.

Let a ≥ 1 and b > 1 be constants. Let T(n) be defined on non-negative integers by the recurrence T(n) = a T(n/b) + f(n). In applications, T(n) is the total cost of solving a problem of size n. The recurrence relation says we have a way of breaking the original problem into b sub-problems of size n/b, each of which takes a units of work to solve. The cost of breaking the original problem into sub-problems and assembling the solutions into the final solution is described by f(n).

The theorem, which Introduction to Algorithms calls the “master theorem,” says that the total cost T(n) of the solution can be bound according to three separate cases.

  1. If f(n) = O(nlogba – ε) for some constant ε > 0 then T(n) = O(nlogba).
  2. If f(n) = T(nlogba), then T(n) = T(nlogba lg(n)).
  3. If f(n) = (nlogba + ε) for some constant ε > 0, and if a f(n/b) = c f(n) for some constant c < 1 and for all sufficiently large n, then T(n) = O(f(n)).

The notation O, Ω, and Θ is explained in these notes on asymptotic order notation.

The main point here is that the term f(n), the one we tend to think less about, is the most important term in the theorem. T(n), the total time for the solution, breaks into three cases depending on whether grows at a rate slower, equal to, or faster than a multiple of nlogba.

Strictly speaking, the theorem applies to the cost (operation count) of running a program. However, you can draw an analogy from the theorem to the cost of developing a program as well. You have to consider the cost of the human subdividing the original problem (design) and the cost of making the pieces work together (integration).

Why there will always be programmers

The latest episode of the .NET Rocks podcast is an interview with Robert Martin. In the interview, Robert Martin gives his view of why there will always be programmers. Here’s my summary/paraphrase of his argument.

Systems are full of details, and somebody has to manage those details. That’s what programmers do. Maybe some day, for example, we will draw diagrams instead of typing lines of text in order to specify computer systems. Fine, but then those diagrams will have lots of detail, and we will need professionals who can manage that complexity. Call them programmers.

Carl Franklin, one of the hosts, added that part of a programmer’s responsibility is to know what needs to happen when things go wrong, which is part of managing the details of a system.

The essential challenge of programming is managing complexity. Those who think the difficulty is primarily translating ideas into computer language syntax line-by-line haven’t written large programs. Writing small programs is challenging, and not many people can do it. But far fewer people can write large programs.

To write a big program, you just break it into lots of small programs, right? Well, that’s true a sense, in the same sense that writing a book is merely a matter of writing chapters, which is merely a matter of writing paragraphs etc. But writing books is hard because the pieces have to hang together in a coherent whole. If part of a book doesn’t quite fit with the whole, the result is aesthetically disappointing. If a part of a program doesn’t quite fit in, it’s called a crash. Paper doesn’t abort, but programs do.

42nd Carnival of Mathematics

It’s the 42nd Carnival of Mathematics. Don’t panic!

First, the customary trivia about the number of the Carnival.

  • 42, Jackie Robinson’s jersey number, is the only number retired by all Major League Baseball teams.
  • 42 is a domino game, especially popular in Texas.
  • 42 is the top result for the Google search “answer to life, the universe, and everything.”

And now on to the posts.

TwoPi at the 360 blog presents two fun articles using advanced techniques to prove elementary results, Using calculus to derive the quadratic equation and Integration by parts via iterated integrals.

Next, Barry Leiba discusses the use of logic in mathematics in Modus ponens, modus tollens, and proof by contradiction posted at Staring At Empty Pages. The footnotes at the end have some interesting information on prime numbers, factorization, and cryptography.

Moving from philosophical logic to hardware logic, Himadri Choudhury illustrates how a seemingly trivial problem becomes interesting when you look into it deeply enough. He presents a clever algorithm for simultaneously adding three integers more efficiently in a microchip. Check out Multiple Operand Adder posted at Thread Pool for an insight into the work that goes into operations most of us take for granted.

Moving higher up the computing levels of abstraction, Samuel Jack combines mathematics and computer science in presenting his solutions to problems 18 and 67 from Project Euler. See Project Euler Problems 18 and 67: Finding the Maximal Route through a Triangle posted at Functional Fun. He also presents his solution to problem 15 in the same series in City Grids and Pascal’s Triangle.

Finally, MBB explains how the check-sum algorithm works on major credit cards in Generation Of Luhn Algorithm Credit Card Numbers posted at Money Blue Book Finance Blog. This algorithm will detect any error that changes a single digit and will detect most errors that transpose consecutive digits.

Jason Dyer will host the 43rd Carnival of Mathematics at The Number Warrior on November 7. Submit your articles for the next Carnival here.

Three math diagrams

Here are three pages containing diagrams that each summarize several theorems.

The distribution relationships page summarizing around 40 relationships between around 20 statistical distributions.

The modes of convergence page has three diagrams like the following that explain when one kind of convergence implies another: almost everywhere, almost uniform, Lp, and convergence in measure.

The page conjugate prior relationships is similar to the probability distribution page, but concentrates on relationships in Bayesian statistics.

What are some of your favorite diagrams? Do you have any suggestions for diagrams that you’d like to see someone make?

Five kinds of subscripts in R

Five kinds of things can be subscripts in R, and they all behave differently.

  1. Positive integers
  2. Negative integers
  3. Zero
  4. Booleans
  5. Nothing

For all examples below, let x be the vector (3, 1, 4, 1, 5, 9).

Positive integers

Ordinary vector subscripts in R start with 1, like FORTRAN and unlike C and its descendants. So for the vector above, x[1] is 3, x[2] is 1, etc. R doesn’t actually have scalar types; everything is a vector, so subscripts are vectors. In the expression x[2], the subscript is a vector containing a single element equal to 2. But you could use the vector (2, 3) as a subscript of x, and you’d get (1, 4).

Negative integers

Although subscripts that reference particular elements are positive, negative subscripts are legal. However, they may not do what you’d expect. In scripting languages, it is conventional for negative subscripts to indicate indexing from the end of the array. So in Python or Perl, for example, the statement y = x[-1] would set y equal to 9 and y = x[-2] would set y equal to 5.

In R, a negative is an instruction to remove an element from a vector. So y = x[-2] would set y equal to the vector (3, 4, 1, 5, 9), i.e. the vector x with the element x[2] removed.

While R’s use of negative subscripts is unconventional, it makes sense in context. In some ways vectors in R are more like sets than arrays. Removing elements is probably a more common task than iterating backward.

Zero

So if positive subscripts index elements, and negative subscripts remove elements, what does a zero subscript do? Nothing. It doesn’t even produce an error. It is silently ignored. See Radford Neal’s blog post about zero subscripts in R for examples of how bad this can be.

Booleans

Boolean subscripts are very handy, but look very strange to the uninitiated. Ask a C programmer to guess what x[x>3] would be and I doubt they would have an idea. A Boolean expression with a vector evaluates to a vector of Boolean values, the results of evaluating the expression componentwise. So for our value of x, the expression x>3 evaluates to the vector (FALSE, FALSE, TRUE, FALSE, TRUE, TRUE). When you use a Boolean array as a subscript, the result is the subset of elements whose index corresponds to an index in the Boolean array containing a TRUE value. So x[x>3] is the subset of x consisting of elements larger than 3, i.e. x[x>3] equals (4, 5, 9).

When a vector with a Boolean subscript appears in an assignment, the assignment applies to the elements that would have been extracted if there had been no assignment. For example, x[x > 3] <- 7 turns (3, 1, 4, 1, 5, 9) into (3, 1, 7, 1, 7, 7). Also, x[x > 3] <- c(10, 11, 12) would produce (3, 1, 10, 1, 11, 12).

Nothing

A subscript can be left out entirely. So x[] would simply return x. In multi-dimensional arrays, missing subscripts are interpreted as wildcards. For example, M[3,] would return the third row of the matrix M.

Mixtures

Fortunately, mixing positive and negative values in a single subscript array is illegal. But you can, for example, mix zeros and positive numbers. And since numbers can be NA, you can even include NA as a component of a subscript.

See also R language for programmers.

What happens when you add a new teller?

Suppose a small bank has only one teller. Customers take an average of 10 minutes to serve and they arrive at the rate of 5.8 per hour. What will the expected waiting time be? What happens if you add another teller?

We assume customer arrivals and customer service times are random (details later). With only one teller, customers will have to wait nearly five hours on average before they are served. But if you add a second teller, the average waiting time is not just cut in half; it goes down to about 3 minutes. The waiting time is reduced by a factor of 93.

Why was the wait so long with one teller? There’s not much slack in the system. Customers are arriving every 10.3 minutes on average and are taking 10 minutes to serve on average. If customer arrivals were exactly evenly spaced and each took exactly 10 minutes to serve, there would be no problem. Each customer would be served before the next arrived. No waiting.

The service and arrival times have to be very close to their average values to avoid a line, but that’s not likely. On average there will be a long line, 28 people. But with a second teller, it’s not likely that even two people will arrive before one of the tellers is free.

Here are the technical footnotes. This problem is a typical example from queuing theory. Customer arrivals are modeled as a Poisson process with λ = 5.8/hour. Customer service times are assumed to be exponential with mean 10 minutes. (The Poisson and exponential distribution assumptions are common in queuing theory. They simplify the calculations, but they’re also realistic for many situations.) The waiting times given above assume the model has approached its steady state. That is, the bank has been open long enough for the line to reach a sort of equilibrium.

Queuing theory is fun because it is often possible to come up with surprising but useful results with simple equations. For example, for a single server queue, the expected waiting time is λ/(μ(μ – λ)) where λ the the arrival rate and μ is the service rate. In our example, λ = 5.8 per hour and μ = 6 per hour. Some applications require more complicated models that do not yield such simple results, but even then the simplest model may be a useful first approximation.

Related post: Server utilization: Joel on queuing

Nearly everyone is above average

Most people have a higher than average number of legs.

The vast majority of people have two legs. But some people have no legs or one leg.  So the average number of legs will be just slightly less than 2, meaning most people have an above average number of legs. (I didn’t come up with this illustration, but I can’t remember where I saw it.)

Except in Lake Wobegon, it’s not possible for everyone to be above average. But this example shows it is possible for nearly everyone to be above average. And when you’re counting numbers of social connections, nearly everyone is below average. What’s not possible is for a majority of folks to be above or below the median.

Most people expect half the population to be above and below average in every context. This is because they confuse mean and median. And even among people who understand the difference between mean and median, there is a strong tendency to implicitly assume symmetry. And when distributions are symmetric, the mean and the median are the same.

Comparing two ways to fit a line to data

Suppose you have a set of points and you want to fit a straight line to the data. I’ll present two ways of doing this, one that works better numerically than the other.

Given points (x1, y1), …, (xN, yN) we want to fit the best line of the form y = mx + b, best in the sense of minimizing the square of the differences between the actual data yi values and the predicted values mxi + b.

The textbook solution is the following sequence of equations.

The equations above are absolutely correct, but turning them directly into software isn’t the best approach. Given vectors of x and y values, here is C++ code to compute the slope m and the intercept b directly from the equations above.

	double sx = 0.0, sy = 0.0, sxx = 0.0, sxy = 0.0;
	int n = x.size();
	for (int i = 0; i < n; ++i)
	{
		sx += x[i];
		sy += y[i];
		sxx += x[i]*x[i];
		sxy += x[i]*y[i];
	}
	double delta = n*sxx - sx*sx;
	double slope = (n*sxy - sx*sy)/delta;
	double intercept = (sxx*sy - sx*sxy)/delta;

This direct method often works. However, it presents multiple opportunities for loss of precision. First of all, the Δ term is the same calculation criticized in this post on computing standard deviation. If the x values are large relative to their variance, the Δ term can be inaccurate, possibly wildly inaccurate.

The problem with computing Δ directly is that it could potentially require subtracting nearly equal numbers. You always want to avoid computing a small number as the difference of two big numbers if you can. The equations for slope and intercept have a similar vulnerability.

The second method is algebraically equivalent to the method above, but it is better suited to numerical calculation. Here it is in code.

	double sx = 0.0, sy = 0.0, stt = 0.0, sts = 0.0;
	int n = x.size();
	for (int i = 0; i < n; ++i)
	{
		sx += x[i];
		sy += y[i];
	}
	for (int i = 0; i < n; ++i)
	{
		double t = x[i] - sx/n;
		stt += t*t;
		sts += t*y[i];
	}

	double slope = sts/stt;
	double intercept = (sy - sx*slope)/n;

This method involves subtraction, and it too can fail for some data sets. But in general it does better than the former method.

Here are a couple examples to show how the latter method can succeed when the former method fails.

First we populate x and y as follows.

	for (int i = 0; i < numSamples; ++i)
	{
		x[i] = 1e10 + i;
		y[i] = 3*x[i] + 1e10 + 100*GetNormal();
	}

Here GetNormal() is generator for standard normal random variables. Aside from some small change due to the random noise, we would expect slope 3 and intercept 1010.

The most direct method gives slope -0.666667 and intercept intercept 5.15396e+10. The better method gives slope: 3.01055 and intercept: 9.8945e+9. If we remove the Gaussian noise and try again, the direct method gives slope 3.33333 and intercept 5.72662e+9 while the better method is exactly correct: slope 3 intercept and 1e+10.

As another example, fill x and y as follows.

	for (int i = 0; i < numSamples; ++i)
	{
		x[i] = 1e6*i;
		y[i] = 1e6*x[i] + 50 + GetNormal();
	}

Now both method compute the intercept exactly as 1e6 but the direct method computes the intercept as 60.0686 while the second method computes 52.416. Neither is very close to the theoretical value of 50, but the latter method is closer.

To read more about why one way can be so much better than the other, see this writeup on the analogous issue with computing standard deviation: theoretical explanation for numerical results.

Update: See related post on computing correlation coefficients.