Make up your own rules of probability

Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:

One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.

The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.

One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as

P(ABCD) = P(A) + P(B) + P(C) + P(D) – P(A) P(B) P(C) P(D)

and

P(AB) = max( P(A), P(B) ).

Baggerly and Coombes were remarkably understated in their criticism: “None of these rules are standard.” In less diplomatic language, the rules are wrong.

To be fair, Baggerly and Coombes point out

These rules are not explicitly stated in the methods; we inferred them either from formulae embedded in Excel files … or from exploratory data analysis …

So, the authors didn’t state false theorems; they just used them. And nobody would have noticed if Baggerly and Coombes had not tried to reproduce their results.

More posts on reproducible research

Conservation of complexity

Larry Wall said something one time to the effect that Scheme is beautiful and every Scheme program is ugly; Perl is ugly, but it lets you write beautiful programs. Of course it also lets you write ugly programs if you choose.

Scheme is an elegant, minimalist language. The syntax of the language is extremely simple; you could say it has no syntax. But this simplicity comes at a price. But because the language does so little for you, you have to write the code that might have been included in other languages. And because the language has no syntax, code written in Scheme is hard to read. As Larry Wall said

The reason many people hate programming in Lisp [the parent language of Scheme] is because every thing looks the same. I’ve said it before, and I’ll say it again: Lisp has all the visual appeal of oatmeal with fingernail clippings mixed in.

The complexity left out of Scheme is transferred to the code you write in Scheme. If you’re writing small programs, that’s fine. But if you write large programs in Scheme, you’ll either write a lot of code yourself or you’ll leverage a lot of code someone else has written in libraries.

Perl is the opposite of a minimalist language. There are shortcuts for everything. And if you master the language, you can write programs that are beautiful in that they are very concise. Perl programs can even be easy to read. Yes, Perl programs look like line noise to the uninitiated, but once you’ve learned Perl, the syntax can be helpful if used well. (I have my complaints about Perl, but I got over the syntax.)

Perl is a complicated language, but it works very well for some problems. Features that other languages would put in libraries (e.g. regular expressions, text munging) are baked directly into the Perl language. And if you depend on those features, it’s very handy to have direct support in the language.

The point of my discussion of Scheme and Perl is that the complexity has to go somewhere, either in the language, in libraries, or in application code. That doesn’t mean all languages are equal for all tasks. Some languages put the complexity where you don’t have to think about it. For example, Java simpler than C++, as long as you don’t have to understand the inner workings of the JVM. But if you do need to look inside the JVM, suddenly Java is more complex than C++. The total complexity hasn’t changed, but your subjective experience of the complexity increased.

Earlier this week I wrote a post about C and C++. My point there was similar. C is simpler than C++, but software written in C is often more complicated that software written in C++ when you compare code written by developers of similar talent. If you need the functionality of C++, and most large programs will, then you will have to write it yourself if you’re using C. And if you’re a superstar developer, that’s fine. If you’re less than a superstar, the people who inherit your code may wish that you had used a language that had this functionality built-in.

I understand the attraction to small programming languages. The ideal programming language has everything you need and nothing more. But that means the ideal language is a moving target, changing as your work changes. As your work becomes more complicated, you might be better off moving to a more complex language, pushing more of the complexity out of your application code and into the language and its environment. Or you may be able down-size your language because you no longer need the functionality of a more complex language.

More posts on programming language complexity

Mercator projection

A natural approach to mapping the Earth is to imagine a cylinder wrapped around the equator. Points on the Earth are mapped to points on the cylinder. Then split the cylinder so that it lies flat. There are several ways to do this, all known as cylindrical projections.

One way to make a cylindrical projection is to draw a lines from the center of the Earth through each point on the surface. Each point on the surface is then mapped to the place where the line intersects the cylinder. Another approach would be to make horizontal projections, mapping each point on Earth to the closest point on the cylinder. The Mercator projection is yet another approach.

Mercator projection map

With any cylindrical projection parallels, lines of constant latitude, become horizontal lines on the map. Meridians, lines of constant longitude, become vertical lines on the map. Cylindrical projections differ in how the horizontal lines are spaced. Different projections are useful for different purposes. Mercator projection is designed so that lines of constant bearing on the Earth correspond to straight lines on the map. For example, the course of a ship sailing northeast is a straight line on the map. (Any cylindrical projection will represent a due north or due east course as a straight line, but only the Mercator projection represents intermediate bearings as straight lines.) Clearly a navigator would find Mercator’s map indispensable.

Latitude lines become increasingly far apart as you move toward the north or south pole on maps drawn with the Mercator projection. This is because the distances between latitude lines has to change to keep bearing lines straight. Mathematical details follow.

Think of two meridians running around the earth. The distance between these two meridians along a due east line depends on the latitude. The distance is greatest at the equator and becomes zero at the poles. In fact, the distance is proportional to cos(φ) where φ is the latitude. Since meridians correspond to straight lines on a map, east-west distances on the Earth are stretched by a factor of 1/cos(φ) = sec(φ) on the map.

Suppose you have a map that shows the real time position of a ship sailing east at some constant rate. The corresponding rate of change on the map is proportional to sec(φ). In order for lines of constant bearing to be straight on the map, the rate of change should also be proportional to sec(φ) as the ship sails north. That says the spacing between latitude lines has to change according to h(φ) where h‘(φ) = sec(φ). This means that h(φ) is the integral of sec(φ) which equals log |sec(φ) + tan(φ)|. The function h(φ) becomes unbounded as φ approaches ± 90°. This explains why the north and south poles are infinitely far away on a Mercator projection map and why the area of northern countries is exaggerated.

(Update: The inverse of the function h(φ) has some surprising properties. See Inverse Mercator projection.)

The modern explanation of Mercator’s projection uses logarithms and calculus, but Mercator came up with his projection in 1569 before logarithms or calculus had been discovered.

For more details of the Mercator projection, see Portraits of the Earth.

More geography posts

I disagree with Linus Torvalds about C++

I heard about this note from Linus Torvalds from David Wolever yesterday. Here’s Torvald’s opinion of C++.

C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much, much easier to generate total and utter crap with it. Quite frankly, even if the choice of C were to do *nothing* but keep the C++ programmers out, that in itself would be a huge reason to use C.

Well, I’m nowhere near as talented a programmer as Linus Torvalds, but I totally disagree with him. If it’s easy to generate crap in a relatively high-level and type-safe language like C++, then it must be child’s play to generate crap in C. It’s not fair to compare world-class C programmers like Torvalds and his peers to average C++ programmers. Either compare the best with the best or compare the average with the average. Comparing the best with the best isn’t very interesting. I imagine gurus like Bjarne Stroustrup and Herb Sutter can write C++ as skillfully as Linus Torvalds writes C, though that is an almost pointless comparison. Comparing average programmers in each language is more important, and I don’t believe C would come out on top in such a comparison.

Torvalds talks about “STL and Boost and other total and utter crap.” A great deal of thought has gone into the STL and to Boost by some very smart people over the course of several years. Their work has been reviewed by countless peers. A typical C or C++ programmer simply will not write anything more efficient or more robust than the methods in these libraries if they decide to roll their own.

Torvalds goes on to say

In other words, the only way to do good, efficient, and system-level and portable C++ ends up to limit yourself to all the things that are basically available in C.

I’ve had the opposite experience. I’d say that anyone wanting to write a large C program ends up reinventing large parts of C++ and doing it poorly. The features added to C to form C++ were added for good reasons. For example, once you’ve allocated and de-allocated C structs a few times, you realize it would be good to have functions to do this allocation and de-allocation. You basically end up re-inventing C++ constructors and destructors. But you end up with something totally sui generis. There’s no compiler support for the conventions you’ve created. No one can read about your home-grown constructors and destructors in a book. And you probably have not thought about as many contingencies as the members of the C++ standards committee have thought of.

I disagree that writing projects in C keeps out inferior C++ programmers who are too lazy to write C. One could as easily argue the opposite, that C is for programmers too lazy to learn C++. Neither argument is fair, but I think there is at least as much validity to the latter as there is to the former. I think there may be a sort of bimodal distribution of C programmer talent: some of the best and some of the worst programmers use C but for different reasons.

I do not claim that C++ is perfect, but I’ve never had any desire to go back to C after I moved to C++ many years ago. I’ll grant that I’m not writing my own operating system, but neither are the vast majority of programmers. For my work, C++ is as low-level as I care to go.

More C++ posts

How to write multi-part definitions in LaTeX

This post explains how to typeset multi-part definitions in LaTeX.

The absolute value function is a simple example of a two-part definition.

absolute value definition

The Möbius function is a more complicated example of a three-part definition.

definition of Mobius function

Here’s how you could write LaTeX for the absolute value definition.

|x| =
\left\{
	\begin{array}{ll}
		x  & \mbox{if } x \geq 0 \\
		-x & \mbox{if } x < 0
	\end{array}
\right.

The right-hand side of the equation is an array with an opening brace sized to fit on the left. Braces are special characters and so the opening brace needs to be escaped with a backslash. LaTeX requires a right for every left but the dot in right. says to make the matching container on the right side empty.

Since this pattern comes up fairly often, it’s handy to have a command to encapsulate it. We define twopartdef as follows.

\newcommand{\twopartdef}[4]
{
	\left\{
		\begin{array}{ll}
			#1 & \mbox{if } #2 \\
			#3 & \mbox{if } #4
		\end{array}
	\right.
}

Then we could call it as follows:

|x| = \twopartdef { x } {x \geq 0} {-x} {x < 0}

The command threepartdef is very similar to twopartdef.

\newcommand{\threepartdef}[6]
{
	\left\{
		\begin{array}{lll}
			#1 & \mbox{if } #2 \\
			#3 & \mbox{if } #4 \\
			#5 & \mbox{if } #6
		\end{array}
	\right.
}

You could call threepartdef for the Möbius function as follows.

mu(n) = \threepartdef
{1}      {n=1}
{0}      {a^2 ,|, n \mbox{ for some } a > 1}
{(-1)^r} {n \mbox{ has } r \mbox{ distinct prime factors}}

More LaTeX posts

 

Subtle variations on familiar themes

I was skimming through George Leonard’s little book Mastery the other night and ran across this quote:

… the essence of boredom is to be found in the obsessive search for novelty. Satisfaction lies in … the discovery of endless riches and subtle variations on familiar themes.

This is a theme I’ve written about several times before. For example, see the post Six quotes on digging deep. I often think about one of the quotes in that post. Richard Feynman said that

… nearly everything is really interesting if you go into it deeply enough …

In the post God is in the details I talk about how that applies to statistics. Rote application of statistics is mind-numbingly dull, but statistics can be quite interesting when you dig down to the foundations.

When I was in new faculty orientation years ago I remember a chemistry professor exhorting us to volunteer to teach freshman courses. Most people want to teach the more advanced courses, but he said that some of his best inspiration came from teaching the most foundational courses.

Focusing on basics is hard work and few people want to do it. George Leonard describes this as America’s “anti-mastery” culture. Seth Godin uses the image of a starving woodpecker in his book The Dip.

A woodpecker can tap twenty times on a thousand trees and get nowhere, but stay busy. Or he can tap twenty thousand times on one tree and get dinner.

Sometimes I feel like the woodpecker tapping on a thousand trees, staying busy but getting nowhere. But then I also think about a line from W. C. Fields:

If at first you don’t succeed, try, try again. Then quit. No use being a damn fool about it.

Related posts

Smoking

Seth Godin has a blog post this morning in which he says

Smoking a pack a day for twenty years is a great way to be sure you’ll die early.

The point of his post was not the dangers of smoking. His point was that “What we do in the long run, over time, drip by drip” matters more than what we do sporadically and I certainly agree. But I disagree with Seth’s comment on smoking.

Smoking certainly cuts your life short on average. But smoking is like playing Russian roulette: Most of the time, you’re OK. Most smokers do not get lung cancer. Smoking does not ensure that you’ll die early. And that may be why smokers ignore warnings. They can point to plenty of fellow smokers who were not killed by smoking. For example, if I wanted to smoke I could point out that my parents smoked and did not die of smoking-related causes. (Another smoker in my family, however, did die of lung cancer.)

People are most strongly motivated by consequences that are immediate and certain. Given a choice between the certain pleasure of enjoying a cigarette now versus a risk of lung cancer years from now, smokers choose the former.

It’s not very effective to tell someone, especially someone young, that if they smoke they will get lung cancer. For one thing, it’s not true: they probably will not get lung cancer. But they do increase their chances of cancer, and even more so their chances of emphysema, heart disease, etc. Still, those are probabilities of future events. Teenagers may be more motivated by the thought of their fingernails turning yellow or their clothes stinking.

Update: I want to be clear that I’m not defending smoking. I couldn’t wait to move out of the smoke-filled house I grew up in. Nor am I trying to down-play the health risks of smoking. The harmful effects are extraordinary well established. As Fletcher Knebel said back in 1961, smoking is the leading cause of statistics. Half a century later we’re still spending money on studies to confirm what we already know.

Related posts

Easy to guess, hard to prove

Suppose you’re waiting for a friend and you have nothing to do. After a few minutes of boredom you pick up a pencil and some scrap paper. You start listing the prime numbers.

2, 3, 5, 7, 11, 13, 17,19, 23, 29, 31, …

Next you write down the forward differences, subtracting each number in the sequence from the one that follows it.

1, 2, 2, 4, 2, 4, 2, 4, …

Your friend is running late and so you repeat the process starting with the sequence you just created.

1, 0, 2, -2, 2, -2, 2, 2, 4, …

Hmm. That time you got a negative number in the list. You’re just doodling and you don’t want to think too hard, so you decide you’ll ignore signs and just write down the absolute values of the differences. So you erase the negative sign and take differences again.

1, 2, 0, 0, 0, 0, 0, 2, …

Your friend is quite late, so you keep doing this some more. After a while you notice that every new sequence has started with 1. Will every sequence start with 1? That’s Gilbreath’s conjecture, named after Norman Gilbreath who asked the question in 1958. I ran across the conjecture in The Math Book by Clifford Pickover. Gilbreath wasn’t the first to notice this pattern. François Proth noticed it in 1878 and published an incorrect proof of the conjecture.

Gilbreath’s conjecture has been verified for the first several billion sequences, but nobody has proved that every sequence will start with 1. Paul Erdős speculated that Gilbreath’s conjecture is true but it would be 200 years before anyone could prove it. I find Erdős’s conjecture more interesting than Gilbreath’s conjecture.

Here’s what I imagine that Erdős had in mind. While the process Gilbreath created is very simple, it is also a strange thing to study. It’s not the kind of thing that people have proven theorems about. No one knows how to approach the problem. There are far more complicated problems in the mainstream of mathematics that will probably be resolved sooner because they are related to previously solved problems and researchers have some idea where to start working on them.

Other posts on number theory:

Other posts on proofs:

Miscellaneous links

Science

Computing

Math

Just for fun

Why care about spherical trig?

Last spring I wrote a post on spherical trigonometry, the study of triangles drawn on a sphere (e.g. the surface of the Earth).

triangles drawn on a sphere

Mel Hagen left a comment on that post a few days ago saying

I am revisiting Spherical Trig after 30 years by going back over some of my books that I have collected over the years. …

I asked Mel via email why he was revisiting the subject. He wrote an interesting reply that I am including below with his permission.

Mr. Cook,

Well, let’s see, how did I come to revisit the world of Spherical Trigonometry?

In the early 1970’s I lived within a mile of the coast of the Gulf of Mexico in southern Alabama. I took up daytime sailing with some of my friends from work (they sailed — I just went along for the ride).

Eventually I brought up the concern of being out of sight of land and how do we know where we are. In addition to LORAN navigation receiver and a Radio Direction Finder receiver they always had at least two navigation sextants on board. They demonstrated very quickly and without much detail how to get a “fix.”

I never thought much about it after that until I was at a yard sale in our town and there was a small book, Celestial Navigation for Yachtsmen by Mary Blewitt. I spent the dollar, took it home and spent the next year or so reading it over and over until I had a idea of what was really going on with simple calculations and the examples using tables and charts.

Having taken most of the mathematics courses available in three different colleges I knew there had to be strong basis for Spherical Trigonometry in Celestial Navigation.

Anyway, her book deals exclusively with finding out where you are but not how to point yourself with the true heading for where you want to go. So I dug out several of my old textbooks including the Schaum’s Outline Series by Frank Ayres, Jr., and found just what I was looking for.

Now, let me digress for a moment. In the art of Celestial Navigation you really need three important items: the sextant, a reasonably accurate watch and either a Nautical Almanac or an Air Navigation Almanac. Nowadays you can buy a digital watch for less than $30.00 that keeps time better than anything they had during World War II. With those 3 items and some simple Spherical Trigonometry you can easily determine your location on the earth (Law of Sines and Law of Cosines).

So, time for a case scenario.

You’re out to sea. A storm comes up. The vessel loses all electronic instrumentation (radios, compasses, computers, LORAN, GPS, etc.).  (If you think this doesn’t happen in real life, think again!) But, as long as you can see the sun during the day you’re in better shape than you think.

Most important — you must have at least one watch that is set for Greenwich Mean Time!

As it is getting about mid-day you start taking sextant sights of the Sun. when it reaches its peak you write down your watch time. With 15 degrees per hour from midnight on your watch you now have local time. With the almanac you can get the declination of the Sun. You should have a reasonable idea of where you think you are. Combine that with two or three Spherical Trigonometry calculations and you can get your “fix.”

Knowing where you are and where you should be headed and using Napier’s Analogies, you can determine where you should point your vessel and get on you way. With a few more calculations you can actually determine how many hours of daylight you have left — and — when the sun will come up in the morning to give you a good reference check.

So curiosity got the best of me. I started playing with Spherical Trigonometry, logarithm tables, and a very good slide rule all over again. The name of the game is not to just use the formulas and the equations but derive them so that you don’t have to try to memorize them and make a mental translation error.

I think one of the things we are doing today is forgetting how we got where we are using certain fields of mathematics and relying too heavily on technology that can easily fail us. Spherical Trigonometry is just an extension of 1 plus 1 equals 2. It just takes some reading and practice, practice, practice.

By the way, another book (if you can find it) that is really helpful is, Standard Mathematical Tables.

Keep in touch. Let me know how people respond to my background and also the sources of information.

Mel Hagen

mbhagen@yahoo.com

Related links