From the monthly archives:

April 2009

I was captured by Somali pirates a couple weeks ago, or so some friends thought.

Captain Richard Phillips

When I went to the library this evening, one of the librarians stopped me as soon as I walked in. She said “John! Wait, I’ve got to show you something.” She brought me a newspaper clipping with a photo of Richard Phillips, captain of the Maersk Alabama who was rescued by U. S. Navy SEAL snipers earlier this week.

She told me that she and her grandmother were watching the news the other day and thought that she saw my face on the TV. She hadn’t seen me at the library in a while and was certain that I had been captured by pirates.

After I told my wife about the story, she said other people had commented on how much I resemble Captain Phillips.

Now the librarian calls me “Captain.”

{ 5 comments }

A note to new subscribers

by John on April 17, 2009

Thank you for subscribing to my blog. I wanted to say a little about the blog for those of you who have just subscribed recently.

I post a little more than one article a day on average. I write about a variety of topics. Here’s a list of some of the most popular posts by category. If a few posts are not your cup of tea, please just ignore these but keep subscribing. I’ll write again soon about the topic that brought you here.

Here’s a list of books mentioned on the blog with links back to the post(s) where they came up.

Here’s my contact info. If you submit a comment and it doesn’t appear in a few hours, please send me a note. (I get thousands of spam comments, and so I filter spam aggressively and sometimes a legitimate comment gets blocked.) I enjoy hearing from you.

{ 0 comments }

Math blog carnivals

by John on April 17, 2009

It looks as though the Carnival of Mathematics is dead. The Blog Carnival site says “It appears that this carnival is no longer accepting submissions.” I may have had the honor of hosting the last Carnival of Mathematics unless the carnival is resurrected.

But there’s a new carnival in town, one devoted to math education. The fifth edition of the Math Teachers at Play carnival was posted this morning, hosted at Let’s Play Math.

By the way, I also wanted to mention the previous blog post on Let’s Play Math. Denise links to a TED presentation by Robert Lang on origami and math. She pulls out this quote from the talk:

The secret to productivity in so many fields … is letting dead people do your work for you. … take your problem and turn it into a problem that someone else has solved and use their solutions.

Update: According to Jason’s comment below, rumors of the death of Carnival of Mathematics have been greatly exaggerated.

{ 1 comment }

Metabolism and power laws

by John on April 16, 2009

Bigger animals have more cells than smaller animals. More cells means more cellular metabolism and so more heat produced. How does the amount of heat an animal produces vary with its size? We clearly expect it to go up with size, but does it increase in proportion to volume? Surface area? Something in between?

A first guess would be that metabolism (equivalently, heat produced) goes up in proportion to volume. If cells are all roughly the same size, then number of cells increases proportionately with volume. But heat is dissipated through the surface. Surface area increases in proportion to the square of length but volume increases in proportion to the cube of length. That means the ratio of surface area to volume decreases as overall size increases. The surface area to volume ratio for an elephant is much smaller than it is for a mouse. If an elephant’s metabolism per unit volume were the same as that of a mouse, the elephant’s skin would burn up.

So metabolism cannot be proportional to volume. What about surface area? Here we get into variety and controversy. Many people assume metabolism is proportional to surface area based on the argument above. This idea was first proposed by Max Rubner in 1883. Others emphasize data that supports the theory that suggests metabolism is proportional to surface area.

In the 1930’s, Max Kleiber proposed that metabolism increases according to body mass raised to the power 3/4. (I’ve been a little sloppy here using body mass and volume interchangeably. Body mass is more accurate, though to first approximation animals have uniform density.) If metabolism were proportional to volume, the exponent would be 1. If it were proportional to surface area, the exponent would be 2/3. But Kleiber’s law says it’s somewhere in between, namely 3/4. The image below comes from a paper by Kleiber from 1947.

Kleiber M. (1947). Body size and metabolic rate. Physiological Reviews 27: 511-541.

The graph shows that on a log-log plot, the metabolism rate versus body mass for a large variety of animals has slope approximately 3/4.

So why the exponent 3/4? There is a theoretical explanation called the metabolic scaling theory proposed by Geoffrey West, Brian Enquist, and James Brown. Metabolic scaling theory says that circulatory systems and other networks are fractal-like because this is the most efficient way to serve an animal’s physiological needs. To quote Enquist:

Although living things occupy a three-dimensional space, their internal physiology and anatomy operate as if they were four-dimensional. … Fractal geometry has literally given life an added dimension.

The fractal theory would explain the power law exponent exponent 3/4 simply: it’s the ratio of the volume dimension to the fractal dimension. However, as I suggested earlier, this theory is controversial. Some biologists dispute Kleiber’s law. Others accept Kleiber’s law as an empirical observation but dispute the theoretical explanation of West, Enquist, and Brown.

To read more about metabolism and power laws, see chapter 17 of Complexity: A Guided Tour.

Related posts:

Networks and power laws
Rate of regularizing English verbs

{ 10 comments }

StackOverflow question statistics

by John on April 15, 2009

This is the third in a series of posts about StackOverflow statistics. First I wrote about reputation. Then a couple days ago I wrote about users and votes on StackOverflow. Now this post answers three questions about questions.

  1. How many answers do question typically get?
  2. How many votes do questions typically get?
  3. Do questions with more answers also get more votes?

The following graph shows the distribution of answers per question.

histogram

At the time the data were collected, there were almost 126,000 questions. Only 134 of these had no answers. (This seems suspiciously small, but I’ll take it at face value for now.) Around 18% of questions had one answer, 23% had two answers, 18% had three answers, and the number of answers declines steadily from there. The median number of votes was 3, and the average was 4.2. About 98% of questions had 15 or fewer answers. Only 51 questions had more than 100 answers.

But a few questions had an extreme number of answers. The question “How old are you, and how old were you when you started coding?” had the most answers at 1,203. (Usually the StackOverflow moderators shut down frivolous questions, but they let this one go on a while for fun.)

Turning to votes, the following graph shows the distribution of number of votes per question.

About 29% of the questions had not yet received any votes when the data were collected.  About 99% of questions had 16 or fewer votes. The median number of votes was 1 and the mean was 2.45. Only 49 questions had more than 100 votes. (Note that we’re looking at total votes, up and down, not score. Score will typically be lower since votes can be negative.)

As with counting answers, there are a few outliers for votes. The question “What’s your favorite programmer cartoon? ” had the most votes with 630. (Most popular answer: xkcd.)

Finally, how do votes and reputation go together? Do questions with more answers also get more votes and vice versa? There’s definitely an association between number of votes and number of answers. (Technically, Spearman’s rank correlation rho was 0.49.) Answers and votes tend to increase together, though there’s plenty of variation in the number of votes that the number of answers does not explain.

Related posts:

Civic duty on StackOverflow
StackOverflow reputation statistics

{ 3 comments }

Intellectual traffic jam

by John on April 15, 2009

Imagine you’re on a highway with two lanes in each direction. Two cars are traveling side-by-side at exactly the speed limit. No one can pass, and so the cars immediately behind the lead pair go a little slower than the speed limit in order to maintain a safe distance. This process cascades until traffic slows down to a crawl miles behind the pair of cars responsible for the traffic jam.

Something similar can happen in organizations. Suppose the person at the top of a company is afraid to hire anyone smarter than himself. He wants to hire people who are talented, but not quite as talented as he is. If each of his subordinates follows his lead, the result can be a lack of talent at the bottom of the org chart.

{ 10 comments }

PowerShell browser toolbar

by John on April 15, 2009

Shay Levy created an amazing browser toolbar for PowerShell. The toolbar works with IE and Firefox. It updates itself using data that Shay maintains. It lets you do Google searches tailored to PowerShell sites, lists popular PowerShell blogs, and has a menu for a wide variety of PowerShell resources. Shay released his toolbar back in June 2008, so this is old news to people more in the know than I am, but I just found out about it yesterday.

I’m not a big fan of toolbars. I installed the toolbar but will keep it hidden most of the time. But I’m glad I installed it just to see Shay’s list of resources.

Shay created his toolbar using Conduit, something else I’d not heard of. Looks like Conduit makes it easy to create other similar toolbars. (Easy in the sense of programming effort; I’m sure a lot of work goes into keeping the resource lists up to date once you’ve created the toolbar.)

{ 1 comment }

I’ve written a small booklet, 10 pages, of things I wish someone had told me when I first started using Windows PowerShell.

Download here: PowerShell Day 1

{ 1 comment }

Civic duty on StackOverflow

by John on April 12, 2009

On StackOverflow, users gain reputation points when other users vote up their questions or answers. Voting is considered a civic duty. Voting doesn’t increase your own reputation. The only direct reward for voting is the “Civic Duty” badge for voting 300 times. But voting makes the site work well. Good questions and answers generally rise to the top.

How civic-minded are StackOverflow users? Where do the votes come from? Are people who receive more or less likely to give? That is, do those who have received high reputation scores through other users’ votes also give away reputation points in the form of votes? Jeff Atwood mailed me some data the other day so I could answer these questions.

Sixty percent of  StackOverflow users haven’t cast one vote, but that doesn’t tell the whole story. The site is growing rapidly and so there are always a large number of users who haven’t been on the site long enough to vote much or gain much reputation. Also, there are a large number of users who registered some time ago but hardly participate on the site.

When you compare reputation scores and votes, things get more interesting. For starters, users who are somewhat invested in the site, as indicated by reputation score > 100, have voted 91 times on average. That still doesn’t tell the full story because it averages over a huge range of reputation scores. Here’s the more interesting story: The number of votes users cast is proportional to their reputation.

The graph above shows average number of answer votes as a function of reputation. I divided reputation ranges into blocks of 100 (i.e. 0 – 99, 100 – 199, etc.) and averaged the number of times users in that range voted up an answer. There are two reasons I only considered answer votes: there are far more question votes than answer votes, and question votes follow a similar pattern to answer votes.

The graph starts to feather out on the right end because there are fewer users in each reputation range; there is more random variation because there are fewer people in the higher ranges to average over.The number of users at each reputation level drops off rapidly according to a power law. Although 99.4% of users have reputation less than 5000, the largest reputation score was 51,313 on the day Jeff collected the data. Here’s a graph from my earlier post, StackOverflow reputation statistics, that shows how quickly the number of people in each reputation range drops.

The graph above was based on data collected at the end of February this year but the data discussed in this post was collected in April. As you look at higher reputation scores, the curve continues to drop of quickly. Since reputations follow a power law, the decrease is linear on a log scale.

Even though users with the highest reputation scores vote the most, most votes come from users with lower reputation scores. That’s just because the large majority of users have lower reputation scores. Users with reputation < 1400 account for a little over half the answer votes cast. They also account for over 96% of all users. If you turn this around, it says that nearly half the votes come from the top 4% of users in reputation. This explains in part why the best answers usually rise to the top: the most knowledgeable users are active voters, assuming reputation and knowledge are correlated.

(The situation is analogous to that of income taxes. The very wealthy pay the most taxes per person, but the bulk of tax revenue comes from those who are not so wealthy. Even so, the percentage of total tax revenue from the top earners is surprisingly high. According to this site, the top 1% of tax payers were responsible for about 40% of all income tax revenue in 2008. The analogy holds for good reasons. Wealth, like StackOverflow reputation, follows a power law distribution. And taxes increase roughly linearly with wealth the same way StackOverflow votes increase with reputation.)

In short, it looks like StackOverflow users are civic minded. Those who receive the most votes also give the most votes. And users in the lower end of the reputation range cast most of the votes in total even though they cast fewer votes per person.

Related post: StackOverflow reputation statistics

{ 7 comments }

Albrecht Dürer’s art and math

by John on April 10, 2009

Richard Elwes has a fun blog post this morning: Dürer, rhinos, and snowflakes. The post is primarily about the art and mathematics of Albrecht Dürer (1471-1528)  but also includes some related links to recent writings, such as Michael Croucher’s blog post on snowflakes.

{ 4 comments }

Programming language fatigue

by John on April 10, 2009

Joe Brinkman wrote an insightful article the other day, Ployglot Programming: Death By A Thousand DSLs. Here’s an excerpt:

I don’t know about other programmers, but I am drowning in DSLs [domain specific languages].  It is hard enough keeping up with my primary development language and the associated platform APIs, but these DSLs are going to be the death of me.  The end result is that I have a pretty decent handle on maybe 3 or 4 of these DSLs but rarely do I have the requisite knowledge to make the right choices in anything beyond that.

It takes a dozen programming languages to do any web project these days. Whenever I bring this up in conversation, most developers say “Oh, well. That’s just the way it is. It isn’t so bad.” But I think it really is a problem. Obviously it’s intimidating amount of material for new developers to learn. But the more subtle problem is that experienced developers who think they understand all the different languages they use are probably wrong.

Case in point: JavaScript. Nearly every web project involves some client-side JavaScript, and 99% of the people who write JavaScript do not know the language. I never claimed to be a JavaScript expert, but I thought I understood the language better than I really did until I saw some presentations by Douglas Crockford.

Crockford has written an excellent book: JavaScript: The Good Parts. His position is that there is an elegant, powerful language at the core of JavaScript but it is surrounded by landmines. His book focuses on the good parts, but along the way he tells you how to avoid or disarm the landmines.

Related post: Programming language subsets

{ 2 comments }

Science versus medicine

by John on April 8, 2009

Before I started working for a cancer center, I was not aware of the tension between science and medicine. Popular perception is that the two go together hand and glove, but that’s not always true.

Physicians are trained to use their subjective judgment and to be decisive. And for good reason: making a fairly good decision quickly is often better than making the best decision eventually. But scientists must be tentative, withhold judgment, and follow protocols.

Sometimes physician-scientists can reconcile their two roles, but sometimes they have to choose to wear one hat or the other at different times.

The physician-scientist tension is just one facet of the constant tension between treating each patient effectively and learning how to treat future patients more effectively. Sometimes the interests of current patients and future patients coincide completely, but not always.

This ethical tension is part of what makes biostatistics a separate field of statistics. In manufacturing, for example, you don’t need to balance the interests of current light bulbs and future light bulbs. If you need to destroy 1,000 light bulbs to find out how to make better bulbs in the future, no big deal. But different rules apply when experimenting on people. Clinical trials will often use statistical designs that sacrifice some statistical power in order to protect the people participating in the trial. Ethical constraints make biostatistics interesting.

{ 2 comments }

Highlights of one year ago

by John on April 7, 2009

Five of the better posts here from April 2008:

A new spin on the cathedral and the bazaar
Why Unicode is subtle
A little simplicity goes a long way
Duct tape on the moon
How to calculate binomial probabilities

{ 0 comments }

Four pillars of Bayesian statistics

by John on April 7, 2009

Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:

  1. Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
  2. Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
  3. Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
  4. No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.

This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.

For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?

Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.

Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.

How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.

Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.

There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)

Related posts:

Probability and information
What is a confidence interval?

{ 8 comments }

Anatomy of a floating point number

by John on April 6, 2009

In my previous post, I explained that floating point numbers are a leaky abstraction. Often you can pretend that they are mathematical real numbers, but sometimes you cannot. This post peels back the abstraction and explains exactly what a floating point number is. (Technically, this post describes an IEEE 754 double precision floating point number, by far the most common kind of floating point number in practice.)

A floating point number has 64 bits that encode a number of the form ± p × 2e. The first bit encodes the sign, 0 for positive numbers and 1 for negative numbers. The next 11 bits encode the exponent e, and the last 52 bits encode the precision p. The encoding of the exponent and precision require some explanation.

The exponent is stored with a bias of 1023. That is, positive and negative exponents are all stored in a single positive number by storing e + 1023 rather than storing e directly. Eleven bits can represent integers from 0 up to 2047. Subtracting the bias, this corresponds to values of e from -1023 to +1024. Define emin = -1022 and emax = +1023. The values emin – 1 and emax + 1 are reserved for special use. More on that below.

Floating point numbers are typically stored in normalized form. In base 10, a number is in normalized scientific notation if the significand is ≥ 1 and < 10. For example, 3.14 × 102 is in normalized form, but 0.314 × 103 and 31.4 × 102 are not. In general, a number in base β is in normalized form if it is of the form p × βe where 1 ≤ p < β. This says that for binary, i.e. β = 2, the first bit of the significand of a normalized number is always 1. Since this bit never changes, it doesn’t need to be stored. Therefore we can express 53 bits of precision in 52 bits of storage. Instead of storing the significand directly, we store f, the fractional part, where the significand is of the form 1.f.

The scheme above does not explain how to store 0. Its impossible to specify values of f and e so that 1.f × 2e = 0. The floating point format makes an exception to the rules stated above. When e = emin – 1 and f = 0, the bits are interpreted as 0. When e = emin – 1 and f ≠ 0, the result is a denormalized number. The bits are interpreted as 0.f × 2emin. In short, the special exponent reserved below emin is used to represent 0 and denormalized floating point numbers.

The special exponent reserved above emax is used to represent ∞ and NaN. If e = emax + 1 and f = 0, the bits are interpreted as ∞. But if e = emax + 1 and f ≠ 0, the bits are interpreted as a NaN or “not a number.” See IEEE floating point exceptions for more information about ∞ and NaN.

Since the largest exponent is 1023 and the largest significant is 1.f where f has 52 ones, the largest floating point number is 21023(2 – 2-52) = 21024 – 2971 ≈ 21024 ≈ 1.8 × 10308. In C, this constant is defined as DBL_MAX, defined in <float.h>.

Since the smallest exponent is -1022, the smallest positive normalized number is 1.0 × 2-1022 ≈ 2.2 × 10-308. In C, this is defined as DBL_MIN. However, it is not the smallest positive number representable as a floating point number, only the smallest normalized floating point number. Smaller numbers can be expressed in denormalized form, albeit at a loss of significance. The smallest denormalized positive number occurs with f has 51 0’s followed by a single 1. This corresponds to 2-52*2-1022 = 2-1074 ≈ 4.9 × 10-324. Attempts to represent any smaller number must underflow to zero.

C gives the name DBL_EPSILON to the smallest positive number ε such that 1 + ε ≠ 1 to machine precision. Since the significant has 52 bits, it’s clear that DBL_EPSILON = 2-52 ≈ 2.2 × 10-16. That is why we say a floating point number has between 15 and 16 significant (decimal) figures.

For more details see What Every Computer Scientist Should Know About Floating-Point Arithmetic.

First post in this series: Floating point numbers are a leaky abstraction

{ 13 comments }