Posts tagged as:

Power laws

This is one of my favorite quotes from Starbucks’ coffee cups:

When I was young I was mislead by flash cards into believing that xylophones and zebras were much more common.

Alphabet books treat every letter as equally important even though letters like X and Z are far less common than letters like E and T. Children need to learn the entire alphabet eventually, and there are only 26 letters, so teaching all the letters at once is not bad. But uniform emphasis doesn’t scale well. Learning a foreign language, or a computer language, by learning words without regard to frequency is absurd. The most common words are far more common than the less common words, and so it makes sense to learn the most common words first.

John Miles White has applied this idea to learning R. He did a keyword frequency analysis for R and showed that the frequency of the keywords follows Zipf’s law or something similar. I’d like to see someone do a similar study for other programming languages.

It would be interesting to write a programming language tutorial that introduces the keywords in the approximately the order of their frequency. Such a book might be quite unorthodox, and quite useful.

White points out that when teaching human languages in a classroom, “the usefulness of a word tends to be confounded with its respectability.” I imagine something similar happens with programming languages. Programs that produce lists of Fibonacci numbers or prime numbers are the xylophones and zebras of the software world.

Related posts:

Zebras and xylophones part II: learning Spanish
Rate of regularizing English verbs
Four reasons we don’t apply the 80-20 rule
R, the good parts

{ 5 comments }

Metabolism and power laws

by John on April 16, 2009

Bigger animals have more cells than smaller animals. More cells means more cellular metabolism and so more heat produced. How does the amount of heat an animal produces vary with its size? We clearly expect it to go up with size, but does it increase in proportion to volume? Surface area? Something in between?

A first guess would be that metabolism (equivalently, heat produced) goes up in proportion to volume. If cells are all roughly the same size, then number of cells increases proportionately with volume. But heat is dissipated through the surface. Surface area increases in proportion to the square of length but volume increases in proportion to the cube of length. That means the ratio of surface area to volume decreases as overall size increases. The surface area to volume ratio for an elephant is much smaller than it is for a mouse. If an elephant’s metabolism per unit volume were the same as that of a mouse, the elephant’s skin would burn up.

So metabolism cannot be proportional to volume. What about surface area? Here we get into variety and controversy. Many people assume metabolism is proportional to surface area based on the argument above. This idea was first proposed by Max Rubner in 1883. Others emphasize data that supports the theory that suggests metabolism is proportional to surface area.

In the 1930’s, Max Kleiber proposed that metabolism increases according to body mass raised to the power 3/4. (I’ve been a little sloppy here using body mass and volume interchangeably. Body mass is more accurate, though to first approximation animals have uniform density.) If metabolism were proportional to volume, the exponent would be 1. If it were proportional to surface area, the exponent would be 2/3. But Kleiber’s law says it’s somewhere in between, namely 3/4. The image below comes from a paper by Kleiber from 1947.

Kleiber M. (1947). Body size and metabolic rate. Physiological Reviews 27: 511-541.

The graph shows that on a log-log plot, the metabolism rate versus body mass for a large variety of animals has slope approximately 3/4.

So why the exponent 3/4? There is a theoretical explanation called the metabolic scaling theory proposed by Geoffrey West, Brian Enquist, and James Brown. Metabolic scaling theory says that circulatory systems and other networks are fractal-like because this is the most efficient way to serve an animal’s physiological needs. To quote Enquist:

Although living things occupy a three-dimensional space, their internal physiology and anatomy operate as if they were four-dimensional. … Fractal geometry has literally given life an added dimension.

The fractal theory would explain the power law exponent exponent 3/4 simply: it’s the ratio of the volume dimension to the fractal dimension. However, as I suggested earlier, this theory is controversial. Some biologists dispute Kleiber’s law. Others accept Kleiber’s law as an empirical observation but dispute the theoretical explanation of West, Enquist, and Brown.

To read more about metabolism and power laws, see chapter 17 of Complexity: A Guided Tour.

Related posts:

Networks and power laws
Rate of regularizing English verbs

{ 10 comments }

Civic duty on StackOverflow

by John on April 12, 2009

On StackOverflow, users gain reputation points when other users vote up their questions or answers. Voting is considered a civic duty. Voting doesn’t increase your own reputation. The only direct reward for voting is the “Civic Duty” badge for voting 300 times. But voting makes the site work well. Good questions and answers generally rise to the top.

How civic-minded are StackOverflow users? Where do the votes come from? Are people who receive more or less likely to give? That is, do those who have received high reputation scores through other users’ votes also give away reputation points in the form of votes? Jeff Atwood mailed me some data the other day so I could answer these questions.

Sixty percent of  StackOverflow users haven’t cast one vote, but that doesn’t tell the whole story. The site is growing rapidly and so there are always a large number of users who haven’t been on the site long enough to vote much or gain much reputation. Also, there are a large number of users who registered some time ago but hardly participate on the site.

When you compare reputation scores and votes, things get more interesting. For starters, users who are somewhat invested in the site, as indicated by reputation score > 100, have voted 91 times on average. That still doesn’t tell the full story because it averages over a huge range of reputation scores. Here’s the more interesting story: The number of votes users cast is proportional to their reputation.

The graph above shows average number of answer votes as a function of reputation. I divided reputation ranges into blocks of 100 (i.e. 0 – 99, 100 – 199, etc.) and averaged the number of times users in that range voted up an answer. There are two reasons I only considered answer votes: there are far more question votes than answer votes, and question votes follow a similar pattern to answer votes.

The graph starts to feather out on the right end because there are fewer users in each reputation range; there is more random variation because there are fewer people in the higher ranges to average over.The number of users at each reputation level drops off rapidly according to a power law. Although 99.4% of users have reputation less than 5000, the largest reputation score was 51,313 on the day Jeff collected the data. Here’s a graph from my earlier post, StackOverflow reputation statistics, that shows how quickly the number of people in each reputation range drops.

The graph above was based on data collected at the end of February this year but the data discussed in this post was collected in April. As you look at higher reputation scores, the curve continues to drop of quickly. Since reputations follow a power law, the decrease is linear on a log scale.

Even though users with the highest reputation scores vote the most, most votes come from users with lower reputation scores. That’s just because the large majority of users have lower reputation scores. Users with reputation < 1400 account for a little over half the answer votes cast. They also account for over 96% of all users. If you turn this around, it says that nearly half the votes come from the top 4% of users in reputation. This explains in part why the best answers usually rise to the top: the most knowledgeable users are active voters, assuming reputation and knowledge are correlated.

(The situation is analogous to that of income taxes. The very wealthy pay the most taxes per person, but the bulk of tax revenue comes from those who are not so wealthy. Even so, the percentage of total tax revenue from the top earners is surprisingly high. According to this site, the top 1% of tax payers were responsible for about 40% of all income tax revenue in 2008. The analogy holds for good reasons. Wealth, like StackOverflow reputation, follows a power law distribution. And taxes increase roughly linearly with wealth the same way StackOverflow votes increase with reputation.)

In short, it looks like StackOverflow users are civic minded. Those who receive the most votes also give the most votes. And users in the lower end of the reputation range cast most of the votes in total even though they cast fewer votes per person.

Related post: StackOverflow reputation statistics

{ 5 comments }

StackOverflow reputation statistics

by John on March 2, 2009

What is the distribution of StackOverflow user reputation scores? Here’s a graph of the number of users with reputations between 0 and 100, 100 and 200, …, 900 and 1000. (The scores go out much further, but the curve looks flat compared to the peak unless you zoom in further.)

This graph was based on a snapshot of the user reputations one day last week. The largest group, 15,219 users, had reputation less than 100. There were 2,494 users with reputation between 100 and 200, etc. The number of users in a 100-point reputation range generally decreases as the reputation score increases. The majority of users have reputation less than 100, and yet the top percentile have reputations over 4,800 and the highest reputation was 38,700. This sort of extreme disparity suggests a power law distribution.

The test for whether the reputation scores follow a power law is to plot the logarithms of the number of people with each score and look for a straight line. And after an initial steep drop off, the logs of the counts do fall roughly on a straight line.

(The graph goes out to scores below 7,700. Beyond that point, there are a few empty 100-point ranges. I stopped at 7,700 to avoid taking logs of zeros.)

The average reputation was 364, though the average does not mean much with a power law distribution. Instead of a bell shape centered around the average, there is a long tail. The average is not the middle because there is no middle to a power law.

Update: As pointed out in the comments, I should have plotted with the log of the reputation score to test for a power law distribution. Whether or not there is a power law here, however, there is a long tail and there’s no useful “middle.”

Other posts about StackOverflow:

Voting patterns on StackOverflow
StackOverflow question statistics

Other posts about power laws:

Networks and power laws
Rate of regularizing English verbs
Metabolism and power laws

{ 5 comments }

Rate of regularizing English verbs

by John on November 1, 2008

English verbs form the past tense two ways. Regular verbs ad “ed” to the end of the verb. For example, “work” becomes “worked.” Irregular verbs, sometimes called “strong” verbs, require internal changes to form the past tense. For example, “sing” becomes “sang” and “do” becomes “did.”

Irregular verbs are becoming regularized over time. For example, “help” is now a regular verb, though its past tense was once “holp.” (I’ve heard that you can still occasionally hear someone use archaic forms such as “holp” and “holpen.”)

What I find most interesting about this change quantifying the rate of change. It appears that the half-life of an irregular verb is inversely proportional to the square root of its frequency. Rarely used irregular verbs are most quickly regularized while commonly used irregular verbs are the most resistant to change.

Exceptions have to be constantly reinforced to keep speakers from applying the more general rules. Exceptions that we hear less often get dropped over time. So it’s not surprising that half-life is a decreasing function of frequency. What is surprising is that that half life is such a simple decreasing function, a constant over square root.

See this Computerworld article, near the end of page 2.

{ 2 comments }

Networks and power laws

by John on October 19, 2008

In many networks — social networks, electrical power networks, networks of web pages, etc.— the number of connections for each node follows a statistical distribution known as a power law. Here’s a model of network formation that shows how power laws can emerge from very simple rules.

Albert-László Barabási describes the following algorithm in his book Linked. Create a network by starting with two connected nodes. Then add nodes one at a time. Every time a new node joins the network, it connects to two older nodes at random with probability proportional to the number of links the old nodes have. That is, nodes that already have more links are more likely to get new links. If a node has three times as many links as another node, then it’s three times as attractive to new nodes. The rich get richer, just like with Page Rank.

Barabási says this algorithm produces networks whose connectivity distribution follows a power law. That is, the number of nodes with n connections is proportional to n-k. In particular he says k = 3.

I wrote some code to play with this algorithm. As the saying goes, programming is understanding. There were aspects of the algorithm I never would have noticed had I not written code for it. For one thing, after I decided to write a program I had to read the description more carefully and I noticed I’d misunderstood a couple details.

If the number of nodes with n connections really is proportional to n-k, then when you plot the number of nodes with 1, 2, 3, … connections, the points should fall near a straight line when plotted on the log-log scale and the slope of this line should be -k.

In my experiment with 10,000 nodes, I got a line of slope -2.72 with a standard error of 0.04. Not exactly -3, but maybe the theoretical result only holds in the limit as the network becomes infinitely large. The points definitely fall near a straight line in log-log scale:

In this example, about half the nodes (5,073 out of 10,000) had only two connections. The average number of connections was 4, but the most connected node had 200 connections.

Note that the number of connections per node does not follow a bell curve at all. The standard deviation on the number of connections per node was 6.2. This means the node with 200 connections was over 30 standard deviations from the average. That simply does not happen with normal (Gaussian) distributions. But it’s not unusual at all for power law distributions because they have such thick tails.

Of course real networks are more complex than the model given here. For example, nodes add links over time, not just when they join a network. Barabási discusses in his book how some elaborations of the original model still produce power laws (possibly with different exponents) and while others do not.

{ 2 comments }