StackOverflow reputation statistics

What is the distribution of StackOverflow user reputation scores? Here’s a graph of the number of users with reputations between 0 and 100, 100 and 200, …, 900 and 1000. (The scores go out much further, but the curve looks flat compared to the peak unless you zoom in further.)

This graph was based on a snapshot of the user reputations one day last week. The largest group, 15,219 users, had reputation less than 100. There were 2,494 users with reputation between 100 and 200, etc. The number of users in a 100-point reputation range generally decreases as the reputation score increases. The majority of users have reputation less than 100, and yet the top percentile have reputations over 4,800 and the highest reputation was 38,700. This sort of extreme disparity suggests a power law distribution.

The test for whether the reputation scores follow a power law is to plot the logarithms of the number of people with each score and look for a straight line. And after an initial steep drop off, the logs of the counts do fall roughly on a straight line.

(The graph goes out to scores below 7,700. Beyond that point, there are a few empty 100-point ranges. I stopped at 7,700 to avoid taking logs of zeros.)

The average reputation was 364, though the average does not mean much with a power law distribution. Instead of a bell shape centered around the average, there is a long tail. The average is not the middle because there is no middle to a power law.

Update: As pointed out in the comments, I should have plotted with the log of the reputation score to test for a power law distribution. Whether or not there is a power law here, however, there is a long tail and there’s no useful “middle.”

Other posts about StackOverflow:

Voting patterns on StackOverflow
StackOverflow question statistics

Other posts about power laws:

Networks and power laws
Rate of regularizing English verbs
Metabolism and power laws

7 thoughts on “StackOverflow reputation statistics

  1. It’s interesting to see an illustration of the disparity between Stack Overflow’s average users and its power users (pun intended, with apologies). If you invert the first graph you can see a classic Long Tail distribution, with many users whose questions and answers on the site are generating relatively few up votes (non-hits in Long Tail parlance). I’m not sure the publicly available data can tell us the whole story, though. I wonder if there’s an invisible group of users who are relatively inactive in asking and answering questions, but who are very active in voting for questions and answers. If so, the Long Tail users are having a bigger impact on the site than is apparent from the publicly available data.

  2. Nice analysis, but I have a remark:

    you need a linear plot on a log-log scale to have power-law, and you are using a semi-log scale, which indicates an exponential law.

  3. @bandi: You are correct: I should have made a log-log plot.

    @Ludwig Weinzierl: Thanks. I corrected the mistake you pointed out.

  4. I haven’t done any more with this, but some people have. StackOverflow now makes this data freely available and some people have done extensive analysis.

    Brian Bondy has analyzed the StackOverflow data to report reputation scores for users who have Twitter accounts. Maybe he has done other analysis.

Comments are closed.