The Law of Medium Numbers

Posted on 25 February 2010 by John

There’s a law of large numbers, a law of small numbers, and a law of medium numbers in between.

The law of large numbers is a mathematical theorem. It describes what happens as you average more and more random variables.

The law of small numbers is a semi-serious statement about how people underestimate the variability of the average of a small number of random variables.

The law of medium numbers is a term coined by Gerald Weinberg in his book An Introduction to General Systems Thinking. He states the law as follows.

For medium number systems, we can expect that large fluctuations, irregularities, and discrepancy with any theory will occur more or less regularly.

The law of medium numbers applies to systems too large to study exactly and too small to study statistically. For example, it may be easier to understand the behavior of an individual or a nation than the dynamics of a small community. Atoms are simple, and so are stars, but medium-sized things like birds are complicated. Medium-sized systems are where you see chaos.

Weinberg warns that medium-sized systems challenge science because scientific disciplines define their boundaries by the set of problems they can handle. He says, for example, that

Mechanics, then, is the study of those systems for which the approximations of mechanics work successfully.

He warns that we should not be mislead by a discipline’s “success with systems of its own choosing.”

Weinberg’s book was written in 1975. Since that time there has been much more interest in the emergent properties of medium-sized systems that are not explained by more basic sciences. We may not understand these systems well, but we may appreciate the limits of our understanding better than we did a few decades ago.

Underwhelmed with progress

Posted on 24 February 2010 by John

Virtual reality pioneer Jaron Lanier writes in his book You Are Not a Gadget about the lack of creativity in our use of computing power.

Let’s suppose that back in the 1980s I had said, “In a quarter century, when the digital revolution has made great progress and computer chips are millions of times faster than they are now, humanity will finally win the prize of being able to write a new encyclopedia and a new version of UNIX!” It would have sounded utterly pathetic.

The quote specifically alludes to Wikipedia and Linux, but Lanier is critical of web culture in general. I’m not sure what I think about his position, but at a minimum he provides a counterbalance to the people who speak about the web in messianic tones.

Something like a random sequence but …

Posted on 24 February 2010 by John

When people ask for a random sequence, they’re often disappointed with what they get.

Random sequences clump more than most folks expect. For graphical applications, quasi-random sequence may be more appropriate.These sequences are “more random than random” in the sense that they behave more like what some folks expect from randomness. They jitter around like a random sequence, but they don’t clump as much.

Researchers conducting clinical trials are dismayed when a randomized trial puts several patients in a row on the same treatment. They want to assign patients one at a time to one of two treatments with equal probability, but they also want the allocation to work out evenly. This is like saying you want to flip a coin 100 times, and you also want to get exactly 50 heads and 50 tails. You can’t guarantee both, but there are effective compromises.

One approach is to randomize in blocks. For example, you could randomize in blocks of 10 patients by taking a sequence of 5 A‘s and 5 B‘s and randomly permuting the 10 letters. This guarantees that the allocations will be balanced, but some outcomes will be predictable. At a minimum, the last assignment in each block is always predictable: you assign whatever is left. Assignments could be even more predictable: if you give n A‘s in a row in a block of 2n, you know the last n assignments will be all B‘s.

Another approach is to “encourage” balance rather than enforce it. When you’ve given more A‘s than B‘s you could increase the probability of assigning a B. The greater the imbalance, the more heavily you bias the randomization probability in favor of the treatment that has been assigned less. This is a sort of compromise between equal randomization and block randomization. All assignments are random, though some assignments may be more predictable than others. Large imbalances are less likely than with equal randomization, but more likely than with block randomization. You can tune how aggressively the method responds to imbalances in order to make the method more like equal randomization or more like block randomization.

No approach to randomization will satisfy everyone because there are conflicting requirements. Randomization is a dilemma to be managed rather than a problem to be solved.

Random improvisation subjects

Posted on 23 February 2010 by John

Destination ImagiNation is a non-profit organization that encourages student creativity. This is my family’s first year to participate in DI and it has been a lot of fun. One of the things that impresses me most about DI is that they have strict rules limiting adult input.

This weekend I was an appraiser at a DI competition for an improvisation challenge. Teams could prepare for the overall format of the challenge, but some elements of the challenge were randomly selected on the day of the competition. This year the improvisations centered around endangered things. Teams were given a list of 10 endangered things ahead of time, but they wouldn’t know which thing would be theirs until just before they had to perform. Some of the things on the list were endangered animals, such as the giant panda. There were also other things in danger of disappearing, such as the VHS tape. The students also had to use a randomly chosen stock character and had to include a character with a randomly chosen “unimpressive superpower.”

There were 13 teams in the elementary division. What would you expect from 13 teams randomly selecting 10 endangered things? Obviously some endangered thing has to be chosen at least twice. Would you expect every item on the list to be chosen at least once? How often do you expect the most common item would be chosen?

In our case, three teams were assigned “glaciers” and five were assigned “the landline telephone.” The other items were assigned once or not at all. (No one was assigned “the Yiddish language”. Too bad. I really wanted to see what the students would do with that one.)

Is there reason to suspect that the assignments were not random? How likely is it that in a competition of 13 teams that five or more teams would be given the same subject? How likely is it that every subject would be used at least once? See an explanation here. Make a guess before looking at my answer.

Here’s some Python code you could use to simulate the selection of endangered things.

from random import random

num_reps     = 100000 # number of simulation repetitions
num_subjects = 10     # number of endangered things
num_teams    = 13     # number of teams competing

def maxperday():
    tally = [0] * num_subjects
    for i in range(num_teams):
        subject = int(random()*num_subjects)
        tally[subject] += 1
    return max(tally)

total = 0
for rep in range(num_reps):
    if maxperday() &gt;= 5:
        total += 1
print float(total)/num_reps

Popular research areas produce more false results

Posted on 19 February 2010 by John

The more active a research area is, the less reliable its results are.

John Ioannidis suggested popular areas of research publish a greater proportion of false results in his paper Why most published research findings are false. Of course popular areas produce more results, and so they will naturally produce more false results. But Ioannidis is saying that they also produce a greater proportion of false results.

Now Thomas Pfeiffer and Robert Hoffmann have produced empirical support for Ioannidis’s theory in the paper Large-Scale Assessment of the Effect of Popularity on the Reliability of Research. Pfeiffer and Hoffmann review two reasons why popular areas have more false results.

First, in highly competitive fields there might be stronger incentives to “manufacture” positive results by, for example, modifying data or statistical tests until formal statistical significance is obtained. This leads to inflated error rates for individual findings: actual error probabilities are larger than those given in the publications. … The second effect results from multiple independent testing of the same hypotheses by competing research groups. The more often a hypothesis is tested, the more likely a positive result is obtained and published even if the hypothesis is false.

In other words,

In a popular area there’s more temptation to fiddle with the data or analysis until you get what you expect.
The more people who test an idea, the more likely someone is going to find data in support of it by chance.

The authors produce evidence of the two effects above in the context of papers written about protein interactions in yeast. They conclude that “The second effect is about 10 times larger than the first one.”

“Noncommercial” is fuzzy

Posted on 18 February 2010 by John

It is common for software, photos, and other creative works to be free for noncommercial use. I appreciate the generosity of those who want to give away their creations, and I appreciate the business savvy of those who see giving some things away as a way to make more money elsewhere. But “noncommercial” is a fuzzy term.

What exactly is noncommercial use? If I include a photo in software that I give away, is that noncommercial use? What if someone includes the same photo in iTunes? That’s software that is freely given away, although it’s clearly a distribution channel for Apple music sales. What about Internet Explorer? Microsoft gives away IE, and it’s not an obvious distribution channel for Microsoft, but many people would call IE commercial software. Is it the nature of the organization rather than the nature of the product that determines whether something is non-commercial?

Sometimes “noncommercial” is used as an opposite of “professional.” But what about employees of charitable organizations such as the American Red Cross? Is a Red Cross relief worker in Haiti doing noncommercial work? What about a lawyer working at Red Cross headquarters? Would it change anything if the lawyer were a volunteer?

Sometimes “educational” is used as a synonym for noncommercial. But if your profession is education, is your work professional or educational? Does it matter whether a school is public or private? Most people would agree that a student doing a homework assignment is engaged in noncommercial activity. What if the student is a teaching assistant receiving a small salary? In that case is it noncommercial use when the student is doing his own homework but commercial use when he’s preparing to teach a class? Isn’t education almost always a commercial activity? After all, why are students in school? They’re preparing to make a living at something. They may have blatantly commercial motives for doing their homework.

Not only can you argue that educational use is commercial, you can argue that commercial use is educational. If an accountant looks up a tax regulation, they’re trying to learn something. Isn’t that educational? Is it educational use when a student looks up a tax regulation but commercial use when an accountant looks up the same regulation?

Individuals and organizations are free to define “commercial” or “noncommercial” use however they please. Personally, I’d rather either sell something or give it away without regard for how it’s going to be used.

Economizing approximations

Posted on 18 February 2010 by John

The most obvious approximation may not be the best. But sometimes a small change to an obvious approximation can make it better approximation. This post gives an example illustrating this point.

The easiest, most obvious way to obtain a polynomial approximation is to use a truncated Taylor series. But such a polynomial may not be the most efficient approximation. Forman Acton gives the example of approximating cos(π x) on the interval [-1, 1] in his classic book Numerical Methods that Work. The point of this example is not the usefulness of the final result; a footnote below explains that this isn’t how cosines are computed in practice. The point is that you can sometimes improve a convenient but suboptimal approximation with a small change.

The goal in Acton’s example is to approximate cos(π x) with a maximum error of less than 10^-6 across the interval. The Taylor polynomial

$\cos \pi x = 1 - \frac{\pi^2}{2!}x^2 + \frac{\pi^4}{4!}x^4 - \cdots + \frac{\pi^{16}}{16!}x^{16}$

is accurate to within about 10^-7 and so is certainly good enough. However the last term of the series, the x¹⁶ term, contributes less than the other terms to the accuracy of the approximation. On the other hand, this term cannot simply be discarded because without it the error rises to 10^-5. The clever idea is to replace the x¹⁶ term with a linear combination of the other terms. After all, x¹⁶ doesn’t look that different from x¹⁴ or x¹². Acton uses the 16th Chebyshev polynomial to approximate x¹⁶ by a combination of smaller even powers of x. This new approximation is almost as accurate as the original Taylor polynomial with an error that remains below the desire threshold. Acton calls this process economizing an approximation.

This process could be repeated to see whether the x¹⁴ term could be eliminated. Or you could directly find a Chebyshev series approximation to cos(π x) from the beginning. Acton did not have a symbolic computation package like Mathematica when he wrote his book in 1970 and so he was computing his approximations by hand. Directly computing a Chebyshev approximation would have been a lot of work. By just replacing the highest order term, he achieved nearly the same effect but with less effort.

Computing power has improved several orders of magnitude since Acton wrote his book, and some of his examples now seem quaint. However, I don’t know of a better book for teaching how to think about numerical analysis than Numerical Methods that Work. Acton has another good book that is harder to find, Real Computing Made Real: Preventing Errors in Scientific and Engineering Calculations.

Footnote 1: evaluating polynomials

Suppose you want to write code to evaluate the polynomial

P(x) = a₀ + a₂x² + a₄x⁴ + … + a₁₄x¹⁴.

The first step would be to reduce this to a 7th degree polynomial in y = x².

Q(y) = a₀ + a₁y + a₂y² + … + a₇y⁷.

Directly evaluating Q(y) would take 1 + 2 + 3 + … + 7 = 28 multiplications, computing every power of y directly. For example, computing y⁵ as y*y*y*y*y. Factoring the polynomial is much more efficient:

((((((a₇y + a₆)y + a₅)y + a₄)y + a₃)y + a₂)y + a₁)y + a₀

Footnote 2: computing sine and cosine

The point of Acton’s example was to improve on a Taylor polynomial evaluated a moderate distance from the point where the Taylor series is centered. It does not illustrate how cosines are actually computed. See this answer on StackOverflow for an outline of how trig functions are computed in practice.

Self-sufficiency is the road to poverty

Posted on 17 February 2010 by John

In his podcast Roberts on Smith, Ricardo, and Trade, Russ Roberts states that self-sufficiency is the road to poverty. Roberts elaborates on the economic theories of Adam Smith and David Ricardo to explain how specialization and trade create wealth and how radical self-sufficiency leads to poverty.

Suppose you decide to grow your own food. Are you going to buy your gardening tools from Ace Hardware? If you really want to be self-reliant, you should make your own tools. Are you going to take your chances with what water happens to fall on your property, or are you going to rely on municipal water? Are you going to forgo fertilizer or rely on someone else to sell it to you? Carried to extremes, self-reliance ends in a Robinson Crusoe-like existence.

People in poor countries are often poor because they are self-reliant in the sense that they must do many things for themselves. They do not have the opportunities for specialization and trade that are available to those who live in more prosperous countries.

Some degree of self-reliance makes economic sense. Transaction costs, for example, make it impractical to outsource small tasks. It also makes sense to do some things that are not economically feasible. For example, an orthodontist may choose to make some of her own clothing or keep a garden for the pleasure of doing so, not because these activities are worth her time. In general, however, specialization and large trading communities are the road to prosperity. Without a large economic community, no one can become an orthodontist (or an accountant, barrista, electrician, …)

Why do we so often value self-sufficiency more than specialization and trade? Here are a three reasons that come to mind.

In America, self-sufficiency is deeply rooted in our culture. We admire the pioneer spirit, and this leads to seeing as virtues actions that were once a necessity.
Self-sufficient people are generally well liked, especially if they’re not too prosperous. Conversely, those who create wealth by leveraging the labor of others are often treated with suspicion and jealously.
Our school system encourages “well-roundedness” rather than excellence. The way to succeed is to be moderately good at everything, even if you’re not outstanding at anything. (More on this idea here.)

Update: After writing this post, I read Russ Robert’s book The Choice: A Fable of Free Trade and Protectionism. I discovered one of the later chapters is entitled “Self-Sufficiency Is the Road to Poverty.” Excellent book.

Statistical functions in Excel

Posted on 17 February 2010 by John

Depending on your expectations, you may have different reactions to the statistical function support in Excel. If you expect anything similar to a statistical package, you’ll be sorely disappointed. But if you think of Excel as a spreadsheet for everybody that sometimes lets you do statistical tasks right there without having to open up a statistical package, you’ll be pleased.

I was looking into the functions in Excel 2007 while preparing for a class I taught yesterday. I wanted to emphasize that certain functions are everywhere, not only in mathematical packages like Mathematica and R, but also in Python and even Excel.

Excel’s set of functions is inconsistent, both in the functionality provided and in the names it uses. Having an asymmetric API makes it harder to remember what is available and how to use it. On the other hand, the most commonly needed functions are available. The functions are individually reasonable even though they do not fit together into a simple pattern.

For details, see my notes Probability distributions in Excel 2007.

I discovered along the way that Excel has a GAMMALN function to compute the logarithm of the Gamma function Γ(x). This is a very useful function to have, even more useful than the Gamma function itself for reasons explained here.

Top four LaTeX mistakes

Posted on 15 February 2010 by John

Here are four of the most common typesetting errors I see in books and articles created with LaTeX.

1) Quotes

Quotation marks in LaTeX files begin with two back ticks, ``, and end with two single quotes, ''.

The first “Yes” was written as

``Yes.''

in LaTeX while the one with the backward opening quote was written as

"Yes."

2) Differentials

Differentials, most commonly the dx at the end of an integer, should have a little space separating them from other elements. The “dx” is a unit and so it needs a little space to keep from looking like the product of “d” and “x.” You can do this in LaTeX by inserting \, before and between differentials.

The first integral was written as

 \int_0^1 f(x) \, dx

while the second forgot the , and was written as

 \int_0^1 f(x)  dx

The need for a little extra space around differentials becomes more obvious in multiple integrals.

The first was written as

dx \, dy = r \, dr \, d\theta

while the second was written as

dx  dy = r  dr  d\theta

3) Multi-letter function names

The LaTeX commands for typesetting functions like sin, cos, log, max, etc. begin with a backslash. The command log keeps “log,” for example, from looking like the product of variables “l”, “o”, and “g.”

The first example above was written as

\log e^x = x

and the second as

log e^x = x

The double angle identity for sine is readable when properly typeset and a jumbled mess when the necessary backslashes are left out.

The first example was written

\sin 2u = 2 \sin u \cos u

and the second as

sin 2u = 2 sin u cos u

4) Failure to use math mode

LaTeX uses math mode to distinguish variables from ordinary letters. Variables are typeset in math italic, a special style that is not the same as ordinary italic prose.

The first sentence was written as

Given a matrix $A$ and vector $b$, solve $Ax = b$.

and the second as

Given a matrix A and vector b, solve Ax = b.

Month: February 2010

The Law of Medium Numbers

Related posts

Underwhelmed with progress

Something like a random sequence but …

Related posts

Random improvisation subjects

Popular research areas produce more false results

Related posts

“Noncommercial” is fuzzy

Economizing approximations

Related posts

Self-sufficiency is the road to poverty

Related posts

Statistical functions in Excel

Related links

Top four LaTeX mistakes

Related posts