The Law of Medium Numbers

There’s a law of large numbers, a law of small numbers, and a law of medium numbers in between.

The law of large numbers is a mathematical theorem. It describes what happens as you average more and more random variables.

The law of small numbers is a semi-serious statement about about how people underestimate the variability of the average of a small number of random variables.

The law of medium numbers is a term coined by Gerald Weinberg in his book An Introduction to General Systems Thinking. He states the law as follows.

For medium number systems, we can expect that large fluctuations, irregularities, and discrepancy with any theory will occur more or less regularly.

The law of medium numbers applies to systems too large to study exactly and too small to study statistically. For example, it may be easier to understand the behavior of an individual or a nation than the dynamics of a small community. Atoms are simple, and so are stars, but medium-sized things like birds are complicated. Medium-sized systems are where you see chaos.

Weinberg warns that medium-sized systems challenge science because scientific disciplines define their boundaries by the set of problems they can handle. He says, for example, that

Mechanics, then, is the study of those systems for which the approximations of mechanics work successfully.

He warns that we should not be mislead by a discipline’s “success with systems of its own choosing.”

Weinberg’s book was written in 1975. Since that time there has been much more interest in the emergent properties of medium-sized systems that are not explained by more basic sciences. We may not understand these systems well, but we may appreciate the limits of our understanding better than we did a few decades ago.

Related posts:

Underwhelmed with progress

Virtual reality pioneer Jaron Lanier writes in his book You Are Not a Gadget about the lack of creativity in our use of computing power.

Let’s suppose that back in the 1980s I had said, “In a quarter century, when the digital revolution has made great progress and computer chips are millions of times faster than they are now, humanity will finally win the prize of being able to write a new encyclopedia and a new version of UNIX!” It would have sounded utterly pathetic.

The quote specifically alludes to Wikipedia and Linux, but Lanier is critical of web culture in general. I’m not sure what I think about his position, but at a minimum he provides a counterbalance to the people who speak about the web in messianic tones.

Something like a random sequence but …

When people ask for a random sequence, they’re often disappointed with what they get.

Random sequences clump more than most folks expect. For graphical applications, quasi-random sequence may be more appropriate.These sequences are “more random than random” in the sense that they behave more like what some folks expect from randomness. They jitter around like a random sequence, but they don’t clump as much.

Researchers conducting clinical trials are dismayed when a randomized trial puts several patients in a row on the same treatment. They want to assign patients one at a time to one of two treatments with equal probability, but they also want the allocation to work out evenly. This is like saying you want to flip a coin 100 times, and you also want to get exactly 50 heads and 50 tails. You can’t guarantee both, but there are effective compromises.

One approach is to randomize in blocks. For example, you could randomize in blocks of 10 patients by taking a sequence of 5 A‘s and 5 B‘s and randomly permuting the 10 letters. This guarantees that the allocations will be balanced, but some outcomes will be predictable. At a minimum, the last assignment in each block is always predictable: you assign whatever is left. Assignments could be even more predictable: if you give n A‘s in a row in a block of 2n, you know the last n assignments will be all B‘s.

Another approach is to “encourage” balance rather than enforce it. When you’ve given more A‘s than B‘s you could increase the probability of assigning a B. The greater the imbalance, the more heavily you bias the randomization probability in favor of the treatment that has been assigned less. This is a sort of compromise between equal randomization and block randomization. All assignments are random, though some assignments may be more predictable than others. Large imbalances are less likely than with equal randomization, but more likely than with block randomization. You can tune how aggressively the method responds to imbalances in order to make the method more like equal randomization or more like block randomization.

No approach to randomization will satisfy everyone because there are conflicting requirements. Randomization is a dilemma to be managed rather than a problem to be solved.

Related posts:

Random improvisation subjects

Destination ImagiNation is a non-profit organization that encourages student creativity. This is my family’s first year to participate in DI and it has been a lot of fun. One of the things that impresses me most about DI is that they have strict rules limiting adult input.

This weekend I was an appraiser at a DI competition for an improvisation challenge. Teams could prepare for the overall format of the challenge, but some elements of the challenge were randomly selected on the day of the competition. This year the improvisations centered around endangered things. Teams were given a list of 10 endangered things ahead of time, but they wouldn’t know which thing would be theirs until just before they had to perform. Some of the things on the list were endangered animals, such as the giant panda. There were also other things in danger of disappearing, such as the VHS tape. The students also had to use a randomly chosen stock character and had to include a character with a randomly chosen “unimpressive superpower.”

There were 13 teams in the elementary division. What would you expect from 13 teams randomly selecting 10 endangered things? Obviously some endangered thing has to be chosen at least twice. Would you expect every item on the list to be chosen at least once? How often do you expect the most common item would be chosen?

In our case, three teams were assigned “glaciers” and five were assigned “the landline telephone.” The other items were assigned once or not at all. (No one was assigned “the Yiddish language”. Too bad. I really wanted to see what the students would do with that one.)

Is there reason to suspect that the assignments were not random? How likely is it that in a competition of 13 teams that five or more teams would be given the same subject? How likely is it that every subject would be used at least once? See an explanation here. Make a guess before looking at my answer.

Here’s some Python code you could use to simulate the selection of endangered things.

from random import random

num_reps     = 100000 # number of simulation repetitions
num_subjects = 10     # number of endangered things
num_teams    = 13     # number of teams competing

def maxperday():
tally = [0] * num_subjects
for i in range(num_teams):
subject = int(random()*num_subjects)
tally[subject] += 1
return max(tally)

total = 0
for rep in range(num_reps):
if maxperday() &gt;= 5:
total += 1
print float(total)/num_reps


Popular research areas produce more false results

The more active a research area is, the less reliable its results are.

John Ioannidis suggested popular areas of research publish a greater proportion of false results in his paper Why most published research findings are false. Of course popular areas produce more results, and so they will naturally produce more false results. But Ioannidis is saying that they also produce a greater proportion of false results.

Now Thomas Pfeiffer and Robert Hoffmann have produced empirical support for Ioannidis’s theory in the paper Large-Scale Assessment of the Effect of Popularity on the Reliability of Research. Pfeiffer and Hoffmann review two reasons why popular areas have more false results.

First, in highly competitive fields there might be stronger incentives to “manufacture” positive results by, for example, modifying data or statistical tests until formal statistical significance is obtained. This leads to inflated error rates for individual findings: actual error probabilities are larger than those given in the publications. … The second effect results from multiple independent testing of the same hypotheses by competing research groups. The more often a hypothesis is tested, the more likely a positive result is obtained and published even if the hypothesis is false.

In other words,

1. In a popular area there’s more temptation to fiddle with the data or analysis until you get what you expect.
2. The more people who test an idea, the more likely someone is going to find data in support of it by chance.

The authors produce evidence of the two effects above in the context of papers written about protein interactions in yeast. They conclude that “The second effect is about 10 times larger than the first one.”

Related posts:

“Noncommercial” is fuzzy

It is common for software, photos, and other creative works to be free for noncommercial use. I appreciate the generosity of those who want to give away their creations, and I appreciate the business savvy of those who see giving some things away as a way to make more money elsewhere. But “noncommercial” is a fuzzy term.

What exactly is noncommercial use? If I include a photo in software that I give away, is that noncommercial use? What if someone includes the same photo in iTunes? That’s software that is freely given away, although it’s clearly a distribution channel for Apple music sales. What about Internet Explorer? Microsoft gives away IE, and it’s not an obvious distribution channel for Microsoft, but many people would call IE commercial software. Is it the nature of the organization rather than the nature of the product that determines whether something is non-commercial?

Sometimes “noncommercial” is used as an opposite of “professional.” But what about employees of charitable organizations such as the American Red Cross? Is a Red Cross relief worker in Haiti doing noncommercial work? What about a lawyer working at Red Cross headquarters? Would it change anything if the lawyer were a volunteer?

Sometimes “educational” is used as a synonym for noncommercial. But if your profession is education, is your work professional or educational? Does it matter whether a school is public or private? Most people would agree that a student doing a homework assignment is engaged in noncommercial activity. What if the student is a teaching assistant receiving a small salary? In that case is it noncommercial use when the student is doing his own homework but commercial use when he’s preparing to teach a class? Isn’t education almost always a commercial activity? After all, why are students in school? They’re preparing to make a living at something. They may have blatantly commercial motives for doing their homework.

Not only can you argue that educational use is commercial, you can argue that commercial use is educational. If an accountant looks up a tax regulation, they’re trying to learn something. Isn’t that educational? Is it educational use when a student looks up a tax regulation but commercial use when an accountant looks up the same regulation?

Individuals and organizations are free to define “commercial” or “noncommercial” use however they please. Personally, I’d rather either sell something or give it away without regard for how it’s going to be used.

Economizing approximations

The most obvious approximation may not be the best. But sometimes a small change to an obvious approximation can make it better approximation. This post gives an example illustrating this point.

The easiest, most obvious way to obtain a polynomial approximation is to use a truncated Taylor series. But such a polynomial may not be the most efficient approximation. Forman Acton gives the example of approximating cos(π x) on the interval [-1, 1] in his classic book Numerical Methods that Work. The point of this example is not the usefulness of the final result; a footnote below explains that this isn’t how cosines are computed in practice. The point is that you can sometimes improve a convenient but suboptimal approximation with a small change.

The goal in Acton’s example is to approximate cos(π x) with a maximum error of less than 10-6 across the interval. The Taylor polynomial

is accurate to within about 10-7 and so is certainly good enough. However the last term of the series, the x16 term, contributes less than the other terms to the accuracy of the approximation. On the other hand, this term cannot simply be discarded because without it the error rises to 10-5. The clever idea is to replace the x16 term with a linear combination of the other terms. After all, x16 doesn’t look that different from x14 or x12. Acton uses the 16th Chebyshev polynomial to approximate x16 by a combination of smaller even powers of x. This new approximation is almost as accurate as the original Taylor polynomial with an error that remains below the desire threshold. Acton calls this process economizing an approximation.

This process could be repeated to see whether the x14 term could be eliminated. Or you could directly find a Chebyshev series approximation to cos(π x) from the beginning. Acton did not have a symbolic computation package like Mathematica when he wrote his book in 1970 and so he was computing his approximations by hand. Directly computing a Chebyshev approximation would have been a lot of work. By just replacing the highest order term, he achieved nearly the same effect but with less effort.

Computing power has improved several orders of magnitude since Acton wrote his book, and some of his examples now seem quaint. However, I don’t know of a better book for teaching how to think about numerical analysis than Numerical Methods that Work. Acton has another good book that is harder to find, Real Computing Made Real: Preventing Errors in Scientific and Engineering Calculations.

Footnote 1: evaluating polynomials

Suppose you want to write code to evaluate the polynomial

P(x) = a0 + a2x2 + a4x4 + … + a14x14.

The first step would be to reduce this to a 7th degree polynomial in y = x2.

Q(y) = a0 + a1y + a2y2 + … + a7y7.

Directly evaluating Q(y) would take 1 + 2 + 3 + … + 7 = 28 multiplications, computing every power of y directly. For example, computing y5 as y*y*y*y*y. Factoring the polynomial is much more efficient:

((((((a7y + a6)y + a5)y + a4)y + a3)y + a2)y + a1)y + a0

Footnote 2: computing sine and cosine

The point of Acton’s example was to improve on a Taylor polynomial evaluated a moderate distance from the point where the Taylor series is centered. It does not illustrate how cosines are actually computed. See this answer on StackOverflow for an outline of how trig functions are computed in practice.

Related posts:

Self-sufficiency is the road to poverty

In his podcast Roberts on Smith, Ricardo, and Trade, Russ Roberts states that self-sufficiency is the road to poverty. Roberts elaborates on the economic theories of Adam Smith and David Ricardo to explain how specialization and trade create wealth and how how radical self-sufficiency leads to poverty.

Suppose you decide to grow your own food. Are you going to buy your gardening tools from Ace Hardware? If you really want to be self-reliant, you should make your own tools. Are you going to take your chances with what water happens to fall on your property, or are you going to rely on municipal water? Are you going to forgo fertilizer or rely on someone else to sell it to you? Carried to extremes, self-reliance ends in a Robinson Crusoe-like existence.

People in poor countries are often poor because they are self-reliant in the sense that they must do many things for themselves. They do not have the opportunities for specialization and trade that are available to those who live in more prosperous countries.

Some degree of self-reliance makes economic sense. Transaction costs, for example, make it impractical to outsource small tasks. It also makes sense to do some things that are not economically feasible. For example, an orthodontist may choose to make some of her own clothing or keep a garden for the pleasure of doing so, not because these activities are worth her time. In general, however, specialization and large trading communities are the road to prosperity. Without a large economic community, no one can become an orthodontist (or an accountant, barrista, electrician, …)

Why do we so often value self-sufficiency more than specialization and trade? Here are a three reasons that come to mind.

1. In America, self-sufficiency is deeply rooted in our culture. We admire the pioneer spirit, and this leads to seeing as virtues actions that were once a necessity.
2. Self-sufficient people are generally well liked, especially if they’re not too prosperous. Conversely, those who create wealth by leveraging the labor of others are often treated with suspicion and jealously.
3. Our school system encourages “well-roundedness” rather than excellence. The way to succeed is to be moderately good at everything, even if you’re not outstanding at anything. (More on this idea here.)

Update: After writing this post, I read Russ Robert’s book The Choice: A Fable of Free Trade and Protectionism. I discovered one of the later chapters is entitled “Self-Sufficiency Is the Road to Poverty.” Excellent book.

Related posts:

Statistical functions in Excel

Depending on your expectations, you may have different reactions to the statistical function support in Excel. If you expect anything similar to a statistical package, you’ll be sorely disappointed. But if you think of Excel as a spreadsheet for everybody that sometimes lets you do statistical tasks right there without having to open up a statistical package, you’ll be pleased.

I was looking into the functions in Excel 2007 while preparing for a class I taught yesterday. I wanted to emphasize that certain functions are everywhere, not only in mathematical packages like Mathematica and R, but also in Python and even Excel.

Excel’s set of functions is inconsistent, both in the functionality provided and in the names it uses. Having an asymmetric API makes it harder to remember what is available and how to use it. On the other hand, the most commonly needed functions are available. The functions are individually reasonable even though they do not fit together into a simple pattern.

For details, see my notes Probability distributions in Excel 2007.

I discovered along the way that Excel has a GAMMALN function to compute the logarithm of the Gamma function Γ(x). This is a very useful function to have, even more useful than the Gamma function itself for reasons explained here.

Top four LaTeX mistakes

Here are four of the most common typesetting errors I see in books and articles created with LaTeX.

1) Quotes

Quotation marks in LaTeX files begin with two back ticks, , and end with two single quotes, ''.

The first “Yes” was written as

Yes.''

in LaTeX while the one with the backward opening quote was written as

"Yes."

2) Differentials

Differentials, most commonly the dx at the end of an integer, should have a little space separating them from other elements. The “dx” is a unit and so it needs a little space to keep from looking like the product of “d” and “x.” You can do this in LaTeX by inserting , before and between differentials.

The first integral was written as

 \int_0^1 f(x) \, dx

while the second forgot the , and was written as

 \int_0^1 f(x)  dx

The need for a little extra space around differentials becomes more obvious in multiple integrals.

The first was written as

dx \, dy = r , dr \, d\theta

while the second was written as

dx  dy = r  dr  d\theta

3) Multi-letter function names

The LaTeX commands for typesetting functions like sin, cos, log, max, etc. begin with a backslash. The command log keeps “log,” for example, from looking like the product of variables “l”, “o”, and “g.”

The first example above was written as

\log e^x = x

and the second as

log e^x = x

The double angle identity for sine is readable when properly typeset and a jumbled mess when the necessary backslashes are left out.

The first example was written

\sin 2u = 2 \sin u \cos u

and the second as

sin 2u = 2 sin u cos u

4) Failure to use math mode

LaTeX uses math mode to distinguish variables from ordinary letters. Variables are typeset in math italic, a special style that is not the same as ordinary italic prose.

The first sentence was written as

Given a matrix $A$ and vector $b$, solve $Ax = b$.

and the second as

Given a matrix A and vector b, solve Ax = b.

Related posts:

Euclid’s proof that there are infinitely many primes

Paul Erdős had this notion that God kept a book of the best proofs. Erdős called God’s book simply “the book.” Springer recently published Proofs from THE BOOK, a collection of elegant proofs that the authors suppose might be in God’s book.

For many mathematicians, the first proof that comes to mind as a candidate for a proof in the book is Euclid’s proof that there are infinitely many primes. The proof is so short and simple that it can almost be written within the 140-character limit of Twitter. I give the proof in my AlgebraFact Twitter account as follows.

There are infinitely many primes. Proof: If not, multiply all primes together and add 1. Now you’ve got a new prime.

Several people pointed out that I was a little too terse. If N is the supposed product of all primes + 1, N itself doesn’t have to be a new prime. It could be that N has a factor that is a new prime. The smallest prime factor of N must be a prime not on the list. Either way “you’ve got a new prime,” but the proof above leaves out some important details.

Here’s an example. Suppose we believed the only primes were 2, 3, 5, 7, 11, and 13. Let N = 2*3*5*7*11*13 + 1 = 30031. Now N cannot be divisible by 2, 3, 5, 7, 11, or 13 because there is a remainder of 1 when dividing N by any of these numbers. Maybe N is a new prime, or maybe N is composite but N has a prime factor not on our list. The latter turns out to be the case: 30031 = 59*509.

If you can find a better summary of Euclid’s proof in 140 characters or less, please leave it in the comments. Please include the character count with your proposal.

Related posts:

For daily tweets on algebra and other math, follow @AlgebraFact on Twitter.

A book so good I had to put it down

“I couldn’t put it down.”  “A real page-turner.” That’s how you might describe a good novel to take on vacation. But for more serious reading, a good book is one you have to put down. Thoreau put it this way:

A truly good book teaches me better than to read it. I must soon lay it down, and commence to living on its hint.

Some books take a long time to read, not because they are dull, but because they are exciting. You have to put them down frequently to think about what you’ve read before reading more. It may not be the content of the book per se but the thoughts the book sparks that make you have to put it down.

What are some books you had to put down frequently because they stirred your thinking?

Related posts:

Using py2exe with SciPy

py2exe is a program that takes Python code and produces a Windows executable that can run on computers that do not have Python installed. My focus here is in using py2exe on Python code that depends on SciPy.

py2exe is itself a Python program, and its latest version is built for Python 2.6. The code I want it to compile is written with Python 2.5 because the latest version of SciPy depends on Python 2.5. How do I tell py2exe that my code uses an earlier version of Python?

It took me a while to realize this, but py2exe has two version numbers: one for the version of py2exe and one for the version of Python. The key was to download and install py2exe-0.6.9.win32-py2.5.exe. This is the latest version of py2exe (version 0.6.9) for an earlier version of Python (version 2.5). The py2exe software runs in the same version of Python as the code it is compiling.

I tested py2exe on a script hellpscipy.py as follows:

from scipy.special import gamma
print "Gamma(1/2) = ", gamma(0.5)

The instructions given in the py2exe tutorial don’t quite work because of the dependence on SciPy, but the site gives an example of how to modify the setup.py file to specify the dependence on SciPy. Here’s where things get a little murky. The instructions don’t say exactly how to modify the setup.py file, but they give strong hints. I know my code depends on scipy.special, but I don’t know what further dependencies I might have. Here’s the setup file I used.

from distutils.core import setup
import py2exe

excludes = []
includes = ["scipy.special"]

opts = {
"py2exe": {
"includes":includes,
"excludes":excludes
}
}

setup(console=['helloscipy.py'], options=opts)

This worked. The output listed 94 additional SciPy dependencies that I might need to include, some of which were clearly not needed. I was pretty sure, for example, that my program did not need email.Utils. Apparently I didn’t need any of these because the code worked fine.

It’s not clear just what files need to be distributed along with the .exe file that py2exe produces. py2exe creates a dist directory with the .exe file as well as other files that you might need, primarily .dll and .pyd files. Many of these were obviously unnecessary. I knew, for example, that my little command line program did not depend on Tk graphics. I deleted these and the code worked fine as expected.

Related posts:

A little optimization and a challenge

Waldir Pimenta asked me whether it is possible to test the condition

max(a, b) / min(a, b)  < r

without computing max(a, b) or min(a, b). Here a> 0, b> 0, and r > 1. (If r ≤ 1, the condition is always false.) The inequality has to be evaluated in an inner loop of a real-time image processing application and so the time saved by avoiding computing the max and min might matter.

We can multiply the original inequality by min(a, b) since a and b are positive. The condition becomes

max(a, b) < r min(a, b) = min(ra, rb).

We then expand this inequality into four simple inequalities:

a < ra
b < ra
a < rb
b < rb

Since r > 1, we know that a < ra and b < rb and so we only need to check a < rb and b < ra. In C notation we would evaluate

 ( a < r*b && b < r*a )

rather than

 ( max(a,b) / min(a,b) < r )

Whether this saves any time depends on context, though it’s plausible that the former might be more efficient. Some portion of the time the condition a < r*b will evaluate to false and the second half of the condition will not be evaluated.

(C evaluates an expression like (p && q) from left to right. If p is false, q is not evaluated since it is known at that point that the expression (p && q) will evaluate to false regardless of the value of q.)

Challenge: How often will the condition a < r*b evaluate to false? What are your assumptions? Leave your answers below.

Update: Combining the ideas of Carlos Santos and Christian Oudard in the comments below, you could implement the inequality as

a < b ? (b < r*a) : (a < r*b)

There are too many other good ideas in the comments for me to keep editing the blog post so I’ll just ask that you scroll down. I’m surprised at all the ideas that came out of evaluating a simple expression.

Amateur software

I’m growing increasingly frustrated with amateur software. Before I explain why, let me first be clear on what I do not mean by amateur.

• Amateur does not mean low quality. Some amateur software is outstanding, and some professional software is terrible.
• Amateur does not mean open source. Some amateur projects are open source and some are not.

I’m using “amateur software” to mean software projects developed by volunteers. I imagine most amateur software is written by professional developers. These are folks paid to write software for a company by day who then work on something else they love by night.

Open source software is not necessarily amateur software. Linux, for example, is now professional software. Around 75% of Linux kernel development is carried out by people paid to work on Linux. Some of the best software is both open source and at least partially professional.

Volunteers do what they want to do by definition. The problem is that the reverse is also true: volunteers do not do what they do not want to do. And for software developers, writing documentation usually falls in the “do not want to do” column. So does making software easy to install. So does testing in multiple environments.

When a company has an interest in a piece of software, they can pay people to do the tasks the volunteers don’t want to do. In fact, if they’re smart, they will concentrate their efforts precisely on the tasks volunteers don’t want to do. In this way even one or two paid staff can make an enormous contribution to a largely volunteer project.

Some amateur projects are highly polished. These may be small projects lead by rare individuals who pay attention to details beyond pure software development. More often, these are large mature projects that have so many volunteers that they have a few who are willing to do tasks that most developers do not want to do.

Related posts: