Incredibly simple approximation

Suppose you need to find the slope of a line through a set of data. You can get a surprisingly good approximation by simply fitting a line to the first and last points. This is known as “Bancroft’s rule.” It seems too good to be true. Of course just fitting a line to just two points is not as good as using all the data, but unless you have a fairly large amount of data, it’s not too much worse either. It’s good enough for a quick estimate.

Just how good is this estimate compared to using all the data? We’ll look at the technical details of an example below.

Suppose you have a regression model y = α + βx + ε where ε is random noise. Suppose ε is normally distributed with mean 0 and variance σ2. Let b be the least squares estimator of β. The variance in b is

\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

Now suppose we have observations yi corresponding to xi = 0, 1, 2, …, 2n. The average value of x is n, and the denominator in the expression for the variance of the slope estimator is 2(12 + 22 + 32 + … + n2) = (2n3 + 3n2 + n)/3. If we just use the data at x = 0 and x = 2n, the denominator is (0 – n)2 + (2nn)2 = 2n2.

If we divide the estimator variance based on Bancroft’s rule by the estimator variance using all the data, the σ2 terms cancel and we are left with n/3 + 1/2 + 1/6n. So Bancroft’s rule increases the variance in the estimate for the slope by roughly n/3 compared to using all the data. Thus it increases the confidence interval by roughly the square root of n/3. So if you had 12 data points, the confidence interval would be about twice as wide. Said another way, the estimate based on all the data is only twice as good as the estimate based on just the first and last points.

Related posts:

Probability mistake can give a good approximation

If you run into someone on the street, the probability that the other person shares your birthday is 1/365. If you run into five people, the probability that at least one of them shares your birthday is 5/365, right?

The answer 5/365 is quite accurate, but not exactly correct. It’s good enough for this problem since there are other practical difficulties besides the quality of the approximation. For example, the problem implicitly assumes there are 365 days in a year, i.e. that no one is ever born on Leap Day.

Now think about a similar problem. Suppose the chance of rain is 40% each day for the next three days. Does that mean there is a 120% chance that it will rain at least one of the next three days? That can’t be. In fact, the chance of some rain over the next three days is 78.4% (assuming the probabilities are independent, which they’re probably not).

The following rule appears to be correct for birthdays but not for predicting rain.

If the probability of a success in one attempt is p, then the probability of at least one success in n attempts is np.

Why does this rule hold sometimes and not at other times? If the probability of success on each attempt is p, the probability of failure on each attempt is (1-p). The probability of n failures in a row is (1-p)n and so the probability of at least one success is 1 – (1-p)n. That’s the right way to approach the birthday example and the rain example. In the birthday example, p = 1/365 and so the probability of running into at least one person in five who shares your birthday is 1 – (364/365)5 = 0.013624. And 5/365 = 0.013699 is a very good approximation. In the rain example, p = 0.4 and the probability of at least one day of rain out of the next three is 1 – 0.63 = 0.784. The difference between the birthday example and the rain example is the size of p. The following equation, based on the binomial theorem, explains why.

1 - (1-p)^n = np - {n \choose 2}p^2 + {n \choose 3} p^3 - {n \choose 4}p^4 + \cdots

When we use np as our approximation, we’re ignoring the terms involving p2 and higher powers of p. When p is small, higher powers of p are very small and can be ignored. That’s why the approximation worked well for p = 1/365. But when p is large, say p = 0.4, the error is large; the terms involving higher powers of p are important in that case. Notice also that the size of n matters. The birthday example breaks down when n is large. If you run into 400 people, it is likely that one of them will share your birthday, but far from certain. The probability in that case is about 2/3, not 400/365.

When p and n are both small, the probability of at least one success out of n tries is approximately np. We can say more. Because the first term the approximation drops from the equation above has a negative sign, our approximation is also an upper bound. This says np slightly over-estimates the probability.

Now how small do p and n have to be? If you calculate the approximation np and get a small answer, then it’s a good answer. Why? The error in the np approximation is roughly n(n-1)p2/2, which is less than (np)2. And if np is small, (np)2 is very small.

See Sales tax included for a similar discussion. That also post looks at a common mistake and explains why it makes a good approximation under some circumstances and not under others.

Optical illusion, mathematical illusion

Someone sent me a link to an optical illusion while I was working on a math problem. The two things turned out to be related.

In the image below, what look like blues spiral and green spirals are actually exactly the same color. The spiral that looks blue is actually green inside, but the magenta stripes across it make the green look blue. I know you don’t believe me; I didn’t believe it either. You can open it in an image editor and use a color selector to examine it for yourself.

My math problem was also a case of two things that look different even though they are not. Maybe you can think back to a time as a student when you knew your answer was correct even though it didn’t match the answer in the back of the book. The two answers were equivalent but written differently. In an algebra class you might answer 5 / √ 3 when the book has 5 √ 3 / 3. In trig class you might answer 1 – cos2x when the book has sin2x. In a differential equations class, equivalent answers may look very different since arbitrary constants can obfuscate differences.

In my particular problem, I was looking at weights for Gauss-Hermite integration. I was trying to reconcile two different expressions for the weights, one in some software I’d written years ago and one given in Abramowitz and Stegun. I thought I’d found a bug, at least in my comments if not in my software. My confusion was analogous to not recognizing a trig identity.  I wish I could say that the optical illusion link made me think that the two expressions may be the same and they just look different because of a mathematical illusion. That would make a good story. Instead, I discovered the equivalence of the two expressions by brute force, having Mathematica print out the values so I could compare them. Only later did I see the analogy between my problem and the optical illusion.

In case you’re interested in the details, my problem boiled down to the equivalence between Hn+1(xi)2 and 4n2Hn-1(xi)2 where Hn(x) is the nth Hermite polynomial and xi is the ith root of Hn. Here’s why these are the same. The Hermite polynomials satisfy a recurrence relation Hn+1(x) = 2x Hn(x) – 2n Hn-1(x) for all x. Since Hn(xi) = 0, Hn+1(xi) = -2nHn-1(xi). Now square both sides.

Related post: Orthogonal polynomials

John Tukey’s median of medians

Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.

Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.

Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.

For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2.  Then

yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8

and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.

That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.

Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.

How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.

Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.

For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.

Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.

Other posts on robust statistics:

Other posts on John Tukey:

Dan Bricklin interview

Dan Bricklin is best known for creating VisiCalc along with Bob Frankston in 1979. Since that time he has been active as a software developer and entrepreneur. His new book is Bricklin on Technology (ISBN 0470402377).

I quoted Dan Bricklin in a blog post a few weeks ago and he left a couple comments in the discussion. This started an email correspondence that lead to the following interview.

JC: Do you ever feel that the fame of VisiCalc has overshadowed some of your more recent accomplishments?

DB: It had better. VisiCalc was a pretty big thing to have done, and I’m very happy that I had the opportunity to make such a big contribution to the world. On the other hand, I frequently run into people who remember me because of some of my other products, especially Dan Bricklin’s Demo, or my writings that had a major impact on their work, so I know it’s not all that I’ve done of interest. Having done VisiCalc has opened many doors for me, and I surely appreciate that. I wouldn’t call it overshadowed, I’d call it added to and enhanced.

JC: What would your 30-second bio be without VisiCalc?

DB: I am a long-term toolmaker and commentator in the area of the personal use of computing power. I’ve stayed current in the technology area, and continually programmed and developed products in the latest genre, and shared my observations through blogging, podcasting, and other means, including a book.

JC: What are you doing these days as a programmer? As an entrepreneur?

DB: I have been working on an Open Source JavaScript-based spreadsheet called SocialCalc. It is being used throughout the world on the One Laptop Per Child’s XO computer, as well as by enterprise social-software company Socialtext, which paid for much of its development. I also serve on a few high-tech boards, and do a variety of types of consulting, including speaking engagements. I plan to continue developing software of various sorts and consulting.

JC: What trends do you see in software development?

DB: Software development is pervading more and more fields as a major component. We have moved from the computer being an adjunct to other means of expression or deployment to being the only or dominant means. The use of major system components, be they libraries or services, has continued to grow.

JC: Every time a new technology comes out, someone asks what the killer app will. That is, what application will do for the new technology what VisiCalc did for personal computers. Could you comment on some other “killer apps” since VisiCalc?

DB: I viewed VisiCalc as an app that made buying it and the whole system needed to run it an extremely simple decision. I saw it as having a “two week payback” for buying the whole system. That came from being two orders of magnitude better than what was used before. In VisiCalc’s case, you could use paper and pencil, taking at least 100 times as long to do the same thing, or, in those days, a timesharing system at a few thousand dollars a month (at least).

Similarly compelling applications since VisiCalc (for businesses) were desktop publishing, email, and mobile computing (like the Blackberry, Treo, and now iPhone). For the home, initially CD-ROM encyclopedias were a pretty compelling reason for homes with children to buy a PC (less than the cost of a paper encyclopedia and a bookcase to hold it but you could also use it for word processing), then the combination of email and the web with an always-connected Internet connection.

JC: The personal computer had a killer app and became popular. Are we reading too much into history by expecting that every technology must have a killer app before it can take off?

DB: You only need something that justifies buying a whole system if the sum of other applications or other reasons don’t cause the purchase on their own. For the iPhone, for some people, just having a large catalog of things you might want (those long tail apps I discuss in Chapter 7 of my book) may be enough.

JC: What do you think of open source business models? Ad sponsored, freemium, selling support/consulting services, etc.

DB: As I point out in Chapter 2 of my book where I talk about artists getting paid, there are many ways to make money. A “business model” is just saying here is how the pieces of what I do fit together and end up making enough money to meet the needs I have. This includes the cost structure as well as the sources of revenue and desired results. All long term endeavors, be they mainly based on developing or using Open Source or proprietary source or a mixture, look to different mixtures. They have historically used selling support, relationships with other companies (which advertising is a variant of), and other techniques as part of their mix. Open Source just gives us other options, including on the cost side. Also, as Prof. Ariely explains in the interview I did with him (Chapter 5) once you move into the realm of “free”, and when you appropriately invoke “community”, both of which Open Source can do, you get added benefits in your relationship with other people that can leverage your marketing and other costs.

JC: What did you learn in the process of writing your book? In particular, could you say a little bit about typography?

DB: Most of what I went through is in my essay on the topic, Turning My Blog Into A Book. I think that typography is important, and we’ve seen that as web pages have moved from very basic to better layout to full use of CSS. Typography is a way of expressing ideas and information outside of the direct flow of what you are saying. It is very valuable. Just as a well-delivered speech can convey much more than just the raw words, appropriate use of typographical techniques can convey much more than simple text.

Related posts

Conservation of attractive profits

Tim O’Reilly talked about the “law of conservation of attractive profits” in a recent interview on the FLOSS Weekly podcast. Clayton Christensen explained this law in an HBR report in 2004. It says that when one thing becomes modular and commoditized, another thing becomes valuable.

O’Reilly argues that just as computers made out of commodity hardware made software more valuable, now commodity software and open standards have made data more valuable.

Taking this line of reasoning one step further, open data makes analysis more valuable. Good news for experts in statistics and machine learning.

The Endeavour is now a podcast

This blog is now available as a podcast. I’m experimenting with the Odiogo service to automatically create an audio version of the blog text. The speech quality is surprisingly good, much better than the Windows speech synthesizer.

You can listen from the web page or subscribe to the podcast via iTunes etc. Let me know what you think.

Update: The speech quality is good on ordinary prose. It even pronounces foreign names correctly. Unsurprisingly, it mangles equations and special symbols. It generally has good inflection. It handles punctuation well, but it does not pause for <br> elements and so lists are unnaturally concatenated together.

I don’t know whether the podcast improves accessibility. I’d be interested to hear whether listening to the podcast is preferable to using a screen reader.

Update: As of February 2015, it looks like Odiogo is dead. Broken link to http://podcasts.odiogo.com/the-endeavour/podcasts-html.php above removed.

Backup and recovery

Paul Randal had the following to say about database backup and recovery in his interview with .NET Rocks.

Don’t ever, ever plan a backup strategy. Plan a restore strategy.

The point of a backup is to be able to restore data when necessary. That sounds obvious, but it’s easy to lose site of. Too many companies backup their data, or think that they back up their data, but don’t have much of a recovery plan. Or they have a recovery plan but they haven’t rehearsed it. Until you rehearse your recovery plan, you don’t know whether it’s going to work, you don’t know how long it’s going to take, and you don’t know whether you need to invest in hardware to speed it up.