From Digital Barbarism by Mark Helprin:
I do not want my life history in the hands of either J. Edgar Hoover or Walt Disney, thank you very much.
Suppose you need to find the slope of a line through a set of data. You can get a surprisingly good approximation by simply fitting a line to the first and last points. This is known as “Bancroft’s rule.” It seems too good to be true. Of course just fitting a line to just two points is not as good as using all the data, but unless you have a fairly large amount of data, it’s not too much worse either. It’s good enough for a quick estimate.
Just how good is this estimate compared to using all the data? We’ll look at the technical details of an example below.
Suppose you have a regression model y = α + βx + ε where ε is random noise. Suppose ε is normally distributed with mean 0 and variance σ2. Let b be the least squares estimator of β. The variance in b is
Now suppose we have observations yi corresponding to xi = 0, 1, 2, …, 2n. The average value of x is n, and the denominator in the expression for the variance of the slope estimator is 2(12 + 22 + 32 + … + n2) = (2n3 + 3n2 + n)/3. If we just use the data at x = 0 and x = 2n, the denominator is (0 – n)2 + (2n – n)2 = 2n2.
If we divide the estimator variance based on Bancroft’s rule by the estimator variance using all the data, the σ2 terms cancel and we are left with n/3 + 1/2 + 1/6n. So Bancroft’s rule increases the variance in the estimate for the slope by roughly n/3 compared to using all the data. Thus it increases the confidence interval by roughly the square root of n/3. So if you had 12 data points, the confidence interval would be about twice as wide. Said another way, the estimate based on all the data is only twice as good as the estimate based on just the first and last points.
If you run into someone on the street, the probability that the other person shares your birthday is 1/365. If you run into five people, the probability that at least one of them shares your birthday is 5/365, right?
The answer 5/365 is quite accurate, but not exactly correct. It’s good enough for this problem since there are other practical difficulties besides the quality of the approximation. For example, the problem implicitly assumes there are 365 days in a year, i.e. that no one is ever born on Leap Day.
Now think about a similar problem. Suppose the chance of rain is 40% each day for the next three days. Does that mean there is a 120% chance that it will rain at least one of the next three days? That can’t be. In fact, the chance of some rain over the next three days is 78.4%.
The following rule appears to be correct for birthdays but not for predicting rain.
If the probability of a success in one attempt is p, then the probability of at least one success in n attempts is np.
Why does this rule hold sometimes and not at other times? If the probability of success on each attempt is p, the probability of failure on each attempt is (1-p). The probability of n failures in a row is (1-p)n and so the probability of at least one success is 1 – (1-p)n. That’s the right way to approach the birthday example and the rain example. In the birthday example, p = 1/365 and so the probability of running into at least one person in five who shares your birthday is 1 – (364/365)5 = 0.013624. And 5/365 = 0.013699 is a very good approximation. In the rain example, p = 0.4 and the probability of at least one day of rain out of the next three is 1 – 0.63 = 0.784. The difference between the birthday example and the rain example is the size of p. The following equation, based on the binomial theorem, explains why.
When we use np as our approximation, we’re ignoring the terms involving p2 and higher powers of p. When p is small, higher powers of p are very small and can be ignored. That’s why the approximation worked well for p = 1/365. But when p is large, say p = 0.4, the error is large; the terms involving higher powers of p are important in that case. Notice also that the size of n matters. The birthday example breaks down when n is large. If you run into 400 people, it is likely that one of them will share your birthday, but far from certain. The probability in that case is about 2/3, not 400/365.
When p and n are both small, the probability of at least one success out of n tries is approximately np. We can say more. Because the first term the approximation drops from the equation above has a negative sign, our approximation is also an upper bound. This says np slightly over-estimates the probability.
Now how small do p and n have to be? If you calculate the approximation np and get a small answer, then it’s a good answer. Why? The error in the np approximation is roughly n(n-1)p2/2, which is less than (np)2. And if np is small, (np)2 is very small.
See Sales tax included for a similar discussion. That also post looks at a common mistake and explains why it makes a good approximation under some circumstances and not under others.
Someone sent me a link to an optical illusion while I was working on a math problem. The two things turned out to be related.
In the image below, what look like blues spiral and green spirals are actually exactly the same color. The spiral that looks blue is actually green inside, but the magenta stripes across it make the green look blue. I know you don’t believe me; I didn’t believe it either. See this blog post for an explanation, including a magnified close-up of the image. Or open it in an image editor and use a color selector to see for yourself.
My math problem was also a case of two things that look different even though they are not. Maybe you can think back to a time as a student when you knew your answer was correct even though it didn’t match the answer in the back of the book. The two answers were equivalent but written differently. In an algebra class you might answer 5 / √ 3 when the book has 5 √ 3 / 3. In trig class you might answer 1 – cos2x when the book has sin2x. In a differential equations class, equivalent answers may look very different since arbitrary constants can obfuscate differences.
In my particular problem, I was looking at weights for Gauss-Hermite integration. I was trying to reconcile two different expressions for the weights, one in some software I’d written years ago and one given in A&S. I thought I’d found a bug, at least in my comments if not in my software. My confusion was analogous to not recognizing a trig identity. I wish I could say that the optical illusion link made me think that the two expressions may be the same and they just look different because of a mathematical illusion. That would make a good story. Instead, I discovered the equivalence of the two expressions by brute force, having Mathematica print out the values so I could compare them. Only later did I see the analogy between my problem and the optical illusion.
In case you’re interested in the details, my problem boiled down to the equivalence between Hn+1(xi)2 and 4n2Hn-1(xi)2 where Hn(x) is the nth Hermite polynomial and xi is the ith root of Hn. Here’s why these are the same. The Hermite polynomials satisfy a recurrence relation Hn+1(x) = 2x Hn(x) – 2n Hn-1(x) for all x. Since Hn(xi) = 0, Hn+1(xi) = -2nHn-1(xi). Now square both sides.
Related post: Orthogonal polynomials
Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.
Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.
Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.
For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2. Then
yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8
and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.
That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.
Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.
How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.
Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.
For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.
Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.
Other posts on robust statistics:
Other posts on John Tukey:
I quoted Dan Bricklin in a blog post a few weeks ago and he left a couple comments in the discussion. This started an email correspondence that lead to the following interview.
JC: Do you ever feel that the fame of VisiCalc has overshadowed some of your more recent accomplishments?
DB: It had better. VisiCalc was a pretty big thing to have done, and I’m very happy that I had the opportunity to make such a big contribution to the world. On the other hand, I frequently run into people who remember me because of some of my other products, especially Dan Bricklin’s Demo, or my writings that had a major impact on their work, so I know it’s not all that I’ve done of interest. Having done VisiCalc has opened many doors for me, and I surely appreciate that. I wouldn’t call it overshadowed, I’d call it added to and enhanced.
JC: What would your 30-second bio be without VisiCalc?
DB: I am a long-term toolmaker and commentator in the area of the personal use of computing power. I’ve stayed current in the technology area, and continually programmed and developed products in the latest genre, and shared my observations through blogging, podcasting, and other means, including a book.
JC: What are you doing these days as a programmer? As an entrepreneur?
JC: What trends do you see in software development?
DB: Software development is pervading more and more fields as a major component. We have moved from the computer being an adjunct to other means of expression or deployment to being the only or dominant means. The use of major system components, be they libraries or services, has continued to grow.
JC: Every time a new technology comes out, someone asks what the killer app will. That is, what application will do for the new technology what VisiCalc did for personal computers. Could you comment on some other “killer apps” since VisiCalc?
DB: I viewed VisiCalc as an app that made buying it and the whole system needed to run it an extremely simple decision. I saw it as having a “two week payback” for buying the whole system. That came from being two orders of magnitude better than what was used before. In VisiCalc’s case, you could use paper and pencil, taking at least 100 times as long to do the same thing, or, in those days, a timesharing system at a few thousand dollars a month (at least).
Similarly compelling applications since VisiCalc (for businesses) were desktop publishing, email, and mobile computing (like the Blackberry, Treo, and now iPhone). For the home, initially CD-ROM encyclopedias were a pretty compelling reason for homes with children to buy a PC (less than the cost of a paper encyclopedia and a bookcase to hold it but you could also use it for word processing), then the combination of email and the web with an always-connected Internet connection.
JC: The personal computer had a killer app and became popular. Are we reading too much into history by expecting that every technology must have a killer app before it can take off?
DB: You only need something that justifies buying a whole system if the sum of other applications or other reasons don’t cause the purchase on their own. For the iPhone, for some people, just having a large catalog of things you might want (those long tail apps I discuss in Chapter 7 of my book) may be enough.
JC: What do you think of open source business models? Ad sponsored, freemium, selling support/consulting services, etc.
DB: As I point out in Chapter 2 of my book where I talk about artists getting paid, there are many ways to make money. A “business model” is just saying here is how the pieces of what I do fit together and end up making enough money to meet the needs I have. This includes the cost structure as well as the sources of revenue and desired results. All long term endeavors, be they mainly based on developing or using Open Source or proprietary source or a mixture, look to different mixtures. They have historically used selling support, relationships with other companies (which advertising is a variant of), and other techniques as part of their mix. Open Source just gives us other options, including on the cost side. Also, as Prof. Ariely explains in the interview I did with him (Chapter 5) once you move into the realm of “free”, and when you appropriately invoke “community”, both of which Open Source can do, you get added benefits in your relationship with other people that can leverage your marketing and other costs.
JC: What did you learn in the process of writing your book? In particular, could you say a little bit about typography?
DB: Most of what I went through is in my essay on the topic, Turning My Blog Into A Book. I think that typography is important, and we’ve seen that as web pages have moved from very basic to better layout to full use of CSS. Typography is a way of expressing ideas and information outside of the direct flow of what you are saying. It is very valuable. Just as a well-delivered speech can convey much more than just the raw words, appropriate use of typographical techniques can convey much more than simple text.
Tim O’Reilly talked about the “law of conservation of attractive profits” in a recent interview on the FLOSS Weekly podcast. Clayton Christensen explained this law in an HBR report in 2004. It says that when one thing becomes modular and commoditized, another thing becomes valuable.
O’Reilly argues that just as computers made out of commodity hardware made software more valuable, now commodity software and open standards have made data more valuable.
Taking this line of reasoning one step further, open data makes analysis more valuable. Good news for experts in statistics and machine learning.
This blog is now available as a podcast. I’m experimenting with the Odiogo service to automatically create an audio version of the blog text. The speech quality is surprisingly good, much better than the Windows speech synthesizer.
You can listen from the web page or subscribe to the podcast via iTunes etc. Let me know what you think.
Update: The speech quality is good on ordinary prose. It even pronounces foreign names correctly. Unsurprisingly, it mangles equations and special symbols. It generally has good inflection. It handles punctuation well, but it does not pause for <br> elements and so lists are unnaturally concatenated together.
I don’t know whether the podcast improves accessibility. I’d be interested to hear whether listening to the podcast is preferable to using a screen reader.
What it’s like to run a marathon in space.
Mathematical artwork from Chris Henden:
finite projective plane
balanced incomplete block design
(A lot of mathematical art is gimmicky. Interesting, but not beautiful. Chris Henden’s work is beautiful. I believe people would enjoy it who had no appreciation for the math that inspired the artwork. Here are a couple non-mathematical pieces by the same artist: life drawing, Abu Dhabi.)
Andrew Gelman’s take on the statistics of the recent election in Iran. Note that he says “let me emphasize that I’m not saying that I think nothing funny was going on in the election. As I wrote, I’m commenting on the statistics.”
Long list of use useful command line commands on Windows.
Server design, a blog post about “the Four Horsemen of Poor Performance.”
Paul Randal had the following to say about database backup and recovery in his interview with .NET Rocks.
Don’t ever, ever plan a backup strategy. Plan a restore strategy.
The point of a backup is to be able to restore data when necessary. That sounds obvious, but it’s easy to lose site of. Too many companies backup their data, or think that they back up their data, but don’t have much of a recovery plan. Or they have a recovery plan but they haven’t rehearsed it. Until you rehearse your recovery plan, you don’t know whether it’s going to work, you don’t know how long it’s going to take, and you don’t know whether you need to invest in hardware to speed it up.
Six of the seven deadly sins are easy to define, but one is more subtle. The seven deadly sins are
Sloth is the subtle one.
I discovered recently that I didn’t know what sloth meant. When I first heard of the seven deadly sins, I thought it was odd that sloth was on the list. How would you know whether you’re sufficiently active to avoid sloth? It turns out that the original idea of sloth was only indirectly related to activity.
The idea of a list of deadly sins started in the 4th century and has changed over time. The word in the middle of the list was “acedia” before it became “sloth,” and the word “sloth” has taken on a different meaning since then. So what is acedia? According to Wikipedia,
Acedia is a word from ancient Greek describing a state of listlessness or torpor, of not caring or not being concerned with one’s position or condition in the world. It can lead to a state of being unable to perform one’s duties in life. Its spiritual overtones make it related to but distinct from depression.
In short, “sloth” did not mean inactivity but rather a state of apathy. As Os Guinness says in his book The Call
… sloth must be distinguished from idling, a state of carefree living that can be admirable, as in friends lingering over a meal … [Sloth] can reveal itself in frenetic activism as easily as in lethargy … It is a condition of explicitly spiritual dejection … inner despair at the worthwhileness of the worthwhile …
Sloth and rest could look the same externally while proceeding from opposite motivations. One person could be idle because he lacked the faith to do anything, while another person could be idle because he had faith that his needs would be met even if he rested a while. The key to avoiding sloth is not the proper level of activity but the proper attitude of the heart.
The Hawthorne effect is the idea that people perform better when they’re being studied. The name comes from studies conducted at Western Electric’s Hawthorne Works facility. Increased lighting improved productivity in the plant. Later, lowering the lighting also increased productivity. The Hawthorne effect says that the productivity increase wasn’t due to changes in lighting per se but either the variety of changing something about the plant or the attention that workers got by being measured, a sort of placebo effect.
The Alternative Blog has a post this morning entitled Hawthorne effect debunked. The original Hawthorne effect was apparently due to a flaw in the study design; correcting for that flaw eliminates the effect.
The term “debunked” in the post title may imply too much. The effect in the original studies may have been debunked, but that does not necessarily mean there is no Hawthorne effect. Perhaps there are good examples of the Hawthorne effect elsewhere. On the other hand, I expect closer examination of the data could debunk other reported instances of the Hawthorne effect as well.
The Hawthorne effect makes sense. It has been ingrained in pop culture. I heard a reference to it on a podcast just this morning before reading the blog post mentioned above. Everyone knows it’s true. And maybe it is. But at a minimum, there is at least one example suggesting the effect is not as wide-spread as previously thought.
It would be interesting to track the popularity of the Hawthorne effect in scholarly literature and in pop culture. If the effect becomes less credible in scholarly circles, will it also become less credible in pop culture? And if so, how quickly will pop culture respond?
Joel Spolsky recommends the following books to self-taught programmers who apply to his company and need to fill in some gaps in their training.
The one that has me scratching my head is The Unix Programming Environment, first published in 1984. After listening to Joel’s podcast, I thumbed through my old copy of the book and thought “Man, I could never work like this.” Of course I could work like that, because I did, back around 1990. But the world has really changed since then.
I appreciate history and old books. I see the value in learning things you might not directly apply. But imagine telling twentysomething applicants to go read an operating system book that was written before they were born. Most would probably think you’re insane.
Update (16 November 2010): On second thought, I could see recommending that someone read The Unix Programming Environment these days even though technology has changed so much, but I’d still expect resistance.
The world’s computer systems kept working on January 1, 2000 thanks to billions of dollars spent on fixing old software. Two wrong conclusions to draw from Y2K are
The programmers who wrote the Y2K bugs were highly successful: their software lasted longer than anyone imagined it would. The two-digit dates were only a problem because their software was still in use decades later. (OK, some programmers were still writing Y2K bugs as late as 1999, but I’m thinking about COBOL programmers from the 1970′s.)
Y2K may be behind us, but we will be facing Y2K-like problems for years to come. Twitter just faced a Y2K-like problem last night, the so called Twitpocalypse. Twitter messages were indexed with a signed 32-bit integer. That means the original software was implicitly designed with a limit of around two billion messages. Like the COBOL programmers mentioned above, Twitter was more successful than anticipated. Twitter fixed the problem without any disruption, except that some third party Twitter clients need to be updated.
We are running out of Internet addresses because these addresses also use 32-bit integers. To make matters worse, an Internet address has an internal structure that greatly reduces the number of possible 32-bit addresses. IPv6 will fix this by using 128-bit addresses.
The US will run out of 10-digit phone numbers at some point, especially since not all 10-digit combinations are possible phone numbers. For example, the first three digits are a geographical area code. One area code can run out of 7-digit numbers while another has numbers left over.
At some point the US will run out of 9-digit social security numbers.
The original Unix systems counted time as the number of seconds since January 1, 1970, stored in a signed 32-bit integer. On January 19, 2038, the number of seconds will exceed the capacity of such an integer and the time will roll over to zero, i.e. it will be January 1, 1970 again. This is more insidious than the Y2K problem because there are many software date representations in common use, including the old Unix method. Some (parts of) software will have problems in 2038 while others will not, depending on the whim of the programmer when picking a way to represent dates.
There will always be Y2K-like problems. Computers are finite. Programmers have to guess at limitations for data. Sometimes these limitations are implicit, and so we can pretend they are not there, but they are. Sometimes programmers guess wrong because their software succeeds beyond their expectations.