Earliest personal account of slavery

According to William R. Cook, there is only one ancient account of slavery written by a slave that still survives: a letter written by Saint Patrick. We have many ancient documents that were written by slaves, but not documents about their experience of being a slave.

Patrick was born in Britain. He was kidnapped at age 16 and became a slave in Ireland. He served as a slave for six years before escaping and returning to Britain. Later he returned to Ireland as a missionary. Although there are many legends surrounding Patrick, historians generally agree that his autobiographical letter, now known as the Confession of St. Patrick, is authentic.

I was surprised to hear that there are no other extant autobiographies of slaves since there were many literate slaves in antiquity. Obviously slaves were not given the liberty to write about whatever they pleased, and slave owners would be unlikely to request candid biographies of their chattel. Still, I imagine some slaves wrote autobiographies, perhaps secretly. But it makes sense that such documents would not likely be preserved.

The lack of first-hand accounts of slavery may contribute to our rosy mental image of classical history. When we think of ancient Greece, we think of Plato and Aristotle, not the anonymous slaves who made up perhaps 40% of the population of classical Athens.

[Update December 2014: The information above comes from a Teaching Company course by William Cook. The original link is dead, and I don’t remember now which of his courses it was from.]

Robust prior illustration

In Bayesian statistics, prior distributions “get out of the way” as data accumulate. But some prior distributions get out of the way faster than others. The influence of robust priors decays faster when the data conflict with the prior.

Consider a single sample y from a Normal(θ, σ2) distribution. We consider two priors on θ, a conjugate Normal(0, τ2) prior and a robust Cauchy(0, 1) prior. We will look at the posterior mean of θ under each prior as y increases.

With the normal prior, the posterior mean value of θ is

y τ2/( τ2 + σ2).

That is, the posterior mean is always a fixed fraction of y. If the prior variance τ2 is large relative to the variance σ2 of the sampling distribution, the fraction will be closer to 1, but it will always be less than 1.

With the Cauchy prior, the posterior mean is

y – O(1/y)

as y increases. (See these notes if you’re unfamiliar with “big-O” notation.) So the larger y becomes, the closer the posterior mean of θ comes to the value of the data y.

In the graph, the green line on bottom plots the posterior mean of θ with a normal prior as a function of y . The blue line on top is y. The red line in the middle is posterior mean of θ with a Cauchy prior. Note how the red line starts out close to the green line. That is, for small values of y, the posterior mean is nearly the same under the normal and Cauchy priors. But as y increases, red line approaches the blue line. The Cauchy prior has less influence as y increases.

In this graph σ = τ = 1. The results would be qualitatively the same for any values of σ and θ. If τ were larger relative to σ, the bottom line would be steeper, but the middle curve would still asymptotically approach the top line.

You can also show that with multiple samples, the posterior mean of θ converges more quickly to the empirical mean of the data when using a Cauchy prior than when using a normal prior if the mean is sufficiently large.

Update: See Asymptotic results for Normal-Cauchy model for proofs of the claims in this post.


Click to learn more about Bayesian statistics consulting


Related post: Robust priors

Inside the multitasking and marijuana study

A study came out in 2005 saying that multitasking lowers your IQ more than smoking marijuana does. David Freedman interviewed Dr. Glenn Wilson, author of the study. Wilson’s first response was “Oh, that damned thing.”

Someone from Hewlett-Packard contacted Glenn Wilson and asked him to conduct the multitasking study.

Encouraged by his sponsor at HP to keep the budget extremely low, and assured there was no pretense of trying to obtain scientifically valid, peer-reviewable, journal-publishable results, Wilson dragged eight students into a quiet room one at a time and gave them a standard IQ test, and then gave each of them another one — except that the second time, he left either a phone ringing continuously in the room or a flashing notification of incoming e-mail on a computer monitor in front of them.

Wilson said “It didn’t prove much of anything, of course.” But the study made a huge splash.

I don’t imagine anyone would be surprised that a constantly ringing telephone would reduce your ability to concentrate on an IQ test.  And comparing the result to marijuana use is pure sensationalism. While hearing a phone ring and smoking marijuana both impair concentration, they’re obviously not comparable.

Artificial studies like this one fail to answer the more important question of what effect  voluntary multitasking has on creativity and productivity. As Tyler Cowen says

To sound intentionally petulant, the only multitasking that works for me is mine, mine, mine!  Until I see a study showing that self-chosen multitasking programs lower performance, I don’t see that the needle has budged.

Paul Graham made a similar observation.

The danger of a distraction depends not on how long it is, but on how much it scrambles your brain. A programmer can leave the office and go and get a sandwich without losing the code in his head. But the wrong kind of interruption can wipe your brain in 30 seconds.

I’m convinced that multitasking, even voluntary multitasking, does decrease creativity and productivity. But I reached that opinion from personal experience, not based on any study of people taking IQ tests while listening to a phone ring. And of course some activities pair more effectively than others. Sweeping floors while listening to an iPod works better than checking email while taking an IQ test.

Related posts:

Two tragic animal-to-human studies

David Freedman gives two examples of animal-to-human studies that went horribly wrong. One actually happened. The other is hypothetical.

The actual study involves the experimental drug TGN1412. The compound was found safe in animal studies at 500 times the dose that would be given to humans. In 2006, TGN1412 was administered to six healthy men. All six were in excruciating pain within an hour of receiving the drug. Within 48 hours, all six were experiencing multiple organ failure. One subject remained in intensive care for several months. More information is available in this report.

Safety in animal studies is necessary but insufficient for testing new compounds in human subjects.That is, compounds that are harmful to animals do not go on to testing in human subjects. This policy is eminently reasonable. However, some drugs that would have been safe and effective in humans are discarded because they were toxic in animals. From Freedman:

It is frequently claimed that penicillin might easily have become one of those mistakenly discarded drugs because it sickens rabbits and guinea pigs in large or in oral doses.

In other words, animal testing might have blocked the development of one of the most important drugs in the history of medicine.

Predicting height from genes

How well can you predict height based on genetic markers?

A 2009 study came up with a technique for predicting the height of a person based on looking at the 54 genes found to be correlated with height in 5,748 people — and discovered the results were one-tenth as accurate as the 125–year-old technique of averaging the heights of both parents and adjusting for sex.

The quote above is from Wrong: Why experts keep failing us — and how to know when not to trust them by David Freedman.

The article Freedman quotes is Predicting human height by Victorian and genomic methods. The “Victorian” method is the method suggested by Sir Francis Galton of averaging parents’ heights. The article’s abstract opines

For highly heritable traits such as height, we conclude that in applications in which parental phenotypic information is available (eg, medicine), the Victorian Galton’s method will long stay unsurpassed, in terms of both discriminative accuracy and costs.

Related posts:

Is helpful software really helpful?

In his new book The Shallows, Nicholas Carr relates an experiment by Christof van Nimwegen on computer-human interaction. Users were asked to solve a puzzle using software. Some users were given software designed to be very helpful, highlighting permissible moves etc. Other users were given bare-bones software.

In the early stages of solving the puzzle, the group using the helpful software made correct moves more quickly than the other group, as would be expected. But as the test proceeded, the proficiency of the group using the bare-bones software increased more rapidly. In the end, those using the unhelpful program were able to solve the puzzle more quickly and with fewer wrong moves.

I immediately thought of the debate over fancy software development tools versus simple tools. Then I read the conclusion:

… those using the unhelpful software were better able to plan ahead and plot strategy, while those using the helpful software tending to rely on simple trial and error. Often, in fact, those with the helpful software were found “to aimlessly click around” as they tried to crack the puzzle.

That really sounds like software development.

Christof van Nimwegen did variations on his experiment and got similar results. For example, he had two groups schedule a complicated series of meetings. One group had plain calendar software and one had software designed to help people schedule complicated meetings. The folks with the simple software won.

The debate over whether to use fancy software development tools (e.g. integrated development environments, wizards, etc.) or simple tools (editors and make files) is a Ford-Chevy argument that won’t go away. I could imagine many valid objections to the applicability of the van Nimwegen studies to the software tools debate, but I’d say they score a point for the simple tools side.

A rebuttal to the van Nimwegan studies is that he has only shown that particular helpful software wasn’t particularly helpful. Maybe the specific puzzle-solving software didn’t help in the long run, but someone could have written software that was ultimately more helpful than the bare-bones software. Maybe someone could have written scheduling software that allows people to schedule tasks faster than using simple calendar software.

A rebuttal to the rebuttal is that someone might indeed write software that allows users get the job done more quickly than they would using simpler software. It may even be inevitable that someone will write such software eventually. However, most attempts fail. It’s hard to write genuinely helpful software. Attempts to help a user too much may interfere with the user’s ability to form a good mental model of the problem.

Related post: Would you rather have a chauffeur or a Ferrari?

Bandwidth is not the bottleneck

Google’s Urs Hölzle gives the following world-wide average statistics regarding internet use.

  • Average page load: 4.9 seconds
  • Average page size: 320 kilobytes
  • Average bandwidth: 225 kilobytes/second

If bandwidth were the only limitation, the average page should load four times faster using the average bandwidth. The Internet protocols that have served us remarkably well were designed for very different usage scenarios. Hölzle says that web pages could load between two and four times faster if we make slight changes to infrastructure protocols and their implementations.

(Sam Savage would point out that you can get into trouble using averages as we did above. When you have variable quantities X and Y, the average of X/Y is not simply the average of X divided by the average of Y. But the calculations above are accurate enough for back-of-the-envelope estimates.)

Defining minimalism

I stirred up some controversy yesterday with an article critical of extreme minimalism. Some people took my article as an attack on minimalism in general. I wanted to clarify a few thoughts on minimalism.

I’m attracted to the general idea of minimalism, though I don’t like the name. “Minimal” literally means an extreme. I appreciate moderate minimalists, though strictly speaking “moderate minimalist” is a contradiction in terms. A more accurate but unwieldy name for minimalists might be “people who are keenly aware of the indirect costs of owning stuff.” Possessions have to be dusted, oiled, upgraded, insured, etc. Eliminating unnecessary things frees up physical and mental space.

Minimalists want to pare down their possessions to a minimum. But an absolute minimum would be to own nothing. Instead, minimalists want to eliminate non-essentials. So you could define a minimalist as someone who wants to eliminate non-essential possessions (or more generally non-essential intangibles as well). But by that definition, Donald Trump would be a minimalist if he believes everything he owns is essential. The essence of minimalism is an aesthetic for what constitutes “essential.”

One final complaint about the term “minimalism” is that it implies that a minimalist’s goal in life is to minimize possessions. I imagine most people who call themselves minimalists do not want to be obsessed with eliminating stuff any more than they want to be obsessed with acquiring stuff. They just want to think about their stuff less.

Related posts:

Selfish minimalism

I saw an article the other day about a man who had chosen to get rid of all of his possessions except for a fair amount of computer equipment, a couch, and a few odds and ends.  (I’m not linking to the article because I want this post to be about a hypothetical extreme minimalist rather than the specifics of one person’s story that I know almost nothing about.) For a moment such a lack of possessions seems like a virtuous lack of attachment to material goods. But on second thought it seems incredibly selfish.

This man owns only what he personally wants. He has nothing for the benefit of anyone else. He cannot offer anyone a place to sleep, or even a place to sit down. He has nothing to loan to a neighbor. Not only does he have nothing to meet anyone else’s material needs, he is probably a burden on others. I imagine he is able to do without some things because plans to borrow from neighbors or relatives when necessary. Such extreme minimalism would be an interesting exercise, but a sad way to live.

I’m not saying that minimalists are selfish. Minimalism is entirely subjective: each person defines what his or her minimum is. Some take others into consideration when deciding what their minimum should be and some do not. Some even become minimalists in order to have more margin to serve others.

Minimalism becomes ugly when it turns into a more-minimal-than-thou contest.

“Read my blog. I only have 47 things!”

“Buy my book. I have only 39 things!”

“I’ll see your 39 and lower you five!”

In a contest to live with the fewest possessions, one way to get ahead is to jettison anything that only benefits someone else.

Update: See my follow up post clarifying my ideas of minimalism.

Related post: Poverty versus squalor

Acknowledging problems versus solving problems

People want their problems acknowledged more than they want them solved, at least at first. That’s one of the points from Thomas Limoncelli’s book Time Management for System Administrators.

Suppose two system administrators get an email about similar problems. The first starts working on the problem right away and replies to the email a couple hours later saying the problem is fixed. The second replies immediately to say he understands the problem and will resolve it first thing tomorrow. The second system administrator will be more popular.

Of course people want their problems solved, and sooner is better than later. But first they want to know someone is listening. Sometimes that’s all they want.

Related posts:

Contrasting Tolkien and Lewis

Ralph Wood gave a lecture for Big Ideas contrasting J. R. R. Tolkien and C. S. Lewis. Wood begins his lecture by explaining that though the two men have much in common, this commonality has been emphasized to the point of concealing some of their differences.

One way Tolkien and Lewis differed was in their writing processes. Tolkien would revise, revise, and revise. Friends would take papers from him and publish them to break the editing cycles. Lewis, on the other hand, would send first drafts to the publisher and later make only minor corrections to the proofs. Wood also discusses much deeper differences between the two authors.

(Big Ideas makes it difficult to link to show notes for individual podcast episodes. Here’s a link to the audio of Wood’s lecture.)

Related posts:

How to compute log factorial

Factorials grow very quickly, so quickly the results can easily exceed the limits of computer arithmetic. For example, 30! is too big to store in a typical integer and 200! is too big to store (even approximately) in a standard floating point number. It’s usually more convenient to work with the logarithms of factorials rather than factorials themselves.

So how might you compute log( n! )? You don’t want to compute n! first and then take logs because you’ll overflow for moderately large arguments. You want to compute the log factorial directly.

Since n! = 1 × 2 × 3 × … × n, log(n!) = log(1) + log(2) + log(3) + … + log(n). That gives a way to compute log(n!), but it’s slow for large arguments: the run time is proportional to the size of n.

If you only need to compute log(n!) for n within a moderate range, you could just tabulate the values. Calculate log(n!) for n = 1, 2, 3, …, N by any means, no matter how slow, and save the results in an array. Then at runtime, just look up the result.

But suppose you want to be able to compute log(n!) for any value of n such that log(n!) won’t overflow. That requires a very large value of n. Since log(n!) is on the order of n log (n) for large n, log(n!) won’t overflow even for some very large values of n. You’ll run out of memory to store your table before you’ll run out of values of n such that log(n!) doesn’t overflow.

So now the problem becomes how to evaluate log(n!) for large values of n. Say we tabulate log(n!) for n up to some size and then use a formula to calculate log(n!) for larger arguments. I’m going to switch notation now and work with the gamma function Γ(x) because most references state their results in terms of the gamma function rather than in terms of factorial. It’s easy to move between the two since Γ(n+1) = n!.

Stirling’s approximation says that

log Γ(x) ≈ (x – 1/2) log(x) – x  + (1/2) log(2 π)

with an error on the order of 1/12x. So if n were around 1000, the approximation would be good to about four decimal places. What if we wanted more accuracy but not a bigger table? Stirling’s approximation above is part of an infinite series of approximations, an asymptotic series for log Γ(x):

log Γ(x) ≈ (x – 1/2) log(x) – x + (1/2) log(2 π) + 1/(12 x) – 1/(360 x3) + 1/(1260 x5) – …

This raises a couple questions.

  1. What is the form of a general term in the series?
  2. How accurate is the series when we truncate it at some point?

The answer to the first question is the term with x2m-1 in the denominator has coefficient B2m / (2m(2m-1)) where the B‘s are Bernoulli numbers. Perhaps that’s not very satisfying, but that’s what it is.

Now to the question of accuracy. If you’ve never used asymptotic series before, you might be tempted to use one like you’d use a Taylor series: the more terms the better. But asymptotic series don’t work that way. Typically the error improves at first as you take more terms, reaches a minimum, then increases as you take more terms. Another difference is that while Taylor series approximations improve as arguments get smaller, asymptotic approximations improve as arguments get larger. That’s convenient for us since we’re looking for an approximation for n so large that it’s beyond our table of saved values.

For this particular series, the absolute value of the error in truncating the series is less than the absolute value of the first term that was left out.  Suppose we make a look-up table for the values 1 through 256. If we truncate the series after 1/(12 x), the error will be less than 1/(360 x3). If x > 256, log(x!) > 1419 and the error in the asymptotic approximation is less than 1/(360×224) = 1.65 × 10-10. Since the number we’re computing has at least four digits and the result is good to 10 decimal places, we should have at least 14 significant figures, near the limits of floating point precision. (For more details, see Anatomy of a floating point number.)

In summary, one way to compute log factorial is to pre-compute log(n!) for n = 1, 2, 3, … 256 and store the results in an array. For values of n ≤ 256, look up the result from the table. For n > 256, return

(x – 1/2) log(x) – x + (1/2) log(2 π) + 1/(12 x)

with x = n + 1. This has been coded up here.

You could include the 1/(360 x3) term or higher terms from the asymptotic series and use a smaller table. This would use less memory but would require more computation for arguments outside the range of the table.

Click to find out more about consulting for numerical computing


Related posts:

John Cleese on creativity

Here’s a 10-minute talk by John Cleese on creativity:

From about 6:20 into the video:

If you’re racing around all day, ticking things off on lists, looking at your watch, making phone calls, and generally just keeping all the balls in the air, you are not going to have any creative ideas.

Related posts:

What distribution does my data have?

“Which distribution describes my data?” Variations on that question pop up regularly on various online forums. Sometimes the person asking the question is looking for a goodness of fit test but doesn’t know the jargon “goodness of fit.” But more often they have something else in mind. They’re thinking of some list of familiar, named distribution families — normal, gamma, Poisson, etc. — and want to know which distribution from this list best fits their data. So the real question is something like the following:

Which distribution from the well-known families of probability distributions fits my data best?

Statistics classes can give the impression that there is a short list of probability distribution families, say the list in the index of text book for that class, and that something from one of those families will always fit any data set. This impression starts to seem absurd when stated explicitly. It raises two questions.

  1. What exactly is the list of well-known distributions?
  2. Why should a distribution from this list fit your data?

As for the first question, there is some consensus as to what the well-known distributions are. The distribution families in this diagram would make a good start. But the question of which distributions are “well known” is a sociological question, not a mathematical one. There’s nothing intrinsic to a distribution that makes it well-known. For example, most statisticians would consider the Kumaraswamy distribution obscure and the beta distribution well-known, even though the two are analytically similar.

You could argue that the canonical set of distributions is somewhat natural by a chain of relations. The normal distribution is certainly natural due to the central limit theorem. The chi-squared distribution is natural because the square of a normal random variable has a chi-squared distribution. The F distribution is related to the ratio of chi-squared variables, so perhaps it ought to be included. And so on and so forth. But each link in the chain is a little weaker than the previous. Also, why this chain of relationships and not some other?

Alternatively, you could argue that the distributions that made the canon are there because they have been found useful in practice. And so they have.  But had people been interested in different problems, a somewhat different set of distributions would have been found useful.

Now on to the second question: Why should a famous distribution fit a particular data set?

Suppose a police artist asked a witness which U. S. president a criminal most closely resembled. The witness might respond

Well, she didn’t look much like any of them, but if I have to pick one, I’d pick John Adams.

The U. S. presidents form a convenient set of faces. You can find posters of their faces in many classrooms. The U. S. presidents are historically significant, but a police artist would do better to pick a different set of faces as a first pass in making a sketch.

I’m not saying it is unreasonable to want to fit a famous distribution to your data. Given two distributions that fit the data equally well, go with the more famous distribution. This is a sort of celebrity version of Occam’s razor. It’s convenient to use distributions that other people recognize. Famous distributions often have nice mathematical properties and widely available software implementations. But the list of famous distributions can form a Procrustean bed that we force our data to fit.

The extreme of Procrustean statistics is a list of well-known distributions with only one item: the normal distribution. Researchers often apply a normal distribution where it doesn’t fit at all. More dangerously, experienced statisticians can assume a normal distribution when the lack of fit isn’t obvious. If you implicitly assume a normal distribution, then any data point that doesn’t fit the distribution is an outlier. Throw out the outliers and the normal distribution fits well! Nassim Taleb calls the normal distribution the “Great Intellectual Fraud” in his book The Black Swan because people so often assume the distribution fits when it does not.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon