The acronym RAID originally stood for Redundant Array of Inexpensive Disks. The idea is to create reliable disk storage using multiple drives that are not so reliable individually. Data is stored in such a way that if one drive fails, you’re OK as long as you can replace it before another drive fails. RAID is very commonly used.
Now folks often use redundant arrays of expensive disks. The name “RAID” is too established to change, and so now people say the “I” stands for “Independent.” And their in lies a problem.
The naïve theory is that if hard disk 1 has probability of failure 1/1000 and so does disk 2, then the probability of both failing is 1/1,000,000. That assumes failures are statistically independent, but they’re not. You can’t just multiply probabilities like that unless the failures are uncorrelated. Wrongly assuming independence is a common error in applying probability, maybe the most common error.
Joel Spolsky commented on this problem in the latest StackOverflow podcast. When a company builds a RAID, they may grab four or five disks that came off the assembly line together. If one of these disks has a slight flaw that causes it to fail after say 10,000 hours of use, it’s likely they all do. This is not just a theoretical possibility. Companies have observed batches of disks all failing around the same time.
Using disks made by competing companies would significantly decrease the correlation in failure probabilities. The failures will always be somewhat correlated. For one thing, different manufacturers use similar materials and processes. Also, regardless of where the drives come from, they end up in the same box, all subject to the same environmental factors.