Hard disk array failure probabilities

The acronym RAID originally stood for Redundant Array of Inexpensive Disks. The idea is to create reliable disk storage using multiple drives that are not so reliable individually. Data is stored in such a way that if one drive fails, you’re OK as long as you can replace it before another drive fails. RAID is very commonly used.

Now folks often use redundant arrays of expensive disks. The name “RAID” is too established to change, and so now people say the “I” stands for “Independent.” And their in lies a problem.

The naïve theory is that if hard disk 1 has probability of failure 1/1000 and so does disk 2, then the probability of both failing is 1/1,000,000. That assumes failures are statistically independent, but they’re not. You can’t just multiply probabilities like that unless the failures are uncorrelated. Wrongly assuming independence is a common error in applying probability, maybe the most common error.

Joel Spolsky commented on this problem in the latest StackOverflow podcast. When a company builds a RAID, they may grab four or five disks that came off the assembly line together. If one of these disks has a slight flaw that causes it to fail after say 10,000 hours of use, it’s likely they all do. This is not just a theoretical possibility. Companies have observed batches of disks all failing around the same time.

Using disks made by competing companies would significantly decrease the correlation in failure probabilities. The failures will always be somewhat correlated. For one thing, different manufacturers use similar materials and processes. Also, regardless of where the drives come from, they end up in the same box, all subject to the same environmental factors.

9 thoughts on “Hard disk array failure probabilities”

Sarah

6 January 2009 at 07:28

There was a recent example – a blogging site (I’m sorry to have forgotten the name) lost all of its content in a single fiasco. They had raid disks but of course, their system was designed to protect the data in case of drive failure. Instead, the drives remained intact but got wiped from some sort of coding error which overwrote good data.

There’s something to be said for the good old-fashioned tape backup stored offsite…

John

6 January 2009 at 07:34

I think I heard the same story, but I can’t remember the name of the site either. The company suggested users recover their data by grabbing Google’s cached versions of their posts, one by one. Sounds painful, but at least it’s something. I bet Google’s cache has saved more than one company’s bacon.

John Venier

6 January 2009 at 12:29

I pointed a knifemaker I know to the wayback machine, and he was really happy to find the old content of his by then long defunct website. It inlcuded images of knives he had made and sold but for which he no longer had photos.

My long-ago experience with tapes was that they are very solid but slow. Recovering a deleted file usually took hours of automated searching, but at least it was possible. Also, at that time one of the highest-bandwidth channels was a tape or hard drive sent by FedEx, beating normal channels by orders of magnitude. I’m pretty sure it is still true today.

John

6 January 2009 at 12:35

A long time ago someone said never underestimate the bandwidth of a station wagon full of floppy disks. Substitute DVD for floppy disk and it remains true.

ragozzi

10 March 2010 at 09:03

Yes, independence assumption is a very strong assumption. That’s what caused “Challenger” disaster in 1986, because it was assumed thet 6 O-rings on the right rocket booster were “independent events”. But they were not, because they had common failure causes. (I think it had to do with low temperature and other environmental factors). However, even if they were independent, that still wouldn’ t be good enough because all 6 of them had to succeed (unlike RAID) in order for the ship no to fall apart. So if the probability of success of one ring is 0.975, the probability of all 6 succeeding is (0.975) to the power of (6) = 0.859, which is not that much.

Mike

15 May 2012 at 09:56

I realize an old post but: another factor: the disks usually are being loaded the same. Two disks getting random but relatively equal use is one thing. One out of chance might have more drive seaks than the other, or happen to be spinning when the server is bumped or whatever. However when two drives are writing at the same time or in the case of X0 (50, 0 etc) mirroring and very very close to identical movements at any time the chances of one thing casing both drives to get borked is much higher. For the uber paranoid: build SAN luns with mirrors from two different disk arrays if possible. Redundant channels to each disk array and redundant/double bandwidth since your “live” wires will be two not one to your data as well.

Miguel Duarte

8 February 2013 at 17:11

It happened to me. Two hard-disks, same model bought on the same moment, one was the backup of the other (on the same NAS). One died one week after another, the second dying on the exact moment I was copying its data for the replacement of the first.

Good I had another backup on a different drive.

Tobias Brox

24 October 2016 at 03:26

Just found a story from a computer repair guy, who recently had four independent customer cases with “broken hard disk” within a relatively short timeframe. It appeared all four customers had a 500G Western Digital WD5000AAKX harddisk manufactured in 2011.

https://steemit.com/hardware/@oecp85/planned-obsolescence

Sean

21 November 2016 at 21:35

I recently had this happen…

4x WD Red 4TB drives in a Netgear ReadyNAS performing a redundant backup. One drive failed within weeks, replaced (with a drive from the same batch) and that also failed. While replacing that disk, 2x more drive failed and I lost all the data on the array. I’ve since had all of the drives replaced and rebuilt those backups, but it was a testing time there for a couple of weeks.

Comments are closed.