Erasure coding white paper

Last year I worked with Hitachi Data Systems to evaluate the trade-offs of replication and erasure coding as ways to increase data storage reliability while minimizing costs. This lead to a white paper that has just been published:

Compare Cost and Performance of Replication and Erasure Coding
Hitachi Review Vol. 63 (July 2014)
John D. Cook
Robert Primmer
Ab de Kwant

3 thoughts on “Erasure coding white paper”

BobC

12 August 2014 at 12:47

About 30 years ago I helped perform a reliability analysis for a portable fault-tolerant distributed file system called “Gemini” that supported local editing and global access by retrieving a variable quorum of the replicated copies before performing certain operations (e.g., the write quorum was larger than the read quorum).

At that time, a major consideration was sector loss due to head crashes. If I correctly understood your paper, only whole disk loss was a concern. Is sector-level loss no longer an issue? Or is the SMART subsystem (or something similar) being relied upon to handle them?

In our analysis we tested two Gemini implementation strategies, one that “owned” one or more disks (or partitions) on each remote system, and another implementation layered over conventional networked filesystems (primarily NFS at the time). Not surprisingly, we found the best performance and reliability (by far) was obtained when we eliminated all external layers and protections, and tailored Gemini’s algorithms to maximize availability given all known sources of loss. But the tailored implementation was extremely non-portable, so it was abandoned after the “real world” ideal performance had been identified.

After gathering the various failure modes and their distributions, we modeled the quorum building process using Markov chains while, in parallel, testing multiple instrumented Gemini deployments. The models highlighted problems in the implementations, and the implementations revealed shortcomings in the model. By the time the two converged, we had learned a huge amount about not only Gemini, but even more about disks (from many makers), filesystems, network hardware (primarily Ethernet and T1), network protocols, and operating systems (4.3BSD derivatives).

While Gemini itself faded away, the results of our work led to many improvements in how disks and networks were utilized and unified. I failed to find the specific papers containing our reliability analysis (I was the 11th or 12th author, not bad for an undergrad), but searching for “gemini fault-tolerant distributed filesystem” will yield several gems, most of which were published after I left the Gemini group in late 1985.

I did make one small but significant contribution to the project, discovering a mapping between the Gemini server permissions (run under a non-root user called “gemini” on each system), and normal users who were creating and using Gemini files (Gemini users typically did not have accounts on the systems running the Gemini servers). The initial Gemini implementation provided its own ponderous distributed ownership and authorization subsystem (to keep Gemini users from seeing each others data without permission) that clearly had to be eliminated, and the group was gearing up to modify the 4.3BSD permission system to provide the required capabilities (everyone modified BSD back then). The need for that work was completely eliminated by my little hack (which relied on a big honking insight).

John

12 August 2014 at 13:01

We don’t get into kinds of failure in this paper, other than distinguishing temporary from permanent, i.e. unavailability vs loss. You can interpret failure how you like, and if you supply the corresponding probabilities, everything goes through. So you can think of disk failure as personal: If a sector of your data fails, the disk has failed as far as you’re concerned.

Michael Yeaney

13 August 2014 at 09:05

Another paper I’ve recently read relevant to erasure coding is the Erasure Coding in Windows Azure paper by Huang, et al [1]. Interesting read discussing the storage/cost benefits for large-scale storage systems.

[1]: http://research.microsoft.com/pubs/179583/LRC12-cheng%20webpage.pdf

Comments are closed.