Ultra-reliable software development

From a NASA page advocating formal methods:

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. P_f < 10^-7/hour).

Emphasis added.

Developing medium-reliability and high-reliability software are almost entirely different professions. Using typical software development procedures on systems that must be ultra-reliable would invite disaster. But using extremely cautious development methods on systems that can afford to fail relatively often would be an economic disaster.

2 thoughts on “Ultra-reliable software”

Nick

16 November 2016 at 15:26

I honestly have no clue how to get failure rates down that low. However, getting from 95% reliable to (say) 99% reliable is attainable simply by tooling. In the C/C++ ecosystem, hongfuzz will find your edge cases (assuming your input format is sane), and if you can compile with clang, you can use AddressSanitizer and ThreadSanitizer to find race conditions and memory leaks.

But the problem is that once you start working in a multilanguage environment, you have to find similar toolchains for each language. Java is especially bad to have in a high-reliability stack, IMO, as there’s essentially no tooling for it.

19 November 2016 at 09:28

The Space Shuttle famously had an entire team and process for ultra reliable sw. I’m sure, as John said, cost-wise it would be unaffordable to have that sort of team for most products.

In any event, when they retired the shuttle and partially disassembled them for preservation they discovered that some hydraulic actuators on the tail had been installed *backwards* at the factory.

So possibly even 10^-7 reliability wouldn’t have made the system any safer because of issues external to the sw – and the takeaway is that in general there might be a floor to any useful reliability specifications.

If there is a 10^-7 chance your database might mangle a users address, and a 10^-3 chance that the data entry operator might get it wrong on original entry, how much effort should you put into the first problem?

Comments are closed.