Reliable system design

Reliable system design is the design of systems with high levels of reliability and availability.

It should be noted that there is no such thing as a perfectly reliable system, and that reliable systems engineering cannot engineer out failure modes which are not anticipated by modelling. For this reason, reliable systems are generally engineered to a designed failure rate, not to a zero failure rate.

Typical reliable system design failure rates include "five nines" (99.999% availability) and "six nines" (99.9999% availability). Some life-critical systems are designed to even higher levels of performance.

Reliable system design attempts to create reliable systems by design, rather than by blindly over-engineering systems. The analytical tools for reliable systems design are root cause analysis and threat tree analysis. These allow real-world system failures to be investigated, and the failure modes of new systems modelled.

The main engineering approaches of reliable systems design are

eliminiating single points of failure
engineering any remaining single points of failure to whatever level is necessary to reach the system specification
adding extra system safety margins to allow for errors in modelling or implemention

The term "single point of failure" describes any part of the system that can bring down the whole system if it fails.

Most non-critical real-world systems have many single points of failure: for example, a typical desktop computer has only one processor, one power supply, one keyboard, one screen, and so on, the failure of any of which will render that computer unusable.

However, a business as a whole generally conducts its affairs so that the failure of any single desktop PC will not bring the business down. Thus, the components mentioned above are single points of failure for the PC, but not for the larger system of which the PC is a component. Similar techniques of using duplicated systems and backup systems are used to create resilient systems for critical applications such as databases, communications networks and air traffic control systems.

However, mere use of massive redundancy does not make a system reliable, so long as there is even one single point of failure left in the system. For example, a network where power feeds, network connections, routers, and router interconnections have all been correctly made redundant can still have a single point of failure if both routers are housed in a single rack, allowing a single spilled cup of coffee to take out both routers at once.

Note that even eliminating every concievable single point of failure is not by itself enough to make a system truly resilient, as the extra redundancy may make the system vulnerable to Byzantine failure modes.