RCMBnet

RCMBnet — A distributed Hardware and Firmware Support for Sofware Fault Tolerance

Project type: PhD project

Period: 1998 - 2007

Design, formal verification and implementation of a hardware support for the consistent execution of fault-tolerant software that follows the N-Version Programming approach.

Software is a major source of reliability degradation in dependable systems. One of the classical remedies is to provide tolerance to software faults by using N-Version Programming (NVP). However, due to the necessity of independent development of N versions for the same application program, the requirements on special hardware and the need for changes and additions at all levels of the system, NVP solutions are very costly and therefore seldom used.

In a previous work, a low-cost architecture for NVP execution was devised, in which each of the N versions is executed on a different standard computer. A special hardware unit, called NVXP, is attached to each computer to take care of the communication among versions and of all tasks required to support the execution of the NVP application. Some of the key features of this architecture are: the broadcast network that links the NVXPs together is duplicated in order to prevent a single point of failure; off-the-shelf components are used and the fault tolerance functionality is concentrated into the NVXP being thereby orthogonal to the other functionalities of the system.

In this dissertation, we present a complete redesign of the above-mentioned architecture, that aims at ensuring its actual implementability and at resolving the potential inconsistent states that were not treated in detail in the original design.

We have made the following three main changes in the hardware of this architecture. First, we have developed a completely new and improved design for the NVXPs, which we call Redundancy and Communication Management Boards (RCMBs). The set of RCMBs connected to each other by means of the broadcast network is what we call the RCMBnet. Second, in order to further reduce the cost of the final system, we have chosen PCs as platforms for executing the different versions, in such a way that the RCMBs are PC boards inserted in the bus of the host PCs. And third, we use the Controller Area Network (CAN) protocol as the basic communication technology for the broadcast network, due to its well-known advantages related to cost, reliability and real-time performance, and due to the growing interest in using CAN for critical applications. In fact what we use is a modification to the CAN protocol, called MajorCAN, which we have devised to eliminate some scenarios of inconsistent communication that CAN exhibits.

Besides introducing changes in the hardware, we have designed a new software to be executed in the RCMBs. Two aspects of this software concerning the consistent management of the redundancy have received a special attention in this dissertation: replica determinism enforcement of all replicated operations and consistent reintegration of RCMBs after suffering transient faults. Replica determinism enforcement ensures, for instance, that all non-faulty versions produce the same results even if some versions are faulty. On the other hand, reintegration prevents a quick attrition of the available redundancy by allowing an RCMB that has experienced a temporary failure to return to the normal coordinated operation with the other RCMBs. The new mechanism proposed for consistent reintegration has been formally verified using model checking.

Given that our ultimate goal is as complex as providing the design of a complete fault-tolerant distributed system, we have made a significant effort to keep an approach as systematic as possible. More specifically, we have adopted the guidelines provided by the paradigm proposed by Prof. Avizienis for the design of fault-tolerant systems. Following one of said guidelines, we have developed first prototypes for the essential parts of the architecture, thus giving support to our claim of actual implementability.

Our design exhibits relevant differences with other architectures which were conceived for similar purposes. Moreover, our approach presents significant advantages beyond the fulfillment of its initial aims of low cost, actual implementability and consistent operation. Some of these additional advantages are that it achieves a consistent management of the redundancy without introducing a sizeable computation or communication overhead in the system; it requires less computers and versions to tolerate a fixed number of faults than common distributed systems and it allows the broadcast network to use a simplified topology such as a bus, thereby facilitating scalability. Finally, although initially developed for NVP, our RCMBnet is also able to support other software replication techniques, such as active replication.

Project Leader

Project Collaborators

Related Publications