RCMBnet — A distributed Hardware and Firmware Support for Sofware Fault Tolerance

Type | PhD project
Duration | 1998 - 2007
Project leader Julián Proenza Arenas
Collaborators |  José Miró

DESCRIPTION

Design, formal verification and implementation of a hardware support for the consistent execution of fault-tolerant software that follows the N-Version Programming approach. Software is a major source of reliability degradation in dependable systems. One of the classical remedies is to provide tolerance to software faults by using N-Version Programming (NVP). However, due to the necessity of independent development of N versions for the same application program, the requirements on special hardware and the need for changes and additions at all levels of the system, NVP solutions are very costly and therefore seldom used. In a previous work, a low-cost architecture for NVP execution was devised, in which each of the N versions is executed on a different standard computer. A special hardware unit, called NVXP, is attached to each computer to take care of the communication among versions and of all tasks required to support the execution of the NVP application. Some of the key features of this architecture are: the broadcast network that links the NVXPs together is duplicated in order to prevent a single point of failure; off-the-shelf components are used and the fault tolerance functionality is concentrated into the NVXP being thereby orthogonal to the other functionalities of the system. In this dissertation, we present a complete redesign of the above-mentioned architecture, that aims at ensuring its actual implementability and at resolving the potential inconsistent states that were not treated in detail in the original design. We have made the following three main changes in the hardware of this architecture. First, we have developed a completely new and improved design for the NVXPs, which we call Redundancy and Communication Management Boards (RCMBs). The set of RCMBs connected to each other by means of the broadcast network is what we call the RCMBnet. Second, in order to further reduce the cost of the final system, we have chosen PCs as platforms for executing the different versions, in such a way that the RCMBs are PC boards inserted in the bus of the host PCs. And third, we use the Controller Area Network (CAN) protocol as the basic communication technology for the broadcast network, due to its well-known advantages related to cost, reliability and real-time performance, and due to the growing interest in using CAN for critical applications. In fact what we use is a modification to the CAN protocol, called MajorCAN, which we have devised to eliminate some scenarios of inconsistent communication that CAN exhibits. Besides introducing changes in the hardware, we have designed a new software to be executed in the RCMBs. Two aspects of this software concerning the consistent management of the redundancy have received a special attention in this dissertation: replica determinism enforcement of all replicated operations and consistent reintegration of RCMBs after suffering transient faults. Replica determinism enforcement ensures, for instance, that all non-faulty versions produce the same results even if some versions are faulty. On the other hand, reintegration prevents a quick attrition of the available redundancy by allowing an RCMB that has experienced a temporary failure to return to the normal coordinated operation with the other RCMBs. The new mechanism proposed for consistent reintegration has been formally verified using model checking. Given that our ultimate goal is as complex as providing the design of a complete fault-tolerant distributed system, we have made a significant effort to keep an approach as systematic as possible. More specifically, we have adopted the guidelines provided by the paradigm proposed by Prof. Avizienis for the design of fault-tolerant systems. Following one of said guidelines, we have developed first prototypes for the essential parts of the architecture, thus giving support to our claim of actual implementability. Our design exhibits relevant differences with other architectures which were conceived for similar purposes. Moreover, our approach presents significant advantages beyond the fulfillment of its initial aims of low cost, actual implementability and consistent operation. Some of these additional advantages are that it achieves a consistent management of the redundancy without introducing a sizeable computation or communication overhead in the system; it requires less computers and versions to tolerate a fixed number of faults than common distributed systems and it allows the broadcast network to use a simplified topology such as a bus, thereby facilitating scalability. Finally, although initially developed for NVP, our RCMBnet is also able to support other software replication techniques, such as active replication.

PUBLICATIONS

J. Proenza, J. Miró. RCMBnet: A distributed Hardware and Firmware Support for Sofware Fault Tolerance. Universitat de les Illes Balears, 2007.

A. Ortiz, J. Proenza, G. Bernat, G. Oliver. Improving the Safety of AUVs. In MTS/IEEE Oceans, Seattle-WA (USA), 1999.

M. A. Barranco, J. Proenza, G. Rodríguez-Navas, L. Almeida. Design and implementation of CANcentrate: an active star topology for improving fault confinement in CAN networks. 2005.

J. Proenza, J. Miro-Julia. System Partitioning of a Low-Cost Distributed Hardware and Firmware Support for Software-Fault Tolerance. 2004.

G. Rodríguez-Navas, J. Proenza. Una descripción de los fundamentos del protocolo TTCAN diseñado por Bosch. 2003.

G. Rodríguez-Navas, J. Proenza. On the Design of a Clock Service for CAN Networks. 2003.

M. A. Barranco, J. Proenza, L. Almeida. Enhancing the response of ReCANcentrate in presence of faults. 2007.

M. A. Barranco, J. Proenza, L. Almeida. On the Management of Media Replication in ReCANcentrate. 2007.

G. Rodríguez-Navas, J. Proenza, H. Hansson. An UPPAAL Model for Formal Verification of Clock Synchronization over CAN. 2005.

J. Proenza, J. Miro-Julia, H. Hansson. Redundancy Management in a Low-Cost Distributed Hardware and Firmware Support for Software-Fault Tolerance. 2007.

P. Ferriol, F. Navio, J. José, J. Pons, J. Proenza, J. Miro-Julia. Un circuito de bajo coste para la tolerancia a fallos en sistemas de control industrial. In Seminario Anual de Automática, Electrónica Industrial e Instrumentación (SAAEI), Pamplona (Spain), 1998.

G. Bernat, J. Miro-Julia, J. Proenza. Fixed priority schedulability analysis of a distributed real-time fault-tolerant architecture. In International Conference on Parallel and Distributed Techniques and Applications (PDPTA), Las Vegas (USA), 1998.

G. Bernat, J. Miro-Julia, J. Proenza. A technique to analyse the tolerance to transient overloads of a fault-tolerant real-time system. In IEEE High Accurance Systems Engineering Workshop (HASE), Washington (USA), 1998.

J. Proenza, J. Miro-Julia. Consistent management and slow attrition of redundancy in a distributed hardware support for NVP execution. In First International Conference on Dependable Systems and Networks (ICDSN), New York (USA), 2000.

J. Proenza, J. Pons, J. Miro-Julia. A low-cost fail-safe circuit for fault-tolerant control systems. In IEEE International Conference on Electronics, Circuits and Systems (ICECS), Paphos (Cyprus), 1999.

J. Proenza, A. Ortiz, G. Bernat, G. Oliver. A Cost-Effective Hardware Architecture for Fail-Safe Autonomous Underwater Vehicles. In IEEE International Conference on Electronics, Circuits and Systems (ICECS), Paphos (Cyprus), 1999.

J. Proenza, A. Ortiz, F. Prats, B. Santandreu, G. Oliver. Incorporating A Safety Hardware Module On A Low-Cost AUV. In International Symposium on Robotics and Automation (ISRA), Monterrey (Mexico), 2000.

J. Proenza, J. Miro-Julia. MajorCAN: A modification to the Controller Area Network protocol to achieve Atomic Broadcast. In First International Workshop on Group Communications and Computations (IWGCC), Taipei (Taiwan), 2000 .

D. Avià, M. D. Diego, G. Oliver, A. Ortiz, J. Proenza. RAO: A Low-Cost AUV for Testing. In MTS/IEEE Oceans, Providence-RI (USA), 2000.

C. Guerrero, G. Rodríguez-Navas, J. Proenza. Hardware Support for Fault Tolerance in Triple Redundant CAN Controllers. In IEEE International Conference on Electronics, Circuits and Systems (ICECS), Dubrovnik (Croatia), 2002.

C. Guerrero, G. Rodríguez-Navas, J. Proenza. Design and Implementation of a Redundancy Manager for Triple Redundant CAN Controllers. In Annual Conference of the IEEE Industrial Electronics Society (IECON), Sevilla (Spain), 2002.

A. Ortiz, J. Proenza, G. Oliver. Automatic Inspection of Underwater Environments by Means of a Fleet of Submersible Agents. In Pattern Recognition Advances in Iberoamerica, Barcelona (Spain), 2000.

G. Rodríguez-Navas, J. Proenza. An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks. In WSEAS International Conference on Signal Processing, Robotics and Automation (ISPRA), Chiclana, Cádiz (Spain), 2002.

J. Ferreira, L. Almeida, J. A. Fonseca, G. Rodríguez-Navas, J. Proenza. Enforcing Consistency of Communication Requirements Updates in FTT-CAN. In International Workshop on Dependable Embedded Systems (DES), Florence (Italy), 2003.

G. Rodríguez-Navas, J. Proenza. Analyzing Atomic Broadcast in TTCAN Networks. In IFAC International Conference on Fieldbus Systems and their Applications (FET), Aveiro (Portugal), 2003.

G. Rodríguez-Navas, J. Jiménez, J. Proenza. An Architecture for Physical Injection of Complex Fault Scenarios in CAN Networks. In IEEE International Conference on Emerging Technology and Factory Automation (ETFA), Lisbon (Portugal), 2003.

G. Rodríguez-Navas, J. Proenza. Harmonizing Dependability and Real Time in CAN Networks. In Second International Workshop on Real-Time LANs in the Internet Age (RTLIA), Lisbon (Portugal), 2003.

G. Rodríguez-Navas, J. Juan, J. Proenza. Hardware Design of a High-Precision and Fault-Tolerant Clock Subsystem for CAN Networks. In IFAC International Conference on Fieldbus Systems and their Applications (FET), Aveiro (Portugal), 2003.

G. Rodríguez-Navas, M. A. Barranco, J. Proenza, I. Broster. COTS-based Hardware Support to Timeliness in CAN Networks. In IEEE International Conference on Emerging Technology and Factory Automation (ETFA), Lisbon (Portugal), 2003.

G. Rodríguez-Navas, J. Proenza. Clock Synchronization in CAN Distributed Embedded Systems. In International Workshop on Real-Time Networks (RTN), Catania (Italy), 2004.

M. A. Barranco, G. Rodríguez-Navas, J. Proenza, L. Almeida. CANcentrate: An Active Star Topology for CAN Networks. In IEEE International Workshop on Factory Communication Systems (WFCS), Vienna (Austria), 2004.

G. Rodríguez-Navas, J. Proenza, J. Rigo, J. Ferreira, L. Almeida, J. A. Fonseca. Design and Modeling of a Protocol to Enforce Consistency among Replicated Masters in FTT-CAN. In IEEE International Workshop on Factory Communication Systems (WFCS), Vienna (Austria), 2004.

M. A. Barranco, L. Almeida, J. Proenza. ReCANcentrate: A replicated star topology for CAN networks. In IEEE Conference on Emerging Technologies and Factory Automation (ETFA), Catania (Italy), 2005.

J. Proenza, L. Almeida. Position Paper on Dependability and Reconfigurability in Distributed Embedded Systems. In International Workshop on Real-Time Networks (RTN), Pisa (Italy), 2007 .

G. Rodríguez-Navas, J. Proenza, H. Hansson. Using UPPAAL to Model and Verify a Clock Synchronization Protocol for the Controller Area Network. In IEEE Conference on Emerging Technologies and Factory Automation (ETFA), Catania (Italy), 2005.

T. Nolte, G. Rodríguez-Navas, J. Proenza, S. Punnekkat, H. Hansson. Towards analyzing the fault-tolerant operation of Server-CAN. In IEEE Conference on Emerging Technologies and Factory Automation (ETFA), Catania (Italy), 2005.

G. Rodríguez-Navas, J. Proenza, H. Hansson. An UPPAAL Model for Formal Verification of Master/slave Clock Synchronization over the Controller Area Network. In IEEE International Workshop on Factory Communication Systems (WFCS), Torino (Italy), 2006.

M. A. Barranco, J. Proenza, L. Almeida. First results of the assessment of the improvement of error containment achieved by CANcentrate. In IEEE International Workshop on Factory Communication Systems (WFCS), Torino (Italy), 2006.

M. Bonet, G. Donaire, J. Proenza. Modelling MajorCAN with UPPAAL. In IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Patras (Greece), 2007.

G. Rodríguez-Navas, J. Proenza, H. Hansson. Modeling and Verification of Master/Slave Clock Synchronization Using Hybrid Automata and Model-Checking. In International Conference on Formal Engineering Methods (ICFEM, LNCS 4789), Boca Raton (Florida, USA), 2007.

J. Proenza, J. Miro-Julia, H. Hansson. Managing redundancy in CAN-based networks supporting N-Version Programming. In Computer Standards and Interfaces, vol. 31, no. 1, pp. 120--127, 2009.

M. A. Barranco, J. Proenza, G. Rodríguez-Navas, L. Almeida. An Active Star Topology for Improving Fault Confinement in CAN Networks. In IEEE Transactions on Industrial Informatics, vol. 2, no. 2, pp. 78--85, 2006.

J. Ferreira, L. Almeida, J. Alberto, P. Pedreiras, E. Martins, G. Rodríguez-Navas, J. Rigo, J. Proenza. Combining Operational Flexibility and Dependability in FTT-CAN. In IEEE Transactions on Industrial Informatics, vol. 2, no. 2, pp. 95--102, 2006.

M. A. Barranco, J. Proenza, L. Almeida. Experimental assessment of ReCANcentrate, a replicated star topology for CAN. In SAE 2006 Transactions Journal of Passenger Cars: Electronic and Electrical Systems, SAE international, pp. 437--446, 2007.

M. A. Barranco, J. Proenza, G. Rodríguez-Navas, L. Almeida. Pushing error containment and fault tolerance. In CAN Newsletter. Special Edition on Automotive Networks, CAN in Automation GmbH, 2006.

M. A. Barranco, J. Proenza, G. Rodríguez-Navas, L. Almeida. A CAN hub with Improved Error Detection and Isolation. In International CAN Conference (iCC), 2005.

P. Ferriol, F. Navio, J. José, J. Pons, J. Proenza, J. Miro-Julia. A double CAN architecture for fault-tolerant control systems. In International CAN Conference (ICC), 1998.


Uso de cookies

Este sitio web utiliza cookies para que usted tenga la mejor experiencia de usuario. Si continúa navegando está dando su consentimiento para la aceptación de las mencionadas cookies y la aceptación de nuestra política de cookies, pinche el enlace para mayor información.

ACEPTAR