RCMBnet – Systems, Robotics & Vision Group

RCMBnet — A distributed Hardware and Firmware Support for Sofware Fault Tolerance

Project type: PhD project

Period: 1998 - 2007

Design, formal verification and implementation of a hardware support for the consistent execution of fault-tolerant software that follows the N-Version Programming approach.

Software is a major source of reliability degradation in dependable systems. One of the classical remedies is to provide tolerance to software faults by using N-Version Programming (NVP). However, due to the necessity of independent development of N versions for the same application program, the requirements on special hardware and the need for changes and additions at all levels of the system, NVP solutions are very costly and therefore seldom used.

In a previous work, a low-cost architecture for NVP execution was devised, in which each of the N versions is executed on a different standard computer. A special hardware unit, called NVXP, is attached to each computer to take care of the communication among versions and of all tasks required to support the execution of the NVP application. Some of the key features of this architecture are: the broadcast network that links the NVXPs together is duplicated in order to prevent a single point of failure; off-the-shelf components are used and the fault tolerance functionality is concentrated into the NVXP being thereby orthogonal to the other functionalities of the system.

In this dissertation, we present a complete redesign of the above-mentioned architecture, that aims at ensuring its actual implementability and at resolving the potential inconsistent states that were not treated in detail in the original design.

We have made the following three main changes in the hardware of this architecture. First, we have developed a completely new and improved design for the NVXPs, which we call Redundancy and Communication Management Boards (RCMBs). The set of RCMBs connected to each other by means of the broadcast network is what we call the RCMBnet. Second, in order to further reduce the cost of the final system, we have chosen PCs as platforms for executing the different versions, in such a way that the RCMBs are PC boards inserted in the bus of the host PCs. And third, we use the Controller Area Network (CAN) protocol as the basic communication technology for the broadcast network, due to its well-known advantages related to cost, reliability and real-time performance, and due to the growing interest in using CAN for critical applications. In fact what we use is a modification to the CAN protocol, called MajorCAN, which we have devised to eliminate some scenarios of inconsistent communication that CAN exhibits.

Besides introducing changes in the hardware, we have designed a new software to be executed in the RCMBs. Two aspects of this software concerning the consistent management of the redundancy have received a special attention in this dissertation: replica determinism enforcement of all replicated operations and consistent reintegration of RCMBs after suffering transient faults. Replica determinism enforcement ensures, for instance, that all non-faulty versions produce the same results even if some versions are faulty. On the other hand, reintegration prevents a quick attrition of the available redundancy by allowing an RCMB that has experienced a temporary failure to return to the normal coordinated operation with the other RCMBs. The new mechanism proposed for consistent reintegration has been formally verified using model checking.

Given that our ultimate goal is as complex as providing the design of a complete fault-tolerant distributed system, we have made a significant effort to keep an approach as systematic as possible. More specifically, we have adopted the guidelines provided by the paradigm proposed by Prof. Avizienis for the design of fault-tolerant systems. Following one of said guidelines, we have developed first prototypes for the essential parts of the architecture, thus giving support to our claim of actual implementability.

Our design exhibits relevant differences with other architectures which were conceived for similar purposes. Moreover, our approach presents significant advantages beyond the fulfillment of its initial aims of low cost, actual implementability and consistent operation. Some of these additional advantages are that it achieves a consistent management of the redundancy without introducing a sizeable computation or communication overhead in the system; it requires less computers and versions to tolerate a fixed number of faults than common distributed systems and it allows the broadcast network to use a simplified topology such as a bus, thereby facilitating scalability. Finally, although initially developed for NVP, our RCMBnet is also able to support other software replication techniques, such as active replication.

Project Leader

Julián Proenza Arenas

Project Collaborators

Photo

José Miró

Related Publications

RCMBnet: A distributed Hardware and Firmware Support for Software Fault Tolerance

Authors: Julián Proenza Arenas; José Miró.
Category: Ph. D. Theses Date: December, 2007
Improving the Safety of AUVs

MTS/IEEE Oceans

Authors: Alberto Ortiz Rodriguez; Julián Proenza Arenas; Guillem Bernat; Gabriel Oliver Codina.
Category: Conferences Date: December, 1999
A double CAN architecture for fault-tolerant control systems

International CAN Conference (ICC)

Authors: Pere Ferriol; Francisco Navio; Juan José Navio; Joan Pons; Julián Proenza Arenas; José Miro-Julia.
Category: Chapters Date: December, 1998

View PDF
A CAN hub with Improved Error Detection and Isolation

International CAN Conference (iCC)

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Guillermo Rodríguez-Navas; Luís Almeida.
Category: Chapters Date: December, 2005

View PDF
Pushing error containment and fault tolerance

CAN Newsletter. Special Edition on Automotive Networks

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Guillermo Rodríguez-Navas; Luís Almeida.
Category: Chapters Date: December, 2006 In: CAN in Automation GmbH

View PDF
Experimental assessment of ReCANcentrate, a replicated star topology for CAN

SAE 2006 Transactions Journal of Passenger Cars: Electronic and Electrical Systems

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Luís Almeida.
Category: Chapters Date: December, 2007 In: SAE international

View PDF
Combining Operational Flexibility and Dependability in FTT-CAN

IEEE Transactions on Industrial Informatics

Authors: Joaquim Ferreira; Luís Almeida; José Alberto Fonseca; Paulo Pedreiras; Ernesto Martins; Guillermo Rodríguez-Navas; Joan Rigo; Julián Proenza Arenas.
Category: Journals Date: December, 2006
An Active Star Topology for Improving Fault Confinement in CAN Networks

IEEE Transactions on Industrial Informatics

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Guillermo Rodríguez-Navas; Luís Almeida.
Category: Journals Date: December, 2006
Managing redundancy in CAN-based networks supporting N-Version Programming

Computer Standards and Interfaces

Authors: Julián Proenza Arenas; José Miro-Julia; Hans Hansson.
Category: Journals Date: December, 2009

View PDF
Modeling and Verification of Master/Slave Clock Synchronization Using Hybrid Automata and Model-Checking

International Conference on Formal Engineering Methods (ICFEM, LNCS 4789)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas; Hans Hansson.
Category: Conferences Date: December, 2007

View PDF
Modelling MajorCAN with UPPAAL

IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)

Authors: Matias Bonet; Gabriel Donaire; Julián Proenza Arenas.
Category: Conferences Date: December, 2007
First results of the assessment of the improvement of error containment achieved by CANcentrate

IEEE International Workshop on Factory Communication Systems (WFCS)

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Luís Almeida.
Category: Conferences Date: December, 2006
An UPPAAL Model for Formal Verification of Master/slave Clock Synchronization over the Controller Area Network

IEEE International Workshop on Factory Communication Systems (WFCS)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas; Hans Hansson.
Category: Conferences Date: December, 2006
Towards analyzing the fault-tolerant operation of Server-CAN

IEEE Conference on Emerging Technologies and Factory Automation (ETFA)

Authors: Thomas Nolte; Guillermo Rodríguez-Navas; Julián Proenza Arenas; Sasikumar Punnekkat; Hans Hansson.
Category: Conferences Date: December, 2005
Using UPPAAL to Model and Verify a Clock Synchronization Protocol for the Controller Area Network

IEEE Conference on Emerging Technologies and Factory Automation (ETFA)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas; Hans Hansson.
Category: Conferences Date: December, 2005
Position Paper on Dependability and Reconfigurability in Distributed Embedded Systems

International Workshop on Real-Time Networks (RTN)

Authors: Julián Proenza Arenas; Luís Almeida.
Category: Conferences Date: December, 2007

View PDF
ReCANcentrate: A replicated star topology for CAN networks

IEEE Conference on Emerging Technologies and Factory Automation (ETFA)

Authors: Manuel Alejandro Barranco González; Luís Almeida; Julián Proenza Arenas.
Category: Conferences Date: December, 2005
Design and Modeling of a Protocol to Enforce Consistency among Replicated Masters in FTT-CAN

IEEE International Workshop on Factory Communication Systems (WFCS)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas; Joan Rigo; Joaquim Ferreira; Luís Almeida; José A. Fonseca.
Category: Conferences Date: December, 2004
CANcentrate: An Active Star Topology for CAN Networks

IEEE International Workshop on Factory Communication Systems (WFCS)

Authors: Manuel Alejandro Barranco González; Guillermo Rodríguez-Navas; Julián Proenza Arenas; Luís Almeida.
Category: Conferences Date: December, 2004
Clock Synchronization in CAN Distributed Embedded Systems

International Workshop on Real-Time Networks (RTN)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2004

View PDF
COTS-based Hardware Support to Timeliness in CAN Networks

IEEE International Conference on Emerging Technology and Factory Automation (ETFA)

Authors: Guillermo Rodríguez-Navas; Manuel Alejandro Barranco González; Julián Proenza Arenas; Ian Broster.
Category: Conferences Date: December, 2003
Harmonizing Dependability and Real Time in CAN Networks

Second International Workshop on Real-Time LANs in the Internet Age (RTLIA)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2003

View PDF
An Architecture for Physical Injection of Complex Fault Scenarios in CAN Networks

IEEE International Conference on Emerging Technology and Factory Automation (ETFA)

Authors: Guillermo Rodríguez-Navas; Jesús Jiménez; Julián Proenza Arenas.
Category: Conferences Date: December, 2003
Enforcing Consistency of Communication Requirements Updates in FTT-CAN

International Workshop on Dependable Embedded Systems (DES)

Authors: Joaquim Ferreira; Luís Almeida; José A. Fonseca; Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2003

View PDF
Automatic Inspection of Underwater Environments by Means of a Fleet of Submersible Agents

Pattern Recognition Advances in Iberoamerica

Authors: Alberto Ortiz Rodriguez; Julián Proenza Arenas; Gabriel Oliver Codina.
Category: Conferences Date: December, 2000
Design and Implementation of a Redundancy Manager for Triple Redundant CAN Controllers

Annual Conference of the IEEE Industrial Electronics Society (IECON)

Authors: Carlos Guerrero; Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2002
Hardware Support for Fault Tolerance in Triple Redundant CAN Controllers

IEEE International Conference on Electronics, Circuits and Systems (ICECS)

Authors: Carlos Guerrero; Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2002
RAO: A Low-Cost AUV for Testing

MTS/IEEE Oceans

Authors: Daniel Avià; Miquel DE Diego; Gabriel Oliver Codina; Alberto Ortiz Rodriguez; Julián Proenza Arenas.
Category: Conferences Date: December, 2000
MajorCAN: A modification to the Controller Area Network protocol to achieve Atomic Broadcast

First International Workshop on Group Communications and Computations (IWGCC)

Authors: Julián Proenza Arenas; José Miro-Julia.
Category: Conferences Date: December, 2000

View PDF
Incorporating A Safety Hardware Module On A Low-Cost AUV

International Symposium on Robotics and Automation (ISRA)

Authors: Julián Proenza Arenas; Alberto Ortiz Rodriguez; Ferran Prats; Bartomeu Santandreu; Gabriel Oliver Codina.
Category: Conferences Date: December, 2000
A Cost-Effective Hardware Architecture for Fail-Safe Autonomous Underwater Vehicles

IEEE International Conference on Electronics, Circuits and Systems (ICECS)

Authors: Julián Proenza Arenas; Alberto Ortiz Rodriguez; Guillem Bernat; Gabriel Oliver Codina.
Category: Conferences Date: December, 1999
A low-cost fail-safe circuit for fault-tolerant control systems

IEEE International Conference on Electronics, Circuits and Systems (ICECS)

Authors: Julián Proenza Arenas; Joan Pons; José Miro-Julia.
Category: Conferences Date: December, 1999
Consistent management and slow attrition of redundancy in a distributed hardware support for NVP execution

First International Conference on Dependable Systems and Networks (ICDSN)

Authors: Julián Proenza Arenas; José Miro-Julia.
Category: Conferences Date: December, 2000
A technique to analyse the tolerance to transient overloads of a fault-tolerant real-time system

IEEE High Accurance Systems Engineering Workshop (HASE)

Authors: Guillem Bernat; José Miro-Julia; Julián Proenza Arenas.
Category: Conferences Date: December, 1998
Fixed priority schedulability analysis of a distributed real-time fault-tolerant architecture

International Conference on Parallel and Distributed Techniques and Applications (PDPTA)

Authors: Guillem Bernat; José Miro-Julia; Julián Proenza Arenas.
Category: Conferences Date: December, 1998

View PDF
Un circuito de bajo coste para la tolerancia a fallos en sistemas de control industrial

Seminario Anual de Automática, Electrónica Industrial e Instrumentación (SAAEI)

Authors: Pere Ferriol; Francisco Navio; Juan José Navio; Joan Pons; Julián Proenza Arenas; José Miro-Julia.
Category: Conferences Date: December, 1998
Redundancy Management in a Low-Cost Distributed Hardware and Firmware Support for Software-Fault Tolerance

Authors: Julián Proenza Arenas; José Miro-Julia; Hans Hansson.
Category: Reports Date: December, 2007
An UPPAAL Model for Formal Verification of Clock Synchronization over CAN

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas; Hans Hansson.
Category: Reports Date: December, 2005
On the Management of Media Replication in ReCANcentrate

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Luís Almeida.
Category: Reports Date: December, 2007
Enhancing the response of ReCANcentrate in presence of faults

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Luís Almeida.
Category: Reports Date: December, 2007
On the Design of a Clock Service for CAN Networks

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Reports Date: December, 2003
Una descripción de los fundamentos del protocolo TTCAN diseñado por Bosch

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Reports Date: December, 2003
System Partitioning of a Low-Cost Distributed Hardware and Firmware Support for Software-Fault Tolerance

Authors: Julián Proenza Arenas; José Miro-Julia.
Category: Reports Date: December, 2004
Design and implementation of CANcentrate: an active star topology for improving fault confinement in CAN networks

Authors: Manuel Alejandro Barranco González; Julián Proenza Arenas; Guillermo Rodríguez-Navas; Luís Almeida.
Category: Reports Date: December, 2005
Hardware Design of a High-Precision and Fault-Tolerant Clock Subsystem for CAN Networks

IFAC International Conference on Fieldbus Systems and their Applications (FET)

Authors: Guillermo Rodríguez-Navas; José Juan Bosch; Julián Proenza Arenas.
Category: Conferences Date: December, 2003

View PDF
Analyzing Atomic Broadcast in TTCAN Networks

IFAC International Conference on Fieldbus Systems and their Applications (FET)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2003

View PDF
An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks

WSEAS International Conference on Signal Processing, Robotics and Automation (ISPRA)

Authors: Guillermo Rodríguez-Navas; Julián Proenza Arenas.
Category: Conferences Date: December, 2002

View PDF