Adding Flexibility to Node-Level Fault Tolerance in Distributed Embedded Systems Based on Real-Time Ethernet

Author Alberto Ballesteros
Supervisor/s Julián Proenza Arenas | Manuel Alejandro Barranco González

In Universitat de les Illes Balears, 2023.

Distributed Embedded Systems (DESs) play a key role and are almost ubiquitous in many economic sectors, such as civil avionics, the automotive industry, healthcare, electric power distribution and telecommunications. This type of system is built interconnecting various embedded devices through a communication channel, and making them coordinate their operation to achieve a common goal.

DESs are mostly used to interact with the real world, where the specific time at which an action is carried out has a significant impact on its outcome. That is why they usually have real-time requirements. Moreover, this interaction with the real world must be done in a trustworthy manner, otherwise the DES could cause physical damage to the surroundings properties, including humans. Consequently, DESs must also be dependable, to some extent. In addition, nowadays there is a growing interest in DESs that can operate in dynamic and unforeseeable operational contexts. The operational context can be defined as the set of relevant aspects involved in the operation of the system. This includes: (1) the operational requirements, that is, what the system has to do; (2) the status of the system, that is, the use of the hardware resources and whether they are faulty or not; and (3) the status of the environment, that is, the status any aspect of the surroundings of the system that can affect its operation.

Traditional DESs have been designed assuming that the operational context in which they operate is known at design time and that it does not change at runtime. Consequently, when operating in dynamic and unforeseeable operational contexts, these systems can be inefficient and/or ineffective, meaning that they may require more resources that the ones strictly needed and/or that they may either fail or the quality/performance of their service may be degraded.

A DES capable of operating in dynamic and unforeseeable operational contexts in an efficient and effective manner, requires said DES to be adaptive. An adaptive system has the ability to change its operation autonomously and at runtime, in response to changes in the operational context, to meet the operational requirements. Adaptivity is particularly appealing from a dependability perspective as it makes it possible to build Dynamic Fault-Tolerance (DFT). This is the ability of the system to use this adaptivity to build fault-tolerance mechanisms that keep the fault-tolerance requirements in an efficient and effective manner, despite foreseeable or unforeseeable changes in the operational context.

Building an adaptive real-time fault-tolerant DES poses a series of challenges. Among them, first, it is necessary to design its underlying subsystems to be flexible. Second, it is necessary to include mechanisms to reconfigure the system, that is, to monitor the operational context, determine if the system is meeting the operational requirements and, if not, rearrange the subsystems to achieve a new system configuration that does. Third, it is necessary to build a set of static and dynamic fault tolerance mechanisms, these last relying on the adaptivity capabilities of the
system, that operate in conjunction to achieve the required level of fault tolerance.

In this dissertation we present Dynamic Fault-Tolerance for Flexible Time-Triggered (DFT4FTT), a solution that enables the construction of adaptive real-time fault-tolerant DESs that makes use of DFT to provide fault tolerance in dynamic foreseen and unforeseen operational contexts efficiently and effectively.

Adding Flexibility to Node-Level Fault Tolerance in Distributed Embedded Systems Based on Real-Time Ethernet

Uso de cookies