This video shows a working prototype of FT4FTT, which is an infrastructure for dependable systems based on Ethernet. The communication subsystem of FT4FTT is built on top of Hard Real-Time Ethernet Switching (HaRTES), a switched-Ethernet implementation of the Flexible Time-Triggered paradigm (FTT). FTT, in turn, makes it possibles for distributed nodes to exchange real-time periodic and aperiodic data, in a flexible manner. By flexible we mean that FTT ensures that changes in the communication requirements can be managed on-line. Finally, to achieve high-reliability, FT4FTT provides redundancy at several different levels to distributed embedded systems.
The video shows the functioning of two of the main functionalities of FT4FTT: the tolerance of faults that cause network nodes to generate incorrect data and the tolerance of faults that affect the network.
The prototype is composed of three node replicas (at the bottom) connected to a custom Ethernet switch (at the left). All these network components are monitored and controller through an instrumentation network (yellow cables). This network connects all the nodes and the switch with an instrumentation station, that is, a PC that is able to start and stop their operation, and gather execution data. Additionally, it can be used to see the internal state of each of the nodes through its screen.
The application running in the nodes is simplification of a control application and can be divided in 4 phases. First, an integer value is generated. In this case this value is just a counter that increments in each iteration of the control loop. Second, the values are exchanged among the three replicas. That is, each node transmits its own value and receives the values from the other two. Third, the nodes execute a voting algorithm to reach an agreement on the value to use. In this case, since we have integer values, it is enough to select the value that has majority. Finally, each node shows the value it transmitted (the value on top) and the value obtained from the voting (the value at the bottom).
As explained before, we test two different fault-tolerance mechanisms. First, we inject transient value errors at the application level. More specifically, from values from 11 to 20, we decrement the value transmitted in 10 units, each round in a different node. Despite these errors the system is able to provide its service as the other two replicas can continue its operation. Moreover, after injecting each error, the affected node is able to detect the inconsistency and correct its own value according to the value populated from the other two replicas.
In the second test we simulate the omission of some of the messages. Specifically, we unplug, one by one, the cables connecting each node with switch. Each disconnected node stops its operation. However, thanks to the other two replicas the system does not stop providing its service. Moreover, identically as in the other test, when the cable of a node is plugged again it can reintegrate into the system.
Note that, the reason for injecting only one error at a time is because we have three replicas, and thus, we are able to tolerate the error of just one node. Injecting more that one error at the same time would make the system to fail. To show this, we finish the test by unplugging two of the nodes. In doing this nodes cannot reach and agreement and, thus, the whole system fails.