This paper presents a novel fault-tolerant networked control system architecture consisting of two cells working in-line. This architecture is fault-tolerant at the level of the controllers as well as the sensors. Each cell is based on the sensor-to-actuator approach and has an additional supervisor node. It is proven, via analysis as well as OMNeT++ simulations that the production line succeeds in meeting all control system requirements with no dropped or over-delayed packets. A reliability analysis is then undertaken to quantitatively estimate the increase in reliability due to the introduction of fault-tolerance.
Traditional point-to-point control systems are currently being replaced by networks in modern manufacturing systems. These networks reduce cost by decreasing the amount of electrical wiring and also decrease maintenance costs. These Networked Control Systems (NCSs) transmit packets that have real-time constraints. These packets are small and frequent. They originate from a Sensor node (S) that samples a physical phenomenon, such as temperature, either regularly (in clock-driven systems) or when there is a system change (in event-driven
systems). A packet is then sent to a controller (K) that decapsulates it, calculates the control decision, encapsulates this decision and sends it over the network to an Actuator node (A).
Currently, the most commonly used network protocols in NCSs are CAN, PROFIBUS, Ethernet/IP, ControlNET and DeviceNET [
However, it is important to remember here that the non-real-time traffic may cause the control traffic to violate system delay constraints. In [
Furthermore, an NCS architecture that typically has higher performance than conventional architectures is the Sensor-to-Actuator (S2A) architecture. In this architecture, each actuator has its own controller integrated in the same node. As such control packets are transmitted directly from the sensors nodes to the integrated actuator nodes over one hop instead of over two hops as in traditional in-loop controller architectures.
Both the conventional in-loop and the S2A architectures need to be very reliable in order to reduce down time. Fault-tolerance is therefore a very important design aspect to be considered. In [
In this paper, the S2A architecture will be investigated in an in-line production scheme. Two cells will be concatenated. This is typically found in production lines where a work piece can be conditioned on several machines. It will be shown how to make the entire production fault-tolerant at the controller level first. Fault- tolerance at the sensor level will then be added and it will be shown that the system will meet all required timing constraints. Furthermore, a reliability study will be conducted to quantitatively evaluate the increase in system lifetime due to the introduction of fault-tolerance. Network fabric level fault-tolerance, which has been previously investigated in the literature [
In [
In [
In [
In [
An extra dimension of fault tolerance is added to an expanded NCS by in-lining two separate cells (machines) together each composed of 16 sensors, 1 supervisor and 4 integrated actuators similar to the S2A architecture in [
The rest of this paper is organized as follows. Section 2 describes some related works. In Section 3, the two- cell architecture is developed and its reliability modeling methodology detailed. In Section 4, the fault-tolerance of the proposed model is expanded through the application of Triple Modular Redundancy (TMR) at the sensor level. A case study is presented in Section 5 to quantify the improvements in reliability offered by the proposed models. Section 6 concludes this research.
In-line architectures have numerous applications in many industrial processes; an example is having two separate machines operating in tandem (in-line production) with the second machine working on functions released by the first one.
In addition, by interconnecting two machines, each with its own NCS, the system as a whole is expected to become more reliable through added fault-tolerance measures. In such a fault-tolerant architecture, the second cell can take over operation of both cells in case of the occurrence of a failure in the first one.
For each individual cell, an S2A (Sensor-to-Actuator) architecture, utilizing Switched Ethernet, is employed as in [
For fault-tolerance at the level of the control function, each actuator has an integrated controller (K) in addition to an extra network interface (E) as shown in
Thus, the control traffic flows can change based on the failure state; as such, all potential failure states must be investigated individually in order to guarantee that all system deadlines are met with no control packet losses.
When all system components are operational, various packets are exchanged over the network. These control packets must be successfully transmitted, with no packet loss, within the required system deadlines in order to guarantee correct system behavior.
Sensors: All the sensors in cell one, shown in
Actuators: The controllers within the actuators in cell one receive packets from the sensors in cell one. The controllers within the actuators in cell two receive packets from the sensors in cell two. Actuators from both cells send additional monitoring packets to both cells’ supervisors.
Supervisors: The two supervisors receive data from all the sensors and all the actuators in both cells to allow for fault-tolerance. Finally, each supervisor sends a watchdog signal to the other supervisor every 347 microseconds (half the sampling period) to indicate to the other supervisor that it is functioning properly. This message is only 10 Bytes, as opposed to control messages, which are 100 Bytes. If a supervisor does not receive a watch- dog signal from the other supervisor, the supervisor assumes the other one has failed and takes over its responsibilities. Thus, for correct operation of the proposed system, it is assumed that the supervisors are fail silent.
In the proposed fault-tolerant two-cell architecture, both cells can still continue operating normally even if certain components fail within the system. Next is a description of the different fault-tolerant scenarios. It is important to note that, since TMR is employed at the level of the sensor nodes, the failure of any one sensor can be tolerated by the proposed control system.
If Sup1 from
Similar to the previous scenario but Sup2 fails instead of Sup1.
If any of the controllers (or all of them as a worst case scenario) within the actuators in cell one fail (K1,1; K1,2; K1,3; K1,4 from
Similar to the previous scenario but the controllers within the actuators in cell two (K2,1; K2,2; K2,3; K2,4) fail instead, Sup2 will take over the function of the failed controllers.
If the controllers within the actuators fail in cell one and cell two (K1,1; K1,2; K1,3; K1,4; K2,1; K2,2; K2,3; K2,4 from
If (K1,1; K1,2; K1,3; K1,4; Sup1 from
Similar to the previous scenario but (K2,1; K2,2; K2,3; K2,4; Sup2) fail instead and Sup1 takes over control of the failed cell.
If (K2,1; K2,2; K2,3; K2,4; Sup1 from
Similar to the previous scenario but (K1,1; K1,2; K1,3; K1,4; Sup2) fail instead and Sup1 takes over control of the actuators in cell one.
If (K1,1; K1,2; K1,3; K1,4; K2,1; K2,2; K2,3; K2,4; Sup1 from
Similar to the previous scenario but (K1,1; K1,2; K1,3; K1,4; K2,1; K2,2; K2,3; K2,4; Sup2) fail and Sup1 must instead undertake the function of the failed controllers in both cells.
Through analysis of the model, the end-to-end delay for the control packets will be calculated for both Fast Ethernet and Gigabit Ethernet. A worst-case delay analysis will be carried out on this model. As previously mentioned, one of the restrictions on the proposed model is that the control action must be taken within 694 μs. Therefore, it is crucial for the worst-case delay to not exceed this limit in order to ensure correct control operation. The following analysis will focus on the last packet transmitted from the final sensor node; this represents the worst-case scenario because all the previously sent packets are queued before this particular packet, hence it will require the largest amount of time to be transmitted over the network. In addition, processing delay is not taken into account because previous work has shown it is so small compared to the rest of the delays and as such it can be considered negligible [
The total end-to-end delay for the worst-case packet flow is given by:
The link transmission delay (Dtransmission) is the amount of time required for all of the packet’s bits to be transmitted onto the link. It depends on the packet length L (bits) and link transmission rate R (bps) [
The length of the packet is fixed to 100 Bytes at the application layer; however, additional packet and frame header overhead (approximately 58 Bytes) must be taken into consideration. All the links are Gigabit Ethernet in one scenario and Fast Ethernet in the second scenario, therefore:
The propagation delay (Dpropagation) is the time taken for the packet to travel from the sender to the receiver; it depends on the link length d (m) and the propagation speed s (m/s) [
The length between each node and the switch is d = 1.5 m and the transmission speed in the Ethernet links is s = 2 × 108 m/s.
Hence, the total end-to-end delay is given by:
Using the delays obtained above as constants and substituting with the appropriate values in (8), the sensor to actuator delays for all possible system states (including fault-free and fault-tolerant scenarios) is calculated next.
For the fault-free scenario, the number of packets in the worst case queue is 20 packets (16 packets from the sensors to the switch and 4 packets from the switch to the actuators) following the same analysis methodology as in [
Following the same delay calculation methodology,
Scenario | Cell 1 | Cell 2 | ||||
---|---|---|---|---|---|---|
Packets | Fast Ethernet delay (µs) | Gigabit Ethernet delay (µs) | Packets | Fast Ethernet delay (µs) | Gigabit Ethernet delay (µs) | |
FF | 20 | 252.95 | 25.43 | 20 | 252.95 | 25.43 |
FT-1 | 20 | 252.95 | 25.43 | 20 | 252.95 | 25.43 |
FT-2 | 20 | 252.95 | 25.43 | 20 | 252.95 | 25.43 |
FT-3 | 22 | 278.245 | 27.973 | 20 | 252.95 | 25.43 |
FT-4 | 20 | 252.95 | 25.43 | 22 | 278.245 | 27.973 |
FT-5 | 22 | 278.245 | 27.973 | 22 | 278.245 | 27.973 |
FT-6 | 24 | 303.54 | 30.516 | 20 | 252.95 | 25.43 |
FT-7 | 20 | 252.95 | 25.43 | 24 | 303.54 | 30.516 |
FT-8 | 20 | 252.95 | 25.43 | 22 | 278.245 | 27.973 |
FT-9 | 22 | 278.245 | 27.973 | 20 | 252.95 | 25.43 |
FT-10 | 44 | 556.49 | 55.946 | 22 | 278.245 | 27.973 |
FT-11 | 22 | 278.245 | 27.973 | 44 | 556.49 | 55.946 |
The proposed two-cell fault-tolerant architecture was simulated on OMNeT++ [
In
The observed end-to-end delays were deterministic due to the periodic nature of the control traffic combined with the use of Switched Ethernet. In all simulated scenarios, the observed percentage error between the simulated and analytical delay results did not exceed 5% as shown in
The reliability of the proposed two-cell architecture will be calculated next. Two values will be calculated: the Control Function Reliability (CFR) and the Node Reliability (NR).
CFR(t) focuses only on the components where the control function is executed, i.e., the supervisors (Sup1 and Sup2) and the four controllers connected to the four actuators in both cells (Ki, j, i = 1,2 and j = 1 to 4). It will be assumed that, if the system loses its observability (both supervisors fail), this will be considered as a system failure even if both cells are still operational. There is one controller in each actuator and there are four actuators in each cell; in addition, there is a supervisor in each cell, which means the control function for the two-cell system is based on 10 components.
In order to calculate CFR(t), all the different situations are analyzed and the failure states are identified. There are 210 situations to consider since there are 10 components involved in the Control Function. A small scale
Scenario | Cell 1 | Cell 2 | ||
---|---|---|---|---|
Fast Ethernet % error | Gigabit Ethernet % error | Fast Ethernet % error | Gigabit Ethernet % error | |
FF | 4.77 | 4.31 | 4.77 | 4.31 |
FT-1 | 4.77 | 4.31 | 4.77 | 4.31 |
FT-2 | 4.77 | 4.31 | 4.77 | 4.31 |
FT-3 | 4.77 | 4.35 | 4.77 | 4.31 |
FT-4 | 4.77 | 4.31 | 4.77 | 4.35 |
FT-5 | 4.77 | 4.35 | 4.77 | 4.35 |
FT-6 | 4.78 | 4.39 | 4.77 | 4.31 |
FT-7 | 4.77 | 4.31 | 4.78 | 4.39 |
FT-8 | 4.77 | 4.31 | 4.77 | 4.35 |
FT-9 | 4.77 | 4.35 | 4.77 | 4.31 |
FT-10 | 2.55 | 0.812 | 4.77 | 4.35 |
FT-11 | 4.77 | 4.35 | 2.55 | 0.812 |
model is first analyzed, using Algorithm I of
It was observed that all failure states have one feature in common: in each failure state, both supervisors are in a failure state, regardless of the state of the controllers. The rationale behind this finding is as follows: Assuming the Ethernet port of the actuators is always operational, the controllers Ki, j will not cause a system failure because, even if they fail, the supervisors can take over their function. This is clear from the 11 scenarios described in Section 3.2. However, when both supervisors fail at the same time, the system loses its observability which is considered to be a system failure as mentioned above (even if all the controllers are working). Hence, CFR(t) for the two-cell system is:
The reliability CFRsim(t) is that of the simplex system (two cells without any fault-tolerance). Note that:
In the above equation, it is assumed that both supervisors have the same failure rate. The same is assumed for all controllers connected to the actuators. The reliability of any of the 8 controllers in both cells is Rk.
As mentioned above, another perspective would include the Ethernet ports of both sensors and actuators in the reliability calculations (in addition to both supervisors and the 4 controllers connected to the actuators in each cell), i.e., all nodes connected to the network fabric. As a result, across both cells, the components consist of 16 sensor Ethernet ports in each cell, 1 supervisor in each cell and 4 controllers as well as 4 actuator Ethernet ports for each cell; in other words, a total of 25 components for each cell and a total of 50 components for the two-cell system in total.
Since there are 50 components to analyze, 250 situations have to be studied. As for CFR(t), a small scale model is first analyzed following Algorithm I with one controller, one sensor Ethernet port, one supervisor and one controller Ethernet port for each cell, resulting in a total of 8 components, or 28 = 256 situations.
Of the 256 situations, precisely 27 are operational and the remaining 219 situations are failure states. By observing the 27 states in which the system is operational, the following observations can be made.
Sensors: If any of the sensor Ethernet ports fail for any reason in either cell, the system fails.
Controllers: If the controllers connected to the actuators fail, the system will continue to work on the condition that one of the supervisors is working and that the corresponding Ethernet ports in the actuators are working as well.
Supervisors: At least one out of the two supervisors must be working at all times. If both supervisors fail at the same time, the observability of the system is lost which is assumed to cause a system failure.
Actuator Ethernet Ports: The system can continue working normally even if the Ethernet ports in the actuators are down, but only if the controllers attached to the corresponding actuators are working. For each actuator, its controller and Ethernet port form a 1-of-2 system. If a controller fails inside the actuator and the Ethernet port inside the same actuator fails at the same time, the system fails. Hence, NR(t) can be calculated as follows:
where Rs is the reliability of the sensor’s Ethernet port, Ra is the reliability of the actuator’s Ethernet port and Rk is the reliability of the controller attached to the actuator. Note that NRsim(t) is the reliability of the simplex system from a nodes point of view.
It is again assumed that both supervisors have the same failure rate. All sensors also have the same failure rate as do all controllers.
Although connecting two cells together is expected to increase system reliability (whether CFR(t) or NR(t)), there is still the problem of single points of failure, i.e., the sensors. If any one sensor out of the thirty two sensors fails, the whole system will fail. In [
The same approach is going to be implemented next; TMR is applied to all 32 sensors (for the two cells). A sensor will only fail if at least two of its three modules fail. The application of TMR results in 48 sensors in cell one and 48 sensors in cell two, giving a total of 96 sensors in the system. And while the increase from 32 sensors to 96 sensors means an increase in costs, the system is expected to become much more reliable. Such an expensive but extremely reliable architecture is appropriate for applications in sensitive environments such as the nuclear industry or the space industry where reliability is the most important factor in system design.
On the other hand, such an increase in the number of sensors will significantly increase network traffic. Note that any control packet must not have an end-to-end delay greater than the 694 µs system control deadline. Clearly, the fault-free and fault-tolerant scenarios will be identical to those described in Sections 3.1 and 3.2. However, the delays are expected to increase significantly due to the extra traffic generated by 96 sensors instead of 32.
In addition to the fault-free scenario, this system must meet the required control deadline under all the fault- tolerant scenarios stated in Section 3.2. Furthermore, as long as two out of every three sensors are working, the system will tolerate the failure, even if one sensor fails during mission time.
Although the sensor to actuator delays experienced in the proposed model in Section 3 were under the 694 microsecond limit, there is an expected increase in the delays for the proposed TMR model because of the large amount of extra traffic due to TMR. In addition to meeting the required control delay deadline, the proposed TMR model must guarantee zero control packet loss.
Calculations for TMR applied to two cells are obtained in the same manner depicted in Section 3.3 for two cells connected to each other without TMR. The only difference is an increase in the number of packets in the system. Furthermore, calculations (and OMNeT++ simulations) were conducted only using Gigabit Ethernet links because it was shown in [
Following the same delay calculation methodology,
The proposed two cell fault-tolerant architecture, after applying TMR on the sensor nodes, was simulated on OMNeT++ [
Scenario | Cell 1 | Cell 2 | ||||
---|---|---|---|---|---|---|
Packets | Gigabit Ethernet delay (µs) | % Error | Packets | Gigabit Ethernet delay (µs) | % Error | |
FF | 52 | 66.118 | 4.28 | 52 | 66.118 | 4.28 |
FT-1 | 52 | 66.118 | 4.28 | 52 | 66.118 | 4.28 |
FT-2 | 52 | 66.118 | 4.28 | 52 | 66.118 | 4.28 |
FT-3 | 54 | 68.661 | 4.29 | 52 | 66.118 | 4.28 |
FT-4 | 52 | 66.118 | 4.28 | 54 | 68.661 | 4.29 |
FT-5 | 54 | 68.661 | 4.29 | 54 | 68.661 | 4.29 |
FT-6 | 56 | 71.204 | 4.31 | 52 | 66.118 | 4.28 |
FT-7 | 52 | 66.118 | 6.08 | 56 | 71.204 | 4.31 |
FT-8 | 52 | 66.118 | 4.29 | 54 | 68.661 | 6.03 |
FT-9 | 54 | 68.661 | 4.29 | 52 | 66.118 | 4.29 |
FT-10 | 108 | 137.322 | 3.39 | 54 | 68.661 | 4.29 |
FT-11 | 54 | 68.661 | 4.29 | 108 | 137.322 | 3.39 |
delays were less than the required system deadline. In all simulated scenarios, the highest observed percentage error between the simulated and analytical delay results was 6.08% as shown in
From a control function point of view, which only focuses on the supervisors and the controllers within the actuators, the reliability equation does not change from the two cell system without TMR, even with the addition of the extra sensors because the sensors are not taken into consideration in this reliability model. Hence, the reliability equations are still the same as in (13) and (14).
System reliability can also be calculated by taking into account the sensors and the Ethernet ports within the actuators, in addition to the supervisors and the controllers within the actuators. In this case, the equation will only differ in the sensor block, but the rest of the equation from (15) will remain the same. Instead of the reliability of
the sensors being
time. As such, the reliability equation is:
For the two proposed fault-tolerant architectures, a case study was conducted to quantify overall system reliability compared to a corresponding simplex system. An exponential Time To Failure (TTF) is assumed with time measured in days.
Based on the case study parameters assumed in
The CFR for the two proposed architectures is shown, and compared to the reliability of a corresponding simplex system with no fault-tolerance, in
It can be seen that, from a Control Function perspective, there is no difference in overall system reliability between the two proposed fault-tolerant architectures (with and without TMR). However, in both cases, a significant improvement in CFR can be seen compared to the corresponding simplex architecture.
It can be seen that, when Node Reliability is taken into account, the proposed fault-tolerant architecture with TMR shows significant improvement in reliability compared to that without TMR as well as the simplex architecture. This is due to the large number of sensor nodes (32) utilized across the architecture’s two cells. Without TMR, each of these sensor nodes is a single point of failure for the entire system.
Parameter | Value |
---|---|
λsensor | 1/365 (days−1) |
λa | 1/365 (days−1) |
λsupervisor | 1/180 (days−1) |
λcontroller | 1/90 (days−1) |
Coverage (C) | 1 |
Fault-tolerant design is essential for a robust Networked Control System (NCS) with a high reliability and a long operational lifetime. With the increasing complexity of NCSs consisting of a large number of nodes such as sensors, controllers and actuators, the probability of the occurrence of any single failure increases significantly. Without fault-tolerance, the occurrence of a single fault in any one node can lead to the failure of the entire control system resulting in lengthy downtimes and consequently significant production losses.
In this paper, the architecture of a two-cell fault-tolerant NCS is developed on-top-of both Unmodified Fast and Gigabit Switched Ethernet. The proposed architecture models a production line composed of two identical machines each based on a Sensor-to-Actuator (S2A) control architecture.
Fault-tolerance is first applied at the controller level across both cells. An extra network interface is added to each actuator node in addition to the integrated controller node. In case of failure of the integrated controller, a supervisor node becomes part of the control loop and sends the required control action to the affected actuator’s added network card. Under all possible failure scenarios, it was shown that the proposed fault-tolerant architecture fulfils the required control system deadline with zero dropped or over-delayed packets under both Fast and Gigabit Ethernet (both analytically and through OMNeT++ simulations).
Additionally, the fault-tolerance of the proposed architecture was expanded to the level of the sensor nodes through Triple Modular Redundancy (TMR). The application of TMR at the sensor nodes led to a significant increase in the number of control packets transmitted across the proposed architecture making Fast Ethernet unsuitable for meeting the required control system deadline. It was shown that the proposed fault-tolerant architecture fulfils the required control system deadline with zero dropped or over-delayed packets using Gigabit Ethernet under all possible failure scenarios.
Two reliability modeling methodologies were illustrated to quantify the achievable improvement in lifetime compared to a corresponding simplex architecture with no fault-tolerance: Control Function Reliability (CFR) and Node Reliability (NR). CFR only considers the probability of failure of the integrated controllers and supervisors while NR also takes into account the probability of failure of the sensor Ethernet ports and the added integrated network interfaces. A case study was carried out for a typical industrial system. It was shown that, for both modeling methodologies, the proposed fault-tolerant architectures significantly improve overall system reliability.
Merna N. Abou Eita,Mostafa W. Hussein,Shereen S. Abouelazayem,Mennatallah A. Morsi,Eslam A. Moustafa,Hassan H. Halawa,Ramez M. Daoud,Hassanein H. Amer,Hany M. ElSayed,Ahmed A. Ibrahim, (2016) Multi-Node Fault-Tolerant Two-Cell Real-Time S2A Network. Intelligent Control and Automation,07,25-38. doi: 10.4236/ica.2016.72004