^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

This paper presents component importance analysis for virtualized system with live migration. The component importance analysis is significant to determine the system design of virtualized system from availability and cost points of view. This paper discusses the importance of components with respect to system availability. Specifically, we introduce two different component importance analyses for hybrid model (fault trees and continuous-time Markov chains) and continuous-time Markov chains, and show the analysis for existing probabilistic models for virtualized system. In numerical examples, we illustrate the quantitative component importance analysis for virtualized system with live migration.

Virtualization is one of the key technologies to deploy cloud computing, which provides a variety of system resources as a service over the Internet [

Although the virtualization is a promising way for HA services, the design of system architecture is not so easy, compared to non-virtual system. For example, the system availability can easily be improved by increasing physical servers which run the virtualization platform. However, from the points of cost and energy consumption, it is not always the best design. That is, towards the best design of virtualized system, we should consider the method to evaluate the system performance beforehand.

On the performance index, Kundu et al. [

This paper is an extension work of [

The rest of this paper is organized as follows. Section 2 presents the hybrid model for virtualized system design in [

In this section, we introduce the availability model for virtualized system which was presented in [

In [

In the availability modeling, the state of system can be classified into two sets:

As seen in

The CTMC model for VMM is given by

In [

system failure in the virtualized system cannot be represented by the FT. Since this paper describes the correlation between the failures of VM and host by the AND gate in the FT representation, the CTMC model simply becomes the same model as VMM, i.e., the model of

Based on these CTMC models, the steady-state availability for component x can be calculated as follows.

where

Let A_{i} be the steady-state availability of component i. Then we have the following steady-state availability for a host in the virtualized system according to the FT analysis:

where HW is the set of

where

In [

where

In this paper, since we do not treat the 2-state availability model to represent the component availability, the importance measures proposed in [

The aggregation is a technique to transform CTMC-based availability models into a equivalent 2-state, 2-transition availability model which has the same availability as the original model. As mentioned before, the states of CTMC-based availability models can be classified into

where the set

By applying the aggregation to the component availability models as preprocessing, the availability importance measures of the component i can be rewritten by

where

In the previous section, we have introduced the component importance for the structure function given by the FT model. The model considered the live migration as a static structure. However, since the live migration is essentially described by a dynamic behavior, the previous method cannot analyze how effect of components on the dynamic behaviors of live migration. Thus in this section, we consider the component importance on live migration from the viewpoint of dynamic behaviors, that is, we apply the component importance analysis for a CTMC representing the dynamic behaviors of live migration presented in [

Matos et al. [

State | Description |
---|---|

UUXUUX | VM1 is running on H1, VM2 is running on H2. |

FXXUUX | H1 is failed, VM1 is failed due to the failure of H1. VM2 is running on H2. |

DXXUUR | H1 failure is detected, VM1 is restarting on H2. |

DXXUUU | H1 is down, VM1 and VM2 are running on H2. |

UXXUUU | H1 is up, VM1 and VM2 are running on H2. |

UXXFXX | H1 is up, H2 is failed. VM1 and VM2 are failed due to the failure of H2. |

URXDXX | H2 failure is detected. VM1 is restarting on H1. |

DXXFXX | H1 is down, H2 is failed. |

DXXDXX | H1 is down, H2 failure is detected. |

DXXURX | H1 is down, H2 is up, VM2 is restarting on H2. |

UXXURX | H1 is up, H2 is up, VM2 is restarting on H2. |

UXXUUR | H1 is up, VM2 is running on H2. VM1 is restarting on H2. |

UFaXUUX | App1 is failed, both VMs and Hosts are up. |

UDaXUUX | App1 failure is detected. |

UPaXUUX | App1 failure is not covered. Additional recovery step is started. |

UFvXUUX | H1 is up, VM1 is failed, VM2 is running on H2. |

UDvXUUX | VM1 failure is detected. |

UPvXUUX | VM1 failure is not covered. |

Manual repair is started. |

state of system is represented by “Da”. If App1 requires an additional repair in the case where the application restart cannot solve the problem, the character is given by “Pa”. Also, when VM1 and App1 are restarting, the state is given by “R”. If VM1 and App1 are not running on the H1, then the character is “X”. The third character represents whether or not VM2 and App2 are running on H1. If VM2 and App2 run on H1, the character is given by “U”. If they are restarting on H1, the character is “R”. Otherwise, if they are not running on H1, the character is “X”. The fourth through sixth characters represent the state of H2 in the same manner as the first through third characters.

Dissimilar to the case of FT model, we do not know the structure function in the CTMC. We consider the component importance analysis by only using the parameter sensitivity analysis.

Let Q be the infinitesimal generator of CTMC described in _{s} is given by the linear equations;

Params | Description |
---|---|

Mean time to host failure | |

Mean time to VM failure | |

Mean time to Application failure | |

Mean time for host failure detection | |

Mean time for VM failure detection | |

Mean time for App failure detection | |

Mean time to migrate a VM | |

Mean time to restart a VM | |

Mean time to repair a host | |

Mean time to repair a VM | |

Mean time to App first repair (covered case) | |

Mean time to App second repair (not covered case) | |

coverage factor for VM repair | |

coverage factor for application repair |

where 1 is a column vector whose elements are 1. Also we define the following vectors:

・

・

・

・

Then the component availability is given by a inner product of π_{s} and

On the other hand, the system availability can be obtained by

Similar to the case of FT model, we define the importance measures of component i as follows.

where

Similarly, the importance measure with respect to repair rate is given by

Thus the problem is to estimate the sensitivity

To estimate the sensitivities for all the component availabilities, we consider the sensitivities of system and component availabilities with respect to model parameters. Suppose that

where

By using the vector

According to [

where T is the transpose operator. By substituting the estimates of the sensitivities into Equations (11) and (12), we have the component importance measures for live migration.

In this section, we illustrate the quantitative component importance analysis of hybrid model for virtualized system.

Using the aggregation technique, we first transform the availability models for all components into the equivalent 2-state, 2-transition models, then compute the effective failure and repair rates for components based on the model parameters. We also compute the component availabilities, and these results are shown in

We then compute the system availabilities based on the structure functions and the component availabilities. The availabilities of a hardware unit and a host, and the system availability are presented in

Next we derive the importance measures of components in the virtualized system by using Equation (6), and the effective failure and repair rates shown in

Params | Description | Value (hours) |
---|---|---|

MTTF of CPU | 2,500,000 | |

MTTF of Mem | 480,000 | |

MTTF of Pow | 670,000 | |

MTTF of Net | 120,000 | |

MTTF of Cool | 3,100,000 | |

MTTF of SAN | 20,000,000 | |

MTTF of VMM | 2880 | |

MTTF of VM | 2880 | |

MTTR of CPU | 0.5 | |

MTTR of Mem | 0.5 | |

MTTR of one power module | 0.5 | |

MTTR of two power modules | 1 | |

MTTR of one network device | 0.5 | |

MTTR of two network devices | 1 | |

MTTR of one cooler module | 0.5 | |

MTTR of two cooler modules | 1 | |

MTTR of one disk unit | 0.5 | |

MTTR of two disk units | 1 | |

MTTR of VMM | 1 | |

MTTR of VM | 0.5 |

Params | Description | Value |
---|---|---|

Mean time to repair person summoned | 30 minutes | |

Mean time to copy data | 20 minutes | |

Mean time for VMM failure detection | 30 seconds | |

Mean time for VM failure detection | 30 seconds | |

Mean time to reboot VMM | 10 minutes | |

Mean time to reboot VM | 5 minutes | |

Coverage factor for VMM reboot | 0.9 | |

Coverage factor for VM reboot | 0.95 |

Component | |||
---|---|---|---|

CPU | 8.0000000e−7 | 1.0000000 | 0.99999920 |

Mem | 8.3333333e−6 | 1.0000000 | 0.99999167 |

Net | 1.6666528e−5 | 1.9999833 | 0.99999167 |

Pow | 2.9850702e−6 | 1.9999970 | 0.99999851 |

Cool | 6.4516108e−7 | 1.9999990 | 0.99999968 |

VMM | 3.4722222e−4 | 3.0769231 | 0.99988717 |

VM | 3.4722222e−4 | 7.0588235 | 0.99995081 |

SAN | 9.9999992e−8 | 1.9999999 | 0.99999995 |

System | Availability |
---|---|

HW1 and HW2 | 0.99998072 |

H1 and H2 | 0.99986789 |

System availability | 0.99999992 |

Component | ||
---|---|---|

CPU | 1.8126415e−4 | 1.4501132e−10 |

Mem | 1.8126278e−4 | 1.5105232e−09 |

Net | 9.0632147e−5 | 7.5526790e−10 |

Pow | 9.0632147e−5 | 1.3527186e−10 |

Cool | 9.0632162e−5 | 2.9236186e−11 |

VMM | 5.8904249e−5 | 6.6471808e−09 |

VM | 1.8711815e−5 | 9.2043069e−10 |

SAN | 0.5000000000 | 2.4999999e−08 |

This section illustrates the quantitative component importance analysis of the CTMC for live migration in the virtualized system. Based on these parameters shown in

Next we compute the effective failure and repair rates for all components based on the aggregation of CTMC model, and the results are shown in

Params | Description | Value |
---|---|---|

Mean time for host failure | 2654 hr | |

Mean time for VM failure | 2893 hr | |

Mean time to Application failure | 175 hr | |

Mean time for host failure detection | 30 sec | |

Mean time for VM failure detection | 30 sec | |

Mean time for App failure detection | 30 sec | |

Mean time to migrate a VM | 330 sec | |

Mean time to restart a VM | 50 sec | |

Mean time to repair a host | 100 min | |

Mean time to repair a VM | 30 min | |

Mean time to App first repair (covered case) | 1 min | |

Mean time to App second repair (not covered case) | 20 min | |

Coverage factor for VM repair | 0.95 | |

Coverage factor for application repair | 0.8 |

System | Availability |
---|---|

H1 and H2 | 0.9993644 |

VM1 and VM2 | 0.9999746 |

App1 and App2 | 0.9994520 |

System availability | 0.9999992 |

Component | ||
---|---|---|

H1 and H2 | 3.763673e−4 | 0.5917368 |

VM1 and VM2 | 7.212219e−4 | 28.351750 |

App1 and App2 | 6.425198e−3 | 11.718790 |

Component | ||
---|---|---|

H1 and H2 | 2.118715e−03 | 1.347584e−06 |

VM1 and VM2 | 1.675414e−12 | 4.261977e−17 |

App1 and App2 | 9.438502e−13 | 5.174957e−16 |

and

In this paper, we have dealt with quantitative component importance analysis of virtualized system with live migration in terms of availability. In [