### Paper Menu >>

### Journal Menu >>

J. Software Engineering & Applications, 2010, 3, 603-609 doi:10.4236/jsea.2010.36070 Published Online June 2010 (http://www.SciRP.org/journal/jsea) Copyright © 2010 SciRes. JSEA Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software Lianfen Qian1, Qingchuan Yao2, Taghi M. Khoshgoftaar2 1Department of Mathematical Sciences, Florida Atlantic University, Boca Raton, USA; 2Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, USA. Email: lqian@fau.edu, qingchuan_yao@yahoo.com, taghi@cse.fau.edu Received October 23rd, 2009; revised November 13th, 2009; accepted November 15th, 2009. ABSTRACT Software reliability modeling and prediction are important issues during software development, especially when one has to reach a desired reliability prior to software release. Various techniques, both static and dynamic, are used for reliability modeling and prediction in the context of software risk management. The single-phase Rayleigh model is a dynamic reliability model; however, it is not suitable for software release date prediction. We propose a new multi-phase truncated Rayleigh model and obtain parameter estimation using the nonlinear least squares method. The proposed model has been successfully tested in a large software company for several software projects. It is shown that the two-phase truncated Rayleigh model outperforms the traditional single-phase Rayleigh model in modeling weekly software defect arrival data. The model is useful for project management in planning release times and defect management. Keywords: Software Testing, Weekly Defect Arrival Data, Single-Phase Rayleigh Model, Two-Phase Truncated Rayleigh Model, Software Reliability 1. Introduction Software reliability is a key attribute of software qual- ity. Various models have been developed for software reliability engineering [1]. The rising complexity, size and functionality of software systems make software reliabil- ity prediction difficult. The problem is compounded with short development times and strict release deadlines. Consequently, predicting the release date for achieving pre-specified system reliability has become a very im- portant issue in software project development. Reliability modeling can not only assist in fulfilling commitments and project deadlines, but also aid in efficient resource management and planning. Software reliability is the probability of failure-free software operation for a given period of time in a given operating environment. The key attribute in software reliability engineering is the number of defects observed in specified time intervals (e.g. weeks). Software reli- ability prediction models assess a software product’s reliability or estimate the number of latent defects when it is released to the customers. Such an estimate is important for two reasons: 1) as an objective statement of the quality of the product and 2) for resource planning in the software maintenance phase. There are two categories of software reliability models: static and dynamic models. Among the static models, Bayesian belief networks [2] and utilizing software pro- cess metrics are relatively popular. Related literature also proposes various models for software defect prediction which can be used to indirectly gauge software reliability [3,4]. The primary drawback among static models can not effectively capture the software process and its variations during the course of software project development. On the other hand, a dynamic software reliability model is re- flective of the software testing phase and is generally applicable before product release. Among dynamic models, the (single-phase) Rayleigh model has been shown suitable to fit software defect ar- rival patterns [5,6]. A single-phase Rayleigh model di- vides the whole software development life cycle into six stages that are in chronological order: High Level Design (HLD), Low Level Design (LLD), CODING, Unit Test- ing (UT), Integration Testing (IT) and System Testing (ST). The six stages are assigned to a sequence of nu- merical scales. That is: HLD = 0.5, LLD = 1.5, CODING = 2.5, UT = 3.5, IT = 4.5 and ST = 5.5 [5]. Those nu- merical assignments seem rather ad hoc. Instead, we Dynamic Two-Phase Truncated Rayleigh Model for Release Date Prediction of Software 604 could assign the six stages to, for instance, {t1, t2, t3, t4, t5, t6}, as long as satisfies t1 < t2 < t3 < t4 < t5 < t6. With different numerical assignments for the six stages, the fitted single-phase Rayleigh models could show a much different accuracy pattern, as shown in Figure 1. 6 1 ii t The defects/KLOC in Figure 1 is reconstructed from the work of Thangarajan et al. [5]. The quadratic fit is shown to illustrate that the small pairs of data could be fitted well by an arbitrary model such as quadratic model, rather than just single-phase Rayleigh model. Also, by assigning one numerical number to each stage, the data set now contain only six pairs at most. For prediction pur- poses, most likely 3, 4 and 5 pairs available, such a small sample size offers no confidence in the reliability predic- tion. During the software development life cycle, collecting one single representative number for each stage results in a very small sample size. Furthermore, it is more likely that the data of major software defects are followed week- ly, hence allowing project management to monitor the dynamic progress of the software development process. Our motivated weekly software development defects data set, Figure 2, shows the serious inadequacy of the sin- gle-phase Rayleigh model. This leads to our research on developing a better dynamic software reliability model to estimate the number of major defects, hence predict soft- ware release date. The existing organizational reliability prediction model for software release date prediction at a large software company, where the weekly data in Figure 2 were col- lected, is the dynamic single-phase Rayleigh model [5,6]. The software process in the organization consisted of two or more development phases. This is due to the software production cycles, availability of supporting hardware (e.g. wingboard/test phones) in the earlier software de- velopment stages, man-power management (e.g. testers’ rearrangement) during the software development phases, and other dynamic issues during development. Figure 2, for instance, shows that the scatter plot overlaid with the single and the newly proposed two-phase truncated (piecewise, for short) Rayleigh models for the data set from the large software company. It is clear that the two-phase truncated Rayleigh model fits the data much better than the single-phase Rayleigh model. Motivated by the example, we propose a multiple- phase truncated Rayleigh model in this paper. Such a model is better suited to fit the weekly defect arrival pat- terns during software development process. For simplicity reasons, we focus on the two-phase truncated (piecewise) Rayleigh model. The model can be extended to include additional phases reflecting the development process. It is shown through empirical modeling that the model accu- racy is significantly improved. Furthermore, using the two-phase truncated Rayleigh model, the release date is predicted with a much higher confidence level. The paper is organized as follows: Section 2 summa- rizes the single-phase Rayleigh model and proposes the multi-phase model, with a focus on the two-phase trun- cated Rayleigh model. Section 3 presents the algorithms of nonlinear least squares estimators of the model pa- rameters and flowcharts of the dynamic process. Section 4 applies the proposed two-phase truncated Rayleigh model to defect arrival data of a large real-world software project from the large software organization. Finally, Section 5 concludes the paper and provides suggestions for future work. 2. Multi-Phase Truncated Rayleigh Models for Software Reliability Prediction The dynamic single-phase Rayleigh model is a standard technique for software reliability modeling, and has been widely used for the software project and quality man- agement in the software industry. The software organiza- tion, from which our case study data is obtained, has uti- lized the dynamic single-phase Rayleigh model for sev- eral of their previous software project developments. The single-phase Rayleigh model is a parametric re- gression model with the regression function specified by the Rayleigh distribution with a multiplier coefficient. When the parameters of the Rayleigh distribution are estimated based on the updated data from a software project, dynamic projections about the number of defects for the software can be made based on the model over the software development life cycle. The Rayleigh distribution is a special case of Weibull distribution, and has various applications including reli- ability estimation and life cycle pattern modeling [7,8] in developing software projects, life testing experiments in clinical studies dealing with cancer patients [9]. We now summarize the Rayleigh distribution. Denote tm be the time at which the single-phase Rayleigh density curve reaches its peak. The cumulative distribution function of Rayleigh distribution with the constant multiplier K (the total number of latent defects) is 2 (;, )1, t FtKK e where =1/(2tm 2) is the scale parameter. The single-phase Rayleigh model has a regression function parameterized as, 2 (;, )2t f tKK te (1) where both K and are the two parameters that need to be estimated using the data. The single-phase Rayleigh model (1) does not fit the Copyright © 2010 SciRes. JSEA Dynamic Two-Phase Truncated Rayleigh Model for Release Date Prediction of Software Copyright © 2010 SciRes. JSEA 605 Figure 1. Single-Rayleigh model vs. quadratic model for two ad hoc numerical assignments for the ordinal stages in the soft- ware development life cycle. Solid line is for the single-phase Rayleigh model, while the dashed line is for the quadratic model. (a) (HLD,LLD, CODING, UT, IT, ST) = (0.5,1.5,2.5,3.5,4.5,5.5); (b) (HLD,LLD, CODING, UT, IT, ST) = (1,3,7,8,8.5,9) Figure 2. Major defects vs. development time in weeks case study data set well. Actually it is a very poor fit as seen in Figure 2, and makes the case for a much needed improvement in modeling software defect arrival patterns. We propose a new multi-phase truncated Rayleigh model Dynamic Two-Phase Truncated Rayleigh Model for Release Date Prediction of Software 606 , , defined as below: 11 1 1221 12 111 (;,), 0 (;,), (; ) (;,), ddd dd ftK t ft Kt gt ft Kt where d is the number of phases and T=( 1,…, d, K1,…, Kd, 1,…, d-1, 1,…, d-1) is the model parameter vector. For simplicity, we will discuss the case with d = 2, the two-phase truncated (piecewise) Rayleigh model with regression function parameterized as follows: 11 22 (;,), 0 (; )(;,), ftK t gt f tK t (2) where is the location of the phase change, is the starting location for the second phase. Due to the nature of the software defect data, we suggest to use the left trun- cated Rayleigh model for the second phase. Then T = ( 1, 2, K1, K2, , ) is the parameter vector, need to be esti- mated. 3. Algorithms for Piecewise Rayleigh Models In this section, we describe the nonlinear least squares estimator of the model parameters. Let 1 (, ) n ii i td be the defect arrival data collected over time, where /n i ti is the time index for the ith week, is the total number of software defects detected during the ith week, and n is the number of weeks observed. Let i d 2 1 () (;). n ii i Sdgt Then the nonlinear least squares estimator is the mi- nimization of S(). Notice that S() is not differentiable in the location of phase change point and the starting point of second phase . In conjunction with nonlinear least squares method and Gauss-Newton algorithm, we utilize a four-step technique (described below) to obtain the estimators of the parameter vector . The package nls in R language is used to obtain the estimates of the model parameters. Step 1: For any given location of phase change in (0, 1), fix a such as 0 < < 1, we compute the nonlinear least squares estimators [10],1(,) n 2 ) for the smooth parameters 1121 (, ,, T K K , by minimizing S( ) over 1. Step 2: Substitute 1(,) n (,S into S( ) to obtain the profile objective function, ) . Then we minimize (,)S over 0 < < 1 for the given to obtain() . Notice that the minimizer () is a function of . Step 3: Substitute () into (,)S to get ()S . The minimizer of ()S over (0, 1) is called the change point estimator, denoted byˆ . Step 4: Substituteˆ into() to get ˆ and 1ˆˆ , n to get1 ˆn . Put them together, we obtain the nonlinear least squares estimator, ˆ 1 ˆ ˆ ,, T n ˆ T n of . Figures 3 and 4 illustrate the flow charts of the dy- namic process of the algorithm for single-phase and mul- ti-phase truncated Rayleigh models, respectively. We pro- vide the flowchart for the single-phase Rayleigh model for comparison purpose. 4. Application to a Real Software Defect Data Set The data set motivated our research were collected from Feb-25-06 to Aug-04-07 at a large software company. There are 76 weeks software defects arrival data. Number of major defects during a week is reported. 4.1 Single-phase vs. Piecewise Rayleigh Models We illustrate the two-phase truncated Rayleigh model by fitting the software defect arrival data set. From Figure 2, it is observed that using two-phase truncated Rayleigh model improves the model fitting significantly compared to the single-phase Rayleigh model with respect to model accuracy and model goodness-of-fit. For comparison Figure 3. Algorithm for single-phase Rayleigh model Copyright © 2010 SciRes. JSEA Dynamic Two-Phase Truncated Rayleigh Model for Release Date Prediction of Software Copyright © 2010 SciRes. JSEA 607 Figure 4. Algorithm for multi-phase Rayleigh model purpose, the estimated single-phase Rayleigh regression function is given by, 2 ˆ ˆ ˆ ˆ() 2, t gtK te where and ˆ13.7111 Kˆ1.5297. For the two-phase truncated Rayleigh model (2), the estimated change is at the ˆ = 33rd week with the starting point estimated at ˆ = 31st week for the second phase. Hence phase one is from the first week to 33rd week and phase two is from the 34th week to the 76th week with estimated starting point at ˆ = 31st week. The estimated first phase (right truncated) of the regres- sion function is estimated as 2 2 ˆ( 31/76) 22 ˆ ˆ ˆ()2(31/76), t gtK te if 33 / 76,t with and the second phase (left truncated) of the regression function is estimated 11 ˆ ˆ4.3022, 5.2279,K as 2 2 ˆ( 31/76) 22 ˆ ˆ ˆ()2(31/76), t gtK te if 33 /76,t with . Figure 2 shows the scatter plot overlaid with the two fitted curves using the single-phase and two-phase truncated Rayleigh models, respectively. From the fitted model, one can predict the future week’s number of software defects and establish the quality assurance criterion and management for pre- dicting the release date. 22 ˆ ˆ7.7731, 11.2445K This proposed multi-phase truncated Rayleigh model can be utilized for modeling any future software devel- opment projects to obtain better prediction and provide more efficient estimation of the release date of the soft- ware product. 4.2 Quality Assurance Criterion for Release Date Prediction In this section, we establish the quality assurance crite- rion for software release. The quality assurance criterion is determined by 95% and 99.9% confidence levels. That is, based on the fitted model, if the model shows that 95% or 99.9% of the total expected software defects has been detected, then we suggest that the software is ready for release. For the single-phase Rayleigh model, we estimate the release date with 95% confidence level. We set ˆ ˆ ; ,0.95FtK K ˆ and solve for t or equivalently 2 ˆ 1 0.95. t e This implies that the release date equals to the ceiling of ˆ ln(1.95)107 weeks,n where Hence, with 95% confidence the software project will ˆ=1.5297. Dynamic Two-Phase Truncated Rayleigh Model for Release Date Prediction of Software 608 need 107−76 = 31 weeks of further testing before releas- ing the software product. That is the predicted release date using the 76 weeks of data is Feb-29-08 based on the single-phase Rayleigh Model. With 99% confidence, it will require even much longer testing time. Alternatively, utilizing the two-phase truncated Ray- leigh model (2), we set ˆ// 1 ˆ 0/ 0 ˆˆ ˆ ()()0.999 () nin n g tdt gtdtgtdt and solve for i to get the estimated release week with 99.9% confidence level. Equivalently, the estimated re- lease week number, i, satisfies that ˆ 1/ 22 ˆˆ ˆ ˆˆ 22 00 2 1 0 2 ˆˆ 0.999()( ) ˆ 0.999 , ˆ n i nn g tdt gtdt ee K AA AK (3) where 2 ˆ ˆ ˆ2 0 ˆ/ 10 1 0 0.9922, ˆ( )2.6967, ˆ( )10.2586. n n Ae Agtdt Agtdt Solving Equation (3) to obtain the estimated release week number: 1 20 2 0.999 ˆ ˆ1/ln76 weeks, ˆ AA in AK where {x} is the smallest integer greater or equal to x. This indicates that with 99.9% confidence that the esti- mated release week is the end of 76th week. That is the software is ready for release, with almost 100% confi- dence based on the two-phase truncated Rayleigh model. We note that the large software organization has adopted our new two-phase truncated Rayleigh model and is us- ing it to predict the number of software defects dynami- cally and release dates for ongoing software projects. Our new two-phase truncated Rayleigh model has im- proved the software release life cycle a great deal and has saved a lot of man-powered resource for the large software organization. 4.3 Model Performance Check We utilize three measures of goodness-of-fit to assess the performance of the models: root mean square error (RMSE), magnitude of relative error (MRE), and ad- justed coefficient of determination 2 adj R . The root mean square error measures the model accuracy defined as the square root of mean squared residuals. That is, 2 1 1ˆ, 5 n ii i RMSEd d n where di is the number of defects detected during the ith week, is the fitted (predicted) value of di. The smaller the RMSE, the better the model fits. ˆi d The second criterion for assessment of the perform- ance of model fitting used in the reliability literature is the mean magnitude of relative error, defined as 1 1 ˆ (0) . (0) n ii i ii n i i dd Id d MRE Id The implicit assumption in this summary measure is that the seriousness of the absolute error is proportional to the size of the observations. The smaller the MRE, the better the model fits. The third measure of goodness-of-fit used is the ad- justed determination of coefficient which is the adjusted percentage of variation in the number of defects per week explained by the model. That is 2 adj R 2/(5) 1, /( 1) adj SSE n RSSTO n where SSE = (n-5)(RMSE)2 and 21 1 with . n ni i i i d SSTOd ddn The higher of the , the better the model fits. 2 adj R Table 1 summarizes the three performance criteria for the real-world weekly software defects data set using both single-phase and two-phase truncated (piecewise) Rayleigh models. Based on the reported RMSE, MRE and values, the two-phase truncated Rayleigh mo- del is much better than the single-phase Rayleigh model. The MRE is reduced by about 50%, while the good- ness-of-fit measure is roughly doubled for the two- phase truncated compared to the single-phase Rayleigh models. The two-phase truncated Rayleigh model ex- plains the almost doubled variation in the number of de- fects than the single-phase Rayleigh model does. Thus, based on the given data, we conclude that the two-phase truncated Rayleigh model is an attractive model for pre- dicting weekly software defects and release date of soft- ware projects. 2 adj R 2 adj R Copyright © 2010 SciRes. JSEA Dynamic Two-Phase Truncated Rayleigh Model for Release Date Prediction of Software Copyright © 2010 SciRes. JSEA 609 REFERENCES Table 1. Model comparisons using RMSE, MRE and 2 adj R Criterion Model RMSE MRE 2 adj R Single-phase 5.97 0.76 36.6% Two-phase 4.13 0.36 70.4% [1] M. R. Lyu, “Software Reliability: To Use or not to Use?” Proceedings of 5th International Symposium on Soft- ware Reliability Engineering, 66-73 November 1994. [2] Y. Wang and M. Smith, “Release Date Prediction for Telecommunication Software Using Bayesian Belief Networks,” Proceedings of the 2002 IEEE Canadian Conference on Electrical and Computer Engineering, 2002, pp. 738-742. 5. Conclusions [3] T. M. Khoshgoftaar and N. Seliya, “Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques,” Empirical Software Engineering Journal, Vol. 8, No. 3, 2003, pp. 255-283. The research was motivated by a real-world software defect arrival data over many weeks from a large software organization. The paper proposes a new multi-phase truncated (focusing on a two-phase truncated model) Ray- leigh model in fitting weekly defect arrival data. [4] T. M. Khoshgoftaar and N. Seliya, “Comparative Asse- ssment of Software Quality Classification Techniques: An Empirical Case Study,” Empirical Software Engin- eering Journal, Vol. 9, No. 3, 2004, pp. 229-257. It is shown that the proposed model is much more ac- curate than the existing single-phase Rayleigh model. The single-phase model was previously used by the organiza- tion during software development. Using both MRE and performance measures, the proposed model almost doubled the prediction accuracy, hence, shortening the release date prediction with a higher confidence level. From a software reliability perspective, our proposed two-phase truncated Rayleigh prediction model will help in the management and planning of project resources toward bettering the software release cycle time. 2 adj R [5] M. Thangarajan and B. Biswas, “Mathematical Model for Defect Prediction across Software Development Life Cycle,” The SEPG (Software Engineering Process Group) Conference, India, 2000. http://www.qaiindia. com/Conferences/SEPG2000/index.html [6] S. H. Kan, “Metric and Models in Software Quality Engineering,” 2nd Edition, Addison Wesley, Massa- chusetts, 2003. [7] P. V. Norden, “Useful Tools for Project Management,” Operations Research in Research and Development, B. V. Dean, Ed., John Wiley & Sons, New York, 1963. The two-phase truncated Rayleigh model can be easily extended to a multi-phase truncated Rayleigh model. Hence it can be used to predict release date for future software projects with a higher confidence level. A general multi-phase Rayleigh software release prediction model can be developed to automatically detect and reflect all the change locations and the starting points of the software development phases so that the multiple-phase truncated Rayleigh software prediction model can be generated to automatically forecast the software release time. [8] L. H. Putman, “A General Empirical Solution to the Macro Software Sizing and Estimating Problem,” IEEE Transaction on Software Engineering, Vol. SE-4, 1978, pp. 345-361. [9] S. K. Bhattacharya and R. K. Tyagi, “Bayesian Survival Analysis Based on the Rayleigh Model,” Trabajos de Estadistica, Vol. 5, No. 1, 1990, pp. 81-92. [10] D. M. Bates and J. M. Chambers, “Nonlinear Models,” Chapter 10 of Statistical Models in S. J. M. Chambers and T. J. Hastie, Eds., Wadsworth & Brooks/Cole, 1992. |