J. Software Engineering & Applications, 2009, 2: 221236 doi:10.4236/jsea.2009.24030 Published Online November 2009 (http://www.SciRP.org/journal/jsea) Copyright © 2009 SciRes JSEA 221 Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects Tim MENZIES1, Osamu MIZUNO2, Yasunari TAKAGI3, Tohru KIKUNO2 1Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, USA; 2Graduate School of Information Science and Technology, Osaka University, Osaka, Japan; 3Social Systems Business Group, OMRON Corporation, Shiokoji Horikawa, Japan. Email: tim@menzies.us, {omizuno, kikuno}@ist.osakau.ac.jp, yasunari@omron.co.jp Received May 6th, 2009; revised July 7th, 2009; accepted July 14th, 2009. ABSTRACT Often, the explanatory power of a learned model must be traded off against model performance. In the case of predict ing runaway software projects, we show that the twin goals of high performance and good explanatory power are achievable after applying a variety of data mining techniques (discrimination, feature subset selection, rule covering algorithms). This result is a new high water mark in predicting runaway projects. Measured in terms of precision, this new model is as good as can be expected for our data. Other methods might outperform our result (e.g. by generating a smaller, more explainable model) but no other method could outperform the precision of our learned model. Keywords: Explanation, Data Mining, Runaway 1. Introduction Every teacher knows that generating succinct explana tions means skipping over tedious details. Such explana tions can be quickly communicated, but can miss the details needed to apply that knowledge in a real world setting. An analogous situation occurs with data miners. All data miners are performance systems; i.e. they can reach conclusions about a test case. However, only some data miners are explanation systems that offer a highlevel description of how the learned model functions. The ability to explain how a conclusion was reached is a very powerful tool for helping users to understand and accept the conclusions of a data miner. Despite this, sometimes explanatory power must be decreased in order to increase the efficacy of the predictor. For example, previously Abe, Muzono, Takagi, et al. used a Näive Bayes classifier to generate a predictor for runaway software projects [1–3]. That model performs well but, as shown below, cannot easily explain how it reaches its conclusions. This paper repairs the explainability of that prior result. Using an iterative exploration of data mining techniques (crossvalidation, different rule learners, discretization, feature subset selection), we found a particular combina tion of methods that yielded succinct explanations of how to predict for runaway software projects while outperforming the Näive Bayes classifier. In holdout experiments, this new model exhibited perfect precision; i.e. precision = 1.0. Other methods might be able to outperform this new result (e.g. by finding a more suc cinct and explainable model) but no other method could be more precise (since 0 ≤ precision ≤ 1). The rest of this paper is structured as follows. First, the software runaway problem is defined and the explanation problems of prior results are discussed. Next, the general problem of explaining a learned model is explored using a range of data miners. And examples from the software engineering literatures (in summary, the best performing models may be very poor at explaining how those models make their conclusions). A class of data miners called rule learners will then be introduced and applied to our data via various treatments (some combination of discre tizer, feature selector, and learner). The subsequent dis cussion will review (a) related work; (b) the external va lidity of these results; as well as (c) general principles of building explainable models via data mining. 2. Runaway Software Glass defines a “runaway software project” as “a project that goes out of control primarily because of the diffi culty of building the software needed by the system” [4]. For Glass “out of control” means “schedule, cost, or functionality that was twice as bad as the original estimates”.
Explanation vs Performance in Data Mining:A Case Study with Predicting Runaway Projects 222 Requirements Estimation Planning Team Organi zations Management R1 R2 R3 R4 R5 E1 E2 E3 E4 E5 P1 P2 P3 P4P5 P6 O1 O2 O3 M1 M2 M3class 0 0 0 0 0 2 3 3 2 02000002 1 0 0 0 0 ok 0 0 0 0 0 0 0 0 0 00002000 0 0 0 0 0 ok 0 0 0 0 3 0 0 2 3 00000200 0 0 0 0 0 ok 3 3 2 2 3 0 0 2 2 02200012 0 0 0 0 0 ok 0 0 0 0 2 0 0 0 0 00202200 0 0 0 2 0 ok 0 3 2 0 0 2 2 2 0 20200000 0 0 0 0 2 ok 0 0 2 3 2 0 0 0 0 00203000 0 0 0 0 0 ok 0 2 3 3 0 1 0 2 0 02200220 0 1 3 0 0 ok 0 2 0 2 3 0 0 0 0 02202200 0 0 0 0 2 ok 0 0 0 0 2 0 2 2 0 00200200 0 0 2 0 0 ok 0 3 3 2 0 0 0 3 3 00000000 0 2 0 0 0 ok 0 2 2 2 0 0 2 0 0 00202000 0 0 0 0 2 ok 0 2 0 2 0 0 0 0 0 02330222 2 0 2 2 1 ok 0 0 0 0 3 0 0 0 0 00000000 0 0 0 0 0 ok 0 2 2 2 2 0 2 2 0 00000003 2 0 3 0 0 ok 0 0 0 0 2 0 2 0 2 33202323 2 0 2 2 2 ok 0 0 0 0 0 0 2 0 0 02223220 0 0 2 2 0 ok 0 0 0 0 1 0 0 0 0 00220000 0 0 0 0 0 ok 0 0 0 0 3 0 0 0 0 00000000 0 0 0 0 0 ok 0 2 3 2 3 0 0 0 0 03000302 0 0 0 3 3 ok 0 2 2 0 0 0 0 0 0 00000000 0 0 0 0 0 ok 3 2 3 3 2 2 1 3 2 10222013 1 2 2 2 0 ok 2 2 0 2 3 0 0 2 3 02022320 0 0 0 2 3 runaway 2 2 3 3 3 2 2 3 2 33332323 3 0 2 2 2 runaway 3 2 0 0 3 0 0 0 0 03003300 0 0 0 0 0 runaway 0 2 3 2 2 3 0 2 2 10200220 2 2 2 0 2 runaway 0 2 2 2 2 0 3 2 3 30220022 2 0 0 0 0 runaway 2 3 3 2 2 0 0 3 3 23030232 0 2 0 2 2 runaway 3 2 3 2 0 3 2 2 2 00222302 0 2 0 3 3 runaway 2 2 3 3 2 0 0 2 0 22222203 0 2 0 2 0 runaway 0 0 0 0 0 0 0 0 0 03333333 3 0 0 3 3 runaway 2 3 3 3 2 2 3 3 3 33332333 3 2 3 3 0 runaway Figure 1. Data used in this study, collected using the meth ods. For an exp lanation of th e columns features , see Figu re 2 [1] Requirements features relate to the understanding and commitment of the requirements among the project members R1: Ambigious requirements R2: Insufficient explanation of the requirements R3: Misunderstanding of the requirements R4: Lack of commitment regarding requirements between the customer and the project members; R5: Frequent requirement changes Estimation features relate to the technical methods for carrying out the estimation, and the commitment between project members and customers: E1: Insufficient awareness of the importance of the estimation; E2: Insufficient skills or knowledge of the estimation method; E3: Insufficient estimation of the implicit requirements; E4: Insufficient estimation of the technical issues; E5: Lack of stake holders’ commitment of estimation. Planning features relate to the planning or scheduling activity and the commitment to the project plan among project members: P1: Lack of management review for the project plan; P2: Lack of assignment of responsibility; P3: Lack of breakdown of the work products; P4: Unspecified project review milestones; P5: Insufficient planning of project monitoring and controlling; P6: Lack of project members’ commitment for the project plan. Team organization features relate to the state of the projects; e.g. the fundamental skills or experience and morale of project members: O1: Lack of skills and experience; O2 ]: Insufficient allocation of resources; O3 ]: Low morale. Project management factors about management activities: M1: Project manager lack of resource management throughout a project; M2: Inadequate project monitoring and controlling; M3: Lack of data needed to keep objective track of a project. Figure 2. Explanation of the features seen in Figure 1 Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining:A Case Study with Predicting Runaway Projects 223 Many software projects suffer from runaways: In 2001, the Standish group reported that 53% of U.S. software projects ran over 189% of the original es timate [5]. This 189% is not the 200% required by Glass’ definition, but it is close enough and large enough to be alarming. Figure 1 shows data from 31 realworld projects, 10 of which (32%) are classified as “runaway”. Figure 1 was collected by [1–3] as follows: Questions covering the various aspects of software development (see Figure 2) areas were delivered to de velopment companies and collected one month later. These projects are actual industrial software development projects of embedded systems in the period 1996 to 1998. The questions were distributed to the project man agers or project leaders of various target projects. The detail and purpose of the questionnaire was explained. Answers were coded strongly agree, Agree, Neither agree nor disagree, and Disagree as 3, 2, 1, and 0, respec tively. All of these projects had completed their develop ment. As a result, some of the projects could be classified as “runaways”. Takagi et al. took care to ensure that all developers held a consensus view that some prior project had been a runaway. Also, to be classified as a runaway, the researchers used other objective measures such as cost and duration. Using manual methods, Takagi et al. [1] found four features from Figure 3 (e3,e5,p3,p5) that seemed prom ising predictors for runaways. The coefficients of those terms (found via logistic regression) were combined as follows: 5222.23228.13964.0 5577.13834.8)5,3,5,3( ppp eeppeeX X X e e XrunawayP 1 )( (1) Unlike prior results [4,6,7], this model is operational; it is possible to precisely characterize the strengths and weaknesses of its performance: For high and low values of P(runawayX), Equation 1 is a perfect predictor for runaways in Figure 1. No pro ject with P ≤ 0.03 is a “runaway” and no project with P ≥ 0.81 is “ok”. This is the majority (22 33=67%) of the data in Figure 1. In the minority case (11 33 ), P is midrange (0.03<P(runawayX)<0.81) and Equation 1 yields incor rect predictions in 4 11 rows. While an important result, Equation 1 has several drawbacks: Not automatic: Equation 1 was created after a man ual inspection of the data by a team of skilled mathema ticians. Such a manual analysis is hard to reproduce or apply to a new data set. Subsequent work by Abe, Takagi, et al. [2] automated the method with a Näive Bayes clas sifier, but this compromised the explainability of the pre dictive model (see below). Only explores one subset: Takagi et al. did not compare the feature subset {e3,e5,p3,p5} with other fea ture subsets. Hence, while they showed that this subset was useful, they did not demonstrate that it was the most useful subset. Ambiguous: At low and high P values, Equation 1 sends a clear signal about what is, and is not, a poten tially runaway project. However, at middlerange P val ues, Equation 1’s conclusions are ambiguous and, hence, hard to explain. 3. The Explanation Problem Learning explainable models is harder than it may appear. This section offers examples where learned models per form well, but explain themselves poorly. 3.1 Learning Latent Features Numerous data mining methods check if the available features can be combined in useful ways. In this way, latent features within a data set can be discovered. For example, principal components analysis (PCA) [8] has been widely applied to resolve problems with struc tural code measurements; e.g. [9]. PCA identifies the distinct orthogonal sources of variation in data sets, while mapping the raw features onto a set of uncorrelated fea tures that represent essentially the same information contained in the original data. For example, the data shown in two dimensions of Figure 3 (lefthandside) could be approximated in a single latent feature (right handside). Since PCA combines many features into fewer latent features, the structure of PCAbased models may be very simple. For example, previously [10], we have used PCA and a decision tree learner to find the following predictor for defective software modules: Figure 3. The two features in the left plot can be transferred to the right plot via one latent feature Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects 224 if domain1 ≤ 0.180 then NoDefects else if domain1 > 0.180 then if domain 1 ≤ 0.371 then NoDefects else if domain 1 > 0.371 then Defects Here, “domain1” is one of the latent features found by PCA. This tree seems very simple, yet is very hard to explain to business clients users since “domain1” is cal culated using a very complex weighted sum (in this sum, v(g),ev(g),iv(g) are McCabe or Halstead static code met rics [11,12] or variants on line counts): domain1= 0.241*loc+0.236* v(g) +0.222*ev(g)+0.236*iv(g)+0.241*n +0.238*v0.086*l+0.199*d +0.216*i+0.225*e+0.236*b+0 .221* t +0.241*lOCode+0.179*lOComment +0.221*lOBlank+0.158*lOCodeAndComment +0.163*uniqOp+0.234*uniqOpnd +0.241*totalOp+0.241*tota lOpnd +0.236*branch Count (2) As we shall see below, other learners can yield effec tive models that are simpler to explain without using complex latent features. 3.2 Ensemble Learning Data mining for SE means summarizing the complex behavior of a group of developers struggling to build intricate artifacts. Data mining over such complex multidimensional data often requires fusing together the results from multiple learners [13]. Such ensembles may perform well but, as we shall see, are hard to explain. In basic ensemble method (BEM), l learners are run on various subsets of the available data. These learners use EQ x\s\do5(j)(r\,s) that returns the probability of the target classes s. BEM returns the mean probability: 1 1 ˆ, l BEM j j xrs l (3) The linear generalized ensemble method (GEM) re turns a weighted sum of the conclusions of each learner x in the ensemble. 1 , l jj GEM j aax rs (4) where αj is the normalized performance score of xj on the training data (so learners that performed the worst, con tribute the least). For some data sets, the combination rule is nonlinear and complex. For example, Toh et al. [13]’s variant of Equation 4 uses a Jacobian matrix for x with different coefficients for each feature r i ∈r and target class s m ∈s. These coefficients are learned via multivariate polyno mial regression. Toh et al. report that their resulting en semble performs better than simpler schemes. However, it may be harder to explain the ensemble since that ex planation must cover: The learning methods used to generate xj; The combination rule that computes x ; and The regression method used to tune the coefficients used in the combination method. Such an explanation is not required if the users are willing to accept the conclusions of the learner, without explanation. However, for data sets as small Figure 1, it seems reasonable to expect that a simple explanation of runaway projects should be possible. Also, if managers are to use the results of the learner as part of their delib erations, they need some succinct structures that they can reflect over. 3.3 Näive Bayes Classifiers It is hardly surprising that complex latent features (e.g. Equation 2) or intricate combinations of multiple learners (e.g. Equation 4) are hard to explain. What is surprising is how hard it is to explain the results of even a single, supposedly simple, learner. For example, this section offers a complete description of how a Näive Bayes clas sifiers makes its conclusions. The reader is asked to con sider how many users would understand this description (in our experience, we have yet to meet a single one). A Näive Bayes classifier [14] is based on Bayes’ Theorem. Informally, the theorem says next=old*new i.e. what we’ll believe next comes from how new evidence effects old beliefs. More formally: P(HE)= P(H) P(E) i P(EiH) (5) i.e. given fragments of evidence Ei and a prior prob ability for a class P(H), the theorem lets us calculate a posterior probability P(HE). When building predictors for runaways, the posterior probability of each hypothesis class (H∈{“ok” or “run away”}) is calculated, given the features extracted from a project such “ambiguous requirements” or “low morale” or any other of the features shown in Figure 2. The clas sification is the hypothesis H with the highest posterior P (HE). Näive Bayes classifiers are called “näive” since they assume independence of each feature. While this as sumption simplifies the implementation (frequency counts are required only for each feature), it is possible that correlated events are missed by this “näive” ap proach. Domingos and Pazzani show theoretically that the independence assumption is a problem in a vanish ingly small percent of cases [15]. This explains the re peated empirical result that, on average, Näive Bayes classifiers perform as well as other seemingly more so Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects225 phisticated schemes (e.g. see Table 1 in [15]). Equation 5 offers a simple method for handling miss ing values. Generating a posterior probability means of tuning a prior probability to new evidence. If that evi dence is missing, then no tuning is needed. In this case Equation 5 sets P(EiH)=1 which, in effect, makes no change to P(H). When estimating the prior probability of hypothesis H, it is common practice [16] to use an Mestimate as fol lows. Given that the total number of hypothesis is C, the total number of training instances is I, and N(H) is the frequency the hypothesis H within I, then CmI mHN HP )( )( (6) Here m is a small nonzero constant (often, m=1). Three special cases of Equation 6 are: For high frequency hypothesis in large training sets, N(H) and I are much larger than m and m·C, so Equation 6 simplifies to P(H)= N(H) I, as one might expect. For low frequency classes in large training sets, N(H) is small, I is large, and the prior probability for a rare class is never less than 1 I; i.e. the inverse of the number of instances. If this were not true, rare classes would never appear in predictions. For very small data sets, I is small and N(H) is even smaller. In this case, Equation 6 approaches the inverse of the number of classes; i.e. 1 C. This is a useful ap proximation when learning from very small data sets when all the data relating to a certain class has not yet been seen. The prior probability calculated in Equation 6 is a useful lower bound for P(EiH). If some value v is seen N(f=vH) times in feature f ’s observations for hypothesis H, then P(EiH)= N(f=vH)+l P(H) N(H)+l (7) Here, l is the Lestimate and is set to a small constant (Yang &Webb [16] recommend l=2). Two special cases of are: A common situation is when there are many exam ples of an hypothesis and numerous observations have been made for a particular value. In that situation, N(H) and N(f=vH) are large and Equation 7 approaches N(f=vH) N(H), as one might expect. In the case of very little evidence for a rare hy pothesis, N(f=vH) and N(H) are small and Equation 7 approaches l·P(H) l; i.e. the default frequency of an ob servation in a hypothesis is a fraction of the probability of that hypothesis. This is a useful approximation when very little data is available. For numeric features it is common practice for Näive Bayes classifiers to use the Gaussian probability density function [17]: 2 2 2 1 2 x gx e (8) where {μ,σ} are the feature’s {mean, standard deviation}, respectively. To be precise, the probability of a continu ous feature having exactly the value x is zero, but the probability that it lies within a small region, say x ± ε/2, is ε×g(x). Since ε is a constant that weighs across all pos sibilities, it cancels out and needs not be computed. Näive Bayes classifiers are frustrating tools in the data mining arsenal. They exhibit excellent performance, but offer few clues about the structure of their models. The means and standard deviations for Figure 1 are shown in Figure 44. Note that this figure is an incomplete charac terization of Figure 1. For example, row 1 of Figure 4 suggests that r1 (“ambiguous requirements”) for “ok” is a Gaussian distribution with a mean of 0.27 and a stan dard deviation of 0.86. A visual inspection of column one values for “ok” projects in Figure 1 shows that this is not true: r1 is usually zero except in two cases where it takes the value of three. One method of handling nonGaussians like P(r1=Xok) is Johns and Langley’s kernel estimation technique [18]. This technique approximates a continuous distribution sampled by n observations as the sum of multiple Gaussians with means of multiple Gaussians with means 12 ,,..., n ob obob 12 , ,...,ob obobn and standard deviation P(ok) = 0.68 P(runaway) = 0.32 feature mean sd mean sd r1 0.2727 0.8624 1.35 1.0500 r2 0.9545 1.0650 1.65 0.8078 r3 0.9545 1.1571 1.95 1.3500 r4 0.8864 1.0759 1.65 1.0500 r5 1.4091 1.2670 1.90 1.0440 e1 0.3182 0.6998 1.00 1.2649 e2 0.7273 1.0082 1.00 1.2649 e3 0.8182 1.0824 1.65 1.0500 e4 0.5455 0.9642 1.65 1.2460 e5 0.2727 0.7497 1.40 1.2806 p1 0.6818 0.9833 1.80 1.3077 p2 0.9545 0.8516 1.50 1.1619 p3 0.3409 0.7744 1.80 1.1225 p4 0.6818 0.9833 1.35 1.0500 p5 0.7500 0.9857 2.25 1.0062 p6 0.4545 0.7820 1.70 1.1874 o1 0.6818 1.0824 1.65 1.2460 o2 0.3636 0.7100 1.30 1.3454 o3 0.2273 0.5979 1.00 1.0000 m1 0.6136 0.9762 0.60 0.9950 m2 0.4773 0.8323 1.50 1.1619 m30.54550.9404 1.50 1.2845 Figure 4. Means and standard deviations from Figure 1 Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects 226 1 n In this approach, to create a highly skew dis tribution like P(r1=Xok), multiple Gaussians would be added together at r1=0. Conclusions are made by asking all the Gaussians which class they believe is most likely. 3.4 Näive Bayes and Software Engineering NäiveBayes classifiers are widely used in the SE litera ture for several reasons. NäiveBayes classifiers summa rize the training data in one frequency table per class. Hence, they consume very little memory and can quickly modify their knowledge by incrementing the frequency count of feature ranges seen in new training examples. Also, many studies (e.g. [15,19,20]) report that Näive Bayes exhibit excellent performance compared to other learners. For example, recently Menzies, Greenwald & Frank [21] have built predictors for software detectors using a Näive Bayes classifier and two explanation systems the OneR rule learner and the J4.8 decision tree learner. In that study, the learner with the worst explanation power (Näive Bayes) had the best performance, by far. For the data sets explored by Menzies, Greenwald & Frank, the median advantage of Näive Bayes, the C4.5 decision tree learner [22], and the OneR rule learner [23] over the other learners was 52.4%, 0%,16.7%, respectively (see Figure 5). On analysis, Menzies, Greenwald & Frank concluded that Näive Bayes worked so well because of the the product calculation of Equation 5. They reasoned as follows. Many static code features have similar infor mation content. Hence, minor changes in how the train ing data was sampled yielded different “best” features for predicting defects. The best predictions come from mathematical methods like Näive Bayes that accumulate the signal from many code features (using Equation 5’s product rule). Decision tree learners like C4.5 and rule method median oneR 16.7 100% 100% j48 0.0 100% 100% Näive Bayes 52.4 100% 100% Figure 5. Quartile charts from Menzies, Greenwald & Frank [21]. The charts show the differences when learners were applied to the same the training and test data Performance was measured using recall; i.e. the percent of the defec tive modules found by the learners. The the upper and lower quartiles are marked with black lines. The median is marked with a black dot. Vertical bars are added to mark (i) the zero point and (ii) the minimum possible value and (iii) the maximum possible value. The median per formance of Näive Bayes was much higher than the other methods. learners like OneR, on the other hand, do not perform well in this domain since they assume hard and fast boundaries between what is defective and what is not. In summary, when mining software engineering data, there are many reasons to start with a Näive Bayes clas sifier. Abe, Muzono, Takagi, et al. [2] used such classifi ers to extend their prior work on runaway software pro jects [1,3]. However, this classifier was only a perform ance system. not an explanation system, so it could not offer insights into, say, how to best change a software project in order to avoid runaways. As shown above, Näive Bayes classifiers do not generate such succinct generalizations. This is a problem since what developers really want to know is what should be done to avoid runaway status. 3.5 Discussion of the Explanation Problem As the mathematics gets more elaborate, it becomes harder to explain a Näive Bayes classifier to a typical business user: Many users are not trained mathematicians. Hence, they may be confused by Equation 5, Equation 6, Equa tion 7 and Equation 8. Presenting the internal statistics (e.g. Figure 4) is uninformative, at least for the business users we have worked with. The problem is compounded if the data is nonGaussian (like Figure 1) since this requires explain ing kernel estimation. Worse, a standard Näive Bayes classifier (with our without kernel estimation) can not answer businesslevel questions such as “what minimal changes should be make to most decrease the odds of runaway projects?” To be fair, Näive Baye’s explanation problems are seen in other kinds of data miners: The problems with PCA and ensemblebased learn ers were discussed above. Tree learners such as C4.5 [22] or CART [24] exe cute in local topdown search, with no memory between different branches. Hence, the same concept can be need lessly repeated many times within the tree. Such trees can be cumbersome, needlessly large, and difficult to understand. Clustering algorithms [25] and nearest neighbor methods [26,27] do not condense their working memory into succinct descriptions. Rather, inferences on new information are made by a query over all the old infor mation. Simulated annealers [28] learn constraints to an in put space that results in higher values in the output space. However, there is no generalization or summarization in a simulated annealer such as which subset of the input space is most important to control. Neural networks store their knowledge as weights distributed across a network. Concepts have no centralized Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects227 location so it is impossible to inspect, say, all the infor mation about one idea at one location in a network [29]. The problem of explaining the performance of these learners to endusers has been explored extensively in the literature (see the review in [30]). Often, some postprocessor is used to convert an opaque model into a more understandable form: Towell and Shavlik generate refined rules from the internal data structures of a neural network [29]. Quinlan implemented a postprocessor to C4.5 called C45 rules that generates succinct rules from cumbersome decision tree branches via (a) a greedy pruning algorithm followed by (b) duplicate removal then (c) exploring sub sets of the rules relating to the same class [22]. TARZAN was another postprocessor to C4.5 that searched for the smallest number of decisions in decision tree branches that (a) pruned the most branches to unde sired outcomes while (b) retaining branches leading to desired outcomes [31]. 4. Learning Methods 4.1 Rule Learners Rather than patch an opaque learner with a postproces sor, it may be better to build learners than directly gener ate succinct highlevel descriptions of a domain. For example, RIPPER [32] is one of the fastest rule learners known in the literature. The generated rules are of the form condition conclusion: 112 2 ... conclu s io n condition eature Value FeatureValueClass The rules generated by RIPPER perform as well as C45rules, yet are much smaller and easier to read [32]. Rule learners like RIPPER and PRISM [33] generate small, easier to understand, symbolic representations of the patterns in a data set. PRISM is a less sophisticated learner than RIPPER and is not widely used. It was ini tially added to this study to generate a lower bound on the possible performance. However, as we shall see, it proved surprisingly effective. 1. Find the majority class C 2. Create a R with an empty condition that predicts for class C. 3. Until R is perfect (or there are no more features) do (a) For each feature F not mentioned in R • For each value vF, consider adding F=v to the condition of R (b) Select F and v to maximize p t where t is total number of exam ples of class C and p is the number of examples of class C se lected by F=v. Break ties by choosing the condition with the largest p. (c) Add F=v to R 4. Print R 5. Remove the examples covered by R. 6. If there are examples left, loop back to (1) Figure 6. PRISM pseudocode Like RIPPER, PRISM is a covering algorithm that runs over the data in multiple passes. As shown in the pseudocode of Figure 6, PRISM learns one rule at each pass for the majority class (e.g. in Figure 6, at pass 1, the majority class is ok). All the examples that satisfy the condition are marked as covered and removed from the data set. PRISM then recurses on the remaining data. The output of PRISM is an ordered decision list of rules where rulej is only tested if all conditions in rule i ＜ j fail. PRISM returns the conclusion of the first rule with a satisfied condition. One way to visualize a covering algorithm is to imag ine the data as a table on a piece of paper. If there exists a clear pattern between the features and the class, define that pattern as a rule and cross out all the rows covered by that rule. As covering recursively explores the re maining data, it keeps splitting the data into: What is easiest to explain, and Any remaining ambiguity that requires a more de tailed analysis. PRISM is a näive covering algorithm and has prob lems with residuals and overfitting. If there are rows with similar patterns and similar frequencies occur in different classes, then: These residual rows are the last to be removed for each class; So the same rule can be generated for different classes. In overfitting, a learner fixates on spurious signals that do not predict for the target class. PRISM’s overfitting arises from part 3.a of Figure 6 where the algorithm loops through all features. If some feature is poorly measured, it might be noisy (contains spurious signals). Ideally, a rule learner knows how to skip over noisy features. RIPPER addresses residuals and overfitting problem three techniques: pruning, description length and ruleset optimization for a full description of these techniques, see [34]. In summary: Pruning: After building a rule, RIPPER performs a backselect to see what parts of a condition can be de leted, without degrading the performance of the rule. Similarly, after building a set of rules, RIPPER performs a backselect to see what rules can be deleted, without degrading the performance of the rule set. These backselects remove features/rules that add little to the overall performance. For example, back pruning could remove the residual rules. Description length: The learned rules are built while minimizing their description length. This is an in formation theoretic measure computed from the size of the learned rules, as well as the rule errors. If a rule set is overfitted, the error rate increases, the description length grows, and RIPPER applies a rule set pruning operator. Rule set optimization tries replacing rules straw man alternatives (i.e. rules grown very quickly by some näive method). Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects 228 4.2 Performance Measures Our results are presented in terms of the following per formance measures. Suppose we have some historical log, like Figure 1 that can comment on the correct classifica tion of each row. By comparing the historical log with the output of the learner, we can define several measures of success. Let {A,B,C,D} denote the true negatives, false negatives, false positives, and true positives (respectively) found by a binary detector (binary detectors work on data sets with two classes, like Figure 1). A,B,C,D can be combined in many ways. For example, accuracy (or acc) is the percentage of true positives (D) and negatives (A) found by the detector. acc=accuracy=(A+D)/(A+B+C+D) (9) Also, recall (or pd) comments on how much of the target was found. pd=recall=D/(B+D) (10) Precision (or prec) comments on how many of the in stances that triggered the detector actually containing the target concept. prec=precision=D/(D+C) (11) The fmeasure is the harmonic mean of precision and recall. It has the property that if either precision or recall is low, then the fmeasure is decreased. The f measure is useful for dual assessments that include both precision and recall. 2prec pd f measureprec pd (12) All these measures fall in the range 0≤ {pd,prec,f,acc} ≤1. Also, the larg er these values, the better the model. 4.3 Experiments with the Learning Methods Various combinations of the learning method described above were applied to Figure 1. The results are shown in Figure 7. In all 13 treatments where applied to Figure 1. Each treatment is some combination of a data filter, a learner, and a assessment method. This section discusses how each treatment was designed using results from the proceeding treatments. Before moving on, we call attention to the accuracy results of Figure 7. Observe how accuracy can be a re markably insensitive performance measure; i.e. it re mained roughly constant, despite large changes in recall and precision. This result has been seen in many other data sets [21,35]. Hence, accuracy is deprecated by this paper. 4.3.1 Cross Validation Treatment a is a simple application of RIPPER to Figure 1. The learned theory was applied back on the training data used to generate it; i.e. all of Figure 1. As shown binslearner#features #tests a n/a ripper 22 1 bn/a ripper 22 10 c n/a nb 22 10 d3 ripper 22 10 e 3 nb 22 10 f 3 prism 22 10 g3 ripper 1 (r1) 10 h3 ripper 2 (r1 + p5) 10 i 3 bayes 1 (r1) 10 j 3 bayes 2 (r1 + p5) 10 k3 prism 1 (r1) 10 l 3 prism 2 (r1 + p6) 10 m3 prism 3 (r1 + p6 + o3) 10 Figure 7. Results from this study. The four plots, shown at top, come from the 13 treatments shown at bottom in Figure 7, this produced one of the largest fmeasures seen in this study. Treatment a assessed a learned model using the data that generated it. Such a selftest can lead to an overestimate of the value of that model. Crossvalida tion, on the other hand, assesses a learned model using data not used to generate it. The data is divided into, say, 10 buckets. Each bucket is set aside as a test set and a model is learned from the remaining data. This learned model is then assessed using the test set. Such crossvalidation studies are the preferred evaluation method when the goal is to produce predictors intended to predict future events [17]. In treatment b, a crossvalidation experiment was ap plied to the data. The treatment b results shows how badly treatment a overestimated the performance: changing the training data by as little as 10% nearly halved the precision and recall. Clearly, the conclusions from the selftest from this data set are brittle; i.e. unduly Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects229 altered by minor changes in the training data. Treatment c illustrates the explanation vs performance tradeoff discussed in the introduction. As mentioned above, the output from rule learners can be far easier to explain than the output of treatment c; i.e. a Näive Bayes classifier (with kernel estimation) running on data sets with nonGaussian distributions like Figure 1. So, if op timizing for explainability, an analyst might favor rule learners over Bayes classifiers. On the other hand, Figure 7 shows treatment c outperforming treatment b, espe cially in terms of recall. So, if optimizing for perform ance an analyst might favor a Bayes classifier. Note that treatment c uses the method favored by the previous high water mark in this research [2]. In the se quel, we show how this study found data mining methods that significantly outperform that prior work. 4.3.2 Discretiz ation Treatments d,e and f explore discretization. Discretiza tion clumps together observations taken over a continu ous range into a small number of regions. Humans often discretize real world data. For example, parents often share tips for “toddlers”; i.e. humans found between the breaks of age=1 and age=3. Many researchers report that discretization improves the performance of a learner since it gives a learner a smaller space to reason about, with more examples in each part of the space [16,20], [36,37]. Discretization can generally be described as a process of assigning data attribute instances to bins or buckets that they fit in according to their value or some other score. The general concept for discretization as a binning process is dividing up each instance of an attribute to be discretized into a number distinct buckets or bins. The number of bins is most often a userdefined, arbitrary value; however, some methods use more advanced tech niques to determine an ideal number of bins to use for the values while others use the userdefined value as a start ing point and expand or contract the number of bins that are actually used based upon the number of data in stances being placed in the bins. Each bin or bucket is assigned a range of the attribute values to contain, and discretization occurs when the values that fall within a particular bucket or bin are replaced by identifier for the bucket into which they fall. After Gama and Pinto [38], we say that discretization is the process of converting a continuous range into a his togram with k break points b1…bk where ij:bi bj. The histogram divides a continuous range into bins (one for each break) and many observations from the range may fall between two break points bi and bi+1 at fre quency counts ci. Simple discretizers are unsupervised methods that build their histograms without exploiting information about the target class; e.g. equal width: i,j:bibi1 bjbj1 ; equal frequency: i,j:cicj . For Näive Bayes classifiers working on n instances, Yang & Webb [16] advocate equal frequency with ci=cj= n. For example, Figure 1 holds 32 instances so a b=3 equal frequency discretion hopes to place 32 3≈10 values into each part of the histogram. However, Figure 1 does not have ten instances for each feature value so, as shown in Figure 8, a skewed histogram is generated. More sophisticated discretizers are supervised methods that build their histograms using knowledge of the target class. Specifically, the continuous range is explored looking for a break that is a cliff; i.e. a point where the class frequencies are most different above and below the cliff. Once a toplevel cliff is found, this method usually recurses into each region above and below the cliff to find the next best subcliff, subsubcliff, and so on. For example, the Fayyad & Irani [37] supervised dis cretizer assumes that the best cliff is the one that most divides target classes. In terms of information theory, this can be measured using entropy; i.e. the number of bits required to encode the class distribution. If the classes in a sample of n instances occur at frequencies counts c1,c2,..., then the entropy of that sample is 1122 12 22 , ...... cccc Ent ccloglog nnnn If a break divides n numbers into two regions of size n1,n2, then the best cliff is the one that minimizes the sum of the entropy below and above the cliff; i.e. n1 n·Ent1+ n2 n·Ent2. Various discretizers were explored, with disappointing results: Yang & Webb’s rule (ci=n=33≈6) was not useful here since our data has less than 6 distinct values per feature. Fayyad&Irani’s method reduced most features to a single bin; i.e. it found no information gain in any parts featurerangefrequency o3 0 24 1 1 2,37 p6 0 19 1,2 10 3 3 r1 0,123 2 5 3 4 Figure 8. Some 3bin results from Figure 1 Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects 230 of our ranges. Best results were seen with a simple 3bin equal fre quency scheme (i.e. b=3) in Treatment f where PRISM achieved precisions as high as the RIPPER selftest (treatment a). However, the same experiment saw the worst recall. The same 3bin scheme offered little help to RIP PER or Näive Bayes (see treatments d,e). Since the precision results were the most promising seen to date, 3bin was retained for the rest of our ex periments. Other methods were then employed to achieve the benefits of 3bin (high precision) without its associ ated costs (low recall). 4.3.3 Feature Subset Selection The remaining treatments (g,h,i,j,k,l,m) explore how dif ferent feature subsets change the performance of the learning. A repeated result in the data mining community is that simpler models with equivalent or higher per formance can be built via feature subset selection algo rithms that intelligently prune useless features [19]. Fea tures may be pruned for several reasons: They may be noisy; i.e. contain spurious signals unrelated to the target class; They may be uninformative; e.g. contain mostly one value, or no repeating values; They may be correlated to other variables in which case, they can be pruned since their signal is also present in other variables. The reduced feature set has many advantages: Miller has shown that models generally containing fewer variables have less variance in their outputs [39]. The smaller the model, the fewer are the demands on interfaces (sensors and actuators) to the external en vironment. Hence, systems designed around small mod els are easier to use (less to do) and cheaper to build. In terms of this article, the most important aspect of learning from a reduced features set is that it produces smaller models. Such smaller models are easier to ex plain (or audit). One such feature subset selector is Kohavi & Johns’ WRAPPER algorithm [40]. Starting with the empty set, WRAPPER adds some combinations of features and asks some target learner to build a model using just those fea tures. WRAPPER then grows the set of selected features and checks if a better model comes from learning over the larger set of features. If we applied WRAPPER to our three learners (RIP PER, PRISM, Näive Bayes), then WRAPPER’s search through the 22 features of Figure 1 could require 3•222=12,582,912 calls to a learner. In practice, a heuris tic search drastically reduced this search space. WRAP PER stops when there are no more features to select, or there has been no significant improvement in the learned model for the last five additions (in which case, those last five additions are deleted). Technically speaking, this is a hillclimbing forward select search with a “stale” param featurePRISM Näive Bayes RIPPER average group #1 : usually selected r1 10 10 6 8.7 o3 7 7 p5 8 4 6 p6 8 1 4.5 group #2: sometimes selected m3 3 3 r2 2 2 p2 2 1 1.5 e1 1 2 1 1.3 o2 1 2 1 1.3 e2 1 1 group #3: rarely selected e3 1 1 m2 1 1 o1 1 1 p1 1 1 1 p3 1 1 p4 1 1 r3 1 1 r4 group #4: never selected r5 e4 e5 m1 Figure 9. Number of times WRAPPER selected features in ten experiments on 90% samples of the data eter set to 5. For data sets as small as Figure 1, WRAP PER terminates in an under a minute (but for large data sets, other feature selectors would be requiredsee [19] for a survey). Figure 9 shows the results of running 10 WRAPPER experiments on Figure 1 (discretized via 3bin) for our three learners. In each experiment, 10% of Figure 1 (se lected at random) was ignored: Group #1 shows the features that, on average, were selected in the majority of ten runs (on average, 6 times or more). Group #2 shows the features that were selected 2 to 5 times. Group #3 shows the features that were selected only once. Group #4 shows the features that were never se lected. There are only three features in Group #1 suggesting that many of the Figure 1 features could be ignored. This has implications for the cost of data collection and the explaining runaway projects: Data collection could be constrained to just Group #1, and perhaps p6 (which PRISM selected eight times). Such a constrained data collection program would be cheaper to conduct, especially over a large organization. Figure 10 shows a rule predicting runaway projects found by PRISM using just the features recommend by WRAPPER (r1, p6, o3) on 3bin discretized data. The figure shows that just using the topranked features of Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects231 1: If o3 = 1 then ok 2: If r1 = 0,1 and p6 = 1 then ok 3: If r1 = 3 and p6 = 1,2 then ok 4: If r1 = 0,1 and p6 = 1,2 and o3 = 0 then ok 5: If r1 = 1,2 then runaway 6: If p6 = 3 then runaway 7: If r1 = 3 and p6 = 0 then runaway 8: If r1 = 0,1 and p6 = 1,2 and o3 = 2,3 then runaway 9: If r1 = 0,1 and p6 = 1,2 and o3 = 0 then runaway Figure 10. Rules generated by treatment m Figure 9 yields a very succinct, easy to explain model. Treatments g,h,...m show the results of applying the topranked features to the discretized data. For each learner, if WRAPPER usually selected N features, then that learner was tested in a 10way crossvalidation using the top ranked feature, the secondtop ranked features, and so on up to using N features. 4.3.4 Best Results The best results were obtained in treatment m. That treatment applied PRISM using the three features usually selected by WRAPPER+PRISM: r1, o3, p6. This re sulted in Figure 10. Figure 8 showed r1{0,1}, p6{1,2}, o3=0 is a fre quent pattern in our data. Hence, after a covering algo rithm removes all other more interesting structures, the residual rows can contain this frequent pattern. This, in turn, means that identical rules could be generated for different classes; e.g. rules 4&9 of Figure 10 (this is the residual rule problem discussed above). It is important to read these rules top to bottom since a rule fires only if all the rules above it fail. In practice, this means that the residual rule 9 is never used (it is blocked by rule 4). A 10way crossvalidation study showed that this rule generation method yields an average precision, recall, and fmeasure across the 10way of 1, 0.85, and 0.92 (respectively). This result is actually much better than it appears. To achieve average precisions and recalls of 1 and 0.85 in such a 10way is something of an accom plishment. In a 10way crossvalidation on the 33 records of Figure 1, the test set is of size three or four. In such a small test set, a single outlier project can have a large and detrimental result on the collected statistics. 4.3.5 User Studi es To test the explainability of Figure 10, we ran a session with eight software engineers managing large software verification projects. Pseudocode for Näive Bayes (with kernel estimation) and PRISM (Figure 6) was introduced. PRISM was summarized this way: “each rule handles some examples, which are then removed, and the algorithm repeats on the remaining data.” Within an hour, the engineers were handsimulating PRISM. Using a pen and ruler, all the rows of Figure 1 that matched rule #1 (in Figure 10) were identified and crossed off. The rows that matched rule #2 were identi fied, then crossed off. The engineers stopped after simu lating PRISM’s activities on two or three rules, making comments like “I see what is going on the learner is finding and handling the most obvious next thing.” Sig nificantly, none of the engineers tried to apply Näive Bayes; i.e. mestimates, lestimates, the approximation, and the Gaussians of kernel estimation. In summary, the simplicity of PRISM the rules of Fig ure 10 allowed them to be explained to one focus group, all within a one hour session. 5. Discussion 5.1 Related Work This research aims at producing a precise, explainable, operational definition of a runaway project. Other work in this area is less precise and not operational. For example, in 1997, Glass [4] had informally sam pled several highprofile software disasters and found the following features to be predictive for runaways: Project objectives not fully specified (in 51% of the sample); Bad planning and estimating (48%); Technology new to the organization (45%); Inadequate/no project management methodology (42%); Insufficient senior staff on the team (42%); Poor performance by suppliers of hard ware/software (42%) Otherperformance (efficiency) problems (42%) Glass did not offer a clear operational method for com bining their features into an effective predictor. Other work carefully documented the software risk problem, but did not offer automatic tool support: Jiang et al. [6] studied 40 features collected from questionnaires posted to personnel with recent experi ence with an IS project. Their study is an exemplary ex ample of software engineering research: after clearly defined six hypotheses about software risk, they identify those hypotheses not supported by their data. Ropponen & Lyytinen [7] studied selfreported data from 83 project managers and 1,110 projects to find 26 software risk components: six scheduling and timing risks; four system functionality risks; three subcontract ing risks; four requirements management risks; four re source usage and performance risks; and five personnel management risks. Both reports have the same limitations: their conclu sions contain a somewhat illdefined and manual proce dure for managers to explore the above risks. For exam ple, both reports list risks and their weighted contribution to total risk. However, no combination rule is offered on how to best combine evidence of multiple risks. Another aspect that sets this work apart from other Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects 232 studies is reproducibility. Neither the Jiang et al. nor Ropponen & Lyytinen [7] studies are reproducible since they did not made their data available to other research ers. Reproducibility is an important methodological prin ciple in other disciplines since it allows a community to confirm, refute, or even improve prior results. In our view, in the field of software engineering, there are all too few examples of reproduced, and extended, results*. This current report began when the second and third au thors published their data [1] and defined a research challenge: how to better explain the results of their learning to developers [2]. We would strongly encourage software engineering researchers to share data, define challenges, and to take the time to rework the results of others. 5.2 External Validity This study has produced: 1). A recommended feature subset for predicting run aways (r1,p6,o3); 2). A recommended model that combines those fea tures (Figure 10); and 3). A recommended method for generating that subset and that model: 3bin discretization; a WRAPPER around PRISM; 10way crossvalidation using PRISM on the sub sets found by WRAPPER. It is good practice to question the external validity of these recommendations. WRAPPER selected different features than the manual method that produced Equation 1. That is, the recom mended feature subset learned by our recommended method is different to that found by our earlier work. This raises a concern about external validity: why do our conclusions keep changing? We endorse the conclusions of this study over our prior work [1] for two reasons. Firstly, this study ex plored far more feature subsets that before: Equation 1 was generated after a manual analysis of a few features. Figure 10 was generated after an automatic search through thousands of subsets. Secondly, the results of this study perform better than our prior results: Equation 1 offers ambiguous conclusions in the range (0.03<P(runawayX)<0.81). Figure 10 offers categorical conclusions about the runaway status of a project. Further, it does so with perfect precision. A more serious validity threat comes from the data used in this study. Any inductive process suffers from a sampling bias; i.e. the conclusions of the study are a function of the data used in that study. In that regard, we have evidence that our results are stable across small to mediumsized changes to our project sample. In a 10way crossvalidation experiment, 10% of the data (in our case, 3 to 4 records) is set aside and the model is learned from the remaining information. Our learned model had an average precision of 1.0 in a 10way; i.e. the precision of our model remained perfect, despite a 10% change in the training data. Also, Figure 1 does not show all the data available to this study. Some of the data available to this research group is proprietary and cannot be generally released. In order to check the external validity of our methods, these ten extra records were not analyzed until after we reached the above conclusions regarding the recom mended data mining method for this data. When our recommended method was applied to Figure 1, plus the extra ten records, WRAPPER still found the features shown in Figure 9. Further, the performance of the rule set learned from the extended data had the same proper ties as Figure 10; i.e. It outperformed NäiveBayes; It exhibited perfect precision (precision=1.0) over the 10way crossvalidation. In summary, despite the data set size changing by a small to medium amount (10% to +33%), there is: No instability in the recommended features; No instability in the performance of the recom mended model; No instability in the recommended method. 5.3 Method Selection for Quirky Data Several times we found that certain widely regarded methods (RIPPER; discretization using Fayyad&Irani; discretization with Yang & Webb’s n rule) did not yield the best results for this data set. The reason for this is simple: software engineering data sets are often small: Figure 1 is one table with only 22*33 cells; Elsewhere we have published results on even smaller data sets [41,42]. It is hard to know apriori what are the quirks of small software engineering data sets. Hence, we recommend trying many methods, even supposedly outdated ones. For example, in this study, a very simple rulelearner (PRISM) produced the best performance while being most understandable to our users. More generally, Fayyad [43] argues persuasively that data mining should be viewed as a small part of the knowledge and data discovery (KDD) cycle shown in Fig ure 11. For example, in this report we used discretization and feature subsetselection for preprocessing and selec tion steps shown in Figure 11. Also, we looped through the KDD cycle 13 times: each time, the results from the pre vious round informed our work for the next round. *Exception: see the reports of the PROMISE workshop http://promis edata.org/repository/papers.html Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects233 Figure 11. The KDD (Knowledge Discovery in Databases) cycle, adapted from [43] 5.4 Data Mining Methods Based on this work, and certain standard texts in the data mining field [17,43], we offer the following advice to other researchers data mining on SE data. It is important to understand the goals of the data mining task. If the learned model only needs to perform, and not explain then any data mining method might do ranging from Näive Bayes classifiers To clustering algorithms, decision tree learners, neural nets, etc Or, as explored in Equation 4, ensembles of the above. The simplest of the above is Näive Bayes. Such classi fiers scale to very large data sets and, in many domains, have performed very well [15,19,20]. Also, in at least one SE domain [21], they far outperformed other meth ods. However, if the goal is to generate an explainable the ory, then: Many business users do not have the background required to understand mathematicalbased learners. For such users, the rule learners (e.g. RIPPER) may be most useful since they produce succinct summaries of the data. It is useful to reduce the range of number variables with discretization. Once reduced, the learned model can be simpler since it only needs to comment on a few dis crete ranges rather than the entire number line. It is also useful to reduce the number of features with feature subset selection. A repeated result in the literature [19,39,40] is that the majority of the features can be pruned away and the resulting model is either simpler, performs better or both. For example, in this case study, the best performance and the most suc cinct/explainable model were found using just 3/22 of the available data. As to the choice of feature subset selector: Hall and Holmes [19] compare WRAPPER to sev eral other variable pruning methods including the princi pal component analysis (PCA) method used by Roppo nen & Lyytinen and Munson [9] (amongst others). Fea ture selection methods can be grouped according to (a) whether or not they make special use of the target vari able in the data set such as “runaway”; (b) whether or not pruning uses the target learner. PCA does not make spe cial use of the target variable. Also, unlike other pruning methods, WRAPPER does use the target learner as part of its analysis. Hall and Holmes found that PCA was one of the worst performing methods (perhaps because it ignored the target variable) while WRAPPER was the best (since it can exploit its special knowledge of the target learner). For large data sets, WRAPPER can be too slow. When WRAPPER is not possible, see the conclusion of the Hall & Holmes study [19] for recommendations on two other feature subset selection methods. If the data set is small enough (e.g. Figure 1), use WRAPPER around a rule learner. WRAPPER is the slowest feature subset selector but it is the only one that can tune itself to the target learner. Regarding performance measures, we have two rec ommendations: Comparing the fmeasures in treatment a and b of Figure 7, it is clear that selftests can overestimate the value of a learned model. Holdout sets are the recom mended way to assess a learned model. Accuracy is a widely used measure for assessing a learned theory. Figure 7 shows that it can be remarkably uninformative. In that figure, large changes in precision and recall make very little impact on the accuracy. Hence, we strongly recommend against the use of accuracy. The above issues are widely discussed in the data mining literature (e.g. [17,43–45]). Nevertheless, our reading of the literature is that multiple traversals of the KDD cyclic application using a range of techniques (e.g. different learners, discretizers, and feature subset selec tors) is quite rare. Often researchers take one learner, apply it once, then report the conclusion. Also, despite many positive empirical studies, feature selection is rarely seen in software engineering (exceptions: [21,46]). Further, it is still standard practice for software engineers to present their data mining results in terms of accuracy of nonholdout experiments (e.g. [47]). We hope our results encourage a change in that standard practice. 6. Conclusions Intuitively, it seems reasonable that optimizing for per formance can compromise explainability. Software en gineering data can be complex, noisy, or confusing. Such complex data may require complex and arcane learning strategies; e.g. the defect data sets studied by Menzies, Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects 234 Greenwald, and Frank. Complex and arcane learning strategies will be hard to explain. That is, good perform ance in a learned model may imply poor explanatory power, especially for real world software engineering data. This paper is a counterargument to such pessimism. We show that at least for predicting runaway software projects, certain standard data mining methods resulted in models with both: High performance: i.e. precision=1.0; and Good explainability: i.e. small rule sets, under standable by our users; This result is a new high water mark in predicting runaway projects. This new predictor outperforms prior results in several ways: Our results are fully reproducible: the data for our analysis comes from Figure 7; the software used is freely available*. Prior work by other researchers [4–7] has carefully documented the influence of features on software risk, but did not offer an operational model (by “operational”, we mean that the model can generate performance statis tics like Figure 7). As to our own prior results, the logistic regression method [1] required some manual intervention on the part of the analyst. In contrast to that, the techniques de scribed here are automatic. Also, due to ambiguities in the middle P ranges of Equation 1, or the inner com plexities of our Näive Bayes classifier [2], our prior mathematical results were much harder to explain than the new rules of Figure 10. Comparing treatment c and treatment m in Figure 7, we see that our new data mining method (treatment m: 3bin, WRAPPER, PRISM) has similar recall but much higher precision than our old data mining method (treat ment c: NäiveBayes [2]). Measured in terms of precision, this new model is as good as can ever be expected for our data. Other com bination data mining methods could outperform our re sult (e.g. by generating a smaller, more explainable model with higher recall) but no other method could be more precise (since precision’s maximum value is 1.0). Prior results conducted a manual exploration of a few subsets of the features [1]. Here, we employed a feature subset selector that explored thousands of feature subsets. Hence, we have far more confidence that the following factors are most useful in recognizing run aways: ambiguous requirements; low morale; lack of project members’ commitment to the project plan. REFERENCES [1] Y. Takagi, O. Mizuno, and T. Kikuno, “An empirical approach to characterizing risky software projects based on logistic regression analysis,” Empirical Software En gineering, Vol. 10, No. 4, pp. 495–515, 2005. [2] S. Abe, O. Mizuno, T. Kikuno, N. Kikuchi, and M. Hira yama, “Estimation of project success using bayesian clas sifier,” in ICSE 2006, pp. 600–603, 2006. [3] O. Mizuno, T. Kikuno, Y. Takagi, and K. Sakamoto, “Characterization of risky projects based on project man agers evaluation,” in ICSE 2000, 2000. [4] R. Glass, “Software runaways: Lessons learned from massive software project failures,” Pearson Education, 1997. [5] “The Standish Group Report: Chaos 2001,” 2001, http://standishgroup.com/sample research/PDFpages/ ex treme chaos.pdf. [6] J. Jiang, G. Klein, H. Chen, and L. Lin, “Reducing userrelated risks during and prior to system develop ment,” International Journal of Project Management, Vol. 20, No. 7, pp. 507–515, October 2002. [7] J. Ropponen and K. Lyytinen, “Components of software development risk: how to address them? A project man ager survey,” IEEE Transactions on Software Engineer ing, pp. 98–112, Feburary 2000. [8] W. Dillon and M. Goldstein, “Multivariate analysis: Methods and applications.” WileyInterscience, 1984. [9] J. C. Munson and T. M. Khoshgoftaar, “The use of soft ware complexity metrics in software reliability model ing,” in Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, May 1991. [10] G. Boetticher, T. Menzies, and T. Ostrand, “The PROM ISE Repository of Empirical Software Engineering Data,” 2007, http://promisedata.org/repository. [11] T. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, Vol. 2, No. 4, pp. 308–320, December 1976. [12] M. Halstead, “Elements of software science,” Elsevier, 1977. [13] K. Toh, W. Yau, and X. Jiang, “A reduced multivariate polynomial model for multimodal biometrics and classi fiers fusion,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 224–233, February 2004. [14] R. Duda, P. Hart, and N. Nilsson, “Subjective bayesian methods for rulebased inference systems,” in Technical Report 124, Artificial Intelligence Center, SRI Interna tional, 1976. [15] P. Domingos and M. J. Pazzani, “On the optimality of the simple bayesian classifier under zeroone loss,” Machine Learning, Vol. 29, No. 23, pp. 103–130, 1997. http:// citeseer.ist.psu.edu/domingos97 optimality. html [16] Y. Yang and G. Webb, “Weighted proportional kinterval discretization for naivebayes classifiers,” in Proceedings of the 7th PacificAsia Conference on Knowledge Dis covery and Data Mining (PAKDD 2003), 2003, Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects235 http://www.csse.monash.edu/_webb/Files/YangWebb03.pdf. [17] I. H. Witten and E. Frank, Data mining. 2nd edition. Los Altos, Morgan Kaufmann, US, 2005. [18] G. John and P. Langley, “Estimating continuous distribu tions in bayesian classifiers,” in Proceedings of the Elev enth Conference on Uncertainty in Artificial Intelligence Montreal, Quebec: Morgan Kaufmann, 1995, pp. 338–345, http://citeseer.ist.psu.edu/john95 estimating.html. [19] M. Hall and G. Holmes, “Benchmarking attribute selec tion techniques for discrete class data mining,” IEEE Transactions On Knowledge And Data Engineering, Vol. 15, No. 6, pp. 1437–1447, 2003, http://www.cs.waikato.ac.nz/_mhall/HallHolmesTKDE.pdf. [20] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in International Conference on Machine Learning, pp. 194–202, 1995, http://www.cs.pdx.edu/_timm/dm/dougherty95supervised.pdf. [21] T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE Transactions on Software Engineering, January 2007, http://menzies.us/pdf/06learnPredict.pdf. [22] R. Quinlan, C4.5: Programs for Machine Learning. Mor gan Kaufman, 1992. [23] R. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, Vol. 11, pp. 63, 1993. [24] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classification and regression trees,” Wadsworth Interna tional, Monterey, CA, Tech. Rep., 1984. [25] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967. [26] T. M. Cover and P. E. Hart, “Nearest neighbour pattern classification,” IEEE Transactions on Information Theory, pp. 21–27, January 1967. [27] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in ICML’06, 2006, http://hunch.net/_jl/projects/cover tree/cover tree.html. [28] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimi zation by simulated annealing,” Science, No. 4598, Vol. 220, pp. 671–680, 1983, http://citeseer.nj.nec.com/kirkpatrick83optimization.html [29] G. G. Towell and J. W. Shavlik, “Extracting refined rules from knowledgebased neural networks,” Machine Learning, Vol. 13, pp. 71–101, 1993, http: //citeseer.ist.psu.edu/towell92extracting.html [30] B. Taylor and M. Darrah, “Rule extraction as a formal method for the verification and validation of neural net works,” in IJCNN ’05: Proceedings. 2005 IEEE Interna tional Joint Conference on Neural Networks, Vol. 5, pp. 2915–2920, 2005. [31] T. Menzies and E. Sinsel, “Practical large scale whatif queries: Case studies with software risk assessment,” in Proceedings ASE 2000, 2000, http://menzies.us/pdf/00ase.pdf. [32] W. Cohen, “Fast effective rule induction,” in ICML’95, 1995, pp. 115–123, http://www.cs.cmu.edu/_wcohen/postscript/ml95ripper.ps. [33] J. Cendrowska, “Prism: An algorithm for inducing modular rules,” International Journal of ManMachine Studies, Vol. 27, No. 4, pp. 349–370, 1987. [34] T. Dietterich, “Machine learning research: Four current directions,” AI Magazine, Vol. 18, No. 4, pp. 97–136, 1997. [35] T. Menzies and J. S. D. Stefano, “How good is your blind spot sampling policy?” in 2004 IEEE Conference on High Assurance Software Engineering, 2003, http://menzies.us/pdf/03blind.pdf. [36] J. Lu, Y. Yang, and G. Webb, “Incremental discretization for naivebayes classifier,” in Lecture Notes in Computer Science 4093: Proceedings of the Second International Conference on Advanced Data Mining and Applications (ADMA 2006), pp. 223–238, 2006, http://www.csse.monash.edu/_webb/Files/LuYangWebb06.pdf. [37] U. M. Fayyad and I. H. Irani, “Multiinterval discretiza tion of continuousvalued attributes for classification learning,” in Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027, 1993. [38] J. Gama and C. Pinto, “Discretization from data streams: Applications to histograms and data mining,” in SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing. New York, NY, USA: ACM Press, pp. 662–667, 2006. http://www.liacc.up.pt/_jgama/ IWKDDS/Papers/p6.pdf. [39] A. Miller, Subset Selection in Regression (second edition). Chapman & Hall, 2002. [40] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, Vol. 97, No. 12, pp. 273–324, 1997, http://citeseer.nj.nec.com/ kohavi96wrappers.html [41] T. Menzies and J. D. Stefano, “More success and failure factors in software reuse,” IEEE Transactions on Soft ware Engineering, May 2003, http://men zies.us/pdf/02sereuse.pdf. [42] T. Menzies, Z. Chen, J. Hihn, and K. Lum, “Selecting best practices for effort estimation,” IEEE Transactions on Software Engineering, November 2006, http://menzies.us/pdf/06coseekmo.pdf. [43] U. Fayyad, “Data mining and knowledge discovery in databases: Implications for scientific databases,” in Pro ceedings on Ninth International Conference on Scientific and Statistical Database Management, pp. 2–11, 1997. [44] F. Provost, T. Fawcett, and R. Kohavi, “The case against accuracy estimation for comparing induction Copyright © 2009 SciRes JSEA
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects Copyright © 2009 SciRes JSEA 236 algorithms,” in Proc. 15th International Conf. on Ma chine Learning. Morgan Kaufmann, San Francisco, CA, pp. 445–453, 1998, http://citeseer.nj.nec.com/ provost98case.html. [45] R. Bouckaert, “Choosing between two learning algo rithms based on calibrated tests,” in ICML’03, 2003, http://www.cs.pdx.edu/_timm/dm/10x 10way. [46] C. Kirsopp and M. Shepperd, “Case and feature subset selection in casebased software project effort predic tion,” in Proc. of 22nd SGAI International Conference on KnowledgeBased Systems and Applied Artificial Intel ligence, Cambridge, UK, 2002. [47] N. Nagappan and T. Ball, “Static analysis tools as early indicators of prerelease defect density,” in ICSE 2005, St. Louis, 2005.
